I recently looked at this description of VBOs and a sample implementation of them. When I ran the executable, it was significantly faster with no VBOs than with them. I don't understand why it should be slower with the VBOs than without. Can anybody explain why?
The discrepancy might be caused by any number of factors. If your graphics card doesn't support VBOs (vertex-buffer objects), then your code is creating them but the OpenGL library is, in software, converting them to something the card -does- support and sending that. If you have a huge number of VBOs, they may not get usefully cached on the card (after the available memory for them is exhausted, every time you send a new one across, some old one has to be deleted to make room, by the time you circle back around to one you want, you have to send it again). If you're only rendering a single frame of animation, you have all the overhead of creating the VBOs and none of the performance gain of re-using them.
Also, while it wouldn't likely cause a reversal of efficiency, there tend to be smaller issues around individual object-size -- way back in the day of t-strips, there was anecdotal evidence that performance gains were minimal for strips longer than 12 triangles (on then-state-of-the-art SGI hardware), though the explanations for that tended to be feeble at best -- real explanations tended to involve things like "is the entire strip within the view frustum?" "is the strip devolving (in the distance) down to a trivial number of pixels?" In any case, longer strips still helped, just not by as much as one might hope.
In general, things that "are more efficient" in OpenGL almost always have some set of conditions under which the efficiency gain applies. You have to understand the conditions, and use the efficiency-gainers appropriately. They're not cure-alls.
Also, for what it's worth, I had coded up the Sieve of Eratosthenes algorithm for finding prime numbers. In Python. Took about 17 seconds to find the primes under 1 million. Then in response to somebody (on this forum) who wanted to accomplish the same thing in under 0.1 seconds, I coded it up in C++ in VS2010. It took about 70 seconds to do the same thing, and that's after I carefully re-selected my data structures to eliminate needless re-allocation of memory. Pulled my hair out for a while, then remembered to set up a Release configuration, instead of Debug. Then it ran in 2-3 seconds, which is what I'd expect, compared to Python. Now I understand that debug-compiled code is necessarily larger and slower than optimized release code. But a factor of 20-30x slower is just insane! Probably not related to your problem, but something I didn't know I had to watch out for!