Originally Posted by thekashyap
2. I'm sure most ppl know this but writing as it's not already mentioned in this topic so far. Use unsigned int stored in registers for loop variables.
Reasons
2. registers are registers ! They're simply faster than memory access.
Most compilers ignore the "register" keyword, because their optimzations are far more advanced than back in the time, where a variable had only one storage in its life...
But, the register keyword can reduce performances on some compilers, because they interpret it as a strong request, from the programmer, to use a register for storage of the variable, which condemns one register.
Even on very old compilers where the register keyword was a useful hint, it would have been utterly stupid to use the register keyword everywhere, as it would be as slow, or slower, than using no keyword at all, because the compiler would not be better than using its default "guessing" algorithm.
The register keyword should be used only at places, where you want to give a hint to the compiler, that THIS variable needs to be accessed faster than other ones, even if it reduces performances of other variables.
1. unsigned int arithmatic is faster than signed int.
Ridiculous.
On two's complement machines, additions, substractions and multiplications have identical performances as signed or unsigned, since this is exactly the same operation.
For comparisons, the performances are identical on x86 architectures, since the same operation is used for all comparisons (cmp).
On all other architectures, there is absolutely nothing that would add a performance penalty on signed comparisons, because integer comparisons require only very very few transitors... The comparison itself, cannot require more than one CPU cycle.
For integer division, signed integer division was a bit slower than unsigned integer division on old CPU.
But, integer divisions are very rarely time critical, because division is not a very much used operation.
3. Optimizing the breaking conditions in for loops. If you know that the loop variable's range is from 0 to some +ve number AND it doesn't matter which way you traverse while looping, you can optimize the loop like this:
Assuming that the compiler doesn't optimize it for you.
GCC 3.4.5 (-O2) for i386, is clever enough to optimize the first loop with a decrementation operation.
This is because, forward loops are very common, and GCC has specific optimizations for this type of code.
On the other hand, GCC 3.4.5 is not clever enough to produce good code for your "manually optimized" loop.
for( int i = 0; i <= 30; i++ );
Produces this assembly code (MinGW 3.4.5 for Win32 -O2):
0x401300 : dec eax
0x401301 : jns 0x401300
But:
for( int i = 30; i--; );
Produces:
0x401300 : dec eax
0x401301 : cmp eax,0xffffffff
0x401304 : jne 0x401300
Why this pessimization?
First, the following code:
0x401300 : dec eax
0x401301 : js 0x401300
Would immediately stop if eax is negative... But, the conditional expression is about the inequality with zero!
The compiler could have seen that all the values of i will be in range [0,30), but GCC is not clever enough to understand that, because you wrote a weird loop.
A good code, benefiting from the fact that "dec eax" sets the zf and sf flags, would require that you stop when i reaches zero, not -1.
Your code confuses the compiler, and is a pessimization.
So, don't do that!
Note:
Generates the right code:
0x401300 : dec eax
0x401301 : jns 0x401300
With MinGW 3.4.5. I don't claim that it's ok. It may be a pessimization with another compiler.
4, Optimizing very small loops using switch-case.
When you know that the range of loop variable's value is
pretty small avoid the loop altogether.
Or... activate the -funroll-loops option (or your compiler equivalent), which is less error prone.
Also, you should notice that, this should only be used on the most critical code, as it greatly increases the code size, which can have a serious performance penalty.
5. Another way of optimizing loops is to unroll it:
Knowing that printf is so slow that the loop code itself is negligible...
I benchmarked the two programs, using NUL as the standard output.
There is no sensible speed difference.
Both require 3900 milliseconds to execute on my computer.
A revelant piece of code of the two things:
; without manual loop unrolling
; Assume a K6-2 CPU
0x401323 <main+67>: inc ebx
0x401324 <main+68>: push 0x403000 ;1
0x401329 <main+73>: call 0x4018a0 <printf>
0x40132e <main+78>: add esp,0x10
0x401331 <main+81>: cmp ebx,0xc350 ;1
0x401337 <main+87>: jl 0x401320 <main+64> ; 3
; -> 3 CPU cycles + approximatively 39600 CPU cycles for the printf call (if the output stream is the NUL device).
; with manual loop unrolling
0x40134d <main+109>: pop eax
0x40134e <main+110>: lea eax,[ebx+1] ;1
0x401351 <main+113>: pop edx ;2
0x401352 <main+114>: push eax ;3
0x401353 <main+115>: push 0x403000 ;4
0x401358 <main+120>: call 0x4019c0 <printf>
; -> 4 CPU cycles + approximatively 39600 CPU cycles for the printf call.
So, your "optimization", approximatively reduce performances by 0.0025%.
Which is negligible.