<<snip>>

Original article can be found here: http://www.cs.cmu.edu/~gilpin/c++/performance.html

Recommended Answers

All 47 Replies

This is a short list of recommendations on how to use C++. My experiences are from using gcc 2.8.0 and Visual C++ 6.0. I had to have things compatible between these two compilers, and between Unix and Windows.

(Note: The recommendations in this post may not be consistent with modern and correct C++ and were not peer reviewed before being posted to this thread. Use them with caution. -Narue)

Contents

IO of binary files

When are destructors called for local variables

Use {} to keep things local

Scope of variables declared in for()

When to use virtual members

IO of binary files

To make sure that there is no CR/LF translation on non-Unix computers, you have to use the following lines to open streams to files with binary data.

[left]ofstream os("output.flt", ios::out | ios::binary); [/left]
ifstream is("output.flt", ios::in | ios::binary);

For Visual C++, when using fstream.h, use in addition the flag ios::nocreate. Otherwise you can open a non-existing file for reading, without complaining. (This is not necessary when using fstream).


When are destructors called for local variables

Non-static (or 'automatic' ) variables are 'destructed' automatically when they go out of scope. Scope is a farily complicated thing, and I'm not going to repeat the definition here. Roughly speaking the scope ends when you encounter the } around the declaration of the variable. See also the use of {} and how scope is defined in the for() statement.

Variables are destructed (by the compiler) by calling the appropriate destructor of their class. If the objects allocate memory (and hence the destructor should free that memory), this means that you recover the memory allocated.

class array
{
private:
float *ptr;
public:
// constructor
array(int n) { ptr = new float[n]; }
// destructor
~array() { delete [] ptr; }
}
 
main()
{
// ...
{
array a(5); // allocates memory
// do something
}
// here the array is destructed, and so the memory is freed
//....
}

Use {} to keep things local

Use of the grouping construct {} enables you to declare variables local to that group. When leaving the group, all local variables are destructed. This has the advantage that the reader of the code knows (s)he shouldn't worry about these variables to understand the rest of the code.
In a way this can be understood as if every use of {} is like a function call (with local variables declared in the function). Of course, you don't have the overhead of stack manipulations and jumps involved in a proper function call.

// recommended usage 
 
void f(int a)
{ 
if(a==1) 
{ 
	myclassA Aobject;
	// here I do something with 'Aobject', and maybe 'b'
} 
// Aobject does not exist here anymore

This tip is just an extension of the 'avoid global variables' credo.

As always, this can be disabused as in the following piece of code, where the outer variable 'a' is hidden by a local 'a', resulting in not very readable code.

[left]// not very readable code 

 
 
 
{ 
int a=1; 
{ 
// local variable hides outer 'a' 
int a; 
a = 2; 
assert (a==2); 
} 
// a is again the previous variable 
assert (a==1);
 
}

Scope of variables declared in for()


The new ANSI C++ standard specifies that variables declared as in for(int i=1; ...) have a scope local to the for statement. Unfortunately, older compilers (for instance Visual C++ 5.0) use the older concept that the scope is the enclosing group. Below I list 2 possible problems, and their recommended solutions:

  • you want to use the variable after the for() statement

[/left]


  • you have to declare the variable outside of the for() statement.
[left]int i; 

 
 
 
 
for(i=1; i<5; i++)
{ /* do something */ }
if (i==5) ...
  • you want to have multiple for() loops with the same variables.

[/left]

Put the for statement in its own group. You could also declare the variable outside of the 'for', but it makes it slightly trickier for an optimising compiler (and a human) to know what you intend.

[left]{

 
 
 
for(i=1; i<5; i++)
{ /* do something */ }
}

[/left]

When to use virtual members

Make a member 'virtual' if a derived class extends the functionality of a member of the base class, and this extended functionality has to be accessible:

  1. inside other member functions of the base class
  2. when using pointers that can point to either an object of the base class, or an object of the derived class.

Example: multi-dimensional arrays which are defined recursively in terms of a 1D array.
We wanted to have a 'grow' member that enlarged the outer dimension of the multidimensional array. At first sight, this is simply calling a general grow of the base class. However, 'grow' has to know the size of the new elements (which are again multidimensional arrays). So, we had to define in the derived class a new 'grow', which calls the base class 'grow' first, and then does more stuff.
At many points in the base class, 'grow' is called to adjust sizes. By making 'grow' virtual we avoid to having to rewrite these members for the derived class.

Caveat:

For members of the base class which use temporary objects of its own type, the base class 'grow' will be called. For instance:

[left]class array

 
 
{
...
virtual void grow(int new_size);
 
array& operator +=( const array& a) 
{ /* some definition using 'grow' */ }
array operator +(const array& a1, const array& a2)
{ 
array a=a1;
a += a2;
// Warning, this will call array::grow, even if a1 is really from a derived type
}
};
 
 

[/left]

Thus, you should provide a member of the derived class for every member of the base class which uses temporary objects.

[indent]class multiarray : public array

 
 
 
{
...
virtual void grow(int new_size);
 
multiarray operator +(const multiarray& a1, const multiarray& a2)
{
multiarray a=a1;
a += a2;
}
}; 
 

[/indent]

I Dont agree for usage of virtual functions neagtively impacts the performance..it does..but one needs to pay price for sth he/she wants to use..If one wants to implement the functionality of virtual functions then he/she would have to go thru technique like vtable.
Advantages of virtual functions
- Readablity
- No Need to change existing code for incrementing functionalities in terms of inheritance
- Better than type field solution.
- and many more..

All above from bjarne Stroustrup. The C++ programming languages..inventor of C++

Happy going virtual!!!!!!!!!!!

Introduction

These tips are based mainly on ideas from the book Efficient C++ by Dov Bulka and David Mayhew. For a more thorough treatment of performance programming with C++, I highly recommend this book. This document, while presenting many of the same ideas in the book, does not go into as much detail as to why certain techniques are better than others. The book also provides code examples to illustrate many of the points presented here.

Constructors and Destructors

  • The performance of constructors and destructors is often poor due to the fact that an object's constructor (destructor) may call the constructors (destructors) of member objects and parent objects. This can result in constructors (destructors) that take a long time to execute, especially with objects in complex hierarchies or objects that contain several member objects. As long as all of the computations are necessary, then there isn't really a way around this. As a programmer, you should at least be aware of this "silent execution". If all of the computations mentioned above are not necessary, then they should be avoided. This seems like an obvious statement, but you should be sure that the computations performed by the constructor that you are using is doing only what you need.
  • Objects should be only created when they are used. A good technique is to put off object creation to the scope in which it is used. This prevents unnecessary constructors and destructors from being called.
  • Using the initializer list functionality that C++ offers is very important for efficiency. All member objects that are not in the initializer list are by default created by the compiler using their respective default constructors. By calling an object's constructor in the initializer list, you avoid having to call an object's default constructor and the overhead from an assignment operator inside the constructor. Also, using the initializer list may reduce the number of temporaries needed to construct the object. See the Temporaries section for more information on this.

Virtual Functions

  • Virtual functions negatively affect performance in 3 main ways:
    1. The constructor of an object containing virtual functions must initialize the vptr table, which is the table of pointers to its member functions.
    2. Virtual functions are called using pointer indirection, which results in a few extra instructions per method invocation as compared to a non-virtual method invocation.
    3. Virtual functions whose resolution is only known at run-time cannot be inlined. (For more on inlining, see the Inlining section.
  • Templates can be used to avoid the overhead of virtual functions by using a templated class in place of inheritance. A templated class does not use the vptr table because the type of class is known at compile-time instead of having to be determined at run-time. Also, the non-virtual methods in a templated class can be inlined.
  • The cost of using virtual functions is usually not a factor in calling methods that take a long time to execute since the call overhead is dominated by the method itself. In smaller methods, for example accessor methods, the cost of virtual functions is more important.

Return Value


Methods that must return an object usually have to create an object to return. Since constructing this object takes time, we want to avoid it if possible. There are several ways to accomplish this.

  • Instead of returning an object, add another parameter to the method which allows the programmer to pass in the object in which the programmer wants the result stored. This way the method won't have to create an extra object. It will simply use the parameter passed to the method. This technique is called Return Value Optimization (RVO).
  • Whether or not RVO will result in an actual optimization is up to the compiler. Different compilers handle this differently. One way to help the compiler is to use a computational constructor. A computational constructor can be used in place of a method that returns an object. The computational constructor takes the same parameters as the method to be optimized, but instead of returning an object based on the parameters, it initializes itself based on the values of the parameters.

Temporaries

Temporaries are objects that are "by-products" of a computation. They are not explicitly declared, and as their name implies, they are temporary. Still, you should know when the compiler is creating a temporary object because it is often possible to prevent this from happening.

  • The most common place for temporaries to occur is in passing an object to a method by value. The formal argument is created on the stack. This can be prevented by using pass by address or pass by reference.
  • Compilers may create a temporary object in assignment of an object. For example, a constructor that takes an int as an argument may be assigned an int. The compiler will create a temporary object using the int as the parameter and then call the assignment operator on the object. You can prevent the compiler from doing this behind your back by using the explicit keyword in the declaration of the constructor.
  • When objects are returned by value, temporaries are often used. See the Return Value section for more on this.
  • Temporaries can be avoided by using <op>= operators. For example, the code
    a = b + c;
    could be written as
    a=b;
    a+=c;.

Inlining

Inlining is one of the easiest optimizations to use in C++ and it can result in the most dramatic improvements in execution speed. The main thing to know when using inlining is when you should inline a method and when you shouldn't inline.

  • There is always a trade-off between code size and execution speed when inlining. In general, small methods (for example, accessors) should be inlined and large methods should not be inlined.
  • If you are not sure of whether or not a given method should be inlined, the best way to decide is to profile the code. That is, run test samples of the code, timing inlining and non-inlining versions.
  • Excessive inlining can drastically increase code size, which can result in increased execution times because of a resulting lower cache hit rate.
  • Watch out for inlined methods that make calls to other inlined methods. This can make the code size unexpectedly larger.
  • Singleton methods, methods that are only called from one place in a program, are ideal for inlining. The code size does not get any bigger and execution speed only gets better.
  • Using literal arguments with an inlined method allows the compiler to make significant optimizations. (This is, however, compiler dependent.)
  • The compiler preprocessor can be used to implement conditional inlining. This is useful so that during testing the code is easier to debug. But for compiling production code, there are no changes to be made to the source code. This is implemented by using a preprocessor macro called INLINE. Inlined code is defined within #ifdef INLINE ... #endif code blocks. Similarly, non-inlined code is defined within #ifndef INLINE ... #endif code blocks. Then to compile using inlined code, you tell the compiler to treat INLINE as defined. (-DINLINE with g++)
  • Sometimes it makes sense to inline a given method in some places, but to not inline in other places within the same program. This can be accomplished using a technique called selective inlining. The implementation of this technique is not very convenient. For each method that you want to selectively inline you have two methods, where on one has "inline_" prepended to the method name and is of course inlined. The method without the "inline_" prepended to its name simply calls the inlined version of the method.
  • Recursive calls cannot be inlined, but there are two techniques to try to improve performance in the case of recursive methods:
    1. If the recursion is tail recursion, then the algorithm can be rewritten as an iterative algorithm, eliminating the overhead of method invocations.
    2. Recursive call unrolling basically allows the programmer to inline the first steps of recursion in a recursive algorithm. For example, for a recursive method print() we might do the following: print_unrolled() calls print1() which calls print2() which calls print3() which calls print(). All methods except print() are inlined. The number of recursive steps can be made as high as desired, depending on the application.

When you want to pass a constant variable, pass it by reference to save memory.

Example:
int example (const int value); // Uses more memory than
int example (const int &value); // this one

Member Avatar for shAq

How about precalculated values. It saves a lot of time in slower machines during long iterations, does it not?
For eg:
x=23.45;
xinv=1/x;

and use xinv where its needed.

Also one can use precalculated trigonometrical functional values such as
float x=sin(x);
and so on.

And the number one rules of optimization is???
DO NOT

(Atleast not untill you _know_ you need to, and then only where you _know_ it will make a difference, eg profiling).

On a side not I recently tried some wtl and they have a rather nifty technique for avoiding virtual functions while still allowing inheritance (does not work well all the time since only a limited amount of the behaviour is simulated though)...
well anyway if you are intrested take a look (it involves specifying the extending class as a template argument to the base class so the implemented class is known at compile time)...

commented: Best advice in this thread so far! +17

Two guidelines from Thinking in C++ by Bruce Eckel:

  1. Avoid the preprocessor. Always use const for value substitution and inlines for macros.
  2. Avoid global variables. Always strive to put data inside classes. Global functions are more likely to occur naturally than global variables, although you may later discover that a global function may fit better as a static member of a class.

This book has 73 guidelines.
See book: http://www.pythoncriticalmass.com/

An easy way to swap 2 variables without using another variable:

a=a+b;
b=a-b;
a=a-b;
commented: Pointless 70's asm hack, if overflow asserts on your machine, you're sunk. Also doesn't work at all with anything not an integer. -3

An easy way to swap 2 variables without using another variable:

a=a+b;
b=a-b;
a=a-b;

Not this again. This is not 1970!

Don't do this today or in the future. Use a temp variable.

An easy way to swap 2 variables without using another variable:

a=a+b;
b=a-b;
a=a-b;

it can't be used if operator+ and operator- arn't defined!!!
and you don't have to use another variable, in C++ standard, there is the STL. It is the strength of C++ because there are lots and lots of algorithms in it. For example, there is a sort that sorts your elements always in O(n log n) time, which is the fastest it can get (if you don't count counting sort). Therefore I use the it any time I can. There is also a templated function swap, that swaps any two elements that are of same type. So i think it is be easier and cleverer to type swap(a, b); than to use your idea.

the best way to do this is to use the bits. The code would be:

inline string to_binary( const int &x ) {
  int t;
  string ret;
  if( x > 0 ) t = x; else t = -1 * x;
  for( int i = 0; i < 32; ++i ) 
    if( t & ( 1 << i ) ) ret.push_back( '1' ); else ret.push_back( '0' );
  reverse( ret.begin(), ret.end() );
  ret.erase( 0, ret.find( '1' ) );
  if( ret.size() == 0 ) return "0";
  if( x < 0 ) return '-' + ret;
  return ret;
}
for( int i = 0; i < 32; ++i )

That's not very portable. Use something in <climits> or the C++ equivalent.

An easy way to swap 2 variables without using another variable:

a=a+b;
b=a-b;
a=a-b;

Easy way? I don't know, but sure it is inefficient (except if the optimizer is really good).

There *are* temporary variables if from an ISO standard point-of-view.


And, this algorithm can only be useful if (assuming an optimizer as good as Turbo C++ 1.0 optimizer or better, knowing that it is a very old compiler, with an obsolete optimizer):

  • Both a and b are in a register (otherwise, I don't know any architecture where the temporary would be avoided)
  • AND There is no builtin XCHG instruction at the assembly level (false for x86 and PowerPC)
  • AND The CPU lacks register (in that case, it is very probable that the CPU has a XCHG instruction).

And, in that case, you may, ocasionally (rarely) gain little performance... In all other cases, you just can pray that the optimizer will understand your bad code, and implement it in a correct way.
Most compilers I know produce bad code for that.

If you use a 25 years old compiler, with a very very bad optimizer, it might also be possible that this compiler lacks register variables...

In that case, the following code:

a-=b;
b-=a;
a-=b;

Will produce something like (on x86-16, yep I can't imagine any such bad optimizer on x86-32):

mov ax, [bp-8] ; timings on a k6-2 CPU (1)
sub ax, [bp-4]  ; 3
mov [bp-8],ax  ; 4 CPU cycles

mov ax,[bp-4]
sub ax,[bp-8]
mob [bp-4],ax

mov ax, [bp-8]
sub ax, [bp-4]
mov [bp-8],ax ;11.5 CPU cycle (i.e. the last one might be paired with the next instruction)

Note : If the optimizer is even more bad than what I described, the generated code might even be worst.
And, your code (which has temporaries), on the same compiler might be twice as slow...

The classical

int c;
c=b;
b=a;
a=c;

Will produce something like:

mov ax,[bp-4]
mov [bp-12],ax

mov ax,[bp-8]
mov [bp-4],ax

mov ax,[bp-12]
mov [bp-8],ax   ; 5.5 CPU cycles (on a K6-2 CPU)

Note : I don't think that the compiler can generate worst code, even if there is no optimizer.

So, my point is that your code is *always* a bad idea.
This algorithm might only be used in some very special contexts, only on weird architectures (I don't know any such architecture, but it probably exists or existed), and at assembly level only!

Please, read my posts (SuperKoko) on these two threads (of another forum):
http://www.codeguru.com/forum/showthread.php?t=386883
http://www.codeguru.com/forum/showthread.php?t=383191

Also, remember that std::swap can be assumed as being optimal on your architecture : The compiler is allowed to do specific optimizations on it.

When you want to pass a constant variable, pass it by reference to save memory.

Example:
int example (const int value); // Uses more memory than
int example (const int &value); // this one

That's the opposite!

Using const references is a good idea for large objects (say, 16 bytes and more).
For very small objects (smaller than a pointer), if you're lucky, you'll not loose performance.
But it is quite probable that the compiler will generate more code, slower code, and use more stack memory (if the argument is not a lvalue).
And, moreover, it is very hard for the compiler to optimize the const-reference thing when the function is used across translation units boundaries (at least, for most compilers), because it must respect a calling convention, and can't know whether the function, internally, gets the address of the const reference.
In that case, the compiler can't replace such const-reference by a simple value.

Please read my posts on these two threads:
http://www.codeguru.com/forum/showthread.php?t=339608
http://www.codeguru.com/forum/showthread.php?t=392134&page=2

Multiplication, division and modulus operations on powers of 2 can be made much faster using bitwise operators.

1. Modulo
Finding the modulus is particularly expensive. This operation can be performed mush faster using the & operator, because a % n == a & (n-1) where n is power of 2.

2. Division
Division by a power of 2 can done faster using the >> operator. a / n == a >> m , where n = 2^m

3. Multiplication
This is similar to division, except, use left shift << operator a * n == a <<m , n = 2^m

Multiplication, division and modulus operations on powers of 2 can be made much faster using bitwise operators.

1. Modulo
Finding the modulus is particularly expensive. This operation can be performed mush faster using the & operator, because a % n == a & (n-1) where n is power of 2.

2. Division
Division by a power of 2 can done faster using the >> operator. a / n == a >> m , where n = 2^m

3. Multiplication
This is similar to division, except, use left shift << operator a * n == a <<m , n = 2^m

Valid, albeit ancient, advice -- much like the XOR swap.

It's one of those things that language implementers knew about ages ago, and should be done by the compiler automagically (or it may take some optimization option).

Bottom line: don't do this unless your compiler sucks. Write code to multiply by 4 like this num*=4 . Obfuscating it as num<<=2 will make your code look strangeto newbies, and it will look like a newbish hack to old timers (both making what you write a little suspect).

And don't forget that right-shifting signed values is implementation defined!

A few more I find useful...

1. Replace switch-case/if-else with array-indexing:

switch ( queue ) {
case 0 :   letter = 'W';
   break;
case 1 :   letter = 'S';
   break;
case 2 :   letter = 'U';
   break;
}
//or maybe
if ( queue == 0 )
  letter = 'W';
else if ( queue == 1 )
  letter = 'S';
else
  letter = 'U';
//A quicker method is to simply use the value as an
//index into a character array, eg.
static char *classes="WSU";
letter = classes[queue];

2. I'm sure most ppl know this but writing as it's not already mentioned in this topic so far. Use unsigned int stored in registers for loop variables.
Reasons
1. unsigned int arithmatic is faster than signed int.
2. registers are registers ! They're simply faster than memory access.

for( register unsigned int loop_variable = 0; loop_variable <=100000; i++ )
{ /*do your stuff*/ }

3. Optimizing the breaking conditions in for loops. If you know that the loop variable's range is from 0 to some +ve number AND it doesn't matter which way you traverse while looping, you can optimize the loop like this:

//original loop
for( int i = 0; i <= 30; i++ )
{/*do your stuff*/}
//optimized loop
for( int i = 30; i--; )
{/*do your stuff*/}

4, Optimizing very small loops using switch-case.
When you know that the range of loop variable's value is
pretty small avoid the loop altogether.

//unoptimized code
for( int i = some_small_positive_int; i--; )
{/*do your stuff with i*/}

//optimized based on the assumption that some_small_positive_int can only be from 1 to 7.
switch( i ) 
{ 
    case 7 : /*do your stuff with i*/ i++; 
    case 6 : /*do your stuff with i*/ i++; 
    case 5 : /*do your stuff with i*/ i++; 
    case 4 : /*do your stuff with i*/ i++; 
    case 3 : /*do your stuff with i*/ i++; 
    case 2 : /*do your stuff with i*/ i++; 
    case 1 : /*do your stuff with i*/ 
}

3. Optimizing the breaking conditions in for loops. If you know that the loop variable's range is from 0 to some +ve number AND it doesn't matter which way you traverse while looping, you can optimize the loop like this:

//original loop
for( int i = 0; i <= 30; i++ )
{/*do your stuff*/}
//optimized loop
for( int i = 30; i--; )
{/*do your stuff*/}

Except if your description is true, you'd be better off using

i = 30; 
while (i--)
{
    /*do your stuff*/
}

5. Another way of optimizing loops is to unroll it:

//unoptimized loop
    int loop_count = 50000;  /* could be anything */ 
    for( int j = 0; j < loop_count; j++ )
        printf("process(%d)\n", j); 

     //optimized one
     static int BLOCKSIZE = 8 ; 
    /* The loop_count may not be divisible by BLOCKSIZE, 
     * go as near as we can first, then tidy up.
     */ 
    int i = 0; 
    int blocklimit = (loop_count / BLOCKSIZE) * BLOCKSIZE ;

    /* unroll the loop in blocks of 8 */ 
    while( i < blocklimit ) 
    { 
        printf("process(%d)\n", i); 
        printf("process(%d)\n", i+1); 
        printf("process(%d)\n", i+2); 
        printf("process(%d)\n", i+3); 
        printf("process(%d)\n", i+4); 
        printf("process(%d)\n", i+5); 
        printf("process(%d)\n", i+6); 
        printf("process(%d)\n", i+7); 

        /* update the counter */ 
        i += 8; 

    } 

    //we already know how to optimize small loops.
    switch( loop_count - i ) 
    { 
        case 7 : printf("process(%d)\n", i); i++; 
        case 6 : printf("process(%d)\n", i); i++; 
        case 5 : printf("process(%d)\n", i); i++; 
        case 4 : printf("process(%d)\n", i); i++; 
        case 3 : printf("process(%d)\n", i); i++; 
        case 2 : printf("process(%d)\n", i); i++; 
        case 1 : printf("process(%d)\n", i); 
    }

> //A quicker method is to simply use the value as an
Except your other two methods never risk an out of bounds memory access.

> 1. unsigned int arithmatic is faster than signed int.
Where's your evidence?

> 2. registers are registers ! They're simply faster than memory access.
True, but any decent compiler nowadays is far more capable of deciding which variables would be best placed in registers.

> AND it doesn't matter which way you traverse while looping,
Well if you're using it for indexing an array, and your cache is optimised in favour of incremental access, then you lose badly.
I've only ever seen counting backwards to zero save ONE instruction on those machines which specifically have a 'test-and-branch' instruction.

> Another way of optimizing loops is to unroll it:
Or just use the gcc flag -funroll-loops
Better yet, bone up on all the magic which a modern compiler can do, which doesn't involve you mauling the code into an unreadable mess.
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Optimize-Options.html#Optimize-Options

commented: Bingo ! More rep** from sos ;) +13
commented: Good Advice - WoLfPaCk +5

@WaltP
Except if your description is true, you'd be better off using

i = 30;
while (i--)
{/*do your stuff*/}

>> Sorry but I fail to see the different between for and while loop.. Do you mean it's faster to use while instead of for?

@Salem
> //A quicker method is to simply use the value as an
Except your other two methods never risk an out of bounds memory access.
KashAI>> TRUE. So I hope that anyone able enough to understand this won't blindly copy my code.


> 1. unsigned int arithmatic is faster than signed int.
Where's your evidence?
KashAI>> I was afraid someone will ask. :). Anyway, simple answer is I don't know.
But here is what I know:
1. In VS 6.0 (on Intel H/W) a simple for loop with loop variable being unsigned is about 2 seconds faster than when loop variable is signed int. (looped some 100K and 500K times to print the value of loop variable)
2. In most cases one can see that there are seperate assemply instructions for signed and unsigned arithmetic. Which at least indicates a difference in performance.
3. Number of flags applicable (CF=carry-over-flag, SG=sign-flag, OF=overflow-flag) to signed and unsigned instructions' execution are different.
4. I'm vaguely remember an instruction called SBB (substract using borrow) which, if i'm not wrong, is only applicable to signed arithmetic.
And use of it is in case where the requested substraction of 2 signed numbers can not be completed with a single instruction due to register size.
> 2. registers are registers ! They're simply faster than memory access.
True, but any decent compiler nowadays is far more capable of deciding which variables would be best placed in registers.
KashAI>> So if I understand it right what I've written is correct, but not neccesarry.
> AND it doesn't matter which way you traverse while looping,
Well if you're using it for indexing an array, and your cache is optimised in favour of incremental access, then you lose badly.
I've only ever seen counting backwards to zero save ONE instruction on
those machines which specifically have a 'test-and-branch' instruction.
KashAI>> So in short, should one NOT optimize it this way? May be you could add some practical numbers for the benefit of readers which will help them in deciding whether to use this optimization or not?
E.g. "in 80% of cases cache is optimised in favour of incremental access" OR
"Now-a-days most machines support "'test-and-branch' instruction".

> Another way of optimizing loops is to unroll it:
Or just use the gcc flag -funroll-loops

Better yet, bone up on all the magic which a modern compiler can do,
which doesn't involve you mauling the code into an unreadable mess.
http://gcc.gnu.org/onlinedocs/gcc-4....timize-Options

KashAI>> Now that is useful info. Just checked and found that VS 6.0 and Sun Workshop 6.0 also support lopp unrolling.

When displaying output use printf every where posible over C++'s cout because of the of constructors to be initialized and series of cout class variable to be initialized and never ending list of inline codes to be executed :) . If you want to see all this details try using <trace into> tool that ships with your IDE compiler and you will see what i mean.

2. I'm sure most ppl know this but writing as it's not already mentioned in this topic so far. Use unsigned int stored in registers for loop variables.
Reasons

2. registers are registers ! They're simply faster than memory access.

Most compilers ignore the "register" keyword, because their optimzations are far more advanced than back in the time, where a variable had only one storage in its life...

But, the register keyword can reduce performances on some compilers, because they interpret it as a strong request, from the programmer, to use a register for storage of the variable, which condemns one register.

Even on very old compilers where the register keyword was a useful hint, it would have been utterly stupid to use the register keyword everywhere, as it would be as slow, or slower, than using no keyword at all, because the compiler would not be better than using its default "guessing" algorithm.
The register keyword should be used only at places, where you want to give a hint to the compiler, that THIS variable needs to be accessed faster than other ones, even if it reduces performances of other variables.

1. unsigned int arithmatic is faster than signed int.

Ridiculous.
On two's complement machines, additions, substractions and multiplications have identical performances as signed or unsigned, since this is exactly the same operation.

For comparisons, the performances are identical on x86 architectures, since the same operation is used for all comparisons (cmp).
On all other architectures, there is absolutely nothing that would add a performance penalty on signed comparisons, because integer comparisons require only very very few transitors... The comparison itself, cannot require more than one CPU cycle.

For integer division, signed integer division was a bit slower than unsigned integer division on old CPU.
But, integer divisions are very rarely time critical, because division is not a very much used operation.

3. Optimizing the breaking conditions in for loops. If you know that the loop variable's range is from 0 to some +ve number AND it doesn't matter which way you traverse while looping, you can optimize the loop like this:

Assuming that the compiler doesn't optimize it for you.
GCC 3.4.5 (-O2) for i386, is clever enough to optimize the first loop with a decrementation operation.
This is because, forward loops are very common, and GCC has specific optimizations for this type of code.
On the other hand, GCC 3.4.5 is not clever enough to produce good code for your "manually optimized" loop.

for( int i = 0; i <= 30; i++ );

Produces this assembly code (MinGW 3.4.5 for Win32 -O2):

0x401300 :     dec    eax
0x401301 :     jns    0x401300

But:

for( int i = 30; i--; );

Produces:

0x401300 :     dec    eax
0x401301 :     cmp    eax,0xffffffff
0x401304 :     jne    0x401300

Why this pessimization?

First, the following code:

0x401300 : dec eax
0x401301 : js 0x401300

Would immediately stop if eax is negative... But, the conditional expression is about the inequality with zero!
The compiler could have seen that all the values of i will be in range [0,30), but GCC is not clever enough to understand that, because you wrote a weird loop.

A good code, benefiting from the fact that "dec eax" sets the zf and sf flags, would require that you stop when i reaches zero, not -1.

Your code confuses the compiler, and is a pessimization.
So, don't do that!

Note:

for( int i = 31; --i; );

Generates the right code:

0x401300 :     dec    eax
0x401301 :     jns    0x401300

With MinGW 3.4.5. I don't claim that it's ok. It may be a pessimization with another compiler.

4, Optimizing very small loops using switch-case.
When you know that the range of loop variable's value is
pretty small avoid the loop altogether.

Or... activate the -funroll-loops option (or your compiler equivalent), which is less error prone.

Also, you should notice that, this should only be used on the most critical code, as it greatly increases the code size, which can have a serious performance penalty.

5. Another way of optimizing loops is to unroll it:

Knowing that printf is so slow that the loop code itself is negligible...

I benchmarked the two programs, using NUL as the standard output.
There is no sensible speed difference.
Both require 3900 milliseconds to execute on my computer.

A revelant piece of code of the two things:

; without manual loop unrolling
; Assume a K6-2 CPU
0x401323 <main+67>:     inc    ebx
0x401324 <main+68>:     push   0x403000 ;1
0x401329 <main+73>:     call   0x4018a0 <printf>
0x40132e <main+78>:     add    esp,0x10
0x401331 <main+81>:     cmp    ebx,0xc350 ;1
0x401337 <main+87>:     jl     0x401320 <main+64> ; 3

; -> 3 CPU cycles + approximatively 39600 CPU cycles for the printf call (if the output stream is the NUL device).
; with manual loop unrolling
0x40134d <main+109>:    pop    eax
0x40134e <main+110>:    lea    eax,[ebx+1] ;1
0x401351 <main+113>:    pop    edx ;2
0x401352 <main+114>:    push   eax ;3
0x401353 <main+115>:    push   0x403000 ;4
0x401358 <main+120>:    call   0x4019c0 <printf>

; -> 4 CPU cycles + approximatively 39600 CPU cycles for the printf call.

So, your "optimization", approximatively reduce performances by 0.0025%.
Which is negligible.

When displaying output use printf every where posible over C++'s cout because of the of constructors to be initialized and series of cout class variable to be initialized and never ending list of inline codes to be executed . If you want to see all this details try using <trace into> tool that ships with your IDE compiler and you will see what i mean.

You should not do that before checking whether printf is slower or faster than cout.
This is very highly compiler dependent.
For example, Borland C++ 5.0 is faster for cout than for printf.

Check my answer on this thread:
http://www.codeguru.com/forum/showthread.php?t=383112

And, don't forget that micro-optimizations are the ennemy of real optimizations.

A performance tip that has served me well with C++, C#, Java, VB, and scripting languages is counting backwards. When using a comparison in a loop (presumably for termination purposes), if possible, count down to zero instead of up to a non-zero value.

Since every machine language in existence has "compare to zero" operators, smart compilers can utilize this efficiency.

Normally, the compiler saves the comparison value in memory and then accesses it indirectly for each loop. However, many compilers will recognize the comparison to zero and optimize it with a single machine language command, bne, bge, etc..

In complex, nested loops, the optimization can be amazing. The main down fall I've observed is compiler consistency. I have not observed .NET's IL doing this and it doesn't always do this when going native.

An easy way to swap 2 variables without using another variable:

a=a+b;
b=a-b;
a=a-b;

I don't think so. the result will be the fllowigs:
a=b;
b=0;
I am a beginner:) please give me some advice

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.