Hey guys,

I've written quite a large piece of code the past few days, and today I've cleaned it up a bit, divided it into functions (I know, bad, should've done it the other way around) etc. etc. The project is a simple 3D thingamadingy, the program is able to display WaveFront .obj files using OpenGL, the GLUT library and C++. The part I was working on these few days was the LOD(Level of detail) algorithm. I'd like to use this thread to optimize that particular part as whole by optimizing each of its functions.

Now, with the program more or less working (working in most cases) I want to optimize some functions. I'll post the functions that get called the most, profiled by gcc/gprof. First, (a part of) the output of the profilation:

Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00 100724734     0.00     0.00  visualPart::notOptimized(unsigned int, std::vector<bool, std::allocator<bool> > const&)
  0.00      0.00     0.00   196605     0.00     0.00  visualPart::myVerbosePrintf(char*, ...)
  0.00      0.00     0.00   147432     0.00     0.00  visualPart::rotateFour(int*)
  0.00      0.00     0.00    73736     0.00     0.00  operator new(unsigned int, void*)
  0.00      0.00     0.00    49160     0.00     0.00  vertexMatch::~vertexMatch()
  0.00      0.00     0.00    24587     0.00     0.00  visualPart::updateVertexMatches(std::vector<vertexMatch, std::allocator<vertexMatch> >&, std::vector<bool, std::allocator<bool> >&)
  0.00      0.00     0.00    24587     0.00     0.00  vertexMatch::vertexMatch()
  0.00      0.00     0.00    24573     0.00     0.00  visualPart::optimizeQuad(vertexMatch&, std::vector<bool, std::allocator<bool> >&)
  0.00      0.00     0.00    24573     0.00     0.00  vertexMatch::vertexMatch(vertexMatch const&)

(BTW: does anybody know how I can get gcc to also measure the time? It'd be of great value. I'm on WinXP x64, using latest (stable) release of the minGW suite in combination with Code::Blocks)

First one is really simple: Rotate an array of 4 integers by one to the right.

inline int visualPart::rotateFour(int *x) {
    unsigned int temp = x[3];

    x[3] = x[2];
    x[2] = x[1];
    x[1] = x[0];
    x[0] = temp;

    return 0;
}

The next function: Scanning through a vector<bool> for 4 elements, returning true if one is false.

inline bool visualPart::notOptimized(const unsigned int quadNumber, const vector<bool> &optimized) {
    unsigned int n = 0;
    for (n = 0; n < 4; n++) {
        if (!optimized[quadindex[quadNumber*4+n]]) {
            return true;
        }
    }

    return false;
}

myVerbosePrintf: A function that behaves exactly like printf, but only does print when the verbose variable is true. Little code as there is, little can be optimized, but do you think 200.000 calls to this "empty" function will slow the program significantly down?

int visualPart::myVerbosePrintf(char *fmt, ...) {
    if (!verbose) return 0;
    int retval;

    va_list ap;
    va_start(ap, fmt);
    retval = vprintf(fmt, ap);
    va_end(ap);

    return retval;
}

Fourth on the list is the tons of new[] I'm calling. Now I know from C experience that malloc() and calloc() are "very slow" and that new is almost like a C++ wrapper for malloc(). It might be worth looking into this, but I'm not sure whether I'm the one calling new[] all the time or C++ does so too.

Then a destructor for a struct! Surprised me a bit, but nothing bad. However, I'd like to know if you guys think it's worthwhile to minimize the calls to this destructor. Here's the struct:

struct vertexMatch {
    unsigned int x;
    unsigned int y;
    vector<unsigned int> faces;
};

Then comes one of the core functions which is quite long and takes a bit more understanding of the whole thing, but I'll post it eventually (with explaination what I'm doing, etc. etc.)

Then comes the constructor for that struct. Same question as for its destructor.
...
And its copy constructor. Idem.

Basically I'm asking you to take a look at the functions, if you know special keywords for alignment, if they are still useful nowadays with all the optimizing GCC already does, etc. Anything that will speed these functions up is welcome. If you so much as read it all to this, thank you for reading. xD

With just a very quick glance, that notOptimized() function needs some help. A bitvector (or vector<bool> ) is not a good choice in such a function. Sad, but it was optimized for space, not speed.

If each point of a quad is always optimized or unoptimized, you can reduce it to a single bit. I don't know if you can do that though.

The other way would be to make yourself an array of bytes that are themselves bit fields representing the points of a quad. The overhead of handling it that way wouldn't be much different than the overhead of your bitvector (since each quad stores its own copy of the vertex's optimization state anyway), but your function could then just directly index the quad's optimization byte and compare that to 0x0F. If equal, all are optimized. If not, then at least one is not optimized.

That function is called so often that it is worth every millisecond you can squeeze out of it.

The rotateFour() function is pretty good as-is, but believe it or not you might get better results if you were to use the <cstring> memmove(), or just as good (I think) but easier to use,

register int n = x[ 3 ];
char_traits <int> ::move( x+1, x, 3 );
x[ 0 ] = n;

If that is unsatifactory, it might be worth writing it in assembly yourself.

Finally, I don't know why you are using new so much, but you should minimize its use. Reuse objects as much as possible --reset/reinitialze them instead of delete/new, and use references whenever possible (to avoid copy construction).

Make sure you have all optimizations you can get turned on so expressions like foo * 4 get turned into foo << 2 machine code, which is significantly faster.

My quick $0.02. Sorry I couldn't be of more help.

[edit] It might be worth your time to set the calling convention for your oft-used functions to fastcall.

Comments
Very interesting post. Gave me something to think about =)

With just a very quick glance, that notOptimized() function needs some help. A bitvector (or vector<bool> ) is not a good choice in such a function. Sad, but it was optimized for space, not speed.

In fact, a vector<bool> isn't even optimized for space if it's standard-compliant.

Whoohoo! So much reactions, thanks a lot! I'll implement them one by one, checking if it has an advantage and keep you guys up to date.

Yesterday when I posted, the time was 9 seconds and 481 milliseconds.

Now, it is was 8,586 milliseconds. (as in, 8 seconds, 586 ms.)

WHOOOOOOHOOOOOOOOO!

On the first run, changed the vector<bool> to a C style char array, it finished in *trumpets*

5 seconds and 406 milliseconds!

3,3 seconds. Yeah. :D

PS:

Yesterday I changed the vector in struct VertexMatch to an array, and it actually was slower.. Any ideas? The usage of the vector's functions is minimal: A single reserve(2) and for the rest it's being treated like an array only accessing 2 elements. However, I'll change it again to an array and see what happens.

Hey! I changed the (only) other vector to an array, and it got down to 4 seconds and 648 milliseconds. Just by changing from vector to arrays, it already executes 50% faster! Yeey.

Found the uses of new by the way, it is in the loading/copying part of my code, nothing to worry about. In the part that gets executed a lot there's no new.

I've tried using fastcall, alignment and other keywords/attributes for optimizations. Just like register and inline, they don't seem to change anything. GCC's pretty smart.

With that done, I guess this was it for the simple optimizations.

Here's the main algorithm:

int visualPart::simplifyQuads() {
    unsigned int *quadsPerVert = NULL;
    struct vertexMatch vertexMatchy;

    unsigned char *optimized = new unsigned char[quadcount * 4];
    memset(optimized, FALSE, sizeof(char) * quadcount * 4);

    printf("Simplifying to LOD level %u:\n", LOD+1);
    do {
        if (verbose) {
            printf("DEBUG:: Waiting for input...");
            char c = getchar();
            fflush(stdin);
            if (c == 'y') {
                LOD++;
                return 0;
            }
        }

        if (findVertexMatch(vertexMatchy, optimized) == 0) {
            printf("Done simplifying!\n");
            LOD++;
            updateQuadsPerVert(&quadsPerVert);
            removeLonelyVerts(quadsPerVert);
            return 0;
        }

        if (optimizeQuad(vertexMatchy, optimized) == 0) {
            quadcount--;
        }

    } while (true);

    LOD = 1;

    return 0;
}

Here is the function that also does a lot of scanning work. The findVertexMatch function:

int visualPart::findVertexMatch(struct vertexMatch &vertexMatchy,
                                     const unsigned char * const optimized) {
    unsigned int n, i, eachX, eachY, nextX, nextY;

    //for each quad that isn't optimized
    for (n = 0; n < quadcount; n++) {
        if (!notOptimized(n, optimized)) continue;
        //find another quad that isn't optimized
        for (i = n+1; i < quadcount; i++) {
            if (!notOptimized(i, optimized)) continue;
            //if they both aren't, create pointers to their 4 vertices. 
            int *x, *y;
            x = &quadindex[n*4];
            y = &quadindex[i*4];
            //if 2 of those vertices match, we have a match and are able to simplify it.
            //for each vertex of quad x
            for (eachX = 0; eachX < 4; eachX++) {
                //check if it matches any vertex in quad y
                for (eachY = 0; eachY < 4; eachY++) {
                    //if it doesn't, continue
                    if (x[eachX] != y[eachY]) {
                        continue;
                    } else {
                        //else, calculate the offset to the next vertex. 
                        nextX = eachX + 1;
                        nextX = nextX % 4;

                        nextY = eachY + 3;
                        nextY = nextY % 4;
                        
                        //if they match, set the match structure and return.
                        if (x[nextX] == y[nextY]) {
                            vertexMatchy.x = x[eachX];
                            vertexMatchy.y = x[nextX];
                            vertexMatchy.faces[0] = i;
                            vertexMatchy.faces[1] = n;
                            return 1;
                        }
                    }
                }
            }
        }
    }
    
    //if no match is found, return 0
    return 0;
}

Here's the other function. It does not much scanning, but a lot of memory movement!

int visualPart::optimizeQuad(struct vertexMatch &vertexMatchy,
                             unsigned char * const optimized) {
    unsigned int n, j;

    int quad[4][4];
    for (n = 0; n < 4; n++) {
        memset(quad[n], -1, sizeof(int) * 4);
    }

    for (n = 0; n < 2; n++) {
        memcpy(quad[n], &quadindex[vertexMatchy.faces[n]*4], sizeof(int) * 4);
    }

    for (n = 0; n < 2; n++) {
        if (quad[n][0] == -1) continue;
        while (1) {
            if ((unsigned int)quad[n][2] == vertexMatchy.x || (unsigned int)quad[n][2] == vertexMatchy.y) {
                if ((unsigned int)quad[n][3] == vertexMatchy.x || (unsigned int)quad[n][3] == vertexMatchy.y) {
                    break;
                }
            }
            rotateFour(quad[n]);
        }
    }

    if (verbose) {
        myVerbosePrintf("-----------MAIN QUAD---------------\n");
        for (n = 0; n < 4; n++) {
            myVerbosePrintf("%u: ", n);
            for (j = 0; j < 4; j++) {
                myVerbosePrintf("%d ", quad[n][j]);
            }
            myVerbosePrintf("\n");
        }
        myVerbosePrintf("-----------MAIN QUAD END-----------\n");

        myVerbosePrintf("removing %d and %d...\n", vertexMatchy.x, vertexMatchy.y);
    }

    for (n = 0; n < 2; n++) {
        for (j = 0; j < 4; j++) {
            if ((unsigned int)quad[n][j] == vertexMatchy.x || (unsigned int)quad[n][j] == vertexMatchy.y) {
                quad[n][j] = -1;
            }
        }
    }

    rotateFour(quad[0]);
    rotateFour(quad[0]);

    myVerbosePrintf("AFTER ROT/DEL:\n\t%3d %3d %3d %3d\n\t%3d %3d %3d %3d\n", quad[0][0], quad[0][1], quad[0][2], quad[0][3], quad[1][0], quad[1][1], quad[1][2], quad[1][3]);
    int newQuad[4] = {quad[1][0], quad[1][1], quad[0][2], quad[0][3]};

    for (n = 0; n < 4; n++) {
        optimized[newQuad[n]] = TRUE;
    }


    if (verbose) {
        myVerbosePrintf("%3d %3d %3d %3d\n", newQuad[0], newQuad[1], newQuad[2], newQuad[3]);

        myVerbosePrintf("Deleting old quads %u && %u...\n", vertexMatchy.faces[0], vertexMatchy.faces[1]);
        myVerbosePrintf("Old index: ");
        myVerbosePrintf("|0| ");
        for (n = 0; n < quadcount * 4; n++) {
            myVerbosePrintf("%d ", quadindex[n]);
            if ((n+1)%4 == 0) {
                myVerbosePrintf("|%u| ", (n+1)/4);
            }
        }
        myVerbosePrintf("\n");
    }

    //overwrite one old quad with newQuad
    memcpy(&quadindex[vertexMatchy.faces[0]*4], newQuad, sizeof(int) * 4);

    if (verbose) {
        myVerbosePrintf("Sew index: ");
        myVerbosePrintf("|0| ");
        for (n = 0; n < quadcount * 4; n++) {
            myVerbosePrintf("%d ", quadindex[n]);
            if ((n+1)%4 == 0) {
                myVerbosePrintf("|%u| ", (n+1)/4);
            }
        }
        myVerbosePrintf("\n");
    }

    //erase the other one by copying the rest over it.
    int *begin = &quadindex[vertexMatchy.faces[1]*4];
    int *end = begin + 4;

    int intsTillEnd = quadcount * 4 - (vertexMatchy.faces[1]*4 + 4);
    myVerbosePrintf("DEBUG:: intsTillEnd: %u\n", intsTillEnd);

    if (intsTillEnd != 0) {
        memcpy(begin, end, sizeof(int) * intsTillEnd);
    }

    if (verbose) {
        myVerbosePrintf("New index: ");
        myVerbosePrintf("|0| ");
        for (n = 0; n < quadcount * 4; n++) {
            myVerbosePrintf("%d ", quadindex[n]);
            if ((n+1)%4 == 0) {
                myVerbosePrintf("|%u| ", (n+1)/4);
            }
        }
        myVerbosePrintf("\n");
    }

    return 0;
}

If you can find any optimizations again, please tell me. Optimizing with you guys has been very fruitful already, so I'm really happy.

PS:
If you want me to ditch all the "verbose" code, please tell me so. I don't know if it affects the speed, so I thought I'd better post the function as it is.

I didn't mention it before, but personally I would be inclined to use a compile-time #define to decide whether that verbose function gets used or not. Calling the function takes time, even if it doesn't do anything.

#ifndef VERBOSE
  #define myVerbosePrintf( ... )
#endif

...

#ifdef VERBOSE
  for (...)
    myVerbosePrintf( "fooey: %d\n", foonum );
#endif

...

myVerbosePrintf( "..." );

When you compile, if you don't specify -DVERBOSE then all that verbose stuff doesn't get compiled into the executable and you avoid the overhead of checking and calling stuff.
Oh, notice how the VERBOSE define supplants the 'verbose' variable.

That's what I'd do...

yeah, I only did the "myPrintfVerbose()" because else I'd need to #ifdef #endif like.. 70 printf()s, hahaha. Maybe I should regex search and replace the printf()s with #ifdef #endifs. Can't be done with regular search expressions I'm afraid. :(

PS: I removed all the verbose code, and it seemed not to matter, but thanks for the input.

This article has been dead for over six months. Start a new discussion instead.