you can use this code i've written to measure the number of cycles each one takes. however, im already sure that teh 2nd way is faster. also u should always inline any simple assingment functions so that there will be no function call. creating/destroying objects will most certainly incur more cycles than just assigning.
What this is:
-a small C/asm source file to help benchmark a function that iterates over a data set.
What it does:
-computes the number of cycles required for the function to run
-computes the CPE(cycles per element) of your function
How it Works:
-it uses Pentium specific assembly instructions to read the processors timestamp counter, which is a 64 bit value that represents the number of cycles passed since the processor was reset
What you do:
-you write a function that takes and returns a void pointer for an argument. in the main function you pack your necessary args into a structure or w/e, and then unpack it in your function. you call the function i wrote, test_it() and you pass to it a pionter to your function and your packed up argument structure. the test_it() function will then run your function passing it your argument and benchmark the performance.
Does it work:
-yes actually well as far as i can tell. but that doesnt mean it actually does
Is it annoyingly complex?:
-no i hope not. i provided a pretty clear example imo. if you feel otherwise then tell me so.
Code is available in zip and tar:
http://www.1nfamus.netfirms.com/#benching
and here it is if you wanna just look:
/* 12/25/03 - Merry Xmas!!!
* This code is meant to provide reasonably accurate benchmarking of functions
* that iterate over a set of data. it is meant to be as modular as possible
* so as to make testing of different functions go as fast as possible. It
* calculates the total number of cycles required for a function to run. It
* also calculates the CPE of a function that iterates over a dataset. What is
* the CPE you ask?
* CPE - cycles per element. the number of cycles required to process an element
* of a data set. This is a term i stole from this book(which i highly recommend):
* http://csapp.cs.cmu.edu/
* It's a good way imo(and theirs) of benchmarking code b/c it lets you clearly see
* how the processor is performing. example: the intel pentiums have an integer arithmetic
* unit that is capable of executing addittion with a latency of 1 cycle. it is also
* capable of starting a new instruction every cycle. Now if you just time your code using
* something like times, you really have no idea how close your code is coming to reaching
* the max capabilities of the processor. if you instead have a measure of how many cycles
* it takes for an element to be processed, there is much clearer relationship between what
* is going on in the processor and where the delays are occurring.
*
* BUGS:
* i have compared all my tests to the benchmarks in the above book, and my results
* using their code are nearly the same as their results, so im fairly confident that
* this works correctly. i emailed my code to the author and asked him to check it
* out, and will be updating anything as necessary and reposting in the thread where
* this was posted.
* TESTED ON:
* this code was written and compiled on redhat 8 and debian 4. i've come to learn
* (unfortunately) that some of the code i've written will compile fine on one version
* of gcc and then have several errors on others; so im only hoping that you'll be able
* to compile this. i'd like to make it work on as many platforms as possible, so if you
* fix it to work on a different one then let me see it plz.
* BUILD:
* due to the idiocy of the gcc inline assembler the asm CANNOT be inlined or it breaks
* as soon as it is optimized. so i had to stick the asm routines in a separate assembly
* file. the way i've compiled is like so:
* gcc -Wall this_source_file.c the_assembly_source.s
*
* feel free to do w/e u like with this but if you make it better u
* gotta share with me.
* UPDATED:
* 02/04/03 - fixed it so that it uses __u64 unsigned ints to store the stamp values. before
* i wasn't checking for an overflow in the low 32 bits of the counter, now it does :).
* -sean larsson */
and some output:
[n00b@highjack3d] ./a.out
loop is not unrolled
overhead is 37 cycles
+-function took 4167 cycles That's a CPE of 4.069336
+-function took 3243 cycles That's a CPE of 3.166992
+-function took 2926 cycles That's a CPE of 2.857422
+-function took 2983 cycles That's a CPE of 2.913086
+-function took 3028 cycles That's a CPE of 2.957031
sum was 523776
unrolling the loop by 6
overhead is 37 cycles
+-function took 2990 cycles That's a CPE of 2.919922
+-function took 1972 cycles That's a CPE of 1.925781
+-function took 1998 cycles That's a CPE of 1.951172
+-function took 2025 cycles That's a CPE of 1.977539
+-function took 1978 cycles That's a CPE of 1.931641
sum was 523776
from teh output you can clearly see the difference between code/data being in the cache or not.
ps. this information was gleaned from the followin:
-intel manuals, primarily vol2,3, and optimizing 1
-the above mentioned book
-the link laying around in one of these threads about performance posted by jc(peenie) regarding the rdtsc instruction proper use of it