hey,today someone asked me one question in my class that what is the fastest way to multiply a number by 7 ?

I said yes you can use bit-wise operators. the code which i gave him is :

int n=10;  // for example

n = (n<<3)-n;

then he aksed me one more question :

n = (n<<2 + n<<1 + n);

now, which one is faster ? first one or second ? I was confused in this.

Then , I asked somebody else about this, he said the following code is fastest.

int n=10; // for example

n = n*7;

Then , it confused me badly, Can you help me out ? Which one is faster and why ? I am not able to get this thing that who is right and which code (out of 1 and 2 ) is faster? And is the 3rd code fastest ? Thanks in advance.

Edited 3 Years Ago by nitin1

I like #1 or #3. #1 because it uses bit shifting, which is always fast. The only minus is the subtraction.

I like #3 because it allows full optimization by the compiler - which surely is set up for something as common as multiplication, in unbeatable time.

The #2 I don't like, it has too many operations, in total, that need to be performed.

First is the fastest among all...Because it uses one bitwise and one arithmatic operator..

Next comes the second. 2 bitwise and two arithmatic operator.

The code with highest time complexity is third. Once my teacher taught " It looks like simple, but When ever a multiplication code is called the compiler will perform inner addition operation. But due to its coding structure, most programmers use this kind of notations.".

Hope this helps you...

Have a happy coding...:D

You're forgetting that the compiler can optimise. Who's to say the compiler doesn't change a multiplication into something like #1, and simplify #2 to #1? There are too many variables here to say for sure which is the fastest when it is put into practice.

The only minus is the subtraction.

Do I sense a pun?

Edited 3 Years Ago by Assembly Guy

Comments
Excellent points AG, especially the one about the pun!

When ever a multiplication code is called the compiler will perform inner addition operation.

What? It might have been common for CPUs to not support multiplication over 30 years ago, but I doubt very much that any such CPUs exist now. So there's no reason for a compiler to implement multiplication in terms of addition¹.

And even if there were some low-end embedded CPU still in use somewhere that did not support multiplication, there's no way that a modern compiler wouldn't find a way to compile to something more efficient than repeated addition when multiplying by a constant.

As a general rule: whenever you think you can low-level optimize something better than the compiler can, you're probably wrong.

¹ Except in cases where that might actually be faster like replacing x*2 with x+x - except that would actually be compiled to x << 1 probably.

Here and here are a couple charts that tell you the clock cycles for the MUL and SFL/SHR instructions on the 80X86 family of processors. The shift operation is less than 1/4th the time of the multiplication instruction. How the compiler will optimize the multiplication operator is dependent on the compiler and you have no control over that.

Edited 3 Years Ago by Ancient Dragon

still ancient dragin sir , So what is the answer ? Can i say that there is standard way in which "all" compiler optimizes the multplication operator? Discussion is quite well, but answer is not clear yet. I understood one thing that (as you said) it is dependent on compiler, so what about the codes which i posted above. please clear these things. thanks in advance.

So what is the answer ?

The answer is "it depends". You need to consider several issues stemming from the platform and compiler, including but not limited to:

  • Are the combination of operators you use to mimic multiplication faster than multiplication on your target platform?

  • Does the compiler optimize * into either your own bit twiddling code or something even faster? Compiler writers are smart folks, and usually their solutions are vastly superior to ad hoc stuff from application programmers.

  • Does the compiler optimize the multiplication away entirely? It doesn't matter how clever your trick is to mimic multiplication if the compiler can beat it by emitting nothing.

  • Does your tricky code confuse the compiler's optimizer such that the resulting machine code is slower than the equivalent multiplication?

There's no single answer to the question as it's one of those cases where your only recourse is to break out the profiler and do empirical testing.

Edited 3 Years Ago by deceptikon

Comments
as usual, 1 man army for problems :D
Well said.

that mean out of the 3 codes, all are dependent ones.Right ? All need to be tested on the target machine ? We can never say that code which i wrote is more optimized ? And I can never say that normal star operator will give best results as we don't know in which way compiler will optimize it ? am i right ? correct me if i wrong at any point. thanks.

Edited 3 Years Ago by nitin1

In a nutshell, yes. We can't really know. I'll reiterate deceptikon's answer:

It depends

You could compile all three and use objdump to disassemble the machine code generated to identify which is fastest, when compiled by your compiler. The machine code may/will still vary from compiler to compiler, and even the situtation in which the multiplication is used (eg what code is before or after it), and which compiler flags are used.

However, in all theory, an shl paired with an add instruction will mostly be faster than a single mul instruction, but when you're programming in C, your compiler might not compile a simple var*7 into a mul ax,7-type instruction. There are too many variables to give a definitive answer.

EDIT:
FYI: my compiler compiles your first snippet into a left shift followed by a subtraction. The second snippet is turned into a shift, a move, an addition, a move, a shift, followed by an addition. The third snippet is turned into the same machine code as #1.

Edited 3 Years Ago by Assembly Guy

You could compile all three and use objdump to disassemble the machine code generated to identify which is fastest

That doesn't tell you which is fastest, it only tells you which compiles to which instructions. It might be easier to predict the runtime when looking at assembly code than when looking at C code, but there's still a lot of variables.

First of all you'd need to know how fast each instruction is on the given architecture. In more complicated cases you'd also need to know how the code behaves with regards to caching, branch prediction and/or pipelining. Predicting those things just by looking at the assembly is not very reliable.

So the most reliable way to find out how fast a given piece of code is, is to measure. Looking at the assembly may give you a better idea of why a given piece of code is faster, but reading the assembly should not be a substitute for measuring.

PS: Most compilers have a way of outputting assembly code direclty (-S in gcc), so you won't need objdump.

So the most reliable way to find out how fast a given piece of code is, is to measure.

Not necessarily -- just sum up the clock cycles for each machine instruction. I already posted two charts for 80x88 family of processors.

just sum up the clock cycles for each machine instruction.

As I said, that won't take into account caching, branch prediction or pipelining. I'm not saying that those factors are relevant in this specific case, but in general they can be.

None of that is relevant when comparing the speed of various algorithms on the same computer. He's not trying to find out whose computer is faster, but which algorithm is faster.

None of that is relevant when comparing the speed of various algorithms on the same computer.

Yes, it is. As an example consider two pieces of code: Both read two ints from memory and add them. The only difference is that in one piece the two ints are next to each other in memory and in the other they're far away. Clearly both pieces of code will use the same machine instructions - only with different arguments. So if you only add up the cycles of the instructions, they should be equal in speed. But they're not because one piece of code only causes one cache miss while the other causes two.

He's not trying to find out whose computer is faster, but which algorithm is faster.

I know that.

But that's still not testing the speed of the algorithms, it's still testing machine speed. Shouldn't care about hardware issues.

If I run both pieces of code on the same machine and one is consistently faster than the other, how is that measuring the machine speed? I'm measuring the speed of the pieces of code on this particular machine.

It should also be noted that reading two integers that are next to each other will be faster than reading two integers that are sufficiently far apart will be faster on every CPU that has a cache - the only difference is how far "sufficiently far" is and how big the difference is going to be. So this isn't just a quirk of a particular machine.

You said this twice now, but you haven't really explained it. For me "testing hardware", in this context, would mean that I take multiple pieces of hardware and find out which one gives best results. While "testing code" would mean that I take multiple pieces of code and see which one produces the best results. Using those definitions I'm clearly testing the code in the scenario described above.

Clearly you're using different definitions and that's fine, but it doesn't really matter what you call it. If one piece of code consistently performs better than another piece of code, then I'm obviously going to prefer that (ignoring for the moment factors like code readability or maintainability as that's not the discussion we're having right now). It does not matter whether I found this out by "testing hardware" or by "testing software".

The question is, which ALGORITHM is faster. You can't test that if you include other factors that contribute to overall time (such as memory, catch, operating system, and other things out of control of the programmer like CPU time slices, time consumed by other processes, etc). In order to answer the question we need to isolate out all other factors.

Edited 3 Years Ago by Ancient Dragon

An algorithm that accesses memory in a pattern that's cache friendly will be faster than one that does. People optimize their algorithms for that. That's a real thing.

Saying it doesn't matter because that's machine-specific is just silly. If my program runs much slower than another program because I didn't bother with cache friendliness the customer won't care if I say that that's a hardware issue, not an issue with my program's algorithm (especially since that's not true by most people's defintions of those terms) - they'll just care that the program runs too slowly.

I should also point out that your proposal of simply summing up each instruction's CPU cycles is hardware specific too (different instructions need a different amount of cycles on different architectures), so I don't understand why that wouldn't also qualify as "testing hardware".

let's leave it at that instead of squabbling?

Provided it remains civil, "squabbling" amongst experts is a very good thing because a lot of ideas and information get thrown around for people to learn from.

Ah, the ageless "compiler optimizations vs. hand optimizations" argument (discussion)... :-) After 30+ years writing both real-time and large scale system code I find that simple beats all at the "is it fast enough" race! If it is simple, the compiler can optimize in ways that us mere mortals cannot even imagine! Well, maybe we could imagine if were were compiler designer/writer boffins! ;-)

All that aside, I DO use left/right shift operations if I need to do factor-of-2 multiplies or divides, although the logic to determine how many bits to shift may well be more expensive than letting the compiler figure it out. When I am uncertain which would be better, then I try coding it both ways, run it in a REALLY BIG loop (with compiler loop optimizations disabled), and compute the absolute timing. However, the caveat here is you need to be sure your system isn't doing anything different from one run to the next, and then make multiple runs, averaging out the results. It is still, in computer terms, a SWAG (Stupid Wild Assed Guess), but it will be reasonably close to "truth".

Disclaimer: I am a senior systems/performance engineer for Nokia Mobile Phones (soon to be Microsoft Mobile Phones) and am responsible for end-to-end performance engineering, modeling, and measuring for thousands of servers and 100's of millions of phones world-wide. I still write a LOT of C, C++, Java, JavaScript, Php, and other code... The only programming language(s) I don't use much any longer (unless I am doing processor-specific kernel code) is assembler. :-)

Edited 3 Years Ago by rubberman

And Fortran... My last Fortran programming exercise was in 1967 at the University of Colorado where I was studying mechanical engineering. My wife, a partical physicist, is MUCH more proficient at Fortran than I am, but most of her programming work these days is in C++. I think her last pure fortran coding was in the 1980's or 1990's at Cornell... :-)

This article has been dead for over six months. Start a new discussion instead.