Pipelining

Question

mattyd 89 Posting Maven

17 Years Ago

I have just begun reading a PDF on Assembly language. One of the terms it mentioned and encouraged the reader to research further is "pipelining". I looked this up and read a brief description on WikiPedia; a section of the article stated: "In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one."

I am not completely clear on what "so that the output of one element is the input of the next one." means. Does this refer to a section of data being output from one element then being absorbed by another for further processing\transfer, or is it possibly referring to one pipeline element outputting data then being readied to take in and process new data in an "opposite" direction-- one dataset out, one dataset in?

I would really like to understand this better.

Thank-you in advance.

sharky_machine

assembly

3 Contributors
3 Replies
124 Views
1 Week Discussion Span
Latest Post 17 Years Ago Latest Post by Infarction

All 3 Replies

Purple Avenger 18 Light Poster

17 Years Ago

What they want you to do is take the CPU's pipeline into consideration when writing the code. On the newer x86 CPU (486 onward) you can get significant speed boosts by ordering stuff so stages of the pipeline don't stall wiating for other stage results.

ex.

xor cx,cx
mov something, cx
mov di, whatever
mov ax,[di]

The load of DI and its use in the next instruction are in close proximity and can cause a stall. The code will work, it'll just be somewhat slower than something like this:

mov di, whatever
xor cx,cx
mov something, cx
mov ax,[di]

Where the CX ops were something that had to happen anyway and interposing them between the DI load and usage doesn't have any effect on the semantics of the code.

Another thing is with flag setting and usage. Newer CPU's have branch prediction logic that will try to prefetch cache lines of the most likely path to be taken. If you give the pipeline some warning about the conditions of a conditional branch that branch can happen faster.

ex.

mov cx,1234
mov dx,4567
add ax,something
jnz someplace

The setting of ZF happening right before the jump doesn't give the branch prediction much help. If you could interpose some other instructions between the setting of ZF and its usage the branch predictor can do a better job. NAturally, those instructions had better not be ones that whack ZF -- like the MOV's.

add ax,something
mov cx,1234
mov dx,4567
jnz someplace

Now the pipe and branch predictors can notice ZF doesn't change between getting set, and they've got two instructions worth of time to start fetching the code at "someplace" in the background.

These sort of optimization "rules" have changed all the time on different generations of CPU's, so its hard to say what is "best" anymore unless you have a specific target in mind.

mattyd commented: Very helpful: pipelining +1

~s.o.s~ commented: Good one - ~s.o.s~ +11

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

mattyd 89 Posting Maven Featured Poster · Answer 1 · 2007-01-08T14:14:08+00:00

What they want you to do is take the CPU's pipeline into consideration when writing the code. On the newer x86 CPU (486 onward) you can get significant speed boosts by ordering stuff so stages of the pipeline don't stall wiating for other stage results.
ex.
xor cx,cx
mov something, cx
mov di, whatever
mov ax,[di]
The load of DI and its use in the next instruction are in close proximity and can cause a stall. The code will work, it'll just be somewhat slower than something like this:
mov di, whatever
xor cx,cx
mov something, cx
mov ax,[di]
Where the CX ops were something that had to happen anyway and interposing them between the DI load and usage doesn't have any effect on the semantics of the code.
Another thing is with flag setting and usage. Newer CPU's have branch prediction logic that will try to prefetch cache lines of the most likely path to be taken. If you give the pipeline some warning about the conditions of a conditional branch that branch can happen faster.
ex.
mov cx,1234
mov dx,4567
add ax,something
jnz someplace
The setting of ZF happening right before the jump doesn't give the branch prediction much help. If you could interpose some other instructions between the setting of ZF and its usage the branch predictor can do a better job. NAturally, those instructions had better not be ones that whack ZF -- like the MOV's.
add ax,something
mov cx,1234
mov dx,4567
jnz someplace
Now the pipe and branch predictors can notice ZF doesn't change between getting set, and they've got two instructions worth of time to start fetching the code at "someplace" in the background.
These sort of optimization "rules" have changed all the time on different generations of CPU's, so its hard to say what is "best" anymore unless you have a specific target in mind.

Thank-you very much for your reply.

Regards,
sharky_machine

Infarction 503 Posting Virtuoso · Answer 2 · 2007-01-10T22:56:33+00:00

Purple Avenger's reply mentions some good points, but fails to describe how pipelining actually works. If you think of a basic (naive) design for a processor, you'll probably have it read an instruction from instructional memory (or cache), parse the instruction, read register values from the register file, send those values to an ALU, send the ALU result to a stage for memory I/O, and send the result there (if not a mem op, then no change) back to the register file to save. The total time for this can be quite long.

Now suppose that we know that the ALU takes 1/3 of the total time to handle one instruction (this is a fictional example btw). If we cut the processor into independent section as mentioned above, with state registers for passing values between them, we can reduce our clock period to 1/3 of it's original (or triple the frequency) since that's the longest period needed for an intermediate stage. For any given instruction, it will take just as long to compute; however, you can have n instruction computing separately along the pipeline (where n is the number of pipeline stages). End result is that you get a faster speed.

For crazy designs (aka whatever's in your box most likely), there will be many pipelines and multiple data paths, both of which are essentially multipliers in both computing output and design complexity.

There's more info from Wikipedia but I'll admit I didn't actually read through all of that.

Pipelining

Recommended Answers Collapse Answers

All 3 Replies

Recommended Answers