•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Assembly section within the Software Development category of DaniWeb, a massive community of 456,562 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 3,507 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Assembly advertiser: Programming Forums
Views: 1108 | Replies: 0 | Solved
![]() |
I am currently studying the impact of microarchitectural techniques. I have been looking at code and how to stall it correctly, as well as how to make it more efficient. I have been doing this through several different methods and then measuring the cycles per iteration.
I was wondering if you could look below at what I did and then let me know if I am stalling correctly and reordering correctly. It would be awesome if you guys could give me any suggestions or any feedback
.
Here is the code I am working with, as well as the latencies beyond a single cycle (note that it is beyond a single cycle so an instruction that has +N actually has N+1 cycles). Also note that the branch is always taken and the branch delayed slot is one cycle.
Sorry for the indenting.... wanted to seperate everything, but it somehow got messed up in word
.
The first method I used was stalling when there were only true data depenencies, instead of stalling on every single instruction.
This left me with 48 cycles per iteration.
Next, I used a multiple-issue design where results can be immediately forwarded from one unit to another or itself. It should only stall to observe a true data dependence.
This gave me 23 cycles per loop iteration.
The final thing I did was use the multiple-issue design and reorder the code to improve the performance.
This gave me 18 cycles per loop iteration.
Thanks again in advance for any responses
I was wondering if you could look below at what I did and then let me know if I am stalling correctly and reordering correctly. It would be awesome if you guys could give me any suggestions or any feedback
. Here is the code I am working with, as well as the latencies beyond a single cycle (note that it is beyond a single cycle so an instruction that has +N actually has N+1 cycles). Also note that the branch is always taken and the branch delayed slot is one cycle.
Sorry for the indenting.... wanted to seperate everything, but it somehow got messed up in word
.Latencies beyond single cycle: Memory LD +3 Memory SD +1 Integer ADD, SUB +0 Branches +1 ADDD +2 MULTD +4 DIVD +10 Loop: LD F2, 0(Rx) I0: MULTD F2, F0, F2 I1: DIVD F8, F2, F0 I2: LD F4, 0(Ry) I3: ADDD F4, F0, F4 I4: ADDD F10, F8, F2 I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8 I7: ADDI Ry, Ry, #8 I8: SUB R20, R4, Rx I9: BNZ R20, Loop Branch Delayed Slot
The first method I used was stalling when there were only true data depenencies, instead of stalling on every single instruction.
Loop: LD F2, 0(Rx) <stall> x 3 I0: MULTD F2, F0, F2 <stall> x 4 I1: DIVD F8, F2, F0 I2: LD F4, 0(Ry) <stall> x 3 I3: ADDD F4, F0, F4 I4: ADDD F10, F8, F2 I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8 I7: ADDI Ry, Ry, #8 I8: SUB R20, R4, Rx I9: BNZ R20, Loop Branch Delay Slot
Next, I used a multiple-issue design where results can be immediately forwarded from one unit to another or itself. It should only stall to observe a true data dependence.
1st Pipeline 2nd Pipeline
Loop: LD F2, 0(Rx) I0: MULTD F2, F0, F2
I1: DIVD F8, F2, F0 I2: LD F4, 0(Ry)
I3: ADDD F4, F0, F4 <stall> x 6 (waiting for F8)
I4: ADDD F10, F8, F2
I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8
I7: ADDI Ry, Ry, #8 I8: SUB R20, R4, Rx
I9: BNZ R20, Loop Branch Delay SlotThe final thing I did was use the multiple-issue design and reorder the code to improve the performance.
1st Pipeline 2nd Pipeline Loop: LD F2, 0(Rx) I0: MULTD F2, F0, F2 I2: LD F4, 0(Ry) I1: DIVD F8, F2, F0 I3: ADDD F4, F0, F4 I8: SUB R20, R4, Rx I5: SD F4, 0(Ry) I6: ADDI Rx, Rx, #8 I4: ADDD F10, F8, F2 I7: ADDI Ry, Ry, #8 I9: BNZ R20, Loop Branch Delay Slot
Thanks again in advance for any responses
"First learn computer science and all the theory. Next develop a programming style. Then forget all that and just hack."
-George Carrette
-George Carrette
![]() |
•
•
•
•
•
•
•
•
DaniWeb Assembly Marketplace
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
Similar Threads
- Pipelining (Assembly)
- CD burner stalling.. (Storage)
Other Threads in the Assembly Forum
- Previous Thread: function error in MIPS
- Next Thread: How do I check


Linear Mode