I cannot seem to solve this problem. if someone can please help me, i would much appreciate it. thanks
This problem is concerned with how variations of Tomasulo’s algorithm perform when they run a loop that is very common. This loop is a vector loop called the DAXPY loop (for double precision aX plus Y) and it is the central operation in Gaussian elimination. The code below implements the operation Y = aX + Y for a vector of length 100. It assumes that initially R1 = 0 and F0 contains the value of a.
foo: L.D F2, 0(R1) ; load X(i)
MUL.D F4, F2, F0 ; multiply a*X(i)
L.D F6, 0(R2) ; load Y(i)
ADD.D F6, F4, F6 ; add a*X(i) + Y(i)
S.D F6, 0(R2) ; store Y(i)
DADDIU R1, R1, #8 ; increment X index
DADDIU R2, R2, #8 ; increment Y index
DSGTUI R3, R1, #800 ; test if done
BEQZ R3, foo ; loop if not done
In this code, the instruction DSGTUI R3, R1, #800 is an integer ALU operation which compares register R1 with unsigned immediate value 800, setting R3 to 1 if R1 > 800, 0 otherwise.
The pipeline functional units have the following characteristics:
PIC ATTACH ONE
- The functional units are not themselves pipelined.
- There is no forwarding between functional units, so that results are communicated using the CDB.
- The EX stage does both the effective address calculation and also the memory access for loads and stores (so that the pipeline is IF – ID – IS – EX – WB
- Loads take one cycle.
- The issue (IS) and write result (WB) stages each take 1 clock cycle.
- There are 5 load buffer slots and 5 store buffer slots.
- Assume that the BEQZ instruction takes 0 clock cycles.
- Assume that a queued instruction in a reservation station may execute in the same cycle that the previous instruction writes to the CDB.
- Assume also that a data dependent instruction begins to execute in the cycle after the data value is broadcast on the CDB.
a) For this part of the problem, use the single-issue Tomasulo MIPS pipeline shown in Figure 2.9 of your text with the pipeline latencies shown below in Table 1. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e. enters its first EX cycle) for three iterations of the loop. How many cycles does each loop iteration take? Give your answers in the form of a table like the one labeled Part (a) below. The first couple of lines of that table are filled in to give you an idea of what to do. You are to finish the table starting at the third line and continuing until 3 iterations are complete.
b) Using the code for DAXPY loop and a fully pipelined floating point unit with the latencies of Table 1. Assume a two-issue Tomasulo’s algorithm for the hardware with one integer unit taking one execution cycle (that means 0 cycles to use) for all integer operations. Show the number of stall cycles for each instruction and what clock cycle each instruction begins execution (i.e. enters its first EX cycle) for three iterations of the loop. Show your answer in the form of a table like the one labeled Part (b) below. The first few lines of that table are filled in to give you an idea of what to do. You are to finish the table starting at the fourth line and continuing until 3 iterations are complete.
c) Again using the MIPS code for DAXPY given above, assume Tomasulo’s algorithm with speculation as shown in Figure 2.14 of your text. Assume the latencies of Table 1 and also assume that there are separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. Create a table like the one labeled Part (c ) below for the first three iterations of the loop. The first few lines of that table are filled in to give you an idea of what to do. You are to finish the table starting at the fourth line and continuing until 3 iterations are complete.
PIC ATTACH TWO
PIC ATTACH THREE