Lecture An Overview of Pipeline

Reading materials of this Lecture:

Chapter 5.3 A simple Implementation Scheme in the text book.
Chapter 6.1 An Overview of Pipelining in the text book
Chapter 6.2 A Pipelined Datapath in the text book.

Section 1 Performance of Single-Cycle MIPS Processor

                                                    1
Performance =   -------------------
                                  CPU Execution Time

CPU Execution Time is determined by three terms:

CPU Execution Time = Instruction Count * CPI * Clock Cycle Time

CPI: Clock cycles Per Instruction, which is the average number of
clock cycles each instruction takes to execute.

Clock Cycle Time = Average time per instruction

In order to calculate the clock cycle time needed different instructions,
we have the following assumptions:

Memory access: 2 ns (nanosecond)
ALU and adders: 2 ns
Register file access: 1 ns

Suppose there is no delay on other parts, such as multiplexors,
control units, sign-extention, PC accesses, and shift unit.

Clock Cycle Time :

R-format: 2ns       +        1ns     +     2ns      +       1ns                                        = 6ns
                     instruction     register    ALU            write back to
                     fetch                read          execution   register file

lw              :   2ns      +        1ns     +      2ns      +      2ns +            1ns                   = 8ns
                     instruction    register    ALU             Read Data write back to
                     fetch               read          execution    Memory      register file

sw              :   2ns      +        1ns     +      2ns      +      2ns +                                     = 7ns
                     instruction    register    ALU             Write Data
                     fetch               read          execution    Memory

beq            :   2ns      +        1ns     +      2ns      +                                                     = 5ns
                     instruction    register    ALU
                     fetch               read          execution

There are two possible implementations:

An implementation in which every instruction operates in 1 clock cycle of a fixed length.
An implementation in which every instruction operates in 1 clock cycle of an variable clock length. In another word, each instruction's clock cycle time is as long as it needs.

To compare the performance between two implementations, suppose the following instruction distribution:

24% lw, 12% sw, 44% R-format, 20% beq

For the fixed clock length, the Clock Cycle Time should at least be equal
to the longest clock cycle times used by the instructions, i.e., 8ns for lw.

Clock Cycle Time (fixed) = 8ns

For the variable clock length, we can calculate the average value:

Clock Cycle Time (variable) = 8*24% + 7*12% + 6*44% + 5 * 20% = 6.4 ns

We use the same Instruction sets to test so that the Instruction Count
is the same. CPI = 1 for both implementations. We can get

Performance (variable)              CPU execution time (fixed)
---------------- =     -----------------------
Performance (fixed)                    CPU execution time (variable)

                                                          Clock Cycle Time (fixed)
                                              =     -----------------------
                                                           Clock Cycle Time (variable)
                                              =      8/6.4 = 1.25

This indicates the variable clock implementation is 1.25 times faster
than the the fixed clock implementation. Keep in mind, here we only
consider a simple instruction set. For a more complex instruction set
including float point instructions, the performance of single cycle with
fixed clock cycle length will be even worse.

Section 2 Improve Performance by Pipeline

Example: Laundry Wash shown in Fig 6.1 in the text book.

Suppose cloth washing have to go through the following
four steps:

wasker ---> dryer ---> folder ---> storer
0.5 hour 0.5 hour 0.5 hour 0.5 hour

Fig 6.1 shows two approaches to laundry was. One is sequential
laundry (non-pipeline) approach that takes
2 * 4 = 8 hours
to wash 4 loads. In comparison, the pipeline approach takes only
2 + 3*0.5 = 3.5 hours.
to wash four loads. So the pipeline approach is more than
2 time faster than non-pipeline approach for the task of 4 load
wash.

In fact, if all the sages take about the same amount of time and
there is enough work to do, then the speedup due to pipelining
is equal to the number of stages in the pipeline. Supposing there
are 1000 wash loads, non-pipeline approach takes
      2*1000 = 2000 hours
while pipeline approach only takes
      2 + 1000*0.5 = 502 hours.
The speedup
     2000/502 approximately equal to 4.

Section 3 Pipelined Datapath

Based on the execution steps used by instructions, we could divide
the datapath in to five stages.

Five-stage pipelined datapath shown in Fig. 6.10 in the text book:
1. IF       : Instruction fetch
2. ID      : Instruction decode and register file read
3. EX : ALU execution
4. MEM: Data memory access
5. WB   : Write back.

Since at most five instructions can be in the datapath at
the same time for five-satge datapath, we need to save information
needed by each instructions. For example, if we did not save one
instruction bits, the following instruction entering the datapath
will re-write the previous instruction. All the information for
the previous one will be lost.

Similar to that PC (program counter) passes the instruction
address from one clock cycle to the next clock cycle, we can
insert registers between two stages, shown in Fig. 6.12 in the text book.

IF/ID registers: PC address, instruction

ID/EX registers: PC address, Read Data 1, Read Data 2, sign-extended offset

EX/MEM registers: branch address, Zero signal, ALU result, Read Data 2

MEM/WB registers: Read Data from Data Memory, ALU result