1. Instruction pipelining. Split an operation into several sub-operations, each executing in different sub-units of the chip in each clock cycle. For instance, if three instructions A, B, C, need to be executed in a 4-stage operation, consider the following decomposition:
stage
cycle 1 2 3 4
1 A
2 B A
3 C B A
4 C B A
5 C B
6 A
Each operation takes 4 clock cycles to complete, but operations are overlapped so
that in cycle 2, operations B and A are both performed, in cycle 3, C, B, A are
all performed, etc. Thus in 6 cycles, all three instructions are completed for
all four stages. How many cycles would it take a serial machine to complete this work?
2. Vector pipelining. This arithmetic pipeline operates in much the same way as instruction pipelining - namely, operations on different components of a long vector are overlapped. Consider, for example, the compution
This is a typical vector multiplication. In successive cycles, we have (for a typical vector pipe operating on vectors of length 64)
operation
cycle
1 load a_i, i=1, 64
2 load b_i, i=1, 64
3 multiply a_i*b_i = c_i, i=1,64
4 store c_i, i=1,64
5 load a_i, i=65, 100
6 load b_i, i=65, 100
7 multiply a_i*b_i = c_i, i=65, 100
8 store c_i, i=65, 100
Setting up the pipe takes a great deal of overhead - each component of each vector
must be loaded in from memory, and small memory bandwidth could make that process
very slow - but the Cray vector pipelining
resulted in about an order of magnitude speed-up over serial computations.
If we could deliver the results of one (vector) operation directly into a second,
a process called chaining of pipelines, we could pick up even more speed. For
example, in the calculation
the results of the vector multiplication could be piped directly into an
addition pipe, resulting in supervector speedup. A related idea is superscalar
speedup, which results from several instructions - maybe an add and store - occuring
in the same clock cycle. Again, data must be appropriately loaded into the registers
for either of these speedups to occur, and that cuts down on the net timesavings.