next up previous
Next: Chip components Up: No Title Previous: CRAY

A DIGRESSION ON CHIPS

Before we look at other supercomputing architectures, we take an extended tangent to discuss tricks to speed up processing by intelligent design of chips.

1. Instruction pipelining. Split an operation into several sub-operations, each executing in different sub-units of the chip in each clock cycle. For instance, if three instructions A, B, C, need to be executed in a 4-stage operation, consider the following decomposition:

                          stage
         cycle        1  2  3  4
               1      A
               2      B  A
               3      C  B  A
               4         C  B  A
               5            C  B
               6               A
Each operation takes 4 clock cycles to complete, but operations are overlapped so that in cycle 2, operations B and A are both performed, in cycle 3, C, B, A are all performed, etc. Thus in 6 cycles, all three instructions are completed for all four stages. How many cycles would it take a serial machine to complete this work?

2. Vector pipelining. This arithmetic pipeline operates in much the same way as instruction pipelining - namely, operations on different components of a long vector are overlapped. Consider, for example, the compution

displaymath108

This is a typical vector multiplication. In successive cycles, we have (for a typical vector pipe operating on vectors of length 64)

                        operation
   cycle          
	 1      load a_i, i=1, 64
	 2      load b_i, i=1, 64
	 3      multiply a_i*b_i = c_i,  i=1,64
	 4      store c_i, i=1,64
	 5      load a_i, i=65, 100 
	 6      load b_i, i=65, 100 
         7      multiply a_i*b_i = c_i,  i=65, 100
         8      store c_i, i=65, 100
Setting up the pipe takes a great deal of overhead - each component of each vector must be loaded in from memory, and small memory bandwidth could make that process very slow - but the Cray vector pipelining resulted in about an order of magnitude speed-up over serial computations.

If we could deliver the results of one (vector) operation directly into a second, a process called chaining of pipelines, we could pick up even more speed. For example, in the calculation tex2html_wrap_inline110 the results of the vector multiplication could be piped directly into an addition pipe, resulting in supervector speedup. A related idea is superscalar speedup, which results from several instructions - maybe an add and store - occuring in the same clock cycle. Again, data must be appropriately loaded into the registers for either of these speedups to occur, and that cuts down on the net timesavings.




next up previous
Next: Chip components Up: No Title Previous: CRAY

E. Bruce Pitman
Wed Sep 13 22:27:10 EDT 2000