Each time we access a piece of data, we should try to get as many floating point ops out of that reference as we can. A performance monitor is the flops per reference, and we want that to be high. First, lets look at a simplier set of examples.
Consider, for example, the operations listed
calc flops/pass operation
1 2 v1_i = v1_i + a*v2_i
2 8 v1_i = v1_i + s2*v2_i + s3*v3_i + s4*v4_i + s5*v5_i
3 1 v1_i = v2_i/v3_i
4 2 v1_i = v1_i + s2*v2_{idx(i)}
5 2 v1_i = v2_i - v3_i*v1_{i-1}
6 2 s = s + v1_i *v2_i
More generally, let us study how much memory is required to sustain performance on today's supercomputers. First some terminology. Let us define:
is the theoretical peak performance
is the real performance you can achieve on a special problem
is the sustained performance
Typically,
on massively parallel
machines; on shared memory computers,
.
is typical.
Now consider a typical operation, say a nonlinear system solve that involves 5 Newton iterations where each Newton iteration involves 5000 matrix-vector multiplications. A matrix by vector multiply involves one multiplication and one addition. Let Nm be the number of non-zero elements in the matrix. Then the number of operations to be executed in, say, one hour, is
In one hour, we can execute
operations. Equating, we find
Doing the arithmetic, we find
That is, a 500 Mflop machine should have about 500 MB memory for reasonable arithmetic performance.