Registers hold data elements. On vector machines, vector registers typically hold 64 elements, so long vectors must be stripped.
Memory bandwidth tells about the number of words moved in each cycle (in a particular pipe). A word is 64 bits, or 8 bytes. If there is only one load or store in a pipe, we say there is a 1-word memory bottleneck. Memory bandwidth is often the crucial factor in determining the performance of a machine.
Main memory determines the size of a problem that can be solved on a given machine. Memory is measured in words or Mb.
Extended memory is extra on-board storage, usually not as fast as main memory, but faster than disk access. With current technology, extended memory can be as fast as main memory, and is usually a couple of times bigger.
I/O to disk is frightfully slow. Sometimes very large problems can "fit" onto machines with small memory only by loading segments of the problem in from disk.
Floating point units operate on memory in cache, from which it is transferred to the registers by load/store operations. Time to access cache is usually 1 or 2 cycles. If the requested data is not in cache, a cache miss, that data must be transferred in from memory. Cache size and re-use of data in cache is essential for good performance. Be careful, however. If data is in cache but the variables have been changed, the cache is "dirty" and must be re-loaded (thus the cache coherence in ccNUMA).