Appendix A : Terminology (taken from the Beowulf-HOWTO)
Although parallel processing has been used for many years in many
systems, it is still somewhat unfamiliar to most computer users. Thus, before discussing
the various alternatives, it is important to become familiar with a few commonly used
terms.
SIMD
SIMD (Single Instruction stream, Multiple Data stream) refers to a
parallel execution model in which all processors execute the same operation at the same
time, but each processor is allowed to operate upon its own data. This model naturally
fits the concept of performing the same operation on every element of an array, and is
thus often associated with vector or array manipulation. Because all operations are
inherently synchronized, interactions among SIMD processors tend to be easily and
efficiently implemented.
MIMD
MIMD (Multiple Instruction stream, Multiple Data stream) refers to a
parallel execution model in which each processor is essentially acting independently. This
model most naturally fits the concept of decomposing a program for parallel execution on a
functional basis; for example, one processor might update a database file while another
processor generates a graphic display of the new entry. This is a more flexible model than
SIMD execution, but it is achieved at the risk of debugging nightmares called race
conditions, in which a program may intermittently fail due to timing variations reordering
the operations of one processor relative to those of another.
SPMD
SPMD (Single Program, Multiple Data) is a restricted version of MIMD
in which all processors are running the same program. Unlike SIMD, each processor
executing SPMD code may take a different control flow path through the program.
Communication Bandwidth
The bandwidth of a communication system is the maximum amount of
data that can be transmitted in a unit of time... once data transmission has begun.
Bandwidth for serial connections is often measured in baud or bits/second (b/s), which
generally correspond to 1/10 to 1/8 that many Bytes/second (B/s). For example, a 1,200
baud modem transfers about 120 B/s, whereas a 155 Mb/s ATM network connection is nearly
130,000 times faster, transferring about about 17 MB/s. High bandwidth allows large blocks
of data to be transferred efficiently between processors.
Communication Latency
The latency of a communication system is the minimum time taken to
transmit one object, including any send and receive software overhead. Latency is very
important in parallel processing because it determines the minimum useful grain/block
size, the minimum run time for a segment of code to yield speed-up through parallel
execution. Basically, if a segment of code runs for less time than it takes to transmit
its result value (i.e., latency), executing that code segment serially on the processor
that needed the result value would be faster than parallel execution; serial execution
would avoid the communication overhead.
Message Passing
Message passing is a model for interactions between processors
within a parallel system. In general, a message is constructed by software on one
processor and is sent through an interconnection network to another processor, which then
must accept and act upon the message contents. Although the overhead in handling each
message (latency) may be high, there are typically few restrictions on how much
information each message may contain. Thus, message passing can yield high bandwidth
making it a very effective way to transmit a large block of data from one processor to
another. However, to minimize the need for expensive message passing operations, data
structures within a parallel program must be spread across the processors so that most
data referenced by each processor is in its local memory... this task is known as data
layout.
Shared Memory
Shared memory is a model for interactions between processors within
a parallel system. Systems like the multi-processor Pentium machines running Linux
physically share a single memory among their processors, so that a value written to shared
memory by one processor can be directly accessed by any processor. Alternatively,
logically shared memory can be implemented for systems in which each processor has it own
memory by converting each non-local memory reference into an appropriate inter-processor
communication. Either implementation of shared memory is generally considered easier to
use than message passing. Physically shared memory can have both high bandwidth and low
latency, but only when multiple processors do not try to access the bus simultaneously;
thus, data layout still can seriously impact performance, and cache effects, etc., can
make it difficult to determine what the best layout is.
Aggregate Functions
In both the message passing and shared memory models, a
communication is initiated by a single processor; in contrast, aggregate function
communication is an inherently parallel communication model in which an entire group of
processors act together. The simplest such action is a barrier synchronization, in which
each individual processor waits until every processor in the group has arrived at the
barrier. By having each processor output a datum as a side-effect of reaching a barrier,
it is possible to have the communication hardware return a value to each processor which
is an arbitrary function of the values collected from all processors. For example, the
return value might be the answer to the question "did any processor find a
solution?" or it might be the sum of one value from each processor. Latency can be
very low, but bandwidth per processor also tends to be low. Traditionally, this model is
used primarily to control parallel execution rather than to distribute data values.
Collective Communication
This is another name for aggregate functions, most often used when
referring to aggregate functions that are constructed using multiple message-passing
operations.
SMP
SMP (Symmetric Multi-Processor) refers to the operating system
concept of a group of processors working together as peers, so that any piece of work
could be done equally well by any processor. Typically, SMP implies the combination of
MIMD and shared memory. In the IA32 world, SMP generally means compliant with MPS (the
Intel MultiProcessor Specification); in the future, it may mean "Slot 2".
SWAR
SWAR (SIMD Within A Register) is a generic term for the concept of
partitioning a register into multiple integer fields and using register-width operations
to perform SIMD-parallel computations across those fields. Given a machine with k-bit
registers, data paths, and function units, it has long been known that ordinary register
operations can function as SIMD parallel operations on as many as n, k/n-bit, field
values. Although this type of parallelism can be implemented using ordinary integer
registers and instructions, many high-end microprocessors have recently added specialized
instructions to enhance the performance of this technique for multimedia-oriented tasks.
In addition to the Intel/AMD/Cyrix MMX (MultiMedia eXtensions), there are: Digital Alpha
MAX (MultimediA eXtensions), Hewlett-Packard PA-RISC MAX (Multimedia Acceleration
eXtensions), MIPS MDMX (Digital Media eXtension, pronounced "Mad Max"), and Sun
SPARC V9 VIS (Visual Instruction Set). Aside from the three vendors who have agreed on
MMX, all of these instruction set extensions are roughly comparable, but mutually
incompatible.
Attached Processors
Attached processors are essentially special-purpose computers that
are connected to a host system to accelerate specific types of computation. For example,
many video and audio cards for PCs contain attached processors designed, respectively, to
accelerate common graphics operations and audio DSP (Digital Signal Processing). There is
also a wide range of attached array processors, so called because they are designed to
accelerate arithmetic operations on arrays. In fact, many commercial supercomputers are
really attached processors with workstation hosts.
RAID
RAID (Redundant Array of Inexpensive Disks) is a simple technology
for increasing both the bandwidth and reliability of disk I/O. Although there are many
different variations, all have two key concepts in common. First, each data block is
striped across a group of n+k disk drives such that each drive only has to read or write
1/n of the data... yielding n times the bandwidth of one drive. Second, redundant data is
written so that data can be recovered if a disk drive fails; this is important because
otherwise if any one of the n+k drives were to fail, the entire file system could be lost.
A good overview of RAID in general is given at <http://www.dpt.com/uraiddoc.html>,
and information about RAID options for Linux systems is at
<http://linas.org/linux/raid.html>. Aside from specialized RAID hardware support,
Linux also supports software RAID 0, 1, 4, and 5 across multiple disks hosted by a single
Linux system; see the Software RAID mini-HOWTO and the Multi-Disk System Tuning mini-HOWTO
for details. RAID across disk drives on multiple machines in a cluster is not directly
supported.
IA32
IA32 (Intel Architecture, 32-bit) really has nothing to do with
parallel processing, but rather refers to the class of processors whose instruction sets
are generally compatible with that of the Intel 386. Basically, any Intel x86 processor
after the 286 is compatible with the 32-bit flat memory model that characterizes IA32. AMD
and Cyrix also make a multitude of IA32-compatible processors. Because Linux evolved
primarily on IA32 processors and that is where the commodity market is centered, it is
convenient to use IA32 to distinguish any of these processors from the PowerPC, Alpha,
PA-RISC, MIPS, SPARC, etc. The upcoming IA64 (64-bit with EPIC, Explicitly Parallel
Instruction Computing) will certainly complicate matters, but Merced, the first IA64
processor, is not scheduled for production until late 1999.
COTS
Since the demise of many parallel supercomputer companies, COTS (Commercial Off-The-Shelf) is commonly discussed as a requirement for parallel computing systems. Being fanatically pure, the only COTS parallel processing techniques using PCs are things like SMP Windows NT servers and various MMX Windows applications; it really doesn't pay to be that fanatical. The underlying concept of COTS is really minimization of development time and cost. Thus, a more useful, more common, meaning of COTS is that at least most subsystems benefit from commodity marketing, but other technologies are used where they are effective. Most often, COTS parallel processing refers to a cluster in which the nodes are commodity PCs, but the network interface and software are somewhat customized... typically running Linux and applications codes that are freely available (e.g., copyleft or public domain), but not literally COTS.