The rapid increase in performance of mass market commodity
microprocessors and significant disparity in pricing between PCs and scientific
workstations, has provided an opportunity for substantial gains in performance to cost by
harnessing PC technology in parallel ensembles to provide high-end capability for
scientific and engineering applications. The Beowulf project is a NASA initiative
sponsored by the HPCC program to explore the potential of Pile-of-PCs and to develop the
necessary methodologies to apply these low cost system configurations to NASA
computational requirements in the Earth and Space sciences. Recently, a 16 processor
Beowulf costing less than $50,000 sustained 1.25 Gigaflops on a gravitational N-body
simulation of 10 million particles with a Tree code algorithm using standard commodity
hardware and software components.
This gives us the ability to attain supercomputing power with
off-the-shelf hardware and free software. This eliminates the need for acquiring expensive
supercomputers and at the same addresses the question -- "What do we do with our
unwanted hardware?"
History of the Beowulf Class Cluster
In the summer of 1994 Thomas Sterling and Don Becker, working at
CESDIS under the sponsorship of the ESS project, built a cluster computer consisting of 16
DX4 processors connected by channel-bonded Ethernet. They called their machine Beowulf.
The machine was an instant success and their idea of providing COTS (Commodity Off The
Shelf) base systems to satisfy specific computational requirements quickly spread through
NASA and into the academic and research communities. The development effort for this first
machine quickly grew into a what we now call the Beowulf Project. Some of the major
accomplishment of the Beowulf Project will be chronicled below, but a non-technical
measure of success is the observation that researchers within the High Performance
Computer community are now referring to such machines as "Beowulf Class Cluster
Computers." That is, Beowulf clusters are now recognized as a genre within the HPC
community.
The next few paragraphs will provide a brief history of the Beowulf
Project and a discussion of certain aspects or characteristics of the Project that appear
to be key to its success.
The Center of Excellence in Space Data and Information Sciences
(CESDIS) is a division of the University Space Research Association (USRA) located at the
Goddard Space Flight Center in Greenbelt Maryland. CESDIS is a NASA contractor, supported
in part by the Earth and space sciences (ESS) project. The ESS project is a research
project within the High Performance Computing and Communications (HPCC) program. One of
the goals of the ESS project is to determine the applicability of massively parallel
computers to the problems faced by the Earth and Space sciences community. The first
Beowulf was built to address problems associated with the large data sets that are
involved in ESS applications.
It may well be that this is simply the "right time" in
history for the development of Beowulf class computers. In the last ten years we have
experienced a number of events that have brought together all the pieces for the genesis
of the Beowulf project. The creation of the casual computing market (office automation,
home computing, games and entertainment) now provides system designers with a new source
of cost effective components. The COTS industry now provides fully assembled subsystems
(microprocessors, motherboards, disks and network interface cards). Mass market
competition has driven prices down and reliability up for these subsystems. The
development of publicly available software, in particular, the Linux operating system, the
GNU compilers and programming tools and the MPI and PVM message passing libraries, provide
hardware independent software. Programs like the HPCC program have produced many years of
experience working with parallel algorithms. That experience has taught us that obtaining
high performance, even from vendor provided parallel platforms is hard work and requires
researchers to adopt a do-it-yourself attitude. A second aspect to this history of working
with parallel platforms is an increased reliance on computational science and therefore an
increased need for high performance computing. One could argue that the combination of
these conditions: hardware, software, experience and expectation; provided the
environment that makes the development of Beowulf clusters seem like a natural
evolutionary event.
We are constantly reminded of the performance improvements in
microprocessors, but perhaps more important to the Beowulf project is the recent
cost/performance gains in network technology. The history of MIMD computing shows many
academic groups and commerical vendors that have built multiprocessor machines based on
what was then the state-of-art microprocessor. But these had always required special
"glue" chips or one-of-a-kind interconnection schemes. For the academic
community this lead to interesting research and the exploration of new ideas, which
usually resulted in one-of-a-kind machines. The life cycle of such a machines strongly
correlates to the life cycle of the graduate careers of those working on them. Vendors
usually made choices for special features or interconnection schemes to enhance certain
characteristics of their machine, or to tailor a machine to a perceived market. Exploiting
these enhancements required programmers to adopt a vendor specific programming model. This
often lead to dead ends with respect to software development. The cost effectiveness and
Linux support for high performance networks for PC class machines has enabled the
construction of balanced systems built entirely of COTS technology and has made generic
architectures and programming models practical.
The first Beowulf was built with DX4 processors and 10Mbit/s
Ethernet. At the time the processors were too fast for a single Ethernet and Ethernet
switches were still too expensive. To balance the system Don Becker rewrote his Ethernet
drivers for Linux and built a "channel bonded" Ethernet where the network
traffic was striped across two or more Ethernets. As 10Mbit/s and even 100Mbit/s Ethernet
switches became cost effective, the need for channel bonding has diminished (at
least for now). In late 1997, a good choice for a balanced system was 16 -- 200MHz P6
processors connected by Fast Ethernet and a Fast Ethernet switch. The exact network
configuration of a balanced cluster will continue to change and will remain dependent on
the size of the cluster, the relationship between processor speed and network bandwidth
and the current price of components. An important characteristic of Beowulf clusters is
that these changes (processor type and speed, network technology, cost of components) do
not impact the programming model. Therefore users of these systems can expect to enjoy
more forward compatibility than we have experienced in the past.
Another key component contributing to forward compatibility is the
system software used on Beowulf. With the maturity and robustness of Linux, GNU software
and the "standardization" of message passing via PVM and MPI, programmers now
have a guarantee that the programs they write will run on future Beowulf clusters,
regardless of who makes the processors or the networks. A natural consequence of coupling
the system software with vendor hardware is that the system software must be developed and
refined only slightly ahead of the application software. The historical criticism that
system software for high performance computers is always inadequate is actually unfair to
those developing it. In most cases coupling vendor software and hardware forces the system
software to be perpetually immature. The model used for Beowulf system software can break
this limitation.
The first Beowulf was built to address a specific computational
requirement of the ESS community. It was built by and for researchers with parallel
programming experience. Many of these researchers spent years fighting with MPP vendors
and system administrators over detailed performance information, while struggling with
underdeveloped tools and new programming models. This lead to a "do-it-yourself"
attitude. Another reality they faced was that access to a large machine often meant access
to a tiny fraction of the resources of the machine shared among many users. For these
users, building a cluster that they can completely control and fully utilize results in a
more effective, higher performance, computing platform. An attitude that summarizes this
situation is "Why buy what you can afford to make?" The realization is that
learning to build and run a Beowulf cluster is an investment, whereas learning the
peculiarities of a specific vendor only binds you to that vendor.
These hard core parallel programmers are first and foremost
interested in high performance computing applied to difficult problems. At Supercomputing
'96 both NASA and DOE demonstrated clusters costing less than $50,000 that achieved
greater than gigaflop/s sustained performance. A year later, NASA researchers at Goddard
Space Flight Center combined two clusters for a total of 199, P6 processors and ran a PVM
version of a PPM (Piece-wise Parabolic Method) code at a sustained rate of 10.1 Gflop/s.
In the same week (in fact, on the floor of Supercomputing '97) Caltech's 140 node cluster
ran an N-body problem at a rate of 10.9 Gflop/s. This does not mean that Beowulf clusters
are supercomputers, rather it means one can build a Beowulf that is big enough to attract
the interest of supercomputer users.
Beyond the seasoned parallel programmer, Beowulf clusters have been
built and used by programmers with little or no parallel programming experience. In fact
Beowulf clusters provide universities short on resources, with an excellent platform
to teach parallel programming courses and provide cost effective computing to their
computational scientists as well. The startup cost in a university situation is minimal
for the usual reasons -- most students interested in such a project are likely to be
running Linux on their own computers. Setting up a lab and learning to write parallel
programs is part of the learning experience.
In the taxonomy of parallel computers, Beowulf clusters fall
somewhere between MPP (Massively Parallel Processors, like the nCube, CM5, Convex SPP,
Cray T3D, Cray T3E, etc.) and NOWs (Networks of Workstations). The Beowulf project
benefits from developments in both these architectures. MPPs are typically larger
and have a lower latency interconnect network than a Beowulf cluster. Programmers are
still required to worry about locality, load balancing, granularity, and communication
overheads in order to obtain the best performance. Even on shared memory machines, many
programmers develop their programs in a message passing style. Programs that do not
require fine-grain computation and communication can usually be ported and run effectively
on Beowulf clusters. Programming a NOW is usually an attempt to harvest unused cycles on
an already installed base of workstations in a lab or on a campus. Programming in this
environment requires algorithms that are extremely tolerant of load balancing problems and
large communication latencies. Any program that runs on a NOW will run at least as well on
a cluster.
A Beowulf class cluster computer is distinguished from a Network of
Workstations by several subtle but significant characteristics. First, the nodes in the
cluster are dedicated to the cluster. This helps ease load balancing problems, because the
performance of individual nodes are not subject to external factors. Also, since the
interconnection network is isolated from the external network, the network load is
determined only by the application being run on the cluster. This eases the problems
associated with unpredictable latency in NOWs. All the nodes in the cluster are within the
administrative jurisdiction of the cluster. For example -- the interconnection network for
the cluster is not visible from the outside world so the only authentication needed
between processors is for system integrity. On a NOW, one must be concerned about network
security. Another example is the Beowulf software that provides a global process ID. This
enables a mechanism for a process on one node to send signals to a process on another node
of the system, all within the user domain. This is not allowed on a NOW. Finally,
operating system parameters can be tuned to improve performance. For example, a
workstation should be tuned to provide the best interactive feel (instantaneous responses,
short buffers, etc), but in a cluster the nodes can be tuned to provide better throughput
for coarser-grain jobs because they are not interacting directly with users.
The Beowulf Project grew from the first Beowulf machine, while the
Beowulf community has grown from the NASA project. Like the Linux community, the Beowulf
community is a loosely organized confederation of researchers and developers. Each
organization has its own agenda and its own reasons for developing a particular component
or aspect of the Beowulf system. As a result, Beowulf class cluster computers range from
several node clusters to several hundred node clusters. Some systems have been built by
computational scientists and are used in an operational setting. Others have been built as
test-beds for system research, while others serve as inexpensive platforms to learn about
parallel programming.
Most people in the Beowulf community are independent,
do-it-yourself'ers. Since everyone is doing their own thing, the notion of having a
central control within the Beowulf community just doesn't make sense. The community is
held together by the willingness of its members to share ideas and discuss successes and
failures in their development efforts. The mechanisms that facilitate this interaction are
the Beowulf mailing lists, individual web pages and the occasional meeting or workshop.
The future of the Beowulf project will be determined collectively by
the individual organizations contributing to the Beowulf project and by the future of
mass-market COTS. As microprocessor technology continues to evolve, higher speed networks
become cost effective and as more application developers move to parallel platforms, the
Beowulf project will evolve to fill its niche.
Definition of Parallel Processing
Parallel Processing refers to the concept of speeding-up the
execution of a program by dividing the program into multiple fragments that can execute
simultaneously, each on its own processor. A program being executed across n processors
might execute n times faster than it would using a single processor.
Parallel computer examples might include:
Clusters of workstations communicating on a local or wide area network.
A multiprocessor computer containing multiple processors in a single computer using Symmetric Multiprocessing (SMP).
Utilization of special processor instruction sets such us the MMX found in many modern processors.
On a finer grain, any modern RISC processor which internally contains multiple functional units that work in parallel. In this case, from the outside the RISC processor "simulates" a single serial execution stream, but is internally a complex parallel system.
Definition of Clustering
Connecting two or more computers together in such a way that they
behave like a single computer. Clustering is used for parallel processing, for load
balancing and for fault tolerance. Clustering is a popular strategy for implementing
parallel processing applications because it enables companies to leverage the investment
already made in PCs and workstations. In addition, it's relatively easy to add new CPUs
simply by adding a new PC to the network.
Definition of Beowulf
Famed was this Beowulf: far flew the boast of him, son of Scyld,
in the Scandian lands. So becomes it a youth to quit him well with his father's friends,
by fee and gift, that to aid him, aged, in after days, come warriors willing, should war
draw nigh, liegemen loyal: by lauded deeds shall an earl have honor in every clan. Beowulf
is the earliest surviving epic poem written in English. It is a story about a hero of
great strength and courage who defeated a monster called Grendel.
There are probably as many Beowulf definitions as there are people
who build or use Beowulf Supercomputer facilities. Some claim that one can call their
system Beowulf only if it is built in the same way as the NASA's original machine. Others
go to the other extreme and call Beowulf any system of workstations running parallel code.
Here are some definitions of Beowulf based on the postings on the Beowulf Mailing List:
Beowulf is a multi computer architecture which can be used for
parallel computations. It is a system which usually consists of one server node, and one
or more client nodes connected together via Ethernet or some other network. It is a system
built using commodity hardware components, like any PC capable of running Linux, standard
Ethernet adapters, and switches. It does not contain any custom hardware components and is
trivially reproducible. Beowulf also uses commodity software like the Linux operating
system, Parallel Virtual Machine (PVM) and/or Message Passing Interface (MPI). The server
node controls the whole cluster and serves files to the client nodes. It is also the
cluster's console and gateway to the outside world. Large Beowulf machines might have more
than one server node, and possibly other nodes dedicated to particular tasks, for example
consoles or monitoring stations. In most cases client nodes in a Beowulf system are dumb,
the dumber the better. Nodes are configured and controlled by the server node, and do only
what they are told to do. In a disk-less client configuration, client nodes don't even
know their IP address or name until the server tells them what it is. One of the main
differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf
behaves more like a single machine rather than many workstations. In most cases client
nodes do not have keyboards or monitors, and are accessed only via remote login or
possibly serial terminal. Beowulf nodes can be thought of as a CPU + memory package which
can be plugged into the cluster, just like a CPU or memory module can be plugged into a
motherboard.
Beowulf is not a special software package, new network topology or
the latest kernel hack. Beowulf is a technology of clustering Linux computers to form a
parallel, virtual supercomputer. Although there are many software packages such as kernel
modifications, PVM and/or MPI libraries, and configuration tools which make the Beowulf
architecture faster, easier to configure, and much more usable, one can build a Beowulf
class machine using standard Linux distributions without any additional software. If you
have two networked Linux computers which share at least the /home file system via NFS, and
trust each other to execute remote shells (rsh), then it could be argued that you have a
simple, two node Beowulf machine.
Classes of Beowulf Systems
CLASS I BEOWULF:
This class of machines built entirely from commodity
"off-the-shelf" parts. We shall use the "Computer Shopper"
certification test to define commodity "off-the-shelf" parts. (Computer Shopper
is a 1 inch thick monthly magazine/catalog of PC systems and components.) The test is as
follows:
A CLASS I Beowulf is a machine that can be assembled from parts
found in at least 3 nationally/globally circulated advertising catalogs.
The advantages of a CLASS I system are:
- hardware is available form multiple sources (low prices, easy maintenance)
- no reliance on a single hardware vendor
- driver support from the Linux community
- usually based on standards (SCSI, Ethernet, etc.)
The disadvantages of a CLASS I system are:
- best performance may require CLASS II hardware
CLASS II BEOWULF
A CLASS II Beowulf is simply any machine that does not pass the
Computer Shopper certification test. This is not a bad thing. Indeed, it is merely a
classification of the machine.
The advantages of a CLASS II system are:
- Performance can be quite good! Maybe even better than a CLASS I Beowulf system.
The disadvantages of a CLASS II system are:
- driver support may vary
- reliance on a single hardware vendor for some components
- may be more expensive than CLASS I systems.
One CLASS is not necessarily better than the other. It all depends
on your needs and budget. This classification system is only intended to make discussions
about Beowulf systems a bit more succinct.
Beowulf Versus a
Cluster of Workstations
Clustering idle computer laboratories in a University is a perfect
example of a Cluster of Workstations (COW). What then is so special about Beowulf, and how
is it different from a COW? The truth is that there is not much difference, but Beowulf
does have a few unique characteristics.
Client nodes in a Beowulf cluster do not have keyboards, mice, video cards nor monitors. All access to the client nodes is done via remote connections from the server node, dedicated console node, or a serial console.
The client nodes use private IP addresses like the 10.0.0.0/8 or 192.168.0.0/16 address ranges (RFC 1918 http://www.alternic.net/rfcs/1900/rfc1918.txt.html) and do not have their own connection to the Internet or the institution's Wide Area Network(WAN).
On the server node, users can edit and compile their code, and also spawn jobs on all nodes in the cluster.
Beowulf has also more single system image features which help the users to see the Beowulf cluster as a single computing workstation.
In most cases COWs are used for parallel computations at night and
over weekends when people do not actually use the workstations for every day work, thus
utilizing idle CPU cycles. Beowulf on the other hand is a machine usually dedicated to
parallel computing, and optimized for this purpose. Beowulf also gives a better
price/performance ratio as it is built from off-the-shelf components and mainly runs free
software.
In order to make clustering of workstations more manageable and
easier to setup, there have been some moves by the industry to standardize on different
libraries to supporting clustering of workstations. The most notable of these are:
PVM (Parallel Virtual Machine)
PVM is a freely-available, portable, message-passing library
generally implemented on top of sockets. It is clearly established as the de-facto
standard for message-passing cluster parallel computing.
MPI (Message Passing Interface)
Although PVM is the de-facto standard message-passing library, MPI
(Message Passing Interface) is the relatively new official standard. The home page for the
MPI standard is <http://www.mcs.anl.gov:80/mpi/> and the newsgroup is
comp.parallel.mpi.
AFAPI (Aggregate Function Application Program Interface)
Unlike PVM and MPI, the AFAPI (Aggregate Function API) did not start
life as an attempt to build a portable abstract interface layered on top of existing
network hardware and software. Rather, AFAPI began as the very hardware-specific low-level
support library for PAPERS (Purdue's Adapter for Parallel Execution and Rapid
Synchronization; see http://garage.ecn.purdue.edu/~papers/).
For purposes of this demonstration, I personally chose PVM as the
clustering library of choice. The reasons are:
PVM is still the de facto standard and therefore more software is written for it.
PVM has better support for a heterogeneous network of workstations.
PVM has a better control execution environment. PVM is standard across the board and has only one implementation unlike MPI.