Introduction

The rapid increase in performance of mass market commodity microprocessors and significant disparity in pricing between PCs and scientific workstations, has provided an opportunity for substantial gains in performance to cost by harnessing PC technology in parallel ensembles to provide high-end capability for scientific and engineering applications. The Beowulf project is a NASA initiative sponsored by the HPCC program to explore the potential of Pile-of-PCs and to develop the necessary methodologies to apply these low cost system configurations to NASA computational requirements in the Earth and Space sciences. Recently, a 16 processor Beowulf costing less than $50,000 sustained 1.25 Gigaflops on a gravitational N-body simulation of 10 million particles with a Tree code algorithm using standard commodity hardware and software components.

This gives us the ability to attain supercomputing power with off-the-shelf hardware and free software. This eliminates the need for acquiring expensive supercomputers and at the same addresses the question -- "What do we do with our unwanted hardware?"

back

History of the Beowulf Class Cluster

In the summer of 1994 Thomas Sterling and Don Becker, working at CESDIS under the sponsorship of the ESS project, built a cluster computer consisting of 16 DX4 processors connected by channel-bonded Ethernet. They called their machine Beowulf. The machine was an instant success and their idea of providing COTS (Commodity Off The Shelf) base systems to satisfy specific computational requirements quickly spread through NASA and into the academic and research communities. The development effort for this first machine quickly grew into a what we now call the Beowulf Project. Some of the major accomplishment of the Beowulf Project will be chronicled below, but a non-technical measure of success is the observation that researchers within the High Performance Computer community are now referring to such machines as "Beowulf Class Cluster Computers." That is, Beowulf clusters are now recognized as a genre within the HPC community.

The next few paragraphs will provide a brief history of the Beowulf Project and a discussion of certain aspects or characteristics of the Project that appear to be key to its success.

The Center of Excellence in Space Data and Information Sciences (CESDIS) is a division of the University Space Research Association (USRA) located at the Goddard Space Flight Center in Greenbelt Maryland. CESDIS is a NASA contractor, supported in part by the Earth and space sciences (ESS) project. The ESS project is a research project within the High Performance Computing and Communications (HPCC) program. One of the goals of the ESS project is to determine the applicability of massively parallel computers to the problems faced by the Earth and Space sciences community. The first Beowulf was built to address problems associated with the large data sets that are involved in ESS applications.

It may well be that this is simply the "right time" in history for the development of Beowulf class computers. In the last ten years we have experienced a number of events that have brought together all the pieces for the genesis of the Beowulf project. The creation of the casual computing market (office automation, home computing, games and entertainment) now provides system designers with a new source of cost effective components. The COTS industry now provides fully assembled subsystems (microprocessors, motherboards, disks and network interface cards). Mass market competition has driven prices down and reliability up for these subsystems. The development of publicly available software, in particular, the Linux operating system, the GNU compilers and programming tools and the MPI and PVM message passing libraries, provide hardware independent software. Programs like the HPCC program have produced many years of experience working with parallel algorithms. That experience has taught us that obtaining high performance, even from vendor provided parallel platforms is hard work and requires researchers to adopt a do-it-yourself attitude. A second aspect to this history of working with parallel platforms is an increased reliance on computational science and therefore an increased need for high performance computing. One could argue that the combination of these  conditions: hardware, software, experience and expectation; provided the environment that makes the development of Beowulf clusters seem like a natural evolutionary event.

We are constantly reminded of the performance improvements in microprocessors, but perhaps more important to the Beowulf project is the recent cost/performance gains in network technology. The history of MIMD computing shows many academic groups and commerical vendors that have built multiprocessor machines based on what was then the state-of-art microprocessor. But these had always required special "glue" chips or one-of-a-kind interconnection schemes. For the academic community this lead to interesting research and the exploration of new ideas, which usually resulted in one-of-a-kind machines. The life cycle of such a machines strongly correlates to the life cycle of the graduate careers of those working on them. Vendors usually made choices for special features or interconnection schemes to enhance certain characteristics of their machine, or to tailor a machine to a perceived market. Exploiting these enhancements required programmers to adopt a vendor specific programming model. This often lead to dead ends with respect to software development. The cost effectiveness and Linux support for high performance networks for PC class machines has enabled the construction of balanced systems built entirely of COTS technology and has made generic architectures and programming models practical.

The first Beowulf was built with DX4 processors and 10Mbit/s Ethernet. At the time the processors were too fast for a single Ethernet and Ethernet switches were still too expensive. To balance the system Don Becker rewrote his Ethernet drivers for Linux and built a "channel bonded" Ethernet where the network traffic was striped across two or more Ethernets. As 10Mbit/s and even 100Mbit/s Ethernet switches  became cost effective, the need for channel bonding has diminished (at least for now). In late 1997, a good choice for a balanced system was 16 -- 200MHz P6 processors connected by Fast Ethernet and a Fast Ethernet switch. The exact network configuration of a balanced cluster will continue to change and will remain dependent on the size of the cluster, the relationship between processor speed and network bandwidth and the current price of components. An important characteristic of Beowulf clusters is that these changes (processor type and speed, network technology, cost of components) do not impact the programming model. Therefore users of these systems can expect to enjoy more forward compatibility than we have experienced in the past.

Another key component contributing to forward compatibility is the system software used on Beowulf. With the maturity and robustness of Linux, GNU software and the "standardization" of message passing via PVM and MPI, programmers now have a guarantee that the programs they write will run on future Beowulf clusters, regardless of who makes the processors or the networks. A natural consequence of coupling the system software with vendor hardware is that the system software must be developed and refined only slightly ahead of the application software. The historical criticism that system software for high performance computers is always inadequate is actually unfair to those developing it. In most cases coupling vendor software and hardware forces the system software to be perpetually immature. The model used for Beowulf system software can break this limitation.

The first Beowulf was built to address a specific computational requirement of the ESS community. It was built by and for researchers with parallel programming experience. Many of these researchers spent years fighting with MPP vendors and system administrators over detailed performance information, while struggling with underdeveloped tools and new programming models. This lead to a "do-it-yourself" attitude. Another reality they faced was that access to a large machine often meant access to a tiny fraction of the resources of the machine shared among many users. For these users, building a cluster that they can completely control and fully utilize results in a more effective, higher performance, computing platform. An attitude that summarizes this situation is "Why buy what you can afford to make?" The realization is that learning to build and run a Beowulf cluster is an investment, whereas learning the peculiarities of a specific vendor only binds you to that vendor.

These hard core parallel programmers are first and foremost interested in high performance computing applied to difficult problems. At Supercomputing '96 both NASA and DOE demonstrated clusters costing less than $50,000 that achieved greater than gigaflop/s sustained performance. A year later, NASA researchers at Goddard Space Flight Center combined two clusters for a total of 199, P6 processors and ran a PVM version of a PPM (Piece-wise Parabolic Method) code at a sustained rate of 10.1 Gflop/s. In the same week (in fact, on the floor of Supercomputing '97) Caltech's 140 node cluster ran an N-body problem at a rate of 10.9 Gflop/s. This does not mean that Beowulf clusters are supercomputers, rather it means one can build a Beowulf that is big enough to attract the interest of supercomputer users.

Beyond the seasoned parallel programmer, Beowulf clusters have been built and used by programmers with little or no parallel programming experience. In fact Beowulf clusters provide universities short on  resources, with an excellent platform to teach parallel programming courses and provide cost effective computing to their computational scientists as well. The startup cost in a university situation is minimal for the usual reasons -- most students interested in such a project are likely to be running Linux on their own computers. Setting up a lab and learning to write parallel programs is part of the learning experience.

In the taxonomy of parallel computers, Beowulf clusters fall somewhere between MPP (Massively Parallel Processors, like the nCube, CM5, Convex SPP, Cray T3D, Cray T3E, etc.) and NOWs (Networks of Workstations). The Beowulf project benefits from developments in both these architectures.  MPPs are typically larger and have a lower latency interconnect network than a Beowulf cluster. Programmers are still required to worry about locality, load balancing, granularity, and communication overheads in order to obtain the best performance. Even on shared memory machines, many programmers develop their programs in a message passing style. Programs that do not require fine-grain computation and communication can usually be ported and run effectively on Beowulf clusters. Programming a NOW is usually an attempt to harvest unused cycles on an already installed base of workstations in a lab or on a campus. Programming in this environment requires algorithms that are extremely tolerant of load balancing problems and large communication latencies. Any program that runs on a NOW will run at least as well on a cluster.

A Beowulf class cluster computer is distinguished from a Network of Workstations by several subtle but significant characteristics. First, the nodes in the cluster are dedicated to the cluster. This helps ease load balancing problems, because the performance of individual nodes are not subject to external factors. Also, since the interconnection network is isolated from the external network, the network load is determined only by the application being run on the cluster. This eases the problems associated with unpredictable latency in NOWs. All the nodes in the cluster are within the administrative jurisdiction of the cluster. For example -- the interconnection network for the cluster is not visible from the outside world so the only authentication needed between processors is for system integrity. On a NOW, one must be concerned about network security. Another example is the Beowulf software that provides a global process ID. This enables a mechanism for a process on one node to send signals to a process on another node of the system, all within the user domain. This is not allowed on a NOW. Finally, operating system parameters can be tuned to improve performance. For example, a workstation should be tuned to provide the best interactive feel (instantaneous responses, short buffers, etc), but in a cluster the nodes can be tuned to provide better throughput for coarser-grain jobs because they are not interacting directly with users.

The Beowulf Project grew from the first Beowulf machine, while the Beowulf community has grown from the NASA project. Like the Linux community, the Beowulf community is a loosely organized confederation of researchers and developers. Each organization has its own agenda and its own reasons for developing a particular component or aspect of the Beowulf system. As a result, Beowulf class cluster computers range from several node clusters to several hundred node clusters. Some systems have been built by computational scientists and are used in an operational setting. Others have been built as test-beds for system research, while others serve as inexpensive platforms to learn about parallel programming.

Most people in the Beowulf community are independent, do-it-yourself'ers. Since everyone is doing their own thing, the notion of having a central control within the Beowulf community just doesn't make sense. The community is held together by the willingness of its members to share ideas and discuss successes and failures in their development efforts. The mechanisms that facilitate this interaction are the Beowulf mailing lists, individual web pages and the occasional meeting or workshop.

The future of the Beowulf project will be determined collectively by the individual organizations contributing to the Beowulf project and by the future of mass-market COTS. As microprocessor technology continues to evolve, higher speed networks become cost effective and as more application developers move to parallel platforms, the Beowulf project will evolve to fill its niche.

back

Definition of Parallel Processing

Parallel Processing refers to the concept of speeding-up the execution of a program by dividing the program into multiple fragments that can execute simultaneously, each on its own processor. A program being executed across n processors might execute n times faster than it would using a single processor.

Parallel computer examples might include:

Clusters of workstations communicating on a local or wide area network.

A multiprocessor computer containing multiple processors in a single computer using Symmetric Multiprocessing (SMP).

Utilization of special processor instruction sets such us the MMX found in many modern processors.

On a finer grain, any modern RISC processor which internally contains  multiple functional units that work in parallel. In this case, from the outside the RISC processor "simulates" a single serial execution stream, but is internally a complex parallel system.

 

back

 

 

Definition of Clustering

Connecting two or more computers together in such a way that they behave like a single computer. Clustering is used for parallel processing, for load balancing and for fault tolerance. Clustering is a popular strategy for implementing parallel processing applications because it enables companies to leverage the investment already made in PCs and workstations. In addition, it's relatively easy to add new CPUs simply by adding a new PC to the network.

back

 

 

 

Definition of Beowulf

Famed was this Beowulf: far flew the boast of him, son of Scyld, in the Scandian lands. So becomes it a youth to quit him well with his father's friends, by fee and gift, that to aid him, aged, in after days, come warriors willing, should war draw nigh, liegemen loyal: by lauded deeds shall an earl have honor in every clan. Beowulf is the earliest surviving epic poem written in English. It is a story about a hero of great strength and courage who defeated a monster called Grendel.

There are probably as many Beowulf definitions as there are people who build or use Beowulf Supercomputer facilities. Some claim that one can call their system Beowulf only if it is built in the same way as the NASA's original machine. Others go to the other extreme and call Beowulf any system of workstations running parallel code. Here are some definitions of Beowulf based on the postings on the Beowulf Mailing List:

Beowulf is a multi computer architecture which can be used for parallel computations. It is a system which usually consists of one server node, and one or more client nodes connected together via Ethernet or some other network. It is a system built using commodity hardware components, like any PC capable of running Linux, standard Ethernet adapters, and switches. It does not contain any custom hardware components and is trivially reproducible. Beowulf also uses commodity software like the Linux operating system, Parallel Virtual Machine (PVM) and/or Message Passing Interface (MPI). The server node controls the whole cluster and serves files to the client nodes. It is also the cluster's console and gateway to the outside world. Large Beowulf machines might have more than one server node, and possibly other nodes dedicated to particular tasks, for example consoles or monitoring stations. In most cases client nodes in a Beowulf system are dumb, the dumber the better. Nodes are configured and controlled by the server node, and do only what they are told to do. In a disk-less client configuration, client nodes don't even know their IP address or name until the server tells them what it is. One of the main differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf behaves more like a single machine rather than many workstations. In most cases client nodes do not have keyboards or monitors, and are accessed only via remote login or possibly serial terminal. Beowulf nodes can be thought of as a CPU + memory package which can be plugged into the cluster, just like a CPU or memory module can be plugged into a motherboard.

Beowulf is not a special software package, new network topology or the latest kernel hack. Beowulf is a technology of clustering Linux computers to form a parallel, virtual supercomputer. Although there are many software packages such as kernel modifications, PVM and/or MPI libraries, and configuration tools which make the Beowulf architecture faster, easier to configure, and much more usable, one can build a Beowulf class machine using standard Linux distributions without any additional software. If you have two networked Linux computers which share at least the /home file system via NFS, and trust each other to execute remote shells (rsh), then it could be argued that you have a simple, two node Beowulf machine.

back

 

 

Classes of Beowulf Systems

CLASS I BEOWULF:

This class of machines built entirely from commodity "off-the-shelf" parts. We shall use the "Computer Shopper" certification test to define commodity "off-the-shelf" parts. (Computer Shopper is a 1 inch thick monthly magazine/catalog of PC systems and components.) The test is as follows:

A CLASS I Beowulf is a machine that can be assembled from parts found in at least 3 nationally/globally circulated advertising catalogs.

The advantages of a CLASS I system are:

  • hardware is available form multiple sources (low prices, easy maintenance)
  • no reliance on a single hardware vendor
  • driver support from the Linux community
  • usually based on standards (SCSI, Ethernet, etc.)

The disadvantages of a CLASS I system are:

  • best performance may require CLASS II hardware

CLASS II BEOWULF

A CLASS II Beowulf is simply any machine that does not pass the Computer Shopper certification test. This is not a bad thing. Indeed, it is merely a classification of the machine.

The advantages of a CLASS II system are:

  • Performance can be quite good! Maybe even better than a CLASS I Beowulf system.

The disadvantages of a CLASS II system are:

  • driver support may vary
  • reliance on a single hardware vendor for some components
  • may be more expensive than CLASS I systems.

One CLASS is not necessarily better than the other. It all depends on your needs and budget. This classification system is only intended to make discussions about Beowulf systems a bit more succinct.

 

Beowulf Versus a Cluster of Workstations

Clustering idle computer laboratories in a University is a perfect example of a Cluster of Workstations (COW). What then is so special about Beowulf, and how is it different from a COW? The truth is that there is not much difference, but Beowulf does have a few unique characteristics.

Client nodes in a Beowulf cluster do not have keyboards, mice, video cards nor monitors. All access to the client nodes is done via remote connections from the server node, dedicated console node, or a serial console.

The client nodes use private IP addresses like the 10.0.0.0/8 or 192.168.0.0/16 address ranges (RFC 1918 http://www.alternic.net/rfcs/1900/rfc1918.txt.html) and do not have their own connection to the Internet or the institution's Wide Area Network(WAN).

On the server node, users can edit and compile their code, and also spawn jobs on all nodes in the cluster.

Beowulf has also more single system image features which help the users to see the Beowulf cluster as a single computing workstation.

In most cases COWs are used for parallel computations at night and over weekends when people do not actually use the workstations for every day work, thus utilizing idle CPU cycles. Beowulf on the other hand is a machine usually dedicated to parallel computing, and optimized for this purpose. Beowulf also gives a better price/performance ratio as it is built from off-the-shelf components and mainly runs free software.

back

 

Cluster Support Libraries

In order to make clustering of workstations more manageable and easier to setup, there have been some moves by the industry to standardize on different libraries to supporting clustering of workstations. The most notable of these are:

PVM (Parallel Virtual Machine)

PVM is a freely-available, portable, message-passing library generally implemented on top of sockets. It is clearly established as the de-facto standard for message-passing cluster parallel computing.

MPI (Message Passing Interface)

Although PVM is the de-facto standard message-passing library, MPI (Message Passing Interface) is the relatively new official standard. The home page for the MPI standard is <http://www.mcs.anl.gov:80/mpi/> and the newsgroup is comp.parallel.mpi.

AFAPI (Aggregate Function Application Program Interface)

Unlike PVM and MPI, the AFAPI (Aggregate Function API) did not start life as an attempt to build a portable abstract interface layered on top of existing network hardware and software. Rather, AFAPI began as the very hardware-specific low-level support library for PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization; see http://garage.ecn.purdue.edu/~papers/).


For purposes of this demonstration, I personally chose PVM as the clustering library of choice. The reasons are:

PVM is still the de facto standard and therefore more software is written for it.

PVM has better support for a heterogeneous network of workstations.

PVM has a better control execution environment. PVM is standard across the board and has only one implementation unlike MPI.

 

back