|
|
|
|
Clustering Tim
Parker and Nancy Breaux The term supercomputer has been in widespread use for
many years. Most people think of
supercomputers as a large, expensive, fast number-cruncher designed for complex
calculation. Supercomputers belong at NASA, meteorological stations,
universities and research organizations. The
ability to have supercomputer power in your own office or home was unthinkable a
few years ago, and when dealing with single PCs the performance is getting close
to but not quite good enough to qualify as a supercomputer.
There are a few ways you can have supercomputing capabilities using
simple, inexpensive personal computers, though.
All hinge around the concept of clustering. Clustering
computers to share their computational power for larger problems is not a new
idea. Clusters have been around
since the Digital Equipment VAX days, and large UNIX workstation manufacturers
like IBM and Sun have offered clustering software capabilities. The change in
focus towards inexpensive personal computers really started at NASA, where a
number of old 80486-based computers were networked together into a cluster.
Since then, the focus on clustering of personal computers has lead to many new
developments, as well as a new category of supercomputer: shared computation
power. With terms like Beowulf
cluster entering our vocabulary, it’s time to take a look at what clustering
is, what it does, what you need to create and manage a cluster, and how Linux
can fit into the picture. Let’s
start with a few definitions. The
word cluster actually refers to many different technologies and techniques, all
intended to couple several machines together to provide either more
computational resources than single machines can offer or to offer high
availability servers (called HA clustering).
HA clustering is used for mission-critical and never-down servers such as
Web servers and large corporate application servers.
By having a number of machines on a network that can perform the same
tasks, failure in one machine simply allows another to assume the failed
machine’s tasks. The creation of these redundant clusters is simple in Linux
and UNIX, requiring only monitoring routines to handle the fail-over tasks as
quickly as possible. One other benefit of HA clusters is shared load handling:
if you have five servers all ready to handle requests, they will be able to
handle a much higher load than a single server. Clusters
are often developed specifically for their computation power.
Instead of spending millions of dollars on a supercomputer, companies and
organizations could spend a few thousands on personal computers and use their
shared, aggregate CPU powers. The original cluster at NASA arose for this
reason. The NASA engineers needed
around 1Gflops (one billion operations per second) for their project but
couldn’t afford a supercomputer. Instead,
they combined 16 66MHz 80486 systems into a cluster, writing software to allow
the machines to cooperatively crunch the numbers. In 1994, this was the first
Beowulf cluster (called Wiglaf, after Beowulf’s aide). The total cost of
Wiglaf was well under $50,000, much less than the cost of a supercomputer to
provide the same raw crunching power. The
high-performance cluster setups also have other uses than raw CPU power.
The issue of bulk disk servers arises, since many low-power CPUs can be
used for large files and filesystems. Fast
file servers require fast CPUs to handle file I/O: spreading the loads across
many machines reduces the amount of CPU required for each disk. Even better,
with parallel Linux or UNIX clusters, a filesystem can be constructed that is as
fast or faster than very expensive disk-based fileservers at a
fraction of the cost. Bulk disk servers are ideal for database applications.
All
clusters are not the same. There are several ways to implement clusters, and to
some extent the method of implementing depends on the machines being used in the
cluster and the purpose of the cluster. There
are two technologies that compete in the cluster world: MPI (Message Passing
Interface) and PVM (Parallel Virtual Machine). MPI is a published standard and
is gaining rapid support from the clustering community including vendors and
military applications. PVM is an
implementation developed at one university and is not a standard. Clusters can
be implemented without either MPI or PVM. However,
one or the other message passing protocol is usually used for parallel
programming. PVM is the de facto standard for clusters. PVM
was designed to support heterogenous environments (in other words, many
different kinds of machines making up a cluster) while MPI was not designed for
clusters at all. MPI was designed
for programming massively parallel processor (MPP) machines such as the Cray
T3D. MPI is suitable for any parallel computer that uses message passing to
communicate between processors. The
goal was to write portable libraries for number crunching, so that what you
wrote (often in some flavor of FORTRAN) for your Cray could be ported to your
IBM MPP. Clusters were an
afterthought for MPI. Hence, the
only clusters that are easily supported by MPI are those consisting of machines
all of the same type. Nevertheless,
if you want the application code you wrote for your Cray supercomputer to run on
a bunch of Linux boxes, MPI is probably the first choice.
If you are starting developing an application from
scratch and want a little more flexibility in load balancing, machine types, and
fault tolerance, PVM is probably your best bet (as long as
you don't mind writing your code in C. Although MPI does allow for
heterogeneous networks, there are some limitations to MPI.
Most notably, MPI does not allow for dynamic sizing of the cluster. While
suitable for identical machines coupled by a high-speed network, it has no
built-in way of handling faults. If
new machines are added to the cluster while it is operational, they are not
used. PVM, on the other hand, does
allow dynamic sizing of the cluster, allowing machines to be added and dropped
at will, adapting the clustering software to handle the changes easily. You
will often hear the term Beowulf cluster. A
Beowulf cluster is simply a cluster of Linux-based machines although the term is
sometimes used for the software component as well. In general, though, a Beowulf
cluster is a set of PCs networked together in a dedicated subnet (usually
100Mbps Ethernet), with one machine designated the “master” and the others
are “headless” (no monitor, keyboard, etc).
The master is responsible for managing the cluster and for communications
between the cluster and the outside world. A Beowulf cluster can be either MPI
or PVM. The term Beowulf cluster refers both to the hardware and software parts.
The software part consists of the system software made available by www.beowulf.org. Among the software provided are packages that allows the
process ID space to span multiple nodes, special Ethernet drivers, a package to
allow Ethernet channel bonding, and other tools.
Either PVM or MPI can be used to write programs to run on the cluster,
but when referring to the “Beowulf cluster software” you are specifically
referring to the system software and extensions to the Linux kernel made
available by Scyld computing corporation. In
addition there are commercial variants of what is available as software under
GNU licenses. It is important that
the Beowulf cluster subnet be high speed. This
usually means 100Mbps Ethernet switch, ATM, or similar system that can keep up
with high bandwidth demands. Channel
bonding was a solution to the problem when only 10Mbps Ethernet was available
and switches were prohibitively expensive. A
neat use of cluster has to do with managing very large data sets. If the data
set to be crunched is sufficiently large that it cannot reside in a single
machine’s RAM and if the computation can be spread across multiple CPUs so a
subset of the total data can fit in a single machine’s RAM, then
“superlinear speedup” is possible. Superlinear speedup occurs then several
processors can perform faster than a single processor machine would be capable
of, given the effect of memory and disk swapping.
In other words, if you have 8 processors in a cluster, the performance of
the cluster would be higher than 8 times faster than that of a single processor
due to the software overhead involved in memory swapping. Beowulf clusters are
likely targets for superlinear speedup when using large data sets such as image
analysis and matrix calculations. Clustering
applications There
are three different categories of clustering application. The first, a simplest,
is called “embarrassingly parallel problems”. These are computational
problems where the parallelism is easy to exploit using SIMD (single instruction
multiple datastream). In other words, these are problems where calculations must
be repeatedly performed on a large data set or on multiple data sets, and the
amount of data is more than can be reasonably crunched on a single CPU. By
joining many machines into a cluster, the amount of data to be crunched can be
shared over all the machines and each has its own software to control the number
crunching. With embarrassingly
parallel problems there is no need for machines to talk to each other except to
download raw data and transmit results. Embarrassingly parallel problems are
quite simple to code because there is no need for CPU to CPU conversations, no
cross thread or process communications, and no need for repeated updates of data
status. One of the most widely-used examples of embarrassingly parallel problems
is distributed.net, where computers download a
chunk of encrypted data and analyze it for key to the encryption, receiving raw
data in chunks and sending results back to an Internet-based server. More
complex are tasks that require some communications between the machines that
make up a cluster. These are not embarrassingly parallel problems because of the
need for interprocess communications. Sharing
these complex calculations among multiple machines means that the software to
control the cluster is far more complicated to code and often requires complex
algorithms. Linear algrebra
routines involving large matrices are typical of such applications.
Some of these matrix
calculations have scalable algorithms associated with them, but most
do not. In fact, this continues to
be an active area of research, working towards the discovery of algorithms that
can work well in a cluster environment. The
term “matrix calculations” is often applied to these problems because matrix
manipulation (such as matrix inversions) are typical tasks. Matrix calculations
tend to be scalable and portable, but the overhead in coding and maintaining the
cluster components can be significant. The
third type of clustering application is actually a modification of
embarrassingly parallel problems. It
is called “grid” computing, because it uses the Internet as a computational
network. Grid computing is designed
to allow many individual machines to download a client and data set from a
server, crunch the data, and transfer the results back to the server, all over
the Internet. The seti@home
project is an example of grid computing, as are the applications offered by
distributed.net. Grid computing extends the concept of using background CPU
power to the entire Internet and its millions of machines. Setting
up a Linux cluster In
order to set up a Beowulf cluster of your own you need two things: Linux
machines on a network together, and software to handle the clustering.
Typically, a Linux cluster will have a single master and a number of headless
machines, but all can be full systems with no performance penalty.
A fast network (preferably 100Mbps Ethernet) with a dedicated switch is
ideal for maximum performance. The disks on the Linux machines are often set up
very similar to each other (often cloned from a single source) and RPC (Remote
Procedure Call) is used for intermachine communications.
Finally, the filesystems are often set up using NFS to fast remote
access. One
problem with this approach to clustering is the security issue.
RPC and NFS are both vulnerable to hacking, so a firewall or closed
network loop is ideal to prevent intrusion and tampering. A closed network is a
necessity with a Beowulf cluster meant for number crunching.
The reason is that outside network traffic will interfere with the
interprocess communication. The
increase in variance of the communication time would make writing fast,
efficient number crunching programs almost impossible.
A developer needs a reasonable estimate of the mean and variance of the
interprocess communication time in order to write effect parallel code to rival
the performance of traditional multiprocessor mainframes. The software component you can use on your Linux cluster depends to some extent on your application, and also on what is available to you. If you want to set up a high-availability cluster, your software will be different than a number-crunching cluster. HA Clusters For
a high-availability cluster, you need software that handles failure of a node on
the network, and also provides some load-balancing feature so your machines are
used optimally. MOSIX (www.mosix.cs.huji.ac.il)
is one of the widest used Linux HA clustering applications and it handles
load-balancing across the cluster. MOSIX requires no special modification to the
Linux operating system or the system layout, which makes it easy to install and
remove later. Although not
completely transparent to users, MOSIX is non-intrusive and works well. The
Piranha (available from www.redhat.com and
other Linux sites) software package has been popular lately, too.
Piranha is intended for HA clustering or for load-balancing, but it
doesn’t do a good job of handling both tasks at once.
Piranha’s strength is its ease of setup and use, and it is tightly
integrated into several current Linux releases including RedHat 6.2. The
Linux Virtual Server project (www.linuxvirtualserver.org)
provides a set of kernel modifications and drivers for setting up load-balanced
clusters using Linux systems. The
Virtual Server approach is to allow scalable, reliable clusters to be
constructed using a variety of Linux platforms.
So far, the Virtual Server system has been plagued by some problems, but
these are quickly being ironed out and the Linux Virtual Server idea holds great
promise. The
final HA clustering software package worth noting comes from TurboLinux, with
their TurboCluster EnFuzion package. TurboLinux
is a well-known vendor of optimized Linux operating systems for workstation use,
and TurboCluster is a superset of the TurboLinux release. TurboCluster is
designed to allow integration between Linux and Sun Solaris platforms into a
single cluster, and support for Windows NT as part of the cluster is also
included. Computational
clusters Computational
clusters share number-crunching tasks across several machines. Most
computational cluster approaches require modification to the software load on a
Linux system, as well as special drivers for the applications to be run
themselves. The more widely known and oft-used computational cluster package is
Beowulf (www.beowulf.org). To install a
Beowulf-based cluster you really need to dedicate the machines to the cluster
task, and not use them as stand-alone workstations.
Setting up and managing a Beowulf cluster requires a fair amount of
system administration skill, as well as some ongoing management tasks. An
alternative to Beowulf is Evolocity (www.linuxnetworx.com)
which is a commercial cluster software application. Evolocity provides a
complete software package including management tools, as well as support.
Although not inexpensive, Evolocity is a good choice for organizations and
companies that require a single point of contact for their application, as well
as a solid technical support structure. Clusters
are quickly becoming an important use for Linux, as well as for aging personal
computers. Setting up clusters is
much simpler than many users assume, and can easily be done by any knowledgeable
Linux user. While the dedicated high-speed networks may be beyond the reach of
home or small office setups, the ability to load share and provide redundancy
inexpensively is attractive and an excellent use of PCs. |
|
Send mail to
tparker@tpci.com with
questions or comments about this web site.
|