|
|
|
|
UNIX and Linux Clusters Dr. Nancy Breaux and Tim Parker Cluster this and cluster that. It seems like everyone is talking about clusters today. This
is remarkable for two reasons: first, few people who talk about clusters really
know what they are; and second, clusters are not a new technology but have been
around for years. What is a
cluster? What does a cluster do?
Why would you want a cluster? What’s
involved in setting up a cluster? Can
UNIX and Linux both do clusters? We’ll
answer these questions and more in the next few pages.
So sit back and enter the wonderful world of clusters. What is a cluster? The term cluster is a generic name for a collection of two
or more computers (usually quite at least five) that are cooperating for some
reason. The machines in a cluster
all talk to each other and cooperate on a particular task or set of tasks.
The use of the term “cluster” goes back to minicomputer days,
although the current definition is a bit different.
Today we think of a cluster as a number of personal computers, all
running the same operating system, all working together on a task that would
normally require much larger computers. The term cluster is used primarily with
the Linux operating system, but that’s not the only type of cluster.
There are clusters that use both UNIX and Windows, as you will see later. Clusters of personal computers were first developed for a
project at NASA where a number of aging 80486 machines were tied together by
software on a network. The reason
for the cluster was simple: the researchers could not afford a supercomputer.
By writing clever software and using Linux, a cluster can provide a good
percentage of the sheer CPU horsepower of a supercomputer using much less
expensive personal computers. In
the case of the NASA cluster, a budget of $40,000 produced a cluster with the
same computational power as a supercomputer costing many millions of dollars. For convenience, we usually divide clusters into three
categories (actually two categories with a modification of a third category).
The first category is high-availability clusters (or HA clusters). The
idea behind HA clustering is to provide a cluster of personal computers which
can all perform the same task, and in case of failure of one particular machine,
the others in the cluster take over the load.
Typically, HA clusters are used for Web servers, database servers,
e-commerce applications, and any application that cannot be unavailable. The second category of clusters is computational clusters.
Computational clusters are used to provide several CPUs all working on
the same problem at the same time, essentially providing more CPU horsepower for
a number-crunching application. The
third type of cluster is called the grid, and is a modification of computational
clusters that uses the entire Internet as its network.
We can look at each of these cluster types in a little more detail. HA clusters For many large corporations and e-commerce vendors, the
most important aspect of their business is availability. If the Web server for one of these companies goes down, not
only does it give Internet users a bad image of the company but it can cost lots
of money in lost sales (both directly because users go elsewhere and indirectly
because of the poor image it projects of the company). Keeping a fast, reliable
Web server running on the Internet requires not only redundant connections to
the Internet, but also redundant hardware for the server. Instead of spending a
fortune on large mainframes and supercomputers, the option to spend much less
for personal computers that can provide the same uptime and be easier to
maintain is attractive. HA clusters are not restricted to Web pages, of course.
Many HA clusters are used inside large corporations for their internal
databases and document libraries. HA
clusters can be applied wherever an always-up application is needed. HA clustering uses a set of machines, usually all identical
in software and hardware setup, as well as special software that have all the
machines on the cluster talk to each other at intervals.
When incoming requests for a service (such as a Web page, database
lookup, or e-commerce transaction) is received, it is directed to one of the
machines on the cluster. To the outside world, the entire cluster looks like a
single machine with a single IP address and usually a single DNS entry.
Software inside the cluster (usually on one dominant machine) then
parcels out the incoming service requests to any machine on the cluster. If one
machine on the cluster fails for any reason, status messages between the
machines detect the fact quickly and all requests are rerouted to other
machines. With many HA cluster
applications, dynamic reallocation of IP addresses is part of the software
package. In use, HA clusters can provide very reliable, fast
processing of incoming service requests. Most
Web sites and e-commerce vendors on the Internet today are using HA clusters
based in rack-mounted servers, mostly Intel-based, all with automatic fail-over
in case of problems. UNIX has had HA clustering for several years, especially in
the workstation and server worlds of Sun and HP-UX machines.
We’ve reviewed HA software in this magazine before. Linux HA clustering
is supported by several software packages, and Windows supports HA clustering
through the use of Microsoft’s Cluster Server. Let’s look at the Linux HA clustering software market for
a moment. MOSIX (www.mosix.cs.huji.ac.il)
is probably the widest used Linux HA clustering application. MOSIX handles
load-balancing across the cluster as well as automatic fail-over capabilities. A
strong feature of MOSIX is that it requires no special modification to the Linux
operating system or the filesystem layout, making it easy to install (and remove
later if necessary). Piranha (available from www.redhat.com
and other Linux sites) is another popular HA clustering software package that
has received very good reviews in the Linux press lately.
Piranha is does both HA clustering or load-balancing, but it doesn’t do
both jobs at once (at least, not very well).
The main strength Piranha offers is its ease of installation,
configuration, and use. The Piranha code is tightly integrated into several
current Linux commercial releases including RedHat 6.2 and derivatives of RedHat. TurboLinux is a well known Linux version, popular for
workstation users, and there’s an HA clustering package called TurboCluster
EnFuzion which works well. TurboCluster
is actually a superset of TurboLinux. A neat feature of TurboCluster is that it
allows integration between Linux, Sun Solaris and Windows NT platforms into a
single cluster, allowing for a heterogeneous machine environment.
It also works very well under TurboLinux by itself, of course. Finally, the Linux Virtual Server project (www.linuxvirtualserver.org)
is a set of kernel modifications and drivers for setting up load-balanced
clusters using Linux systems. Virtual
Server allows scalable, reliable clusters to be constructed from a variety of
Linux platforms. The Virtual Server
system has been plagued by problems to date, but the Linux Virtual Server idea
holds great promise. Writing applications for HA clusters is not particularly
difficult as there is no collaboration, per se, between the different machines
when it comes to handling service requests.
Instead, the HA clustering software takes care of any switching that
needs to be performed. Your
applications run on each of the machines in the cluster, each as a separate
process. You can write applications
for HA clustering that provide collaboration between the machines in the
cluster, but this is usually reserved for computational clusters. Computational clusters Computational clusters come in two varieties.
The first is a dedicated LAN of homogenous personal computers running
Linux. This type of cluster is
called a Beowulf cluster. It is
usually set up so that one computer, the master, is fully outfitted, while the
rest of the computers on the network, the slaves, are headless workstations (no
video, keyboard, or mouse). The master itself has two network adapters, one for the
dedicated LAN, and one to connect to the outside world.
Since communication bottlenecks are often a bigger problem than
computational bottlenecks in many parallel algorithms, a fast (100 Mbps or
higher), switched Ethernet LAN is a necessity for serious scientific number
crunching. A switched network,
rather than a simple hub, eliminates packet collisions, decreases the
variability of communication times, and allows more than one point-to-point
communication to occur. A dedicated
LAN is necessary because an undedicated LAN introduces two much variability in
communication times for some programs to work properly.
Also, some cluster applications use rsh and other Unix tools that are
notorious security holes if the network is connected to the Internet. The software for a Beowulf cluster usually includes user
applications, the message passing API, and the system software.
The system software for Beowulf clusters is Linux with patches to the
kernel and other tools supplied by the Beowulf site, www.beowulf.org. The software available includes a package that allows
process ID space to span multiple nodes, device drivers for most network
adapters, Ethernet channel bonding (where communication is striped across two
Ethernet networks in order to double bandwidth), extensions to TCL, and a
virtual memory pre-pager that can reduce the run time of programs with data sets
larger than physical memory. An alternative to Beowulf is Evolocity (www.linuxnetworx.com),
a commercial cluster software application. Evolocity provides a complete
software package including management tools. Evolocity is not inexpensive, but
it is ideal for organizations and companies that want a single point of contact
for technical support and don’t mind paying for this convenience. Programming for computational clusters is not a trivial
task, and there are several ways to go about application development. For the
message passing API (application programming interface) the main choices are the
Message Passing Interface (MPI) and the Parallel Virtual Machine (PVM).
Your applications can be written using either API, and both can be
available on your cluster. MPI is a
standard message-passing interface designed for the programming of massively
parallel processors (MPPs). It was
not designed specifically for programming clusters, but was designed to take
care of the problem that the message-passing interface for existing parallel
processor machines was hardware dependent.
The MPI Forum designed the MPI standard so that a parallel application
written for one platform, such as a Cray, can be recompiled and run on another,
such as an HP. Since MPI is a
standard supported by most system software vendors, it will probably become the
API of choice for numerical library routines. You should definitely include it
in the system software for your Beowulf cluster if applications will be ported
from or to conventional massively parallel computers. The other commonly used message passing API is the Parallel
Virtual Machine (PVM). PVM differs
from MPI in that it was designed specifically for heterogeneous networks of
computers. It allows such a network
to appear as a “virtual machine”. It
is the API of choice for networks of workstations, discussed below, because it
offers more flexibility, has greater fault tolerance, and better tools for
resource management. MPI is less
suitable as an API for non-dedicated, heterogeneous networks because it assumes
a fixed topology, cannot handle on the fly load balancing, and provides no
constructs that aid fault tolerance. MPI
was designed for speed of parallel computation on MPPs, with each vendor’s
implementation tweaked for the hardware it runs on.
In the design of PVM, tradeoffs were made which favored simplicity of
interface and platform interoperability over raw speed.
Theoretically, intense scientific number crunching routines would run
faster under an MPI implementation than a PVM implementation, but actual results
on Beowulf clusters do not consistently show one API to be faster than the
other, mainly because publicly available MPI implementations must be able to run
on the gamut of PCs that may be used, and, hence, cannot be as finely tuned as
their supercomputer counterparts. PVM
is freely available from http://www.epm.ornl.gov/pvm/. Since both PVM and MPI have free implementations available,
both should be offered on your Beowulf cluster. Which is better suited for programming cluster applications?
As a rule of thumb, use MPI if your application will be ported to a conventional
MPP, or if you need the richer communications set that comes with MPI.
If you want the applications designed for the Beowulf cluster to be able
to run on non-dedicated heterogeneous networks, PVM is the choice for you.
Otherwise, it is a matter of which API you like better.
PVM is not standard, but is, nevertheless, a de facto standard, and is
continually being developed at the Oak Ridge National Laboratory.
Because it has been around longer than MPI, there is a richer set
applications and tools currently available which use this interface, including
load monitoring software and graphical parallel debuggers.
PVM is not restricted to any one platform, and free implementations exist
for just about every platform from Cray supercomputers to lowly Macs.
The availability of PVM is part of the impetus for the development of the
grid computing model, discussed below. The other variety of computational cluster is the network
of workstations (NOW). The concept
here is to reclaim processing capability which would otherwise be wasted and use
it to process tasks. Jobs which can
be easily programmed to run on networks of workstations usually follow the
master/worker model, where a large task is broken up into smaller tasks by the
master, then farmed out to the workers for independent processing. The load balancing and fault tolerance capabilities PVM are
clearly useful here. An application may query the other machines on the network
to find out each machine’s current computational load, and adjust the tasking
on the fly should the computational loads on the workstations change.
The fault tolerance capabilities of PVM are especially useful for NOW
applications that may run for weeks or months at a time.
Steps must be taken to ensure that the application will still run even if
a workstation crashes. Processing
of large images or graphics rendering are suitable candidates for parallel
implementation on NOWs. Grid computing If unused computational capacity can be reclaimed on
networks of workstations, why not be able to apply the same concept to the
Internet as a whole? This brings us
to the concept of grid computing, where computational resources are made
available to users the way electricity is available to your homes.
Geographically dispersed supercomputers and databases can be accessed
inexpensively by networks that make the disparate resources seem like one
unified resource: the grid.
This pooling of resources makes possible the solving of large-scale
problems that could not otherwise be tackled.
The concept is still fairly new, but has generated a lot of interest as
can be seen by the site http://www.gridcomputing.com.
Among the applications that use the resources of Internet-wide home
computers are the search for extraterrestrial intelligence (www.setiathome.com
or www.setiathome.ssl.berkeley.edu), the breaking of common encryption
algorithms (www.distributed.net), and the search for large Mersenne primes (www.mersenne.org).
Encryption key breaking is an especially interesting area to watch, since
estimates for breaking codes assuming single machines can become meaningless
when the computational resources of the entire Internet are brought to bear. What kind of speedup can you expect if you invest in a
computational cluster? If two workstations are used to solve a problem, it
doesn’t mean that the problem can be solved twice as fast.
In computer algorithms, there is nearly always a serial component to the
algorithm that cannot be sped up by parallel processing.
In addition to this limitation is the increased overhead that
coordinating two or more processors incurs.
For what are known as embarrassingly parallel problems, problems that can
easily be broken up into separate, independent tasks, speedup should be linearly
proportional to the number of machines that you have.
For certain numerical algorithms, there will be a point at which adding
additional processors to the computational cluster does not yield increased
performance. However, for other
algorithms, we can expect to solve a problem on N machines more than N times as
fast as on a single processor, a condition known as superlinear speedup. Algorithms which are candidates for superlinear speedup are
those which handle datasets too large to fit in physical memory.
By distributing the data over several workstations such that each handles
a subset of the data small enough to fit in physical memory, page faults and
cache misses can be avoided. Windows NT The Microsoft Cluster Server (MSCS) is Microsoft’s
clustering package and it is unique from other clustering packages in a couple
of ways. First, it runs on any
machine that runs Windows NT so there is no issue of special hardware
requirements. Second, Microsoft modifies a general approach of clustering
wherein all the machines in a cluster access a resource at the same time and
dispenses with this code. The lack
of shared resources across a cluster might sound important, but it really
isn’t because all the machines on a cluster seldom have to access the same
resources, anyway. The big
advantage in dropping shared access to resources is the amount of code that was
dropped from the clustering software. The
lack of this complex code makes MSCS quite noticeably faster than cluster
software that includes this code. The Cluster Service is a primary component of the software
that performs six tasks with a manager handling each task. The key to the
clustering software is the fail-over system managed by the Resource Manager.
This is responsible for moving resources among cluster members at all times,
especially when there has been a failure of a node or a service. One interesting
approach Microsoft took with MSCS is the software for a virtual server.
A virtual server is the logical equivalent of a single application or
file server, but spread over many machines. There are two parts to a virtual
server: a unique network name and a single unique IP address. Any machine or
device in the virtual server shares the name and IP address, and the actual
machine that is using the name and IP address can change dynamically. The future for clusters Clustering has already proven itself.
HA clusters are running practically every Web site that generates a lot
of traffic. Many organizations and
corporations, as well as educational institutions and military applications use
computational clusters. Grid computing is used widely for many number-crunching
client-oriented applications. The future will see the use of clustering
accelerate as these advantages are leveraged with better software and more
hardware resources. The ready availability of low-powered or replaced personal computers makes setting up a cluster inexpensive, as well as easy with the newer versions of clustering software. As you will see in future articles, setting up both UNIX and Linux clusters can be performed by anyone with basic system administration skills. If you’ve got a need for a never-down application or a big number crunching application, and you’ve got a few older PCs kicking around, take a look at clustering. |
|
Send mail to
tparker@tpci.com with
questions or comments about this web site.
|