Timothy Parker Consulting Incorporated


 

UNIX and Linux Clusters

Dr. Nancy Breaux and Tim Parker

Cluster this and cluster that.  It seems like everyone is talking about clusters today. This is remarkable for two reasons: first, few people who talk about clusters really know what they are; and second, clusters are not a new technology but have been around for years.  What is a cluster?  What does a cluster do?  Why would you want a cluster?  What’s involved in setting up a cluster?  Can UNIX and Linux both do clusters?  We’ll answer these questions and more in the next few pages.  So sit back and enter the wonderful world of clusters.

What is a cluster?

The term cluster is a generic name for a collection of two or more computers (usually quite at least five) that are cooperating for some reason.  The machines in a cluster all talk to each other and cooperate on a particular task or set of tasks.  The use of the term “cluster” goes back to minicomputer days, although the current definition is a bit different.  Today we think of a cluster as a number of personal computers, all running the same operating system, all working together on a task that would normally require much larger computers. The term cluster is used primarily with the Linux operating system, but that’s not the only type of cluster.  There are clusters that use both UNIX and Windows, as you will see later. 

Clusters of personal computers were first developed for a project at NASA where a number of aging 80486 machines were tied together by software on a network.  The reason for the cluster was simple: the researchers could not afford a supercomputer.  By writing clever software and using Linux, a cluster can provide a good percentage of the sheer CPU horsepower of a supercomputer using much less expensive personal computers.  In the case of the NASA cluster, a budget of $40,000 produced a cluster with the same computational power as a supercomputer costing many millions of dollars. 

For convenience, we usually divide clusters into three categories (actually two categories with a modification of a third category).  The first category is high-availability clusters (or HA clusters). The idea behind HA clustering is to provide a cluster of personal computers which can all perform the same task, and in case of failure of one particular machine, the others in the cluster take over the load.  Typically, HA clusters are used for Web servers, database servers, e-commerce applications, and any application that cannot be unavailable.  The second category of clusters is computational clusters.  Computational clusters are used to provide several CPUs all working on the same problem at the same time, essentially providing more CPU horsepower for a number-crunching application.  The third type of cluster is called the grid, and is a modification of computational clusters that uses the entire Internet as its network.  We can look at each of these cluster types in a little more detail.

HA clusters

For many large corporations and e-commerce vendors, the most important aspect of their business is availability.  If the Web server for one of these companies goes down, not only does it give Internet users a bad image of the company but it can cost lots of money in lost sales (both directly because users go elsewhere and indirectly because of the poor image it projects of the company). Keeping a fast, reliable Web server running on the Internet requires not only redundant connections to the Internet, but also redundant hardware for the server. Instead of spending a fortune on large mainframes and supercomputers, the option to spend much less for personal computers that can provide the same uptime and be easier to maintain is attractive. HA clusters are not restricted to Web pages, of course.  Many HA clusters are used inside large corporations for their internal databases and document libraries.  HA clusters can be applied wherever an always-up application is needed.

HA clustering uses a set of machines, usually all identical in software and hardware setup, as well as special software that have all the machines on the cluster talk to each other at intervals.  When incoming requests for a service (such as a Web page, database lookup, or e-commerce transaction) is received, it is directed to one of the machines on the cluster. To the outside world, the entire cluster looks like a single machine with a single IP address and usually a single DNS entry.  Software inside the cluster (usually on one dominant machine) then parcels out the incoming service requests to any machine on the cluster. If one machine on the cluster fails for any reason, status messages between the machines detect the fact quickly and all requests are rerouted to other machines.  With many HA cluster applications, dynamic reallocation of IP addresses is part of the software package.

In use, HA clusters can provide very reliable, fast processing of incoming service requests.  Most Web sites and e-commerce vendors on the Internet today are using HA clusters based in rack-mounted servers, mostly Intel-based, all with automatic fail-over in case of problems.

UNIX has had HA clustering for several years, especially in the workstation and server worlds of Sun and HP-UX machines.  We’ve reviewed HA software in this magazine before. Linux HA clustering is supported by several software packages, and Windows supports HA clustering through the use of Microsoft’s Cluster Server.

Let’s look at the Linux HA clustering software market for a moment. MOSIX (www.mosix.cs.huji.ac.il) is probably the widest used Linux HA clustering application. MOSIX handles load-balancing across the cluster as well as automatic fail-over capabilities. A strong feature of MOSIX is that it requires no special modification to the Linux operating system or the filesystem layout, making it easy to install (and remove later if necessary). 

Piranha (available from www.redhat.com and other Linux sites) is another popular HA clustering software package that has received very good reviews in the Linux press lately.  Piranha is does both HA clustering or load-balancing, but it doesn’t do both jobs at once (at least, not very well).  The main strength Piranha offers is its ease of installation, configuration, and use. The Piranha code is tightly integrated into several current Linux commercial releases including RedHat 6.2 and derivatives of RedHat.

TurboLinux is a well known Linux version, popular for workstation users, and there’s an HA clustering package called TurboCluster EnFuzion which works well.  TurboCluster is actually a superset of TurboLinux. A neat feature of TurboCluster is that it allows integration between Linux, Sun Solaris and Windows NT platforms into a single cluster, allowing for a heterogeneous machine environment.  It also works very well under TurboLinux by itself, of course.

Finally, the Linux Virtual Server project (www.linuxvirtualserver.org) is a set of kernel modifications and drivers for setting up load-balanced clusters using Linux systems.  Virtual Server allows scalable, reliable clusters to be constructed from a variety of Linux platforms.  The Virtual Server system has been plagued by problems to date, but the Linux Virtual Server idea holds great promise.

Writing applications for HA clusters is not particularly difficult as there is no collaboration, per se, between the different machines when it comes to handling service requests.  Instead, the HA clustering software takes care of any switching that needs to be performed.  Your applications run on each of the machines in the cluster, each as a separate process.  You can write applications for HA clustering that provide collaboration between the machines in the cluster, but this is usually reserved for computational clusters.

Computational clusters

Computational clusters come in two varieties.  The first is a dedicated LAN of homogenous personal computers running Linux.  This type of cluster is called a Beowulf cluster.  It is usually set up so that one computer, the master, is fully outfitted, while the rest of the computers on the network, the slaves, are headless workstations (no video, keyboard, or mouse).  The master itself has two network adapters, one for the dedicated LAN, and one to connect to the outside world.  Since communication bottlenecks are often a bigger problem than computational bottlenecks in many parallel algorithms, a fast (100 Mbps or higher), switched Ethernet LAN is a necessity for serious scientific number crunching.  A switched network, rather than a simple hub, eliminates packet collisions, decreases the variability of communication times, and allows more than one point-to-point communication to occur.  A dedicated LAN is necessary because an undedicated LAN introduces two much variability in communication times for some programs to work properly.  Also, some cluster applications use rsh and other Unix tools that are notorious security holes if the network is connected to the Internet.

The software for a Beowulf cluster usually includes user applications, the message passing API, and the system software.  The system software for Beowulf clusters is Linux with patches to the kernel and other tools supplied by the Beowulf site, www.beowulf.org.   The software available includes a package that allows process ID space to span multiple nodes, device drivers for most network adapters, Ethernet channel bonding (where communication is striped across two Ethernet networks in order to double bandwidth), extensions to TCL, and a virtual memory pre-pager that can reduce the run time of programs with data sets larger than physical memory. 

An alternative to Beowulf is Evolocity (www.linuxnetworx.com),  a commercial cluster software application. Evolocity provides a complete software package including management tools. Evolocity is not inexpensive, but it is ideal for organizations and companies that want a single point of contact for technical support and don’t mind paying for this convenience.

Programming for computational clusters is not a trivial task, and there are several ways to go about application development. For the message passing API (application programming interface) the main choices are the Message Passing Interface (MPI) and the Parallel Virtual Machine (PVM).  Your applications can be written using either API, and both can be available on your cluster.  MPI is a standard message-passing interface designed for the programming of massively parallel processors (MPPs).  It was not designed specifically for programming clusters, but was designed to take care of the problem that the message-passing interface for existing parallel processor machines was hardware dependent.  The MPI Forum designed the MPI standard so that a parallel application written for one platform, such as a Cray, can be recompiled and run on another, such as an HP.  Since MPI is a standard supported by most system software vendors, it will probably become the API of choice for numerical library routines. You should definitely include it in the system software for your Beowulf cluster if applications will be ported from or to conventional massively parallel computers.

The other commonly used message passing API is the Parallel Virtual Machine (PVM).  PVM differs from MPI in that it was designed specifically for heterogeneous networks of computers.  It allows such a network to appear as a “virtual machine”.  It is the API of choice for networks of workstations, discussed below, because it offers more flexibility, has greater fault tolerance, and better tools for resource management.  MPI is less suitable as an API for non-dedicated, heterogeneous networks because it assumes a fixed topology, cannot handle on the fly load balancing, and provides no constructs that aid fault tolerance.  MPI was designed for speed of parallel computation on MPPs, with each vendor’s implementation tweaked for the hardware it runs on.  In the design of PVM, tradeoffs were made which favored simplicity of interface and platform interoperability over raw speed.  Theoretically, intense scientific number crunching routines would run faster under an MPI implementation than a PVM implementation, but actual results on Beowulf clusters do not consistently show one API to be faster than the other, mainly because publicly available MPI implementations must be able to run on the gamut of PCs that may be used, and, hence, cannot be as finely tuned as their supercomputer counterparts.   PVM is freely available from http://www.epm.ornl.gov/pvm/.

Since both PVM and MPI have free implementations available, both should be offered on your Beowulf cluster.  Which is better suited for programming cluster applications? As a rule of thumb, use MPI if your application will be ported to a conventional MPP, or if you need the richer communications set that comes with MPI.  If you want the applications designed for the Beowulf cluster to be able to run on non-dedicated heterogeneous networks, PVM is the choice for you.  Otherwise, it is a matter of which API you like better.  PVM is not standard, but is, nevertheless, a de facto standard, and is continually being developed at the Oak Ridge National Laboratory.  Because it has been around longer than MPI, there is a richer set applications and tools currently available which use this interface, including load monitoring software and graphical parallel debuggers.  PVM is not restricted to any one platform, and free implementations exist for just about every platform from Cray supercomputers to lowly Macs.  The availability of PVM is part of the impetus for the development of the grid computing model, discussed below.

The other variety of computational cluster is the network of workstations (NOW).  The concept here is to reclaim processing capability which would otherwise be wasted and use it to process tasks.  Jobs which can be easily programmed to run on networks of workstations usually follow the master/worker model, where a large task is broken up into smaller tasks by the master, then farmed out to the workers for independent processing.  The load balancing and fault tolerance capabilities PVM are clearly useful here. An application may query the other machines on the network to find out each machine’s current computational load, and adjust the tasking on the fly should the computational loads on the workstations change.  The fault tolerance capabilities of PVM are especially useful for NOW applications that may run for weeks or months at a time.  Steps must be taken to ensure that the application will still run even if a workstation crashes.  Processing of large images or graphics rendering are suitable candidates for parallel implementation on NOWs. 

Grid computing

If unused computational capacity can be reclaimed on networks of workstations, why not be able to apply the same concept to the Internet as a whole?  This brings us to the concept of grid computing, where computational resources are made available to users the way electricity is available to your homes.  Geographically dispersed supercomputers and databases can be accessed inexpensively by networks that make the disparate resources seem like one unified resource:  the grid.  This pooling of resources makes possible the solving of large-scale problems that could not otherwise be tackled.  The concept is still fairly new, but has generated a lot of interest as can be seen by the site http://www.gridcomputing.com.  Among the applications that use the resources of Internet-wide home computers are the search for extraterrestrial intelligence (www.setiathome.com or www.setiathome.ssl.berkeley.edu), the breaking of common encryption algorithms (www.distributed.net), and the search for large Mersenne primes (www.mersenne.org).  Encryption key breaking is an especially interesting area to watch, since estimates for breaking codes assuming single machines can become meaningless when the computational resources of the entire Internet are brought to bear.

What kind of speedup can you expect if you invest in a computational cluster? If two workstations are used to solve a problem, it doesn’t mean that the problem can be solved twice as fast.  In computer algorithms, there is nearly always a serial component to the algorithm that cannot be sped up by parallel processing.  In addition to this limitation is the increased overhead that coordinating two or more processors incurs.  For what are known as embarrassingly parallel problems, problems that can easily be broken up into separate, independent tasks, speedup should be linearly proportional to the number of machines that you have.  For certain numerical algorithms, there will be a point at which adding additional processors to the computational cluster does not yield increased performance.  However, for other algorithms, we can expect to solve a problem on N machines more than N times as fast as on a single processor, a condition known as superlinear speedup.  Algorithms which are candidates for superlinear speedup are those which handle datasets too large to fit in physical memory.  By distributing the data over several workstations such that each handles a subset of the data small enough to fit in physical memory, page faults and cache misses can be avoided.   

Windows NT

The Microsoft Cluster Server (MSCS) is Microsoft’s clustering package and it is unique from other clustering packages in a couple of ways.  First, it runs on any machine that runs Windows NT so there is no issue of special hardware requirements. Second, Microsoft modifies a general approach of clustering wherein all the machines in a cluster access a resource at the same time and dispenses with this code.  The lack of shared resources across a cluster might sound important, but it really isn’t because all the machines on a cluster seldom have to access the same resources, anyway.  The big advantage in dropping shared access to resources is the amount of code that was dropped from the clustering software.  The lack of this complex code makes MSCS quite noticeably faster than cluster software that includes this code.

The Cluster Service is a primary component of the software that performs six tasks with a manager handling each task. The key to the clustering software is the fail-over system managed by the Resource Manager. This is responsible for moving resources among cluster members at all times, especially when there has been a failure of a node or a service. One interesting approach Microsoft took with MSCS is the software for a virtual server.  A virtual server is the logical equivalent of a single application or file server, but spread over many machines. There are two parts to a virtual server: a unique network name and a single unique IP address. Any machine or device in the virtual server shares the name and IP address, and the actual machine that is using the name and IP address can change dynamically.

The future for clusters

Clustering has already proven itself.  HA clusters are running practically every Web site that generates a lot of traffic.  Many organizations and corporations, as well as educational institutions and military applications use computational clusters.  Grid computing is used widely for many number-crunching client-oriented applications. The future will see the use of clustering accelerate as these advantages are leveraged with better software and more hardware resources.

The ready availability of low-powered or replaced personal computers makes setting up a cluster inexpensive, as well as easy with the newer versions of clustering software.  As you will see in future articles, setting up both UNIX and Linux clusters can be performed by anyone with basic system administration skills.  If you’ve got a need for a never-down application or a big number crunching application, and you’ve got a few older PCs kicking around, take a look at clustering.

 

Send mail to tparker@tpci.com with questions or comments about this web site.
Copyright © 1995-2007 Timothy Parker Consulting Incorporated
Last modified: January 23, 2007