|WHITE DWARF RESEARCH CORPORATION|
|Home||Education||Research||Facilities||About us||Site map|
Travis Metcalfe, email@example.com|
Ed Nather, firstname.lastname@example.org
v1.0, 01 October 1998
This Mini-HOWTO explains how to configure a specialized metacomputer with one complete Linux system and any (reasonable) number of additional nodes with minimal hardware. Like the much-celebrated Beowulf systems, this design offers an inexpensive alternative to supercomputers for parallel processing. Unlike a Beowulf system, the minimal design is motivated by the requirements of a specific class of problems.
Table of Contents
In January 1998, around the time that the idea of commodity parallel processing started getting a lot of attention, we were independently designing a metacomputer of our own. We had experimented with PVM on mixed networks of workstations and PCs, and we had a computation-intensive problem which would benefit a great deal from unlimited access to our own parallel computing facility. Our budget was modest, so we set out to get the best performance possible per dollar without restricting the ability of the machine to solve our specific problem.
The original Beowulf cluster (which we didn't know about at the time) had a number of features which, though they contributed to the utility of the machine as a multi-purpose computational tool, were unnecessary for our particular problem. We wanted to use each node of the metacomputer to run identical tasks with small, independent sets of data. The results of the calculations performed by the nodes consisted of just a few numbers which only needed to be communicated to the master process, never to another node. Put another way, network bandwidth was not an issue because the computation to communication ratio of our application was extremely high, and hard disks were not needed on the nodes because our problem did not require any significant amount of data swapping.
After defining the hardware requirements on the basis of our specific computational problem, we settled on a design including one master server augmented by 32 minimal nodes connected by a simple 10base-2 network.
The master computer is a Dell Dimension XPS. It features a Pentium-II 333-MHz processor, 128Mb SDRAM, and two 8.4Gb IDE hard disks. In addition to the factory installed 3Com 3C900 Ethernet card, it also has two NE-2000 compatible network cards, each of which drives 16 nodes on the subnet. Since a single ethernet card can handle up to 30 devices, no repeater was necessary.
The slave nodes were assembled from components obtained at a local discount computer outlet. Each node consists of an ATX mid-tower case with a Gigabyte LX P-II motherboard, a Pentium-II 300-MHz processor plus fan, a single 32Mb SDRAM, and an NE-2000 compatible network card with an Am27C256 (32kb) EPROM. The nodes are connected in series with 3-ft ethernet coaxial cables, and each 16-node subnet has a 50 Ohm terminator on both ends.
The software configuration of the metacomputer was not much more complex than setting up a diskless Linux box (see Robert Nemkin's Diskless Linux Mini HOWTO). The main difference was that we wanted each node to have an independent filesystem rather than mounting a shared NFS. Since the nodes had no hard disks, we needed to create a self-contained filesystem that would fit in a modest fraction of the 32Mb RAM.
The basic idea was to have each node boot from the network (using an inexpensive bootrom rather than a floppy drive), downloading its kernel from the server, creating an 8Mb initial ramdisk, and finally downloading and mounting a minimal root filesystem.
We custom compiled the kernel for the nodes, including support for the NE-2000 ethernet card, a root filesystem on the network retrieved with the BOOTP protocol, a ramdisk filesystem and an initial ramdisk. The 2.0.34 kernel compiled to be about 335kb.
To create the self-contained root filesystem, we used Tom Fawcett's YARD (Yet Another Rescue Disk) package. Since it was designed to make rescue disks, there were a number of changes we had to make to the defaults. In particular, we had to add a user account for pvm, including a trimmed down execute-only distribution of the PVM software. We had to give pvm read-write privilege in the /tmp directory, and we had to make /sbin/reboot a setuid-root program to allow remote reboots of the nodes.
There are two files which control the properties and content of the YARD filesystem: Config.pl and Bootdisk_Contents. The Config.pl file controls the size of the filesystem, the location of the kernel image, and other logistical matters. The Bootdisk_Contents file contains a list of the daemons, devices, directories, files, executables, libraries, and utilities that we explicitly wanted to include in the filesystem. When we ran the scripts that came with YARD, it automatically determined the external dependences of anything we had included, and it added those to the filesystem before compressing the whole thing to create the root.gz file.
To set up a subnet, we had to install extra ethernet cards in the server to communicate with the nodes. No more than 30 devices (e.g. ethernet cards in the nodes) can be included on a single subnet without using a repeater to boost the signal. Since ethernet cards are a great deal less expensive than repeaters, we control our 32-node subnet with two ethernet cards.
Getting the server to recognize multiple ethernet cards requires a few options to be passed to the kernel specifying the addresses of the devices. It's also necessary to explicitly reserve the address space, to avoid any potential conflicts which might arise if another device attempts to access the address space while the card is being initialized. We pass everything to the kernel through LOADLIN with the command line options:
Once the extra ethernet cards were recognized during the boot sequence, we configured them and added entries in the /etc/hosts table for the two interfaces. The IP addresses that are reserved for subnets (which do not operate on the internet) are:
Since we were dealing with a relatively small number of machines, we used the first three numbers to specify the domain, and the last number to specify the host (a so-called Class C subnet). Our eth1 interface was assigned control of the 192.168.1.0 network while 192.168.2.0 was handled by our eth2 interface.
We used BOOTP and TFTP to allow the nodes to retrieve and boot their kernel, and download a compressed version of the root filesystem. We relied heavily on Robert Nemkin's Diskless Mini HOWTO to make it work.
First of all, the BOOTP daemon needs to be running on the server. We added "bootpd -s" to our /etc/rc.local file and uncommented two lines in the /etc/inetd.conf file:
Finally, we verified that the /etc/services file contained the two lines:
After restarting inetd (kill -HUP [process id of inetd]) we began configuring the server.
We created an /etc/bootptab file containing a list of the hostnames and IP addresses that correspond to each device on the subnet (identified by the unique hardware address of each ethernet card). In addition to various network configuration parameters, this file also describes the location of the bootimage to retrieve with TFTP.
Since each node is running an identical copy of the bootimage, setting up TFTP was considerably easier than it would have been in general. We simply created the /tftpboot directory in the root partition of the server and placed a copy of the bootimage there.
We also created a ROM image with the NETBOOT package, which we needed in order to program the bootrom for the ethernet card in each node, using the "makerom" command.
In addition to choosing the packet driver for the card (NE-2000 in our case), this program also required that we know the I/O address and IRQ that the card would be using. We chose the defaults on all of the other options.
Although our ROM image was only 16kb, we used Am27C256 (32kb) EPROMs because they were actually cheaper than the smaller chips. To get the ROM image onto the EPROMs, we used an old BP Microsystems EP-1 28-pin programmer. The DOS software to drive this device through a parallel port (ep320.exe) was available online, and it was self-explanatory.
4. How it works
With the server up and running, we turn on one node at a time (to prevent the server from being overwhelmed by many simultaneous BOOTP requests). By default, the BIOS tries to boot from the LAN first. It finds the bootrom on the ethernet card, and executes the ROM image. The packet driver in the ROM image initializes the ethernet card and broadcasts a BOOTP request over the network.
When the server receives a request, it identifies the associated hardware address, assigns a corresponding IP address, and allows the requesting node to download the bootimage. The node loads the kernel image into memory, creates an 8Mb initial ramdisk, mounts the root filesystem, and executes the rc script which establishes the node's identity and starts essential services and daemons.
Once all of the nodes are up, we login to the server as "pvm" and start the PVM daemon. The .rhosts files in the /home/pvm directory on each of the nodes allows the server to start the slave PVM daemons. Any working executable file which incorporates the PVM library routines and has been included in the root filesystem can now be run in parallel.
Typically, the executables residing on the nodes are slave programs that are called by a master program which resides on the server. These programs are tested and compiled on the server, and added to the Bootdisk_Contents file before creating a new root filesystem with YARD. The new bootimage is created with the NETBOOT utilities, and all of the nodes are rebooted. While this may seem tedious, the use of a few shell scripts makes it relatively easy in practice.