The Galaxy Cluster


This cluster consists of 8 machines, each running Linux and connected to 2 networks. The first network is the computer science department network, and is interconnected with the global Internet. The second network is dedicated solely to these 8 machines, and is inaccessible except through these 8 machines.

Each machine is configured as follows:

Each machine has two names and two IP addresses, one for each network. The machines are named after bright stars. The names and IP addresses are:
CS networkDedicated network
antares - 132.177.16.43antares-2 - 132.177.17.43
atria - 132.177.16.54atria-2 - 132.177.17.54
avior - 132.177.16.52avior-2 - 132.177.17.52
becrux - 132.177.16.48becrux-2 - 132.177.17.48
mirtak - 132.177.16.55mirtak-2 - 132.177.17.55
pollux - 132.177.16.44pollux-2 - 132.177.17.44
regulus - 132.177.16.47regulus-2 - 132.177.17.47
sirius - 132.177.16.85sirius-2 - 132.177.17.85

Programs are executed on the cluster by means of the pshell (Parallel Shell) utility. To execute the pshell program, /usr/local/bin must be in your path. In addition, you must have a "hosts" file available that defines the machines you want to execute on. Here is a sample hosts file for the Galaxy cluster:

#
#antares antares-2
atria atria-2
avior avior-2
sirius sirius-2
becrux becrux-2
#mirtak mirtak-2
#pollux pollux-2
#regulus regulus-2
Note that machines are specified using both of their names: CS network name first and dedicated network name second. Also note the use of the "#" character to indicate comments in the host file. This example specifies the use of 4 processors: atria, avior, sirius and becrux.

For NXS users: logical processor numbers are assigned to processors in the order that active processors appear in the hosts file.

When invoking the pshell, you can specify the hosts file via either the single command line argument, via the environment variable NETSTAR_HOSTS, or as, by default, the file hosts in the current working directory.

Within the pshell, there are three fundamental commands:

IMPORTANT: all machines must share a common directory structure so that the current working directory has the same name on all active machines, as well as the machine that is running the pshell. If this is not the case, you will get the admittedly obscure error message: recv EOF on fd K, where K is the number of active processors. This error will occur when you start up the pshell.

The pshell has a primitive history mechanism. It remembers the last command executed as "!!".

The pshell does not currently allow the redirection of stdin, stdout or stderr.

Stdout, stderr and stdin for logical processor zero are routed to/from the pshell. Stdout and stderr for other active processors are sent to a log file in /tmp/netstar.username.log on each processor.

To terminate cleanly under the pshell, a process must perform an "exit(0)".

The machine antares should be used to run compilers and the pshell. This will keep the rest of the machines free for parallel computations. When you want to utilize all 8 nodes for a computation, then you will need to coordinate with the other users to minimize the interference from other tasks on antares. Remember also that the pshell itself will put some strain on antares in this case.

Only one user at a time may be running a parallel program on a particular node of the cluster. If someone else has already allocated the node, then you will get the following error when you attempt to execute on that node: SbpSetup: error on sbp_alloc.

The Galaxy cluster may not be available during the following times to allow for special testing by researchers also using the cluster:

IMPORTANT: The disks on the Galaxy cluster are not backed up. Users are responsible for transferring any files requiring backup to another system that is backed up.

IMPORTANT: The Galaxy cluster is intended to be used to investigate issues surrounding parallel programming. Inappropriate use of the cluster may result in the suspension of your cluster account.

The Galaxy cluster has been constructed with the support of the National Science Foundation via grant CDA-9421997.


Last modified on October 6, 1997.

Comments and questions should be directed to pjh@cs.unh.edu