Reviving a Downed Compute Node in TORQUE/MOAB

The following describes a procedure for bringing up a compute node in TORQUE that's marked as 'Down'. Whilst the procedure, once known, is relatively simple, investigation to come to this stage required some research and to save others time this document may help.

1. Determine whether the node is really down.

Following an almighty NFS outage quite a number of compute nodes were marked as "down". However the two standard tools, `mdiag -n | grep "Down"` and `pbsnodes -ln` gave significantly different results.

This variation is due on whether the TORQUE mom config setting $down_on_error torque is set to report whether a node is down or not. By default it does not report a node "down" when it has a message=ERROR (in pbsnodes). However maui/moab does report that node as "Down". Thus `mdiag -n grep "Down"` is gives a more accurate description. After compiling the list from `mdiag`, use `checknode` for confirmation of the status. e.g.,


[root@edward-m ~]# checknode edward045
State: Down (in current state for 1:57:10)

2. Check there's nothing running on the node, if so, reboot.

The next step is to determine whether there is anything currently running on the node; its status may be `down` but there can still be jobs running - often these nodes end up with a very high load. For the time being, deal with those nodes that can definitely be brought back safely. Sadly, this will require a reboot. In this example, use of the `rpower` command from xCAT xCAT (Extreme Cloud Administration Toolkit) is used.


[root@edward-m ~]# jobsonnode edward045
[root@edward-m ~]# rpower edward045 reset
[root@edward-m ~]# ssh edward045

3. Remove stale job information

Up this point, node revivial has been pretty trivial. However even after a reset it is possible, even probable, that at this stage job submissions will still fail. This is because old job information will still be in the mom spool directory. This needs to be cleared and the mom daemon restarted for the client to begin scheduling again.


[root@edward045 ~]# cd /var/spool/torque/mom_priv/
[root@edward045 mom_priv]# rm -rf jobs/*
[root@edward045 mom_priv]# pbs_mom
[root@edward045 mom_priv]# exit

4. Check that it's back

Give the node a minute or so to update its status to Mau/Moab/TORUE and then check that it's back up and has the status `Idle` (or, if the queue has been really busy, it might already be `Busy`! This can be conducted with `checknode`, `mdiag -n`, or `pbdnodes`.

The exact time required depends on on the RMPOLLINTERVAL set in the moab.cfg.


[root@edward-m ~]# checknode edward045
[root@edward ~]# mdiag -n | grep "edward045"
edward045 Idle 16:16 32104:32104 linux
[root@edward ~]# pbsnodes -ln | grep "edward045"

5. Run a test job from both storage arrays

For the truly careful, running a very short test job on each node is worthwhile. This will, of course, depend on the partitioning, queues, and applications available. In this case a very short Octave job was submitted with a test from two different accounts, one on each of the storage arrays.


[user2@edward ~]$ cat pbstest.pbs
#!/bin/bash
# This is a test job by UniMelb sysadmins to see the state of NFS.
#PBS -q serial
#PBS -l nodes=edward041
#PBS -l nodes=1:ppn=1
cd $PBS_O_WORKDIR
module load octave
octave demo-input.oct


[user2@edward ~]$ cat demo-input.oct
M=rand(300,300);
k=svd(M);
save -ascii demo-result.txt k;

Rinse, wash, and repeat for all the nodes marked `Down`.

If the tests are successful, inform the users.