Deleting "Stuck" Compute Jobs

Often on a cluster a user launches a compute job only to discover that they have some need to delete it (e.g., the data file is corrupt, there was an error in their application commands or PBS script). In TORQUE/PBSPro/OpenPBS etc this can be carried out by the standard PBS command, qdel.


[compute-login ~] qdel job_id

Sometimes however that simply doesn't work. An error message like the following is typical: "qdel: Server could not connect to MOM". I think I've seen this around a hundred times in the past few years.

Users should contact their cluster administrators at this point as it usually means that the compute node that their job is running on is inaccessible for some reason. The command `checkjob job_id` in TORQUE or the more generic `qstat -f job_id` are good commands to find out what nodes a job is running on. This information should be provided to the cluster administrator.

At this point it is almost certain that the cluster administrator should attempt to run `qdel job_id`. However if that fails, it suggests that there is a problem with the compute node itself. The next step should be to check whether there are other jobs running on it. The jobsonnode script may be very useful here (essentially qstat with awk, e.g., qstat -n -1 | awk -F. ' /compute001/ {print $1}').


[compute-m ~] pbsnodes -a | less
[compute-m ~] jobsonnode nodeid

At this point some judgements need to be made. With luck, the only jobs on the node are those that are known to be stuck. If there are others, it is possible that those jobs will run to conclusion successfully. Chances are however, that there is the sort of error which means that all jobs on the node are in a similar state.

If a priviliged user can login to the requisite nodes then the individual jobs could be killed by their process IDs, if they are active. Another option is to restart pbm_mom on the compute nodes in question, with the -p option (preserve running jobs). In TORQUE this is the default after version 2.4.0, and run the `qdel jobid` again.

Often however the node will not be accessible to ssh; the most common example of this situation is if the kernel has had an out-of-memory error. If this is the case the entire node should be taken offline and marked as such:


[compute-m ~] pbsnode -o -N "OOM memory error, 01-01-2016 LL" compute001

In this case the node will need to be rebooted. With a distributed computing management software tool such as xCAT the node will need to rebooted from the management node. For example:


[compute-m ~]# rpower compute001 boot
compute001: reset

Wait for it to come back up, restart pbs on the client, clear the offline state, and verify


[compute-m ~]# ping compute001
PING edward043.hpc.unimelb.edu.au (172.26.4.53) 56(84) bytes of data.
64 bytes from edward043.hpc.unimelb.edu.au (172.26.4.53): icmp_seq=1 ttl=64 time=0.141 ms
64 bytes from edward043.hpc.unimelb.edu.au (172.26.4.53): icmp_seq=2 ttl=64 time=0.120 ms
^C


[compute-m ~]# ssh compute001
[compute001 ~]# pbs_mom
[compute001 ~]# exit
[compute001 ~]# pbsnodes -c compute001
[compute-m ~]# pbsnodes -a | less