Batchholds, Leap Seconds, and PBS Restarts

Submitted by lev_lafayette on Sat, 02/27/2016 - 12:41

It is not unusual for a few jobs to fall into a batchhold state when one is managing a cluster; users often write PBS submissions with errors in them (such as requesting more core than what is actually available). When a sysadmin has the opportunity to do so they should check such scripts, and educate the users on what they have done wrong.

Except when more and more jobs start falling into that state; this is an indication that something is going very wrong indeed. Sure, some users can make ignorant decisions, but when experienced users start having jobs consistently fall into batch hold, you know there's a problem.

The first thing to do is to run checkjob -vv [jobid] to have a look at the various resources requested. In this case, something quite unusual came up:

Message[0] cannot start job on reserved resources - cannot start job 2072852 - RM failure, rc: 15012, msg: 'PBS_Server System error: Inappropriate ioctl for device' Message[1] cannot start job 2072852 - RM failure, rc: 15012, msg: 'PBS_Server System error: Inappropriate ioctl for device'

This is a relative obscure error of the not a typewriter variety and one that can go down a great deal of speculation of what is possibly going wrong.

Fortunately, with the eagle-eyes of NinjaDan, another interesting event was occurring at the same time; qpidd, an AMQP message broker daemon, was eating up a lot of resources given its relatively minor, if important role. However, it is also prone to a rather unexpected leap second bug. As the link explains "The qpidd itself does not have to be restarted, simply stopping ntpd, setting the date manually and starting ntpd again fixes the problem."

It also explains why so many jobs were falling into batchhold with the unusual resource request error; it was, of all things, time related. It seemed opportune then to restart the PBS server on the head node, and push those jobs through with their new time resource allocation.

The first step is to run the qterm -t quick command; which puts the pbs server into a terminating state. The advantage of this command is that it keeps running jobs operating. Do not restart the pbs_server as it will kill running jobs. After that, it was simply a case of bringing the server up again pbs_server and pushing through those jobs that were in batch hold with the following short script:

for i in $(showq -b | grep [username] | awk '{print $1}');do mjobctl -w flags=ignorepolicies -x $i;done

Very soon afterwards the cluster, which had fallen to under 50% utilisation, was humming along at 90%.

You are here

Batchholds, Leap Seconds, and PBS Restarts