The Danger of Reusing Old Scripts

Did you know you can bring down an entire HPC cluster with an old script? Well, this week I had such an experience. As the systems administrator for a seriously aging cluster with over 800 post-graduate and post-doctoral researchers, "stress" is a normal part of daily life (for future reference: it's probably killing me).

The situation is as follows; our cluster, as mentioned, has over eight hundred resarchers engaging in over three hundred projects. Established quite a few years ago, it is seriously lacking in user and project storage - less than 35 terabytes in total, a substantial amount when the cluster was first established, not so much today. Whilst there are quotas on the project directories, due to the peculiarities of when it was established, there were no quotas on the home directories. To say the least this was not a good policy, and will terrible consequences.

What happened is a user ran a job script, but was inattentive to both the contents and the enviroment that they were using it in. The problem in the script is the line:


Which would be fine if there actually was a $TMPDIR; because there is not, it copies /* into the $PBS_O_WORKDIR. Which was everything on the compute node it was running on with disasterous consequences; the NFS mounted /home partition fills up and, being NFS, it really does its best to try again.. and again... and again - and as other data attempts to do the same thing, and the problem escalates. Repairing such problems typically involves shuffling some data to a new location, unmounting are remounting the NFS services on the login node client, and restarting the NFS services on the storage arrays. You would be surprised how long this can take.

Back to the script; the $TMPDIR is established previously in the script with the line:


This effectively overwrites the directory that is established when the module is loaded, setting it to null. i.e.,

[lev@edward028 ~]$ module load gaussian
[lev@edward028 ~]$ echo $GAUSS_SCRDIR
[lev@edward028 ~]$ echo $TMPDIR

[lev@edward028 ~]$ export GAUSS_SCRDIR=$TMPDIR
[lev@edward028 ~]$ echo $TMPDIR

[lev@edward028 ~]$ echo $GAUSS_SCRDIR

[lev@edward028 ~]$

This leads to a very bad result.

It can be fixed by switching the two variables around.

[lev@edward ~]$ qsub -l nodes=1:ppn=2 -I -X
qsub: waiting for job 2114322.edward-m to start
qsub: job 2114322.edward-m ready
[lev@edward028 ~]$ module load gaussian/g09
[lev@edward028 ~]$ module display gaussian/g09

module-whatis	 Electronic structure modelling program for computational chemistry. (vg09) 
setenv		 GAUSS_SCRDIR /tmp 
setenv		 g09root /usr/local/gaussian/g09 
prepend-path	 GAUSS_EXEDIR /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 
prepend-path	 GMAIN /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 
prepend-path	 PATH /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 
prepend-path	 LD_LIBRARY_PATH /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 
setenv		 GAUSS_LEXEDIR /usr/local/gaussian/g09/linda-exe 
setenv		 GAUSS_ARCHDIR /usr/local/gaussian/g09/arch 
setenv		 G03BASIS /usr/local/gaussian/g09/basis 
setenv		 F_ERROPT1 271,271,2,1,2,2,2,2 
set-alias	 g03 g09 
[lev@edward028 ~]$ echo $GAUSS_SCRDIR
[lev@edward028 ~]$ export TMPDIR=$GAUSS_SCRDIR
[lev@edward028 ~]$ echo $TMPDIR

The script itself looked very familiar, so I did a bit of looking around, and - to some amazement to a coworker - I found it in my files collection in less than thirty seconds. It turns out it was exactly the same as the one which had been run minus an identifying header - a header which noted the people who had written it (not me, thank goodness), when it had been written (seven years ago), and the machine and environment it had been written for (a long retired cluster), which presumably had a very different $PATH environment for the application in question.

The user acknowledged that they didn't know any of this, and had simply run a script that had been given to them by their supervisor without checking its contents or the environment.

There are several lessons that can be learned from this. Firstly, scripts that include globbing need to be treated with a great deal of care. Secondly, checking the environment for a script and the effects that will have on environment paths needs to be checked before running it. Thirdly, quotas should always been imposed on a shared system.