The Danger of Reusing Old Scripts
Did you know you can bring down an entire HPC cluster with an old script? Well, this week I had such an experience. As the systems administrator for a seriously aging cluster with over 800 post-graduate and post-doctoral researchers, "stress" is a normal part of daily life (for future reference: it's probably killing me).
The situation is as follows; our cluster, as mentioned, has over eight hundred resarchers engaging in over three hundred projects. Established quite a few years ago, it is seriously lacking in user and project storage - less than 35 terabytes in total, a substantial amount when the cluster was first established, not so much today. Whilst there are quotas on the project directories, due to the peculiarities of when it was established, there were no quotas on the home directories. To say the least this was not a good policy, and will terrible consequences.
What happened is a user ran a job script, but was inattentive to both the contents and the enviroment that they were using it in. The problem in the script is the line:
cp -avf $TMPDIR/* $PBS_O_WORKDIR
Which would be fine if there actually was a $TMPDIR; because there is not, it copies /* into the $PBS_O_WORKDIR. Which was everything on the compute node it was running on with disasterous consequences; the NFS mounted /home partition fills up and, being NFS, it really does its best to try again.. and again... and again - and as other data attempts to do the same thing, and the problem escalates. Repairing such problems typically involves shuffling some data to a new location, unmounting are remounting the NFS services on the login node client, and restarting the NFS services on the storage arrays. You would be surprised how long this can take.
Back to the script; the $TMPDIR is established previously in the script with the line:
This effectively overwrites the directory that is established when the module is loaded, setting it to null. i.e.,
[lev@edward028 ~]$ module load gaussian [lev@edward028 ~]$ echo $GAUSS_SCRDIR /tmp [lev@edward028 ~]$ echo $TMPDIR [lev@edward028 ~]$ export GAUSS_SCRDIR=$TMPDIR [lev@edward028 ~]$ echo $TMPDIR [lev@edward028 ~]$ echo $GAUSS_SCRDIR [lev@edward028 ~]$
This leads to a very bad result.
It can be fixed by switching the two variables around.
[lev@edward ~]$ qsub -l nodes=1:ppn=2 -I -X qsub: waiting for job 2114322.edward-m to start qsub: job 2114322.edward-m ready [lev@edward028 ~]$ module load gaussian/g09 [lev@edward028 ~]$ module display gaussian/g09 ------------------------------------------------------------------- /usr/local/Modules/modulefiles/gaussian/g09: module-whatis Electronic structure modelling program for computational chemistry. (vg09) setenv GAUSS_SCRDIR /tmp setenv g09root /usr/local/gaussian/g09 prepend-path GAUSS_EXEDIR /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 prepend-path GMAIN /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 prepend-path PATH /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 prepend-path LD_LIBRARY_PATH /usr/local/gaussian/g09/bsd:/usr/local/gaussian/g09/private:/usr/local/gaussian/g09 setenv GAUSS_LEXEDIR /usr/local/gaussian/g09/linda-exe setenv GAUSS_ARCHDIR /usr/local/gaussian/g09/arch setenv G03BASIS /usr/local/gaussian/g09/basis setenv F_ERROPT1 271,271,2,1,2,2,2,2 setenv TRAP_FPE OVERFL=ABORT;DIVZERO=ABORT;INT_OVERFL=ABORT setenv MP_STACK_OVERFLOW OFF setenv KMP_DUPLICATE_LIB_OK TRUE set-alias g03 g09 ------------------------------------------------------------------- [lev@edward028 ~]$ echo $GAUSS_SCRDIR /tmp [lev@edward028 ~]$ export TMPDIR=$GAUSS_SCRDIR [lev@edward028 ~]$ echo $TMPDIR /tmp
The script itself looked very familiar, so I did a bit of looking around, and - to some amazement to a coworker - I found it in my files collection in less than thirty seconds. It turns out it was exactly the same as the one which had been run minus an identifying header - a header which noted the people who had written it (not me, thank goodness), when it had been written (seven years ago), and the machine and environment it had been written for (a long retired cluster), which presumably had a very different $PATH environment for the application in question.
The user acknowledged that they didn't know any of this, and had simply run a script that had been given to them by their supervisor without checking its contents or the environment.
There are several lessons that can be learned from this. Firstly, scripts that include globbing need to be treated with a great deal of care. Secondly, checking the environment for a script and the effects that will have on environment paths needs to be checked before running it. Thirdly, quotas should always been imposed on a shared system.