Deleting "Stuck" Compute Jobs
Submitted by lev_lafayette on Fri, 01/22/2016 - 04:03Often on a cluster a user launches a compute job only to discover that they have some need to delete it (e.g., the data file is corrupt, there was an error in their application commands or PBS script). In TORQUE/PBSPro/OpenPBS etc this can be carried out by the standard PBS command, qdel.
[compute-login ~] qdel job_id
Sometimes however that simply doesn't work. An error message like the following is typical: "qdel: Server could not connect to MOM". I think I've seen this around a hundred times in the past few years.