Power failure again, but this time with a Murphy's touch: the power went down for a period of time suffieciently short to kill the jobs, but not sufficiently long to permanently shutdown the nodes. Nodes came-up again, slurm restarted the jobs (from their state three weeks ago), and this was when things went to hell in a handbasket: the restart files that would be needed to continue the long jobs were overwritten, making a proper restart impossible. As a result, a –no-requeue
flag was added to the NAMDjob
script.