Dec 1st, 2009

Power failure again, but this time with a Murphy's touch: the power went down for a period of time suffieciently short to kill the jobs, but not sufficiently long to permanently shutdown the nodes. Nodes came-up again, slurm restarted the jobs (from their state three weeks ago), and this was when things went to hell in a handbasket: the restart files that would be needed to continue the long jobs were overwritten, making a proper restart impossible. As a result, a –no-requeue flag was added to the NAMDjob script.

maintenance/dec_1st_2009.txt · Last modified: 2009/12/01 12:52 (external edit)