And suddenly, NAMD jobs started bombing-out at step (8 x LDBperiod) with the following message from the balancer:
ENERGY: 60240 58.8727 162.3848 79.1835 5.3860 -397348.5272 36382.6634 0.0000 LDB: ============= START OF LOAD BALANCING ============== 2252.97 LDB: ============== END OF LOAD BALANCING =============== 2252.97 ENERGY: 60320 54.5252 157.6152 83.2358 11.1675 -397481.1164 36760.0562 0.0000 LDB: ============= START OF LOAD BALANCING ============== 2259.16 LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Singular Matrix LB: Model for object 0 found LB: Singular Matrix ... LB: Model for object 10239 found LB: New model completely constructed LDB: TIME 2259.75 LOAD: AVG 4.87852 MAX 5.20117 PROXIES: TOTAL 828 MAXPE 53 MAXPATCH 4 None 1.25924 LDB: TIME 2259.79 LOAD: AVG 4.87852 MAX 5.11777 PROXIES: TOTAL 828 MAXPE 53 MAXPATCH 4 RefineTorusLB 1.25924 LDB: ============== END OF LOAD BALANCING =============== 2259.79 ENERGY: 64320 60.0195 155.8684 75.8565 6.6394 -397465.2050 36699.6039 0.0000 LDB: ============= START OF LOAD BALANCING ============== 2504.88 Error in estimation: object 0: real time=0.000000, model error=0.000689, default error=0.000000 object 1: real time=0.000000, model error=-0.020384, default error=0.000000 object 2: real time=0.000000, model error=0.000100, default error=0.000000 object 3: real time=0.000000, model error=-0.018001, default error=0.000000 ...
Banging my head against the wall didn't help. Changing executables, scripts, …, didn't either. Switching off the balancer with “ldBalancer none” or “ldbStrategy none” did the trick. On the way realised that the newest (CVS) executable of 2.7b was giving better performance for multi-node jobs, and slightly worse for single-node jobs. So, installed the newest executable, changed NAMDjob, and generally made a confusing mess.
After all that, the balancer problem automagically dissapeared the next day (with exactly the same scripts, executables and nodes). Definitely sounds like and feels like a hardware problem, but where ? The switch, a node's memory, … ? Decided to let it rest until something more solid comes along …