June 9th, 10th, 2010

And suddenly, NAMD jobs started bombing-out at step (8 x LDBperiod) with the following message from the balancer:

ENERGY:   60240        58.8727       162.3848        79.1835         5.3860        -397348.5272     36382.6634         0.0000       

LDB: ============= START OF LOAD BALANCING ============== 2252.97
LDB: ============== END OF LOAD BALANCING =============== 2252.97

ENERGY:   60320        54.5252       157.6152        83.2358        11.1675        -397481.1164     36760.0562         0.0000       

LDB: ============= START OF LOAD BALANCING ============== 2259.16
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Singular Matrix
LB: Model for object 0 found
LB: Singular Matrix
...
LB: Model for object 10239 found
LB: New model completely constructed
LDB: TIME 2259.75 LOAD: AVG 4.87852 MAX 5.20117  PROXIES: TOTAL 828 MAXPE 53 MAXPATCH 4 None 1.25924
LDB: TIME 2259.79 LOAD: AVG 4.87852 MAX 5.11777  PROXIES: TOTAL 828 MAXPE 53 MAXPATCH 4 RefineTorusLB 1.25924
LDB: ============== END OF LOAD BALANCING =============== 2259.79
ENERGY:   64320        60.0195       155.8684        75.8565         6.6394        -397465.2050     36699.6039         0.0000       

LDB: ============= START OF LOAD BALANCING ============== 2504.88
Error in estimation:
object 0: real time=0.000000, model error=0.000689, default error=0.000000
object 1: real time=0.000000, model error=-0.020384, default error=0.000000
object 2: real time=0.000000, model error=0.000100, default error=0.000000
object 3: real time=0.000000, model error=-0.018001, default error=0.000000
...

Banging my head against the wall didn't help. Changing executables, scripts, …, didn't either. Switching off the balancer with “ldBalancer none” or “ldbStrategy none” did the trick. On the way realised that the newest (CVS) executable of 2.7b was giving better performance for multi-node jobs, and slightly worse for single-node jobs. So, installed the newest executable, changed NAMDjob, and generally made a confusing mess.

After all that, the balancer problem automagically dissapeared the next day (with exactly the same scripts, executables and nodes). Definitely sounds like and feels like a hardware problem, but where ? The switch, a node's memory, … ? Decided to let it rest until something more solid comes along …