Tried an old but recently updated 2.6.18 kernel → not surprisingly, couldn't get passed the kexec stage.
So, back to drawing board. I seem to recall that the correction for the AMD-family-related memory problem were a few lines of code in just one kernel module. So, the question is : can the correction be applied in the corresponding module of the our old kernel ?
It was a simple diff (http://us.generation-nt.com/answer/cpu-mtrrs-linux-kernel-help-207801521.html), but the change is within the kernel (and not in a module), which means recompiling everything. Bloody hell. The never ending story …
Tried to find a 2.6.32 kernel with a firmware-independent bnx2 module. Tested kernel-2.6.32-358.el6.x86_64.rpm, kernel-2.6.32-71.el6.x86_64.rpm, kernel-2.6.32-131.21.1.el6.centos.plus.x86_64.rpm, kernel-2.6.32-131.0.15.el6.centos.plus.x86_64.rpm, kernel-2.6.32-71.7.1.el6.centos.plus.x86_64.rpm, kernel-2.6.32-71.el6.x86_64.rpm, kernel-2.6.32.26-175.fc12.x86_64.rpm, kernel-2.6.32.10-44.fc11.x86_64.rpm, kernel-server-2.6.32.8-69mib-1-1mib2010.0.x86_64.rpm. Failed again.
Last ditch effort ? Get a newer VNFS capsule from the CAOS repository, which is based on 2.6.31.6-2. This version already has a firmware-based bnx2 module. To begin with, test it on an as-is basis. If the memory problem persists, then we might get lucky and throw in this capsule a new 2.6.32-XXX kernel (and then start looking for site-specific changes). Will have to wait till tomorrow …
⇒ This went better. The memory problem persisted and the bnx2 module was not automatically loaded, but after logging-in and doing a rmmod-modprobe cycle network was up and functioning.
⇒ OK, here we go again : try to place 2.6.38.6-26 kernel + firmware in the new capsule.
⇒ Finally got somewhere. It boots correctly, but slurm (due to newer version) does not cooperate. Do some benchmarks with NAMD (this is without a CUDA-enabled card).