NAMD 2.8b1 with and without CUDA on an Intel i7 extreme with an Nvidia GTX 295 card

For the non-CUDA runs eight threads were used. For the CUDA runs, four threads always gave better performance and were used throughout. The non-CUDA runs are based on an SMP-aware executable of NAMD 2.7b1 using the flags '+setcpuaffinity +LBSameCpus'. The CUDA runs are based on the 2.8b1 version of NAMD (obtained directly from the NAMD site). Details of the examples used can be found here (keeping in mind that the value of OutputEnergies had been increased to 200 to improve CUDA performance).

All measurements are in nanoseconds per day.

With CUDA Without CUDA Times faster with CUDA
100K (ApoA1) 1.73 0.37 4.65
60K atoms 5.25 1.66 3.15
25K atoms 10.00 4.55 2.20
6.5K atoms 5.5 15.4 Slower
1.6K atoms 62 76 Slower

It is worth noting that for the ApoA1 and the 60K system, the whole quad cluster (8 nodes, 32 cores) is producing 1.21 and 3.22 nanoseconds per day, when the i7+CUDA alone produces 1.73 and 5.25 nanoseconds per day. In other words, the single i7+CUDA box is 42% and 63% faster than the rest of the cluster together. :-)

