Farber Upgrade, 2016-06
What follows is a summary of some of the problems encountered after the Farber cluster was upgraded from CentOS 6.5 to CentOS 6.6 and how they were addressed.
Performance suffers moving from CentOS 6.5 => 6.6
HPL benchmarks dropped from between 500 to 1000 GFLOP after the compute nodes were moved to the 6.6 kernel et al. Other portions of the HPCC test suite tended to perform better (MPI latency was WAY down, bandwidth was up).
After exhausting numerous other ideas and options (and our scheduled maintenance window ended) we turned the cluster over to users – some of who quickly informed us that performance had suffered. E.g. some users saw a 4x increase in wallclock for the same VASP runs they were doing prior to the update.
Early this week after much research I modified two kernel scheduler parameters that dictate how long a process can run without being preempted by the kernel: by default, they are set to 4 ms. I set them to range between 10 to 15 ms. HPL benchmarks returned to pre-upgrade levels, and users reported their VASP jobs were back to normal.
I found out that the system does not bother to load the "intel_pstate" cpufreq module by default – likely because on Ivy Bridge it's hardware controlled. Loading that module does give us the ability to determine each core's clock speed:
[root@farber ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq 1235937 1196875 3110937 2876562 3189062 1431250 3189062 1235937 3189062 1235937 3189062 3071875 1353125 3189062 1196875 3071875 1235937 3110937 3032812 3189062
The module defaults to using the "powersave" governor; alternate is "performance." But since frequency is hardware-controlled, it shouldn't matter. Loading this driver nets us the ability to track core frequencies in Ganglia, which could be useful data.
So, the VNFS images were configured to:
- set these parameters in
/etc/sysctl.conf
- in
/etc/rc.local
, load the "intel_pstate" cpufreq module and set to "performance" mode
Variations on HPL benchmarks
Summary of a series of HPL runs before and after modifying the kernel scheduling granularity:
when | core count | node count | HPL score |
---|---|---|---|
before | 40 | 2 | 0.6096±0.0052 TFLOP |
after | 40 | 2 | 0.7636±0.0008 TFLOP |
before | 40 | 4 | 0.5336±0.0377 TFLOP |
after | 40 | 4 | 0.7847 TFLOP |
This supports the observations from users.
The 40 core / 4 node test pins 10 MPI workers per node; if just half the cores are running on-average then we're well below the baseline TDP and the active cores should hit turbo frequencies throughout a large portion of the job (especially since I increased the scheduling granularity by nearly 4x). Spit-balling, it appears that the increased clock frequency may temper the increased use of IB linkages – at least for the HPL testing scheme. Obviously, I need to do additional 40 core / 4 node runs to get some statistical significance, but it's interesting that it performed the better than the 40 core / 2 node tests despite using 50% more IB links between workers (3/4 of the workers versus 1/2).
Automount (Mis)behavior
One research workgroup purchased additional storage on Farber. This storage was broken into three distinct units, each with its own quota. Prior to the upgrades, the units were automounted at /home/work/geog/[unit]
.
After the upgrade, the workgroup reported that they could no longer get to files under that path. Indeed, on the head node, trying to do
$ cd /home/work/geog/[unit]
now produced a "not found" error, whereas before the appropriate NFS share was mounted. A restart of the autofs
service was performed – and the old behavior returned.
/home/work/[workgroup]
on Farber, and user's home directories under /home/[uid#]
. These mounts continued to function properly.
When the autofs
service was restarted on compute nodes, the mounts under /home/work/geog/[unit]
did NOT regain functionality, making the compute nodes' behavior under the same configuration different than the behavior on the head node. The automount map looks like this:
/home program:/opt/etc/automount/users_and_software.sh /home/work/geog program:/opt/etc/automount/geog.sh /home/work program:/opt/etc/automount/workgroup.sh
Debugging was enabled on a compute node which demonstrated that the automount
daemon was attempting to mount the intervening "empty" path /home/work/geog
first, handing-off the path /home/work/geog/[unit]
to the /home/work
mountpoint with key geog
. Since that path is meant to behave as a virtual container for the actual NFS shares, the mount fails and the daemon ceases processing the path /home/work/geog/[unit]
. This is contrary to previous behavior (and behavior observed on the head node) whereby the daemon handed key [unit]
to the /home/work/geog
mountpoint's program.