technical:generic:farber-upgrade-201606

Farber Upgrade, 2016-06

What follows is a summary of some of the problems encountered after the Farber cluster was upgraded from CentOS 6.5 to CentOS 6.6 and how they were addressed.

HPL benchmarks dropped from between 500 to 1000 GFLOP after the compute nodes were moved to the 6.6 kernel et al. Other portions of the HPCC test suite tended to perform better (MPI latency was WAY down, bandwidth was up).

After exhausting numerous other ideas and options (and our scheduled maintenance window ended) we turned the cluster over to users – some of who quickly informed us that performance had suffered. E.g. some users saw a 4x increase in wallclock for the same VASP runs they were doing prior to the update.

Early this week after much research I modified two kernel scheduler parameters that dictate how long a process can run without being preempted by the kernel: by default, they are set to 4 ms. I set them to range between 10 to 15 ms. HPL benchmarks returned to pre-upgrade levels, and users reported their VASP jobs were back to normal.

I found out that the system does not bother to load the "intel_pstate" cpufreq module by default – likely because on Ivy Bridge it's hardware controlled. Loading that module does give us the ability to determine each core's clock speed:

[root@farber ~]# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
1235937
1196875
3110937
2876562
3189062
1431250
3189062
1235937
3189062
1235937
3189062
3071875
1353125
3189062
1196875
3071875
1235937
3110937
3032812
3189062

The module defaults to using the "powersave" governor; alternate is "performance." But since frequency is hardware-controlled, it shouldn't matter. Loading this driver nets us the ability to track core frequencies in Ganglia, which could be useful data.

So, the VNFS images were configured to:

  • set these parameters in /etc/sysctl.conf
  • in /etc/rc.local, load the "intel_pstate" cpufreq module and set to "performance" mode

Summary of a series of HPL runs before and after modifying the kernel scheduling granularity:

whencore countnode countHPL score
before4020.6096±0.0052 TFLOP
after4020.7636±0.0008 TFLOP
before4040.5336±0.0377 TFLOP
after4040.7847 TFLOP

This supports the observations from users.

The 40 core / 4 node test pins 10 MPI workers per node; if just half the cores are running on-average then we're well below the baseline TDP and the active cores should hit turbo frequencies throughout a large portion of the job (especially since I increased the scheduling granularity by nearly 4x). Spit-balling, it appears that the increased clock frequency may temper the increased use of IB linkages – at least for the HPL testing scheme. Obviously, I need to do additional 40 core / 4 node runs to get some statistical significance, but it's interesting that it performed the better than the 40 core / 2 node tests despite using 50% more IB links between workers (3/4 of the workers versus 1/2).

One research workgroup purchased additional storage on Farber. This storage was broken into three distinct units, each with its own quota. Prior to the upgrades, the units were automounted at /home/work/geog/[unit].

After the upgrade, the workgroup reported that they could no longer get to files under that path. Indeed, on the head node, trying to do

$ cd /home/work/geog/[unit]

now produced a "not found" error, whereas before the appropriate NFS share was mounted. A restart of the autofs service was performed – and the old behavior returned.

Workgroups' storage gets automounted under /home/work/[workgroup] on Farber, and user's home directories under /home/[uid#]. These mounts continued to function properly.

When the autofs service was restarted on compute nodes, the mounts under /home/work/geog/[unit] did NOT regain functionality, making the compute nodes' behavior under the same configuration different than the behavior on the head node. The automount map looks like this:

/home              program:/opt/etc/automount/users_and_software.sh
/home/work/geog    program:/opt/etc/automount/geog.sh
/home/work         program:/opt/etc/automount/workgroup.sh

Debugging was enabled on a compute node which demonstrated that the automount daemon was attempting to mount the intervening "empty" path /home/work/geog first, handing-off the path /home/work/geog/[unit] to the /home/work mountpoint with key geog. Since that path is meant to behave as a virtual container for the actual NFS shares, the mount fails and the daemon ceases processing the path /home/work/geog/[unit]. This is contrary to previous behavior (and behavior observed on the head node) whereby the daemon handed key [unit] to the /home/work/geog mountpoint's program.

  • technical/generic/farber-upgrade-201606.txt
  • Last modified: 2015-06-12 13:37
  • by 127.0.0.1