Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision | |||
| technical:generic:farber-microcode-201904 [2019-04-23 11:54] – frey | technical:generic:farber-microcode-201904 [2019-04-23 11:57] (current) – [Mitigation] frey | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== 2019 Farber Job Stall ====== | ||
| + | |||
| + | This document summarizes a performance issue reported after the annual cluster maintenance was performed in January of 2019. It outlines proposed actions to mitigate the issue. | ||
| + | |||
| + | ===== The Issue ===== | ||
| + | |||
| + | In March 2019 a workgroup reported ongoing issues with jobs not completing within the wall time limits that had worked prior to the January annual maintenance. | ||
| + | |||
| + | * LAMMPS has a built-in timer facility that causes the program to exit if a given amount of time has elapsed; the timer was not triggering an exit, though. | ||
| + | * Ganglia monitoring showed that the Infiniband network was in-use throughout the jobs — successful or failed — at similar transmission/ | ||
| + | |||
| + | When a failed job was resubmitted it would run successfully in the allotted time, so the job did not simply require more processing time than expected. | ||
| + | |||
| + | When additional data was gathered from the workgroup, it was found that GROMACS jobs were also experiencing occasional random failures. | ||
| + | |||
| + | ===== Updated Kernel and Microcode ===== | ||
| + | |||
| + | The OS update applied to Farber as part of January' | ||
| + | |||
| + | The compute node boot image had the microcode update added to it and 12 nodes owned by IT were rebooted. | ||
| + | |||
| + | * Three users from the workgroup that reported the issues | ||
| + | * Standby queues | ||
| + | |||
| + | Over the course of two weeks, the users funneled a series of jobs that had been experiencing random failures through the 12 nodes. | ||
| + | |||
| + | Though no direct evidence (traces, monitoring) could be gathered to conclusively prove the lack of the microcode update is to blame, the empirical evidence from the testing seems clear enough. | ||
| + | |||
| + | ===== Mitigation ===== | ||
| + | |||
| + | All compute nodes in Farber will need to be rebooted in order to apply the microcode update to the processors. | ||
| + | |||
| + | * All queues on all nodes will be disabled. | ||
| + | * Once all jobs running on a node have completed, the node will be rebooted. | ||
| + | * Once the node is online again, its queues will be restored to their previous state and jobs can again run on it. | ||
| + | |||
| + | At 9:00 the morning of **April 29, 2019**, this staged process will commence. | ||
| + | |||
| + | <note important> | ||