Through most of May and into the first week of June, cluster users reported myriad performance problems with respect to the Lustre filesystem. The reports almost exclusively came from those using content under Mills' /lustre/work
, easily the most heavily-used of our Lustre resources.
Users reported sporadic lengthy hangs when performing even the most basic filesystem operations, like listing a directory. There were reports of remote file transfer agents (WinSCP, for example) unable to complete the initial login, which would be for the same underlying latency when enumerating directory contents.
All the issues have as their root cause a design choice made by the vendor of the block storage appliance on top of which the Lustre filesystem is built.
Modern disk i/o systems need to employ memory caches in order to streamline and accelerate transfers to and from the storage media (hard disks). These caches are often held in standard RAM, which needs an uninterrupted source of power to maintain the bits it stores. RAM is many times faster at moving data than hard disks.
It is extremely important that any write caching mechanism has the ability to stay online after a loss of power in order to preserve the data that has not been committed to storage media. Historically, i/o adapters had small batteries attached to them for this purpose. The DDN SFA12k-20 behind our Lustre filesystems instead uses a sophisticated in-rack UPS (Uninterruptible Power Supply) to guarantee 10 minutes of time during which the system can empty its write caches. These UPS's are themselves connected to the Computing Center's power infrastructure, which has several redundancies to ensure zero loss of power.
In a facility that lacks such power redundancies, the lead storage battery inside the UPS's would undergo charge/discharge cycles and would degrade over time. DDN dictates that the UPS's they use have a nominal 2 year life span, and have programmed into their appliance's software a hard expiration of the UPS at that point. The UPS's that shipped with our DDN SFA12k-20 had manufacture dates of June 6, 2014. Thus, the two year lifespan was about to expire.
Despite the fact that the UPS's themselves indicated they were in full health and could sustain 12 minutes of full power draw on power loss, the DDN software demanded that we replace the batteries. Further, it disabled all write caches: if it doesn't trust the UPS, then it can't be sure it could clear the write caches on power loss.
Response time from the vendor on the replacement UPS's was not great:
Date | Event Description |
---|---|
May 5 | IT reports the UPS replacement messages to DDN. All necessary log and system info dumps are included. |
May 10 | DDN ships a single replacement UPS unit. |
May 12 | Further review of the system by UD IT in preparation for UPS installation shows that both UPS's need replacement. |
May 17 | DDN ships a second replacement UPS unit. |
May 17 | A hard disk fails in the SFA12k-20; a case is opened with DDN, a replacement hard disk ships. |
May 19 | A second hard disk fails in the SFA12k-20; a case is opened with DDN, a replacement hard disk ships. |
May 20 | The second UPS unit arrives. IT staff install the two UPS units and reboot the DDN controllers. |
When a controller is rebooted, whatever disks it was servicing fail over to the other controller in the pair. The Lustre servers' multipath service must take notice of this and alter how i/o requests are delivered for those particular disks. Usually this incurs very little interruption on the Lustre servers. All appeared to be working fine at this point.
Over the next week it became apparent that something was not right with the replacement UPS units. The SFA12k-20 continually reported loss of contact with the UPS units and would reestablish contact 20 seconds later. When contact was reestablished, the SFA12k-20 reported that the battery was more than 2 years old and needed to be replaced.
Date | Event Description |
---|---|
May 27 | IT reports the UPS units' behavior to DDN. All necessary log and system info dumps are included. |
May 27 | DDN ships two (2) replacement UPS for an SFA12k system. |
The second item bears further discussion: the DDN support technician went back over the previous ticket and found that even though they had serial numbers for our system and the system info dumps showed it was an SFA12k-20, the support technician on that case had erroneously sent UPS units for an older model SFA10k to us.
Date | Event Description |
---|---|
May 28 | The replacement replacement UPS units arrive. IT staff install the two UPS units and reboot the DDN controllers. |
Again, on the reboot of the controllers the Lustre servers must alter their i/o paths and incur short-term service interruptions.
By the end of the Memorial Day weekend, the DDN SFA12k-20 was still reporting that the UPS batteries were nearing two years of age. Write caches were still disabled, so Lustre performance was becoming increasingly bad as i/o requests piled up awaiting completion. This also meant that jobs were taking longer to execute relative to their runtimes before May 2016, which has serious implications for usage of the (limited wall time) standby queues.
Date | Event Description |
---|---|
Jun 1 | IT reports to DDN that the UPS batteries are still showing a June 6, 2014, date of manufacture. |
That same day, a DDN support technician responded with an unpublished command that could be used in the SFA12k-20 command line interface to wipe the controllers' cached UPS information and trigger a reset of the manufacture date to the current date and time. After issuing that command on both controllers, all exceptional condition messages disappeared and write caches were reenabled. The command:
> dd if=/dev/zero of=/lustre/work/it_nss/bigzero bs=1M count=1000
had been seeing bandwidths on the order of 30 MB/s before the write caches were reenabled. Once the write caches were active, the same command saw bandwidth in excess of 300 MB/s.
Any system designed for parallel operation requires locking facilities. A lock is an entity that synchronizes access to a shared resource.
Since Lustre is a massively-parallel filesystem, it makes extensive use of locking throughout its code. Much of the locking/unlocking is spread throughout the code, making it necessary for the Lustre programmers to have a very complete understanding of the locking pre- and post-conditions on every function. The sequencing of function calls must be deterministically established during development, to ensure that a function that is expected to drop a lock on entry is not called by code that drops the same lock before calling that function. Understandably, there have been myriad bugs in Lustre attributable to its locking facilities and their usage.
Whenever a Lustre component goes offline, i/o operations that target it begin to stack up. For a planned outage, the recovery procedure is fast: the component waits for 5 minutes or until every previously-connected client reconnects before allowing new i/o. Under other circumstances, the recovery may be more complicated since the other Lustre components (e.g. the MDS) may have in-flight dependent operations that have been effectively orphaned by the component's going offline. In these cases, the entire Lustre system must reach a self-consistent state, with all in-flight operations either completing or timing out.
In some cases, bugs exist in the Lustre code that prevent those in-flight operations from ever completing or timing out. Sometimes the problem stems from a lock that can never be dropped. This can degrade system performance since operations needing a lock must wait for one to become available, either through completion of an operation or by timeout.
Date | Event Description |
---|---|
Jun 2 | Several object storage servers (OSS's) begin reporting conditions where zero locks are available for updating of group quotas after modifying an object. |
The logged information led us to Lustre bug LU-4807, which is a duplicate of LU-4249. The bug itself seemed to occur after the MDS associated with the Lustre filesystem has gone offline and recovered, with the OSS's staying online through the outage. Locks held prior to the outage were not being dropped.
Unfortunately, if a locking mechanism has no applicable timeout contingency the leaked locks can only be recovered by restarting the system. Thus, the morning of June 3, 2016, all Lustre servers were rebooted to start from a scratch.
In February of 2015, DDN altered their stance on UPS battery lifespan. The default lifespan was increased to 3 years in subsequent releases of SFA-OS (the software behind the DDN SFA12k-20). A command was added to the command line interface that allows a system administrator to reset the battery manufacture date to whatever value s/he chooses. In short, the system administrator can (if s/he chooses) reset the battery manufacture date to avoid disablement of write caches and the need to replace the UPS units. One naturally runs some risk in doing so, but when the data center itself guarantees reliability of power, that risk is quite minimal.
The unpublished CLI command that DDN cited to reset the UPS's manufacture dates may well have worked before any battery replacement was attempted, and may have prevented this entire sequence of events. That comes of little comfort now, but could aid others who encounter this same issue.
The extension to the battery lifespan and the additional CLI commands are present in SFA-OS releases beyond 2.2; we are currently running 2.1.1.2. IT is researching a plan to upgrade the SFA12k-20 controllers to the latest SFA-OS release in July during the scheduled maintenance of Farber.