Recovering a Failed/Failing OST

This documentation was produced as part of the February 2013 failure of a Lustre OST in the Mills cluster.

The /lustre filesystem passed all stages of recovery and was brought back online this morning, Feb 18, around 10:00.

— Jeffrey T Frey 2013-02-18 12:39

On Mills, all of the Lustre components are monitored by IT's Nagios system. Within a few hours of a hard disk in Mills' Lustre system failing, Nagios will find out and email a notification to IT so we can replace the disk.

RAID6 allows a virtual disk to remain usable and its data preserved so long as only two of the component hard disks fail; if a third disk fails before the original failures have been mitigated, data on the entire virtual disk is lost. In each storage appliance there are 6 RAID6 virtual disks (36 hard disks).

The morning of 02/06, Mills lost two hard disks within a span of five hours – there was only a 14% chance that a second failure would affect the same virtual disk as the first failure, but it did. A spare disk was inserted in the appliance and a rebuild of the RAID6 virtual disk started (the usual response to failure). Within an hour the rebuild failed due to a read error on a third disk from the set – an 11% chance. The combined probability of two subsequent failures affecting the same virtual disk as the first is just 1.7%¹⁾.

With the storage appliance unwilling to rebuild the virtual disk, we had to find an alternate approach to salvaging the data on it. If the virtual disk were completely unusable, a very large part of the Lustre filesystem would be lost and many users adversely affected.

Any data that can be rescued from the failed/failing OST should be archived.

Since the OST is really just an EXT4 filesystem, it can be mounted and the tar utility can be used if the disk has bad blocks that are not in use by any files
- Lustre uses sparse files so the tar utility must be given the –sparse flag to preserve sparseness in the archive (as well as the restored image)
- Lustre sets extended file attributes, so some versions of tar will not create a complete snapshot of the fileystem; use the getfattr utility to preserve a textual dump of all attributes that can later be replayed onto the restored filesystem
If bad blocks are in-use by the filesystem, tar will fail to archive the contents. In this case, dd with a block size equal to that of the underlying disk and conv=sync,noerror will zero-out the bad blocks but continue past them

Case Study: Mills OST Failure

The OST had block errors which necessitated using dd to copy all 8 TiB to an alternate location. The alternate location is showing a mean transfer rate of 16 MiB/s, so the time required to move all 8 TiB of data is 6 days. The transfer started late on Friday, 02/08, putting the target completion time late in the day on Thursday, 02/14 (happy Valentine's day!).

The "slowness" of this operation is due in part to the need to read the OST in 4k chunks – the native block size of the disk. Any larger size would increase the efficiency of the transfer, but would lead to greater loss of data: if reading 2 blocks at a time (8k) then a single faulty 4k block would cause all 8k of data to be lost.

A dd was started early on Friday 02/08, but later failed thanks to the Linux device-mapper subsystem. When the first bad block was encountered, device-mapper could not distinguish between a media error or communications error, so it tried switching to the alternate SAS link to the disk. When this failed (as expected) it decided the disk was unusable and removed it from the system. Circumventing the device-mapper and using /dev/sdb directly made it past the media errors.

New hard disks must be swapped-in and formatted by the storage appliance. If the tar method was used to archive the original disk, the new disk must have a filesystem created on it – for Lustre, this equates to creating an initializing a large EXT4 filesystem, which can take an appreciable amount of time.

Case Study: Mills OST Failure

The storage appliances in Mills use sets of 6 hard disks in a RAID6 configuration for each OST. Data is striped in fixed-size chunks across the 6 disks, with two of the chunks containing parity data calculated from the other four chunks to provide redundancy. Every stripe must be initialized (written) to contain an appropriate set of chunks; this equates to writing 12 TiB of data to the disks.

Since the dd approach was used to archive the original disk, no filesystem creation is necessary (that data is implicit in the backup copy). The storage appliances in Mills allow a RAID6 disk to be initialized in an online mode that does not require the initialization to complete before data can be written to the disk. This is a major benefit, since we can start restoring the backup image almost immediately.

Once the new disk has been prepared to accept data, the backup that was taken must be transferred back from the alternate location.

For the tar method, the archive is "untarred" into the new filesystem.
- If the tar utility does not preserve extended attributes, they must be restored using setfattr
If dd was used, the image is transferred to the new disk using dd
- Since there should be no unreadable blocks in the archived copy or unwritable blocks on the new disk, larger block sizes should be used to enhance the transfer efficiency

Case Study: Mills OST Failure

The backup copy of the disk was written to an alternate Lustre filesystem using Lustre striping – a block size of 4 MiB across 8 OSTs. While this did not improve the efficiency of the creation of the backup, the restoration can be performed with input/output block sizes that are aligned with the transfer sizes of the source and destination devices to enhance transfer rates. For example, the input block size might be 8 MiB (2 x 4 MiB) to allow overlapping reads from the two archival Lustre servers.

If blocks were lost during the archival, the filesystem must be "checked" to establish a consistent state before it is used. This process is read-intensive, as the entire filesystem must be scanned for missing blocks, unusable blocks, etc. For a Lustre filesystem, following the baseline check another check must be performed against the Lustre metadata server to establish a consistent state across the entire distributed filesystem.

Case Study: Mills OST Failure

The execution of the baseline e2fsck scales according to the size of a filesystem. If few or no repairs are effected, the operation is read-dominant and will proceed as quickly as the disk can provide the requested data. Relatively few blocks were lost in the archival of the Mills OST, so the e2fsck should not be hindered by extensive repairs.

The lfsck is an unknown: we've never performed this procedure before, so it's unclear how long it will take. Since there is no reverse mapping between lost blocks and the Lustre file to which they belonged, it is likely all files on the metadata server containing objects on the OST in question will have to be checked for missing objects.

¹⁾

This ignores the effects of mean-time-between-failure for hard disks, which only serves to make the probability for this particular failure ridiculously remote.

Recovering a Failed/Failing OST

Recovery Status

Nature of the Failure

Recovery Procedure

1. Archival of old disk

2. Initialization of new disk

3. Restoration of archived disk

4. Filesystem checks

hpc documentation