====== Recovering a Failed/Failing OST ======
This documentation was produced as part of the February 2013 failure of a Lustre OST in the Mills cluster.
===== Recovery Status =====
The ''/lustre'' filesystem passed all stages of recovery and was brought back online this morning, Feb 18, around 10:00.
--- //Jeffrey T Frey 2013-02-18 12:39//
===== Nature of the Failure =====
On Mills, all of the Lustre components are monitored by IT's Nagios system. Within a few hours of a hard disk in Mills' Lustre system failing, Nagios will find out and email a notification to IT so we can replace the disk.
RAID6 allows a virtual disk to remain usable and its data preserved so long as only two of the component hard disks fail; if a third disk fails before the original failures have been mitigated, data on the entire virtual disk is lost. In each storage appliance there are 6 RAID6 virtual disks (36 hard disks).
The morning of 02/06, Mills lost two hard disks within a span of five hours -- there was only a 14% chance that a second failure would affect the same virtual disk as the first failure, but it did. A spare disk was inserted in the appliance and a rebuild of the RAID6 virtual disk started (the usual response to failure). Within an hour the rebuild failed due to a read error on a third disk from the set -- an 11% chance. The combined probability of two subsequent failures affecting the same virtual disk as the first is just 1.7%((This ignores the effects of mean-time-between-failure for hard disks, which only serves to make the probability for this particular failure ridiculously remote.)).
With the storage appliance unwilling to rebuild the virtual disk, we had to find an alternate approach to salvaging the data on it. If the virtual disk were completely unusable, a very large part of the Lustre filesystem would be lost and many users adversely affected.
===== Recovery Procedure =====
==== 1. Archival of old disk ====
Any data that can be rescued from the failed/failing OST should be archived.
* Since the OST is really just an EXT4 filesystem, it can be mounted and the ''tar'' utility can be used if the disk has bad blocks that are not in use by any files
* Lustre uses //sparse files// so the ''tar'' utility must be given the ''--sparse'' flag to preserve sparseness in the archive (as well as the restored image)
* Lustre sets extended file attributes, so some versions of ''tar'' will not create a complete snapshot of the fileystem; use the ''getfattr'' utility to preserve a textual dump of all attributes that can later be replayed onto the restored filesystem
* If bad blocks are in-use by the filesystem, ''tar'' will fail to archive the contents. In this case, ''dd'' with a block size equal to that of the underlying disk and ''conv=sync,noerror'' will zero-out the bad blocks but continue past them
**Case Study: Mills OST Failure**
The OST had block errors which necessitated using ''dd'' to copy all 8 TiB to an alternate location. The alternate location is showing a mean transfer rate of 16 MiB/s, so the time required to move all 8 TiB of data is 6 days. The transfer started late on Friday, 02/08, putting the target completion time late in the day on Thursday, 02/14 (happy Valentine's day!).
The "slowness" of this operation is due in part to the need to read the OST in 4k chunks -- the native block size of the disk. Any larger size would increase the efficiency of the transfer, but would lead to greater loss of data: if reading 2 blocks at a time (8k) then a single faulty 4k block would cause all 8k of data to be lost.
A ''dd'' was started early on Friday 02/08, but later failed thanks to the Linux ''device-mapper'' subsystem. When the first bad block was encountered, ''device-mapper'' could not distinguish between a media error or communications error, so it tried switching to the alternate SAS link to the disk. When this failed (as expected) it decided the disk was unusable and removed it from the system. Circumventing the ''device-mapper'' and using ''/dev/sdb'' directly made it past the media errors.
==== 2. Initialization of new disk ====
New hard disks must be swapped-in and formatted by the storage appliance. If the ''tar'' method was used to archive the original disk, the new disk must have a filesystem created on it -- for Lustre, this equates to creating an initializing a large EXT4 filesystem, which can take an appreciable amount of time.
**Case Study: Mills OST Failure**
The storage appliances in Mills use sets of 6 hard disks in a RAID6 configuration for each OST. Data is //striped// in fixed-size chunks across the 6 disks, with two of the chunks containing //parity data// calculated from the other four chunks to provide redundancy. Every stripe must be initialized (written) to contain an appropriate set of chunks; this equates to writing 12 TiB of data to the disks.
Since the ''dd'' approach was used to archive the original disk, no filesystem creation is necessary (that data is implicit in the backup copy). The storage appliances in Mills allow a RAID6 disk to be initialized in an //online// mode that does not require the initialization to complete before data can be written to the disk. This is a major benefit, since we can start restoring the backup image almost immediately.
==== 3. Restoration of archived disk ====
Once the new disk has been prepared to accept data, the backup that was taken must be transferred back from the alternate location.
* For the ''tar'' method, the archive is "untarred" into the new filesystem.
* If the ''tar'' utility does not preserve extended attributes, they must be restored using ''setfattr''
* If ''dd'' was used, the image is transferred to the new disk using ''dd''
* Since there should be no unreadable blocks in the archived copy or unwritable blocks on the new disk, larger block sizes should be used to enhance the transfer efficiency
**Case Study: Mills OST Failure**
The backup copy of the disk was written to an alternate Lustre filesystem using Lustre //striping// -- a block size of 4 MiB across 8 OSTs. While this did not improve the efficiency of the creation of the backup, the restoration can be performed with input/output block sizes that are aligned with the transfer sizes of the source and destination devices to enhance transfer rates. For example, the input block size might be 8 MiB (2 x 4 MiB) to allow overlapping reads from the two archival Lustre servers.
==== 4. Filesystem checks ====
If blocks were lost during the archival, the filesystem must be "checked" to establish a consistent state before it is used. This process is read-intensive, as the entire filesystem must be scanned for missing blocks, unusable blocks, etc. For a Lustre filesystem, following the baseline check another check must be performed against the Lustre metadata server to establish a consistent state across the entire distributed filesystem.
**Case Study: Mills OST Failure**
The execution of the baseline ''e2fsck'' scales according to the size of a filesystem. If few or no repairs are effected, the operation is read-dominant and will proceed as quickly as the disk can provide the requested data. Relatively few blocks were lost in the archival of the Mills OST, so the ''e2fsck'' should not be hindered by extensive repairs.
The ''lfsck'' is an unknown: we've never performed this procedure before, so it's unclear how long it will take. Since there is no reverse mapping between lost blocks and the Lustre file to which they belonged, it is likely all files on the metadata server containing objects on the OST in question will have to be checked for missing objects.