If for some reason (perhaps to test reconstruction) it is necessary to pretend a drive has failed, the following will perform that function:
raidctl -f /dev/sd2e raid0
The system will then be performing all operations in degraded mode, where missing data is re-computed from existing data and the parity. In this case, obtaining the status of raid0 will return (in part):
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Note that with the use of
-f a reconstruction has not been started. To both fail the disk and start a reconstruction, the
-F option must be used:
raidctl -F /dev/sd2e raid0
The
-f option may be used first, and then the
-F option used later, on the same disk, if desired. Immediately after the reconstruction is started, the status will report:
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 10% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
This indicates that a reconstruction is in progress. To find out how the reconstruction is progressing the
-S option may be used. This will indicate the progress in terms of the percentage of the reconstruction that is completed. When the reconstruction is finished the
-s option will show:
Components:
/dev/sd1e: optimal
/dev/sd2e: spared
/dev/sd3e: optimal
Spares:
/dev/sd4e: used_spare
[...]
Parity status: clean
Reconstruction is 100% complete.
Parity Re-write is 100% complete.
Copyback is 100% complete.
At this point there are at least two options. First, if
/dev/sd2e is known to be good (i.e., the failure was either caused by
-f or
-F, or the failed disk was replaced), then a copyback of the data can be initiated with the
-B option. In this example, this would copy the entire contents of
/dev/sd4e to
/dev/sd2e. Once the copyback procedure is complete, the status of the device would be (in part):
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
and the system is back to normal operation.
The second option after the reconstruction is to simply use
/dev/sd4e in place of
/dev/sd2e in the configuration file. For example, the configuration file (in part) might now look like:
START array
1 3 0
START disks
/dev/sd1e
/dev/sd4e
/dev/sd3e
This can be done as
/dev/sd4e is completely interchangeable with
/dev/sd2e at this point. Note that extreme care must be taken when changing the order of the drives in a configuration. This is one of the few instances where the devices and/or their orderings can be changed without loss of data! In general, the ordering of components in a configuration file should
never be changed.
If a component fails and there are no hot spares available on-line, the status of the RAID set might (in part) look like:
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
No spares.
In this case there are a number of options. The first option is to add a hot spare using:
raidctl -a /dev/sd4e raid0
After the hot add, the status would then be:
Components:
/dev/sd1e: optimal
/dev/sd2e: failed
/dev/sd3e: optimal
Spares:
/dev/sd4e: spare
Reconstruction could then take place using
-F as describe above.
A second option is to rebuild directly onto
/dev/sd2e. Once the disk containing
/dev/sd2e has been replaced, one can simply use:
raidctl -R /dev/sd2e raid0
to rebuild the
/dev/sd2e component. As the rebuilding is in progress, the status will be:
Components:
/dev/sd1e: optimal
/dev/sd2e: reconstructing
/dev/sd3e: optimal
No spares.
and when completed, will be:
Components:
/dev/sd1e: optimal
/dev/sd2e: optimal
/dev/sd3e: optimal
No spares.
In circumstances where a particular component is completely unavailable after a reboot, a special component name will be used to indicate the missing component. For example:
Components:
/dev/sd2e: optimal
component1: failed
No spares.
indicates that the second component of this RAID set was not detected at all by the auto-configuration code. The name ‘component1' can be used anywhere a normal component name would be used. For example, to add a hot spare to the above set, and rebuild to that hot spare, the following could be done:
raidctl -a /dev/sd3e raid0
raidctl -F component1 raid0
at which point the data missing from ‘component1' would be reconstructed onto
/dev/sd3e.
When more than one component is marked as ‘failed' due to a non-component hardware failure (e.g., loss of power to two components, adapter problems, termination problems, or cabling issues) it is quite possible to recover the data on the RAID set. The first thing to be aware of is that the first disk to fail will almost certainly be out-of-sync with the remainder of the array. If any IO was performed between the time the first component is considered ‘failed' and when the second component is considered ‘failed', then the first component to fail will
not contain correct data, and should be ignored. When the second component is marked as failed, however, the RAID device will (currently) panic the system. At this point the data on the RAID set (not including the first failed component) is still self consistent, and will be in no worse state of repair than had the power gone out in the middle of a write to a file system on a non-RAID device. The problem, however, is that the component labels may now have 3 different ‘modification counters' (one value on the first component that failed, one value on the second component that failed, and a third value on the remaining components). In such a situation, the RAID set will not autoconfigure, and can only be forcibly re-configured with the
-C option. To recover the RAID set, one must first remedy whatever physical problem caused the multiple-component failure. After that is done, the RAID set can be restored by forcibly configuring the raid set
without the component that failed first. For example, if
/dev/sd1e and
/dev/sd2e fail (in that order) in a RAID set of the following configuration:
START array
1 4 0
START disks
/dev/sd1e
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
then the following configuration (say "recover_raid0.conf")
START array
1 4 0
START disks
absent
/dev/sd2e
/dev/sd3e
/dev/sd4e
START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_5
64 1 1 5
START queue
fifo 100
can be used with
raidctl -C recover_raid0.conf raid0
to force the configuration of raid0. A
will be required in order to synchronize the component labels. At this point the file systems on the RAID set can then be checked and corrected. To complete the re-construction of the RAID set,
/dev/sd1e is simply hot-added back into the array, and reconstructed as described earlier.