|
|
|
|
|
ServeRAID - Recovering from multiple disk failures |
 Recovering from multiple disks defunct in a ServeRAID
environment The ServeRAID controller is designed to tolerate a
single disk failure when configured with redundant RAID levels. There is no
guarantee that any multiple disk failure can be recovered with the data intact.
IBM is providing these steps since they offer the highest possibility of a
successful recovery, if under the rare circumstances, multiple disk drives are
marked defunct within the same array.
Resource requirements
- IBM ServeRAID Support
CD(1)
- ServeRAID Command Line
Diskette (available on the IBM ServeRAID Support CD(2) or can be
downloaded)
- ServeRAID-3x or
ServeRAID-4x Controller(s)
- These procedures depend
on certain logging functions enabled in the BIOS/Firmware of the ServeRAID
controller which was first implemented in version 4.0 of the IBM ServeRAID
Support CD. The ServeRAID controller must have previously been flashed using
any 4.x version of the IBM ServeRAID Support CD or diskette(s) prior to the
failure.
- DUMPLOG.BAT
and CLEARLOG.BAT
for DOS/Windows and/or the version of DUMPLOG and CLEARLOG appropriate for
your operating system.
(1) The IBM
ServeRAID Support CD software is backward compatible. If you boot to a newer
version of the IBM ServeRAID Support CD which prompts you to upgrade
BIOS/Firmware, you should cancel out of the BIOS/Firmware update until the
system is recovered. Upgrading software levels while in a failed state is not
recommended unless otherwise directed by IBM support. (2) Diskette images of
the ServeRAID downloadable diskettes are also on the IBM ServeRAID Support CD in
the IMAGES directory. You can build a diskette of the ServeRAID Command Line utilities from this image. For more information see the
README.TXT file in the IMAGES directory on the IBM ServeRAID Support
CD.
Recovery steps
ServeRAID systems with multiple disk failures
- Capture the ServeRAID
logs using DUMPLOG.BAT. There are two methods of capturing these logs,
depending on the situation. The first method is used when the operating system
is located on the failed logical drive and the second is used when any other
logical drive has failed and the operating system is still accessible. These
logs should be sent to your IBM Support Center as necessary for root cause
analysis. This is the best evidence to determine what caused the
failure.
- Use Method #1 if the
operating system logical drive is off-line. Copy the DUMPLOG.BAT and
CLEARLOG.BAT files to the root of the ServeRAID Command Line diskette or the
A:\ directory. Boot to the ServeRAID Command line diskette and run the
DUMPLOG command using the following syntax:
DUMPLOG <FILENAME.TXT>
<Controller#>
- Use Method #2 when a
data logical drive is off-line and the operating system is online. Copy or
extract the DUMPLOG utility appropriate for your operating system to a local
directory or folder. Run the DUMPLOG command following the instructions on
the above Website (under Resource
Requirements) appropriate for the operating system to capture ServeRAID
logs.
- Use ServeRAID Manager to
determine the first disk in the failed array to be marked Defunct under these
two conditions: the operating system is accessible (use (a) below) and the operating
system is not accessible (use (b) below).
- When the
operating system is accessible, determine the order the disks were marked
Defunct using the following steps:
- Open ServeRAID
Manager and note the hard disk drive(s) that are defunct
- In ServeRAID Manager
highlight the system hostname with Defunct (DDD) drives
- Right click the
system hostname then and choose Save printable configuration and event
logs. These logs are saved into the installation directory of
ServeRAID Manager, usually "Program Files\RAIDMAN." The log files are
saved as RAIDx.LOG where x is the controller number.
- Open the correct
RAIDx.LOG text file into any standard text editor for the controller with
Defunct Drives. Go to the last page of the RAIDx.LOG file and you will see
a list called ServeRAID defunct drive event log for controller x.
This portion of the log will list by date and time stamp all the disk
drives marked Defunct by the order they went off-line with the most recent
failure shown at the bottom of the list. Determine which disk failed
first. The first drive marked Defunct should be at the at the top of
list.
IMPORTANT NOTICE: Since the "ServeRAID defunct drive
event log" is not cleared, there may be entries from earlier incidents of
defunct drives that does not pertain to the problem currently being
worked. Review the list of defunct drives carefully from the bottom of the
list to the top and identify the point where the first drive associated
with this incident is logged. The date and time stamps are usually the
strongest indicators of where this point is.
There is no
guarantee that the "ServeRAID defunct drive event log" will list the
drives in the exact order the disks failed under certain circumstances.
One example is when an array is setup across multiple ServeRAID channels.
In this configuration, the ServeRAID controller issues parallel I/O's to
disk devices on each channel. In the event of a catastrophic failure,
disks could also be marked defunct in parallel. This could impede the
reliability of the date and time stamps as the ServeRAID controller writes
events from multiple channels operating in parallel to a single
log.
- Detach or remove the
first hard disk marked Defunct from the backplane or cable (after the
system is powered off). This hard disk will need to be replaced.
- Exit ServeRAID Manger
and power the server off
- When the
operating system is not accessible, determine the order the disks were
marked Defunct using the following steps:
- Boot to the IBM
ServeRAID Support CD
- Highlight
Local in ServeRAID Manager, right click and select Save
printable configuration and event logs. You will be prompted for a
blank floppy diskette to be inserted into Drive A:.
- Insert a diskette and
ServeRAID Manager will save a RAIDx.LOG file to the diskette. The log
files are saved as RAIDx.LOG where x is the controller number.
- Take the diskette
from Drive A: to another working system and open the RAIDx.LOG text file
into any standard text editor. Go to the last page of the RAIDx.LOG file
and you will see a list called ServeRAID defunct drive event log for
controller x. This portion of the log will list by date and time stamp
all the disk drives marked Defunct by the order they went off-line with
the most recent failure shown at the bottom of the list. Determine which
disk failed first. The first drive marked Defunct should be at the at the
top of list.
IMPORTANT NOTICE: Since the "ServeRAID defunct
drive event log" is not cleared, there may be entries from earlier
incidents of defunct drives that doesn't pertain to the problem currently
being worked. Review the list of defunct drives carefully from the bottom
of the list to the top and identify the point where the first drive
associated with this incident is logged. The date and time stamps are
usually the strongest indicators of where this point is.
There
is no guarantee that the "ServeRAID defunct drive event log" will list the
drives in the exact order the disks failed under certain circumstances.
One example is when an array is setup across multiple ServeRAID channels.
In this configuration, the ServeRAID controller issues parallel I/O's to
disk devices on each channel. In the event of a catastrophic failure,
disks could also be marked defunct in parallel. This could impede the
reliability of the date and time stamps as the ServeRAID controller writes
events from multiple channels operating in parallel to a single
log.
- Detach or remove the
first hard disk marked Defunct from the backplane (or cable after the
system is powered off). This hard disk will need to be replaced.
- Exit ServeRAID Manger
and power the server off
- While the server is
powered off, reseat the PCI ServeRAID controller(s). Reseat the SCSI cable(s)
and the disks against the backplanes. Reseat the power cables to the backplane
and SCSI backplane repeater options if they are present. As you are reseating
the components, visually inspect each piece for bent pins, nicks, crimps,
pinches or other signs of damage. Take extra time to ensure that each
component snaps or clicks into place properly.
- Power on the system and
boot to the IBM ServeRAID Support CD
- Using ServeRAID Manager,
undefine all Hot Spare drives to avoid an accidental rebuild from
starting
- Using ServeRAID Manager,
set each Defunct Hard Drives in the failed array to an "Online" state, (except
the first drive marked Defunct) as listed in the "ServeRAID defunct drive
event log." The failed logical drives should change to a critical state. If
there are problems bringing drives online, or if a drive initially goes online
then fails to a Defunct state soon after, see the Hardware Troubleshooting
sections below before proceeding. The logical drives must be in a critical
state before proceeding.
- In this step we will
attempt to access the critical logical drive(s). If you are still in ServeRAID
Manager, exit and restart the system.
- If the operating system
logical drive is now critical, attempt to boot into the installed operating
system. (If you are prompted to perform any CHKDSK activities or file system
integrity tests, choose to skip these tests)
- If the data is on the
critical logical drive, boot into the operating system and attempt to access
the data. (If you are prompted to perform any CHKDSK activities or file
system integrity tests, choose to skip these tests)
- If the system boots
into the operating system, run the appropriate command to do a READ-ONLY
file system integrity check on each critical logical drive. If the file
system checker determines the file system does not have any data corruption,
the data should be good.
For Windows NT/2K systems, run CHKDSK in
READ ONLY MODE at a command prompt (NO PARAMETERS) for each critical logical
drive. If CHKDSK does not report data corruption, the data should be
intact.
- Run the IPSSEND
GETBST command to determine if the bad stripe table has incremented on
any logical drive. If the Bad Stripe Table has incremented to one or more,
the array should be removed and recreated. The IPSSEND.EXE executable is
located on the IBM ServeRAID Support CD or the command line
diskette.
You can attempt to backup the data, however you may
encounter "file corrupted" messages for any files that had data on the
stripe that was lost. This data is unfortunately lost forever from the
current logical drive.
- Plan to restore or
rebuild the missing data on each critical logical drive if any of the
following problems persists:
- The critical logical
drive remains inaccessible
- Data corruption is
found on the critical logical drives
- The system
continuously fails to boot properly to the operating system
- Partition information
on the critical logical drives is unknown
- If the data is good,
initiate a backup of the data.
- After the backup
completes, replace the remaining physical hard drive that is still Defunct. An
auto-rebuild should initialize when the Defunct disk is replaced.
- Redefine hot spares as
necessary.
- Capture another set of
ServeRAID logs using DUMPLOG.
- Clear the ServeRAID logs
using the following CLEARLOG.BAT command available on the DUMPLOG
website:
CLEARLOG
<Controller#>
- If a case was opened with
your IBM Support Center for this problem, complete steps #14 and #15,
otherwise you have completed the recovery process.
- Plan to capture the
ServeRAID logs again using DUMPLOG within 2-3 business days of normal activity
(after the ServeRAID logs were cleared in step #12) to confirm the ServeRAID
subsystem is fixed. These logs should be emailed to your IBM support center to
ensure the ServeRAID controller and SCSI bus activities are operating within
normal parameters.
- If additional issues are
observed, an action plan will be provided with corrective actions and steps
#12 and #13 should be repeated until the system is running
optimally.
Hardware
Troubleshooting If you continue to experience problems, like the drives get marked
DDD again or a disk that will not change to an online state, review the
following guidelines to assist in identifying the configuration or hardware
problem.
The most common cause of
multiple disk failures is poor signal quality across the SCSI Bus. Poor signal
quality will result in SCSI protocol overhead as it tries to recover from these
problems. As the system becomes busier and demand for data increases, the
corrective actions of the SCSI protocol increase and the SCSI bus becomes closer
to saturation. This overhead will eventually limit the normal device
communications bandwidths and if left unchecked, one or more SCSI devices may
not be able to respond to the ServeRAID controller in a timely manner resulting
in the ServeRAID controller marking the disk drives Defunct. These types of
signal problems can be caused by improper installation of the ServeRAID
controller in a PCI slot, poor cable connections, poor seating of hot swap
drives against the SCSI backplane, improper installation or seating of backplane
repeaters, and improper SCSI bus termination.
There are many possible
reasons for multiple disk failures, however you should be able to isolate most
hardware problems using the following isolation techniques:
- Check error codes within
the ServeRAID Manager when a device fails to respond to a command. Research
these codes in the publicly available Hardware Maintenance Manuals.
- In non-hot swap systems,
make sure the disk devices are attached to the cable starting from the
connector closest to the SCSI terminator and work your way forward to the
connectors closest to the controller. Also examine each SCSI devices for the
proper jumper settings.
- While the server is
powered off, reseat the ServeRAID controller in its PCI slot and all cables
and disk devices on the SCSI bus.
- Examine the cable(s) for
bent or missing pins, nicks, crimps, pinches or over stretching.
- Temporarily attach the
disks to an integrated Adaptec controller (or PCI version as available) and
boot into the Adaptec BIOS using CTRL-A. As the Adaptec BIOS POSTs, you should
see all the expected devices and the negotiated data rates. You can review
this information and determine if this is what you should expect to
see.
Once you proceed into the
BIOS, choose the SCSI Scan option which will list all the devices attached to
the controller. Highlight and select one of the disks and initiate a "Media
Test" (this is NOT destructive to data). This will test the device and the
entire SCSI bus. If you see errors on the Adaptec controller, try to determine
if it is the device or the cable by initiating a Media Test on other disks. Test
both Online and Defunct disks, to determine if the test results are consistent
with the drive states on the ServeRAID controller. You can also move Hot Swap
disks to a different position on the backplane and re-test to see if the results
change.
If the problem persists,
swap out the SCSI cable and retry the Media Tests on the same disks. If the
disks test okay, the previous cable is bad. This is a valuable tool to use as
you isolate for a failing component in the SCSI path.
IMPORTANT NOTICE: The
Adaptec controller can produce varying results from the ServeRAID controller
because of the way the Adaptec controllers negotiates data rates with LVD/SE
SCSI devices. If an Adaptec controller detects errors operating at LVD speeds,
it can downgrade the data rates to single-ended speeds and continue to operate
with no reported errors. The ServeRAID controller will not necessarily change
data rates under the same conditions.
Use the System Diagnostics
in F2 Setup to test the ServeRAID subsystem. If the test fails, disconnect the
drives from the ServeRAID controller and reset the controller to factory default
settings. Retry the Diagnostic tests. If Diagnostics pass then attach the disks
to the ServeRAID controller, one channel at a time and retry the tests to
isolate the channel of the failing component. If the controller continues to
fails diagnostic tests (when using the latest available diagnostics for the
server), call your IBM support center for further assistance.
Disconnect or detach the
first drive in the array to be marked defunct from the cable or backplane.
Restore default settings to the ServeRAID controller. Attempt to import the RAID
configuration from the disk drives. Depending on how the failure occurred, this
technique may have mixed results. There is a reasonable chance that all the
drives will return to an online state, except the first disk which is
disconnected.
Move the ServeRAID
controller into a different PCI slot or PCI bus and re-test.
When attaching an LVD
ServeRAID controller to legacy storage enclosures, set the data rate for the
channel to the appropriate data rate supported by the enclosure.
Mixing LVD SCSI devices with
Single Ended SCSI devices on the same SCSI bus will result in switching all
devices on the channel to Single Ended mode and data rates.
Open a case with your IBM
Support Center and submit all the sets of ServeRAID logs captured on the system
for interpretation to isolate a failing component.
|