CF Web>SuperMicro>MegaRaid (2019-04-23, GordBoerke)

LSI MegaRaid

Hardware notes

LSI MegaRaid controllers are available in various configurations.
See www.lsi.com
The RAID array can be set up via the Adapter BIOS on boot.
Save the Adapter configuration to a file:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -help| grep CfgSave
MegaCli -CfgSave -f filename -aN   

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -CfgSave -f /admhome/cscf-adm/MegaRaidAdapterConfiguration -a0
Config data is saved to the file.
Exit Code: 0x00

To retrieve the configuration:
MegaCli -CfgRestore -f filename -aN

Battery Backup Unit and array going offline

BBU for 3ware and MegaRAID have an 8Mhz CPU and if they fail, the data in cache stops moving making the array appear to be offline. Remove the BBU
- https://www.broadcom.com/support/knowledgebase/1211161501519/offline-array-troubleshooting-for-megaraid-and-3ware

Software notes

Software install on Ubuntu

20 Nov 2018

) sudo bash
) apt-get install alien gdebi
) Find the programs for your controller - example 9260-8i
- In this example I downloaded Driver, Firmware. MegaCLI, Storage Manager and StorCLI into directories with the same name
- In each downloaded directory we
  - convert all *rpm files with alien
  - unzip all gz files with gunzip
  - extract all tar files with tar xf
) MegaCLI
- alien MegaCli-8.07.14-1.noarch.rpm
- gdebi megacli_8.07.14-2_all.deb
) StorCLI
- unzip storcli_All_OS.zip
- gdebi storcli_1.23.02_all.deb
) Storage Manager
- gunzip 17.05.00.02_Linux-64_MSM.gz
- tar xf gunzip 17.05.00.02_Linux-64_MSM*
- alien --scripts *.rpm
- gdebi lib-utils2_1.00-9_all.deb megaraid-storage-manager_17.05.00-3_all.deb
- Start MSM
  - /etc/init.d/vivaldiframeworkd start * Enable auto startup
  - *update-rc.d vivaldiframeworkd defaults
) Add MegaCLI and StorCLI utils to search path vi /etc/environment
- - Insert into PATH: /opt/MegaRAID/MegaCli:/opt/MegaRAID/storcli: * Add to current login shell
- export PATH=$PATH:/opt/MegaRAID/MegaCli:/opt/MegaRAID/storcli
) Firmware
- MegaCli -adpfwflash -f mr2108fw.rom -a0
) Driver * tar xzf MR_LINUX_DRIVER_7.7-07.707.03.00-1.tgz
- cd ubuntu/rpms-1
- gdebi megaraid_sas_07.707.03.00-1-ubuntu18.04_x86_64.deb

using smartmontools with megaraid

https://www.thomas-krenn.com/en/wiki/Smartmontools_with_MegaRAID_Controller

Running Storage Manager

20 Nov 2018

cd "/usr/local/MegaRAID Storage Manager" sudo ./startupui.sh
Use root account to login

Ubuntu Installation example on hops.cs

The command line application for Linux can be found at the LSI web page for the controller. e.g. http://www.lsi.com/channel/products/storagecomponents/Pages/MegaRAIDSAS9285CV-8e.aspx

This example is for the MegaRAID SAS 9285CV-8e
Download MegaCli-8.07.06-1.noarch.rpm

cscf-adm@hops:~$ sudo alien MegaCli-8.07.06-1.noarch.rpm
[sudo] password for cscf-adm:
Warning: Skipping conversion of scripts in package MegaCli: postinst postrm
Warning: Use the --scripts parameter to include the scripts.
megacli_8.07.06-2_all.deb generated

cscf-adm@hops:~$ sudo dpkg -i megacli_8.07.06-2_all.deb
Selecting previously unselected package megacli.
(Reading database ... 181600 files and directories currently installed.)
Unpacking megacli (from megacli_8.07.06-2_all.deb) ...
Setting up megacli (8.07.06-2) ...
Processing triggers for libc-bin ...
ldconfig deferred processing now taking place

cscf-adm@hops:~$ ls /opt
MegaRAID 

cscf-adm@hops:/opt/MegaRAID/MegaCli$ ls -la
total 5576
drwxr-xr-x 2 root root    4096 Jan 30 15:17 .
drwxr-xr-x 3 root root    4096 Jan 30 15:17 ..
-r--r--r-- 1 root root  510200 Nov 14 02:42 libstorelibir-2.so.13.05-0
-rwxr-xr-x 1 root root 2467036 Nov 14 02:42 MegaCli
-rwxr-xr-x 1 root root 2716224 Nov 14 02:42 MegaCli64

cscf-adm@hops:/opt/MegaRAID/MegaCli$ ./MegaCli64 -h
... provides us with a whole bunch of command options

MegaCli references

See MegaCli

Running the CLI

Normally one would execute the CLI via the current path. e.g. ./MegaCli64 options

cscf-adm@hops:/opt/MegaRAID/MegaCli$ ./MegaCli64 -CfgDsply -a0

User specified controller is not present.
Failed to get CpController object.
Exit Code: 0x01

cscf-adm@hops:/opt/MegaRAID/MegaCli$ ./MegaCli64 -AdpAllinfo -aALL
Exit Code: 0x00

The problem lies in the architecture. The architecture must be specified in the command.

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpCount
Controller Count: 1.
Exit Code: 0x01

Commands via the CLI

sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -help

Getting all information about the Adapter

setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL
Here we see that the RAID array was created by dividing the 45 drives into two groups called "Spans", each with 22 drives called "PD" Physical Devices.

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL|grep -E 'SPAN|Span\ Ref|Number\ of'
Number of DISK GROUPS: 1
SPANNED DISK GROUP: 0
Number of Spans: 2
SPAN: 0
Span Reference: 0x00
Number of PDs: 22
Number of VDs: 1
Number of dedicated Hotspares: 0
SPAN: 1
Span Reference: 0x01
Number of PDs: 22
Number of VDs: 1
Number of dedicated Hotspares: 0

Silencing an Alarm

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp AlarmSilence -aALL
Adapter 0: Set alarm to Silenced success.
Exit Code: 0x00

To find a faulty drive

Finding a faulty drive may be easy on some systems. A faulty drive may indicate as a flashing red LED on the drive bay. The RAID adapter may also sound an alarm.
To get information about the entire RAID controller:

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0

One interesting piece of information resulting from the above command:

                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 1
  Offline         : 0
Physical Devices  : 48
  Disks           : 45
  Critical Disks  : 0
  Failed Disks    : 1

The logical drive information will not give a specific drive, but will report whether the array is degraded. If a "hot spare" drive is faulty the array won't be degraded.
The next command asks for "ldinfo" Logical Drive Info, "lall" All Logical devices, "a0" for Adapter 0.

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 72.753 TB
Sector Size         : 512
Parity Size         : 7.275 TB
State               : Partially Degraded
Strip Size          : 64 KB
Number Of Drives per span:22
Span Depth          : 2
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: Yes
Is VD Cached: No

Exit Code: 0x00

To get a log of the controller:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog           
Success in AdpEventLog
Exit Code: 0x00

root@hops:/opt/MegaRAID/MegaCli# ls -lat
total 16428
-rw-r--r-- 1 root root 9764271 Feb 26 16:08 lsi-events.log
-rw-r--r-- 1 root root 1226454 Feb 26 16:06 MegaSAS.log

Log entry pertaining to the faulty drive:

cscf-adm@hops:/opt/MegaRAID/MegaCli$ less lsi-events.log
...searched for "Jan 30" in this long file containing log data from "Oct  3"

Time: Wed Jan 30 15:51:49 2013

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0c(e0x21/s3) Path 50030480015b134f, CDB: 28 00 86 8c 04 80 00 00 80 00, Sense: 3/11/00
Event Data:
===========
Device ID: 12
Enclosure Index: 33
Slot Number: 3
CDB Length: 10
CDB Data:
0028 0000 0086 008c 0004 0080 0000 0000 0080 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18

To get information about all the drives in the array use the pdlist option on the array controller 0:

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0

To determine which drive is faulty:

cscf-adm@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -pdlist -a0 | grep -E 'Inquiry|Firmware\ state:\ Failed|Slot|ID|Unconfigured'

Inquiry displays the drive's serial number.
Firmware states are either "Failed", "Online, Spun Up", "Online, Spun Down", "Unconfigured(bad)", "Unconfigured(good), Spun down", "Hotspare, Spun down", "Hotspare, Spun up" or "not Online".
If a hot spare has been built into the array to compensate for the failed drive, then the above command may not show the failed drive as failed. Use the next command to view the drive:

  setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [E:S] -a0        #Where E is the enclosure ID and S is the drive bay slot.

If the hot spare is built into the array it will report as "Online, Spun Up". Hence, it can't be determined that it was previously the hot spare.

In the "pdlist" command "Slot" will display drive bay number.
ID displays the enclosure.
The MegaRAID SAS9285CV-8e has two divisions. In this example 24 drives are connected to ID 33 at the front of the JBOD drive bay enclosure and 21 drives to ID 55 at the back of the JBOD enclosure.

Here's a sample output of "setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL" showing only the failed drive.
Notice the media error count. Also, notice no SMART alert is reported, yet the drive is faulty:

Enclosure Device ID: 33
Slot Number: 3
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: 1
Device Id: 12
WWN: 5000c5004537e2fb
Sequence Number: 3
Media Error Count: 401
Other Error Count: 5
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: CC49
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x50030480015b134f
Connected Port Number: 0(path0)
Inquiry Data: ATA     ST2000DM001-9YN1CC49            W2406BSP
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive:  Not Certified
Drive Temperature :31C (87.80 F)
PI Eligibility:  No
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No

A mention of drive identification within the enclosure is in order.

Enclosure Device ID: 33 signifies the front of the enclosure. For this particular JBOD the back is Device 55
Slot Number: 3 is the fourth drive starting from drive 0
Device Id: 12 is an alternate numbering system. This JBOD starts with Id 9 in slot 0. Note that the Device Id is not always in sequence.
Device Firmware Level: CC49
Inquiry Data: ATA ST2000DM001-9YN1CC49 W2406BSP shows the drive model and serial number

More on enclosure identification can be found with the command: setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -EncInfo -aALL

Identify a drive

Most drive enclosures will have disk drive bays with LEDs to alert a fault or proper functioning. A drive bay LED can be "blinked" to determine its location. Note that the drive must have the correct firmware to allow this function.
As an example this Seagate Baraccuda drive will light the blue LED on the drive bay enclosure:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [33:0] -a0                                     
Enclosure Device ID: 33
Slot Number: 0
...
Device Firmware Level: CC49
...
Inquiry Data:             W1E05QBAST2000DM001-9YN164                      CC49

A similar Seagate Baraccuda drive will not light the blue LED. It is a newer model number with a different firmware.

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [33:18] -a0
                                     
Enclosure Device ID: 33
Slot Number: 18
...
Device Firmware Level: CC24
...
Inquiry Data:             Z240K4H2ST2000DM001-1CH164                      CC24

Although the drive with model number 1CH164 won't turn on its drive bay blue LED it will still blink its red LED with the command below:

Start the blinking

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[E:S] -aALL

Where E is the enclosure and S is the slot.

Stop the blinking

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[E:S] -aALL

Spare Drive and Hot Spares

The JBOD enclosure may not have all drives assigned to the array. An unused drive will show as "Unconfigured(good), Spun down".
It may require several minutes (maybe half an hour) for the array controller to recognize a new drive install.

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0

Enclosure Device ID: 55
Slot Number: 20
Enclosure position: 1
Device Id: 54
WWN: 5000c500452e1aa7
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Unconfigured(good), Spun down
Device Firmware Level: CC49
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5003048001943c5c
Connected Port Number: 1(path0)
Inquiry Data:             W1E065X6ST2000DM001-9YN164                      CC49 
...

To make the drive a global Hot Spare
Note there is a difference between dedicated and global hot spares. In this example there are two spans as this is a RAID 60. A dedicated hot spare would be assigned to only one of the two spans. A global hot spare will work for either span, hence for the whole array.

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -PhysDrv [55:20] -a0
Adapter: 0: Set Physical Drive at EnclId-55 SlotId-20 as Hot Spare Success.
Exit Code: 0x00

If the system had a faulty drive at this point the array will immediately rebuild with the hot spare, as seen in the "Firmware state" field and the drive bay will blink its red LED:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0

Enclosure Device ID: 55
Slot Number: 20
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: 1
Device Id: 54
WWN: 5000c500452e1aa7
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Rebuild
Device Firmware Level: CC49

The auto rebuild option can be seen in the adapter "Settings". The "Device Present" will continue to show "degraded" until the rebuild is complete:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0
...
                Settings
                ================
Current Time                     : 20:47:28 2/5, 2013
Predictive Fail Poll Interval    : 300sec
Interrupt Throttle Active Count  : 16
Interrupt Throttle Completion    : 50us
Rebuild Rate                     : 30%
PR Rate                          : 30%
BGI Rate                         : 30%
Check Consistency Rate           : 30%
Reconstruction Rate              : 30%
Cache Flush Interval             : 4s
Max Drives to Spinup at One Time : 4
Delay Among Spinup Groups        : 2s
Physical Drive Coercion Mode     : Disabled
Cluster Mode                     : Disabled
Alarm                            : Enabled
Auto Rebuild                     : Enabled 
...
                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 1
  Offline         : 0
Physical Devices  : 48
  Disks           : 45
  Critical Disks  : 0
  Failed Disks    : 1

If the hot spare is at any point replaced, the replacement drive may need to be reset as a hot spare. Simply run the command again...

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -PhysDrv [55:20] -a0
Adapter: 0: Set Physical Drive at EnclId-55 SlotId-20 as Hot Spare Success.

Replacing a faulty drive

A RAID 60 is composed of two RAID 6 arrays spanned at RAID 0. Both RAID 6 arrays may sustain two failed drives and retain data integrity. If a "hot spare" is built into the array after the first drive failure, then three failed drives may be sustained in one of the RAID 6, but the other RAID 6 will only sustain two failed drives.
Here we see 45 drives (one of which is a hot spare), one failed drive, and the "hot spare" [55:20] as on-line. This may indicate that the hot spare has already been built into the array (as it is no longer degraded) and that a second drive has failed.

root@hops:/opt/MegaRAID# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0
...
                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 48
  Disks           : 45
  Critical Disks  : 0
  Failed Disks    : 1
...

root@hops:~# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDList -a0|grep -E 'Inquiry|Firmware|Slot|ID'
...
Enclosure Device ID: 55
Slot Number: 20
Firmware state: Online, Spun Up
Device Firmware Level: CC49
Inquiry Data:             W1E065X6ST2000DM001-9YN164                      CC49

Take the faulty drive offline. It may start the alarm once offline, so stop the alarm.

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv [33:3] -a0
Adapter: 0: EnclId-33 SlotId-3 state changed to OffLine.
Exit Code: 0x00 

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -EncInfo -aAll
...
    Number of enclosures on adapter 0 -- 3

    Enclosure 0:
    Device ID                     : 33
    Number of Slots               : 24
    Number of Power Supplies      : 2
    Number of Fans                : 5
    Number of Temperature Sensors : 1
    Number of Alarms              : 1
    Number of SIM Modules         : 0
    Number of Physical Drives     : 23 

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [33:3] -a0 | grep state
Firmware state: Unconfigured(bad)

A faulty drive may not go into the off-line mode. Check to see if the array is in the "spun down" state. If it is spun down then access or edit a file in the array. That will then spin up the drives. Then try to "off-line" the faulty drive.

myself@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdList -a0|grep -E 'Inquiry|Firmware|Slot|ID' 
Enclosure Device ID: 33 
Slot Number: 0 
Firmware state: Online, Spun down

myself@hops:/usr/local/MegaRAID Storage Manager$ sudo setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 
... 
 Settings 
 ============ 
Current Time : 15:52:21 11/15, 2013 
... 
Max Drives to Spinup at One Time : 4 
Delay Among Spinup Groups : 2s 
... 
Maximum number of direct attached drives to spin up in 1 min : 120

If the drive won't go off-line then just go ahead with the drive replacement. However, note that the new drive may not be immediately recognized by the adapter. It may require up to half an hour for the controller to acknowledge the new drive.
Check the drive status again. It may show that the drive no longer exists in the array or that it is still faulty. After waiting half an hour you may want to re-seat the drive.
Showing drive in "Slot Number: 3" missing:

root@hops:/usr/local/MegaRAID Storage Manager# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdList -a0|grep -E 'Inquiry|Firmware|Slot|ID'
...
Enclosure Device ID: 33
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: CC49
Inquiry Data:             W1E06XMMST2000DM001-9YN164                      CC49
Enclosure Device ID: 33
Slot Number: 4
Firmware state: Online, Spun Up
Device Firmware Level: CC49
Inquiry Data:             W1E05NLZST2000DM001-9YN164                      CC49
...
Enclosure Device ID: 55
Slot Number: 20
Firmware state: Online, Spun Up
Device Firmware Level: CC49
Inquiry Data:             W1E065X6ST2000DM001-9YN164                      CC49

Replace the drive in the bay and put it into the enclosure. It is hot-swappable and should immediately be accepted by the array.
If a "Hot Spare" was present the hot spare will now copy its data to the new drive. The alarm may sound, the blue LED should be on, and the red LED flashing indicating a "copy back". As well you should notice all other drives flashing:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0
...
                Device Present
                ================
Virtual Drives    : 1
  Degraded        : 0
  Offline         : 0
Physical Devices  : 48
  Disks           : 45
  Critical Disks  : 0
  Failed Disks    : 0 

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [33:3] -a0

Enclosure Device ID: 33
Slot Number: 3
Enclosure position: 1
Device Id: 12
WWN: 5000c5005cc361d1
Sequence Number: 10
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Copyback
Device Firmware Level: CC24
...

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0

Enclosure Device ID: 55
Slot Number: 20
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: 1
Device Id: 54
WWN: 5000c500452e1aa7
Sequence Number: 4
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: CC49

The process may require several hours to copy back the information from the Hot spare to the new drive. In the example above the hot spare is [55:20] and the new drive [33:3]. For a RAID 60 of two 22 drive spans each a copy-back will require approximately 3.5 hours. Without a "hot spare copy-back" a rebuild requires approximately 6.5 hours.
After copy back completes the new drive is online and the hot spare once again shows "Hot Spare":

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDCpyBk -ShowProg -PhysDrv[33:3] -a0
Physical Drive is not in Copyback state.
Exit Code: 0x00

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [33:3] -a0

Enclosure Device ID: 33
Slot Number: 3
Drive's position: DiskGroup: 0, Span: 0, Arm: 3
Enclosure position: 1
Device Id: 12
WWN: 5000c5005cc361d1
Sequence Number: 11
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Online, Spun Up
Device Firmware Level: CC24 
...

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0

Enclosure Device ID: 55
Slot Number: 20
Enclosure position: 1
Device Id: 54
WWN: 5000c500452e1aa7
Sequence Number: 5
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Hotspare Information:
Type: Global, is revertible

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  0
Firmware state: Hotspare, Spun down
Device Firmware Level: CC49 
...

Note the hot spare may continue to show a red flashing LED after the copy-back. This will occur once it is in PowerSave mode. It is in a "ready" state and will automatically rebuild in case of another drive failure.

Have the hot spare drive in spun up state:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0|grep Firmware\ state
Firmware state: Hotspare, Spun down

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Rmv -PhysDrv [55:20] -a0
Adapter: 0: Remove Physical Drive at EnclId-55 SlotId-20 as Hot Spare Success.
Exit Code: 0x00

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0|grep Firmware\ state
Firmware state: Unconfigured(good), Spun Up

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -PhysDrv [55:20] -a0
Adapter: 0: Set Physical Drive at EnclId-55 SlotId-20 as Hot Spare Success.
Exit Code: 0x00

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PdInfo -PhysDrv [55:20] -a0|grep Firmware\ state
Firmware state: Hotspare, Spun Up

If two drives have failed, the hot spare should have been built into the array. Replace one of the drives. The hot spare may not show that it is copying back data. It will reserve that for the other failed drive. Notice that the replaced drive rebuild can be see with the command:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv [55:5] -a0
Rebuild Progress on Device at Enclosure 55, Slot 5 Completed 5% in 17 Minutes.
Exit Code: 0x00

Enclosure Information

To help determine drive, fan, temperature, etc. information use the enclosure info command:

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -EncInfo -aALL

    Number of enclosures on adapter 0 -- 3

    Enclosure 0:
    Device ID                     : 33
    Number of Slots               : 24
    Number of Power Supplies      : 2
    Number of Fans                : 5
    Number of Temperature Sensors : 1
    Number of Alarms              : 1
    Number of SIM Modules         : 0
    Number of Physical Drives     : 23
    Status                        : Normal
    Position                      : 1
    Connector Name                : Port B
    Enclosure type                : SES
    FRU Part Number               : N/A
    Enclosure Serial Number       : N/A
    ESM Serial Number             : N/A
    Enclosure Zoning Mode         : N/A
    Partner Device Id             : 65535

    Inquiry data                  :
        Vendor Identification     : LSI CORP
        Product Identification    : SAS2X36
        Product Revision Level    : 0717
        Vendor Specific           : x36-55.7.23.0

Number of Voltage Sensors         :2

Voltage Sensor                    :0
Voltage Sensor Status             :OK
Voltage Value                     :5000 milli volts

Voltage Sensor                    :1
Voltage Sensor Status             :OK
Voltage Value                     :11700 milli volts

Number of Power Supplies     : 2

Power Supply                 : 0
Power Supply Status          : OK

Power Supply                 : 1
Power Supply Status          : OK

Number of Fans               : 5

Fan                          : 0
Fan Speed              :High Speed
Fan Status                   : OK

Fan                          : 1
Fan Speed              :High Speed
Fan Status                   : OK

Fan                          : 2
Fan Speed              :High Speed
Fan Status                   : OK

Fan                          : 3
Fan Status                   : Not Installed

Fan                          : 4
Fan Status                   : Not Installed

Number of Temperature Sensors : 1

Temp Sensor                  : 0
Temperature                  : 36
Temperature Sensor Status    : OK

Number of Chassis             : 1

Chassis                      : 0
Chassis Status               : OK

... output continues with enclosure 1 containing 21 drives and 2 containing no drives.

Battery State

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -aALL

BBU status for Adapter: 0

BatteryType: iBBU-09
Voltage: 4073 mV
Current: 0 mA
Temperature: 24 C
Battery State: Optimal
Segmentation fault (core dumped)

Battery Write-back cache should be enabled

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -a0 | grep -i cache
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Disk Cache Policy   : Disk's Default
LD's IO profile supports MAX power savings with cached writes: Yes
Is VD Cached: No

Logs

Get the event information from the Adapter

root@hops:/opt/MegaRAID/MegaCli# setarch x86_64 --uname-2.6 /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog

View the event log.

root@hops:/opt/MegaRAID/MegaCli# less lsi-events.log

Shown next is the output after the drive in enclosure 55 slot 15 failed and was replaced:

Failed ->

Time: Thu Jun 13 17:09:29 2013

Code: 0x000000b9
Class: 2
Locale: 0x04
Event Description: Enclosure PD 37(c Port A/p1) phy bad for slot 15
Event Data:
===========
Device ID: 55
Enclosure Index: 1
Slot Number: 1
Index: 15

Replaced ->

Time: Thu Jun 13 17:40:50 2013

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 31(e0x37/s15) Info: enclPd=37, scsiType=0, portMap=01, sasAddr=5003048001943c57,0000000000000000
Event Data:
===========
Device ID: 49
Enclosure Device ID: 55
Enclosure Index: 2
Slot Number: 15
SAS Address 1: 5003048001943c57
SAS Address 2: 0

seqNum: 0x00017939
Time: Thu Jun 13 17:40:50 2013

Code: 0x00000119
Class: 0
Locale: 0x02
Event Description: CopyBack automatically started on PD 31(e0x37/s15) from PD 36(e0x37/s20)
Event Data:
===========
None

seqNum: 0x0001793a
Time: Thu Jun 13 17:40:50 2013

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 31(e0x37/s15) from UNCONFIGURED_GOOD(0) to COPYBACK(20)
Event Data:
===========
Device ID: 49
Enclosure Index: 55
Slot Number: 15
Previous state: 0
New state: 32

seqNum: 0x0001793b

seqNum: 0x00017926

Finished copy back ->

Time: Thu Jun 13 21:51:06 2013

Code: 0x00000116
Class: 0
Locale: 0x02
Event Description: CopyBack complete on PD 31(e0x37/s15) from PD 36(e0x37/s20)
Event Data:
===========
None

seqNum: 0x000179f9
Time: Thu Jun 13 21:51:06 2013

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 31(e0x37/s15) from COPYBACK(20) to ONLINE(18)
Event Data:
===========
Device ID: 49
Enclosure Index: 55
Slot Number: 15
Previous state: 32
New state: 24

seqNum: 0x000179fa
Time: Thu Jun 13 21:51:06 2013

Code: 0x00000087
Class: 0
Locale: 0x42
Event Description: Global Hot Spare created on PD 36(e0x37/s20) (global,rev)
Event Data:
===========
Device ID: 54
Enclosure Index: 55
Slot Number: 20
Spare Type: Revertible
Arrays Dedicated to:

seqNum: 0x000179fb
Time: Thu Jun 13 21:51:06 2013

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 36(e0x37/s20) from ONLINE(18) to HOT SPARE(2)
Event Data:
===========
Device ID: 54
Enclosure Index: 55
Slot Number: 20
Previous state: 24
New state: 2

seqNum: 0x000179fc
...
Time: Thu Jun 13 22:24:28 2013

Code: 0x0000014b
Class: 0
Locale: 0x02
Event Description: Power state change on PD 36(e0x37/s20) from ON(0) to POWERSAVE(1)
Event Data:
===========
None

seqNum: 0x00017a09

-- GordBoerke - 26 Feb 2013

Topic revision: r27 - 2019-04-23 - GordBoerke

Information in this area is meant for use by CSCF staff and is not official documentation, but anybody who is interested is welcome to use it if they find it useful.

Other Webs

My links
- People
- CERAS
- WatForm
- Tetherless lab
- Ubuntu Main.HowTo
- eDocs
- RGG NE notes
- RGG
- CS infrastructure
- Grad images

Edit