Social Icons

Pages

Thursday, June 7, 2012

Intel SS4200-E RAID Failure & Solution

For years, I have a RAID setup at home using Intel Entry Storage System SS4200-E. I have four 1TB disks setup in RAID 5 configuration. A couple of days ago, there was a power failure. When I rebooted the storage, initially it went on to re-build the arrays. But after more than 24 hours, the status on the web interface was still showing 0% rebuild complete. On top of it, I was not able to access the dashboard through the web interface.

So I powered down the storage, wait for 5 minutes or so for the disks to cool down and powered backup. My expectation was the array will start the rebuilt process over again. Instead, all the four lights on the front panel turned amber, meaning all four disks have failed!!. Once again, I cannot access the dashboard. I power down the system a couple more times, expecting the results could be different.

Unfortunately, there is nothing much I was able to do at this point. Then I noticed a warning in the disk status under settings in the web interface "disk has been replaced with a disk containing data from another system" against each of the four drives. So the system thinks that the disks have been swapped with and not assembling the RAID properly.

Digging further, I came across this thread http://communities.intel.com/message/32955 and decided to try it out.

  1. Login to the device using the web interface
  2. Bring up the hidden support.html page by manually pointing to the URL http://your-device-ip/support.html
  3. Turn on SSH access
  4. Restart the device
  5. Login to the device using SSH. The username was root and the password should be sohoYOUR-ADMIN-PASSWORD. For example, if your admin password is 1234, then the ssh password for root is soho1234
  6. Look for the process e2fsck by running the command ps -aef | grep -i e2fsck
  7. Kill this process. kill -9 pid-of-e2fsck-process
Suddenly, all the lights turned solid blue and there was no data loss. I was able to access all my data without any problems! It would have been such a nightmare if I lose years worth of data. Now I am a not re-booting the storage until I found out a permanent solution for this process kicking in during startup.

Update-1: Looks like the e2fsck process will run only when rebooted manually by pressing the power button. If rebooted cleanly using the web interface, this process is not found to be running. But the disks are still showing amber after any type of reboot. So the quick fix seems to be to reboot using the power button, kill the e2fsck process using SSH to get access to the disks and data. There should be some other way to fix this.

Update-2: After killing the process, ran the e2fsck command manually. e2fsck -f /dev/evms/md0vol1. Provide yes to all the questions asked. Once this check is complete, I was able to successfully reboot the system without any issues and the disks were not going to amber state.

References

  1. http://communities.intel.com/message/32955
  2. http://communities.intel.com/thread/26862
  3. http://communities.intel.com/message/70720
  4. http://serverfault.com/questions/118791/how-do-you-get-e2fsck-to-show-progress-information

3 comments:

  1. I would like to THANK YOU as I have run into the very same issues. I was about gave up on 2 TB of data and at the very last minute I come across your post. Big Kudos to you!

    ReplyDelete
  2. In my case, I'm not able to access the web interface. When I start the machine, the fans are running are full speed.

    any help on how to recover the data

    ReplyDelete
  3. I am facing something similar. A week ago one of my raid 5 1TB hdd shows amber and it started rebuilding, after the rebuilding is done (took 1 whole day), I am unable to login to the admin website and the system lamp is constantly blinking in green. Any advice would help.

    ReplyDelete