January 2011 09

Disk swapping question

You may have guessed that since I’m off work, and for all purposes off the radar in the UK, I’m catching up on my emails more formally.

Anyway I got asked over email would I swap a failed hard drive on production application server running linux, in this case it was a HP Proliant DL385 G1. The concern was about potential data loss, outage to the application or the server as a result of possible issues from swapping the disk.

My reply, was that it is something that can be done online without shutting the server down and is a day to day function that engineers perform regularly without outage, now for some considerations:

  • If it’s tier 1 – leave it and obtain scheduled downtime during a non busy period if there is genuine concern
  • If it’s anything i’d say older than 2003, I would swap out of hours simply due to the age of the servers and the experience that I have had with some emotional early servers where a disk swap should have been fine but wasn’t for about twelve different reasons.
  • Also I tend to leave the disk until it’s in a proper orange failed state, as I’ve had a few issues and arguments with servers and array controllers in the past. Swapping a failed disk is pretty binary, predictive failure should be but if you’ve got ghosts in your array configuration or both disks in a RAID1/O set reporting as predictive failure, swapping the disk might not mean it rebuilds correctly.

Ultimately it depends on your support models, your comfort in carrying out the disk swap, I always tend to work on the basis of cautious optimism, but be prepared for any fall out, not to an extreme level, simple things check the backups, check the application was working so that when we do swap the drive and if something goes wrong we can rule out the disk swap as the problem.




No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Bookmark and Share

Leave a Reply