As you are all painfully aware at this point, host4 has been experiencing a large number of problems this week, and is again offline. Below is a breakdown of what has happened to date and what we have done and are doing to restore service and prevent this from reoccurring.
Tuesday evening around 9:30PM this server's raid array, reported that one of the drives in the storage array had failed. We immediately replaced the drive and the array started to rebuild, however 83% through the rebuild the array encountered an error on the drive we were rebuilding from.
At this point, we replaced all the drives in the array and restored the data from a backup from 8PM on Tuesday. Service was restored late Wednesday.
Friday at about 11:30AM we had another alarm go off on the raid array in the server. Again our self test on the raid card showed no problems, so we figured there had to be a cable to one of the hard drives that was just going bad, so we replaced all the drive cables with new cables and rebuilt the array using brand new drives that arrived Friday morning. After restoring service at about 3:30PM the array completely failed at 7:30PM.
At this point, we have removed the hot swap back planes and raid card from the server, which would be the fail points left in this storage array.. We are restoring a backup from today to a new drive that is hooked directly into the on board controller card.
This is obviously a very rare set of problem we have run into this week and we are working as fast as possible to restore service. The backup restore will take 4-5 hours to complete at which point the server will be back up.
If we have any further issues with this server over the next few days we will be moving all customers on this server to completely new hardware (and we do mean completely new server, storage array, drives, etc..). As long as this current set-up proves reliable, we will be running on it for the next 4-5 days, and next week after things settle down a bit, we will be replacing the raid controller card and back plane that we removed this evening with new hardware and moving everything back to the new storage array.
We will be issuing everyone effected by this servers outage a credit for a free month of hosting for the trouble, although we know that does not make up for the problems. We hope you can understand that we have been working around the clock to resolve this problem, and are doing everything possible to make sure service is restored and reliable for you and the rest of the customers on this server.
Saturday, October 11, 2008