What happened
We had some server issues early Sunday morning. More specifically, we were unable to serve websites. Even more specifically, we were unable to serve websites for 22 hours. Not a good day for our clients. Not a good day for us.
It’s never a good time to have web server issues. Back40 Design has hosted websites for 11 years. Although technology and equipment has improved over the years there has been no great improvements on making a client feel better when their website is down.
Awful timing
The timing of this outage was awful. We are two weeks away from taking an all new updated server system online. In mid-April we purchased a package of 5 servers to improve our web serving capabilities and combat fatal hardware issue – just like we had early Sunday morning. Since the purchase, we have dedicated a programmer full-time to the configuration and testing of the system. The new system is scheduled to go online by May 15th. The switchover to the new servers should not affect business at all for you or your customers.
The new system is configured to be fully redundant, meaning we have another serving system waiting and ready to take over in the event of a hardware failure. The second system’s only job is to listen for a heartbeat of the other server – and if that server skips a beat, connectivity is switched to the waiting server.
Bank of trust
For the clients we have built trust with over the years, this outage was a withdrawal. I hope this post sufficiently explains our actions to regain connectivity during this outage – and our plan to safeguard ‘up time’ in the future. To our more recent clients who do not yet have the trust of years of doing business with Back40 – please understand we are working to give your website and your visitors the experience they expect. We will keep you informed in the coming days of the upgrade process.
What we did to fix it
Hardware failure is what happened. So we pulled and replaced the drive that was not performing. Sounds easy enough, but this takes a while. The server is ‘raided’ so there was no loss of data from the hard drive malfunctioning. We restarted the server. We still had the same error – not performance issue. Databases were not connecting to their respective websites.
Next we replaced all the cabling and RAID controller. This took some time also. We restarted the system and still had the same error that did not allow us to get the websites online.
The next step was the undertaking we were aiming to avoid. A total server rebuild. Our methods and actions to this point were to isolate and replace the offending hardware issue. Isolating and replacing malfunctioning hardware is more time efficient than rebuilding a server. This approach did not yield stable results – so went to plan B and rebuilt the server.
We booted-up from an external operating system, attached an external terabyte drive and offloaded all the server data (websites, configuration files, databases). This also took time – more specifically – hours. Once completed, we replaced all the hard drives and tested their performance and set up RAID on each hard drive. Installed server software, set up database hosting software. Tested and re-tested. Transferred the website from the external terabyte drive and restarted. The rebuild worked and has no serving issue.
You have my apologies. I will keep you informed as our new system comes online mid-May. If you have any question or issues I can address, please contact me at dave@back40design.com.