First off, thanks to everyone who has been very supportive since we came backup. A pleasant surprise, because I thought we might be on the defensive for sometime, particularly since we had a series of hiccups proceeding the extended outage. So, thanks everyone, we appreciate it.
As nearly everyone will be aware, site performance was starting to get quite miserable before the upgrade, but this was also leading to other problems. Stressed servers crash more often, and fail in other areas too. A few efforts to apply quick fixes to buy time didn’t provide the benefit we hoped for, so we got new hardware on order, and started migrating literally the second they came online. (And in the case of one machine, before it was even available to us, such was the urgency.)
Even so, the downtime lasted much longer than intended. Primarily because of the lack of preparation. Most jobs (particularly those that result in downtime) are rehearsed a number of times to figure out the most efficient way of doing them with the least downtime. This was not an option in this instance, because we had to move to the new kit as soon as possible. This resulted in blunders that extended the original estimates significantly. (Ask Chris how he felt when I destroyed 2 hours of work at 6am in the morning….) Anyway, the best thing that happened was the acceptance that we would not be able to get back online anytime soon, taking a breathe, and then starting again.
So, why the big rush to new hardware, how come we didn’t see that one coming? Since this site has started we have always been pushing our hardware, but once we noticed degradation in performance, we normally roll-up our sleeves and optimise. New hardware is not normally the answer to a slow site, badly designed scripts (scripts that don’t scale) are normally at fault. So, we had hoped in this instance, as before, we could make better use of what we had. But in this case, we were in a corner, and there was nowhere to go. So we needed to get to new hardware ASAP.
We also upgraded numerous software components, including OS (now 64bit on the database) and database server software. Furthermore, we bundled up a few upgrades that were planned for the future. Rather than have another lengthy outage again later, we rolled it all in to one. Increased the risk, but it appears to have been a success.
We do actually have plenty more to do still, but there should be no more serious disruption as a result.
Lessons learnt to prevent this happening again? Always make sure we have more contingency on our server resources. As simple as that.
Thanks for listening,
Oh, and we are now using carbon neutral hosting.
Trees. We love them.