Sabbatical Day 40 - Downtime

Two missing updates. I was feeling poorly and thus mainly spent the days in bed. Clearly I'm not hardcore.

Today saw the first significant bit of downtime for SteadyService - customer sites were unavailable for a while.

So what happened:

  • Amazon have made some changes to their infrastructure, which required restarting customer VMs. As scheduled, the VM hosting customer services was restarted.
  • Services were not scheduled to come online when the machine starts up, so they did not start when the machine came back a few minutes later. I knew that this would be the case when I wrote the original scripts, but in the time between writing them and now I forgot about this.
  • There was no active monitoring of the sites in place, so I didn't notice that this was the case.

So what am I doing to prevent this occuring again?

  • Adding active monitoring to all the sites to I will be notified immediately if anything breaks. I need to monitor many more hosts than most commerical solutions make possible, plus I want to automatically add customer hosts when they are created, so it looks like this will be something home grown with Nagios. Any better ideas.
  • I will be modifying my site creation scripts so sites are automatically brought online again when a box restarts.
  • In future if Amazon schedules a restart of machines hosting customer data I will handle it by bringing up a new box hosting customer sites, testing it works, transfering DNS over, and then bringing down the old server. This should avoid the downtime which would otherwise be caused by restarts.

This is another reminder that running an always available service has a whole host of additional challenges you don't face when just writing and distributing software.