Yesterday one of our database servers crashed during what was supposed to be routine maintenance, and we were forced to follow a lengthy rescue protocol in order to minimize chances of data loss / corruption. As far as we can judge, none has occurred.
The full incident lasted approximately 4 hours; during that time it was impossible to sign up, log in, or view image pages. For a shorter while images embedded into third-party websites were unavailable as well, but as soon as we had a picture of what’s going on we modified our code base to keep serving them, thus limiting the impact of the outage to new uploads and to our own pages (the bulk of our traffic is used by images hotlinked into other websites, so this was our #1 priority).
Lessons learned and plans:
- Implement a replication scheme that will allow us to work around such problems without substantial downtime.
- Make changes to our code that will make the site less brittle and dependent on that particular DB server.
Thanks for your patience!