Facebook experienced a 2.5 hour down time yesterday and it is the largest or longest ever downtime for the social networking giant.
It was a more than just one site going down and users were waiting for it to come up. It was more than 350,000 sites all over the globe using facebook API to connect / communicate were broken as well!
Facebook status page kept us in the dark, the same darkness they were in;
Current Status: API Latency IssuesWe are currently experiencing latency issues with the API, and we are actively investigating. We will provide an update when either the issue is resolved or we have an ETA for resolution.We were then alerted to to the actual problem they later explained in the post-Mortem of the issue.
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.Finally so it came back and the internet continued to function. That Mark fellow has too much power, breakup facebook!
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful - we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.