This is Why Facebook Went Down Yesterday.

Share This Post

Yesterday Facebook book was down for a number of hours and unstable for even longer. It seems that it was a glitch in one of their automated systems. Robert Johnson a Facebook engineer posted an explanation of what happened. Essentially they had to restart the entire Facebook.com network:

Early today Facebook was down or unreachable for many of you for approximately 2.5 hours. This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.

The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.

The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.

Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.

To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.

The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.

You can read the whole post here: More Details on Today’s Outage

So basically a infinite loop was created that overloaded the system. Here’s a layman’s interpretation of what happened. Ok, you have a kid that wants a cookie but all you have is crackers.

Kid – “Can I have a cookie?”
You – “No, but here’s a cracker.”
The kid looks at the cracker and it’s not what he wants so he asks again. “Can I have a cookie?”
You – “No, but here’s a cracker.”

This continues… I think we all know kids that do this. 😉

Then the rest of the kids in the neighborhood here cookie and they want one so they all start asking.
Neighborhood kids: “Can I have a cookie?”
You – “No, but here’s a cracker.”
The same thing happens and they continue to ask “Can I have a cookie?”

More and more kids start asking for cookies until the whole country is asking. So, the sequence keeps repeating.

Kids – “Can I have a cookie?”
You – “No, but here’s a cracker.”
The kids look at the crackers and it’s not what they wants so they ask again. “Can I have a cookie?”
You – “No, but here’s a cracker.”

Kids – “Can I have a cookie?”
You – “No, but here’s a cracker.”
The kids look at the crackers and it’s not what they wants so they ask again. “Can I have a cookie?”
You – “No, but here’s a cracker.”

Kids – “Can I have a cookie?”
Kids – “Can I have a cookie?”
Kids – “Can I have a cookie?”
Kids – “Can I have a cookie?”

Get it?

Subscribe To Our Newsletter

Get updates and learn from the best

Discussions

More To Explore

ai vs human content - does openai really work?
Blackbird e-Solutions

Does OpenAI Really Work?

Everyone is talking about OpenAI. And many people have been saying that it will (and other AI platforms) will replace content creation and content sites.

Do You Want To Improve your online Business?

Contact us to get started!

Blackbird e-Solutions Logo