In a phrase, they accidentally shot themselves in the foot.
Facebook went offline for six hours on Monday, leaving many advertisers and businesses frustrated while users were confused across the globe. It wasn’t just Facebook’s main app, of course, as the rest of the company’s portfolio like Instagram and WhatsApp also went down.
I covered the play-by-play of the afternoon of October 4th which you can read here, and at the time of writing, there wasn’t any official word from Facebook as to exactly what happened. They did release a blog post pretty late on Monday indicating it was a malfunction in a routine maintenance session, but it was clear the company would need to do more explaining to clear any confusion around potential hackers or other foul play.
Yesterday, Facebook’s engineering department published a blog post that went over the events of the outage and what led to it. In a similar fashion to yesterday’s newsletter, here’s an outline.
Facebook says the outage was, in fact, caused by a malfunction in a routine maintenance session. According to the company, it involved a certain command that was run which “unintentionally took down all the connections in [its] backbone network, effectively disconnecting Facebook data centers globally.” (For context, Facebook’s backbone network is essentially the foundation of Facebook’s online presence. Just like how you can’t have a house without a solid foundation, there’s no Facebook on the internet without a strong backbone network.)
The company says the command that was run had a bug in it, but an auditing system that’s designed to look for bugs like this had a bug in itself. This ultimately led to the unintentional disconnection of Facebook’s backbone network, and it doesn’t seem the company was aware until people started complaining.
Those complains didn’t just start on Twitter. People within Facebook were getting locked out of their offices, unable to access their computers and company software to assist with diagnostics. According to Facebook Engineering, the issues reached the point where none of their debugging tools were going to work since they’re all accessed remotely over Facebook’s servers.
“All of this happened very fast,” Santosh Janardhan wrote in the company’s blog. “And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.“Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
Luckily, Facebook was able to get its backbone network back online once engineers toyed around in the data center. However, the company goes on to explain how it was faced with another challenge: a sudden surge of traffic that was sure to come once they flicked the switch.
“Once our backbone network connectivity was restored across our data center regions, everything came back up with it. But the problem was not over — we knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.”
Facebook says it regularly simulates traffic surges like these in what they call “storm” drills. “Experience from these drills gave us the confidence and experience to bring things back online and carefully manage the increasing loads. In the end, our services came back up relatively quickly without any further systemwide failures.”
Janardhan then proceeded to elaborate on how Facebook plans to learn from this outage:“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway. “We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”
Facebook should be way more careful with this stuff. There were two instances where what seemed to be minor bugs were capable of collapsing Facebook’s foundation, making it impossible for users to access its various apps and products and impossible for Facebook to remotely fix the problems. A mere coincidence between two coding errors was enough to shut down one of the world’s largest online presences for hours.
Let’s just hope Facebook learns its lesson. It has to – it’s literally responsible for billions of users and their interaction with one another.