On October 4, 2021, Facebook services went off the grid gradually, and then suddenly at 15:39 UTC. It took nearly six hours to restore service to normal. With over 3.5 billion users facing a lengthy downtime using one or multiple products from Facebook, Inc. (now known as Meta Platforms, Inc.) conversations flooded the internet about what caused the downtime issues on the American social networking service. This article attempts to outline the events that led to the outage, and help organizations large and small learn from the breakdown.
During regular network maintenance activities, Facebook engineers applied a patch to the network routers in its backbone network, unintentionally shutting them down.
The audit commands that usually prevent these mistakes contained a bug, making this an ineffective fix.
Facebook operates its own backbone network that stores all its data, and routes it to the internet through various entry gates. This faulty configuration change to the backbone routers interrupted all internal communications.
This resulted in a cascading effect on its intranet, and one by one, the network became unhealthy, stopped relaying its presence to the internet, and eventually, all of the company's apps and services including internal access points, went off grid.
As a result, a facility that responded to DNS queries itself became unreachable. DNS resolving errors skyrocketed, and in a matter of minutes, one by one, all the entries to Facebook's content worldwide were virtually unreachable. Facebook was suddenly off the grid, and the domain was even listed as "available" for sale for a short time.
"Our internal tools and systems complicated [our IT teams'] attempts to diagnose quickly and resolve the problem," explained Santosh Janardhan, VP for infrastructure at Facebook Inc., adding, "[Our IT team] identified the root cause as a faulty configuration and ruled out any malicious activity or data breach."
The Border Gateway Protocol (BGP) is the postal system of the internet, where, through routing protocols, companies such as Facebook can announce their autonomous systems with the other internet companies. In other words, the BGP helps networks choose the best way to reach any other network, like a postal service.
The internet is a network of networks, so it is vital for the peers to announce themselves frequently to stay in the DNS pathways that enable users worldwide to reach its servers. Inside Facebook is a vast network that the company calls its backbone network, which is the company's long-term investment and development of its own intranet that spans the globe, linking its data centers using fiber networks.
The facilities connect to each other over this backbone network through routers. In these routers on October 4, a routine maintenance job unintentionally took down all the connections in the backbone network. An analogy is when the kitchen gets cut off from the restaurant, resulting in impatient and hungry diners demanding meals.
During its repair journey, engineers found it hard to access its data centers since entries were blocked due to network failure, and the internal repair tools were unusable. As a last resort, personnel were deployed to the data centers to debug physically and restart the systems. This process was designed to be challenging from a security perspective, so it took more time to resolve it. After the IT team fixed it, the network was up and running again.
Site24x7's DNS monitoring solution helps you look up the DNS status of your websites from more than 110 global locations. It helps you eliminate potential domain resolution errors on your critical servers, ensuring you stay on top of outages and performance concerns.
Site24x7's Network Configuration Management helps IT administrators efficiently backup network router configurations so they can be restored immediately as necessary.
Website monitoring is a zero-sum game, as user experience and an organization's brand value is often instantly impacted when IT infrastructures are down or attacked. For comprehensive, advanced end-to-end website monitoring capability, visit Site24x7's website monitoring suite. Site24x7 ensures top availability of all your websites to visitors across the globe, and helps webmasters and IT administrators gain the proactive edge to restore services and thwart attacks to ensure the best user experience.