Twitter went out today, and it was “a big thing” – many users used facebook and tumblr to complain about the outage. This is the biggest twitter outage since October last year, and it deserves a bit of attention. Why do these outages happen? Twitter reported that it’s a cascaded bug, which probably means very little to the public. A hacker groups claimed it was DDoS. And it might have been a different thing entirely.
Sites like twitter, facebook, google, etc. have hundreds of servers that run the same application. So when you type “twitter.com” you are connected to one of X hundred machines in the server farms / cloud / data centers. It’s no big deal if a server crashes – the rest of the servers take the load. Even if 10% of the servers are down, the rest should be able to take the load. Moreover, when the server capacity drops, some features are turned off. For example, you can’t read your favorites or your replies. Features are turned off so that the main features (tweeting and the stream) remain active. But turning off features wasn’t enough. So what can be the reason for such an outage? Here is a very high level explanation:
- cascaded bug – let’s assume the official version is true. Twitter constantly releases updates to its code, and some updates contain bugs. Some bugs might be so severe that they cause the server to exceed its memory (RAM), or “eat” too much CPU. Since the code is released to all servers throughout the day, if some engineer made a mistake, it could have possibly caused the outage. The timing also hints something like this: 11:59 (actually 12:00 server time) is a time when some scheduled operations take place. If the scheduled operation exceeded the memory resources of a single server, it could bring it down. And so it can bring down all the servers the code is deployed to.
- cascaded outage – this has happened to foursquare once. When one server dies, its load is transferred to some of the rest. But if they can’t handle it, they also die, and their load is transferred to the rest. Obviously, they can’t handle it, and the problem cascades until everything runs out of resources. This might be related to the previous cases – if all of the servers with the new code “die”, and the rest can’t handle the load, they also die, leaving no servers to handle requests. This can happen if the load is reaching the limit of what servers can handle. It can’t happen if all servers are operating on 50% of their resources.
- DDoS – distributed denial of service attack. A hacker group claimed to have performed a DDoS, thus bringing twitter down. There is the so called “bandwidth” – its the amount of data that the connection can transfer for 1 second. If hackers occupy the whole bandwidth with their requests, legitimate users are rejected – they can’t reach the routers, load-balancers and servers. DDoS is performed by multiple machines, usually containing a virus, and are coordinated by a hacker group. So imagine millions of machines firing thousands of requests to twitter at the same time. The network can’t handle it, and so many requests are rejected.
Which one is the true cause? I guess twitter will explain in the upcoming ours. But the fact that no “fail whale” appeared hints that a DDoS is not unlikely.