It really is. Facebook is far from bug-free, but I'm a near-constant user and this is the first time in a long time that I can recall it going down completely. I'll be really surprised if it lasts more than an hour, at most.
It was specifically fbstatic-b-a.akamaihd.net which was also not responsive, preventing css & js to load rendering a completely broken page on my browser.
Akamai has several different networks, one for streaming media, one for regular http, and one for https. They also have an S3-like storage system and a DNS service. Some customers are configured with several of these offerings at once, so it's hard to know which, if any, experienced problems and led to the outages.
Honestly, I doubt it was an Akamai issue. If Akamai experienced network problems, dozens or hundreds of sites would be affected. If an Akamai config issue (ie human error) were to blame, then it would probably only affect one site, not several. Neither FB nor Akamai is dumb enough to push multiple site changes at once.
I restarted my modem and router two times before I realized that the network was fine and that it was just Facebook that was down. Props to their ops team for uptime that is so reliable.
$400 per second somehow seems... smaller than I would have thought. Their opex on a per second basis has got to be orders of magnitude higher than that.
For comparison, the cost of downtime for a single oil rig is on the order of $100 per second. Petrobras, as an example, operates 70 rigs and has a market cap of $125 billion. Facebook has a market cap of $216 billion.
(source for cost: Cormorant Alpha shutdown in 2013)
Edit: misplaced parenthesis => my first number was way off, $70 000 vs $100. Redid analysis, made mental note not to do math before coffee intake.
http://www.offshoreenergytoday.com/cormorant-alpha-shutdown-... estimates that the Cormorant Alpha shutdown cost $10,000,000/day, or about $100/second. That figure comes from multiplying the oil production (90,000 barrels/day) by the oil price ($110/barrel). Where did you get the $70,000/second figure from?
It wasn't stated that that was the first thing he tried.
I've experienced router/modem issues where only certain sites were inaccessible until one or both pieces of equipment were restarted. I'd be interested in knowing exactly what happens to a router's software to make that happen.
I'd spitball a guess that it has to do with the firewall rules on the modem or router running themselves into a bad state. That would explain why a restart fixes it, since the state would be cleared.
A site like Facebook has a different profile though. More like email. If you cannot reach your mail, you'll read it at a later point.
When facebook comes up, people will catch up on "lost facebook time" most likely. Over the course of the week (or even day) it will most likely even out.
Yep. Down for me in Queensland, Australia as well (Facebook and Instagram). It is situations like this that give people a much needed break from social media (myself included). You don't realise just how much of a stranglehold (some more than others) that it has over you until you can't actually use it. Downtime for a site as big as Facebook is rare, but it happens, especially when your infrastructure isn't solely controlled by you (content distribution networks come to mind).
It's interesting to see how Facebook downtime has a huge effect on Internet traffic – there's a Guardian report where they share their referral data while Facebook was down for 31 minutes (the longest outage in 4 years):
Very unlikely. The attack on Malaysia Airlines was a hijacking of the DNS, not an actual hack on the website. Plus with the reosurces fb dedicates to security, they have a much better chance of finding 0days than some terrorists.
Might as well take a first step to ignoring facebook.
I've spent more than my fair share of time on the site in the past few years. While it's been productive overall in terms of professional and personal networking, I could have achieved as much with 10% of the time if I focused solely on the productive aspects. Two weeks ago I decided I could do without endless pointless conversations with vague colleagues, being updated on largely irrelevant minute details and thoughts of people I don't actually know, and and endless stream of low quality news links.
I deactivated my account for a week and then reactivated it but only visit once a day at most. My quality of life certainly has not decreased!
Obviously anyone can "take credit" for things like this, but if this[1] is true, then jesus.
Edit: Since it's topical, I enjoy listening to Chuck Rossi in interviews or presentations.
Releng 2014 - Keynote 1: Chuck Rossi, Release Engineering, Facebook Inc. | Talks at Google [2]
E: not that I support these guys in any way. Just curious what would happen -- and whether we might all learn something about DDoS mitigation from such an event.
Site Reliability Engineer. It's a Google (+Facebook)-specific title that is sort of like a sysadmin or devops, but instead of keeping the system up, they write code that keeps the system up. They also have a different negotiating position vs. engineering than in many other companies, eg. SREs have veto power over many architectural decisions in the code, and it's more "we'll build the system that can stay upright with a minimum of pagerstorms" vs. "you build the system and throw it over the wall to us and then we'll keep it upright through our self-sacrificing heroism."
Apple hires SREs who are actually sysadmins that occasionally code, complicating the title somewhat. This is not unique to them.
nostrademons's explanation of SRE is the correct one, IMO. The architecture is key. Engineering has to be built to allow that. It has helped me in the past to say SREs are concerned more with the operation of a service than a group of machines offering a service; it's almost like a service operations developer. When a company thinks in terms of services and abstracts the machine away, i.e., containers, scheduling, Mesos, Omega/<unnamed>, intelligent CI/CD, service discovery, now you're getting into SRE territory instead of SA territory. The architecture involvement distinguishes SRE from devops for me. You should be able to trust SRE to build services, not just run engineering output.
Teams that congeal out of Xooglers tend to preach SRE well, and there is the occasional company (Twitter and Foursquare come to mind) that applies the title and interacts with the team as intended.
I gotta say I'm finding it hard to accept their word that they're responsible for Facebook. I haven't paid much attention but I'm under the impression that they primarily just deal in DDoS's and other crude attacks, and I have a hard time imagining that they could cause a large enough DDoS to affect the massive juggernaut that is Facebook. Especially since Facebook just came back up and is now perfectly responsive and showing no signs of being under strain.
Yeah, it doesn't seem like a DoS since facebook was reachable and serving error pages, and instagram was reachable and serving blank pages. It _does_ seem like an intrusion or other security incident though because I'd be surprised to learn that instagram shares lots of critical infrastructure with facebook. It seems more likely that someone hit the panic button for both sites.
Some sort of security incidence does seem more plausible than a DDoS, although I can't think of what would affect Facebook, Instagram, Tinder, Hipchat, and AIM simultaneously. I'm also having a hard time imagining what sort of security incident would result in Facebook deliberately shutting down their web presence, even for a few minutes. And all of the other potential attacks, such as DNS or CDN, seems like it a) wouldn't affect everyone simultaneously, and b) wouldn't even work because sites like Facebook don't have a single point of failure, there's always backups and backups for the backups.
Instagram runs within the Facebook data centers. They had some challenging scaling issues within Amazon Web Services and moved their code base to Facebook's infrastructure.
Nothing to gain? Taking down Facebook seems like something that would bolster their rep significantly. If they think they can claim credit for it and get away with it, then there seems to be little reason why they wouldn't.
I guess we'll just have to wait and see if Facebook makes any public statements as to what the cause of the outage was.
Except that proving they didn't do it turns out to be much harder, because any amount of contrary announcement or evidence can be handwaved away as lies and misdirection. It's roughly akin to trying to talk to conspiracy theorists about 9/11.
For their service infrastructure, certainly. Napkin math suggest that they have to sustain peak floods in the ballpark of 500M concurrent clients. Give or take 100M.
However, if you could cripple DNS propagation/resolution from their systems to the next-hop upstreams, it would show up as services being very slow to respond. They do serve content from multiple domains.
I'd be interested to know how Google's 8.8.8.8 infrastructure defends them from DNS level manipulation...
> they could cause a large enough DDoS to affect the
> massive juggernaut that is Facebook
You don't need to send Facebook more traffic than it normally gets, you simply need to nail a pain point, and cause an unusual traffic event.
There's (I assume) no chance that it's been done using Slow Loris, but that's a good example of doing something unusual to deplete a resource in an unusual way.
I would tend to agree that it's unlikely Facebook have simply been flooded off the internet, but there are many other ways to perform much more targeted DDoS attacks, and presumably Facebook haven't mitigated against _all_ of them.
Good point. Although I'd say that any such attack like that is still unlikely to cause the entire site to become instantly unresponsive (or to serve static error pages as other people have suggested; personally, I just never got the site to load). And similarly, it would not be expected to come back to normal operating behavior a few minutes later. So yes, I'll grant you that you can DDoS even large sites without needing pure brute traffic, but it still doesn't seem particularly likely here.
I doubt it was them, though think it's worth watching since Facebook will definitely have a post-mortem. If it's not DDoS/security related, then it would be kind of embarrassing for Lizard Squad (which is why I'm surprised to see them claim it). For example, Facebook's last major downtime was caused by a main database failure – nothing to do with any vulnerabilities:
I'm guessing the multiple services going down at once has to do with Akamai. It seems like there's some speculation about the storm on the east coast and their Boston datacenter.
Or it could be a power issue brought on by this little bigger than average blizzard that will hit the Boston area the hardest where, incidentally, Akamai is located.
But sure, couple of script kiddies might be responsible as well. It's usually one or the other.
Hmm, interestingly enough Faceboook today "has complied with a Turkish court order demanding the blocking of a page it said offended the Prophet Muhammad." [0]
I've always thought how strange it is when people notice and make such a big deal about facebook being down or accessible. It's almost as if fb has become a public utility or something.
As others have remarked, one way one can interpret this is it is a testament to their ops' exceptional ability as engineers that such downtime is so noticeable.
>It's almost as if fb has become a public utility or something.
Well thats what Zuckerberg answered a user to the question "Why Facebook wasn't cool anymore". He didn't want it to be cool , but more like a digital utility like water or electricity
The trouble with that analogy is that water and electricity (just about) are absolutely essential to our day-to-day lives; a specific social media site isn't. Facebook needs to appeal to the trend to keep in business. Hugely popular social networking sites have failed in the past, precisely when they've fallen out if fashion.
Domain names in the .com and .net domains can now be registered
with many different competing registrars. Go to http://www.internic.net
for detailed information.
To single out one record, look it up with "xxx", where xxx is one of the
records displayed above. If the records are the same, look them up
with "=xxx" to receive a full display for each record.
>>> Last update of whois database: Tue, 27 Jan 2015 06:56:04 GMT <<<
Appears to return hosts with name starting with search term, which doesn't make a lot of sense but has likely always been the case. Also @ flag is a dig thing, not whois.
If they guy is posting pics of himself on Twitter asking others to "dox him", it could mean that he's pretty emotionally stable not to fall for the common pitfalls when it comes to online hacking.
In that sense, it's very possible that he has access to some major database dumps that the public is unaware of, and given that he is also claiming the glory behind the Malaysia Airlines hack; it's clear that he does more than DDoS, no matter what the more educated hackers are saying.
A long way to go for a secure internet, that's for sure.
I can get back on; able to see posts from hours ago, noting within 5 hours-ish. Anyone with an ops background able to give a JS eng who has no ops experience like myself a glimpse into what bringing services back online like that looks like? No experience with ddos mitigation techniques
I mean, like, WOW. If they have done this we all must give them props for ambition.
Forget Sony. Nobody outside of hollywood bigwigs were impacted by that. But Facebook? The US government is going to go nuts. I hope lizardsquad are driving drone-proof cars.
Facebook is now back online for me on my home connection (BC Canada) and every Tor node I try. Either the attack is over or FB has fought back.
Facebook and Instagram experienced a major outage tonight from 22:10 until 23:10 PST. Our engineers identified the cause of the outage and recovered the site quickly. You should now see decreasing error rates while our systems stabilize. We don't expect any other break in service. I'll post another update within 30 mins. Thank you for your patience."
Official update: "Facebook and Instagram experienced a major outage tonight from 22:10 until 23:10 PST. Our engineers identified the cause of the outage and recovered the site quickly. You should now see decreasing error rates while our systems stabilize. We don't expect any other break in service. I'll post another update within 30 mins. Thank you for your patience."
https://developers.facebook.com/status/issues/39399836411226...
"The issue was resolved at 23:10 PST and the site stabilized shortly afterwards. Our internal and external monitoring shows that API requests are being served with normal latency and error rates, in all geographic regions. We are sorry for any inconvenience this may have caused you and the users of your apps."
No hints there, guess we'll just have to wait for a full post-mortem to come out.
Was completely dead in Australia.
Down on both my local ISP and VPN to the National Research network. It was not a DNS issue as the IP Address is the same as it was during the outage.
"Host unreachable", maybe they took out the load-balancer?
- traceroute to star.c10r.facebook.com (2a03:2880:f00c:6:face:b00c:0:2)
1 redacted
2 ge-1-0-0.bb1.a.syd.aarnet.net.au (2001:388:1:5001::1) 107.649 ms 91.569 ms 91.814 ms
3 ae9.pe2.brwy.nsw.aarnet.net.au (2001:388:1:88::1) 91.981 ms 92.822 ms 103.957 ms
4 ae5.pe1.brwy.nsw.aarnet.net.au (2001:388:1:87::1) 92.447 ms 93.29 ms 101.182 ms
5 et-1-1-0.pe1.rsby.nsw.aarnet.net.au (2001:388:1:66::1) 94.606 ms 109.654 ms 91.987 ms
6 et-0-3-0.nsw-msct-bdr1.aarnet.net.au (2001:388:1:a3::2) 96.119 ms 92.032 ms 92.57 ms
7 6453.syd.equinix.com (2001:de8:6::6453:1) 99.204 ms 111.52 ms 190.499 ms
8 if-xe-0-3-1.3.thar1.1MH-Sydney.ipv6.as6453.net (2405:2000:ffd0::a) 90.392 ms 90.862 ms 92.174 ms
9 if-3-0-0.2.core1.PV4-Piti.ipv6.as6453.net (2405:2000:ffd0::1a) 232.07 ms 210.376 ms 163.293 ms
10 if-xe-3-1-1.10.tcore1.TV2-Tokyo.ipv6.as6453.net (2405:2000:ffb::22) 191.668 ms 200.129 ms 226.642 ms
11 if-ae2.2.tcore2.TV2-Tokyo.ipv6.as6453.net (2001:5a0:2200:300::2) 192.905 ms 191.953 ms 201.681 ms
12 if-ae6.2.tcore1.SVW-Singapore.ipv6.as6453.net (2405:2000:ffa0:100::49) 288.696 ms 279.671 ms 269.912 ms
13 if-ae11.2.thar1.SVQ-Singapore.ipv6.as6453.net (2405:2000:300:100::d) 264.938 ms 264.522 ms 266.912 ms
14 2405:2000:300:100::16 (2405:2000:300:100::16) 354.279 ms 349.7 ms 389.75 ms
15 ae2.bb02.sin1.tfbnw.net (2620:0:1cff:dead:beef::84c) 349.649 ms 348.205 ms 347.904 ms
16 ae0.bb01.hkg1.tfbnw.net (2620:0:1cff:dead:beef::1bdc) 349.552 ms 356.726 ms 348.74 ms
17 be6.bb01.pdx1.tfbnw.net (2620:0:1cff:dead:beef::5dd) 366.212 ms 368.99 ms 368.89 ms
18 be9.bb01.prn2.tfbnw.net (2620:0:1cff:dead:beef::f5) 365.81 ms 366.327 ms 366.598 ms
19 ae10.dr02.prn1.tfbnw.net (2620:0:1cff:dead:beef::1c27) 377.387 ms 365.613 ms 365.392 ms
23 * * *
24 ae10.dr02.prn1.tfbnw.net (2620:0:1cff:dead:beef::1c27) 387.156 ms !H 383.877 ms !H 374.672 ms !H
"This was not the result of a third party attack but instead occurred after we introduced a change that affected our configuration systems. We moved quickly to fix the problem, and both services are back to 100% for everyone." - Facebook spokesperson [1]
Funny how an outage of something as pointless as facebook generates this humongous comment count. But then, some source material for http://www.lamebook.com may have been lost...
I don't get how anyone could think that Facebook, a platform that has literally defined a paradigm shift in the way people communicate across the planet, could be genuinely considered pointless.
It's okay not to like Facebook, and there are many good reasons to be sceptical about them as a company, but to call them pointless is so utterly myopic that it beggars belief.
I was calling fb pointless from my own point of view. Obviously, lots of people disagree, otherwise it wouldn't have gotten that big. And I’m sure that every single user has their (valid) opinion on what “the point” of fb is for them. But as far as I'm concerned, I'll still call it pointless — no offence.
That's why I can't comprehend companies using SaaS solutions like Slack or HipChat for something as crucial as internal communication, that requires a connection to an external server to work. What's even more scary is the data leakage risk - with DDoS you know when something bad happens, because the service simply doesn't work, but with data leakage how can you tell? Every single conversation stored on a 3rd party server. It's ridiculous. Yes, they're well protected, but it's also much more inviting for hackers - once you breach it, you gain a massive amount of sensitive data about thousands of companies... Well, it seems nice UI and 3rd app integrations are more important than security and reliability these days I guess.
If you are, say, a startup with sensitive information being discussed in an IRC server that you run, do you really think your data would be more secure then? Keep in mind that you can't afford a security expert (or your own time) to admin your system if you're a small business.
As for reliability, I've been using Slack for months without a single outage. If it went down for a day, the impact would be minimal -- I'd just use email and communicate less for that one day. Or I'd switch to another provider. Big deal.
How about P2P connections, so nothing leave your company's internal network, because why should it? It's the most secure and reliable solution, but even running an IRC server on a local machine is so easy that you don't need an admin.
What about solutions like Dropbox or Gmail? Or even plain old email for that matter. Even if you look after those yourself, how can you be 100% sure that you can offer a service that is more secure and reliable than a third party?
You assume that every company has the knowledge and resources to provide an alternative which is more secure than what a 3rd party would be able to do as part of their core business. This, unfortunately, is not always the case regardless of the company being a startup or an enterprise...
Looks more like a "SW update gown wrong" scenario. People reported problems a few days before already, empty timelines and such. They roll out incremently. Those kind of problems quickly affect the backends.
They are saying it was human error by facebook and hence the outage.. but how come other apps like hipchat and tinder were also down? can't be co-incidence.
reply