Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Throwing in my positive hot take among all the negative ones here: the immediate response and blog post from Fastly here is really good.

A quick fix, a clear apology, enough detail to give an idea of what happened, but not so much detail that there might be a mistake they’ll have to clarify or retract. What more are you looking for?

Apart from “not have the bug in the first place” -- and I hope and expect they’ll go into more detail later when they’ve had time for a proper post mortem -- I’d be interested to hear what anyone thinks they could have done better in terms of their immediate firefighting.



view as:

I'm sorry, but I disagree. They gave the BBC enough detail that a very misleading headline was produced as a result. True, the main blame lies with the BBC, but it also comes across — to me, anyway, maybe I'm being too cynical — as a bit of an excuse from Fastly.

"But a customer quite legitimately changing their settings had exposed a bug in a software update issued to customers in mid-May, causing "85% of our network to return errors"

They are careful to make clear that the customer did nothing wrong and that the problem was a bug in their software.


I know — as I've said, the main blame lies with the bbc. However, as it's reported, it comes across very much as Fastly trying to save face. Maybe the blame is entirely on the bbc, maybe Fastly were naive in thinking that giving them this information wouldn't result in irresponsible headlines.

How on earth do you figure that is anyone's fault but the BBC?

Read Fastly's statement. There is nothing about it blaming the customer(s) at all. There is nothing trying to save face.

What is your point here?


> Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

Is it necessary to refer to "a customer" at all in this statement? What would be problematic if the above were rewritten as something like:

> Early June 8, a configuration change triggered a bug in our software, which caused 85% of our network to return errors.

The advantage is that you wouldn't get ignorant reporting that "one customer took down the internet". I'm not sure there are disadvantages that net outweigh that.


> Is it necessary to refer to "a customer" at all in this statement?

That’s how autopsies work. You describe the cause and resolution. The cause was a bug in the customers control panel.

They’re not trying to absolve themselves of responsibility.


Yes, because it is explaining that it was a valid *customer* configuration, which is a separate set of concerns from, say, infrastructure config.

The important adjective "valid" means it was completely normal/expected input and thus not the fault of the customer.

It's perfectly clear you've come at this with a pre-determined agenda of "I bet fastly, like most other public statements after corporate booboos I've seen, will try to shrug this one off as someone else's fault" after reading the BBCs title and haven't bothered to read it at all until now.


Not necessarily valid. Could have been a bad entry that passed validation when it shouldn’t have, which would still not be the customer’s fault.

> We experienced a global outage due to an undiscovered software bug that surfaced on June 8 when it was triggered by a valid customer configuration change.

Verbatim from Fastly: https://www.fastly.com/blog/summary-of-june-8-outage


The somewhat awkward "a customer pushed a _valid_ configuration" is Fastly making sure they aren't pushing any blame onto the customer.

There is no customer blaming here. None at all.


> Is it necessary to refer to "a customer" at all in this statement? What would be problematic if the above were rewritten as something like:

That's literally what happened. They even say it was a valid configuration change, it's very blameless.

Saying "a configuration change" loses critical context. I would have assumed that this was in some sort of deployment update, not something that a customer could trigger. Why would you want less information here?


OK, I'm replying to your comment since it's the least aggressive — thanks for that!

I'll fully retract my statement. This is 100% the BBC's fault, 0% Fastly's.

Can I make one small suggestion that might help to prevent this kind of misleading reporting in future, though? What if Fastly produce the detailed statement they have, with as much accurate technical detail as possible AND a more general public-facing statement that organisations such as the BBC can use for reporting, that doesn't include such detailed information that can easily be misconstrued?


I hate being part of a dogpile, so yeah sorry about that, I just open up things to reply to, and then come back later and write it up just to find that I'm one of 10 people saying the same shit.

edit: FWIW I had a very negative initial reaction to the headline as well.


Not at all — I understand.

My apologies for any hostility on my part.

No worries. I probably didn't take it very well because my intentions were genuine and I really wasn't trying to level anything beyond the very mildest criticism towards Fastly. I recognise, however, that even that was misplaced — I think the BBC headline just got me too worked up!

Most of the replies to yours haven’t been aggressive. Ironically it’s your comments that have come across the worst by using terms like “aggressive”, “blame” and “fault” in the first place. Calling other people’s comments aggressive is pretty unfair. One might even say hypocritical.

In what way is the BBC at fault for this? Their title is objectively true. A _valid_ configuration setting that was used by a customer _did_ cause fastly to have an outage.

It's not limited to one specific customer (i.e this customer isn't the only customer who could have caused the issue, presumably), but it _was_ something the customer (legitimately) did. It wasn't a server outage. It wasn't a fire. It wasn't a cut cable.

"a customer quite legitimately changing their settings (BBC: one fastly customer) had exposed a bug (BBC: triggered internet meltdown) in a software update issued to customers (fastly admitting, when combined with 'legitimately', that fastly are at fault) in mid-May".


People love to hate main stream media

Not me — I adore the BBC. I've always paid my licence fee gladly, and I've been waxing lyrical about the latest BBC drama on Twitter just this very hour. On this issue, I believe they've made a mistake.

Whatever happened to nuanced opinion, where you can see good and bad in the same entity? Why do some people insist so strongly on absolutes?


How on earth is the headline making a mistake?

Here's some excerpts:

Fastly, the cloud-computing company responsible for the issues, said the bug had been triggered when one of its customers had changed their settings.

Fastly senior engineering executive Nick Rockwell said: "This outage was broad and severe - and we're truly sorry for the impact to our customers and everyone who relies on them."

But a customer quite legitimately changing their settings had exposed a bug in a software update issued to customers in mid-May, causing "85% of our network to return errors", it said.

The headline accurately portrays the story given the limit on headlines.


The wording "a configuration change triggered a bug" in this context sounds (to me) like it was a configuration change made by Fastly to something on their backend.

The wording which was actually used makes it clear that that was not the case.


What verbiage exactly are you looking for from Fastly here? I'm hearing "Nobody else did anything wrong, it was 100% a software bug on our end, and we're sorry about that." How much more responsibility are you asking them to take before you would no longer be considering them to "save face"? I'm trying to come up with an ironic exaggeration here, but I can't, because it kinda seems like Fastly has already taken 100% full responsibility and there's no room left for exaggeration.

Don't forget that the BBC were also initially affected by this, and jumped on it a lot sooner than most outlets, so they have skin in the game.

Would love it if it was the BBC that triggered the problem :D

If they have a bug that can crash their servers, they likely won't want to publicize the details until it is fully patched. I wouldn't expect that detail for a while.

As an ex CTO having lots of fire fighting experience, I wanna give the honor to the team being able to identify such a user triggered bug in such a short period of time. Hardly anyone would anticipate a single user can trigger a meltdown to the internet!

It reminds me of that ancient joke: A QA engineer walks into a bar. He orders a beer. Orders 0 beers. Orders 99999999999 beers. Orders a lizard. Orders -1 beers. Orders a ueicbksjdhd.

First real customer walks in and asks where the bathroom is. The bar bursts into flames, killing everyone.


You forgot the part where the QA engineer ticks off the checklist and gives the bar a pass.

The blog post actually says it’s fixed already (but I would definitely expect them to keep the details private until they’re 100% sure, yeah)

I don't blame them for having a bug. I do blame them for having a design that doesn't isolate incident like this (although, it is hard to know how much without more details. And I blame our industry for relying so much on a single company (and that isn't a problem unique to fastly, or even our industry).

I would wager that they run a single configuration, as it grants a significant economy of scale, rather than vertical partitioning of their stack, which would require headroom per customer and/or slice. This way you just need global headroom.

Having done some similar stuff with varnish in the past (ecommerce platform), they’re likely taking changes in the control panel and deploying them to a global config - and someone put something lethal in that somehow passed validation and got published, and did not parse.


This looks a quite likely scenario.

But then we still don't know what they fixed, was is the incorrect configuration or the underlying bug? I would expect the former instead of the latter, because it is probably not very difficult or dangerous to change that specific configuration while fixing bugs in the code seems riskier and would probably take more time for testing.

We'll see if they will publish a post-mortem. It has become more or less a normal custom these days (and they are frequently quite interesting).


They were pretty clear about this in their response (linked in the article)

    Once the immediate effects were mitigated, we turned our attention to fixing the bug and communicating with our customers. We created a permanent fix for the bug and began deploying it at 17:25.
So they did both. First reverted the config then later fixed the bug.

> And I blame our industry for relying so much on a single company (and that isn't a problem unique to fastly, or even our industry).

The problem is that if fastly is the best choice for a company then there's zero incentive for the company to choose another vendor. Everyone acting in their own best interest results in a sub-optimal global outcome.

It's actually one of the major problems with the global, winner-takes-all marketplace that's evolving with the internet.


Do you roll your own power grid? Do you roll your own ISP + telecoms network?

As a software engineer I live by the ethos that coupling and dependency is bad, but if you unravel the layers you start to realise much of our life is centralised:

Roads, trains, water, electricity, internet

These are quite consolidated and any of these going down would be very disruptive to our lives. Connected software, ie the internet, is still quite new. Being charitable, are these just growing pains in the journey to building out foundational infrastructure?


My datacentres have two sources of power (plus internal UPS), two main internet lines to two different exchange points (and half a dozen others), plenty of bottled water.

At home I have emergency power, water and internet. If the trains stop I drive, if the car breaks I take the train.


But having everything redundantly available costs money. While some redundancy is easy to justify... At some point it becomes hard when the MBA wants to cut costs so he gets a bigger bonus.

There is even a competitive advantage in living with the risk, as you have less costs and overhead... Sure, you might have an outage once every x years for a few minutes... But that's obviously the fault of the development team, duh


Which is one reason why things like roads, trains, water, electricty, etc. are so heavily regulated. To prevent the companies that hold monopolies over the infrastructure from cutting corners like that.

This is a classic example of an externality. You use regulations or lawsuits to force the costs back onto the decision makers. Make it so people can collect damages from outages, the company then needs insurance to cover the potential costs of an outage; if the savings from removing a redundancy exceed the increase in insurance premium then it is actually efficient, otherwise it is a net negative. While an actuarian may make a mistake and underestimate the likelihood of an outage, they are far less incentivized to do so than the MBA looking for a bigger bonus.

If you can win a lawsuit against a company for a failure it is probably because it wasn't an externality but a contract or warranty agreement they have breached. This is present even in niches with minimal regulations.

Data centers offer the highest uptime guarantees at the highest price tiers. People pay more for Toyotas, new or used, because of their reputation. Quality is a product feature. If MBA's want to decide if they can cut corners, there are already upsides and downsides, the calculation is something they need to make.


Yeah, I'm saying add regulations so it is no longer an externality.

Quality is a product feature when there is competition, monopolies don't suffer from cutting quality.


whereas I'm saying it's not an externality even in the absence of manipulation by specific regulations, which you apparently agree with given your second sentence.

I find that people have a tendency to be overly narrow in considering competition and declaring things monopolies. There are alternative ways to get tasks done that avoid relying on (and paying for) low-quality internet services if companies find it necessary.


An externality is a side effect or consequence of an industrial or commercial activity that affects other parties without this being reflected in the cost of the goods or services involved. Removing redundancy to increase profit margins affects other parties (ie users) without affecting the cost of the goods or services involved. If and only if the MBA's decision to increase risk to the user is reflected in the costs does it cease to be an externality.

And we are specifically talking about an industry over-relying on a single provider of a service. If there were a variety of competing services, the entire point would be moot.


Right, but that's why we (ideally) put infrastructure costs under the control of an entity (the government) which doesn't have to operate within a system of profit and market competition.

What's this "emergency ... internet"? A hot-spot on a cellular telephone?

but do all your emergency backups have emergency backups?

My emergency backups have emergency emergency backups (but those do not have emergency emergency emergency backups).

> Do you roll your own power grid? Do you roll your own ISP + telecoms network?

> Roads, trains, water, electricity, internet

I guess the difference here is that you're (mostly) talking about physical infra, which by definition must be local to where it's being used. We allow (enforce?) a monopoly on power distribution (and separate distribution from generation) because it doesn't make sense to have every power company run their own lines. But with that monopoly comes regulation.

Digital services are different. The entire value prop is that you can have an infinite number and the marginal cost of "hooking up" a new customer is ~$0. This frequently leads to a natural winner-take-all market.

One way to address this is to add regulation to digital services, saying that they must be up x% of the time or respond to incidents in y minutes or whatever. But another way to address it is to ensure it's easy for new companies to disrupt the incumbents if they are acting poorly. The first still leads to entrenched incumbents who act exactly as poorly as they can get away with. The second actually has a chance of pushing incumbents out, assuming the rules are being enforced. And now you've basically re-discovered the current American antitrust laws.

As far as any individual company's best interests, like anything else in engineering, it's about risk vs. reward.

What's the cost of having a backup CDN (cost of service, cost of extra engineering effort, opportunity cost of building that instead of something else, etc.) vs. the cost of the occasional fastly downtime?

I have to imagine that for most companies the cost of being multi-CDN isn't worth what they lose with a little down time (or four hours of downtime every four years).


> One way to address this is to add regulation to digital services, saying that they must be up x% of the time or respond to incidents in y minutes or whatever.

This is good reasoning but I don't think it's possible to legislate service level objectives like that.

> But another way to address it is to ensure it's easy for new companies to disrupt the incumbents if they are acting poorly.

I agree but realistically there will be many cases when a company is far better at something than anyone else. I think the only way to avoid global infra single points of failure is competitive bidding and multi-source contracts, plus competitive pressure to force robustness (which already works quite well).


But a CDN _is_ physical infrastructure. Just like power, water, transit, etc. The same economic forces influence CDNs just as much as they do the others.

> Do you roll your own power grid?

I know plenty of people in Texas who will be buying solar panels and batteries after last winter. I will be doing the same.

> Do you roll your own ISP + telecoms network?

If I could magically get fiber directly to an IX I would gladly be my own ISP. I have confidence I would do as good a job or better than the ISPs I’ve had over the years (yes I realize having hundreds of thousands of customers to service is more difficult than a single home).


> I know plenty of people in Texas who will be buying solar panels and batteries after last winter. I will be doing the same.

I have actually been in the position of having to rely on non-mains power all my life.

It bloody sucks.


How long is "all my life?"

Because it seems like a relatively recent development that off-grid solar power solutions have become affordable and mature enough to not suck on average.


Around 40% of the population of Nigeria (a country of 200+ million) does not have access to electricity. The per capita electricity consumption of Nigeria is _two orders of magnitude lower_ than the US.

> Around 40% of the population of Nigeria (a country of 200+ million) does not have access to electricity. The per capita electricity consumption of Nigeria is _two orders of magnitude lower_ than the US.

... And how is this relevant?


Because that is where I live?

> Because that is where I live?

... And how is this relevant?


You asked how long "all my life" I've spent only partially reliant on the power grid is, and someone else has provided you some context (that I actually mean all my life, which you can probably infer to be longer than two decades at least).

As to the second half of your original question, solar power is not the only kind of backup power that exists.


What do you think the suboptimal outcome was in this case?

Is it better for websites to be unavailable at different times as opposed to all at the same time? This seems to be a really common assumption people make re these occasional cloud take-downs, but I don't really understand why people think it.

Seems to me that in cases like this, everyone operating in their own self interest, by all using the best value service, is actually the best outcome. Everyone suffered the same outage at the same time, which minimised the overall cost of the outage (one resolution, one communication line etc. as opposed to many).


With short term outages like this your probably right that a single point of failure doesn't matter.

It's the longer term outages that are the problem. That's because we start talking about knock on effects.

It's not really a problem if your supplier (and all others) have a short term issue (assuming you don't run super lean). It may be a headache if your supplier has a longer term issue while you set up another supplier (or use your less desired one) but it's not a disaster. It's a big problem if all suppliers are down for more than a short time.

I'd assume people seeing this as a market failure are talking about it in the "this highlights the problem" kind of way, not the "this event was a true disaster" way.


Sure. But again, I feel like using a "big, everyone uses it" kind of supplier is the best mitigation of the "what if I have to replace this" problem.

If a massive vendor shutters or has a long term failure, at least you're in the same boat as a bunch of experts, which is a much better place to be than "my obscure or self-rolled solution is now orphaned / hacked / broken".

The unspoken assumption always seems to be "my self-configured solution will have fewer and/or shorter issues than the massive publicly traded solution that everyone uses" but that seems ... very incorrect.

Also... a reverse proxy / CDN always is a single point of failure. The question is... is it a single point of failure that you personally own. In my opinion shared single points of failure are desirable. It's just obviously more efficient.


> Is it better for websites to be unavailable at different times as opposed to all at the same time?

Yes. If one site is down, it may hurt my productivity a little bit, and I may have to adjust what I work on. But if the entire internet is down that has drastic impact on my productivity, and depending on what I am working on at the time, may completely block me.


Sometimes, it’s something as small as a missing parentheses that makes things go wrong.

(Slight joke here)


Legal | privacy