Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I feel terrible for the people who worked on Boxopus, but I find it hard to believe this very outcome wasn't evaluated during basic risk management.

There have been no shortage of companies built 100% on someone else's platform and the realities surrounding what happens when your access is turned off.



sort by: page size:

If that had happened to me my entire startup would have to shut down. Not cool.

I hate to be judgmental. We are humans, we all make mistakes. But it's my understanding that they were operating for several years and hosting the data for many people. Carrying on without a backup plan and hoping for the best was very irresponsible and I agree with you that they, sadly, deserved to go out of business.

The fact that the cheap ass idiots at the company had cancelled their backup protection at Rackspace, and lacked any other form of backup is just complete incompetence.

If it hadn't been a junior developer who nobody noticed or cared was using the prod DB for dev work, it would have been an outright failure. DBs fail, and if you don't have backups you are not doing your damn job.

The CEO should be ashamed of himself, but the lead engineers and the people at the company who were nasty to this guy should all be even more ashamed of themselves.


This could've happened anyone. It's a huge shame for those in charge not for you. Any business letting such operations happen without having backups or proper user-right-management should consider why they still exists, if they really make huge amounts of many as you mentioned.

I don't really feel sorry for them, their access control and accounting are a completely mess. It only takes one mistake to lose tons of money, like in this case.

My feels go out to the ops folk at Cloudflare. Mistakes happen no matter how many years of experience people bring in, or how much they're paid. We're all humans after all. It must be a pressurizing task to be responsible for potentially millions of dollars of losses during this downtime.

I hope the issue is resolved soon and if a person caused it, they're not in too much trouble.


True - my point was more that accidents happen and often are caused by people. In this case there may not have been data loss, but monetary impact of sites going down was probably not insignificant.

I also have a box with them. At one point, they made a massive mistake with rebuilding a broken raid array and lost all the data on a 256gb slab I had, as well as for other customers. While the human error is incredibly concerning to me (luckily I had backups of critical data), I appreciated the immediate notification and offers of compensation for the data lost, as well as plans within the next few weeks to make their RAID array more robust going forward. At most providers I've had to actively seek out compensation for situations like these rather than have it offered to me within a day of the data loss.

A catastrophe like this could destroy a company. With all the cloud providers etc there is no excuse for losing a customers data.

It is hard to measure the potential damage. Did we lose users/customers? We don't know. We certainly didn't gain German devs on board. Did we lose investors? We got ourselves a great investment ( $500K), but it may hurt future rounds. This is a "better be safe then sorry" decision in the end I think.

I feel really bad for some of these users; certainly, for the less technically-inclined or interested, how were they to know that an otherwise legitimate and professional-looking website would disappear with their data overnight? Whether it's RapidShare, MediaFire or Amazon S3-backed others, to the regular user they all look the same.

In the long run, if high-profile removals happen like this, I can see it causing a general lack of trust in SaaS for the average home user.


Yes, not notifying at least the big linux distros and BSD projects was quite irresponsible. Everyone except for a few chosen service providers like cloudflare was thrown under a bus here.

While the post-mortem is thorough, it misses key details on what companies experienced who were unlucky enough to be caught out by this outage. For example, it fails to mention how impacted customers lost access to certain Atlassian services for up to ~2 weeks: JIRA, Confluence, OpsGenie. But not others like Trello or BitBucket.

Of these, losing access to OpsGenie for this long was a massive problem, dwarfing most other systems. OpsGenie is like PagerDuty in the Atlassian world.

I spoke to several engineers at impacted companies who could not believe their incident management system was “deleted” and had no ETA on when it would be back, or Atlassian could not prioritise restoring this critical system ASAP. JIRA and Confluence being down was trouble enough, but those systems being down for some time was things most teams worked around. However, suddenly flying blind, with no pager alerting for their own systems? That is not acceptable for any decent company.

Most I talked with moved rapidly to an alternative service, building up oncall rosters from memory and emails - as Confluence which stored these details was also down. Imagine being a billion dollar company suddenly without pager system: and no ETA on when that system would be back, your vendor not responding to your queries.

I talked to engineers at such a company and it was a long night to move rapidly over to PagerDuty. It would be another 7 days they could get through to a human at Atlassian. By that time, they were a lost customer for this product. Ironically, this company moved to OpsGenie a few years before from PagerDuty because OpsGenie was cheaper and they were on so many Atlassian services already.

The post-mortem has no actions on prioritising services like OpsGenie in reliability or restoration, which is a miss. I can’t tell if Atlassian staff are unaware of the critical nature of this system or if they treat all their products - including paging systems - as equals in terms of SLAs on principle.

Worth keeping in mind when choosing paging vendors - some might recognise these systems are more critical ones than others.

I wrote about this outage from the viewpoint of the customers as it entered its 10th day and it was discussed in HN, with comments from people impacted by the outage. [1]

[1] https://news.ycombinator.com/item?id=31015813


ime the main alternative is the box crashing, frequently without leaving behind enough information to know what went wrong. at least if the app crashes you have a pretty good idea who was incompetent.

If your senior management/devs are worth anything, they were already aware that this was a possibility. There is no excuse for what ostensibly appears to be a total lack of a fully functioning development & staging environment--not to mention any semblance of a disaster recovery plan.

My feeling is that whatever post-incident anger you got from them was a manifestation of the stress that comes from actively taking money from customers with full knowledge that Armageddon was a few keystrokes away. You were just Shaggy pulling-off their monster mask at the end of that week's episode of Scooby Doo.


I think the loss may not have been as much as you think; sure, nobody could buy tickets for a few hours, so theoretically the company lost millions of revenue during that time. But that assumes people wouldn't just try again later. Downtime does not, in practice, translate to losses I think.

I mean look at Twitter, which was famously down all the time back when it first launched due to it popularity and architecture. Did it mean people just stopped using Twitter? Some might, the vast majority and then some didn't.

Downtime isn't catastrophic or company-ending for online services. It may be for things in space or high-frequency trading software bankrupting the company, but that's why they have stricter checks and balances - in theory, in practice they're worse than most people's shitty CRUD webservices that were built with best practices learned from the space/HFT industries.


This is a perfect example of the tragedy of the commons. No one person feels the impact of abuse, but ends up ruining it for everyone.

Clearly they didn't intend for this service to be a general backup system.


It's an organisational failure if a junior employee can bring down the company in a few clicks. No backups, testing on the production database, this is no way to run a company. Feel sorry for the guy who made a simple mistake.

As a long time software developer and manager, it is unimaginable to me that they had not done proper analysis of the risks and had mitigation strategies in place. The whole future butterfly comment was bizarre and I saw it when it happened. I am hoping at the very least they have DB backups but I would think if they did, they would already have been restored. With each passing hour I am in fact growing very concerned. Not to mention the run on that place could likely put them out of business when and if they ever do come back online. Just look at the owner over there. Smoking too much weed I suspect?
next

Legal | privacy