Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

You will know when it becomes a problem. The helpdesk gets flooded. Servers go down. You get white pages and 500 internal server error pages. The entire site folds because a mouse farts.

You add new servers and they get crushed. You scrutinize every line of code and fix algorithms and cache whatever is possible and you still can't keep up. You look at your full stack configuration and tune settings. You are afraid of growth because it will bring the site down. THEN you know it's time to consider switching frameworks. And even then I would try to find the core of the problem and rewrite that one piece.

Rewriting an app in a new framework can kill a company. Be leery of starting over. I personally know of one company who started over some 4+ years ago because the old app was too hard to maintain and only started rolling out the new app last year to extremely poor reception (even with about 1/2 the features of the old app). The reception was so bad that they had to stop rolling it out until it was fixed. They could have easily spent a year improving the old app and would be miles ahead.

That's not to say that all problems are due to the framework either. I've had web servers go unresponsive for 20+ minutes because apache went into the swap of death because KeepAlive was set to 15s and MaxClients was set to a value that would exceed available RAM. The quickest solution was to cycle the box. This was 10 years ago though and I think I had a total of 1GB of ram to work with.



sort by: page size:

From this operations engineer's perspective, there are only 3 main things that bring a site down: new code, disk space, and 'outages'. If you don't push new code, your apps will be pretty stable. If you don't run out of disk space, your apps will keep running. And if your network/power/etc doesn't mysteriously disappear, your apps will keep running. And running, and running, and running.

The biggest thing that brings down a site is changes. Typically code changes, but also schema/data changes, infra/network/config changes, etc. As long as nothing changes, and you don't run out of disk space (from logs for example), things stay working pretty much just fine. The trick is to design it to be as immutable and simple as possible.

There are other things that can bring a site down, like security issues, or bugs triggered by unusual states, too much traffic, etc. But generally speaking those things are rare and don't bring down an entire site.

The last thing off the top of my head that will absolutely bring a site down over time, is expired certs. If, for any reason at all, a cert fails to be regenerated (say, your etcd certs, or some weird one-off tool underpinning everything that somebody has to remember to regen every 360 days), they will expire, and it will be a very fun day at the office. Over a long enough period of time, your web server's TLS version will be obsoleted in new browser versions, and nobody will be able to load it.


Applications are broadly vulnerable to this problem. It's true that your app server and your web server will share the blame, but that's going to be cold comfort.

(This comment sounds more disagreeable than I mean it to; sorry, it's tricky for me to comment about this stuff).


But that's not the first port of call. For instance, I've worked with a few companies that were perplexed why their site was going slow when they had invested heavily in great servers. It turns out the code was terrible.

I disagree that they are different levels of problem. A fatal error in a web app is the same as a crash, the fact that it doesn't affect other people and the fact that you can effectively "restart" the app by refreshing or going back and performing an action again is irrelevant.

If you are a serious web business and your app drops a ten thousand dollar order, or it dies in the middle of posting an important live news story, or it fails to send a personal message to an SO overseas, the fact that they can retry (perhaps after re-doing a significant amount of work) is no consolation. You, dear web app creator, are still proper fucked.


They don't until things break, your app can't handle the amount of requests coming in, can't insert and fetch data in a reasonable amount of time, etc.

It's at this point that shit falls apart and the manager starts to care about these technical choices and all of the time and money that will be needed to be spent to rewrite.

Call it maintenance costs if you want but those costs are directly affected by the technology decisions you make.


If dev support from the company fades, the UI will start to deteriorate - and wether you are hosting or not, that is also a thing that matters. Like mobile apps, browser plugins, form filling logics and specific site behaviours etc.

I don't use a lot of new consumer web apps for this reason (I did use Batch just to test, but didn't start using it heavily because I was afraid of something like this happening).

One way of reassuring me in using a product like this would be to have something in the app's settings menu that allows me to point to a different production server, and a commitment in the legal Terms of Service that in the event that the product is shut down, that the server-side code will be open-sourced.


This is very true. I learned this the hard way when I moved from making games that only ran for an hour to making server stuff and wondering why I couldn't keep a site up for more than a day and how are you supposed to debug if it takes a day for it to reach failure point. I haven't had the issue in a long time but I can see that being an issue if you are used to break stuff and release quick.

OP has it backwards. What he calls "bad", I call "good".

You shouldn't have to worry about things like systems administration, email, and scaling from the start. If you are, you're probably not focused on the most important things and may well run out of runway before you ever really need them.

OTOH, if you focus on your app and your customers and are so successful that your infrastructure can no longer support you: what a great problem to have.


Throwing more money should be the answer to this problem. If your server can't handle a HN burst and assuming it's not caused by your code being badly written the obvious solution is to get a better server.

We're talking a few tens of dollars a month more. The time a startup founder spends trying to 'fix' the issue in another way would be better spent elsewhere. Pay more, move on.


There are many reasons that a site can go down, aside from a code deployment.

I had an even worse issue (entirely my own fault) which highlights the dangers of building on another platform. My app (tweetbars.com) had a tiny flaw in that it didn't time out the Curl calls to the Twitter API, and my host didn't kill hanging php executions.

So when Twitter started hanging (and eventually timing out after a minute or so) the app basically ate one of their shared servers and the host took my whole account down for 24 hours.


This would make life truly hellish for Web developers: your app randomly gets terminated on client machines that are temporarily sluggish. Lots of them would just give up on the Web and write apps for other platforms.

Yep. I had caused a similar problem in the past that brought down a live site. It's a cascade failure on error handling that caused an avalanche of retrying requests piling up that eventually more and more servers failed under the load. Not fun. Luckily we had well defined deployment and rollback procedures and was able to roll back the change easily.

This is in no way to dog on Parse, which is handling the situation as well as any company I've seen, but I can't help but think of the hilarious line from https://medium.com/@wob/the-sad-state-of-web-development-160...:

"A [Single Page App] will lock you into a framework that has the shelf life of a hamster dump"

Seems like that could be said about a lot of infrastructure services out there as well.


Modern websites are designed to blow up as soon as there isn't an active development team anymore. The docker image needs to be rebuilt from source, the cloud APIs get deprecated, etc.

Looking at that status page, unless it's completely wrong, there's outages multiple times every month. They really could just upgrade their instance and get more mileage out of it and it wouldn't require any "refactoring". Site's mostly fine as is

Meh. I rewrilote our backend in about 3 month, and in the end of rewriting another service. All good so far!

Sometimes stuff is broken so badly you better start over


Resource hogging is a huge class of errors, though. Everything from a bad client update DDoSing a feature, file handles, memory leaks, log storage (a little outdated now perhaps), and so many more...
next

Legal | privacy