Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Sure all of that is easy, but control flow is a nightmare. How do you deal with cleaning up resources on failures? Idempotency? Traps? It’s a fucking mess


sort by: page size:

Well, somebody has to detect the failures, and restart failed processes, etc. so you can just "go with the flow". If you can "just crash" that's great, but it doesn't lead to much helpful advice for people who actually design the underlying systems.

Sure, creating is easy, what about monitoring, testing, rate limiting, updating (it can't be atomically updated with another resource, good luck testing that), security patching? What if you create a condition which causes an infinite loop of invocations? This is really rare in monoliths because we all agree on patterns like MVC that determine who calls what. When Conway s Law determines your infrastructure this is no longer the case.

That's the exactly same problem, with the exactly same failure modes and consequences on failure.

It's also solved the exact same ways, by scripting your stuff on one level or another.


That's why you write a very small core with the essential tasks, that include updating the non-core. And then add the complexity in another layer, where failure is handled by the core.

I work in data engineering and fixating on idempotence has been one of the best things I've ever done. Now whenever we build a new job or we review an existing one, the first question (well, second, first being 'do we actually need this?') is usually 'is this idempotent?' Saves SO much hassle. Processes fail, nodes disconnect, OOM kills stuff, these things happen on a daily basis in larger systems, be ready for that.

It's not about getting things done, it's about reliability and repeatability. When you deploy large numbers of nodes in a system you don't want little bits of state causing random failures. You want everything to be as homogeneous and clean as possible.

One of the pieces of software I'm most proud of is a service to manage the dynamic part of our infrastructure. It uses control theory and let it fail to great effect.

The service reads the state of the system, and applies change to converge to a configured policy. If it encounter an error, it doesn't try to handle or fix it, it just fails and logs a fine grained metric, plus a general error metric.

The system fails all the time at this scale, but heals itself pretty quickly. In over 1 year of operation it hasn't caused a single incident, and it has survived all outages.


With that many integrations, some small set must be broken at any given time. How do you handle this without scaling a support staff accordingly?

Idempotency is something I think about constantly. It's part of my Optimize For Sleep® development philosophy. You have to assume that 1) your program can and will fail at any arbitrary point during execution and 2) the caller of your code will retry it with reckless abandon every time it fails (or even if it doesn't). I know this because I am often on both sides of my own code, the writer and the caller. Managing state to avoid side effects takes a lot of mental effort and devs often get complacent or rush through it, but eventually it is going to burn you.

I've seen the guts of a few major financial organizations, and there are some common themes regarding their infrastructures.

The one that really sticks out to me as an engineer is the fact that the whole system in most cases seems to be tied together by a fragile arrangement of 100+ different vendors' systems & middleware that were each tailored to fit specific audit items that cropped up over the years.

Individually, all of these components have highly-available assurances up and down their contracts, but combine all these durable components together haphazardly and you get emergent properties that no one person or vendor can account for comprehensively.

When the article says a full reset entails killing the power and restarting, this is my actual experience. These complex leviathans have to be brought up in a special snowflake sequence or your infra state machine gets fucked up and you have to start all over. When dependency chains are 10+ systems long and part of a complex web of other dependency chains it starts to get hopeless pretty quickly.


In this specific case, I don't think that's necessarily the outcome. Our industry has yet to accept a universally-acknowledged equivalent of a lockout/tagout (LOTO) interlock. There is no need for a bureaucracy if we have cryptographically-enforced multisig Shamir secret sharing keys where a LOTO prevents (in this case) a system from spinning up while another system (the backup system apparently in this case) is running. Allow it to be overridden by a sufficiently senior manager or say a sufficient number of lower-seniority managers, which leaves an audit trail. Integrate with a change management, notification, secrets storage infrastructures, and infrastructure as code, and it encodes these infrastructure dependencies into code, and can be queried to auto-construct change interlock sequences for a particular desired state.

Of course, once you take advantage of such a representation at scale by deploying tremendously more complex infrastructures, you then have to deal with the dependency network meta challenge lest you inadvertently fall into dependency hell. While towards there lies NP-hard problems, they're still computable to a reasonable degree and I dare say a more robust situation than doing it all by hand like we do today.

The real challenge is the vast majority of devops staff today would really dislike reasoning about such a representation when it blows up in their faces, and I can't blame them for that kind of reaction.


IMO the only reason why people run into trouble is because they code as if everything completes instantly. For example, they’ll write to disk assuming the operation is quick, because that’s what they’re used to. You see this all the time in all sorts of software when it freezes up and errors cascade.

People just code like everything works all the time. Then when they want to integrate an external service, they have to rewrite everything.

If you understand that network calls have to made and electrical signals have to pass down a bus when you write either a monolith or a microservice, going back and swapping services is super easy.


And idempotency is essential for shell scripts that alter state. Otherwise you have to keep track of where failures happened.

Yeah, I'm not saying to completely abandon it. However, folks are well served learning that idempotency is your friend. Not always achievable, to be sure. However, even ec2 lets you launch instances with an idempotency key.

So, where I said "seems" to have. Take that more as these "seem to be required guarantees." In practice, they are not required and you need to have plans for them failing, at any rate. Rather than build up a system so that fluke failures are painful, make one that is constantly checking for failures and doing incremental work.

But, above all, my point was to reason about it. Not to take anything as written in stone.


Yeah, keep stateful stuff and stateless stuff separate; separate clusters, network spaces, cloud accounts, likely a mix of all that.

Clearly define boundaries and acceptable behavior within boundaries.

Setup up telemetry and observability to monitor for threshold violations.

Simple. Right?


Yeah, I agree that it's a nightmare in production, but it's absolutely indispensable in development situations where you can tolerate the occasional weirdness or failure. If you're developing a server app (or a thick-client app), avoiding a sever restart every time you change anything is a huge, huge win. And again, it's not like it's not implementable: there's already an implementation, for the JVM, that works.

You can recover from failed allocations without catastrophic failure. It is a fundamentally lazy programming practice to pretend that error handling has to be an out of band operation that can't be dealt with locally and bubble up progressively.

And have your unreliable, iconsistent, unscalable system. That apparently goes down all the time.

Not using ES here is actually nuts.


Agreed.

I found that the best way to get reliable ops long-term is to mercilessly dive into the code for every issue encountered, and fix it at fundamental level.

It takes significant time, but after following this practice for a while, things start working reliably.

I'm currently doing this with software like GlusterFS, Ceph, Tinc, Consul, Stolon, nixpkgs, and a few others.

In other words, try to make it so that none of your components is a black box to you. Own the entire stack, and be able to fix it. That makes reliable systems.

next

Legal | privacy