Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Being able to quickly change in different retry mechanisms and policies is pretty useful. Sometimes you want exponential backoff, sometimes no backoff, sometimes it depends on other state.

Not as simple as you might think. And if you're gonna be writing microservices you're going to be intimately familiar with retries.



sort by: page size:

Very cool. Consistent and clear retry, backoff, and failure behaviors are an important part of designing robust systems, so it's disappointing how uncommon they are. If I were starting a new Java project today I would almost certainly want to use this library instead of the various threads and timers I had to hack together years ago.

If you are not familiar with how retrying is used and why exponential back off, jitter, etc are useful here is a shameless plug to a blog post I wrote recently https://blog.najaryan.net/posts/autonomous-robustness/ (the first half pretty much is about all these things).

Totally! Thanks for bringing those up. I tried to keep the scope specifically on retries and client-side mitigation. There's a whole bunch of cool stuff to visualise on the server-side, and I'm hoping to get to it in the future.

At a basic level yes, it is to allow "giving up and moving on" and that is the point. There is more to it than that but basically it is to allow failing fast and if possible degrade gracefully via fallbacks instead of queueing requests on a latent backend dependency and saturating resources of all machines in a fleet which then causes everything to fail instead of just functionality related to the single dependency.

The concepts behind Hystrix are well known and written and spoken about by folks far better at communicating these topics than I such as Michael Nygard (http://pragprog.com/book/mnee/release-it) and John Allspaw (http://www.infoq.com/presentations/Anomaly-Detection-Fault-T...). Hystrix is a Java implementation of several different concepts and patterns in a manner that has worked well and been battle-tested at the scale Netflix operates.

Netflix has a large service oriented architecture and applications can communicate with dozens of different services (40+ isolation groups for dependent systems and 100+ unique commands are used by the Netflix API). Each incoming API call will on average touch 6-7 backend services.

Having a standardized implementation of fault and latency tolerance functionality that can be configured, monitored and relied upon for all dependencies has proven very valuable instead of each team and system reinventing the wheel and having different approaches to configuration, monitoring, alerting, etc - which is how we were before.

As for incremental backoff - I agree that this would be an interesting thing to pursue (which is why a plan is in place to allow different strategies to be applied for circuit breaker logic => https://github.com/Netflix/Hystrix/issues/9) but thus far the simple strategy taken by tripping a circuit completely for a short period of time has worked well.

This may be an artifact of the size of Netflix clusters and a smaller installation could possibly benefit more from incremental backoff - or perhaps this is functionality that could be a great win for us as well that we just haven't spent enough time on.

The following link shows a screen capture from the dashboard monitoring a particular backend service that was latent:

https://github.com/Netflix/Hystrix/wiki/images/ops-social-64... (from this page https://github.com/Netflix/Hystrix/wiki/Operations)

Note how it shows 76 circuits open (tripped) and 158 closed across the cluster of 234 servers.

When we are running clusters of 200-1200 instances the "incremental backoff" naturally occurs as circuits are tripping and closing independently on different servers across the fleet. In essence the incremental backoff is done at a cluster level rather than within a single instance and naturally reduces the throughput to what the failing dependency can handle.

Feel free to send questions or requests at https://github.com/Netflix/Hystrix/issues or to me @benjchristensen on Twitter.


> That's why failing fast and not retry is the best strategy for most if not all applications.

I think it's more complex than this. You also have to lump timeouts, caching and failure behavior into the conversation. And there are also situations where you absolutely need some amount of retries. Say, for example, you want seamless failover between backends...you're expecting some failures and don't want or need to expose those to your end users. Or, maybe the "end user" isn't a person. Like, for example, finalizing a financial transaction from a queue.


Many good points there, but I think a lot of people could be misled by the advice about timeouts and retries. In my experience, which is a lot more than the author's two years, having each component retry and eventually fail on its own schedule is a terrible way to produce an available and debuggable system. Simple retries are fine and even necessary, but there should only be one part of the system making the often-complex decisions about when to declare something dead and give up. That way you don't end up with things that are half-dead or half-connected, which is a real pain in the butt.

The fault tolerance allow you to have processes and state to be available reliably for longer than the duration of a HTTP request.

You can have continuously running processes without relying on something outside of the language. You can more easily distribute such code as an Elixir package. The code can work without relying on e.g. cron or redis being available and configured.


Understood, and perhaps the difference here is more about what I said in my other comment. I don't see this as particularly useful, and I see it as deceptive. What matters is that when we intend to do something is that it happens. We actually want retries. We want exactly-once processing with at-least-once handling. That's the only way that I know of to build resilient systems.

Our back end components literally crash when they fail. No other processing is done. They crash and if there is a bug they will keep crashing. Because our back end components have autonomy, this results in a very narrow service disruption. We fix the issue, identify the root cause and eliminate it. As a result, we have extremely resilient systems that, even in the face of 3rd party downtimes, we can know that every command that gets submitted will be effected. Dead letter queues or anything of the sort are anathema. We try until we succeed. Idempotence is considered in every single handler all the way from the inside to the outside as it must be, mathematically.


Right, but often you should punt your retry logic. You already need to handle process crash at high level, so you already want some kind of restart/backoff in your supervisor. Unless there's something specific that you want to do differently for this particular kind of failure, having separate retry/backoff logic for each kind of failure is just clutter.

I think one of the more useful ones not mentioned would be idempotency. That is, ensure for any given message, processing it multiple times will be at least tolerated, but ideally yield identical state. This eases so many complexities of dealing with failures (it's always safe to to retry, even if the receipient may have processed your first attempt), coherency, etc etc.

Retrying and stuff like that could either be handled with inversion-of-control which requires building a set of interfaces that provide stuff like 'login()' which is made available to some kind of orchestrator that handles retrying and other metalogic. This requires understanding some new domain specific abstraction really well or providing ample "escape hatches" in the even the abstraction doesn't work for all cases

Or, in my opinion, handling it in the typical way functional programming does, you'd have your stateful computations like login represented as functions returning IO, which you can easily use off the shelf functionality to rate limit and retry (like cats effect, fs2, etc... In scala). This kind of programming isn't as mainstream as it could be, but if you can build retry once and use it for pretty much any side effecting computation, you wouldn't feel a need to DRY up things that should be separate in an attempt to share code.


As with anything, there are tradeoffs.

A dozen is too short to really make a difference, but one thing that can be done at the service management layer without abandoning restart entirely is a delayed restart, or exponential backoff.

You could also imagine a per-service option for auto restart. Paranoid organizations with plenty of on-call engineers could disable auto-restart, if they were convinced the engineers wouldn't just rig up their own auto-restart to avoid the call.


Simple example where this is not quite true: moving a slow, fallible operation out of band to a durable queue with a sensible retry policy will tend to make the system simpler and less brittle, even though it becomes distributed.

I'd compare retry logic that doesn't exponentially back off as somewhat similar to the Chernobyl incident. It increases load at the exact moment you don't want that.

It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

I just find it amusing that they describe their back-off behaviors as "well tested" and in the same sentence, say it didn't back off adequately.


Yeah. I was going to bring up the library for .net that provides policy based retries.

So a side effect of the fault tolerance is that you can also easily redeploy small parts of your app. So if you have a small logic bug, you can circumvent bringing down the entire app to fix it.

There's also some more serious stuff like your workers getting killed by the OS for whatever reason and you might need to go in and restart it.

You can do this through most queue-based systems in other languages but having everything be built-in is useful.


For #5, I like mesh level retries for a few reasons, but perhaps the biggest is avoiding retry storms by using budgets https://linkerd.io/2.15/tasks/configuring-retries/#retry-bud...

> I'm handling bugs very differently than network failures though, because network failures are usually temporary while bugs are usually (or even by definition) permanent.

Depends on the bug - there are transient bugs that are not networking related.

But let's assume it's a "hard error" ie: a consistently failing bug. I would say where that bug is makes a huge difference.

If it's a critical feature, that bug should probably get propagated. If it's a non-critical feature, maybe you can recover.

By isolating your state across a network boundary, recovery failure is made much simpler (because you do not need to unwind to a 'safe point' - the safe point is your network boundary).

But it often depends how you do it. I personally prefer to write microservices that use queues for the vast majority of interactions. This makes fault isolation particularly trivial (as you move retry logic to the queue) and it scales very well.

If you build lots of microservices with synchronous communications I think you'll run into a lot more complexity.

Still, I maintain that faults were already something to be handled, and that a network bound encourages better fault handling by effectively forcing it upon you.

next

Legal | privacy