Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Isn't that the point of a pager system?


view as:

Many companies have overseas teams that take the night shift, eg: every on call dev has 12 on during the day, 12 off at night. Amazon does not based on cultural reasons (allegedly) because it promotes ownership and accountability for those teams.

Idea being if you're woken up at 4am because of a flaky test or some issue you'll be more inclined to fix it. And if it's a recurring issue, you'll really want to get it sorted or else your life will be hell. It also pushes emphasis on serious testing methodology and the idea that Amazon is a 24/7 business.

It actually makes a fair amount of sense, but it's also brutal and contributes to burnout and attrition. I also think this is starting to change for some teams.


Amazon encourages developers to be on call, but there are also rolling on-call systems so no one gets paged in the middle of the night. A team I work closely with does this. It isn't very stressful for me, I'm just on call for small, well-defined intervals, something like 1 week out of every 6, and only high severity issues will cause a page. In the last year or so I've been on call I was only paged after normal business hours once. Of course, Amazon is an absolutely massive company, so different teams may have wildly different experiences.

I think it's different for each team, especially the AWS groups. I know some support engineers that do 12on/12off rolling calls like you said, but others where it's one person on call, 24/7 for a week and they catch everything, not just SEV 1. It really just depends on your org/boss but the AWS org is known to be particularly pager-slaved.

Meh, I really don't buy that reasoning - but maybe it's my current job where 9/10 "got woken up at night" is stuff that's not really our fault (e.g. something in the datacenter is broken, or missing network connectivity, etc.pp. - and yes, I've heard of multihoming, but it's not the team's budget and decision to not keep stuff redundant...)

Disclaimer: Ridiculous pager duty was one of the reasons I quit Amazon (combined with massive failures in management, which I'll describe shortly).

The team I was on (retail-related, not AWS), shifted away from the rolling schedule with your counterparts in India taking over for the other 12 hours in order to push the "promote ownership" BS.

The only problem? Management was constantly pushing for new things to get done on extremely tight deadlines (including "emergency features" that needed to be done and in prod in days when they likely required a week of design efforts to get right, never mind actual dev time) so you have two options:

(1) Develop something stable, with good test coverage and the like, and work to fix it if it breaks... and work 20 hours to get it done.

(2) Shit something out as fast as you can and hope it breaks when someone else is on call (who likely will be too busy triaging SEV3-5's during business hours to even think about spending time fixing the root cause of the SEV1/2, rather than mitigating it and moving on...or, even worse (!) (/s) try to get the actual feature owners to fix it) in order to maintain some semblance of work-life balance.

I'll leave it as an exercise to the reader to guess at which route was generally taken.


Legal | privacy