Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

We came to the exact same conclusion. EventBridge time triggers a Fargate task. The job automatically terminates the process after execution, so the container shuts down and all is good.


sort by: page size:

Seems like they were not seeing the race, and instead only saw that open connections would stay open if not explicitly closed during the graceful shutdown.

Once the graceful shutdown was properly executed, it closed any open connections to that pod and stopped the 502s they were seeing. Sounds like either the race wasn't happening or they didn't see/care about it.


> You tell your applications to drain connections and gracefully exit on SIGTERM.

The problem is that k8s will send requests to your application after SIGTERM. So you have to wait some amount of time before shutting down to allow for that.

This was at least the case last time I used k8s, and it seemed like it was due to the distributed architecture, so something that was more than a mere bugfix away.


> Basically when a window is created, we receive an event. After getting that event, we lock the X server, then ask it about the new window. And sometimes, the window is just not there

Relying on this sounds like a race condition even if the lock is working. In the time between you process the event and getting the lock, the window could have been destroyed.


Great post, and this is something we've faced as well. Luckily our jobs are mainly idempotent, and the ones which are not, aren't that critical. This is a pretty nice solution! Ethan, the errors you still see from jobs that take more than PRE_TERM_TIMEOUT seconds... I'm assuming that's a separate, job specific issue, like talking to timing out external services/etc?

I noticed the "wait 5 seconds, and then a KILL signal if it has not quit" comment in the code above the new_kill_child method. Without jumping into the code, is the normal process sending a TERM, then forcing a KILL after 5 seconds? Just curious.


I hate that this specification and most of the other ones use spans that have a beginning and end rather than events that start and end the span. What if it crashes before it sends out the span? What if it is taking a very long time to complete?

> You _can_ block the whole process with a long lived handler.

Or you could handle the event out-of-process.


Task.await tries to exit the calling process when the timeout hits, but IEx traps the exit in that process, so it doesn't terminate and thus the linked task process doesn't either, I think? If I do all of this wrapped in another task, rather than directly in IEx, then I observe the innermost process get terminated by the process link after the intervening one doesn't trap the exit.

Relevant from https://hexdocs.pm/elixir/1.4.5/Task.html, which you've probably already seen:

> If the timeout is exceeded, await will exit; however, the task will continue to run. When the calling process exits, its exit signal will terminate the task if it is not trapping exits.


No, it's saying: if my goroutines crash it's the same as if my main thread crashed, which means: game over, application down.

Which gets handled by the container scheduler


what if a process takes like an hour or two to finish, can apprunner handle this?

Often I find myself panicking because I can't finish a task in 15 minute limit, and so I end up spinning up Lightsail server to process long running tasks, which means I need to create SQS queue to manage pending jobs, and its a wheel I seem to reinvent constantly.


Not quite.

In my experience you have several critical issues:

1. What happens when a job silently fails?

2. What happens when a job takes a lot longer than expected to succeed?

If you solve the first with a timeout, the second leads to a job rerun. The best (only?) solution I have found is to have some awareness in the job queue of the fact that the job is currently being processed. In my previous work we used advisory locks for that.


> it will process events while waiting for syscall

How does that work?

According to the source code quoted in the article, there is a separate "coroutine-safe version of time.sleep", which seems like it shouldn't be needed if V has a general solution for unblocking blocking syscalls.


close can take arbitrarily long, it's a blocking operation.

Don't ever call close on the hot path.


"To avoid this situation, there is a termination logic in the Executor processes whereby an Executor process terminates itself as soon as three consecutive heartbeat calls fail. Each heartbeat timeout is large enough to eclipse three consecutive heartbeat failures. This ensures that the Store Consumer cannot pull such tasks before the termination logic ends them—the second method that helps achieve this guarantee."

Neither this or the first method guarantees a lack of concurrent execution. A long GC pause or VM migration after the second check could allow the job to get rescheduled due to timeout. The first worker could resume thinking it still had one heartbeat left to execute before giving up on the job and it could've already been handed out to another worker in the meantime.


The sleep sub-process of time is not part of the pipeline, however. It will exit after the requested time has passed.

Same for any queuing system. You need to set the expiry time long enough for the expected task duration.

In SQS, for example, you use a visibility timeout set high enough that you will have time to finish the job and delete the message before SQS hand it off to another reader.

You won’t always finish in time though, so ideally jobs are idempotent.


In this case just taking more than a millisecond can cause scheduler collapse. So it's a pretty easy mistake to make.

> On the IO completing, the OS will suspend your thread, and execute your callback.

On Windows, you have to put your thread to sleep to receive any callbacks [1]. If OS would suspend your thread at random point to execute a callback, that could lead to hard-to-detect/debug deadlocks and race conditions.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...


> What happens when a tasklet take too long?

The same thing that happens when 1+1 == 3, or when a task tries to write to memory that it doesn't have permissions for. The static analysis that your system relies on for correct behavior is no longer valid, so a hardware belt-and-suspender mechanism (a schedule overrun timer interrupt, a lockstep core check failure, or an MPU fault, respectively) resets or otherwise safe-states the failed ECU and safety is assured higher up in the system analysis.


I have running time limit on the containers.

The first snippet is a fork bomb and will cause the container to run out of memory before the timeout. It does terminate since I have set a 256M memory limit. However, it is not sending the correct response in this case and the message in the output tab is not updated properly.

Fork works just fine, https://sharepad.io/p/aBn5Oxu

next

Legal | privacy