Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Time is not a synchronization primitive (xeiaso.net) similar stories update story
168 points by todsacerdoti | karma 157974 | avg karma 11.55 2023-06-24 23:12:58 | hide | past | favorite | 112 comments



view as:

Did a devops elective last semester; I was surprised to hear that sleeping to wait for the system to come up was "common in the industry". 30 seconds was the magic number, which was somehow always either mostly a waste or not enough when I tried to use it.

For similar reasons, we similarly can't keep making sleep-sort faster by scaling the sleeps down.


Sleeping on error is a valid pattern when combined with retries. Consider if that would be a long-starting external process that sets up the listener (or even a remote server spinning up). I.e. something we have no control over and that has no ability to signal us it's ready. In such case, sleeping a little then trying to connect (and repeating) would be simple, effective and economical solution.

Sleeping on error is a valid pattern when combined with retries.

Agreed. Just note that it can sometimes (often?) be a good idea to add a small stochastic element to the sleep time, and/or do something similar to the exponential back-off[1] approach that Ethernet (among other things) uses.

[1]: https://en.wikipedia.org/wiki/Exponential_backoff


You ideally want both exponential back off plus some random delay (to prevent thundering herd).

I've never personaly seen any benefit from exponential backoff... constant backoff + jitters + short timeouts has always worked well for me.

If anything, people underestimating the ramp up of exponential back off(or not capping it) has left me waiting much longer than necessary


the amount of times something (Discord is the first example that comes to mind) starts waiting 30 seconds between retries because I wasn't connected to the network has been super absurd.

That was "more sophisticated" and I was given a sample script without a loop. (I got Terraform to wait in a retry loop, which was apparently too complex.)

Yeah for the record I'm not talking about delay loops, I'm talking about hardcoded sleeps in test code that cause random failures when unspoken invariants change. DevOps is a unique blend of hellfire that is almost unfair to compare to everything else in the industry. I'm glad I don't work in DevOps anymore.

> we similarly can't keep making sleep-sort faster by scaling the sleeps down.

Technically, we can, at least to a certain limit - as long as the total time needed to schedule the sleeps is less than the shortest sleep duration, sleepsort will still work. Though we're just offloading the sorting to the scheduler, which has to implement the sorting algorithm anyway...


I ended up writing a function that would sleep and retry according to the Fibonacci sequence up to a given total elapsed time, so you could set it to a max timeout of 60s, for example, but if the system came up in 8 seconds, you'd be waiting not much longer than that (11 seconds, technically).

I do this too occasionally — Fibonacci is easy to implement, and it's an exponential backoff (with a lower exponent than 2, which is also easy to implement ;) )

I found the following line in a codebase once, which made me laugh:

``` interval = interval*1.6 # fiboroughly ```


This bleeds into the Init Wars (sorrows, prayers), but there was this idea for a while that every service was going to be directly queryable by the run control system so that there would never be a need for magic sleep numbers. The result fell rather short of the ideal (particularly on the teardown side of things) but it did at least help admins get a firmer handle on exactly where we were (and still are, for the most part) using magic waits of "long enough".

Way back I worked on a project to used virtualized routers inside labs for a bunch of use cases. The idea was to remove the need for building after building filled with equipment that just sat there idling 99% of the time.

The biggest headache we had was to figure out when everything was up. There simply was no way to detect this programmatically, so the only way would have been to alter the code to give some kind of signal on the console like "System is up." similar to a linux box showing the login prompt. So I resorted to wait loops too.

We ended up canceling the project.


There have been several developments in the past decade or so that center on the philosophy of "make it the client's problem" (hello, Wayland) which has simultaneously led to much more robust and maintainable centralized systems with much more fragmented and unpredictable peripheral clients. I'm still not sure whether I regret this or not.

I've been doing "devops" stuff for a long time, but only got into Kubernetes in the last 18 months or so.

I've been endlessly frustrated by how prevalent "just add a <x> second sleep" is in the Kubernetes world.

Examples:

If you're using IRSA to provide pods an AWS IAM Role, it can take some time on first-run for the role to become usable. So your app has to just wait until it can successfully talk to AWS.

If you're using the AWS Elastic Load Balancer controller, you need to add sleeps during pre-stop for a pod. The controller has to see the shutdown event, then start target deregistration on the ELB Target Group. That's a minimum of 30 seconds, during which you can be getting new connections.

Service meshes on Kubernetes add some fun on both startup AND shutdown.

Linkerd, for instance, will only inject it's linkerd-proxy container into your pod after startup. So it's entirely possible for your main application container to start, begin talking to the world and then a moment later all your connections get reset when Linkerd-proxy takes over the pod's networking configuration.

You can fix this by injecting into all of your containers a linkerd-await utility that replaces the container entrypoint. It will wait until linkerd starts before starting your regular entrypoint. It's basically a more advanced version of sleep.

Shutdown is just as fun - linkerd-proxy will immediately terminate, dropping all network connectivity for all the containers in your pod. There's a fix for this, too, which involves the above linkerd-await and another option.


With systemd having standardized so much stuff on sd_notify for a basic liveness check, it's frustrating that container setups can't piggy-back more on that.

Systemd's socket activation mechanism also got born from the "Sleep is not a synchronisation primitive; talking to a socket is!" and the container world just ignored it whilst it's a great fit!

Actually it's a generalisation of what OP is doing here. Move the listener _out_ of the program and you never have to know if a service is up or not. Connecting to a service just blocks until it's up.


> It's basically a more advanced version of sleep.

Is then the waiting for the availability/existence of anything a "more advanced version of sleep"?

Is a readiness probe a sleep?

I don't follow at all.


Polling relies on sleep. Setting the polling time matters a great deal, because you'll almost always take at least one poll interval to discover that something has come up, and on average every polled step in your startup process adds (poll time/2) to your total start-up time. If you have a bunch of steps that require polling, that adds up fast.

The problem is often relying too much on cold-starts; these polls amortize well in a real system, where you can afford (and want to afford!)


I think the sleep N > check for availability paradigm is pretty useful.

The alternatives have a ton more complexity for not much benefit.


I agree that it is no synchronization primitive and that even if it was, it's very fragile. However I found it in nearly every project I worked on, in backend tests, frontend tests and even in integration tests where the library driving the browser should take care of waiting for elements to appear on screen. Modals and browser dialogs are particularly good at breaking stuff.

The dynamic is: developers write the the test and it usually works on their machines. If it does not there is a chance that they figure out what's wrong and rewrite it in the proper way. Tests often work locally, where the CPU and disk are nearly 100% available to run them. Sometimes they fail on CI systems. Developers scratch their heads and attempt the easy fix of adding a sleep of 1 second. That almosts always work and it took 5 minutes, code, commit, test run. They know that it stinks but they have stuff to do. Every few months somebody attempts to remove some of those sleeps with mixed and often unsuccessful results. Integration tests are particularly nasty.


and thus you end up with slow software

Slow tests.

Yes, though is suspect it is cultural. A team that writes slow flakey tests probably writes slow flakey production code too, for the same reasons.

I'm sure that's a factor, but having slow tests makes it much harder to notice when you have made the actual code slower.

A team that insists on high-performance tests probably ships lots of bugs, because the tests are so local/isolated/mock-driven that they don’t actually exercise interesting interactions.

As an intern I took a look at an unloved integration test suite that took something like 8 hours to run. Most of it was resulting from scattered sleeps waiting for Apache Solr replication, so I built a hack that queried Solr to determine the timestamp of replication vs the I think the committed document's timestamp (iirc, it's been a while) and waited for it to be observable, which cut 6 hours off the test's runtime.

Where on earth do people find interns this is good

I have a test suite with 100s of tests written by someone who was a big fan of this method.

When you run the test in isolation, 1s wait here and 2s wait there is not a big deal. But when you run 200 tests sequentially (because some of them don't work when they run in parallel), then it starts to add up, and your test suite starts feeling sluggish.


I'd be asking several questions here:

Why wait in the first place and how can I fix the issue of tests not parallelized

Both should be red flags in a codebase


I mean, there are some valid reasons why the tests are the way they are, and it's not easy to fix them.

Re waiting: Test are integration tests that wait for some event. Example: test changes a file on disk, waits for UI to update. We rely on the OS to report file system changes. We don't know how long it takes, because it depends on system load. We could just wait for the next file system event, but we might receive multiple file system events, and since they are processed asynchronously in our app, we don't know how long it takes for the UI to update. We could add some kind of notification when the UI updates to speed up tests, but that would mean adding extra code that might slow down the app just for integration tests.

Re parallellizing: Tests are integration tests that run against different databases. Not all databases have perfect transaction support, so sometimes one test interferes with another (eg. one test creates a table, another test fails because that table exists). So parallelising these tests would require multiple copies of each test database, which would cost more money, and spreading tests over multiple test databases would make the test execution code even more complex...


It's a weird example, because it moves listener creation elsewhere, which is not always possible (e.g. consider if the goroutine would be spawning an external program instead of setting up the listener of its own).

Go has excellent primitive for this, a channel:

    // Setup a channel. We'll send one and only one messsage over it, then close it.
    // The message will be either an error (if something went wrong)
    // or nil, if everything is fine and we can proceed.
    ready := make(chan error)

    go func() {
        // No matter how this goroutine exits, when it exits the channel will be closed
        defer close(ready)

        // Do whatever we must: open a socket, run a subprocess, or something else
        lis, err := net.Listen(...)
        if err != nil {
            // Let it be known we've failed
            ready <- err
            return
        }
        defer lis.Close()

        // Announce the "parent" goroutine we're ready, without errors
        ready <- nil
        ...
    }()

    // Here, we block and wait for our "child" goroutine to tell
    // us something or at least die and close the socket.
    // You can extend this and e.g. use `select` if you want to implement a timeout
    err, ok := <-ready
    if err != nil {
        return fmt.Errorf("failed to start listener goroutine: %w", err)
    } else if !ok {
        return errors.New("listener goroutine died unexpectedly")
    }
If there are multiple tasks and the case is not complex, then use sync.WaitGroup to count the children and a shared variable to convey the error status.

Multiple languages have channels, I’m not sure why people only discuss this in terms of Go.

.NET and Rust are the first two I can name off the top of my head.


True. Go just puts them in a spotlight, recommending them as the synchronization primitive for many use cases.

Very true. Hell, even Python has them in a library.

Crystal's channel implementation is great

It pisses me off so much that Go doesn't have a built-in way to wait for a goroutine to complete, similar to how every other language I've used allows joining threads. Would avoid all this, solve some extra problems to boot, and be much less error-prone.

> It's a weird example, because it moves listener creation elsewhere, which is not always possible (e.g. consider if the goroutine would be spawning an external program instead of setting up the listener of its own).

Just addressing the case of spawning an external program: if you have control of the external program you’re spawning it can be better to pass the listening socket in using the systemd $LISTEN_PID protocol. It’s easy too implement, language independent and you get systemd integration for free.

See also: https://blog.williammanley.net/2014/01/26/Improve-integratio...


idk, i am new to goland but same thing, tls issue. my tests were very flaky. sometimes they were failing sometimes they weren't. i initially thought it was wsl and goland issue. but after it happened 2-3 times. i ran with goroutines test with -count 50 but it didn't work in goland. then i issued in terminal. which i got to see yup my tests are actually flaky.

i know this is not the best way. but i added a timeout of 10sec. also while asserting i am polling to know if i got a connection.

i understand and i can make it way better. but for now it's good enough. more importantly, it's better not scatter these kind of codes everywhere. also making things explicit is better always. like now server.Start is Blocking. and in my tests i am spawning a goroutine, rather than it doing implicitly.

another thing, testing dependencies. i am going to test concurrent handler. and it's not dependent on tls or tcp. i can just use a fake and test the concurrency. i like this design. scattering these throughout will make every other test integration tests. increasing time to execute and that is bad.

btw, it would have been better if go lang runs tests in parallel.


[flagged]

Conversely, it makes me trust the author's expertise deeply. Our industry would be nowhere without furries.

I don't know about the industry as a whole, but at least the (early) Rust community evangelism seemed to have a lot of them involved.

Yeah its weird, but the author consistently produces well polished and effective documents. You see that sort of conversational technique used in other places to great effect so the odd bit really is just the furry stickers part.

And I guess I feel that we should be accepting of a little weirdness as an audience, especially considering the quality? I don't have that idea all figured out but the two concepts feel related.


[flagged]

I somewhat agree with the sentiment, in the sense that I also don't enjoy tech in that way. But we're all different, and the fact that we enjoy things in different ways shouldn't mean that we can't listen to and learn from each other. The author is a very knowledgeable person that consistently writes really good articles.

As an aside, don't put everyone in the same bucket, it's really not great for e.g. trans people that don't want to be associated with that kind of tech enthusiasm. This is like the new variant of strict gender roles, not fun.


Why shouldn’t I assert my preferences?

In most societies it's considered impolite to barge into someone else's house and decry the decor as tacky.

Whose house is this?

Xe's. And as a response to your previous comment, it's not preference so much as completely ignoring someone's personality and instead assuming one based on some stereotype. And I'm sorry if that upsets you but you can't "prefer" to ignore someone because of gender or personality; I thought humans already concluded that's not a Good Thing. Get over yourself.

To me this comment seems far more rude and aggressive than anything I said. Liberal tolerance in action?

Human empathy in action.

Empathy is being rude and mean?

I'm aware I was mean and maybe that wasn't right, so I'm genuinely sorry for that. My point stands anyway.

Anyone trying to guilt you with 'liberal tolerance' is not arguing in good faith.

I'm aware, but I'll take the criticism. Arguments are almost always better without being aggressive or mean. The liberal tolerance thing was obviously crap.

I should have stated this outright; I don't think you were being aggressive or mean at all. You very clearly laid out that they were not in fact establishing a preference, and genuinely 'getting over yourself' is the only course of action when people are genuinely feeling discomfort from seeing people unlike them.

Thats what I mean, I don't think the criticism is at all valid


Thanks.

The only real offputting thing there is arrogance but I don't think the author here is particularly arrogant either.

The "I see, thanks" part definitely is.

It's not so much off-putting as just covering up for the fact that the explanation above is pedagogically bad.

I had to open both snippets in two tabs so I could flip between them and spot all the actual differences, which is that 1) the listener creation was moved out of the go-routine, 2) the port was changed to auto, and 3) the sleep call was removed.

Making the server support multiple instances at same time is a distraction, because you could do the same with the old code. The trick here is that creating the listening socket in the original thread means you can also use auto-port assignment and then fetch the port from the returned object.

This would be unnecessary if you queried for a free port first, so perhaps the point here is that an atomic port reserve-and-listen system also prevents race conditions over the port number.

These are all good angles that could be naturally woven into the explanation. Instead you get furries discussing moldy bread, which doesn't tell you anything, but I suppose makes both reader and writer feel like they got something out of it.

Personally, I feel like the post is cheating by suggesting "in order to synchronize a server and a client, just create the server inside the client!" ... ummm. If you could do that, you wouldn't have a synchronization problem in the first place.


[flagged]

"Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting."

"Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead."

"Eschew flamebait. Avoid generic tangents."

https://news.ycombinator.com/newsguidelines.html


This seems to be a persistent problem. Can you modify the site's code to autodead comments to my blog that contain the word "furry" or something? It is tiring to continue to see people that are NOT in my intended target audience complain that I'm not catering to their every need.

If you didn't make everything noreferrer I could just add a "no fun allowed" bit of JavaScript or server-side Rust code to replace the "objectionable images" with random emoji or pictures of innocuous food or something, but I can't do that.


I'm not sure what target audience you believe you have if "fan of weird furry anime" tech posts contain details that would also not be relevant or of interest to hacker news users.

It is possible to detect if the user came from HN: https://news.ycombinator.com/item?id=24429725


Not anymore, all links are noreferrer now.

I'm not sure it was the right call and am open to maybe putting it back the way it was. It's a point that is unusually unclear - something ends up sucking no matter what we do.

I don't think we'd implement technical solutions like autodead comments of class X on site Y. Those things are too brittle and trying to get around the community with technical tricks is asking for trouble anyhow.


Do I need to email you with more details about my usecase? I'm seriously considering having hacker News referers use some absolutely terrible Medium clone CSS with absolutely no soul just to stop the harassment that you are facilitating via moderation inaction.

Would some more weird anime furry images help?

The implementers of Google Spanner would like to have a word!

What happens when two 802.11 devices use the frequency at the same time? I was under the impression that they'd both re-transmit after a arbitrarily chosen pause. If that is true, time would be a de-synchronization primitive. Then, it is surely also a sync primitive.

How do you arrange a date with your friend in the city? I usually say "let's meet at 17:00 at the main train station, under the clock".[0]

You have to know your counter-party's accuracy. But if you synchronize on any other metric, surely that could also be inaccurate? So how is time any different?

[0]: Note: in Danish: https://www.berlingske.dk/under-uret-uret-skiftes-ud


If that is true, time would be a de-synchronization primitive

That's not desynchronization or synchronization. The synchronization is realizing they are both trying to connect at the same time.

How do you arrange a date with your friend in the city? I usually say "let's meet at 17:00 at the main train station, under the clock".[0]

The conversation is synchronization here. Synchronization is unambiguous communication.


Synchronization is inherently temporal. The "chron" in the word even means time. Synchronization is unambiguous communication with respect to some time coordinates.

It's certainly possible to communicate in a way that appears to be unambiguous without explicitly referencing time, but the time is always implicit. Practically, all communication is otherwise ambiguous.


Synchronization is inherently temporal.

Everything happens over time

Synchronization is unambiguous communication with respect to some time coordinates.

No, it could mean ordering but that doesn't mean 'time coordinates'.

It's certainly possible to communicate in a way that appears to be unambiguous without explicitly referencing time

That's because time isn't the most fundamental aspect, reading shared information and ordered writing are.

If cars need to merge on the highway into one lane, the space between them isn't the most important aspect, deciding on the order is. There could be a little or a lot of space in between and it still works.


Ordering often is temporal. "Two comes after one" implies that "one" is somehow before "two" but the word you probably mean is "greater than." In the case of cars, saying "the blue car merges first, and the green car second" implies a temporal ordering of events. That again is time. We implicitly know that the velocity of the car and anticipated changes in acceleration impact when the events will occur, but they are implicit. We inherently know that it's ridiculous for the green car to merge a nanosecond after the blue car, but we're assuming a velocity. Time is present everywhere that there is change. Synchronization tends to involve changes.

Again, just because time is 'present' doesn't mean that 'time is a synchronization primitive'.

Programs execute over time, but synchronization is not fundamentally about when things execute, it's about the order that they execute. The order is what is important, not specific time coordinates or specific timing.

Just like cars merging, where they go in absolute space is not the fundamental concern, it is where they are relative to each other so they don't overlap.

Synchronization is about serializing events if they have overlap in the data they affect.


Ordering is a subset of temporal though. That is, ordering implies there must be a relative time between events. Absolute time doesn't have to matter, nor does time between events. It's still temporal.

Ordering is a subset of temporal though. That is, ordering implies there must be a relative time between events. Absolute time doesn't have to matter, nor does time between events. It's still temporal.

It's only temporal because all execution is temporal.

The fundamentals of what you are trying to achieve aren't when something executes, it is the order.

If try to execute two different things at two different times, your synchronization could go wrong because that isn't really what you want, you want a guarantee of the order.

If you try to synchronize by guaranteeing the order that operations execute it will work regardless of what time something executes.

Ordering is what you are actually trying to achieve, timing in ancillary. That's why the title is 'time is not a synchronization primitive'.


All this means is that time exists. No one tried to say time doesn't exist. They said it was the wrong tool for the job.

Time is a side effect or consequence of the ordering of events, it does not cause, enforce, or predict the ordering.

One thing depends on some other thing. It does not depend on the passage of any particular amount of time.


I think existing programming languages have lack of general purpose listening mechanism for state changes or events to objects in memory. People write their own adhoc observer pattern or hook pattern. I think people are forced into polling because of the lack of mechanism. Webhooks are an example of solution as an integration pattern between separate services but I'm talking about within a process, page or program.

I am inspired by Smalltalk but with a twist: it is the pattern and sequence of interactions (such as events, states or messages) between MULTIPLE objects - all interacting - that are interesting.

Golang's and occams select operator is powerful for switching between events from different sources.

My idea: I want to be able to efficiently epoll_wait/uring a sequence of steps and state changes between multiple objects.

The idea is complex event processing for POJOs or objects and states.

Epoll for general purpose objects has usecases in efficient and scalable change detection in microservices, graphical user interfaces dirtychecking, job systems and workflow systems.

If you have a pull or push driven system and you want to adapt it, your hand might be forced into a substandard solution to add reactivity to an event or state or event sequence.

In integration testing, I see this problem. You can fire events in Javascript that your test runner can listen to, this is one solution to this problem but you have to build it yourself. You could hook into Redux state management. I have only created APIs for pages for interrogation for use by Selenium. But what I feel you want is state communication that isn't dependent on timing but efficient notification. (I don't want to use Selenium ExpectedConditions Wait for which is what I generally used to fix tests, there's a far more elegant solution)


What do you need more complex than a pipe and your favourite unix fd multiplexer?

Imagine you want to wait for events B(key=value) and C(key=value) all caused by common cause A as a single atomic event.

You can have a descriptor that represents the combination of all that state.

https://en.wikipedia.org/wiki/Complex_event_processing

I think it would be helpful for microservices and graphical user interfaces for complicated user behaviour.


A neat special case of "hope is not a strategy" from SRE-land!

[flagged]

Please don't cross into personal attack. If you know more than someone else, that's great, but then please share some of what you know so the rest of us can learn. If you don't want to do that or don't have time, another fine option is not to post anything—but swipes and putdowns only poison the ecosystem, so please don't do those.

https://news.ycombinator.com/newsguidelines.html


RCU would like to disagree.

RCU uses epochs for synchronization. That's very different than winging it and waiting an arbitrary amount of time.

In fact RCU exists exactly because the previously common method of waiting an arbitrary time before deleting a potentially shared reference and hoping for the best was no longer considered acceptable.


That's still time as a synchronization primitive.

TruTime is, in Google Spanner, using atomic clocks

[flagged]

It would've been nice if the author had described the change that they made instead of just writing "consider the following code".

They do say "you generally want to use something that synchronizes that state", but then fail to describe exactly which method they selected in this particular example. A single sentence explaining the particular change would've been sufficient.

This is a little thing but would greatly improve the communicability of the post and practices greater empathy for the reader.


"synchronize" literally has the word "time" in it as a greek root (?????/chrónos).

Time is fine for synchronizing when all components are actually reading and acting in accordance with a clock, and agree on using the same one (or a well-characterized facsimile of the same one).

Time is used at the bottom layers of hardware. Certain components generate a signal, and expect the response signal to be settled within a certain time.

Even in situations in which there is robust signaling (like a device can indicate for indefinite periods that it is "not ready" and another device will wait), the underlying signals on which that is based have to meet timing constraints.

You have rules like "when pin X is driven high, the data is expected to be present on lines Y_0 through Y_32 within time T_x." If not, there will be garbage.

All notion order in the machine rests on arbitrary timing at the lowest electronic design level.


> "synchronize" literally has the word "time" in it as a greek root (?????/chrónos).

Oh, yeah, the simple times before everybody learned that time is a local and observer-dependent measure.

Modern computers do almost none time-based at-distance synchronization. About 2 decades ago, when I was studying asynchronous computation, hacking the chip-design process to make it possible to do time-based synchronization of nearby transistors on the same silicon crystal was a very trendy idea. It was very hard, because everything is extremely noisy, and I imagine the idea faded away, because everything is much more noisy nowadays. Doing that on any large scale is asking for everything to break.

For all of the computer-era, synchronization has being done by signaling. The millenia-old tech of doing it by time is just deprecated.


How does signaling work without time?

I'm having a really hard time understanding your question.

If it's on the line of "signals only exist because we have time", then yeah, duh; and it's such an outrageous misconstruction of my comment that I imagine you meant something else.


I don't think they understand their question either.

The question "How does signaling work without time?" makes complete sense to me. I understand this question. Do you not?

The question is actually a rhetoric implying that "signalling" in the chip level involves time, namely, the clock signal that is firing periodically. It is not so much as a signalling as it is like an assumption baked into the circuit design. The circuit must change its state from one state to another only at the next clock signal and not before that.


Well there you go. Physical events are not in fact limited to occurring at clock transitions. But even if they did, it still doesn't make any particular amount of time into a correct stand-in for a serial dependency. The passage of 150ms of time is not what makes thing B clear to proceed. The only thing I don't understand is how anyone can not understand that.

The point is that at the hardware level, arbitrary timings are in fact used. The same logic as

   usleep(100000);  // give other thread enough time to do B
is actually correct at the hardware level. Devices are expected to produce a stable output within some arbitrary time, that they meet. If they don't, a garbage value is blindly read anyway: there is no actual cause-and-effect synchronization other than adherence to the timing diagram's requirements.

That usleep is wrong. Sure they exist all over the place, but they are wrong. They are merely easy and "good enough". But other thread has not finished it's job because 100ms passed. Other thread does something. Whatever other thread does is the definition of done and ready, not usleep 100000. sleep is crossing your fingers and hoping that other thread did actually finish doing whatever it exists to do by then.

The fact that digital circuits operate by different parts all being synced to clock pulses, and some parts even count the pulses such that the 7th bit in a byte is actually picked out strictly by the timing of 7 pulses, is pointless pedantics when the article is operating in an entirely different domain which does not operate that way and there is cause & effect and the definition of a service being ready is when the service is ready, not any particular passage of time. The service might actually be ready before you even asked, or might still not be ready in an hour, or might have come up fully and gone back down all within the sleep.

Today the service takes 100ms to start and you have a 150ms sleep. Tomorrow the same service runs on faster hardware and takes 1ms to start, and now has it's own wrong timing assumption where when it starts, it shuts back down if there are no clients (stupid thing for most services to do, but people do stupid things all the time), it starts up and sleeps 50ms "to give time for the client". The 150ms sleeping client doesn't issue any requests, the service shuts back down.

You have to be writing the firmware to a spinning disk to talk about time being the actual definition of anything.


> Physical events are not in fact limited to occurring at clock transitions.

You seem to be talking about "events". We are not talking about events. We are talking about the clock component of CPU chips that keeps emitting a periodical signal.

> The passage of 150ms of time is not what makes thing B clear to proceed.

It absolutely does. Except that in modern chips the time interval is in nanoseconds. Two decades ago, it used to be in microseconds.

Not trying to be snarky but a genuine question: Do you have any background in chip design and fab? What @kazinator and I are saying is not exactly surprising. These things are taught in undergrad or grad-level integrated circuit courses.

If the circuitry didn't wait those nanoseconds before the next state change, it would be making the state change based on intermediate and possibly garbage values which would make the chip behave in a non-deterministic manner.

Again, we are not talking about events or cause-and-effect. We are talking about something that is more fundamental and is baked into the design of CPU chips at the very lowest level.

See also - https://news.ycombinator.com/item?id=36476621


What do you mean by signaling? What happens under signaling?

With signaling, something locally determined tells when your synchronized nodes need to stop or go, and makes them do that by sending a message.

This is what happens when your connection library starts a call-back after it connects; or blocks until the connection is ready; or use some continuation scheme, like futures to reorder your program and only start the next line of code after the connection is ready.

The alternative is what the article does. AFAIK, you can only do it in Javascript, because it's so obviously bad that no other language supports it.

For my example in circuitry, signal based synchronization uses a clock, or a ready signal. Time based synchronization just uses circuits that are ready to read data exactly when the previous step is done calculating it.


A clock is not really a ready signal though. It is really a regularly cycling signal. It is not generated when a component is done with work, instead components are implemented on such a way to be guaranteed to be done in one clock cycle.

> For my example in circuitry, signal based synchronization uses a clock, or a ready signal.

But clock is not a ready signal. The clock is just a periodical signal that keeps firing. If you squint your eyes, it is almost like an event loop performing a time.sleep(ns) before deciding to check the state. In the case of chips, the circuitry performs the next state transition after the very short delay decided by the periodical clock signal. The assumption that every pin would acquire a stable binary state before the next clock tick is a fundamental part of a chip design. Without that we'll have non-deterministic and ill-defined chip behavior.


> Oh, yeah, the simple times before everybody learned that time is a local and observer-dependent measure. Modern computers do almost none time-based at-distance synchronization.

This is a very odd response. Nothing you write here directly contradicts the parent comment and yet you seem to imply that you have a point that is the opposite of the parent comment.

You seem to be going off on a tangent without addressing the main point that chips do depend on a clock firing periodically and that clock interval determining how long the circuitry waits for its pins to achieve a stable binary state before the next state transition. That was the original point of the parent comment and that point is still valid.


Some thoughts:

- The general point is correct -- time is not a synchronization mechanism

- Flakey tests are bad for all the reasons listed

- Shoddy synchronization makes for flakey tests

The example chosen -- waiting for a test server to be available is probably a huge source of the issues. However, I don't think the example -- really demonstrates how to do synchronization.

My core objection is that it's really relying on the operating system to take care of the synchronization, and that winds up hiding it entirely:

- the dial will not happen until the operating system has competed the `listen` call

- the operating system is selecting a random port at the time of the `listen` call, and reports the chosen port to the user

This is the correct way to approach this problem, but it doesn't really show a proper synchronization. It shows the operating system synchronizing on behalf of the program. It's also an example that probably only shows up in test -- client and server in the same process space.

In practice, many clients and servers may start at the same time in different processes or on different hosts, meaning that the client has no signal at all that the server has become ready. The best thing that can really be done is quickly polling with low timeouts until the server reports healthy.

I'd have liked to see use of a `sync.Cond`, or channel to signal that the operation has completed, and the waiting goroutine being safe to continue. It wouldn't fit well in this example, because, well, the operating system provides those facilities.

Over, because it does come up a lot with the example used, I'm happy there is a blog posts addresses it. Hopefully I'll see less of the problem in the world. I suppose I hope the author writes a part 2, showing other ways sleeps creep in and how to remove them.


on the flip side i've sometimes wondered how useful a syscall time manipulator that injects delays in order to explore the space of concurrent execution sequences would be.

either something systematic or a just a simple stochastic scheduler.


Likewise, if you think you think you need a sync(), you're wrong.

idk how common that is these days, but I saw it all over, used basically blindly to "fix" all manner of mystery reliability problems and never not one of those countless examples was that ever actually the problem or the fix. It was always simply failing to close a file before trying to do something else with it from some other context like a system() command. It was used mystic-dead-chicken style even worse that sticking sleeps everywhere for no actually understood reason.


Legal | privacy