I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.
In theory once Graal adds support for it, any Graal/Truffle-compatible language can benefit.
IMHO it's only JVM+Graal that can bring this to other languages. Loom relies very heavily on some fairly unique aspects of the Java ecosystem (Go has these things too though). One is that lots of important bits of code are implemented in pure Java, like the IO and SSL stacks. Most languages rely heavily on FFI to C libraries. That's especially true of dynamic scripting languages but is also true of things like Rust. The Java world has more of a culture of writing their own implementations of things.
For the Loom approach to work you need:
a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.
b. The compiler/runtime to control all code being used. The moment you cross the FFI into code generated by another compiler (i.e. a native library) you have to pin the thread and the scalability degrades or is lost completely.
But! Graal has a trick up its sleeve. It can JIT compile lots of languages, and those languages can call into each other without a classical FFI. Instead the compiler sees both call site and destination site, and can inline them together to optimize as one. Moreover those languages include binary languages like LLVM bitcode and WASM. In turn that means that e.g. Python calling into a C extension can still work, because the C extension will be compiled to LLVM bitcode and then the JVM will take over from there. So there's one compiler for the entire process, even when mixing code from multiple languages. That's what Loom needs.
At least in theory. Perhaps pron will contradict me here because I have a feeling Loom also needs the invariant that there are no pointers into the stack. True for most languages but not once C gets involved. I don't know to what extent you could "fix" C programs at the compiler level to respect that invariant, even if you have LLVM bitcode. But at least the one-compiler aspect is not getting in the way.
With Truffle you have to map your language’s semantics to java ones. I am unfortunately out of my depth on the details, but my guess would be that LLVM operates here with this in mind in a completely safe way (I guess pointers to the stack are not safe) so presumably it should work for these as well.
Not exactly, no. That's the whole point of Truffle and why it's such a big leap forward. You do not map your language's semantics to Java semantics. You can implement them on top of the JVM but bypassing Java bytecode. Your language doesn't even have to be garbage collected, and LLVM bitcode isn't (unless you use the enterprise version which adds support for automatically converting C/C++ to memory safe GCd code!).
So - C code running on the JVM via Sulong keeps C/C++ semantics. That probably means you can build pointers into the stack, and then I don't know what Loom would do. Right now they aren't integrated so I guess that's a research question.
Perhaps I wasn’t clear. I do know that Truffle works by writing an AST interpreter for another language, but to achieve the best performance you have to map/reuse existing java constructs. E.g. I have read that perhaps Ruby uses java exceptions in a not too idiomatic way, but this is what Graal can later optimize to very good code.
My way out of depth idea with Sulong is that it uses small heap-allocated regions for every manual memory usage (it even has a Managed mode in Enterprise).
You use Java constructs to implement the interpreter, but that doesn't mean the language itself has to be mapped to Java constructs, any more than writing an interpreter in C means your language has to be mapped to C semantics.
Sulong uses a standard C-style heap in the open source version. In EE they (can) trap malloc/free and re-point it towards the GCd heap. They also do bounds checking on pointer de-references. It's actually amazingly cool but unfortunately, EE is expensive enough in dollar terms that it gets ignored. I don't know of anything that uses it for real.
Actually every client IP+port / server IP+port pair. Linux uses 60999 - 32768 for ephemeral ports so can support 28e3^2 = 784 million connections per IP pair.
Except your service is almost certainly listening on one non-ephemeral port.
But having "only" tens of thousands of connections per client is rarely a problem in practice, apart from some load testing scenarios (such as the experiment here, where they opened a number of ports so they could test a large number of connections with a single client machine).
I imagine that's the limit per client IP address [for a single server port], no? The Linux kernel can use multiple pieces of information to track connections: client IP address, client port, server IP address, server port.
Cloudflare has some interesting blog posts on this topic:
Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.
Are you thinking of the ephemeral port limit? That's on the client side; not the server side. Each TCP socket pair is a four-tuple of [server IP, server port, client IP, client port]; the uniqueness comes from the client IP/port part in the server case.
You don't really need 77 IP addresses (the 64k limit for TCP is per client IP, per source port, per server IP) but even if you did, your average IPv6 server will have a few billion available. Every client can connect to a server IP of their own if you ignore the practical limits of the network acceleration and driver stack. If you're somehow dealing with this scale, I doubt you'll be stuck with pure legacy IP addressing.
The real problem with such a setup is that you're not left with a whole lot of bandwidth per connection, even if you ignore things like packet loss and retransmits mucking up the connections. Most VPS servers have a 1gbps connection, with 5 million clients that leaves 200 bytes per second of concurrent bandwidth for TCP signaling and data to flow through. You'll need a ridiculous network card for a single server to deal with such a load, in the terabits per second range.
While I can't answer the question directly there is an article about C#-s async/await vs Go's goroutines, which compare the two approaches, and while some of the stuff is probably stack-specific, a lot of it is probably intrinsic to the approach:
- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.
- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory
Loom will make a great backend for kotlin's co-routines. Roman Elizarov (kotlin language lead & person who is behind Kotlin's co-routine framework) has already confirmed that will happen and it makes a lot of sense.
For those who don't understand this, Kotlin's co-routine framework is designed to be language neutral and already works on top the major platforms that have kotlin compilers (native, javascript, jvm, and soon wasm). So, it doesn't really compete with the "native" way of doing concurrent, aynchronous, or parallel computing on any of those platforms but simply abstracts the underlying functionality.
It's actually a multi platform library that implements all the platform specific aspects in the platform appropriate way. It's also very easy to adapt existing frameworks in this space via Kotlin extension functions and the JVM implementation actually ships out of the box with such functions for most common solutions on the JVM for this (Java's threads, futures, threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom will be just another solution in this long list.
If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.
With Kotlin-js in a browser you can call Promise.toCoroutine() ans async { ... }.asPromise(). That makes it really easy to write asynchronous event handling in a web application for example or work with javascript APIs that expect promises from Kotlin. And if you use web-compose, fritz2, or even react with kotlin-js, anything asynchronous, you'd likely be dealing with via some kind of co-routine and suspend functions.
Once Loom ships, it basically will enable some nice, low level optimization to happen in the JVM implementation for co-routines and there will likely be some new extension functions to adapt the various new Java APIs for this. Not a big deal but it will probably be nice for situations with extremely large amounts of co-routines and IO. Not that it's particularly struggling there of course but all little bits help. It's not likely to require any code updates either. When the time comes, simply update your jvm and co-routine library and you should be good to go.
I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.
Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.
It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.
I mean... is 5M very impressive? Not really. Does it show that Project Loom meets the goal of being able to do large client count thread per server workloads? I think so. Does the name remind me of a best selling point and click adventure game? Definitely yes.
The experiment is about Java app, but the tweaks are at the O/S level. Does it mean any app (Java/not, Loom/not) can achieve target given correct tweak?
Also, why are these not default for the O/S? What are we compromising by setting those values?
There's always trade-offs. It would be very rare for any server to reach even 100K concurrent connections, let alone 5M. Optimising for that would be optimising for the 0.000001% case at the expense of the common case.
If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.
I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...
It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...
20kbps should be sufficient for things like chat apps if you have the CPU power to actually process chat messages like that. Modern apps also require attachments and those will require more bandwidth, but for the core messaging infrastructure without backfilling a message history I think 20kbps should be sufficient. Chat apps are bursty, after all, leaving you with more than just the average connection speed in practice.
I have a memory of some chat site, maybe discord, sending attachments to a different server, thus exchanging the bandwidth problem with extra system complexity
That's how I'd solve the problem. The added complexity isn't even that high, give the application an endpoint to push an attachment into a distributed object store of your choice, submit a message with a reference to the object and persist it the moment the chat message was sent. This could be done with mere bytes for the message itself and some very dumb anycast-to-s3 services in different data centers.
I'm sure I'm skipping over tons of complexity here (HTTP keepalives binding clients to a single attachment host for example) because I'm no chat app developer, but the theoretical complexity is still relatively low.
No, it doesn't. The reason the tweaks are at the OS level is because, apparently, Loom-enabled JVMs already scale up to that level without needing any tuning. But if you try that in C++ you're going to die very quickly.
Sure, I wrote some myself. Q is what libraries you can use on top of the userspace thread package that are aware of the userspace threads rather than just using OS APIs and thus eg blocking the current OS thread.
You need both your operating system and your application environment need to be up to the task. I'd expect most operating systems to be up to the task; although it might need settings set. Some of the settings are things that are statically allocated in non-swappable memory and you don't want to waste memory on being able to to have 5M sockets open if you never go over 10k. Often you'll want to reduce socket buffers from defaults, which will reduce throughput per socket, but target throughput per socket is likely low or you wouldn't want to cram so many connections per client. You may need to increase the size of the connection table and the hash used for it as well; again, it wastes non-swappable ram to have it too big if you won't use it.
For application level, it's going to depend on how you handle concurrency. This post is interesting, because it's a benchmark of a different way to do it in Java. You could probably do 5M connections in regular Java through some explicit event loop structure; but with the Loom preview, you can do it connection per Thread. You would be unlikely to do it with connection per Thread without Loom, since Linux threads are very unlikely to scale so high (but I'd be happy to read a report showing 5M Linux threads)
Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)
If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.
But in the real world it is common to need information from the authorization stage to use in the authentication stage. For example you may have a user login with an email address/password which you then pass to an LDAP server in order to get a userId. This userId is then used in a database to determine with objects/groups they have access to.
A lower latency design would be for the authorization service to be able to work with either. That way those requests could be done in parallel to reduce latency.
Loom sets out to give you a sane programming paradigm similar to what threads do (i.e. as opposed to programming asynchronous I/O in Java with some type of callback) without the overhead of Operating System threads.
That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.
I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.
It is true that the experiment exercises the OS, but that's only part of the point. The other part is that it uses a simple, blocking, thread-per-request model with Java 1.0 networking APIs. So this is "achieving 5M persistent connections with (essentially) 26-year-old code that's fully debuggable and observable by the platform." This stresses both the OS and the Java runtime.
So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.
Writing the sort of applications that I get involved with, it's frequently the case whilst it's true that 1 OS thread/java thread was a theoretical scalability limitation - in practice we were never likely to hit it (and there was always the 'get a bigger computer').
But: the complexity mavens inside our company and projects we rely upon get bitten by an obsessive need to chase 'scalability' /at all costs/. Which is fine, but the downside to that is the negative consequences of coloured functions comes into play. We end up suffering having to deal with vert.x or kotlin or whatever flavour-of-the-month solution is that is /inherently/ harder to reason about than a linear piece of code. If you're in a c# project, the you get a library that's async, and boom, game over.
If loom gets even within performance shouting distance of those other models, it's ought to kill (for all but the edgiest of edge-cases) reactive programming in the java space dead. You might be able to make a case - obviously depending on your use cases which are not mine - that extracting, say, 50% more scalability is worth the downsides. If that number is, say, 5%, then for the vast majority of projects the answer is going to be 'no'.
I say 'ought to', as I fear the adage that "developers love complexity the way moths love flames - and often with the same results". I see both engineers and projects (Hibernate and keycloak, IIRC) have a great deal of themselves invested in their Rx position, and I already sense that they're not going to give it up without a fight.
So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!
1. Demanding scalability for inappropriate projects and at any cost is something I've seen too, and on investigation it was usually related to former battle scars. A software system that stops scaling at the wrong time can be horrific for the business. Some of them never recover, the canonical example being MySpace, but I've heard of other examples that were less public. In finance entire multi-year IT projects by huge teams have failed and had to be scrapped because they didn't scale to even current business needs, let alone future needs. Emergency projects to make something "scale" because new customers have been on-boarded, or business requirements changed, are the sort of thing nobody wants to get caught up in. Over time these people graduate into senior management where they become architects who react to those bad experiences by insisting on making scalability a checkbox to tick.
Of course there's also trying to make easy projects more challenging, resume-driven development etc too. It's not just that. But that's one way it can happen.
2. Rx type models aren't just about the cost of threads. An abstraction over a stream of events is useful in many contexts, for example, single-threaded GUIs.
One additional - as noted, it's been 26 years since Java's founding. Project Loom has been around since at least 2018 and still has no release date. It'll be cool for Java projects whenever it comes out, but I just...have a hard time caring right now. I can't use it for old codebases currently, and new codebases I'm not using one request per Java thread anyway (tbh - when it's my choice I'm not choosing the JVM at all).
The point I was making is that Loom isn't released, stable, production ready, supported, etc, and there's no still no date when it's supposed to be, so what you can do with Loom in no way affects what I can do with a production codebase, either new or legacy. I'm not sure how you missed that from my post.
I'm not defending reactive programming on the JVM. I'm also not defending threads as units of concurrency. I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM, and I can't reasonably pick Project Loom if I want something stable and supported by its creators.
> and there's no still no date when it's supposed to be
September 20 (in Preview)
> I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM
Only sort-of. The only languages offering something similar in terms of programming model are Erlang (/Elixir) and Go — both inspired virtual threads. But Erlang doesn't offer similar performance, and Go doesn't offer similar observbility. Neither offers the same popularity.
I'm not saying there aren't tradeoffs, just that if I need the benefits of virtual threads...I have other options. I'm all for this landing on the JVM, mainly so that non-Java languages there can take advantage of it rather than the hoops they currently have to jump through to offer a saner concurrency model, but that until it does...don't care. And last I saw this feature is proposed to land in preview in JDK19; not that it would, and...it's still preview. Meaning the soonest we can expect to see this safely available to production code is next year (preview in Java is a bit weird, admittedly. "This is not experimental but we can change any part of it or remove it for future versions depending how things go" was basically my take on it when I looked in the past).
Meanwhile, as you say, Erlang/Elixir gives me this model with 35+ years of history behind it (and no libraries/frameworks in use trying to provide me a leaky abstraction of something 'better'), better observability than the JVM, a safer memory model for concurrent code, a better model for reliability, with the main issue being the CPU hit (less of a concern for IO bound workloads, which is where this kind of concurrency is generally impactful anyway). Go has reduced observability than Java, sure, but a number of other tradeoffs I personally prefer (not least of all because in most of the Java shops I was in, I was the one most familiar with profiling and debugging Java. The tools are there, the experience amongst the average Java developer isn't), and will also be releasing twice between now and next year.
Again, I'm not saying virtual threads from Loom aren't cool (in fact, I said they were; the technical achievement of making it a drop in replacement is itself incredible), or that it wouldn't be useful when it releases for those choosing Java, stuck with Java due to legacy reasons, or using a JVM language that is now able to migrate to take advantage of this to remove some of the impedance mismatch between their concurrency model(s) and Java's threading and the resulting caveats. Just that I don't care until it does (because I've been hearing about it for the past 4 years), it still doesn't put it on par with the models other languages have adopted (memory model matters to me quite a bit since I tend to care about correct behavior under load more than raw performance numbers; that said, of course, nothing is preventing people from adopting safer practices there...just like nothing has been in years previous. They just...haven't), nor do I care about the claims people make about it displacing X, Y, or Z. It probably will for new code! Whenever it gets fully supported in production. But there's still all that legacy code written over the past two decades using libraries and frameworks built to work around Java's initial 1:1 threading model, and which simply due to calling conventions and architecture (i.e., reactive and etc) would have to be rewritten, which probably won't happen due to the reality of production projects, even if there were clear gains in doing so (which as the great-grandparent mentions, is not nearly so clearcut).
Erlang is very cool, and Go has certainly achieved a notable measure of popularity — both served as an inspiration here — but a super-popular language like Java plays in a different world and a completely different scale than languages with a smaller reach. Virtual threads will bring the benefits of lightweight user-mode threads to an audience that is many times that of Erlang and Go combined.
As to legacy code, Java programs have been using the thread-per-request model for over 25 years (there's been a lot of talk of reactive, but actual adoption is relatively low), and Java's threads were designed to be abstracted from day one (in fact, early versions of Java implemented them in user mode). So the right fit has been there all along. Migrating applications to use virtual threads requires relatively few changes because of those reasons, and because we designed them with easy adoption in mind. This particular experiment is about simple, "legacy" Java 1.0 code enjoying terrific scalability.
BTW, Java's observability has come a long way in recent years (largely thanks to JFR — Java Flight Recorder), and even Erlang's is no match for it, although Java still lags behind Erlang's hot-swapping capabilities.
[1]: BTW, I always find talk about the "average Java programmer" a bit out of touch. The top 1% of Java programmers, the experts, outnumber all Rust (or Haskell, or Erlang) programmers several times over, and there are many more reliable Java programs than reliable Erlang programs. The average Java (or Python, or JavaScript, the two other dominant languages these days) programmer, is just the average programmer, period.
JEP 425 has been proposed to target JDK 19, out September 20. It will first be a "Preview" feature, which means supported but subject to change, and if all goes well would normally be out of Preview two releases, i.e. one year, after that.
> I'm not using one request per Java thread anyway
You don't have to, but not that only the thread-per-request model offers you world-class observability/debuggability.
> other than "ugh, this again".
Ok, although in 2022, the Java platform is still among the most technologically advanced, state-of-the art, software plarform out there. It stands shoulder to shoulder with clang and V8 on compilation, and beats everything else on GC and low-overhead observability (yes, even eBPF).
I think my point is more that you end up having to pay the costs (of Rx-style APIs) whether you need the scalability or not, because the libraries end up going down that route. This has sometimes felt that I'm being forced to do work in order to satisfy the fringe needs of some other project!
And sure, if you are living in a single-threaded environment, your choices are somewhat limited. I, personally, dislike front-end programming for exactly that reason - things like RxJS feel hideously overcomplicated to me. My guess is that most, though not all, will much prefer the loom-style threading over async/await given free choice.
Threads (whether lightweight or heavyweight) can’t fully replace reactive/proactive/async programming even ignoring performance and scalability. Sometimes network code simply needs to wait for more than one event as a matter of functionality. For example, a program might need to handle the availability of outgoing buffer space and also handle the availability of incoming data. And it might also need to handle completion of a database query or incoming data on a separate connection. Sure, using extra threads might do it, but it’s awkward.
Let me preface by saying I am a Johnny-come-lately loom fanboy. Amazing work and huge impact.
Re structured concurrency: I wonder if there’s any way to combine with generic exceptions such that we can not force a wrapping exception class. So maybe have an executor class that’s generic on the thrown exception type, and then have the join or get apis explicitly throw that type?
This thought process is inspired by the goto-considered-harmful trail of logic: I think it would get us even closer to concurrency encapsulated in function blocks.
Convenient polymorphism over exceptions is something I would very much like to see in Java, but it's a separate topic. Given that structured concurrency is normally used with things that can fail, and whose failures must be handled, I hope (and think) you'll find that the use of checked exceptions is not onerous at all. If we're mistaken, we can consider solutions during the incubation period.
Totally agree that we need explicit handling of the concurrency-specific exceptions like interruptedexception. It’s just that concurrent apis by their nature take callable/runnable apis which lose any formality over exceptions thrown by client code, and thus someone up the stack is always forced to write a catch( Throwable ) block. So the concurrency leaks up the stack, and forces unsafe default clauses.
You’re clearly correct that the topic is separate, but it has great impact on the leakiness of these apis.
I think we're in agreement. Ignoring under the hood - Loom's programming paradigm (from the viewpoint of control flow) is the Threading programming paradigm. (Virtual)Thread-per-connection programming is easier and far more intuitive than asynchronous (i.e. callback-esque) programming.
I still attest though - The 5M connections in this example is still a red herring.
Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.
Loom and Java NIO can handle probably a billion connections as programmed. Java Threads cannot - although that too is a broken statement. "Linux Threads cannot" is the real statement. You can't have that many for resource reasons. Java Threads are just a thin abstraction on top of that.
Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.
Don't get me wrong - I think Loom is cool. It's attempted to do the same thing as Async/Await tried - just better. But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.
*We typically vilify Java Threads for the Ram they consume. Something like 1M per thread or something (tunable). Loom must still use "some" ram per connection although surely far far less (and of course Linux must use some amount of kernel ram per connection too).
> But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.
It is the only way to achieve that many connections with Java in a way that's debuggable and observable by the platform and its tools, regardless of its intuitiveness or friendliness to human programmers. It's important to understand that this is an objective technical difference, and one of the cornerstones of the project. Computations that are composed in the asynchronous style are invisible to the runtime. Your server could be overloaded with I/O, and yet your profile will show idle thread pools.
Virtual threads don't just allow you to write something you could do anyway in some other way. They actually do work that has simply been impossible so far at that scale: they allow the runtime and its tools to understand how your program is composed and observe it at runtime in a meaningful and helpful way.
One of the main reasons so many companies turn to Java for their most important server-side applications is that it offers unmatched observability into what the program is doing (at least among other languages/platforms with similar performance). But that ability was missing for high-scale concurrency. Virtual threads add it to the platform.
Saying "Linux cannot handle 5M connections with one thread per connection" isn't a reasonable statement because no operating system can do that, they can't even get close. The resource usage of a kernel thread is defined by pretty fundamental limits in operating system architecture, namely, that the kernel doesn't know anything about the software using the thread. Any general purpose kernel will be unable to provision userspace with that many threads without consuming infeasible quantities of RAM.
The reason JVM virtual threads can do this is because the JVM has deep control and understanding of the stack and the heap (it compiled all the code). The reason Loom scalability gets worse if you call into native code is that then you're back to not controlling the stack.
Getting to 10M is therefore very much a question for the JVM as well as the operating system. It'll be heavily affected by GC performance with huge heaps, which luckily modern G1 excels at, it'll be affected by the performance of the JVM's userspace schedulers (ForkJoinPool etc), it'll be affected by the JVM's internal book-keeping logic and many other things. It stresses every level of the stack.
As the GP said, what's cool about this is how simple the code is. You might be able to achieve 5M connections in Java using an event loop based solution (eg Netty), but if the connection handlers need to do any async work, then they also need to be written using an event loop, which is not how most people write Java. Simply put, 5M connections was not possible using Java in the way most people write Java.
Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.
I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.
Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.
Maybe threads don’t work for your thinking style but your claim that this is generally true is baseless and pretty well refuted by languages like Go or Erlang that feature stackfull threads/processes as a critical part of their best-in-class concurrency stories.
Erlang sidesteps the problem by avoiding mutable shared state, in this context they're threads/processes in name only.
Go is just yet another implementation of green threads that is slightly less broken than prior implementations, because it had the benefit of being implemented on day 1 (so the whole ecosystem is green thread-aware). It's certainly nowhere near "best-in-class".
Shared mutable state is hard to work with, but Java threads and Java promises both give you access to it. In either case, you'd need discipline to avoid patterns which reduce concurrency.
From the article, it seems that Loom (in preview) enables the threaded model for Java to scale. IMHO, this is great because you can write simple straightforward code in a threaded model. You can certainly write complex code in a threaded model too. Maybe there's an argument that promises can be simple and straightforward too, but my experience with them hasn't been very straightforward.
If I look at a thread, I see futures all over the place. They're just implicit, and the OS takes care of concurrency/preemption. Sure, that means that you need concurrency primitives if you access shared resources, but only in the trivial case you can get away without shared state in the promise/future scenario as well (i.e. glue code that ties together the hard stuff). Downside is your code gets convoluted and your stacktraces suck.
I think you're confusing specific synchronisation/communication mechanisms with the basic concept of a thread, which is simply the sequential composition of instructions that is known and observable by the runtime. If you like the future/promise API, that will work even better with threads, because then the sequence is a reified concept known to the runtime and all its tools. You'll be able to step through the sequence of operations with a debugger; the profiler will know to associate operations with their context. What API you choose to compose your operations, whether you prefer message passing with no shared state, shared state with locks, or a combination of the two — that's all orthogonal to threads. All they are is a sequantial unit of instructions that may run concurrently to other such units, and is traceable and observable by the platform and its tools.
You can implement futures by just running each future as a thread, but it doesn't really give you much. It's a lot more complex to write a preemptive thread scheduler + delegating future scheduler than to just write a future scheduler in the first place.
Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.
It gives you a lot (aside from the ability to use existing libraries and APIs): observability and debuggabillity.
Supporting tooling has been one of the most important aspects of this project, because even those who were willing to write asynchronous code, and even the few who actually enjoyed it, constantly complained — and rightly so — that they cannot easily observe, debug and profile such programs. When it comes to "serious" applications, observability is one of the most important aspects and requirements of a system.
Instead of introducing new kind of sequenatial code unit through all layers of tooling — which would have been a huge project anyway, we abstracted the existing thread concept.
Threads have essentially the same API as Futures - normally you have some join of join handle and you can join a set of threads (the equivalent of awaiting a set of futures).
Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.
Give me some async code and I'll show you an easier threaded version.
If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value
Which is exactly the workaround with Task.Run(), being able to integrate a library written with async/await in codebases older than the feature, where no one is paying for a full rewrite.
Except Kotlin coroutines already works, can be very easily integrated in existing java codebases and are much superior than loom (structured concurrency, flow, etc)
Kotlin coroutines are amazing. They're built on very clever tech that converts fairly normal source code into a state machine when compiled. This has huge benefits and allows the programmer to break their code up without the hassle of explicitly programming callbacks, etc.
However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.
Having coroutine and lightweight thread support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.
I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...
I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.
Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.
Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.
As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?
"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:
Coroutines are much less coloured than async await programming though since functions returns resolved types directly instead of futures. But yes there is the notion of coroutine scope but I don't see how to supress it without making it less expressive.
BTW i expect Kotlin coroutines to leverage loom eventually.
As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow.
Few people know there is an alternative to tailrecursive, that can make any function stackoverflow safe by leveraging the heap via continuations
https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deep-re...
Thanks for posting that link to Java tail recursion library, super handy + didn't know about it. You need tail recursion for writing expression evaluators/visitors frequently.
I've been using an IntelliJ extension that can do magic by rewriting recursive functions to stateful stack-based code for performance, but it spits out very ugly code:
> "This inspection detects methods containing recursive calls (not just tail recursive calls) and removes the recursion from the method body, while preserving the original semantics of the code. However, the resulting code becomes rather obfuscated if the control flow in the recursive method is complex."
> Coroutines are much less coloured than async await programming though since functions returns resolved types directly instead of futures
Only because the compiler does its magic behind the scenes and transforms it into bytecode that takes a lambda with a continuation. Try calling a suspend function from java or starting a job and surprise, it's continuations all the way down
yes interfacing with java is generally made via RxJava and reactor. Interfacing is easy but yes nobody wants to use rxjava and reactor in the first place.. I wonder wether loom will enable easier interop and make the magic work from java side POV
I think another commenter pointed out that they are still coloured though. Still, they're very cool - and you can use them for more than just lightweight threading.
> As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow
I'd say tailrecursive is compiler feature (codegen the recursion into a loop) to work around a runtime contraint (no tail call optimisation).
The lack of tail call optimisation on the JRE means recursion is a lot less safe than in functional language runtimes which guarantee stacks don't overflow when you make tail calls.
> As for Java, there is universal support for tail recursion at the bytecode level.
Just a note here for other readers that there are several terms in play here.
I was talking about "tail calls" - when a function calls a function as its last operation - and I mentioned some annotations to do "tail recursion", which is a special case - when a function calls _itself_ as its last operation.
SemanticStrength is talking about "tail recursion" only here. The JVM bytecode can support tail recursion (tail calls on the same method), since we can use the same bytecode that is used for while loops, etc.
However, we cannot do safe tail recursion between different functions (yet), in the same way that we cannot have a loop spanning more than one function. Tail call optimisation is something that will hopefully come in Project Loom.
I am absolutely a novice at this level of detail with the current OpenJDK implementation, so I’m only asking it, but wouldn’t a method call as a last operation be also a target of the basic inlining done by the JIT compilers? How is a non-recursive tail-call any different than an inline at any another location?
1. These are actual threads from the Java runtime's perspective. You can step through them and profile them with existing debuggers and profilers. They maintain stacktraces and ThreadLocals just like platform threads.
2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.
Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425
A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.
A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.
A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.
Presumably at 100M simultaneous connections the machine CPU would be saturated with setting up and closing them, without getting much actual work done. TCP connections seem too fragile to make it worth trying to keep them open for really long periods.
It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?
There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:
1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.
2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.
3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.
I'm pretty sure the exercise was to show the absolute extremes that could be achieved in a toy application and possibly how easy one could achieve some level of IO blocking scaling that has been harder than most other tasks in java of late. More and more, heap allocations are cheaper, often with sub-milli collector locks, CPU scaling has more to do with what you're doing instead of the platform, but java have enough tools to make your application fast.
For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.
You're totally spot on that connection establishment is much more challenging than steady state; with TLS or just TCP.
I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).
One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.
In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.
QUIC will help with some things, and make others worse. With QUIC you don't need a file descriptor per connection anymore. A single file descriptor for one UDP socket will be sufficient to handle an arbitrary amount of connections (although you might want more to actually exploit concurrency). That fact will help limiting resources that the kernel uses. However the state that needs to be tracked per established connection is likely way larger than for TCP, due to being a more complex and featureful protocol. E.g. QUIC needs state for tracking sub-streams on a connection, while TCP does not. And of course there's all the mandatory crypto state. I am fairly familiar with QUIC implementations, and made a multitude of changes in various libraries (e.g. Quinn and s2n-quic). I wouldn't be surprised if the baseline memory usage of a QUIC connection in most libraries is > 10x of what the Linux TCP stack requires for tracking a connection
It depends on what you do, but I think GC/memory pressure can become an issue rather quickly with the default programming models Java leads you towards. I end up seeing this a lot in somewhat high throughput services/workers I own where fetching a lot of data to handle requests and discarding it afterwards leads to a lot of GC time. Curious if anyone has any sage advice on this front.
It's easy to just get 4TB of ram if that's what you need; I haven't scoped out what you can shove into a cheap off the shelf server these days, but I'd guess around 16TB before you need to get fancy servers (Edit: maybe 8TB is more realistic after looking at SuperMicro's 'Ultra' servers). I think you'd need a very specialized applicatjon for 100M connections per server to make sense, but if you've got one, that sounds like a fun challenge; my email is in my profile.
Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.
It looks more closer to go routines, which to me begs the question - where are the channels that I could use to communicate between these virtual threads?
Go's channels are simplistically a mutex in front of a queue. Java has many existing objects that can do the same, it's just that's not idiomatic best choice to do the same. Since green threads should wake up from Object.notify(), any threads blocking on the monitor should wake/consume. I'm curious how scalable/performance a green thread ConcurrentDequeue would stand up to go's channel.
You are right. But Go Channels come also with the superpower of „select“, which allows to wait for multiple objects to become ready and atomic execution of actions. I don’t think this part can be retrofitted on top of simple BlockingQueues.
no*, and as you've discovered, the skbufs allocated by the kernel will often be the limiting factor for a highly concurrent socket server on linux.
* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.
Does Linux actually allocate buffers for each socket or does it just link to sk_buff's (which I understand are similar to FreeBSD's mbuf's) and then limit how much storage can be linked? FreeBSD has a limit on the total ram used for mbufs as well, not sure about Linux.
Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.
When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.
Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.
I think the socket buffers (sk_buff) are actually shared. They are all packet sized, and whatever socket needs to transmit some data or receives it gets the buffers attached. So my assumption is that the amount of required socket buffers scales more with the amount of data transmission than with the number of sockets.
But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.
I'm not a java programmer. I tried clicking 3 layers deep of links, but still have no idea what virtual threads are in this context. Is it a userspace thread implementation?
I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)
I see a lot of these making the FP of HN. But it's very difficult to be impressed, or unimpressed because it's all about hardware. How much hardware is everybody throwing at all of this? 5M persistent connections on a Pi with mere GigE? Pretty frickin' amazing. 5M persistent connections on a Threadripper with 128 cores and a dozen trunked 4 port 10GE NICs? Yaaaaawwwnnn snooze.
We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.
This isn't about the hardware, it's about thread count.
There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.
From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.
In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.
Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.
If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.
I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.
reply