Experiment to achieve 5M persistent connections with Project Loom (Java)

invalidname | karma 1659 | avg karma 1.82 · 2022-04-30 04:13:03

This is pretty fantastic!

I'm very excited about the possibilities of Loom. Would love to have a more realistic sample with Spring Boot that would demonstrate the real world scale. I saw a few but nothing remotely as ambitious as that.

reply

isbvhodnvemrwvn | karma 1596 | avg karma 1.86 · 2022-04-30 09:29:00

Spring Boot overhead would likely make that infeasible.

invalidname | karma 1659 | avg karma 1.82 · 2022-04-30 11:19:18

I'm not saying 5M. I just want to see to what scale it would get without threading issues. Spring Boot isn't THAT heavy.

RhodesianHunter | karma 2785 | avg karma 3.06 · 2022-04-30 13:21:23

Spring boot overhead is largely in startup time. It really doesn't have much overhead there after.

It's largely a collection of the same libraries you would use anyways glued together with a custom di system.

reply

nelsonic | karma 988 | avg karma 3.66 · 2022-04-30 04:13:55

Reminds of https://phoenixframework.org/blog/the-road-to-2-million-webs... Would love to see this extended to more Languages/Frameworks.

bkolobara | karma 897 | avg karma 5.71 · 2022-04-30 07:34:58

With lunatic [0] we are trying to bring this to all languages that compile to WebAssembly. A few days ago I wrote about our journey of bringing it to Rust: https://lunatic.solutions/blog/writing-rust-the-elixir-way-1...

[0]: https://github.com/lunatic-solutions/lunatic

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 09:16:10

In theory once Graal adds support for it, any Graal/Truffle-compatible language can benefit.

IMHO it's only JVM+Graal that can bring this to other languages. Loom relies very heavily on some fairly unique aspects of the Java ecosystem (Go has these things too though). One is that lots of important bits of code are implemented in pure Java, like the IO and SSL stacks. Most languages rely heavily on FFI to C libraries. That's especially true of dynamic scripting languages but is also true of things like Rust. The Java world has more of a culture of writing their own implementations of things.

For the Loom approach to work you need:

a. Very tight and difficult integration between the compiler, threading subsystem and garbage collector.

b. The compiler/runtime to control all code being used. The moment you cross the FFI into code generated by another compiler (i.e. a native library) you have to pin the thread and the scalability degrades or is lost completely.

But! Graal has a trick up its sleeve. It can JIT compile lots of languages, and those languages can call into each other without a classical FFI. Instead the compiler sees both call site and destination site, and can inline them together to optimize as one. Moreover those languages include binary languages like LLVM bitcode and WASM. In turn that means that e.g. Python calling into a C extension can still work, because the C extension will be compiled to LLVM bitcode and then the JVM will take over from there. So there's one compiler for the entire process, even when mixing code from multiple languages. That's what Loom needs.

At least in theory. Perhaps pron will contradict me here because I have a feeling Loom also needs the invariant that there are no pointers into the stack. True for most languages but not once C gets involved. I don't know to what extent you could "fix" C programs at the compiler level to respect that invariant, even if you have LLVM bitcode. But at least the one-compiler aspect is not getting in the way.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-04-30 16:18:18

With Truffle you have to map your language’s semantics to java ones. I am unfortunately out of my depth on the details, but my guess would be that LLVM operates here with this in mind in a completely safe way (I guess pointers to the stack are not safe) so presumably it should work for these as well.

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 16:51:52

Not exactly, no. That's the whole point of Truffle and why it's such a big leap forward. You do not map your language's semantics to Java semantics. You can implement them on top of the JVM but bypassing Java bytecode. Your language doesn't even have to be garbage collected, and LLVM bitcode isn't (unless you use the enterprise version which adds support for automatically converting C/C++ to memory safe GCd code!).

So - C code running on the JVM via Sulong keeps C/C++ semantics. That probably means you can build pointers into the stack, and then I don't know what Loom would do. Right now they aren't integrated so I guess that's a research question.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-05-01 02:22:33

Perhaps I wasn’t clear. I do know that Truffle works by writing an AST interpreter for another language, but to achieve the best performance you have to map/reuse existing java constructs. E.g. I have read that perhaps Ruby uses java exceptions in a not too idiomatic way, but this is what Graal can later optimize to very good code.

My way out of depth idea with Sulong is that it uses small heap-allocated regions for every manual memory usage (it even has a Managed mode in Enterprise).

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-05-01 06:42:02

You use Java constructs to implement the interpreter, but that doesn't mean the language itself has to be mapped to Java constructs, any more than writing an interpreter in C means your language has to be mapped to C semantics.

Sulong uses a standard C-style heap in the open source version. In EE they (can) trap malloc/free and re-point it towards the GCd heap. They also do bounds checking on pointer de-references. It's actually amazingly cool but unfortunately, EE is expensive enough in dollar terms that it gets ignored. I don't know of anything that uses it for real.

reply

notorandit | karma 197 | avg karma 0.55 · 2022-04-30 04:26:29

With a maximum of 64k TCP connections per single server IP, you need 77 different IP on the server side. This is a fact.

imperio59 | karma 1029 | avg karma 4.18 · 2022-04-30 04:40:38

Pretty sure you can bump that up in the kernel to hold more active connections per server that 64k...

jauer | karma 2074 | avg karma 4.6 · 2022-04-30 04:41:33

How do you figure?

Clients can connect to the server on the same server port, so connection limit is more like 64k*2 for every Client IP-Server IP pair.

reply

akvadrako | karma 5693 | avg karma 1.65 · 2022-04-30 04:47:44

Actually every client IP+port / server IP+port pair. Linux uses 60999 - 32768 for ephemeral ports so can support 28e3^2 = 784 million connections per IP pair.

mypalmike | karma 2344 | avg karma 2.16 · 2022-04-30 07:04:16

Except your service is almost certainly listening on one non-ephemeral port.

But having "only" tens of thousands of connections per client is rarely a problem in practice, apart from some load testing scenarios (such as the experiment here, where they opened a number of ports so they could test a large number of connections with a single client machine).

reply

charcircuit | karma 2002 | avg karma 0.38 · 2022-04-30 10:32:10

1 IP can correspond to multiple different clients.

mypalmike | karma 2344 | avg karma 2.16 · 2022-05-01 21:17:38

With NAT, right. I guess there may be situations where client ports for a single server IP dry up due to NAT, but I've not encountered that issue.

peq | karma None | avg karma None · 2022-04-30 04:43:49

Isn't this limit per client ip, server ip, and server port? (https://stackoverflow.com/a/2332756/303637)

alanfranz | karma 5520 | avg karma 7.51 · 2022-04-30 04:47:12

“You need 77 ips” to do what? May be a fact or not, depending on what you’re doing.

If you suppose just one open server port, you’ll probably need 77 client ips to do this test to get unique socket pairs.

But it’s a client problem, not a server one.

reply

ivanr | karma 1673 | avg karma 5.71 · 2022-04-30 04:51:18

I imagine that's the limit per client IP address [for a single server port], no? The Linux kernel can use multiple pieces of information to track connections: client IP address, client port, server IP address, server port.

Cloudflare has some interesting blog posts on this topic:

- https://blog.cloudflare.com/how-we-built-spectrum/

- https://blog.cloudflare.com/how-to-stop-running-out-of-ephem...

reply

NovemberWhiskey | karma 8094 | avg karma 3.27 · 2022-04-30 05:11:27

What?

Having run production services that had over 250,000 sockets connecting to a single server port, I'm calling "nope" on that.

Are you thinking of the ephemeral port limit? That's on the client side; not the server side. Each TCP socket pair is a four-tuple of [server IP, server port, client IP, client port]; the uniqueness comes from the client IP/port part in the server case.

reply

jeroenhd | karma 24884 | avg karma 4.06 · 2022-04-30 05:11:41

You don't really need 77 IP addresses (the 64k limit for TCP is per client IP, per source port, per server IP) but even if you did, your average IPv6 server will have a few billion available. Every client can connect to a server IP of their own if you ignore the practical limits of the network acceleration and driver stack. If you're somehow dealing with this scale, I doubt you'll be stuck with pure legacy IP addressing.

The real problem with such a setup is that you're not left with a whole lot of bandwidth per connection, even if you ignore things like packet loss and retransmits mucking up the connections. Most VPS servers have a 1gbps connection, with 5 million clients that leaves 200 bytes per second of concurrent bandwidth for TCP signaling and data to flow through. You'll need a ridiculous network card for a single server to deal with such a load, in the terabits per second range.

reply

deepsun | karma 3058 | avg karma 1.77 · 2022-04-30 04:32:45

How does that compare to Kotlin suspend functions?

torginus | karma 3105 | avg karma 3.04 · 2022-04-30 05:18:10

While I can't answer the question directly there is an article about C#-s async/await vs Go's goroutines, which compare the two approaches, and while some of the stuff is probably stack-specific, a lot of it is probably intrinsic to the approach:

- Green threads scale somewhat better, but both scale ridiculously well, meaning probably you won't run into scaling issues.

- async/await generators use way less memory than a dedicated green thread, this affects both memory consumption and startup time, since the process has to run around asking the OS for more memory

Here's the link:

https://alexyakunin.medium.com/go-vs-c-part-1-goroutines-vs-...

reply

jillesvangurp | karma 16646 | avg karma 3.74 · 2022-04-30 08:38:43

Loom will make a great backend for kotlin's co-routines. Roman Elizarov (kotlin language lead & person who is behind Kotlin's co-routine framework) has already confirmed that will happen and it makes a lot of sense.

For those who don't understand this, Kotlin's co-routine framework is designed to be language neutral and already works on top the major platforms that have kotlin compilers (native, javascript, jvm, and soon wasm). So, it doesn't really compete with the "native" way of doing concurrent, aynchronous, or parallel computing on any of those platforms but simply abstracts the underlying functionality.

It's actually a multi platform library that implements all the platform specific aspects in the platform appropriate way. It's also very easy to adapt existing frameworks in this space via Kotlin extension functions and the JVM implementation actually ships out of the box with such functions for most common solutions on the JVM for this (Java's threads, futures, threadpools, etc., Spring Flux, RxJava, Vert.x, etc.). Loom will be just another solution in this long list.

If you use Spring Boot with Kotlin for example, rather than dealing with Spring's Flux, you simply define your asynchronous resources as suspend functions. Spring does the rest.

With Kotlin-js in a browser you can call Promise.toCoroutine() ans async { ... }.asPromise(). That makes it really easy to write asynchronous event handling in a web application for example or work with javascript APIs that expect promises from Kotlin. And if you use web-compose, fritz2, or even react with kotlin-js, anything asynchronous, you'd likely be dealing with via some kind of co-routine and suspend functions.

Once Loom ships, it basically will enable some nice, low level optimization to happen in the JVM implementation for co-routines and there will likely be some new extension functions to adapt the various new Java APIs for this. Not a big deal but it will probably be nice for situations with extremely large amounts of co-routines and IO. Not that it's particularly struggling there of course but all little bits help. It's not likely to require any code updates either. When the time comes, simply update your jvm and co-routine library and you should be good to go.

reply

richdougherty | karma 1794 | avg karma 4.52 · 2022-04-30 15:15:43

I made a comment about this above: https://news.ycombinator.com/item?id=31218826

I won't repeat it all, but the main point is that having runtime support is much better than relying on compiler support, even if compiler support is pretty fantastic.

Note that the two aren't mutally exclusive, you should still be able to use coroutines after Project Loom ships, and it still might make sense in many places.

reply

KingOfCoders | karma 8559 | avg karma 2.83 · 2022-04-30 04:53:47

Something to learn for everybody, the article is mainly about Linux tuning.

jeroenhd | karma 24884 | avg karma 4.06 · 2022-04-30 05:18:48

The Linux tuning part seems to have been inspired by these blog posts from 14 years ago: https://www.metabrew.com/article/a-million-user-comet-applic...

It's almost a little disappointing that beefy modern servers only manage a x5 scale improvement, though that could be due to the differences in runtime behaviour between Erlang and the JVM.

reply

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 23:11:41

I mean... is 5M very impressive? Not really. Does it show that Project Loom meets the goal of being able to do large client count thread per server workloads? I think so. Does the name remind me of a best selling point and click adventure game? Definitely yes.

wiradikusuma | karma 4652 | avg karma 3.63 · 2022-04-30 04:55:00

The experiment is about Java app, but the tweaks are at the O/S level. Does it mean any app (Java/not, Loom/not) can achieve target given correct tweak?

Also, why are these not default for the O/S? What are we compromising by setting those values?

reply

jiggawatts | karma 26854 | avg karma 5.63 · 2022-04-30 04:59:48

There's always trade-offs. It would be very rare for any server to reach even 100K concurrent connections, let alone 5M. Optimising for that would be optimising for the 0.000001% case at the expense of the common case.

Some back of the envelope maths: https://www.wolframalpha.com/input?i=100+Gbps+%2F+5+million

If the server had a 100 Gbps Ethernet NIC, this would leave just 20 kbps for each TCP connection.

I could imagine some IoT scenarios where this might be a useful thing, but outside of that? I doubt there's anyone that wants 20 kbps throughput in this day and age...

It's a good stress test however to squeeze out inefficiencies, super-linear scaling issues, etc...

reply

Koffiepoeder | karma 535 | avg karma 5.35 · 2022-04-30 05:20:43

Open, idle websockets can be a use case for a large amount of tcp connections with a small data footprint.

jeffbee | karma 21041 | avg karma 2.25 · 2022-04-30 11:53:07

Also IMAP has this unfortunate property.

jeroenhd | karma 24884 | avg karma 4.06 · 2022-04-30 05:33:49

20kbps should be sufficient for things like chat apps if you have the CPU power to actually process chat messages like that. Modern apps also require attachments and those will require more bandwidth, but for the core messaging infrastructure without backfilling a message history I think 20kbps should be sufficient. Chat apps are bursty, after all, leaving you with more than just the average connection speed in practice.

henrydark | karma 423 | avg karma 2.49 · 2022-04-30 05:48:45

I have a memory of some chat site, maybe discord, sending attachments to a different server, thus exchanging the bandwidth problem with extra system complexity

jeroenhd | karma 24884 | avg karma 4.06 · 2022-04-30 05:53:25

That's how I'd solve the problem. The added complexity isn't even that high, give the application an endpoint to push an attachment into a distributed object store of your choice, submit a message with a reference to the object and persist it the moment the chat message was sent. This could be done with mere bytes for the message itself and some very dumb anycast-to-s3 services in different data centers.

I'm sure I'm skipping over tons of complexity here (HTTP keepalives binding clients to a single attachment host for example) because I'm no chat app developer, but the theoretical complexity is still relatively low.

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 08:59:57

No, it doesn't. The reason the tweaks are at the OS level is because, apparently, Loom-enabled JVMs already scale up to that level without needing any tuning. But if you try that in C++ you're going to die very quickly.

gpderetta | karma 12081 | avg karma 1.83 · 2022-04-30 10:07:50

There have been userspace thread libraries for c++ for decades.

yosefk | karma 3687 | avg karma 19.93 · 2022-04-30 11:35:47

Sure, I wrote some myself. Q is what libraries you can use on top of the userspace thread package that are aware of the userspace threads rather than just using OS APIs and thus eg blocking the current OS thread.

gpderetta | karma 12081 | avg karma 1.83 · 2022-04-30 13:14:55

There are .so interposition tricks that can be used for that.

I think Pth used to do that for example.

reply

yosefk | karma 3687 | avg karma 19.93 · 2022-04-30 15:35:47

Could you elaborate?

gpderetta | karma 12081 | avg karma 1.83 · 2022-05-01 05:01:20

For example: https://www.gnu.org/software/pth/pth-manual.html#system_call...

See the hard system call wrapping. This is just one option.

reply

pjmlp | karma 109153 | avg karma 1.76 · 2022-04-30 12:18:53

With C++ co-routines and a runtime like HPX, not really.

However there are other reasons why a C++ applications connected to the internet might indeed die faster than a Java one.

reply

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 11:38:42

You need both your operating system and your application environment need to be up to the task. I'd expect most operating systems to be up to the task; although it might need settings set. Some of the settings are things that are statically allocated in non-swappable memory and you don't want to waste memory on being able to to have 5M sockets open if you never go over 10k. Often you'll want to reduce socket buffers from defaults, which will reduce throughput per socket, but target throughput per socket is likely low or you wouldn't want to cram so many connections per client. You may need to increase the size of the connection table and the hash used for it as well; again, it wastes non-swappable ram to have it too big if you won't use it.

For application level, it's going to depend on how you handle concurrency. This post is interesting, because it's a benchmark of a different way to do it in Java. You could probably do 5M connections in regular Java through some explicit event loop structure; but with the Loom preview, you can do it connection per Thread. You would be unlikely to do it with connection per Thread without Loom, since Linux threads are very unlikely to scale so high (but I'd be happy to read a report showing 5M Linux threads)

reply

torginus | karma 3105 | avg karma 3.04 · 2022-04-30 05:21:10

While impressive, I don't really see it as something practical - I think scaling across processes/VMs is a much more realistic approach.

cheradenine_uk | karma 269 | avg karma 3.9 · 2022-04-30 05:32:56

I think a lot of people are missing the point.

Go look at the sourcecode. Look at how simple it is - anyone who has created a thread with java knows what's happening. With only minor tweaks, this means your pre-existing code can take advantage of this with, basically, no effort. And it retains all the debuggability of traditional java thread (I.e: a stack trace that makes sense!)

If you've spent any time at all dealing with the horrors of c# async/await (Why am I here? Oh, no idea) and it's doubling of your APIs to support function colouring - or, you've fought with the complexities of reactive solutions in the Java space -- often, frankly, in the name of "scalability" that will never be practically required -- this is a big deal.

You no longer have to worry about any of that.

reply

bullen | karma 666 | avg karma 0.39 · 2022-04-30 06:17:31

Agreed it's simpler, but using NIO with one OS thread per core also has it's benefits.

The context switch (how ever small) will cause latency when this solution is at saturation.

I think they should write four tests: fiber, NIO and each with userspace networking (no kernel copying network memory) and compare them.

Why Oracle is stalling removing the kernel for Java networking is surprising to me, they allready have a VM.

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 06:55:14

https://github.com/ebarlas/project-loom-comparison

vlovich123 | karma 10600 | avg karma 2.26 · 2022-04-30 10:58:33

Shouldn’t you be able to send authorization and authentication requests in parallel in the async and virtual threads cases?

threeseed | karma 21216 | avg karma 2.31 · 2022-04-30 13:19:54

It is just an example so they could do anything.

But in the real world it is common to need information from the authorization stage to use in the authentication stage. For example you may have a user login with an email address/password which you then pass to an LDAP server in order to get a userId. This userId is then used in a database to determine with objects/groups they have access to.

reply

vlovich123 | karma 10600 | avg karma 2.26 · 2022-04-30 20:06:36

A lower latency design would be for the authorization service to be able to work with either. That way those requests could be done in parallel to reduce latency.

bullen | karma 666 | avg karma 0.39 · 2022-05-01 04:44:01

I'm missing bandwidth and latency in those graphs.

blibble | karma 9766 | avg karma 3.04 · 2022-04-30 09:00:02

there's still a context switch with NIO, you're just doing it manually

bullen | karma 666 | avg karma 0.39 · 2022-05-01 04:46:18

The context switches in NIO is between fewer threads, you just need one per core.

Memory contention is also playing into this.

The benchmark they made is asking the question in a way that it leans into the answer they need, just like 99% of all human activity it's biased.

reply

blibble | karma 9766 | avg karma 3.04 · 2022-05-01 11:06:19

you're missing the point and taking "context switch" literally

with NIO you are still managing the stack, just yourself instead of letting the operating system do it for you

it is still a "context switch", just done in your code instead of the OS

and that's not free (and likely more expensive than saving and restoring a set of registers)

reply

zinxq | karma 1900 | avg karma 9.5 · 2022-04-30 06:31:14

Loom sets out to give you a sane programming paradigm similar to what threads do (i.e. as opposed to programming asynchronous I/O in Java with some type of callback) without the overhead of Operating System threads.

That's a very cool and a noble pursuit. But the title of this article might as well have been "5M persistent connections with Linux" because that's where the magic 5M connections happen.

I could also attempt 5M connections at the Java level using Netty and asynchronous IO - no threads or Loom. Again, it'd take more Linux configuration than anything else. If that configuration did happen though now you can also do it in C# async/await, javascript, I'm sure Erlang and anything else that does Asynchronous I/O whether it's masked by something like Loom/Async/Await or not.

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 06:58:17

It is true that the experiment exercises the OS, but that's only part of the point. The other part is that it uses a simple, blocking, thread-per-request model with Java 1.0 networking APIs. So this is "achieving 5M persistent connections with (essentially) 26-year-old code that's fully debuggable and observable by the platform." This stresses both the OS and the Java runtime.

So while you could achieve 5M in other ways, those ways would not only be more complex, but also not really observable/debuggable by Java platform tools.

reply

cheradenine_uk | karma 269 | avg karma 3.9 · 2022-04-30 07:33:21

This.

Writing the sort of applications that I get involved with, it's frequently the case whilst it's true that 1 OS thread/java thread was a theoretical scalability limitation - in practice we were never likely to hit it (and there was always the 'get a bigger computer').

But: the complexity mavens inside our company and projects we rely upon get bitten by an obsessive need to chase 'scalability' /at all costs/. Which is fine, but the downside to that is the negative consequences of coloured functions comes into play. We end up suffering having to deal with vert.x or kotlin or whatever flavour-of-the-month solution is that is /inherently/ harder to reason about than a linear piece of code. If you're in a c# project, the you get a library that's async, and boom, game over.

If loom gets even within performance shouting distance of those other models, it's ought to kill (for all but the edgiest of edge-cases) reactive programming in the java space dead. You might be able to make a case - obviously depending on your use cases which are not mine - that extracting, say, 50% more scalability is worth the downsides. If that number is, say, 5%, then for the vast majority of projects the answer is going to be 'no'.

I say 'ought to', as I fear the adage that "developers love complexity the way moths love flames - and often with the same results". I see both engineers and projects (Hibernate and keycloak, IIRC) have a great deal of themselves invested in their Rx position, and I already sense that they're not going to give it up without a fight.

So: the headline number is less important than "for virtually everyone you will no longer have to trade simplicity for scalability". I can't wait!

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 08:53:39

A couple of points to consider.

1. Demanding scalability for inappropriate projects and at any cost is something I've seen too, and on investigation it was usually related to former battle scars. A software system that stops scaling at the wrong time can be horrific for the business. Some of them never recover, the canonical example being MySpace, but I've heard of other examples that were less public. In finance entire multi-year IT projects by huge teams have failed and had to be scrapped because they didn't scale to even current business needs, let alone future needs. Emergency projects to make something "scale" because new customers have been on-boarded, or business requirements changed, are the sort of thing nobody wants to get caught up in. Over time these people graduate into senior management where they become architects who react to those bad experiences by insisting on making scalability a checkbox to tick.

Of course there's also trying to make easy projects more challenging, resume-driven development etc too. It's not just that. But that's one way it can happen.

2. Rx type models aren't just about the cost of threads. An abstraction over a stream of events is useful in many contexts, for example, single-threaded GUIs.

reply

lostcolony | karma 9464 | avg karma 3.54 · 2022-04-30 09:21:49

One additional - as noted, it's been 26 years since Java's founding. Project Loom has been around since at least 2018 and still has no release date. It'll be cool for Java projects whenever it comes out, but I just...have a hard time caring right now. I can't use it for old codebases currently, and new codebases I'm not using one request per Java thread anyway (tbh - when it's my choice I'm not choosing the JVM at all).

chrisseaton | karma 36438 | avg karma 2.64 · 2022-04-30 09:44:15

> I'm not using one request per Java thread anyway

The point is with Loom you can, and you can stop putting everything into a continuation and go back to straight-line code.

reply

lostcolony | karma 9464 | avg karma 3.54 · 2022-04-30 10:30:28

>> The point is with Loom you can

The point I was making is that Loom isn't released, stable, production ready, supported, etc, and there's no still no date when it's supposed to be, so what you can do with Loom in no way affects what I can do with a production codebase, either new or legacy. I'm not sure how you missed that from my post.

I'm not defending reactive programming on the JVM. I'm also not defending threads as units of concurrency. I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM, and I can't reasonably pick Project Loom if I want something stable and supported by its creators.

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 11:52:57

> and there's no still no date when it's supposed to be

September 20 (in Preview)

> I'm saying I can get the benefits of Project Loom -right now-, in production ready languages/libraries, outside of the JVM

Only sort-of. The only languages offering something similar in terms of programming model are Erlang (/Elixir) and Go — both inspired virtual threads. But Erlang doesn't offer similar performance, and Go doesn't offer similar observbility. Neither offers the same popularity.

reply

lostcolony | karma 9464 | avg karma 3.54 · 2022-04-30 12:28:07

I'm not saying there aren't tradeoffs, just that if I need the benefits of virtual threads...I have other options. I'm all for this landing on the JVM, mainly so that non-Java languages there can take advantage of it rather than the hoops they currently have to jump through to offer a saner concurrency model, but that until it does...don't care. And last I saw this feature is proposed to land in preview in JDK19; not that it would, and...it's still preview. Meaning the soonest we can expect to see this safely available to production code is next year (preview in Java is a bit weird, admittedly. "This is not experimental but we can change any part of it or remove it for future versions depending how things go" was basically my take on it when I looked in the past).

Meanwhile, as you say, Erlang/Elixir gives me this model with 35+ years of history behind it (and no libraries/frameworks in use trying to provide me a leaky abstraction of something 'better'), better observability than the JVM, a safer memory model for concurrent code, a better model for reliability, with the main issue being the CPU hit (less of a concern for IO bound workloads, which is where this kind of concurrency is generally impactful anyway). Go has reduced observability than Java, sure, but a number of other tradeoffs I personally prefer (not least of all because in most of the Java shops I was in, I was the one most familiar with profiling and debugging Java. The tools are there, the experience amongst the average Java developer isn't), and will also be releasing twice between now and next year.

Again, I'm not saying virtual threads from Loom aren't cool (in fact, I said they were; the technical achievement of making it a drop in replacement is itself incredible), or that it wouldn't be useful when it releases for those choosing Java, stuck with Java due to legacy reasons, or using a JVM language that is now able to migrate to take advantage of this to remove some of the impedance mismatch between their concurrency model(s) and Java's threading and the resulting caveats. Just that I don't care until it does (because I've been hearing about it for the past 4 years), it still doesn't put it on par with the models other languages have adopted (memory model matters to me quite a bit since I tend to care about correct behavior under load more than raw performance numbers; that said, of course, nothing is preventing people from adopting safer practices there...just like nothing has been in years previous. They just...haven't), nor do I care about the claims people make about it displacing X, Y, or Z. It probably will for new code! Whenever it gets fully supported in production. But there's still all that legacy code written over the past two decades using libraries and frameworks built to work around Java's initial 1:1 threading model, and which simply due to calling conventions and architecture (i.e., reactive and etc) would have to be rewritten, which probably won't happen due to the reality of production projects, even if there were clear gains in doing so (which as the great-grandparent mentions, is not nearly so clearcut).

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 19:06:12

Erlang is very cool, and Go has certainly achieved a notable measure of popularity — both served as an inspiration here — but a super-popular language like Java plays in a different world and a completely different scale than languages with a smaller reach. Virtual threads will bring the benefits of lightweight user-mode threads to an audience that is many times that of Erlang and Go combined.

As to legacy code, Java programs have been using the thread-per-request model for over 25 years (there's been a lot of talk of reactive, but actual adoption is relatively low), and Java's threads were designed to be abstracted from day one (in fact, early versions of Java implemented them in user mode). So the right fit has been there all along. Migrating applications to use virtual threads requires relatively few changes because of those reasons, and because we designed them with easy adoption in mind. This particular experiment is about simple, "legacy" Java 1.0 code enjoying terrific scalability.

BTW, Java's observability has come a long way in recent years (largely thanks to JFR — Java Flight Recorder), and even Erlang's is no match for it, although Java still lags behind Erlang's hot-swapping capabilities.

[1]: BTW, I always find talk about the "average Java programmer" a bit out of touch. The top 1% of Java programmers, the experts, outnumber all Rust (or Haskell, or Erlang) programmers several times over, and there are many more reliable Java programs than reliable Erlang programs. The average Java (or Python, or JavaScript, the two other dominant languages these days) programmer, is just the average programmer, period.

reply

namdnay | karma 6257 | avg karma 3.06 · 2022-04-30 10:32:04

And hopefully we can bury Reactor Core in the garden and never talk about it again

abollaert | karma 43 | avg karma 1.48 · 2022-05-02 13:31:02

... at which point we can also "undeprecate" RestTemplate and pretend that never happened either :-)

pron | karma 22164 | avg karma 3.06 · 2022-04-30 11:25:41

> and still has no release date

JEP 425 has been proposed to target JDK 19, out September 20. It will first be a "Preview" feature, which means supported but subject to change, and if all goes well would normally be out of Preview two releases, i.e. one year, after that.

> I'm not using one request per Java thread anyway

You don't have to, but not that only the thread-per-request model offers you world-class observability/debuggability.

> other than "ugh, this again".

Ok, although in 2022, the Java platform is still among the most technologically advanced, state-of-the art, software plarform out there. It stands shoulder to shoulder with clang and V8 on compilation, and beats everything else on GC and low-overhead observability (yes, even eBPF).

reply

Scarbutt | karma 1633 | avg karma 0.85 · 2022-04-30 11:44:41

What has the space move to?

cheradenine_uk | karma 269 | avg karma 3.9 · 2022-04-30 09:44:55

I think my point is more that you end up having to pay the costs (of Rx-style APIs) whether you need the scalability or not, because the libraries end up going down that route. This has sometimes felt that I'm being forced to do work in order to satisfy the fringe needs of some other project!

And sure, if you are living in a single-threaded environment, your choices are somewhat limited. I, personally, dislike front-end programming for exactly that reason - things like RxJS feel hideously overcomplicated to me. My guess is that most, though not all, will much prefer the loom-style threading over async/await given free choice.

reply

amluto | karma 19119 | avg karma 3.85 · 2022-04-30 10:27:30

Threads (whether lightweight or heavyweight) can’t fully replace reactive/proactive/async programming even ignoring performance and scalability. Sometimes network code simply needs to wait for more than one event as a matter of functionality. For example, a program might need to handle the availability of outgoing buffer space and also handle the availability of incoming data. And it might also need to handle completion of a database query or incoming data on a separate connection. Sure, using extra threads might do it, but it’s awkward.

pron | karma 22164 | avg karma 3.06 · 2022-04-30 12:00:57

> Sure, using extra threads might do it, but it’s awkward.

It's simpler and nicer, actually — and definitely offers better tooling and observability — especially with structured concurrency: https://download.java.net/java/early_access/loom/docs/api/jd...

reply

jonpez2 | karma 1 | avg karma 0.5 · 2022-05-01 06:19:03

Let me preface by saying I am a Johnny-come-lately loom fanboy. Amazing work and huge impact. Re structured concurrency: I wonder if there’s any way to combine with generic exceptions such that we can not force a wrapping exception class. So maybe have an executor class that’s generic on the thrown exception type, and then have the join or get apis explicitly throw that type? This thought process is inspired by the goto-considered-harmful trail of logic: I think it would get us even closer to concurrency encapsulated in function blocks.

pron | karma 22164 | avg karma 3.06 · 2022-05-01 07:38:52

Convenient polymorphism over exceptions is something I would very much like to see in Java, but it's a separate topic. Given that structured concurrency is normally used with things that can fail, and whose failures must be handled, I hope (and think) you'll find that the use of checked exceptions is not onerous at all. If we're mistaken, we can consider solutions during the incubation period.

jonpez2 | karma 1 | avg karma 0.5 · 2022-05-01 10:59:52

Totally agree that we need explicit handling of the concurrency-specific exceptions like interruptedexception. It’s just that concurrent apis by their nature take callable/runnable apis which lose any formality over exceptions thrown by client code, and thus someone up the stack is always forced to write a catch( Throwable ) block. So the concurrency leaks up the stack, and forces unsafe default clauses. You’re clearly correct that the topic is separate, but it has great impact on the leakiness of these apis.

Thanks for the response and the amazing work!

reply

zinxq | karma 1900 | avg karma 9.5 · 2022-04-30 08:15:31

I think we're in agreement. Ignoring under the hood - Loom's programming paradigm (from the viewpoint of control flow) is the Threading programming paradigm. (Virtual)Thread-per-connection programming is easier and far more intuitive than asynchronous (i.e. callback-esque) programming.

I still attest though - The 5M connections in this example is still a red herring.

Can we get to 6M? Can we get to 10M? Is that a question for Loom or Java's asynchronous IO system? No - it's a question for the operating system.

Loom and Java NIO can handle probably a billion connections as programmed. Java Threads cannot - although that too is a broken statement. "Linux Threads cannot" is the real statement. You can't have that many for resource reasons. Java Threads are just a thin abstraction on top of that.

Linux out of the box can't do 5M connections (last I checked). It takes Linux tuning artistry to get it there.

Don't get me wrong - I think Loom is cool. It's attempted to do the same thing as Async/Await tried - just better. But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.

*We typically vilify Java Threads for the Ram they consume. Something like 1M per thread or something (tunable). Loom must still use "some" ram per connection although surely far far less (and of course Linux must use some amount of kernel ram per connection too).

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 08:32:23

> But it is most definitely not the only way to achieve 5MM connections with Java or anything else. Possibly however, it's the most friendly and intuitive way to do it.

It is the only way to achieve that many connections with Java in a way that's debuggable and observable by the platform and its tools, regardless of its intuitiveness or friendliness to human programmers. It's important to understand that this is an objective technical difference, and one of the cornerstones of the project. Computations that are composed in the asynchronous style are invisible to the runtime. Your server could be overloaded with I/O, and yet your profile will show idle thread pools.

Virtual threads don't just allow you to write something you could do anyway in some other way. They actually do work that has simply been impossible so far at that scale: they allow the runtime and its tools to understand how your program is composed and observe it at runtime in a meaningful and helpful way.

One of the main reasons so many companies turn to Java for their most important server-side applications is that it offers unmatched observability into what the program is doing (at least among other languages/platforms with similar performance). But that ability was missing for high-scale concurrency. Virtual threads add it to the platform.

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 08:46:47

I don't quite follow your argument.

Saying "Linux cannot handle 5M connections with one thread per connection" isn't a reasonable statement because no operating system can do that, they can't even get close. The resource usage of a kernel thread is defined by pretty fundamental limits in operating system architecture, namely, that the kernel doesn't know anything about the software using the thread. Any general purpose kernel will be unable to provision userspace with that many threads without consuming infeasible quantities of RAM.

The reason JVM virtual threads can do this is because the JVM has deep control and understanding of the stack and the heap (it compiled all the code). The reason Loom scalability gets worse if you call into native code is that then you're back to not controlling the stack.

Getting to 10M is therefore very much a question for the JVM as well as the operating system. It'll be heavily affected by GC performance with huge heaps, which luckily modern G1 excels at, it'll be affected by the performance of the JVM's userspace schedulers (ForkJoinPool etc), it'll be affected by the JVM's internal book-keeping logic and many other things. It stresses every level of the stack.

reply

simulate-me | karma 512 | avg karma 2.06 · 2022-04-30 10:40:50

As the GP said, what's cool about this is how simple the code is. You might be able to achieve 5M connections in Java using an event loop based solution (eg Netty), but if the connection handlers need to do any async work, then they also need to be written using an event loop, which is not how most people write Java. Simply put, 5M connections was not possible using Java in the way most people write Java.

Nullabillity | karma 4768 | avg karma 1.8 · 2022-04-30 07:36:10

Loom is missing the point.

Time has shown that bare threads are not a viable high-level API for managing concurrency. As it turns out, we humans don't think in terms of locks and condvars but "to do X, I first need to know Y". That maps perfectly onto futures(/promises). And once you have those, you don't need all the extra complexity and hacks that green threads (/"colourless async") bring in.

I'd take a system that combined the API of futures with the performance of OS threads over the opposite combination, any day of the week. But as it turns out, we don't have to choose. We can have the performance of futures with the API of futures.

Or we can waste person-years chasing mirages, I guess. I just hope I won't get stuck having to use the end product of this.

reply

rvcdbn | karma 357 | avg karma 3.5 · 2022-04-30 07:43:03

Maybe threads don’t work for your thinking style but your claim that this is generally true is baseless and pretty well refuted by languages like Go or Erlang that feature stackfull threads/processes as a critical part of their best-in-class concurrency stories.

Nullabillity | karma 4768 | avg karma 1.8 · 2022-04-30 07:57:09

Erlang sidesteps the problem by avoiding mutable shared state, in this context they're threads/processes in name only.

Go is just yet another implementation of green threads that is slightly less broken than prior implementations, because it had the benefit of being implemented on day 1 (so the whole ecosystem is green thread-aware). It's certainly nowhere near "best-in-class".

reply

chrisseaton | karma 36438 | avg karma 2.64 · 2022-04-30 09:47:00

> Erlang sidesteps the problem by avoiding mutable shared state

Erlang is maximal shared mutable state!

Processes are mutable state and they’re shared between other processes.

reply

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 10:54:20

Shared mutable state is hard to work with, but Java threads and Java promises both give you access to it. In either case, you'd need discipline to avoid patterns which reduce concurrency.

From the article, it seems that Loom (in preview) enables the threaded model for Java to scale. IMHO, this is great because you can write simple straightforward code in a threaded model. You can certainly write complex code in a threaded model too. Maybe there's an argument that promises can be simple and straightforward too, but my experience with them hasn't been very straightforward.

reply

groestl | karma 2222 | avg karma 2.61 · 2022-04-30 07:43:47

If I look at a thread, I see futures all over the place. They're just implicit, and the OS takes care of concurrency/preemption. Sure, that means that you need concurrency primitives if you access shared resources, but only in the trivial case you can get away without shared state in the promise/future scenario as well (i.e. glue code that ties together the hard stuff). Downside is your code gets convoluted and your stacktraces suck.

pron | karma 22164 | avg karma 3.06 · 2022-04-30 08:09:37

I think you're confusing specific synchronisation/communication mechanisms with the basic concept of a thread, which is simply the sequential composition of instructions that is known and observable by the runtime. If you like the future/promise API, that will work even better with threads, because then the sequence is a reified concept known to the runtime and all its tools. You'll be able to step through the sequence of operations with a debugger; the profiler will know to associate operations with their context. What API you choose to compose your operations, whether you prefer message passing with no shared state, shared state with locks, or a combination of the two — that's all orthogonal to threads. All they are is a sequantial unit of instructions that may run concurrently to other such units, and is traceable and observable by the platform and its tools.

Nullabillity | karma 4768 | avg karma 1.8 · 2022-04-30 08:45:17

You can implement futures by just running each future as a thread, but it doesn't really give you much. It's a lot more complex to write a preemptive thread scheduler + delegating future scheduler than to just write a future scheduler in the first place.

Especially when that future scheduler already exists and works, and the preemptive one is a multi-year research project away.

reply

pron | karma 22164 | avg karma 3.06 · 2022-04-30 09:17:02

It gives you a lot (aside from the ability to use existing libraries and APIs): observability and debuggabillity.

Supporting tooling has been one of the most important aspects of this project, because even those who were willing to write asynchronous code, and even the few who actually enjoyed it, constantly complained — and rightly so — that they cannot easily observe, debug and profile such programs. When it comes to "serious" applications, observability is one of the most important aspects and requirements of a system.

Instead of introducing new kind of sequenatial code unit through all layers of tooling — which would have been a huge project anyway, we abstracted the existing thread concept.

reply

IshKebab | karma 13023 | avg karma 1.29 · 2022-04-30 09:11:38

Threads have essentially the same API as Futures - normally you have some join of join handle and you can join a set of threads (the equivalent of awaiting a set of futures).

Threads don't require locks and condvars. You can use channels and scoped joins etc. if you want.

Give me some async code and I'll show you an easier threaded version.

reply

bpicolo | karma 6988 | avg karma 2.56 · 2022-04-30 10:25:57

The goroutine model in go is plenty conceptually simple for concurrency. Correct me if I'm wrong, but loom seems similar in that sense?

I don't find myself missing out on futures in Go.

reply

pjmlp | karma 109153 | avg karma 1.76 · 2022-04-30 12:11:34

Or inserting the occasional Task.Run() calls, as means to avoiding changing the whole call stack up to Main().

gavinray | karma 5782 | avg karma 4.05 · 2022-04-30 13:30:42

This hasn't been that much of a problem, IME

If you decide somewhere deep in your program you want to use async operations, most languages allow you to keep the invoking function/closure synchronous and return some kind of Promise/Future-like value

reply

pjmlp | karma 109153 | avg karma 1.76 · 2022-04-30 14:04:54

Which is exactly the workaround with Task.Run(), being able to integrate a library written with async/await in codebases older than the feature, where no one is paying for a full rewrite.

SemanticStrengh | karma 999 | avg karma 1.2 · 2022-04-30 14:28:44

Except Kotlin coroutines already works, can be very easily integrated in existing java codebases and are much superior than loom (structured concurrency, flow, etc)

richdougherty | karma 1794 | avg karma 4.52 · 2022-04-30 15:06:08

Kotlin coroutines are amazing. They're built on very clever tech that converts fairly normal source code into a state machine when compiled. This has huge benefits and allows the programmer to break their code up without the hassle of explicitly programming callbacks, etc.

https://kotlinlang.org/spec/asynchronous-programming-with-co...

However... an unavoidable fact is that converted code works differently to other code. The programmer needs to know the difference. Normal and converted code compose together differently. The Kotlin compiler and type system helps keep track, but it can't paper over everything.

Having coroutine and lightweight thread support directly in the VM makes things very much simpler for programmers (and compiler writers!) since the VM can handle the details of suspending/resuming and code composes together effortlessly, even without compiler support, so it works across languages and codebases.

I don't want to be critical about Kotlin. It's amazing what it achieves and I'm a big fan of this stuff. Here are some notes I wrote on something similar, Scala's experiments with compile-time delimited continuations: https://rd.nz/2009/02/delimited-continuations-in-scala_24.ht...

I think this is a general principle about compiler features vs runtime features. Having things in the runtime makes life a lot easier for everyone, at the cost of runtime complexity, of course.

Another one I'd like to see is native support for tail calls in Java. Kotlin, Scala, etc have to do compile-time tricks to get basic tail call support, but it doesn't work across functions well.

Scala and Kotlin both ask the programmer to add annotations where tail calls are needed, since the code gen so often fails.

https://kotlinlang.org/docs/functions.html#tail-recursive-fu...

https://www.scala-lang.org/api/3.x/scala/annotation/tailrec....

https://rd.nz/2009/04/tail-calls-tailrec-and-trampolines.htm...

As a side note, I can see that tail calls are planned for Project Loom too, but I haven't heard if that's implemented yet. Does anyone know the status?

"Project Loom is to intended to explore, incubate and deliver Java VM features and APIs built on top of them for the purpose of supporting easy-to-use, high-throughput lightweight concurrency and new programming models on the Java platform. This is accomplished by the addition of the following constructs:

* Virtual threads

* Delimited continuations

* Tail-call elimination"

https://wiki.openjdk.java.net/display/loom/Main

reply

SemanticStrengh | karma 999 | avg karma 1.2 · 2022-04-30 15:32:34

Coroutines are much less coloured than async await programming though since functions returns resolved types directly instead of futures. But yes there is the notion of coroutine scope but I don't see how to supress it without making it less expressive.

Very few people know it but Oracle is developping an alternative to Loom, in parallel. https://github.com/oracle/graal/pull/4114

BTW i expect Kotlin coroutines to leverage loom eventually.

As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow. Few people know there is an alternative to tailrecursive, that can make any function stackoverflow safe by leveraging the heap via continuations https://kotlinlang.org/api/latest/jvm/stdlib/kotlin/-deep-re...

As for Java, there is universal support for tail recursion at the bytecode level https://github.com/Sipkab/jvm-tail-recursion

reply

gavinray | karma 5782 | avg karma 4.05 · 2022-04-30 15:49:08

Thanks for posting that link to Java tail recursion library, super handy + didn't know about it. You need tail recursion for writing expression evaluators/visitors frequently.

I've been using an IntelliJ extension that can do magic by rewriting recursive functions to stateful stack-based code for performance, but it spits out very ugly code:

https://github.com/andreisilviudragnea/remove-recursion-insp...

  > "This inspection detects methods containing recursive calls (not just tail recursive calls) and removes the recursion from the method body, while preserving the original semantics of the code. However, the resulting code becomes rather obfuscated if the control flow in the recursive method is complex."

It was this guy's whole Bachelor thesis I guess:

https://github.com/andreisilviudragnea/remove-recursion-insp...

reply

ohgodplsno | karma 1666 | avg karma 1.15 · 2022-04-30 16:07:04

> Coroutines are much less coloured than async await programming though since functions returns resolved types directly instead of futures

Only because the compiler does its magic behind the scenes and transforms it into bytecode that takes a lambda with a continuation. Try calling a suspend function from java or starting a job and surprise, it's continuations all the way down

reply

SemanticStrengh | karma 999 | avg karma 1.2 · 2022-04-30 16:31:11

yes interfacing with java is generally made via RxJava and reactor. Interfacing is easy but yes nobody wants to use rxjava and reactor in the first place.. I wonder wether loom will enable easier interop and make the magic work from java side POV

richdougherty | karma 1794 | avg karma 4.52 · 2022-04-30 19:36:37

> Coroutines are much less coloured

I think another commenter pointed out that they are still coloured though. Still, they're very cool - and you can use them for more than just lightweight threading.

> As for the tailrecursive keyword, it is not a constraint but a feature since it guarantee at the type level that this function cannot stack overflow

I'd say tailrecursive is compiler feature (codegen the recursion into a loop) to work around a runtime contraint (no tail call optimisation).

The lack of tail call optimisation on the JRE means recursion is a lot less safe than in functional language runtimes which guarantee stacks don't overflow when you make tail calls.

> As for Java, there is universal support for tail recursion at the bytecode level.

Just a note here for other readers that there are several terms in play here.

I was talking about "tail calls" - when a function calls a function as its last operation - and I mentioned some annotations to do "tail recursion", which is a special case - when a function calls _itself_ as its last operation.

SemanticStrength is talking about "tail recursion" only here. The JVM bytecode can support tail recursion (tail calls on the same method), since we can use the same bytecode that is used for while loops, etc.

However, we cannot do safe tail recursion between different functions (yet), in the same way that we cannot have a loop spanning more than one function. Tail call optimisation is something that will hopefully come in Project Loom.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-05-01 02:39:35

Doesn’t it possibly get inlined by current mechanisms as well?

richdougherty | karma 1794 | avg karma 4.52 · 2022-05-03 03:40:15

Can you explain a bit more?

kaba0 | karma 9701 | avg karma 1.18 · 2022-05-03 03:56:44

I am absolutely a novice at this level of detail with the current OpenJDK implementation, so I’m only asking it, but wouldn’t a method call as a last operation be also a target of the basic inlining done by the JIT compilers? How is a non-recursive tail-call any different than an inline at any another location?

the8472 | karma 12236 | avg karma 2.77 · 2022-04-30 05:34:02

   net.netfilter.nf_conntrack_buckets = 1966050
   net.netfilter.nf_conntrack_max = 7864200

or avoid conntrack entirely

LinuxBender | karma 53794 | avg karma 3.11 · 2022-04-30 08:05:10

For completeness sake I would add that one must also set

  options nf_conntrack expect_hashsize=X hashsize=X

in /etc/modules.d/nf_conntrack.conf, X being 1/4 the size of conntrack_max

pron | karma 22164 | avg karma 3.06 · 2022-04-30 05:47:41

For more information about virtual threads see https://openjdk.java.net/jeps/425 (planned to preview in JDK 19, out this September).

What's remarkable about this experiment is that it uses simple 26-year-old (Java 1.0) networking APIs.

reply

wiseowise | karma 2953 | avg karma 1.52 · 2022-04-30 05:50:47

And how is that any different from Kotlin coroutines if you still need to call Thread.startVirtualThread?

ferdowsi | karma 2161 | avg karma 8.97 · 2022-04-30 05:54:48

Kotlin coroutines are colored and infect your whole codebase. Virtual threads do not.

wiseowise | karma 2953 | avg karma 1.52 · 2022-05-01 13:02:01

You can mark everything suspend and there's no difference.

pron | karma 22164 | avg karma 3.06 · 2022-04-30 05:54:54

1. These are actual threads from the Java runtime's perspective. You can step through them and profile them with existing debuggers and profilers. They maintain stacktraces and ThreadLocals just like platform threads.

2. There is no need for a split world of APIs, some designed for threads and others for coroutines (so-called "function colouring"). Existing APIs, third-party libraries, and programs — even those dating back to Java 1.0 (just as this experiment does with Java 1.0's java.net.ServerSocket) — just work on millions of virtual threads.

Normally, you wouldn't even call Thread.startVirtualThread(), but just replace your platform-thread-pool-based ExecutorService with an ExecutorService that spawns a new virtual thread for each task (Executors.newVirtualThreadPerTaskExecutor()). For more details, see the JEP: https://openjdk.java.net/jeps/425

reply

pjmlp | karma 109153 | avg karma 1.76 · 2022-04-30 12:15:36

Native VM support instead an additional library faking it, and filling .class files with needless boilerplate.

newskfm | karma -7 | avg karma -1.4 · 2022-04-30 06:28:25

alberth | karma 7376 | avg karma 3.95 · 2022-04-30 07:09:38

Is this a test of just having 5M people knock on your door?

Or is this a test where something actually happens (data exchanges) with each connection?

I ask because those are two totally different workloads and typically where in the later test Erlang shines.

reply

bufferoverflow | karma 5508 | avg karma 2.35 · 2022-04-30 10:12:57

It's an echo server. The client sends the data, the server responds with the same data.

10000truths | karma 2600 | avg karma 3.81 · 2022-04-30 07:34:59

A bit of a digression, but I’d love to see how much further one could go with a memory-optimized userland TCP stack, and storing the send and receive buffers on disk.

A TCP connection state machine consists of a few variables to keep track of sequence numbers and congestion control parameters (no more than 100-200 bytes total), plus the space for send/receive buffers.

A 4 TB SSD would fit ~125 million 16-KB buffer pairs, and 125 million 256-byte structs would take up only 32 GB of memory. In theory, handling 100 million simultaneous connections on a single machine is totally doable. Of course, the per-connection throughput would be complete doodoo even with the best NICs, but it would still be a monumental yet achievable milestone.

reply

mike_hearn | karma 11641 | avg karma 2.92 · 2022-04-30 08:58:35

Presumably at 100M simultaneous connections the machine CPU would be saturated with setting up and closing them, without getting much actual work done. TCP connections seem too fragile to make it worth trying to keep them open for really long periods.

It's interesting to think about though, I agree. What are the next scaling bottlenecks now (for JVM compatible languages) threading is nearly solved?

There are some obvious ones. Others in the thread have pointed out network bandwidth. Some use cases don't need much bandwidth but do need intense routability of data between connections, like chat apps, and it seems ideal for those. Still, you're going to face other problems:

1. If that process is restarted for any reason that's a lot of clients that get disrupted. JVMs are quite good at hot-reloading code on the fly, so it's not inherently the case that this is problematic because you could make restarts very rare. But it's still a problem.

2. Your CPU may be sufficient for the steady state but on restart the clients will all try to reconnect at once. Adding jitter doesn't really solve the issue, as users will still have to wait. Handling 5M connections is great unless it takes a long time to reach that level of connectivity and you are depending on it.

3. TCP is rarely used alone now, it usually comes with SSL. Doing SSL handshakes is more expensive than setting up a TCP connection (probably!). Do you need to use something like QUIC instead? Or can you offload that to the NIC making this a non-issue? I don't know. BTW the Java SSL stack is written in Java itself so it's fully Loom compatible.

reply

charcircuit | karma 2002 | avg karma 0.38 · 2022-04-30 10:26:18

I think you meant to say TLS. Not SSL.

adra | karma 830 | avg karma 2.36 · 2022-04-30 12:00:56

I'm pretty sure the exercise was to show the absolute extremes that could be achieved in a toy application and possibly how easy one could achieve some level of IO blocking scaling that has been harder than most other tasks in java of late. More and more, heap allocations are cheaper, often with sub-milli collector locks, CPU scaling has more to do with what you're doing instead of the platform, but java have enough tools to make your application fast.

For extremely IO wait bound workloads though, there was always a LOT if hoops to jump through to make performance strong since OS threads always have a notable stack memory footprint that just doesn't scale well when you could have thousands of OS threads waiting around just taking up RAM.

reply

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 12:44:26

You're totally spot on that connection establishment is much more challenging than steady state; with TLS or just TCP.

I don't think QUIC helps with that at all. Afaik, QUIC is all userland, so you'd skip kernel processing, but that doesn't really make establishment cheaper. And TCP+TLS establishes the connection before doing crypto, so that saves effort on spoofing (otoh, it increases the round trips, so pick your tradeoffs).

One nice thing about TCP though is it's trivial to determine if packets are establishing or connected; you can easily drop incoming SYNs when CPU is saturated to put back pressure on clients. That will work enough when crypto setup is the issue as well. Operating systems will essentially do this for you if you get behind on accepting on your listen sockets. (Edit) syncookies help somewhat if your system gets overwelmed and can't keep state for all of them half-established connections, although not without tradeoffs.

In the before times, accelerator cards for TLS handshakes were common (or at least available), but I think current NIC acceleration is mainly the bulk ciphering which IMHO is more useful for sending files than sending small data that I'd expect in a large connection count machine. With file sending, having the CPU do bulk ciphers is a RAM bottleneck: the CPU needs to read the data, cipher it, and write to RAM then tell the NIC to send it; if the NIC can do the bulk cipher that's a read and write omitted. If it's chat data, the CPU probably was already processing it, so a few cycles with AES instructions to cipher it before sending it to send buffers is not very expensive.

reply

Matthias247 | karma 3753 | avg karma 2.13 · 2022-04-30 18:19:57

QUIC will help with some things, and make others worse. With QUIC you don't need a file descriptor per connection anymore. A single file descriptor for one UDP socket will be sufficient to handle an arbitrary amount of connections (although you might want more to actually exploit concurrency). That fact will help limiting resources that the kernel uses. However the state that needs to be tracked per established connection is likely way larger than for TCP, due to being a more complex and featureful protocol. E.g. QUIC needs state for tracking sub-streams on a connection, while TCP does not. And of course there's all the mandatory crypto state. I am fairly familiar with QUIC implementations, and made a multitude of changes in various libraries (e.g. Quinn and s2n-quic). I wouldn't be surprised if the baseline memory usage of a QUIC connection in most libraries is > 10x of what the Linux TCP stack requires for tracking a connection

natdempk | karma 2832 | avg karma 8.18 · 2022-04-30 14:51:22

It depends on what you do, but I think GC/memory pressure can become an issue rather quickly with the default programming models Java leads you towards. I end up seeing this a lot in somewhat high throughput services/workers I own where fetching a lot of data to handle requests and discarding it afterwards leads to a lot of GC time. Curious if anyone has any sage advice on this front.

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 11:46:42

It's easy to just get 4TB of ram if that's what you need; I haven't scoped out what you can shove into a cheap off the shelf server these days, but I'd guess around 16TB before you need to get fancy servers (Edit: maybe 8TB is more realistic after looking at SuperMicro's 'Ultra' servers). I think you'd need a very specialized applicatjon for 100M connections per server to make sense, but if you've got one, that sounds like a fun challenge; my email is in my profile.

Moving 100M connections for maintenance will be a giant pain though. You would want to spend a good amount of time on a test suite so you can have confidence in the new deploys when you make them. Also, the client side of testing will probably be harder to scale than the server side... but you can do things like run 1000 test clients with 100k outgoing connections each to help with that.

reply

christophilus | karma 10910 | avg karma 3.87 · 2022-04-30 08:46:31

Loom looks like it’s nicely solved the function coloring problem. This plus Graal makes me excited to pick up Clojure again.

imranhou | karma 218 | avg karma 2.37 · 2022-04-30 11:14:00

It looks more closer to go routines, which to me begs the question - where are the channels that I could use to communicate between these virtual threads?

sdfgdfgbsdfg | karma 18 | avg karma 1.29 · 2022-04-30 11:31:23

In a library. Loom is more about adapting the JVM itself for continuations and virtual threads than adding to userspace.

adra | karma 830 | avg karma 2.36 · 2022-04-30 11:42:23

Go's channels are simplistically a mutex in front of a queue. Java has many existing objects that can do the same, it's just that's not idiomatic best choice to do the same. Since green threads should wake up from Object.notify(), any threads blocking on the monitor should wake/consume. I'm curious how scalable/performance a green thread ConcurrentDequeue would stand up to go's channel.

Matthias247 | karma 3753 | avg karma 2.13 · 2022-04-30 11:52:01

You are right. But Go Channels come also with the superpower of „select“, which allows to wait for multiple objects to become ready and atomic execution of actions. I don’t think this part can be retrofitted on top of simple BlockingQueues.

sdfgdfgbsdfg | karma 18 | avg karma 1.29 · 2022-04-30 12:00:22

pron talks about this on https://cr.openjdk.java.net/~rpressler/loom/loom/sol1_part2....

TYMorningCoffee | karma 143 | avg karma 2.23 · 2022-04-30 11:16:50

I was only able to get to 840,000 open connections with my experiment. My machine only has 8GB of memory. https://josephmate.github.io/2022-04-14-max-connections/

Is there anyway for the TCP connections share memory in kernel space? My experiment only uses two 8 byte buffers in userspace.

reply

mh- | karma 2025 | avg karma 1.97 · 2022-04-30 12:22:32

no*, and as you've discovered, the skbufs allocated by the kernel will often be the limiting factor for a highly concurrent socket server on linux.

* I don't know if someone has created some experimental implementation somewhere. It would require a significant overhaul of the TCP implementation in the kernel.

edit: check out this sibling thread about userland TCP. I think this is a more interesting/likely direction to explore in. https://news.ycombinator.com/item?id=31215569

reply

toast0 | karma 25207 | avg karma 2.17 · 2022-04-30 13:46:52

Does Linux actually allocate buffers for each socket or does it just link to sk_buff's (which I understand are similar to FreeBSD's mbuf's) and then limit how much storage can be linked? FreeBSD has a limit on the total ram used for mbufs as well, not sure about Linux.

Otoh, FreeBSD's maximum FD limit is set as a factor of total memory pages (edit: looked it up, it's in sys/kern/subr_param.c, the limit is one FD per four pages, unless you edit kernel source) and you've got 2M pages with 8GB ram, so you would be limited to 512k FDs total, and if you're running the client on the same machine as server, that's 256k connections. But 8G is not much for a server, and some phones have more than that... so it's not super limiting.

When you're really not doing much with the connections, userland tcp as suggest in a sibling, could help you squeeze in more connections, but if you're going to actually do work, you probably need more ram.

Btw, as a former WhatsApp server engineer, WhatsApp listens on three ports; 80, 443, and 5222. Not that that makes a significant difference in the content.

reply

Matthias247 | karma 3753 | avg karma 2.13 · 2022-04-30 18:15:20

I think the socket buffers (sk_buff) are actually shared. They are all packet sized, and whatever socket needs to transmit some data or receives it gets the buffers attached. So my assumption is that the amount of required socket buffers scales more with the amount of data transmission than with the number of sockets.

But independent of socket buffers, the kernel obviously needs to allocate other state per socket, which tracks the state of the TCP connection.

reply

Andrew_nenakhov | karma 10663 | avg karma 3.44 · 2022-04-30 11:29:00

Sounds like a job for Erlang.

speed_spread | karma 2733 | avg karma 2.13 · 2022-04-30 14:07:13

Sounds like Erlang's out of a job.

Andrew_nenakhov | karma 10663 | avg karma 3.44 · 2022-04-30 18:50:37

sgtnoodle | karma 3156 | avg karma 2.43 · 2022-04-30 12:12:26

I'm not a java programmer. I tried clicking 3 layers deep of links, but still have no idea what virtual threads are in this context. Is it a userspace thread implementation?

I've used explicit context switching syscalls to "mock out" embedded real time OS task switching APIs. It's pretty fun and useful. The context switching itself may not be any faster than if the kernel does it, but the fact that it's synchronous to your program flow means that you don't have to spend any overhead synchronizing to mutexes, queues, etc. (You still have them, they just don't have to be thread safe.)

reply

grishka | karma 11238 | avg karma 3.26 · 2022-04-30 12:33:22

> Is it a userspace thread implementation?

Yes.

reply

metabrew | karma 1461 | avg karma 8.54 · 2022-04-30 12:42:13

API for the server example looks... actually good, wow. Nice job!

Also tickled to see my erlang 1M comet blog post referenced. A lifetime ago now, pre-websockets.

reply

midislack | karma 17 | avg karma 0.03 · 2022-04-30 13:22:37

I see a lot of these making the FP of HN. But it's very difficult to be impressed, or unimpressed because it's all about hardware. How much hardware is everybody throwing at all of this? 5M persistent connections on a Pi with mere GigE? Pretty frickin' amazing. 5M persistent connections on a Threadripper with 128 cores and a dozen trunked 4 port 10GE NICs? Yaaaaawwwnnn snooze.

We need a standardized computer for benchmarking these types of claims. I propose the RasPi 4 4GB model. Everybody can find one, all the hardware's soldered on so no cheating is really possible, etc. Then we can really shoot for efficiency.

reply

shadowpho | karma 195 | avg karma 1.95 · 2022-04-30 13:48:11

Raspberry pi 4 performance changes wildly based on cooling. Bare die vs heatsink vs heatsink + fan will give you wildly different results.

midislack | karma 17 | avg karma 0.03 · 2022-04-30 13:53:12

Same is true with any computer these days. So let's go no heat sink, Pi 4 4GB anyway.

tadfisher | karma 4907 | avg karma 2.71 · 2022-04-30 18:23:40

At STP, of course, as you need to standardize the convection rate of the SoC package.

niederman | karma 102 | avg karma 2.62 · 2022-04-30 19:00:40

> Everybody can find one

LMAO I wish.

https://rpilocator.com/?cat=PI4

reply

kmelva | karma 1 | avg karma 0.33 · 2022-05-01 16:18:47

Could a 128c Threadripper even do 5M kernel threads?

jpollock | karma 2389 | avg karma 3.63 · 2022-05-01 19:32:55

This isn't about the hardware, it's about thread count.

There are limits in the linux kernel, and the 5m concurrent connections was chosen to exceed it.

From what I remember (my knowledge is ancient though), a Java thread consumes a pid_t in the linux kernel. By default this is limited to 64k. However, this can be increased by setting a flag in the kernel, to a maximum 2^22 or 4m.

In order to have more than 4m connections, the existing Java code either needs to be changed to be event driven, or it can't use kernel threads.

Event driven code is very different. It's very powerful, but it is very easy to get lost. Think writing Java code that looks like a Makefile with dependencies or "andThen" everywhere, and everyone having to make sure everything is threadsafe. Thread safety is hard for large teams with high qps services - deadlocks can bring down a service.

If a developer can write "regular" non-re-entrant Java code and still get the concurrent connections? Win all around.

reply