Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Real-time messaging (slack.engineering) similar stories update story
173 points by fagnerbrack | karma 13725 | avg karma 4.94 2023-06-23 15:04:01 | hide | past | favorite | 90 comments



view as:

Kind of funny timing because I've noticed my messages in Slack have been quite a bit more delayed than usual for the last few months, especially when moving between devices. Maybe they need more "channel servers"

They're preparing you to the real-time messaging experience when you're working remote from Mars.

Noticed this today, too. I turned on a computer I haven’t logged in to in 2 months, and clicking a channel took at least 5 seconds to open/load. And then it worked normally after that.

Same here, all channels took 5 seconds to load. I've used Slack on that computer recently though.

I've noticed a big performance drop in the last few weeks as well.

Everything has gotten more sluggish.


There was a performance issue with channel loading, documented here: https://status.slack.com/2023-06-23

Agreed, something changed that made Slack usage across multiple devices much worse. Loading a channel tends to sputter as various state and messages load, messages I already read on different devices change state, it's pretty jarring to watch. The "unread messages" badge is often out of sync.

E: if you reach out to support, they can reset stuck state for you. They were very helpful but didn't sound sure it would work.


This seems complex, and not necessarily in a good way

> They replace unhealthy CSs very quickly and efficiently

Maybe fixing the bugs so that they don't mysteriously get ill would be a thing to pursue?


Who suggested they're not doing that already?

I'm sure they have whole teams dedicated to that, but even perfect software would still need these sorts of health checks at the scale slack operates. Sometimes you just have random failures in hardware or the OS, or you need to migrate software around the datacenter without bringing down the service. This is a generic way to route around all of that.

Silly engineers, implementing things like failovers and redundancy. Why can't they just ensure that their servers never crash?

More like, silly engineers relying on the fact that they have failover, so they don't bother to fix the actual issue.

But sure, you get a medal for being a smartass online :)


Why would an asynchronous messaging platform worry about real-time messaging? It seems rather counter to what the platform is meant for.

> It seems rather counter to what the platform is meant for.

Well that's just silly, if you've ever used used slack. It's corporate chat rooms, with real-time chat being a fundamental feature of it.

> counter to what the platform is meant for.

The platform is meant for communication. Async communication is one (probably smallish, at least in my experience) slice of that.


No you don't understand, the GP uses Slack asynchronously so if you don't use Slack asynchronously you're doing it wrong.

I wonder if slack engineering and it's users would benefit from tick/queue based messaging vs real time. Say messages are only exchanged every minute.

> Say messages are only exchanged every minute.

One of the primary uses for slack, in our org, is live support. Having people sitting around, waiting for timers to expire, could never be a benefit for us.


Why would that be at all beneficial?

"We have this very successful, fairly technically advanced as-real-time-as-is-practicable messaging system that most of our users use for real-time communication. We spent large amounts of engineering resources, time, and money to make it not real-time."


[dead]

A dumb pet peeve I have is the use of the term "real time".

What on earth do you even mean by that?

Talk to a controls engineer about real time operating systems and PLC programming, and you get a very solid definition. Talk to a software developer and it means something like "as fast as we can, without any purposeful delays or buffers".

Maybe I'm just a pedant, but the entire article has graphs with no numbers, no defined service level objectives, and vaguely mentions 500ms at the end.

What is real time?


I’d say anything that users perceive as instantaneous.

And even that depends very much on the type of interaction. E.g. Tactile Feedback vs visual.

What kind of tactile feedback is there in Slack?

Knock and brush

"interactive" is 10ms, everything above that is just fluff

> Talk to a software developer and it means something like "as fast as we can, without any purposeful delays or buffers"

No, "real time" in a software context means having a persistent stateful TCP/UDP connection open between a client and a server (or two clients in case of WebRTC and the like) to exchange data, skipping over paradigms like having to establish new HTTP connections or being blocked on database fetches. This is a well-established definition, and has nothing to do with any specific millisecond threshold.


Real time means that there is a requirement on maximum time (end-to-end latency, photon to motion, whatever), and that the worst case behavior of the system meets the requirement.

That's a fantastic definition. Now does the article explain what those maximums are, or how often they meet them?

There's "hard real time", where a system must respond within a specific time (eg. vehicle or rocket control, or PLC systems you mention, etc) and "soft real time" meaning the response should be quick but nothing catastrophic will happen if it doesn't.

Voice, video, and even text chat, falls into this category. You want minimal layency but nobody will die if the screen freezes for a bit.

It's customary to talk about these systems as "real-time communication" and nobody working in the field would confuse it with hard real-time.

More info and examples: https://en.m.wikipedia.org/wiki/Real-time_communication


I came out of college with a (not computer) engineering degree, and went into software. I totally flunked a software interview one time early in my career because they asked me how to implement "a real-time clock UI," and I spent the whole time asking questions trying to understand what they meant by "real-time," rather than just implementing a basic clock UI with 4 numbers, a colon, and a 1-second timer updating the UI, and calling it a day.

I would say they failed the interview rather than you

500ms is definitely not real time, even dialup can do better than that. my ping is often 5ms

The word real-time has more than one meaning. This is about real-time chat, where the message is expected to be read immediately, as opposed to something like email where the message is expected to sit in a buffer for a while before being read.

You wouldn't ask for the deadlines of a video game speedrunner talking about "real time", either. That means real wall-clock time rather than time displayed by the video game (differences can arise due to e.g. pausing)


Lots of people being negative about this, but if you've ever implemented anything that works in near-real-time at wide scale, most of this design makes sense and it works great.

One thing interested me: Why the difference in pathing between events and messages? I think the event flow makes sense, but why not have messages also go through your gateway server instead of through webapp? Surely there is needless latency there when you already have an active websocket open to gateway? I thought perhaps it was because your gateway was for egress-only, but then the events section made it clear you also handle ingress there.


My guess: it’s persisted mutations that need their own retry- and deduplication logic, as well as user facing error handling. Since it hits the db in the main region anyway latency is similar.

Yeah it's a pretty cool design. Discord's real time system operates on a similar (although much more complicated system) on top of hash rings as well.

Interesting architecture. Asynchronous message passing at scale becomes quite a PITA very quickly. Wonder how scalable the code base really is. Probably Java isn't that bad of a choice considering the tooling although it's not my personal favourite.

Seems quite the ideal use-case for Elixir/Erlang but since haven't used it myself, don't know would it be that much better. Especially training/developer pool -wise.

And Vitess/MySQL for persistence? No Cassandra or something similar?


What makes you recommend Cassandra here over the other?

MySQL can scale extremely well if you tune and use it correctly - Twitter was still backed primarily on MySQL at least through Obama’s second election, and I’m pretty sure for some time after that.

Discord uses Elixir. They're fans.

https://elixir-lang.org/blog/2020/10/08/real-time-communicat...

Notably, with regards to the hiring pool:

> None of the chat infrastructure engineers had experience with Elixir before joining the company. They all learned it on the job.


It is definitely an area in which Elixir excels. My personal feeling, however, at this advanced age of late 30s, is that I just can't do dynamically typed languages for large projects anymore. A statically typed Elixir would be a dream I think. There is https://gleam.run/, but I don't know how mature it is.

There is active work being done currently to add set-theoretic types to the core language. https://www.youtube.com/watch?v=gJJH7a2J9O8

Oh wow I wasn't aware. Thanks for the link!

Slack started out on LAMP stack, so a lot of their choice of technologies stem from there. There's a mention of using Hack in this article, which is a PHP dialect. People frequently reach for things more complicated than MySQL, but the truth is that there are likely very few use cases that won't work with MySQL. Vitess makes MySQL manageable at scale (YouTube uses it too) as long as you invest in getting the tooling & automation ready. Vitess itself is fairly straight forward to understand and operate. It's ultimately just MySQL + a side car service and a query layer that handles reading/writing to shards for you, and some other admin tools for managing shards, migrations, etc. If you need to debug something you can just SSH into a host and connect directly to the mysql instance. It uses standard mysql replication and you can tune it the same way you would tune mysql normally. There's no shortage of resources available when it comes to dealing with mysql.

> YouTube uses it too

Pretty sure YouTube migrated onto Spanner quite a while ago.


Vitess increases the number of things that can go wrong in a high traffic MySQL cluster by an order of magnitude. When you need it you need it but “straightforward to operate” only describes vitess in the most abstract of theoretical happy path situations.

That's true for any database at that scale

Seems to me like their Channel Servers pretty much does the work of what Cassandra would do, with Consistent Hashing and their Consistent Hash Ring Managers, although it's not detailed how the Channel Servers persist their data and if like Cassandra, they have replication to N Channel Servers. Looks like a specialized variant of how Cassandra works for their use-case, possibly without all the drawbacks of the underlying data structure Cassandra uses for persistence (SSTables), like needing exact ordering of clustering keys when performing queries or dealing with tombstones.

I could see these Channel Servers using something like SQLite for persistence, which would allow the full suite of SQL features without any of the drawbacks of CQL, of which there are many.

The blog post felt a bit high-level and lacking some technical/architectural details that would've given more insight, but nonetheless it was an interesting read!


Something along like this came to my mind but was unable to put it into words! Good you got it.

> Wonder how scalable the code base really is.

Are you for real? This isn't an hypothetical codebase. This is slack, used by millions of people.

I get it, they're using boring tech, but come on.


It’s a valid point. We might discover FTL and suddenly slack has to work at the federation level and serve trillions of people. This is such a poor design and won’t scale at that point.

[dead]

You can scale even a stack made out of machine code if you just keep throwing money at it. I guess I should have phrased the question, how maintainable and tech debt free the code base is. It's obvious that scalable is in how many people are served, is apparent from the context. I didn't think there was ambiguity in that.

Is there anything fundamentally unique about this architecture?

AOL chatrooms in 1995 too were blazingly real-time even on dialup. Anyone ever use progz/scrollers? With an OH (overhead) account you could flood rooms with thousands of messages a second. Rich text, not plaintext. And it'd appear in super fast "real time" to everyone in the chat.


The biggest difference is that if you shoved a million active users in a single chatroom back then it would have crashed all of AOL. The same problem can have vastly different solutions at different scales.

Huh? Where have you ever seen a Slack channel with a million active users?

100K – https://slack.com/blog/news/top-u-s-retailer-h-e-b-rolled-ou...

350K – https://www.theverge.com/2020/6/4/21280829/slack-amazon-aws-...

The Kubernetes Slack has close to 200K users right now.

Amazon has 1.5 million employees, and I'm sure the majority are on Slack.

Forget 1995, no service other than Slack can run group chats at this scale even today.


> I’m sure the majority are on Slack

80% of Amazon’s employees are warehouse and drivers.


Which still leaves >300K corporate employees.

[flagged]

Nothing. Not the geographically distributed customers, (unlike America Online), not the fact that this works both on the web and on mobile devices, nor the fact that mobile devices can't open persistent background connections due to power requirements, not the much larger scale, nor the fact that latency characteristics over dial-up modems were much, much higher than they are now, not even the fact that organizations get their business done using this architecture while AIM was mostly (not always) used to just chat with friends and strangers casually. No nothing at all.

But really is the internet actually doing anything new? My grandfather used to send telegrams overseas all the time. Does anyone remember teletype? How is that any different than us chatting with each other here on HN? Phaw, tech bros just keep reinventing the wheel.


> Not the geographically distributed customers, (unlike America Online)

AOL existed outside of America.

my fondest memories as a would-be computer hacker in the early-noughts was sitting in a BT phone box in the UK at 3am using the lineman opcode for free local calls to connect to my nearest AOL pop, so I could use one of their trial CDs with a fabricated or stolen credit card (I deactivated the trial before the first payment, I was not a total thief).


Sure but your experience is proof of how different the experience was. You had to call into an AOL pop using POTS and use a trial CD and some proof of payment (credit card) to perform this action. I can send my friend/coworker in Japan a Slack invite to their email and they can get started chatting on their smartphone in a minute or two. As I said, how is any of this really different than my grandfather sending telegrams? Or his father sending mail through the post?

I was making the argument that you made an assumption that didnt hold any weight.

regarding doing what slack can do, I think we do have a bit of collective amnesia about what was truly good and truly bad about what came before.

Slack does a lot of things, but its UX is actually pretty piss poor due to the bolt-ons (threads is just about the worst implementation I have ever seen), walled garden, weird account things (magic links~ if you want to see your other workspaces make sure you use the same email everywhere!), security (goodbye decent realtime bot api) and the client which feels slow despite using more resources than it should given its status as a program that needs to always be running.

To take one example of a worse system: IRC scales to tens of thousands of users on a single machine; doesn't have emoji or picture sharing though- but the bot API is simpler for sure, you can connect in one click if the person has an IRC client (or you pass a webirc link) - no signup needed.

I think the problem with slack is exacerbated in europe (or rest of the world in general), where the latency of fetching thousands of tiny assets or chats is many-many times higher than in California.

Also; the reason I had to do that was because my mum wouldn't permit a home internet connection; broadband was a thing at the time. Dial-up was not mandatory in 2002. IRC with dial-up was faster than slack is today too; you can argue that slack “does more” though.


> I was making the argument that you made an assumption that didnt hold any weight.

I claimed multiple things. Unless you think the single assumption being wrong in magnitude (I still maintain the majority of their customers were American) completely upends the entire thesis, then you're trying to treat this as some counterexample which... doesn't work here.

> Slack does a lot of things, but its UX is actually pretty piss poor due to the bolt-ons (threads is just about the worst implementation I have ever seen), walled garden, weird account things (magic links~ if you want to see your other workspaces make sure you use the same email everywhere!), security (goodbye decent realtime bot api) and the client which feels slow despite using more resources than it should given its status as a program that needs to always be running.

I'm not discussing the UX here I'm trying to talk about what Slack is doing differently than AIM.


The overarching software distribution and installation experience has nothing to do with my original question around the networking architecture.

Once you were online with AOL, it was extremely high speed. Virtually instantaneous chatroom messaging with hundreds of users in a given chatroom.


What is "extremely high speed"? Do you have latency measurements from back then? I'm in Slack chatrooms with thousands of users. Your argument is essentially "it felt hella fast back then so what changed". Feelings are hard to dispute.

Mobile devices matter because they cannot maintain persistent connections. This by design necessitates a different architecture. AIM and IRC can have a single machine hold open a TCP connection and send and receive messages. A mobile device's modem wakes up briefly, performs some network chatter, then goes back to sleep. It's a fundamentally different requirement on the way the system is designed. IRC has evolved (through add-ons, not in the protocol itself) to shim this by using bouncers or having some sort of push notification based proxy for messages.

The other question is, how reliable was AOL? I have no idea how many outages they had, how long those outages lasted, nor what the impact was. Given the service was meant to be used as entertainment I suspect their SLA was one of best effort rather than the measured kind that Slack is leaning into (wrt using multiple CSes and using consistent hashing to pick which CS to send a message to, so architecture can be scaled as needed.)

I don't work for Slack but I work on high scale systems for a big company for a living. I'm pretty aware of the limitations of networks nowadays and the differences between systems in the dial-up era from now. Just the fact that Slack accommodates devices with push based semantics (mobile devices) and pull based semantics (anything that can open a persistent connection) is itself a large difference between systems of old and systems today.

Btw the TOC and Oscar protocols that AOL used have been reverse engineered and documented online so you can see for yourself at least at a protocol level what's different.


> Mobile devices matter because they cannot maintain persistent connections.

This isn't really a property of the device. It's an OS limitation. Nokia Series 60 didn't have a platform push service, so WhatsApp had to maintain a connection from the app to the servers to get messages without delay; it was not unusual to find Series 60 phones connected for 30+ days when I worked at WhatsApp. The newest versions of Nokia Series 40 had push, but older versions had to long-connect as well. Some mobile networks aren't great at letting connections work for longer times either, but some can.

But yeah, Apple only lets Apple keep connections open. Android doesn't have hard and fast rules, but background connections are likely to die, you should use push if you can. The push services connection is going to try to stay connected for a long time though.


Yeah I don't disagree that these are OS/platform limitations. I'm just saying that this is a factor that makes the architecture different.

Thought it is quite interesting to consider a world where mobile platforms made different platforms.


BlackBerry had a working push system without hanging TCP connections. You made an HTTP call to BlackBerry, who then sent a UDP packet over the network direct to the device. It's a shame we've lost this architecture.

Huh interesting. How did the remote know the endpoint to send the UDP packet to? I would expect NAT and firewalls would make it hard to route inbound UDP? Did they do some sort of STUN style negotiation through HTTP?

BlackBerry operated it's own GPRS APN, so all Blackberries registered directly with it from the tower, and so it always had their up-to-date location on the network.

Consul, Envoy, websockets. These are architectural components called out in the article. Are you saying that AOL used this same architecture and you don’t see anything “fundamentally unique”?

If you’re using the word architecture even though you mean product functionality, you can simply go to https://slack.com/help/articles/115003205446-Slack-plans-and...

Look at the functionality listed there and see how many things AOL supported which Slack does not and vice versa.


I’m curious about the actual message passing implementation(s) here - there’s a lot of “CS passes a message to all the GSs” - what’s that look like? Is it a general pubsub system? Did they roll their own? What’s the actual machinery here?

This reminds me, a few months ago, a "real-time" chat app appeared on HN which would display your typing to a public chat room before you hit Enter. And people found ways to inject XSS into usernames, including images and JS popups. Does anyone remember the name and URL of the site? I think it would make an intriguing alternative to voice calls, which is more synchronous/immediate/interactive than a conventional IM app where the other person can only start reading a sentence once you've finished typing it.

> I think it would make an intriguing alternative to voice calls

'talk' was an everyday tool back in the 1980s and early 90s that sent characters as they were typed. I just checked and it's installed on my Mac today. It's person-to-person, not group chat, but there's something magical about seeing every keystroke in near-real-time. You can feel the person at the other end.


VAX/VMS had (has?) a talk program that allowed multiparty chat, at least if my memory serves me correctly.

As I recall, ytalk allowed for larger groups, although it gets messy pretty quick.

That was a standard chat feature back in the day, certainly with AIM and probably ICQ.

Its fine if you don't care enough to edit what you say, but if you do it requires composing the sentence first in your head.


Like speaking?

Meh. Realtime pub/sub at scale is easy as pie. I wrote a WebSocket client/server toolkit which can handle unlimited users with automatic channel sharding and auto-scaling on Kubernetes as a weekend project. It was too easy. I made it open source and for some reason it only ended up being used by some crypto companies and a few adult chat websites though one of them said they were handling 120K concurrent users with a couple of instances.

Sounds fun. Is the repo still up?

Did you actually test it with unlimited users?

I tested it with hundreds of thousands of concurrent connections running on multiple hosts yes.

Thousands is nothing. A single thread can handle thousands. You don't need any of this sharding or Kubernetes stuff. I'd wager 120K connections can be handled by one fast instance.

Above that, you have basic sharding techniques like assigning each channel to an instance. This should work up to some thousands of moderately active chatters in a single channel - at which point nobody can read all the messages anyway so it's not working for non-technical reasons.

Slack is trying to solve the problem of having thousands of channels with millions of people, as well as millions of channels with thousands of people. In other words, about 6 orders of magnitude bigger than your load test.


I was limited by time and budget there. My solution is open source after all and embarrassingly parallel so can potentially handle sharding across thousands of hosts. If I remember correctly, my calculations suggested that a cluster with 2k hosts could handle 40 million concurrent users if you push a message to every user every 5 to 10 seconds with some randomness.

The per-channel limit at that rate (1 message to every users every 5 to 10 seconds) was 20K concurrent connections but the number of channels is unlimited since you can add more capacity by adding more hosts and the sharding of channels across available hosts is automatic.

The only limit was number of hosts since there is only one coordinator instance in the cluster though the cluster can continue to operate without any downtime while the coordinator instance is down. The coordinator instance only needs to be up while the cluster is scaling up or down. That said it should be able to handle up to 5k hosts, potentially even much higher.


I'm seeing a problem where gateway servers have to subscribe to channels therefore they need to know which users are in which channels, but they're not segregated per channel so this load might not distribute well. Gateway servers could have been made more stateless if the channel server would send a list of which user IDs to broadcast a message to.

Legal | privacy