Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Seems like any key value store would be better for this. Redis, DynamoDB, etc.. could get a lot higher throughput on messages.


sort by: page size:

Nice article. One thing you should talk about in my opinion are systems like Redis. Redis can be used as a generalized key value store, but it can also be used as a messaging platform. In fact many systems like Storm (which would be another great topic) have easy integration with redis pub subs. While Redis and other solutions like it probably are not a good fit for all your data, they are great for mixed supporting data, caching and messaging. There are also really nice integrations if you like Java, these days you can make Spring message driven beans to consume messages from a redis pub sub very easily with minimal configuration. ZeroMQ is probably another technology that is worth discussing, either using it with a layer like Storm on top or by itself. Not every messaging system has to be heavyweight and cumbersome like the good ole' days.

I implied just sharded database of messages, no need to replicate it. With customizable filters along data transfer path, of course.

Any distributed storage option? Asynchronous chat-style messaging is rarely enough for social networking.

As an aside: If you're not willing to make that substantial kind of change to your system at this point, I would at least recommend basing your messaging system on something that's designed to do it (i.e. a message queue) as opposed to a relational database. RabbitMQ has had quite good reviews from what I've seen, with clustering and persistence available.

Imo I'd resort to an ACID datastore if permitted but I'm curious if there are common patterns that enable this with messaging systems.

if you care to elaborate, i'm curious -- what alternative(s) would you recommend instead of one transaction per message, and why?

I think having a message broker in-between the services and some sort of transient database should solve this, in a way.

I think this is a disingenuous or naive view, while you may not have been serious, I think it's worth unpacking why it's naive as I think a lot of engineers have similar opinions.

To start, there's nothing here about what machine this architecture runs on, it could be running on a Raspberry Pi for all we know.

Then, this ignores the cost of database lookups for the keys. That data is probably small enough to be on the one machine, but then you have to have service support for (reliably, in real time) syncing that data to the service. A separate database is therefore probably the right solution here, which means you're doing networking in each message send, which makes it unlikely that a Pi could do this.

Next up you've got the issues of reliability. The message queue separation gives you better reliability in the face of issues such as upstream APIs going down or erroring, or issues for a specific user. All the business logic around handling this, the message queue handling persistence and ACID semantics (or parts of it), this all takes additional resources, not to mention potentially a fair bit of disk space (for a Pi) to queue up undelivered messages should an upstream API slow down or stop accepting new messages.

Then you have hardware failure, at this scale you don't want a single machine failure to wipe out your primary communication method with millions of customers. You'd therefore want to have a distributed system, even if that's only for reliability rather than performance.

Lastly, 21 million clock cycles might sound like a lot, and might go a long way with C/C++/Rust, but as you move up to more dynamic languages that will reduce significantly. It happens that they are using Go here, and that's likely to get pretty good performance out of the hardware, but writing this service in Python/Ruby would be a very valid choice for developer productivity, or based on existing skills they have in the team. That might be 1/10th the performance, but since you need a distributed system for reliability here anyway, adding a few more machines to the pool might be a better choice than introducing a lower level language that takes longer to develop and exposes you to memory safety or threading issues.

There may well be other factors I haven't considered here, but I think for the use case of delivering that scale of messages, the reliability options you get with a system like this are well worth the additional hardware requirements and architecture overhead.

Edit: lmilcin makes a good point about trading systems, but there are several differences – that system still has a single point of failure, it was probably written in a low level language with input from experts on performance, and it was running on a much faster machine. The single point of failure of a server-grade machine like that is probably an acceptable risk if you own the hardware, but in a cloud environment (which brings other benefits) hardware is less reliable so probably not an acceptable risk there. I don't think it's an apples-to-apples comparison, although it is interesting.


You might want to check out Amazon SQS for this. It has all the advantages you're looking for (transactional, guaranteed not to lose messages, zero administration, crazy-simple API with solid client libraries in every conceivable language), with the only possible downside being that it's hosted externally.

It's priced so cheaply as to be essentially free, though you'll need to give it a credit card so that it can bill you eleven cents a month or whatever.

http://aws.amazon.com/sqs/

I've been using it for a few years now with good success.


My suggestion is to just use RabbitMQ. It's written in Erlang, it uses a BerkeleyDB-like backend for message storage. It's non-distributed, "durable", and optionally with persistent and non-persistent messages. It has a web interface to examine messages. Second suggestion is to use JSON for your messaging format, although it's possible for basic tasks to put all the info you need in the headers.

I don't see how this applies to messaging systems at all.

It's actually harder to implement a real-time messaging system with a database or key/value store than it is to use something like RabbitMQ (messaging) or Beanstalkd (for jobs).


On a server side - of course. My bet is that messages are indexed, for efficient querying.)

Could this be used as an alternative to the AWS Message Queue?

[Edit] Now that I am having a little coffee, I realize that you would have difficulty knowing what the keys are to the queued messages. I guess you could use a set of known keys, and cycle through them looking for messages, but why be a cheapskate.


sounds like a great project. How do you handle performance with large message sizes?

Agreed, there are a lot of more robust products already in this space.

My biggest concern is its reliance on mysql. There is no way this could be a valid option for high volume messaging when it is essentially database as a queue.


No, but what is the fast schema that you have in mind that makes it hard to move messages? Perhaps fast and easily movable messages are not exclusive.

storing messages isn't really a hard problem either though.

I think it is a bit harder when the data is highly connected (social web app).

However, memory is so cheap now that you could probably get away with keeping all of your business objects in memory, having them write-through to disk for persistence. I imagine that you would have to move business objects around the servers to decrease messaging latency as the connectivity between objects changes.

I don't know of any frameworks that use this approach.


Well, I think what you proposed is better than Bitmessage. But to end user Bitmessage provides same benefits (+ a few backdraws). I have been thinking exactly what you have proposed. But based purely on DHT, with intermediate hops and message polling when retrieving data, without direct connections between clients. If you have read freenet's freemail implementation paper, that describes parts of the process. Because freenet is distributed encrypted key/value storage solution.

Basic feature of DHT is that data automatically expires after a while if not refreshed to group of nodes. This would allow more efficient routing that what Bitmessage does. That's one of the primary reasons why I don't like it. I also suspect that a key streams in bitmessage could become overloaded, because there's no way to loadbalance key streams.

next

Legal | privacy