Thanks for the feedback! I totally agree with you that using a streaming approach seems like an overkill. Actually making the pipeline truly real-time is part of our future work for Cassandra CDC (This is mentioned in Part 2 of this blogpost, which is expected to be published next week). In Cassandra 4.0, the CDC feature is improved to allow real-time parsing of commit logs. We are running 3.X in production, however we plan to add 4.0 support in the future.
What does the 4.0 api look like? I would home that they would expose a high-level interface for change data, similar to SQL Server or Snowflake streams. Most databases leave this as an afterthought, which is so unfortunate.
The high-level idea is hard-linking the commit log segment file in cdc directory as it is being written + have a separate index file to keep track of offset.
Relying on an eventual consistency database sounds like a huge challenge in terms of getting absolute consistency between Cassandra and Kafka. There seems to be many edge cases that could cause the data in Kafka to get out of sync, it doesn't sound very reliable. Especially if you want to rely on Kafka for event sourcing later on, which is sounds like you're planning on in the future. For a payments company that sounds very risky.
What happens if you write a row to Cassandra but that row is invalidated because of failed writes to the other nodes in the cluster (ex. the write was QUORUM and it succeeded on the node that you're reading from but failed on all other nodes). In Cassandra, when an error occurs, there are no rollbacks, and the row gets read-repaired. It sounds like that row will get outputted to Kafka. How do you "disappear" data written into Kafka if it gets read-repaired?
great question actually. There is a nuance in cassandra regarding to read-repair. If you issue a QUORUM read and the read itself failed (say, because 2/3 nodes are down), but succeeded on 1 node; and later on you read again when all three nodes are up, the failed write will actually be resurrected during read repair due to last write wins conflict resolution.
This is why it is recommended to always retry when a write fails. This way data might arrive at BQ slightly before cassandra itself for a failed write, but eventually they will be consistent.
That said, at WePay cassandra is being used by services which do not have to be strongly consistent at the moment. Not to say it’s not possible, as there are companies that use Cassandra’s light weight transaction feature for CAS operations.
As for consistency between kafka and cassandra, agree that its hard to make it 100% consistent (due to a number of limitations cassandra poses). This is why we tend to be pretty careful about which services should use cassandra, and specify clearly what is the expectation and SLA for cassandra cdc.
Thanks for the feedback! I totally agree with you that using a streaming approach seems like an overkill. Actually making the pipeline truly real-time is part of our future work for Cassandra CDC (This is mentioned in Part 2 of this blogpost, which is expected to be published next week). In Cassandra 4.0, the CDC feature is improved to allow real-time parsing of commit logs. We are running 3.X in production, however we plan to add 4.0 support in the future.
reply