in theory it could be written to store each chunk as a separate value inside of indexDB because when seeding you should only need one chunk at a time...
Hey @manigandham thanks for the complements on database overall =)
I understand conceptually that this is all about splitting data, but I think if you look at most scalable databases that use sharding, it’s really meant as a partitioning of primary keyspace over servers, and then you just globally map this sharding through client libraries, some transparent proxy, or some map that every node maintains, because O(map) = O(# servers). Examples: Cassandra, DynamoDB, scale-out memcached, Vitesse, ZippyDB/RocksDB, etc.
We are instead tracking per-chunk state in catalogs to give us this level of flexibility, and allowing the movement/migration of individual chunks on a much finer-grained basis. This is both for placement/management across the cluster but also for data management on single nodes, e.g., for data retention policies, tiering, lazy indexing, etc.
I realize this isn’t a hard-and-fast rule, and exceptions always exist. But one reason we try to call this out is we’re often asked why we don’t just use a standard hash-based partitioning tool/system as a black box, which wouldn’t give us this level of fine-grained visibility & control that we find highly useful for time-series data management.
Would this just be like using RocksDB or something like that with extensions to auto-shard and balance between instances? That's what it looks like to me, but I could be misunderstanding it.
I didn't mean "wrong" in the full sense of the word. What I actually mean is that sharding based on user id will most probably not give/guarantee predictable distributions of the writes generated by users (basically there can be no guarantee that chunk#1 of users will always generate the aprox. same amount of data as chunk#2 of users).
1. The hash would be an extra column that can be calculated from existing data, wasted storage
2. You effectively have to rewrite the entire database to itself to redistribute, and keeping the DB availabile during this process is _very_ complicated
3. You're putting an extreme load on the DB for a substantial amount of time. This takes away from your DB performance and makes node downtime even more severe
In a distributed DB, you have to remember that the probability your node _doesn't_ have the data you need increases with the size of the cluster, which creates a negative feedback loop for having to rewrite.
True, but it's O(n^2) complexity instead of O(n) complexity (1000 shards would lead to 1M reads instead of 4k)
If they're not persisted at the receiving side, how does it handle a crash before committing on the receiving node ? Keeping previous versions indefinitely on sending nodes to permit requesting the old values doesn't work, so this introduces a time bound on how long a crashed node has to recover (granted, this could still mean weeks)
> You can set an index template to be used on new indices that match a pattern, which is a very common thing to do
It is, but how can you tell in your template you want to keep shard sizes under 50GB? You can't.
The best thing you can do (as they did) is, based on historical data, update the template, so that the new index will have shards that (hopefully) are under 50GB.
Oh fuck me, I didn't even realize they used a single instance (node).
To expand a little bit, the whole point of using multiple shards per index in an ES cluster is so that the shards spread across multiple nodes (servers) and distribute the load (disk i/o) and handle redundancy. ES automatically scales and reshuffles its shards across multiple nodes in the cluster to handle fault-tolerance as well. If one or more nodes go down, the cluster still has all of the data through replica shards etc...
Either way, in this particular case, the data is so small, having 5 shards per index with 50k indices results in 250k shards for 5GBs of data.
5GB / 250k shards = 20kb per shard.
You have shards of size ~ 20kb ... total cluster misconfiguration.
From the email thread, it sounds like the decision to shard on UID was made mostly to increase locality of data, so that you didn't have to query more than one node to get a single user's data.
There's no silver bullet here. Hashing on insertion order would basically guarantee that writes would favor one node over another, which random hashes would force you to aggregate results from all available nodes for each query.
Random distribution seems like it would be better for write performance - it means that inserting multiple new values and rebuilding the index can happen in parallel rather than having to coordinate the auto increments. It also naturally has the right properties for sharding as and when you get to the point of needing that.
On the read side it shouldn't make any difference, as there's no reason the rows you want to access at any given point should have any correlation to when those rows were created.
Good point. I guess it would be cool if you could choose the sharding method per-table. I haven't put too much thought into this, but off the top of my head I can imagine a number of different methods suitable for different situations.
There's a 1:1 mapping between the user's hash modulo the number of shards, and the table they're writing to. So, if we had 1000 logical shards, we have 1000 schema/tablespaces, with the same tables in each. And the database's own 'nextval()' feature makes sure never have the same ID twice. Hope that clarified things.
I guess you are talking about the plan to have the data of one shard only on one disk (see https://github.com/elastic/elasticsearch/issues/9498)? This does not necessarily mean that you will end up having only one shard per datapath - only if you have just one shard per node. But you are right, the change might lead to unbalanced disk usage in some scenarios, where increasing the number of shards would solve the problem.
1) is recommended, since it allows for throttling on import time (see https://crate.io/docs/en/latest/best_practice/data_import.ht...) and also does not require a rename of a table, which is currently not implemented but is on our backlog. However i think once ES 2.0 is out we will have table renames and also throttling in insert by query, so option 2) will be recommended then.
Our genreal recommendation to the fixed number of shards limitation is to choose a higher number of shards upfront (number of expected cores matches the most use cases) or to use partitioned tables (https://crate.io/docs/en/latest/sql/partitioned_tables.html) where possible since those allow to change the number of shards for future partitions.
can you do hashtable sharding? like have a hashtable point to another hashtable by using something appended to the key? just a thought on a procrastination filled Friday afternoon.
Honestly I never found the case when this happens, data always falls into 1 shard according to the key.
Then comes the concept of shard replica where the shard can live in several nodes and form a redundancy.
However I'm noy sure how usually it's being setup on Postgres
reply