What's "this kind of work"? :) The company I work for does reconciliation as a service, something that's pervasive in finance. We support customer-defined schemas; in fact that's one of our selling-points. I could start talking about the kinds of things we've done with MySQL to make this perform well - it's not difficult, just a bit unorthodox.
Anyway, the diversity in customer schema leaks out into the Hadoop schema, where we'd much prefer to give customers data using column names they're familiar with, and we also want to give them rows from all their different schemas in a single table (because many schemas have overlap by design). The superset of all schema columns is large, however. The problem can be overcome with more tooling - defining friendly views with explicit column choice - but having the option to implement that (and go to market sooner), vs a requirement to implement that, adds up to a distinct advantage for tech that can support the extra columns.
If you've a schema, you sometimes need to rewrite an entire table as part of a migration, and that can mean heavy engineering if you want to avoid downtime or disabling writes during the migration. There are 1:1 relationships between tables in the wild that wouldn't have passed a sniff test as part of an early schema design review, but then got created regardless to avoid a lengthy table rewrite.
If you've no schema proper, by contrast, you can manage multiple variations of what the data might look like in code. Not that such a thing is simple; it's definitely not, for the reason you raised. But it's simpler to deploy and migrate, or certainly might appear to be so to someone who isn't comfortable with SQL.
Also, there's a class of apps where having a schema doesn't add much value and NoSQL actually makes sense. Think storing and mining logs, scraped data, ML training sets, etc. -- apps where it doesn't matter much however a big pile of data gets stored, so long as you can shovel through it in parallel or store it very fast.
In fairness to MongoDB, I've seen this done in the relational world. I once worked at a place who had a schema where everything joined through a single table -- "TableRow_TableRow", which had six fields. Two IDs (which were varchars) and metadata_1 through metadata_4 (also varchars).
They couldn't understand why it was so slow. But hey, it's super flexible, right?
You might be overthinking it; for my purposes, at least, a set of example schemas and required outputs would suffice -- the rest I can probably do on a MySQL instance of my own.
You know, I've been toying with this very idea in my head for a while, as a way to make safe schema migrations. But I also know very little about databases so had no idea if this was a stupid idea or not. Glad to know I'm not alone.
One issue I've considered is fk relationships - that could get complicated depending on the approach.
Somebody in our team put this on production. I guess this solution has some merits if you need something quick, but in the long run it turned out to be painful. It's basically SQL over REST. Additionally, your DB schema becomes your API schema and that either means you force one for the purposes of the other or you build DB views to fix that.
The built-in MySQL schemas have problems - mostly that schema changes can take hours or days on large tables. As a result of this you need to build a system that doesn't require schema changes at the MySQL level. This doesn't mean you can't use schemas at a higher level though.
There are two ways to build on top of MySQL - you can either make a flexible schemaless structure with objects and properties and associations between them, which is what Facebook advocates for, or you can make a flexible virtual schema that has column names, types, and arbitrary indexes. I'm in favor of the latter.
That'd be tedious and error-prone if had schemas with maybe dozens of tables. Now you've also got dozens of views to manage and keep in sync with the underlying tables.
I work with both MySQL and PgSQL, and not being able to reorder columns easily is one feature I really miss in PG.
You could be surprised about the efforts many folks do to try to put the different DB solutions into some kind of schema. The reason for this is mostly opaque for me...
I've done this before where we ran a schema per customer and it was fabulous. Once the customer was large enough we could justify allocating a separate DB for them. The application was written in such a way that it knew which data store to query based on the user.
Yep, and you can also solve this problem by having a separate table for each variant and joining whenever you need to deal with common stuff. Personally, I think it's usually better just to have one giant table and allow the different columns to be null since it lets you avoid a bunch of verbose joins.
Will look into it, thanks for the tip. Also had an idea of accepting an SQL schema dump to auto-generate the right columns. There's definitely stuff to improve.
There are so many ways to accomplish this other than what you did. I assume there was a reason for your choice, but it would have been so much easier if you could just create a separate table that matched the same primary key as the customer record, and contained a single other field, the data required to be searchable. Easy to join and search, easy to insert and update.
> reinsert them into a real schema
That sounds sort of like you decided to fix a bunch of problems at the same time...
Ah, yep that makes complete sense if you want to per-tenant schemas. My app is multi-tenant but all have the same schema. We use a customer_id column that's matched against the JWT's customer_id token to ensure that no data is shared inadvertently by a dev missing adding a WHERE clause.
Thanks for the SF link, that's quite interesting. It seems bonkers to me to throw away all the advantages of the RDMS but you can't argue with their success.
A middle ground I've encountered in an ERP system (prophet21 if you're interested) was each table had multiple "CUSTOM_XX" columns that were initially blank. Customers could edit their UI and would drag/drop the custom columns onto the appropriate form and change the label to be whatever they'd like. That gave them some flexibility but kept the core schema coherent.
The problem with standard'ish SQL+RDBMS and their schemas is that types are not enforced well in SQL and most libraries and language integrations, thus we end up with the quagmire of static schemas that can not be reliably type checked when you use standard tooling.
Also lacking in many DB systems is integrated support for tables withh heterogenous schemas that is supported by page/row-level cersioning and/or on line schema alteration. Having to rewrite a huge table only because you want to narrow a couple field for future data, it gets old quickly.
This is what I do as well. Most schema structure is on tables that are not very large, and it's nice especially when dealing with multiple environments to prevent errors, or to produce schemas with some polymorphism or joins between several tables without a chain of id-resolving joins in-between.
There are UUID variants that can work well with indices, which shrinks the case for big-integers yet further, to micro-optimizing cases that are situational.
Anyway, the diversity in customer schema leaks out into the Hadoop schema, where we'd much prefer to give customers data using column names they're familiar with, and we also want to give them rows from all their different schemas in a single table (because many schemas have overlap by design). The superset of all schema columns is large, however. The problem can be overcome with more tooling - defining friendly views with explicit column choice - but having the option to implement that (and go to market sooner), vs a requirement to implement that, adds up to a distinct advantage for tech that can support the extra columns.
reply