My photos alone are north of 4TB. That's DSLR, but not a crazy high-res one (to say nothing of people with video hobbies). I've always worked in small 2-8 person teams for companies that I've been heavily a part of, so that's a huge chunk too. But even discounting that data, I have quite a lot of projects that weigh in pretty heavily.
Yeah, I do have some datahoarding-type collections, because that's the sort of thing you end up doing when automation and total storage become commonplace, but just looking at "bytes I have created and will be lost forever if they vanish", I'm well north of 4TB. I think a lot of other people are too.
I don't mean this disparagingly, but if a person hasn't had any data-heavy hobbies, and has always been some kind of employee to a larger entity who manages data elsewhere, then yeah, your data footprint might be small. I imagine lots of HN regular types don't fit that mold, though.
(On the original link--I don't have much to add to the other comments here. But calling a 4-drive RAID5 setup robust in any sense is nuts. That's data loss waiting to happen, and probably made worse by thinking it is robust)
I was wondering the same thing. I don't think the average person have terabytes or even gigabytes of data stored at home. Every time I start a project I make a remote git repo. Music and photos are handled by cloud services... Not that I have photos I really care about, but I'm sure other people do. Documents and configuration files are backup up using Tarsnap, but that's 100Mb at the most.
We had a similar debate at work, where I questioned the need for backup of my workstation, arguing that there's nothing on it that's not also in git, on a network share, in LastPass or the mail server. Apparently only two people out of a hundred, me and the CEO, do have important stuff stored solely on our workstations.
Calling people with 4Tb of data hoarders may be a little unnecessary, but I do question how much of that data is ever going to be accessed in the future.
That's a lot, compared to mine. How do you organize replication and do you make backups on any external services? I kinda do want to hoard more, but I find it complicated to deal with at large scale. It gets expensive to back-up everything, and HDDs aren't really a solid media long-term. Now, I can kinda use my judgment of what is important and what is essentially trash I store just in case, but losing 100TB of trash would be pretty devastating too, TBH.
It's a little absurd to think people don't need more than 2TB - especially on HN.
Gamers will likely have 2TB in games alone, videographers often have many TBs of videos and photos from weddings and events in their life, many that care about health may have a few TB in genomic data mirrored on their computers to analyze, etc.
I would imagine it's hard to find people that wouldn't have TBs of data, if they were allowed to do so. The reason many people don't have TBs of data is they're limited by these exact companies you're claiming 'solve the problem' by offering limited storage.
It is notable however, that having better tools to organize, deduplicate, and compress data would be helpful to reduce some of the size of data that many people have.
Over the years I've noticed my family will have multiple tar.gz archives, zip archives, etc, which (after extraction/unencryption) will share 20% files here, 10% files there, a 4kb jpg that's the same as a 100MB PNG here and there, etc.
So yes, those 10TB archives may end up being 5TB if someone spent the time to really comb over, understand, make good decisions, and organize that data. But I have not yet seen anything that can scratch that surface yet, other than perhaps https://github.com/jjuliano/aifiles - but I won't use it until it's local only and has guarantees not to destroy data without explicit permission. An overlay filesystem that shows compression/deduplication with LLM capability like aifiles is probably the best option here.
However, I wouldn't imagine that most people's life data is less than 2TB even with all of this - it's mostly imposed as an artificial constraint by these companies.
On the other hand it doesn't seem that much when you consider todays storage density.
You can fit around 0.5 PB into one rack nowadays.
200 racks then sounds a bit less impressive than 100 Petabytes.
However, that ofcourse doesn't account for redundancy, nor for doing anything useful with such a pile of data. Both of which impose some interesting challenges at that scale.
Yep. Everyone winds up with some war story about cleansing some multi-petabyte data store – but the better data engineers I know try very hard to avoid having two of them.
Plenty of things come to mind that can involve a lot of data
- Docker and other VM images
- Backups
- 4K video editing synced to shared NASes when working from home
- Build and content syncing for gamedev when working from home
4gbit/s is still slow enough for me to spend minutes syncing a mere 100GB, even if I can exclusively saturate that link and not share it with a roommate. That's smaller than some blueray disks and many steam games in their compressed/compiled/processed states for a single platform - no debug symbols, no built object caches, no source assets. It's not uncommon for me to resync a significant portion of that due to mass rebuilds, poorly handled content refactoring, or just switching projects when my local disks (mechanical, SSD, and otherwise) are all full.
You have wayyy too much data. You're a straight up digital hoarder. You couldn't even sift through that shit in a lifetime, let alone use it all.
Making full disk backups and archiving huge collections of media assets you'll only ever use 1% of is just a huge burden of mind and a huge loss of portability and flexibility. Not to mention a waste of time and money.
I know you say you're an amateur as if professionals would have even more, but in reality, they've probably realized what a time sink it is and focused on just the small set of stuff that actually matters.
You didn't mention data scale. Just because the disks have room, doesn't mean the data access patterns in perfectly stable code will perform well at continual multiples if old data isn't somehow moved to colder storage.
At that scale, devoting a significant portion of the storage space to parity or redundant data seems plausible if not required. A scratch doesn't have to mean wiping any data completely.
All of that is true, but I don't think it's a realistic concern. You're going to be sharding your data across multiple nodes before it gets that large. Nobody wants to sit around backing up or restoring a monolithic 256 TiB database.
With utmost respect: this isn't a super valuable data point. A "production deployment" with lots of users and/or workloads will see issues you will never encounter in a moderate setting.
FWIW, I use FreeNAS in a similar small but diverse setting with nary a problem, but when I sat it up for a 30+ organization, issue arose that I hadn't expected (not data loss, but usability and performance issues).
It's not nearly the problem now it was ~6 years ago when I was stuck with 8TB in a single file at a small startup.
I look around now and you a right. It's really no big deal to move a few terabytes of data around but it sure was a mess back then and I'm once bitten, twice shy.
Your post is valid from a technical and idealistic standpoint, however when you realize the size of the data sets turned over in the film / TV world in a daily basis, restoring, hashing and verifying files during production schedules is akin to painting the forth bridge - only the bridge has doubled in size by the time you get half way through, and the river keeps rising...
There are lots of companies doing very well in this industry with targeted data management solutions to help alleviate these problems (I'm not sure that IT 'solutions' exist), however these backups aren't your typical database and document dumps. In today's UHD/HDR space you are looking at potentially petabytes of data for a single production - solely getting the data to tape for archive is a full time job for many in the industry, let alone administration of the systems themselves, which often need overhauling and reconfiguring between projects.
Please don't take this as me trying to detract from your post in any way - I agree with you on a great number of points, and we should all strive for ideals in day to day operations as it makes all our respective industries better. As a fairly crude analogy however, the tactician's view of the battlefield is often very different to that of the man in the trenches, and I've been on both sides of the coin. The film and TV space is incredibly dynamic, both in terms of hardware and software evolution, to the point where standardization is having a very hard time keeping up. It's this dynamism which keeps me coming back to work every day, but also contributes quite significantly to my rapidly receding hairline!
The most common bottleneck in my work experience has been disk I/O when the data set would not fit into memory. And every professional data set I've worked with exceeds 1TB. Perhaps there are other bottlenecks, but disk seeks and (for one place) their iSCSI over gbit ethernet nonsense ruled the performance challenges.
Definitely. If our dataset was absolutely massive and we couldn't hold a reasonable amount on disk, it'd make more sense. Fortunately, we are getting a 90+% hit ratio out of a very small amount of space relative to the size of our bucket.
And yeah, we have 0 resiliance for write data here. Again, fortunately, we can afford this tradeoff since the amount of uploads is significantly lower and much less critical for us.
My photos alone are north of 4TB. That's DSLR, but not a crazy high-res one (to say nothing of people with video hobbies). I've always worked in small 2-8 person teams for companies that I've been heavily a part of, so that's a huge chunk too. But even discounting that data, I have quite a lot of projects that weigh in pretty heavily.
Yeah, I do have some datahoarding-type collections, because that's the sort of thing you end up doing when automation and total storage become commonplace, but just looking at "bytes I have created and will be lost forever if they vanish", I'm well north of 4TB. I think a lot of other people are too.
I don't mean this disparagingly, but if a person hasn't had any data-heavy hobbies, and has always been some kind of employee to a larger entity who manages data elsewhere, then yeah, your data footprint might be small. I imagine lots of HN regular types don't fit that mold, though.
(On the original link--I don't have much to add to the other comments here. But calling a 4-drive RAID5 setup robust in any sense is nuts. That's data loss waiting to happen, and probably made worse by thinking it is robust)
reply