> although computers themselves are mostly infallible
What do you mean? Hardware is fallible too, just less often than software. This may cause problem on its own e. g. bit flips in non-ECC memory, HDD which lie (reply to flush cache before data is actually written) or HW can trigger software errors, e. g. HW can crash at random moment and SW can be not designed to handle this properly.
> "you must have ECC in all your computers otherwise they will constantly and silently corrupt your data + crash" and "statistically speaking, most computers in the world do not use ECC". How can both of these things be true?
What makes you say that? Both those things being true simply means that most computers silently corrupt your data and crash. That matches my experience. My programs occasionally crash. My pictures and videos are occasionally corrupted.
Do I know that those events are caused by memory failures? No. Most of them are probably other sorts of software or hardware failures, but some could be memory errors.
That's not a good excuse. Same memory gets corrupted on machines which are processing our online money transfers and yet we don't see problems when buying things online. Why? Because there are layers upon layers validating data integrity, so that even if something gets corrupted, no invalid payment is made.
Crashes might not matter, but silent data corruption does. The owner/user of that data will care when they eventually discover that it at some point mysteriously got corrupted.
> Software under normal circumstances is remarkably resilient to having its memory corrupted
Not really? What you are saying applies to anything that uses hashing of some sort, where the goal by design is to have completely different outputs even with a single bit flip.
And "resilience" is not a precise enough term? Is just recovering after errors due to flips? Or is it guaranteeing that the operation will yield correct output (which implies not crashing)? The latter is far harder.
> haven't personally seen any kind of data corruption in motion
Ever had a program crash, hang, or act oddly? That's how data corruption in memory surfaces.
Of course, non-perfect programs (i.e. all of them) act the same way, which means that differentiating memory corruption from misbehaving programs is hard.
Fixing the memory errors will result in more stable system, but it still won't be perfect.
>In an embedded system such a crash may be as bad as the incorrect access itself.
I don't agree on this point. An incorrect access on an embedded system has the potential to cause all kinds of horribly subtle bugs involving memory corruption. A simple crash is generally much better.
Hardware can misbehave. If the kernel can detect that it is reasonable to shut down the machine and preventing data corruptions. I cannot think of a kernel developer who would laugh at that.
> True, but hardware failures are known to happen due to Linux misbehavior
[citation needed]
In the past, Linux was blamed for memory failure because it exposed bad memory when trying to make use of it, and Windows on the same machine didn't. But it was not Linux's fault.
If you edit images or videos, maybe you detect small corruption in the image. If you use databases or do data analysis, there may be one number that is wrong, or some string has one byte of garbage. Sometimes, application may crash.
All this is very rare. It only matters if you need data integrity and do work where data has value.
> haven't personally seen any kind of data corruption
How would you know? Unless your computer use has been literally trouble-free (and all your archived data has been verified for correctness somehow), you can't know that none of your glitches over the past 17 years has been due to memory errors.
Desktop computers are sometimes used for actual work where data integrity is important.
Only a small minority of main memory data corruptions lead to OS crashes, mostly the in-memory application or filesystem data just silently gets corrupted.
While a nifty idea, corruption of this sort is so rare and so unbounded (that is, there's no reason to believe it'll strike in your incoming data, it could well strike at the CPU instructions itself or whoknows) that there's not much you can do about it from inside the code. It's all but impossible to deal with corruption rates on the order of 1 in 10^18 (or better! properly functioning hardware is obscenely reliable at doing what it was designed to do [1]) instructions on properly functioning hardware, and all but impossible to deal with failing instructions at a much higher rate on nonfunctioning hardware, except to replace it with functioning hardware.
[1]: If anyone wants to pop up with complaints about that statement, remember that properly functioning hardware is also doing a lot of things very quickly, so it has a lot of chances to fail. ECC RAM is important, for instance, because something that only happens every few billion accesses may still happen several times a day. But this is still an absolutely obscene degree of reliability. Most disciplines would laugh at worrying about something at that rate of occurrence... they wouldn't even be able to detect it.
> There's simply no structural reason for power failure (or application crashes) to be able to put hard disk data into an inconsistent, corrupt state.
No structural reason no, but a lot of other reasons. You can install your favourite atomic file operations library in your PL of choice and run some benchmarks to identify reason #1.
This one does pervasive damage to software quality though. You get a bug report, it doesn't reproduce easily. The thought is always there: "maybe their hardware had a transient error?". If it did you have no bug to find. That's attractive.
If storage drives reported writes accurately and memory didn't occasionally silently corrupt, software falling over would be more likely to imply an error the developer is empowered to fix.
I've seen a ton of it in the field. In general the ram stability is so bad that large operating systems fail to boot from corruption, so most of the time it doesn't get to the data corruption phase.
Can make that statement with any certainty? My personal and family computers have crashed quite a few times, and have corrupted photoes and files, some of them are valuable (taxes, healthcare, etc. Personal computers have valuable data these days)
I couldn't tell, as a user, which if those corruptions and crashes were causes by bitflips. Could you?
What do you mean? Hardware is fallible too, just less often than software. This may cause problem on its own e. g. bit flips in non-ECC memory, HDD which lie (reply to flush cache before data is actually written) or HW can trigger software errors, e. g. HW can crash at random moment and SW can be not designed to handle this properly.
reply