Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

To put that another way: hardware failures are beyond the scope of the compiler.


sort by: page size:

> Out of memory errors generally don't happen on modern systems.

Well, unless we're talking about smartphones or virtualized environments (e.g. VPS). Both not very relevant in a compiler context, of course.


> although computers themselves are mostly infallible

What do you mean? Hardware is fallible too, just less often than software. This may cause problem on its own e. g. bit flips in non-ECC memory, HDD which lie (reply to flush cache before data is actually written) or HW can trigger software errors, e. g. HW can crash at random moment and SW can be not designed to handle this properly.


Hardware can misbehave. If the kernel can detect that it is reasonable to shut down the machine and preventing data corruptions. I cannot think of a kernel developer who would laugh at that.

It could be both. But having a failure be a security issue is a fail-open design and should be avoided in most cases. Having out-of-bounds memory writes can be exploited in a variety of ways and can provide exploits that affect even fail-closed designs.

I get what you mean and mostly agree with it, but compiler / OS / hardware errors are ultimately not that rare. For example:

Not too long ago, the latest Apple clang simply crashed when compiling some C++ codebases I was working on. It was an obvious compiler bug, and we got a fix a few months later.

During the 32-bit to 64-bit transition, it was common that large disk writes would fail silently because some layer truncated the length to 32 bits. Usually the culprit was the standard library, but I once saw it happening in the file system.

For a long time, I assumed that OpenMP had some nontrivial memory overhead, because I often saw simple multithreaded data processing code using more memory than I would have expected. Then one time the overhead was tens of gigabytes and continued growing. When I investigated, it turned out to be degenerate behavior in glibc malloc/free. If you allocated and freed memory in different threads, you could end up with many fragmented arenas where the memory freed by the user could not be reused or released back to the OS.

Bit flips and other memory errors seem to have become more common in the recent years, but only on consumer hardware. Maybe it's time to start using ECC memory everywhere.


Indeed. Trying to write code that can essentially work correctly with arbitrary memory corruption is not something that should even be attempted.

Yes, excellent article but clickbaity title. Honestly, all memory access errors eventually boil down to a "one register value" and many other error types probably too. I mean, what else is there if we go down far enough, really?

And your software will continue to have memory safety issues.

But that's metallurgical failure, not software, and it's something that's showing up in use almost decades later.

I'm saying this to highlight that what you're pointing out, while not desired, is kinda expected. That's the point of the routine inspection and maintenance - to catch these.

It's not the same as the conversation we're having under the OP: catastrophic failure due to bad assumptions in software (in this case, memory safety)


>In an embedded system such a crash may be as bad as the incorrect access itself.

I don't agree on this point. An incorrect access on an embedded system has the potential to cause all kinds of horribly subtle bugs involving memory corruption. A simple crash is generally much better.


The other three are memory management and off-by-one errors.

Unironically yes. Memory and CPU errors happen.

You are arguing a useless point. Yes, in a real computer, memory can be randomly flipped by solar radiation and viola, your pointer is now actually NIL. Or any other various failure modes you've mentioned. The point being that those are failure modes, not normal operations. Once a system reaches a failure mode, nothing can be guaranteed, not even that it adds 1 and 1 correctly, because who's to say that the instructions you think you're writing are being written, read, and processed correctly? The only solution is to down the box, reset the hardware, and hope that whatever happened wasn't permanent damage.

My point being, you cannot invoke catastrophic system failure as an argument against a static-time type system and call it an argument, simply because that's an argument against any programming construct at all. Linked lists? But what if you can't allocate the next node... Better to not use them at all!


On hw level, memory access errors and hardware malfunctions are often called exceptions. For example, on x86 the MCE architecture does this.

It is possible to corrupt data on the stack due to out of bounds.

I've always been told that there is risk of irreversible hardware damage. I haven't any idea where this claim comes from, and I've never been offered an explanation. It's one of the things that has always made me wonder, and why I've never tested any of my own code on real hardware. I'd be curious to see this claim expanded, or debased.

You might be better off focusing on improving fault tolerance instead of trying to hunt down the root cause.

While a nifty idea, corruption of this sort is so rare and so unbounded (that is, there's no reason to believe it'll strike in your incoming data, it could well strike at the CPU instructions itself or whoknows) that there's not much you can do about it from inside the code. It's all but impossible to deal with corruption rates on the order of 1 in 10^18 (or better! properly functioning hardware is obscenely reliable at doing what it was designed to do [1]) instructions on properly functioning hardware, and all but impossible to deal with failing instructions at a much higher rate on nonfunctioning hardware, except to replace it with functioning hardware.

[1]: If anyone wants to pop up with complaints about that statement, remember that properly functioning hardware is also doing a lot of things very quickly, so it has a lot of chances to fail. ECC RAM is important, for instance, because something that only happens every few billion accesses may still happen several times a day. But this is still an absolutely obscene degree of reliability. Most disciplines would laugh at worrying about something at that rate of occurrence... they wouldn't even be able to detect it.


> 99% of BSODs are driver or hardware issues related

Of course, memory management is also done by a driver. /s

next

Legal | privacy