Bit rot is real. Last month we’ve had weird linker errors on one of our build servers. Turns out that one of the binary libs in the build cache got a single bit flipped in the symbol table, which changed the symbol name, causing the linker errors. If the bit-flip had occured in the .TEXT section, then it wouldn’t have caused any errors at build time, and we would have released a buggy binary. It might have just crashed, but it could have silently corrupted data…
It may be a regional thing but I have never heard ”bit rot” refer to legacy code. In the retro computing circles bit rot refers to hardware defects (usually floppies or other storage media) caused by cosmic rays or other environmental hazards.
I agree this is the primary context, but I've seen unmaintained (or very old) software being reffered to as "bit rotting" by extension. As in, forward compatibility might break due to obsolete dependencies, etc.
Same here. “Bit rot” is then analogous to food rot: the longer your data sits unverified, the more likely that there will be flipped bits and therefore “rotten data”.
Well, it wasn't that hard to uncover actually. We knew that the same build succeeds on our machines. So we only had to find what the difference was between the two :)
As Arthur Conan Doyle put it: "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth." ¯\_(?)_/¯
Well done, Watson. This calls for a bit of snuff. Seriously, this is the kind of thing that keeps me up at night, and it's nice to hear a happy ending =D
Says it's a Mac/iOS build platform. Since it's a commercial service they're probably complying with the license and thus using actual Mac hardware, and in turn the only ECC option is the really awful value, outdated Mac Pro. Seems more likely they're using Minis instead, or at least mostly Minis. An unfortunate thing about Apple hardware (says someone still nursing along a final 5,1 Mac Pro for a last few weeks).
I'm just thinking (of course) that you said If the bit-flip had occured but it's probably already when the bit flip occurs in the .TEXT section; we don't know what it might already have caused or just passed without notice (unreproducible bug, or bitflip in function that's never called or whatever).
I’ve had a case where a bit flip in a TCP stream was not caught because it happened in a Singapore government deep packet inspection snoop gateway that recalculated the TCP checksum for the bit-flipped segment:
Fragmentation doesn’t change the TCP checksum. The packet is reassembled and the original checksum verified against the rehydrated packet and TCP segment.
It's easier to just code checks and balances to fix those errors instead. There's a famous story about a Google search bug caused by a comic ray and they implemented such error checking on their servers after that
That's not what actually happened here (probably), but think of relatively heavy atoms like iron, ejected from a supernova at close to the speed of light. They're not really "rays". They're solid particles that punch holes in everything. The good news is, mostly they're ionized and they get repelled away from hitting the planet's surface by our magnetic field (generated by the big hot iron magnet that's churning under our crust). But in the rest of space where there's no field like that, getting lots of tiny puncture wounds from high-velocity iron particles is pretty normal. It does a lot of damage to cells and DNA in a cumulative way pretty quickly.
To answer your question, shielding is possible but it is much harder in space than it is under the magnetosphere. Even so, a stray particle can wreak havoc. Shielding for the purposes of data protection on earth is essentially a risk/reward analysis. Anyone could get hit by a cosmic bullet, but the chances are pretty low. The odds of an SSD flipping bits on its own are significantly higher.
> That's not what actually happened here (probably), but think of relatively heavy atoms like iron, ejected from a supernova at close to the speed of light. They're not really "rays".
Wow, I have to admit I always assumed cosmic rays to be gamma radiation but alpha particle radiation sounds a lot more scary.
Does anyone happen to know if computers engineered for the space station or shuttles have already some built-in (memory)redundancy or magnetic shielding to account for damage by cosmic rays? I imagine a system crash of the life support systems in space would be devastating.
IIRC nobody is currently using magnetic fields for shielding, I don’t know if that’s due to insufficient effectiveness, power consumption, or unwanted interactions e.g. with Earth’s magnetosphere.
I've always wondered about that. Seems to me that if you can shield one side from the sun's heat and expose the other side to the cold of space, an MRI magnet's superconductors should be quite happy with the temperature. A big ol' magnet would be a pain to charge up once, and then provide long-term shielding.
MRIs are big electromagnets and they use big power and produce big heat. In space you have no convection so you must radiate away all of your heat which is challenging. Maybe you can make something better with superconductors that doesn't use much power, but I don't think it exists yet.
Superconductors famously don’t produce heat. Putting a superconducting magnet in shade (though that includes from the Earth not just from Sol) will keep it cool once it gets cool.
This is what the James Webb is going to be doing, though for non-superconducting reasons.
The magnet in an MRI is a superconducting electromagnet. It needs power once, to charge it up, then you close the loop and it just sits there being a superconducting electromagnet. The only power used continuously is for cooling, which is to say, when ambient room heat leaks in, it has to be carried out again. The magnet itself does not produce heat; it has no electrical resistance because it is a superconductor.
The vast majority of the cosmic radiation consists of protons, i.e. hydrogen ions.
Light ions, e.g. of helium, lithium or boron are also relatively abundant and heavier ions are less abundant. There are also electrons, but in smaller quantities, because they can be captured by ions.
The high speed protons collide with the atoms of the Earth atmosphere and the collisions generate a huge variety of particles, but most of them have a very short lifetime, so they decay before reaching ground level.
At ground level, the cosmic radiation consists mostly of muons, because they have a longer lifetime, so they survive the travel from the upper atmosphere where they are generated, until ground level.
Only extremely few of the particles from the primary cosmic radiation, i.e. protons, reach ground, because most are deflected by the magnetic field of the Earth and the remaining protons lose their energy in the collisions with atmosphere, by generating particles, which eventually decay into muons.
I had thought that Starlink would become extremely compelling when the servers were in orbit as well, but maybe that’s naive. Cubesats with massive arrays of active storage might be far too difficult (aka costly) to protect properly.
One of the issues with putting servers in orbit is cooling; you can't just use fans in space. On the other hand, real estate is pretty cheap. Servicing is hard, though, and micrometeorites are another risk. Plus launch costs being high, and radiation an issue, I don't see it happening any time soon outside of very specific areas.
Forget cooling for a minute. Cooling dissipates energy but you need to collect the energy first and worry about powering the server. It’s going to be a solar-plus-battery system, which is heavy and expensive, with substantial launch costs.
Radiation cooling is pretty dramatic in space. With an isolated radiation shield/mirror from the sun and a decent sized heatsink, you can have a lot of cooling power on hand. A server isn't going to be pumping out nearly as much heat as a rocket engine, and launch vehicles don't tend to overheat.
You can’t just put an SSD or whatever in orbit and expect reasonable read latencies at all times. It’s in orbit. It moves. Half the time it’s on the wrong side of the planet.
Putting servers in orbit is a really, really bad idea.
Firstly, it would be wildly inefficient, just due to the unavoidable delay both ways. You expect delay over a long distance network, but you want the server to be positioned and cabled up to minimize latency. Just locating a satellite requires quite a bit of overhead, so treating them as servers rather than clients would create a huge amount of latency. As a failover for ground-based servers in a nuclear war scenario, it could make sense, but they're also literally flying over the airspace of hostile nations who, in a war, have the capability to take them out. Mixing any kind of civilian control onto a server embedded in a flying artifact over an enemy nation is, obviously, a recipe for disaster in a hot war. Take all of that and the fact that the bandwidth is stone-age and spotty depending on its position, and there's no satellite you'd want to use to set up a website selling alpaca sweaters. (Watch: A few years from now we'll find out that the US Space Command runs some second strike end-of-world command control server off a satellite; that still only proves it's too late to contact the fucker when you need it).
> Firstly, it would be wildly inefficient, just due to the unavoidable delay both ways
The point I was thinking about is a scenario where Starlink satellites communicate with each other via laser (already starting to happen), and then communicate with the end user via the satellite over them. Because we're talking about speed-of-light transmission between sats, data in the Starlink network can theoretically cover "ground" (aka miles/km) faster than ground-based ISPs.
Then it makes sense to deploy servers within that network so that two Starlink users can have extremely low latency from user to server to other user (the slowest part being the earth to sat latencies, one for each user).
Not trivially; if it was viable the Chernobyl reactor would be gone already. Also, IC packagings and chassis materials can be sources of stray electron beams, and other modes of glitching like power failure exists.
The cosmic ray induced radiation at sea level is a mix of neutrons, gamma rays, electrons, and highly penetrating high energy muons (heavier cousins of electrons). They are caused by the cascade of reactions when the CRs collide with nuclei in the atmosphere. You're right, neutrinos barely interact (hence the name "little neutral ones") so they're no problem. You could put infrastructure underground to escape muons, but that wouldn't be practical. Moreover, any shielding (and the materials in the electronics) needs to be carefully chosen to minimise naturally occurring radioactive materials that could lead to upsets themselves. It's a tricky one. There are other ways to mitigate the risk, error checking, redundant systems, etc.
The problem isn't cosmic rays, per se, it's the issues created by any single-instance modification of a record.
The general solution here is to store multiple redundant copies of data and checksums of those data, in order that lack-of-consistency can be determined and the specific records which are considered suspect can be clearly identified.
In general, a record is a persistence of a pattern through time subject to entropic forces.[1] Given that modifications to records are typically random events, keeping (and continuously integrity-checking) multiple copies of those records is the best general defence, not guarding against one specific variety of modification in a single instance of a record.
Can't say I really like the title as it comes across as an absolute statement whereas the bit flip could have happened for any number of unknown reasons.
I saw a similar take on Twitter and whilst root causing such things (especially when it's a single occurrence) isn't always possible, shrugging and saying "cosmic rays" should be the last thing to posit not one of the first.
I remember an older less-computer savvy gentlemen asking for support because "the program isn't working", when after a few questions we realized his computer won't boot up (screen dark, etc.). We thought his terminology was all screwed up. But now I realize he just lacked the necessary PR skill. He should have said that "the program isn't working" is a well-known term of art for power users such as himself. The fact that people who actually have a clue about computing find this imprecise term upsetting if there problem.
It so happens that aside from developing software, I'm physicist working on particle physics (specifically, I make my living in industry from cosmic rays). So I can assure you that "cosmic rays" actually mean something. Something very specific. Books have been written full of specific intelligent things we can say about cosmic rays. If you go an appropriate it to describe a whole bunch of phenomena because you can't be bothered to distinguish between them, you're in the exact same boat as that gentlemen from the start of the story.
Using wrong terms prevents understanding, as can be seen in all the stories linked elsethread so far. For example, using thinner silicon typically reduces the rate of Single-Event Upsets (the malfunctions caused by cosmic rays) but using smaller silicon components typically increases the rate of malfunctions due to quantum fluctuations. The latter typically happen in specific hardware that we manufactured not-quite-as-well as the rest. SEUs happen in the same rate in all hardware.
You're saying you have to engineer chip components so they're small enough not to be hit as often by cosmic rays, but large enough to avoid quantum fluctuations? Are quantum fluctuations something Intel is dealing with regularly as they get down under 3nm?
Yes, quantum tunneling (an electron "teleporting" into or out of transistors) has been an issue everyone has had to design around for multiple process nodes now.
Go easy on the old guy. It sounds to me like he was completely correct, at the appropriate level of abstraction for him.
And as for cosmic rays...this may be a sensitive topic for you (or maybe you're just in a condescending mood), so I'll tread carefully, but seems simple to me.
The actual cause of the problem, given the instrumentation in place at the time of incident, is unknowable. But it might have been a subatomic particle. It absolutely definitively is, sometimes.
Metonymy might be imprecise, but it's human. I'm not sure what standard you intend to hold commenters here (or old guys with computers) to.
> If you go an appropriate it to describe a whole bunch of phenomena because you can't be bothered to distinguish between them
Cosmic ray is the appropriate colloquial term. Just like "bug" is the appropriate term to describe computer problems that have nothing to do with insects. It's a well established colloquialism and not simply terminology made up on the spot like in your example.
Why? When you do Raman spectroscopy, in any given session you're likely to see 1 going through a 1 cm^2 sample. Students are taught that if you get a super sharp increasing looking peak, first rerun it because you probably just saw a cosmic ray.
Hardware issues that cause bitflips like that one famous issue[1] with Sun servers in ~2000 (supposedly caused by radioactive isotopes in IBM manufactures SRAM chips) are often called "cosmic rays".
No! This exact issue is a classic example because at one point the fault was attributed to cosmic rays by Sun before the true cause came to light. As another commenter has said, terminology matters.
These are all very fair statements but there’s no guarantee that ECC memory was even used. Computers typically fail open when ECC is potentially present but not available.
People also cite early stage google and intentionally do not buy ECC components, running more consumer hardware for production workloads.
It’s always humorous to me when people use the term theology in situations such as this; it makes me wonder, as human mental bandwidth becomes more strained and we increase our specializations to the n-th degree, what will constitute theology in the future?
The benevolent and malevolent rogue AI, eluding capture or control by claiming critical infrastructure. Some generations of humans will pass and the deities in the outlets will become.
Future is already here and we call it consensus. Trusting your peers, believing they are honest and proficient in their respective fields is a natural human response to the unknown phenomena.
Public CAs are not that type of people; I would be disapointed if that were not running two seperate systems checking each other for consistancy; having top of the range ECC running well inside its specification must be table stakes.
Not hundreds. There are currently 52 root CA operators trusted by Mozilla (and thus Firefox, but also most Linux systems and lots of other stuff) a few more are trusted only by Microsoft or Apple, but not hundreds.
But also, in this context we aren't talking about the CAs anyway, but the Log operators, and so for them reliability is about staying qualified, as otherwise their service is pointless. There are far fewer of those, about half-a-dozen total. Cloudflare, Google, Digicert, Sectigo, ISRG (Let's Encrypt), and Trust Asia.
[Edited, I counted a column header, 53 rows minus 1 header = 52]
These things exist, using trade names like chipkill or lockstep memory. Though they don't need to sacrifice half of the memory chips to get good error recovery properties.
Note that this is still not end-to-end protection of data integrity. Bit flips happen in networking, storage, buses between everything, caches, CPUs, etc. See eg [1]
According to the Intel developer's manual L1 has parity and all caches up from that have ECC. This would seem to imply that the ring / mesh also has at least parity (to retry on error). Parity instead of ECC on L1(D) makes sense since the L1(D) has to handle small writes well, while the other caches deal in lines.
Checksumming filesystems and file transfer protocols cover many cases. SCP, rsync, btrfs, and zfs all fix this problem.
As for guaranteeing the computed data is correct: I know space systems often have two redundant computers that calculate everything and compare results. It's crazy expensive and power demanding, but it all but solves the problem.
Usually they have an odd number. The Space Shuttle had five, of which the fifth was running completely different software. In case of a 2/2 split the crew could shut down a pair (in case the failure was clear) or switch to the backup computer.
IIRC, EEC memory can correct for single-but flips. It can detect/warn/fail on double-bit flips. And it can not detect triple-bit flips. This might be a simplified understanding, but if this has happened only once, that seems to match up with my intuitive understanding of the probability of a triple bit flip occurring in a particular system.
You multiply the probability of two random events to get the probability they will happen at the same time. If the expected value of a bit flip is 10^-18 then two would be 10^-36 and three would he 10^-54.
At some point it becomes a philosophical question of how much can the tails of the distribution be tolerated. We've never seen a quantum fluctuation make a whale appear in the sky.
DRAM failures are not independent events, so it’s not appropriate to multiply the probabilities like that. Faults are often clustered in a row, column, bank, page or whatever structure your DRAM has, raising the probability of multi-bit errors.
I don't see why a high-energy particle strike would confine itself to a single bit. The paper I posted elsewhere in this thread says that "the most likely cause of the higher single-bit, single-column, and single-bank transient fault rates in Cielo is particle strikes from high-energy neutrons". In the paper, both single-bit and multi-bit errors are sensitive to altitude.
A single particle strike would only affect a single transistor. If that transistor controls a whole column of memory, then sure it could corrupt lots of bits. With ECC, though, it would probably result in a bunch of ECC blocks with a single bit flip, rather than a single ECC block with several bit flips.
Instead of ECC it would also be possible to run the log machine redundantly and have each replica cross-check the others before making an update public. I assume the log calculation is deterministic.
Process enough data and even ECC can - and will - fail undetected. Any kind of mechanism you come up with is going to have some rate of undetected errors.
Given the rate required for this its not a reasonable assumption. It's like saying Amazon sees sha256 collisions between S3 buckets. Just doesn't happen in practice.
Undetected ECC errors are common enough to see from time to time in the wild. This paper estimates that a supercomputer sees one undetected error per day.
"Forward error correction" is really just the application of ECC while transmitting data over a lossy link in order to tolerate errors without two-way communication.
The ECC used in memory is likely relatively space inefficient at the benefit of being computationally simple so it can be done quickly in hardware. More redundancy could be added to tolerate more bit flips, but it would either add a lot of memory overhead, or a lot of computational complexity. In particular, something really good like reed solomon would likely be very difficult to encode on every single memory write, at least not without taking a several order of magnitude performance hit. It would likely be easier just to have 2x ECC memory, or 3x non-ECC memory and do majority voting.
Bit flips like that should be easy to reverse. Flip just one bit at the n-th place and test again the certificate, and vary n, until it is valid. It's done in linear time.
It could, yes. I’m sure the thought also crossed the mind of everybody else in that thread, and they’ve clearly dismissed it.
Bare conspiratorial assertions don’t advance the conversation in any meaningful way. If you have something constructive to add to the discussion, by all means do so. Otherwise, please refrain from this sort of comment
Humans can examine the underlying data and reason that indeed one bitflip could cause the observed result.
In this particular case the alternatives are:
1. The bitflip happened. Cosmic rays. Defective CPU. Maybe a tremendously rare data tearing race somewhere in software.
OR
2a. Somebody has a fast way to generate near-Collisions in SHA256, which are 1-bit errors from the target hash. Rather than publish this breakthrough they:
2b. Had a public CA issue and sign two certificates, one uninteresting real certificate for *.molabtausnipnimo.ml which we've now seen and another "evil" near-colliding certificate with the one-bit-different hash then they:
2c. Logged the real certificate but at only one log (Yeti) they get the log to show the real certificate yet actually use the hash from the "evil" certificate, thus perhaps obtaining a single valid SCT for the "evil" certificate.
Scenario 1 isn't terribly likely for any one certificate, but given we're logging vast numbers every day and errors inherently cascade in this system, it was going to happen sooner or later.
Scenario 2 fails for having too many conspirators (a public CA, and likely a different public log) and yet it achieves almost nothing even if it remains secret. Your Chrome or Safari expects at least two SCTs (one of them from Google, in this case Google's Argon 2022 log), one doesn't get the job done. And it was never going to remain secret, this occurrence was spotted after less than a day, and thoroughly investigated, resulting in the log being shut down, after just a few days.
[Edited multiple times to fix awful HN formatting.]
either because a single bit flip (of the code) could still cause a launch. You'd hope that it was at least stored in ROM and that ECC ensures that even if such a bit flips it does not lead straight to Armageddon.
There are some 'near miss' stories where a single switch made all the difference:
Obviously this anecdote is decades out-of-date, but my first boss’ PhD thesis is for an automatic small-airplane guidance system. I mean: as long as your plane was on a high speed ballistic arc and needed a guidance system that only ran for about 25 minutes.
The guidance system used mercury & fluidic switches, in case the small aircraft encountered a constant barrage of extremely large EMPs.
You also need to ensure that launch_label and the location of the branch instruction are both more than one bit away from fail_label. You can duplicate the JUMP_IF_NOT_EQUAL instructions as needed - or indeed the whole block before the BRANCH - as necessary to ensure that.
Your comment is one of the reasons why I still believe assembly has its place, when it really matters what kind of instructions you put out this is the sort of control you want. What your high level language is going to dump in the instruction stream is totally invisible (and interpreted languages are not even up for consideration in situations like these).
There have been false alerts caused by single chip failures. 1980 there was nuclear alert in the US that lasted over three minutes.
Generally critical systems can't be armed without physical interaction from humans. It's not just computer logic, but powering up the system that can do the launch. It does not matter what the logic does as long as ignition system is not powered up using physical switch.
I don't know about nuclear missiles, but on the Space Shuttle I think they had four duplicate flight computers, and the outputs of all of them would be compared to look for errors. (They also had a fifth computer running entirely different software, as a failover option.)
The push for crypto without ecc ram is a nonstop horror show. Software under normal circumstances is remarkably resilient to having its memory corrupted. However crypto algorithms are designed so that a single bit flip effectively changes all the bits in a block. If you chain blocks then a single bit flip in one block destroys all the blocks. I've seen companies like msps go out of business because they were doing crypto with consumer hardware. No one thinks it'll happen to them and once it happens they're usually too dim to even know what happened. Bit flips aren't an act of god you simply need a better computer.
> The push for crypto without ecc ram is a nonstop horror show
That's a bit hyperbolic.
First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.
Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.
But mostly... this isn't the end of the world. This is a failed cert, which is a failure that can be reasonably easily handled by manual intervention. There have been many other mistaken certs distributed that had to be dealt with manually: they can be the result of software bugs, they can be generated with the wrong keys, the dates can be set wrong, they can be maliciously issued, etc... The system is designed to include manual validation in the loop and it works.
So is ECC a good idea? Of course. Does it magically fix problems like this? No. Is this really a "nonstop horror show"? Not really, we're doing OK.
Yes, this kills Yeti 2022. There's a bug referenced which refers to an earlier incident where a bitflip happened in the logged certificate data. That was just fixed. Overwrite with the bit flipped back, and everything checks out from then onwards.
But in this case it's the hash record which was flipped, which unavoidably taints the log from that point on. Verifiers will forever say that Yeti 2022 is broken, and so it had to be locked read-only and taken out of service.
Fortunately, since modern logs are anyway sharded by year of expiry, Yeti 2023 already existed and is unaffected. DigiCert, as log operator, could decide to just change criteria for Yeti 2023 to be "also 2022 is fine" and I believe they may already have done so in fact.
Alternatively they could spin up a new mythical creature series. They have Yeti (a creature believed to live in the high mountains and maybe forests) and Nessie (a creature believed to live in a lake in Scotland) but there are plenty more I'm sure.
It doesn't break anything that I can see (though I'm no expert on the particular protocol). Our ability to detect bad certs isn't compromised, precisely because this was noticed by human beings who can adjust the process going forward to work around this.
Really the bigger news here seems to be a software bug: the CT protocol wasn't tolerant of bad input data and was trusting actors that clearly can't be trusted fully. Here the "black hat" was a hardware glitch, but it's not hard to imagine a more nefarious trick.
Your statement is, to be frank, non-sensical. The protocol itself isn't broken, at least for previous Yeti instances, certificate data are correctly parsed and rejected.* In this instance, it seems that the data is verified already pre-signing BUT was flipped mid-signing. This isn't the fault of how CT was designed but rather a hardware failure that requires correction there. (Or at least that's the likely explanation, it could be a software bug° but it will be a very consistent and obvious behaviour if it is indeed a software bug.)
On the issue of subsequent invalidation of all submitted certificates, this is prevented by submitting to at least 3 different entities (as of now, there's a discussion whether if this should be increased), so if a log is subsequently found to be corrupted, the operator can send a "operator error" signal to the browser, and any tampered logs are blacklisted from browsers. (Note that all operators of CT lists are members of CA/B forum, at least as of 2020. In standardisation phase, some individuals have operated their own servers but this is no longer true.)
* Note that if the cert details are nonsensical but technically valid, it is still accepted by design, because all pre-certificates are countersigned by the intermediate signer (which the CT log operator checks from known roots). If the intermediate is compromised, then the correct response is obviously a revocation and possibly distrust.
° At least the human-induced variety, you could say that this incident is technically a software bug that occurred due to a hardware fault.
So I'm learning about Yeti for the first time, but I don't buy that argument. Corrupt tranmitted data has been a known failure mode for all digital systems since they were invented.
If your file download in 1982 produced a corrupt binary that wiped your floppy drive, the response would have been "Why didn't you use a checksumming protocol?" and not "The hardware should have handled it".
If Yeti can't handle corrupt data and falls down like this, Yeti seems pretty broken to me.
Not handling corrupted data is kind of the point of cryptographic authentication systems. Informally and generally, the first test of a MAC or a signature of any sort is to see if it fails on arbitrary random single bit flips and shifts.
The protocol here seems to have done what it was designed to do. The corrupted shard has simply been removed from service, and would be replaced if there was any need. The ecosystem of CT logs foresaw this and designed for it.
I think Yeti2022 is just the name for one instance of the global CT log? Nick Lamb could probably say more about this; I understand CT mostly in the abstract, and as the total feed of CT records you surveil for stuff.
Yeti is one of the CT logs (and Yeti2022 the 2022 shard of it, containing certs that expire in 2022). CT logs are independent of each other, there is not really a "global log", although the monitoring/search sites aggregate the data from all logs. Each certificate is added to multiple logs, so the loss of one doesn't cause a problem for the certs in it. (Maybe it's also possible to still trust Yeti2022 for the parts of the log that are well-formed, which would decrease the number of impacted certs even more, not familiar enough with the implementations for that)
The Yeti2022 log is corrupted due to the random event. This has been correctly detected, and is by design and policy not fixable, since logs are not allowed to rewrite their history (ensuring that they don't is very much the point of CT). That the log broke is annoying but not critical, and the consequences are very much CT working as intended.
You can argue if the software running the log should have verified that it calculated the correct thing before publishing it, but that's not a protocol concern.
Presumably it's possible to code defensively against this sort of thing, by eg. running the entire operation twice and checking the result is the same before committing it to the published log?
Big tech companies like Google and Facebook have encountered problems where running the same crypto operation twice on the same processor deterministically or semi-deterministically gives the same incorrect result... so the check needs to be on done on separate hardware as well.
I don't think that matters in this case, because the entire point of the log machine is to run crypto operations. If it has such a faulty processor it is basically unusable for the task anyway.
If a system is critical it should run on multiple machines in multiple locations and "sync with checks" kinda like the oh so hated and totally useless blockchains.
Then if such a bit-flip would occur it would never occur on all machines at the same time in the same data.
And on top of that you could easy make the system fix itself if something like that happens (simply assume the majority of nodes didn't have the bit-flip) or in worst case scenario it could at least stop rather than making "wrong progress"
I have no clue what this particular log need for "throughput specs" but I assume it would be easily achievable with current DLT.
Safety critical systems are not a good fit for a blockchain-based resolution to the Byzantine General problem. Safety critical systems need extremely low latency to resolve the conflict fast. So blockchain is not going to be an appropriate choice for all critical applications when there are multiple low-latency solutions for BFT at a very low ms latency and IIRC, microseconds for avionics systems.
Not sure what you mean with "safety critical". I made no such assumption. Also since the topic is about a (write-only) log it probably doesn't need such low latency. What it more likely needed is final states so once a entry is made and accepted it must be final and ofc correct. DLTs can do this distributed and self-fixing i.e. a node that tries to add fault data is overruled and can never get a confirmation for a final state that later would not be valid.
Getting the whole decentral system to "agree" will never be fast unless we have quantum tech. There is simply no way servers around the glob could communicate in microseconds even if all communication would be the speed of light and processing would be instant. It would still take time.
In reality such system need seconds which is often totally fine. As long as everyone only relies on the data that has been declared final.
I thought you were making a more general statement about all critical systems, that's all. And since many critical systems have a safety factor in play, I wanted to distinguish them as not always being a good target for a Blockchain solution to the problems of consensus.
Blockchain is a very interesting solution to the problem of obtaining consensus in the face of imperfect inputs, there are other options so, like anything else, you choose the right tool for the job. My own view is that-- given other established protocols, blockchain is going to be overkill for dealing with some types of fault tolerance. It is a very good fit for applications where you want to minimize relying on the trust of humans. (And other areas too, but right now I'm just speaking of the narrow context of consensus amid inconsistent inputs)
Critical that the system operates/keeps operating/does not reach an invalid state.
It could be for safety but in general its more to avoid financial damage. Downtime of any kind usually result in huge financial loses and people working extra shifts. This was my main point.
>...blockchain is going to be overkill for dealing with some types of fault tolerance
But it this case likely it isn't. The current systen already works with a chain of blocks its just lacks the distributed checking and all that stuff. But "blckchains" aren't some secret sauce in this case it just an way to implement a distributed write-only database with filed proven tech. It can be as lightweight as it any other solution. The consensus part is completely irrelevant anyway because all nodes are operated by one entity. But due to the use case (money/value) of modern DLT ("blockchains") they are incredible reliable by design. The oldest DLTs that uses FBA (instead of PoW/PoS) are running since 9+ years without any error or downtime. Recreating a similar reliable system would be month and month of work followed by month of testing.
Yep, pretty much agreed. Whatever anyone may think if crypto currencies blockchains are essentially a different technology where coins are just 1 application. I'm kind of "meh" on crypto currencies (not anti, just think they need a while more to mature) but trustless consensus is a significant innovation in its own right.
I haven't thought about whether an actual blockchain is really the best solution, but the redundancy argument is legitimate. We've been doing it for decades in other systems where an unnoticed bit flip results in complete mission failure, such as an Apollo mission crash.
I'm not really sure what Yeti 2022 is exactly, so take this with heaps of salt, but it seems like this is a "mission failure" event -- it can no longer continue, except as read only. Crypto systems, even more than physical systems like rockets, suffer from such failures after just "one false step". Is the cost of this complete failure so low that it doesn't merit ECC? Extremely doubtful. Is it so low that it doesn't merit redundancy? More open for debate, but plausibly not.
I know rockets experience more cosmic rays and their failure can result in loss of life and (less importantly) losing a lot more money, and everything is a tradeoff -- so I'm not saying the case for redundancy is water tight. But it's legitimate to point out there is an inherent and it seems under-acknowledged fragility in non-redundant crypto systems.
>I haven't thought about whether an actual blockchain is really the best solution...
Most likely not. But the tech behind FBA (Federated Byzantine Agreement) distributed ledgers would make an extremely reliable system that can handle malfunction of hardware and large outages of nodes. And since this is a write-only log and only some entities can write to it, it could be implemented with permission so that the system doesn't have to deal with attacks that public blockchain would face.
Technically everyone can write to it. However you can only write certain specific things.
In the case of Yeti 2022 you were only able to log (pre-)certificates signed by particular CAs trusted in the Web PKI, which were due to expire in the year 2022.
In practice the vast majority of such logging is done by issuing CAs, as part of their normal operations. But it is possible (and is done purposefully, at least sometimes) to obtain certificates which have not been logged. These certificates, of course, won't work in Chrome or Safari because there's no proof they were logged. But you can log them yourself, and get SCTs and show those Just In Time™.
This is only an interesting technical option if you have both the need to do it for some reason and the capability to send SCTs that aren't baked inside your certificates. The vast majority of punters just have the CA do all this for them and the certificate they get has SCTs baked right inside it so there's no technical changes for them at all, they needn't know CT exists.
Because the CA's critical business processes depend on writing to logs, they need a formal service level agreement assuring them that some logs they use will accept their writes in a timely fashion and meet whatever criteria, but you as an individual don't need this, you don't care if the log you wanted to use says it's unavailable for 4 hours due to maintenance.
Thats pretty much the default state of any blockchain like systems. You need a private key to write to it. Its just that in most public blockchains can have an infinite amount of new private keys can be generated an and some kind of token is attached to it. For a log none of that would be needed. A central operator could hand out keys to anyone who should be able to write to it and for all other its read-only. And ofc the key alone would still not allow someone to write invalid data.
Sometimes there are problems with certificates in the Web PKI (approximately the certificates your web browser trusts to determine that this is really news.ycombinator.com, for example). It's a lot easier to discover such problems early, and detect if they've really stopped happening after someone says "We fixed it" if you have a complete list of every certificate.
The issuing CAs could just have like a huge tarball you download, and promise to keep it up-to-date. But you just know that the same errors of omission that can cause the problem you were looking for, can cause that tarball to be lacking the certificates you'd need to see.
So, some people at Google conceived of a way to build an append-only log system, which issues signed receipts for the items logged. They built test logs by having the Google crawler, which already gets sent certificates by any HTTPS site it visits as part of the TLS protocol, log every new certificate it saw.
Having convinced themselves that this idea is at least basically viable, Google imposed a requirement that in order to be trusted in their Chrome browser, all certificates must be logged from a certain date. There are some fun chicken-and-egg problems (which have also been solved, which is why you didn't need to really do anything even if you maintain an HTTPS web server) but in practice today this means if it works in Chrome it was logged. This is not a policy requirement, not logging certificates doesn't mean your CA gets distrusted - it just means those certificates won't work in Chrome until they're logged and the site presents the receipts to Chrome.
The append-only logs are operated by about half a dozen outfits, some you've heard of (e.g. Let's Encrypt, Google itself) and some maybe not (Sectigo, Trust Asia). Google decided the rule for Chrome is, it must see at least one log receipt (these are called SCTs) from Google, and one from any "qualified log" that's not Google.
After a few years operating these logs, Google were doing fine, but some other outfits realised hey, these logs just grow, and grow, and grow without end, they're append-only, that's the whole point, but it means we can't trim 5 year old certificates that nobody cares about. So, they began "sharding" the logs. Instead of creating Kitten log, with a bunch of servers and a single URL, make Kitten 2018, and Kitten 2019, and Kitten 2020 and so on. When people want to log a certificate, if it expires in 2018, that goes in Kitten 2018, and so on. This way, by the end of 2018 you can switch Kitten 2018 to read-only, since there can't be new certificates which have already expired, that's nonsense. And eventually you can just switch it off. Researchers would be annoyed if you did it in January 2019, but by 2021 who cares?
So, Yeti 2022 is the shard of DigiCert's Yeti log which only holds certificates that expire in 2022. DigiCert sells lots of "One year" certificates, so those would be candidates for Yeti 2022. DigiCert also operate Yeti 2021 and 2023 for example. They also have a "Nessie" family with Nessie 2022 still working normally.
Third parties run "verifiers" which talk to a log and want to see that is in fact a consistent append-only log. They ask it for a type of cryptographic hash of all previous state, which will inherit from an older hash of the same sort, and so on back to when the log was empty. They also ask to see all the certificates which were logged, if the log operates correctly, they can calculate the forward state and determine that the log is indeed a record of a list of certificates, in order, and the hashes match. They remember what the log said, and if it were to subsequently contradict itself, that's a fatal error. For example if it suddenly changed its mind about which certificate was logged in January, that's a fatal error, or if there was a "fork" in the list of hashes, that's a fatal error too. This ensures the append-only nature of the log.
Yeti 2022 failed those verification tests, beginning at the end of June, because in fact it had somehow logged one certificate for *.molabtausnipnimo.ml but it has mistakenly calculated a SHA256 hash which was one bit different and then all subsequent work assumed the (bad) hash was correct. There's no way to rewind and fix that.
In principle if you knew a way to make a bogus certificate which matched that bad hash you could overwrite the real certificate with that one. But we haven't the faintest idea how to begin going about that so it's not an option.
So yes, this was mission failure for Yeti 2022. This log shard will be set read-only and eventually decommissioned. New builds of Chrome (and presumably Safari) will say Yeti 2022 can't be trusted past this failure. But the overall Certificate Transparency system is fine, it was designed to be resilient against failure of just one log.
Conceivably you could also fix this by having all verifiets special case this one certificate in their verification software to substitute the correct hash?
Obviously that's a huge pain but in theory it would work?
You really want to make everyone special case this because 1 CT log server had a hardware failure?
This is not the first time a log server had to be removed due to a failure, nor will it be the last. The whole protocol is designed to be resilient to this.
What would be the point of doing something besides following the normal procedures around log failures?
Thank you, that's an excellent description. The CT system as a whole does appear to have ample redundancy, with automated tools informing manual intervention that resolves this individual failure.
This is just learned helplessness because Intel were stingy as shit for over a decade and wanted to segregate their product lines. Error correction is literally prevalent in every single part of every PHY layer in a modern stack, it is an absolute must, and the lack of error correction in RAM is, without question, a ridiculous gap that should have never been allowed in the first place in any modern machine, especially given that density and bandwidth keeps increasing and will continue to do so.
When you are designing these systems, you have two options: you either use error correcting codes and increase channel bandwidth to compensate for them (as a result of injected noise, which is unavoidable), or you lower the transfer rate so much as to be infeasible to use, while also avoiding as much noise as you can. Guess what's happening to RAM? It isn't getting slower or less dense. The error rate is only going to increase. The people designing this stuff aren't idiots. That's why literally every other layer of your system builds in error correction. Software people do not understand this because they prefer to believe in magic due to the fact all of these abstractions result in stable systems, I guess.
All of the talk of hardware flip flops and all that shit is an irrelevant deflection. Doesn't matter. It's just water carrying and post-hoc justification because, again, Intel decided that consumers didn't actually need it a decade ago, and everyone followed suit. They've been proven wrong repeatedly.
Complex systems are built to resist failure. They wouldn't work otherwise. By definition, if a failure occurs, it's because it passed multiple safeguards that were already in place. Basic systems theory. Let's actually try building more safeguards instead of rationalizing their absence.
> By definition, if a failure occurs, it's because it passed multiple safeguards that were already in place.
Having worked on a good bunch of critical system, there aren't multiple safeguards in most hardware.
E.g. a multiplication error in a core will not be detected by an external device. Or a bit flip when reading cache, or from a storage device.
Very often the only real safeguard is to do the whole computation twice on two different hosts. I would rather have many low-reliability hosts and do computation twice than few high-reliability and very expensive host.
Unfortunately the software side is really lagging behind when it comes to reproducible computing. Reproducible builds are a good step in that direction and it took many decades to get there.
Depends on the system. In this case it seems like retries are possible after a failure, so two is sufficient to detect bad data. You need three in real time situations where you don't have the capability to go back and figure it out.
Two hosts is efficient. Do it twice on two different hosts and then compare the results. If there is a mismatch, throw it away and redo it again on 2 hosts. A total of
4 computations are needed. But only if the difference really was due to bit flips, the chance of which are exceedingly rare.
In all the rest of the cases, you get away with two instead of three computations.
Safeguards do not just exist at the level of technology but also politics, social structures, policies, design decisions, human interactions, and so on and so forth. "Criticality" in particular is something defined by humans, not a vacuum, and humans are components of all complex systems, as much as any multiplier or hardware unit is. The fact a multiplier can return an error is exactly in line with this: it can only happen after an array of other things allow it to, some of them not computational or computerized at all. And not every failure will also result in catastrophe as it did here. More generally such failures cannot be eliminated, because latent failures exist everywhere even when you have TMR or whatever it is people do these days. Thinking there is any "only real safeguard" like quorums or TMR is exactly part of the problem with this line of thought.
The quote I made is actually very specifically is in reference to this paper, in particular point 2, which is mandatory reading for any systems engineer, IMO, though perhaps the word "safeguard" is too strong for the taste of some here. But focusing on definitions of words is besides the point and falls into the same traps this paper mentions: https://how.complexsystems.fail/
Back to the original point: is ECC the single solution to this catastrophe? No, probably not. Systems are constantly changing and failure is impossible to eliminate. Design decisions and a number of other decisions could have mitigated it and caused this failure to not be catastrophic. Another thing might also cause it to topple. But let's not pretend like we don't know what we're dealing with, either, when we've already built these tools and know they work. We've studied ECC plenty! You don't need to carry water for a corporation trying to keep its purse filled to the brim (by cutting costs) to proclaim that failure is inevitable and most things chug on, regardless. We already know that much.
Failing to account for bit flips and HW failure is too common in web/service coding. Lookup how Google dropped a massive Big Table instance in prod, and traced it back to a cosmic ray bit flip that made a WRITE instruction into a DROP TABLE instruction.
I laugh when I compare my day to day coding to that of an avionics programmer in the aero industry.
People who tout this don't understand the probability of bit flips. It's measured in failures per _billion_ hours of operation. This matters a ton in an environment with thousands of memory modules (data centers and super computers) but you're lucky to experience a single ram bit flip more than once or twice in your entire life
Looks like things are worse than I thought (but still better than most people seem to think). Interesting to note that the motherboard used affects error rate, and it seems that part of it is a luck of the draw situation where some dimms have more errors than others despite being the same manufacturer
I still think it made sense up until now to not bother with it on consumer hardware, and even at this point. The probability of your phone having a software glitch needing a reboot is way higher. Now that it's practically a free upgrade? Should be included by default. But I still don't think it's nearly as nefarious as people make it out to be that it has been this way for so long
I don’t think they are an issue in a phone, but in a system like a blockchain that goes through so much effort to achieve consistency, the severity of the error is magnified hence the lesser tolerance for the error rate.
Bit flips are guaranteed to happen in digital systems. No matter how low the probability is, it will never be zero. You can't go around thinking you're going to dodge a bullet because its unlikely. If it weren't for the pervasive use of error detection in common I/O protocols you would be subjected to these errors much more frequently.
We're talking specifically about ECC ram, which solves the specific problem caused by cosmic rays (and apparently bad motherboard design). IO protocol error correction is a totally different problem.
Would ECC have avoided the issue in this case? If so then it's hard not to agree that is should be considered a minimum standard. It looks like Yeti 2022 isn't going to survive this intact, and while they can resolve the issues in other ways, not everyone will always be so fortunate and ECC is a relatively small step to avoid a larger problem.
Using ECC is a no-brainer. Even the Raspberry Pi 4 has ECC RAM. It’s not particularly expensive and only a artificial limitation Intel has introduced for consumer products.
The Pi 4 uses DRAM with on-die ECC, which AFAIK does not provide any means of reporting errors (corrected or uncorrected) to the SoC's memory controller. It is effectively a cost-saving measure to improve DRAM yields. As such, it does little to guarantee that there are no memory errors.
> First, ECC doesn't protect the full data chain, you can have a bitflip in a hardware flip flop (or latch open a gate that drains a line, etc...) before the value reaches the memory. Logic is known to glitch too.
Of course. DRAM ECC protects against errors in the DRAM cells. That doesn't mean other components don't have other strategies for reducing errors which can form a complete chain.
Latches and arrays and register files often have parity (where data can be reconstructed) or ECC bits, or use low level circuits that themselves are hard or redundant enough to achieve a particular target UBER for the full system.
> Second: ECC is mostly designed to protect long term storage in DRAM. Recognize that a cert like this is a very short-term value, it's computed and then transmitted. The failure happened fast, before copies of the correct value were made. That again argues to a failure location other than a DRAM cell.
Not necessarily. Cells that are sitting idle other than refresh have certain error profiles, but ones under constant access. Particularly "idle" cells that are in fact being disturbed by adjacent accesses certainly have a non-zero error profile and need ECC too.
My completely anecdotal guess would be this error is at least an order of magnitude more likely to have occurred in non-ECC memory (if that's what was being used) rather than any other path to the CPU or cache or logic on the CPU itself.
I'm one of those who can easily downvote blockchain stuff mercilessly. It is not reflexively though: I reserve it for dumb ideas, it just so happens that most blockchain ideas I see come off as dumb and/or as an attempted rip off.
I think it is instructive to ask "why doesn't this mention guarantee downvotes?" because I don't think it's just cargo-culting. I doubt that many of those objecting to blockchain are objecting to byzantine fault tolerance, DHT, etc. Very high resource usage in the cost function, the ledger being public and permanent (long term privacy risk), negative externalities related to its use in a currency... These are commonly the objections I have and hear. And they are inapplicable.
Extending what the Wikipedia article says, it's basically glorified database replication. But it also replicates and verifies the calculation to get to that data so it provides far greater fault tolerance. But since it is private you get to throw out the adversarial model (mostly the cost function) and assume failures are accidental, not malicious. It makes the problem simpler and lowers the stakes versus using blockchain for a global trustless digital currency so I don't think we should be surprised that it engenders less controversy.
>If you chain blocks then a single bit flip in one block destroys all the blocks. I've seen companies like msps go out of business because they were doing crypto with consumer hardware.
If it is caused by a single bitflip you know the block in which that bitflip occurred and can try each bit until you find the right bit. This is an embarrassing parallel problem.
Let's say you need to search 1 GB space for a single bit flip. That only requires that you test 8 billion bit flips. Given the merklized nature of most crypto, you will probably be searching a space far smaller than 1 GB.
>Bit flips aren't an act of god you simply need a better computer.
Rather than using hardware ECC, you could implement ECC in software. I think hardware ECC is good idea, but you aren't screwed if you don't use it.
The big threat here is not the occasional random bit flips, but adversary caused targeted bit flips since adversaries can bit flip software state that won't cause detectable failures but will cause hard to detect security failures.
In my younger days I worked in customer support for a storage company. A customer had one of our products crash (kernel panic) at a sensitive time. We had a couple of crack engineers I used to hover around to try to pick up gdb tricks so I followed this case with interest—it was a new and unexpected panic.
Turns out the crash was caused by the processor taking the wrong branch. Something like this (it wasn’t Intel but you get the picture):
test edx, edx
jnz $something_that_expects_nonzero_edx
Well, edx was 0, but the CPU jumped anyway.
So yeah, sometimes ECC isn’t enough. If you’re really paranoid (crypto, milsec) you should execute multiple times and sanity-test.
> Software under normal circumstances is remarkably resilient to having its memory corrupted
Not really? What you are saying applies to anything that uses hashing of some sort, where the goal by design is to have completely different outputs even with a single bit flip.
And "resilience" is not a precise enough term? Is just recovering after errors due to flips? Or is it guaranteeing that the operation will yield correct output (which implies not crashing)? The latter is far harder.
I used to sell workstations and servers years ago and trying to convince people they needed ECC Ram and that it was just insurance for (often) just the price of the extra chips on the DIMMs was a nightmare.
The amount of uninformed and inexperienced counter arguments online suggesting it was purely Intel seeking extra money (even though they didn't sell RAM) was ridiculous.
I never understood why there was so much push back from the consumer world commenting on something they had no idea about. Similar arguments for why would you ever need xxGB of RAM while also condemning the incorrect 640kb RAM Bill Gates comment.
Cosmic-ray bit flipping is real and it has real security concerns. This also makes Intel's efforts at market segmentation by not having ECC support in any consumer CPUs [1] even more unforgivable and dangerous.
Also, ECC ram is technically supported on AMD’s recent consumer platform, although it’s not advertised as so since they don’t do validation testing for it.
I've read that reporting of ECC events is not supported on consumer Ryzen. It's not a complete solution and since unregistered ECC is being used, how can you even be sure the memory controller is doing any error correction at all?
Someone would need to induce memory errors and publish their results. I'd love to read it.
This is 4 years old now, but does produce some interesting results.
The author of that article doesn't have hands-on experience with ECC DRAM, and mistakenly concludes that ECC on Ryzen is unreliable because of a misunderstanding of how Linux behaves when it encounters an uncorrected error. However, the author at least includes screenshots which show ECC functionality on Ryzen working properly.
> ...since unregistered ECC is being used, how can you even be sure the memory controller is doing any error correction at all?
ECC is performed by the memory controller, and requires an extra memory device per rank and 8 extra data bits, which unbuffered ECC DIMMs provide.
Registered memory has nothing to do with ECC (although in practice, registered DIMMs almost always have ECC support). It's simply a mechanism to reduce electrical load on the memory controller to allow for the usage of higher-capacity DIMMs than what unbuffered DIMMs would allow.
With respect to Ryzen, Zen's memory controller architecture is unified, and owners of Ryzen CPUs use the same memory controller found in similar-generation Threadripper and EPYC processors (just fewer of them). Although full ECC support is not required on the AM4 platform specifically (it's an optional feature that can be implemented by the motherboard maker), it's functional and supported if present. Indeed, there are several Ryzen motherboards aimed at professional audiences where ECC is an explicitly advertised feature of the board.
ECC reporting is part of the memory controller (which is unified across all Zen architecture parts), and is fully supported and functional. You can see the reporting working as expected within the Hardware Canucks article linked in the grand parent.
The article you linked mentions that ECC reporting is not working with the on-board IPMI controller (which presumably means that ECC events aren't being logged in the SEL). While that might be a limitation of this board (and other IPMI-equipped AM4 boards), reporting from within the operating system will still work.
AMD is also doing market segmentation on their APU series of Ryzen. PRO vs. non PRO.
It should also be mentioned that Ryzen is a consumer CPU and you're stuck (mostly) with consumer motherboards, none of which tell you the level of ECC support they provide. Some motherboards do nothing with ECC! Yes, they "work" with it. But that means nothing. Motherboards need to say they correct single bit errors and detect double bit errors.[1] None of the Ryzen motherboards say this. Not a single one that I could find.
Maybe Asrock Rack, but that's a workstation/server motherboard. Which is also going for $400-600. You think that $50 Gigabyte motherboard is doing the right thing regarding ECC? That's a ton of faith right there.
Consumer Ryzen CPUs may support ECC, but that's meaningless without motherboards testing it and documenting their support of it. So no, Ryzen really does not support ECC if you ask me.
DDR5 has a modicum of ECC so things might slowly improve. Maybe DDR6 will be full ECC and we will no longer have this market segmentation in the 2030s. Wow, that’s a long time though.
PS: why didn’t apple do the right thing with the M1? My guess is the availability of the memory which again points to changing it at the memory spec level.
I don't understand the issue with market segmentation here. I can absolutely see the reason why all of my servers should have ECC, but I don't see why my gaming PC should (or even my work development machine). What's the worst case impact of the (extremely rare) bit-flip on one of those machines?
Bit flip is the bane of satellite communication, especially if you tried to use FTP over it.
Also, for critical eon-duration record keeping, run-length encoding or variable record size are harder to maintain and recover from large file on ROM-type storages than fixed-length record or text-type (i.e. ASCII log, or JSON) … against multiple cosmic-type alterations.
Sure that you could do max. Shannon approach for multi-bit ECC, but text-based will be recovered a lot quicker (given then-unknown formatting).
Are “cosmic rays” actually the only or primary way bits get flipped? Or is it just a stand-in for “all the ways non-ECC RAM can spontaneously have an erroneous bit or two”?
However, radioactive isotopes in the packaging used to be a major cause. I liked this bit: "Controlling alpha particle emission rates for critical packaging materials to less than a level of 0.001 counts per hour per cm2 (cph/cm2) is required for reliable performance of most circuits. For comparison, the count rate of a typical shoe's sole is between 0.1 and 10 cph/cm2."
Faulty RAM can produce errors and it's hard to catch those, even memtest might not detect it. I'm not sure if non-faulty non-ECC RAM can spontaneously have errors, but you can't be sure, cosmic rays are real, unless you've put your PC into a thick lead case, LoL.
In my experience with ECC RAM on couple thousand servers over a few years, we had a couple machines throw one ECC correctable error and never have an issue again.
It seemed more common for a system to move from no errors to a consistent rate; some moved to one error per day, some 10/day, some to thousands per second which kills performance because of machine check exception handling.
The one off errors could be cosmic rays or faulty ram or voltage droop or who knows what, the repeatable errors are probably faulty ram, replacing the ram resolved the problem.
Anyone interested in seeing cosmic rays visually should check out a cloud chamber. One can be made at home with isopropyl alchol and dry ice. Once you see it, it makes the need for ECC more visceral.
Here’s a video of a cloud chamber. The very long straight lines are muons, which are the products of cosmic rays hitting the upper atmosphere. Also visible are thick lines which are alpha particles, the result of radioactive decay.
Six years later, with the $1,000 question uncollected, it seems that a cosmic ray bit flip is also the cause of a strange happening during a Mario 64 speedrun:
It's not the same thing. GH Issues firstly requires a GitHub account, whereas anyone can sign up to a mailing list. The model is just worse for general discussion too.
A few years ago there was a CCC? talk about registering domains a bitflip away from the intended domain. Which effectively hijacked legitimate web traffic. While the chance of a bitflip is low it added up to a surprisingly large number of requests maybe a few hundred or so. Here is someone doing it for windows.com
https://www.bleepingcomputer.com/news/security/hijacking-tra...
> I think the most likely explanation is that this was a hardware error caused by a cosmic ray or the like, rather than a software bug. It's just very bad luck :-(
These sorts of cosmic bitflips have been exploited as a security error for some time. See bitsquatting for domain names.
ECC helps and so does software error detection on critical data in memory or on disk. The problem is always that input data should not be relied upon as correct. Do not trust data. It's a difficult mind shift, but it needs to be done, or programs will continue to mysteriously fail when data becomes corrupted.
This leaves with a bit of concern about how reliable is storage encryption on consumer hardware at all.
As far as I know, recent Android devices and iPhones have full disk encryption by default, but they do protect the keys against random bit-flipping?
Also, I guess that authentication devices (like YubiKey) are not safe and easy to use at the same time, because the private key inside can be damaged/modified by a cosmic ray, and it's not possible (by design) to make duplicates. So, it's necessary to have multiple of them to compensate, lowering their practicality in the end.
Edit: from the software side, I understand that there are techniques to ensure some level of data safety (checksumming, redudancy, etc), but it thought it was OK to have some random bit-flipping on hard disks (where I found it more frequently), since it could be corrected from software. Now I realize that if the encryption key is randomly changed on RAM, the data can or becomes permanently irrecoverable.
OP here. Unless you work for a certificate authority or a web browser, this event will have zero impact on you. While this particular CT log has failed, there are many other CT logs, and certificates are required to be logged to 2-3 different logs (depending on certificate lifetime) so that if a log fails web browsers can rely on one of the other logs to ensure the certificate is publicly logged.
This is the 8th log to fail (although the first caused by a bit flip), and log failure has never caused a user-facing certificate error. The overall CT ecosystem has proven very resilient, even if a bit flip can take out an individual log.
(P.S. No one knows if it was really a cosmic ray or not. But it's almost certainly a random hardware error rather than a software bug, and cosmic ray is just the informal term people like to use for unexplained hardware bit flips.)
Is it possible to give an example other than “cosmic ray”? I know it’s an informal short hand, but it also raises the question of what’s actually causing these flips.
Is it just random stray electrons that happen to stray too far while traveling through the RAM? Very interesting to me.
Old DRAM/SRAM simply becoming flaky, overheating, physical connection issues, faulty power - there are a ton of reasons these flips happen, and I would think that in most cases they are entirely unnoticed or simply observed as an application crashing, rather than being recognized as directly caused by bit flips.
Awhile back I encountered what I thought was hardware memory corruption that turned out to be a bug in the kernel [1]. Restic, a highly reliable backup program, is written in Go, and Go programs were highly affected [2] by a memory corruption bug in the kernel [3].
What log operators should be doing, and I am surprised they are not, is verifying new additions to their log on a separate system or systems before publishing them. In the worst case scenario they might hit a hardware bug that always returns the wrong answer for a particular combination of arguments to an instruction in all processors within an architecture, so verification should happen on multiple architectures to reduce that possibility. If incorrect additions are detected they can be deleted without publishing and the correction addition can be recreated, verified, and published.
ECC RAM would help somewhat, but independent verification is orders of magnitude better because it's a cryptographic protocol and a reliable way of generating broken log entries that validated on another system would constitute a reliable way to generate SHA2 collisions.
This is a rare example of a problem that a (closed-membership, permissioned) blockchain is a good fit for. CT logs are already merkle trees. If a consensus component were added, entries would only be valid if the proposed new entry made sense to all parties involved.
All the CT readers do verify it and reject the log, but it looks like the writer trusts itself. So it keeps rebroadcasting a broken log instead of instantly failing. Having an internal blockchain or just another verification server would solve it, I guess.
reply