TriX wrote:
“Supercomputers” are often used for modeling, and the models can run for hours or days. To prevent losing the entire run (which can be VERY expensive on a big machine) in the event of a crash, periodically main memory (which can be a petabyte or more) is dumped to disk in a “checkpoint” operation, and the machine can do nothing until that process is completed. Some years ago scientists at Lawrence Livermore National Labs we’re reporting data from long running models where there are many checkpoints, that was simply wrong. It was finally determined that when the checkpoints were reviewed, the data on disk was correct, but was being corrupted by the occasional flipped bit in the disk read cache due to the particles from cosmic rays, which are constantly bombarding us. Over a zillion iterations, it was enough to skew the data. The answer was to design storage with parity checking on both reads and writes. Occasional flipped bits happen all the time in semiconductor memory, but in more “normal” usage, it is rarely if ever noticed unless there are lots of operations. That’s why ECC DRAM, which costs extra, is often used in servers with heavy use, but rarely seen in client machines.
“Supercomputers” are often used for modeling, and ... (
show quote)
Parity bits predated the supercomputers. They were used on mag tapes in 1951 because flaking of the oxide layer of the tapes was very common.
The first Cray computers were all about speed and did not have parity checking. Seymour Cray said that the people who used the machines would be able to see if the results were incorrect, rerun the jobs that had errors, and get the answers faster than they would if error correction were in use. The users reaction was to run jobs three times and accept answers that occurred more than once.