Ugly Hedgehog - Photography Forum
Home Active Topics Newest Pictures Search Login Register
General Chit-Chat (non-photography talk)
Wonky Computer Chips
Jun 18, 2021 09:48:43   #
jerryc41 Loc: Catskill Mts of NY
 
Both Google and Facebook are having trouble with chips acting up. From New Scientist mag (England).

"Google and Facebook have discovered they are experiencing computer chip failures that can corrupt data or make it difficult to unlock encrypted files. Facebook says hardware manufacturers must take notice of the problem, which has emerged due to the vast scale of computing resources the firms use. The issue surfaced at Google when multiple teams of engineers reported problems with their computations, but the company’s usual diagnostic tools showed no problem. An investigation revealed that individual chips were responsible for repeated faults. In certain cases, researchers could prompt problems by changing a chip’s temperature. These “silent errors” are caused by bits on the chips flipping from 0 to 1 or vice versa. Cosmic radiation can cause bits to flip, so computers destined for space have to be specially designed to prevent this. The errors spotted by Google and Facebook manifest in a similarly sudden way, but are instead due to ever-shrinking chips exhibiting unpredictable behaviours."

Reply
Jun 18, 2021 10:49:27   #
downing Loc: Cincinnati
 
Moore`s law at it's end.

Reply
Jun 18, 2021 23:11:10   #
TriX Loc: Raleigh, NC
 
jerryc41 wrote:
Both Google and Facebook are having trouble with chips acting up. From New Scientist mag (England).

"Google and Facebook have discovered they are experiencing computer chip failures that can corrupt data or make it difficult to unlock encrypted files. Facebook says hardware manufacturers must take notice of the problem, which has emerged due to the vast scale of computing resources the firms use. The issue surfaced at Google when multiple teams of engineers reported problems with their computations, but the company’s usual diagnostic tools showed no problem. An investigation revealed that individual chips were responsible for repeated faults. In certain cases, researchers could prompt problems by changing a chip’s temperature. These “silent errors” are caused by bits on the chips flipping from 0 to 1 or vice versa. Cosmic radiation can cause bits to flip, so computers destined for space have to be specially designed to prevent this. The errors spotted by Google and Facebook manifest in a similarly sudden way, but are instead due to ever-shrinking chips exhibiting unpredictable behaviours."
Both Google and Facebook are having trouble with c... (show quote)


“Supercomputers” are often used for modeling, and the models can run for hours or days. To prevent losing the entire run (which can be VERY expensive on a big machine) in the event of a crash, periodically main memory (which can be a petabyte or more) is dumped to disk in a “checkpoint” operation, and the machine can do nothing until that process is completed. Some years ago scientists at Lawrence Livermore National Labs we’re reporting data from long running models where there are many checkpoints, that was simply wrong. It was finally determined that when the checkpoints were reviewed, the data on disk was correct, but was being corrupted by the occasional flipped bit in the disk read cache due to the particles from cosmic rays, which are constantly bombarding us. Over a zillion iterations, it was enough to skew the data. The answer was to design storage with parity checking on both reads and writes. Occasional flipped bits happen all the time in semiconductor memory, but in more “normal” usage, it is rarely if ever noticed unless there are lots of operations. That’s why ECC DRAM, which costs extra, is often used in servers with heavy use, but rarely seen in client machines.

Reply
 
 
Jun 19, 2021 08:29:42   #
sb Loc: Florida's East Coast
 
Yeah. As I get older sometimes I have a bit that flips from a 0 to a 1 or vice versa.

Reply
Jun 19, 2021 11:03:17   #
markngolf Loc: Bridgewater, NJ
 
sb wrote:
Yeah. As I get older sometimes I have a bit that flips from a 0 to a 1 or vice versa.


Only one??? You are lucky!!
Mark

Reply
Jun 19, 2021 13:47:11   #
TheShoe Loc: Lacey, WA
 
TriX wrote:
“Supercomputers” are often used for modeling, and the models can run for hours or days. To prevent losing the entire run (which can be VERY expensive on a big machine) in the event of a crash, periodically main memory (which can be a petabyte or more) is dumped to disk in a “checkpoint” operation, and the machine can do nothing until that process is completed. Some years ago scientists at Lawrence Livermore National Labs we’re reporting data from long running models where there are many checkpoints, that was simply wrong. It was finally determined that when the checkpoints were reviewed, the data on disk was correct, but was being corrupted by the occasional flipped bit in the disk read cache due to the particles from cosmic rays, which are constantly bombarding us. Over a zillion iterations, it was enough to skew the data. The answer was to design storage with parity checking on both reads and writes. Occasional flipped bits happen all the time in semiconductor memory, but in more “normal” usage, it is rarely if ever noticed unless there are lots of operations. That’s why ECC DRAM, which costs extra, is often used in servers with heavy use, but rarely seen in client machines.
“Supercomputers” are often used for modeling, and ... (show quote)


Parity bits predated the supercomputers. They were used on mag tapes in 1951 because flaking of the oxide layer of the tapes was very common.

The first Cray computers were all about speed and did not have parity checking. Seymour Cray said that the people who used the machines would be able to see if the results were incorrect, rerun the jobs that had errors, and get the answers faster than they would if error correction were in use. The users reaction was to run jobs three times and accept answers that occurred more than once.

Reply
If you want to reply, then register here. Registration is free and your account is created instantly, so you can post right away.
General Chit-Chat (non-photography talk)
UglyHedgehog.com - Forum
Copyright 2011-2024 Ugly Hedgehog, Inc.