Silent Data Corruption in Multicore CPUs?

business_kid · 06-06-2021, 02:17 PM

Here's a link with 2 references and a summary but google scholar has plenty more
https://hardware.slashdot.org/story/...re-modern-cpus

Basically, it seems that modern high-density fab is causing sporadic errors in CPUs and these are being noticed in big data centres.

As a hardware guy, I know it's a testing nightmare. Testing at max temperature and minimum voltage may help, but may not. These would typically heavily cooled 250W or 280W packages, and temperature uniformity throughout can only be modeled, not measured. Also, the uniformity of doping could be an issue. "Doping" mixes pure silicon with a tiny percentage of atoms with 1 electron more(negative doping), or 1 less (positive doping). Lastly, any manufacturing imperfection would do it. CPUs have an extremely low manufacturing pass rate anyhow.

What I can also imagine is the staggering amount of time required to decide core 59 is dodgy, but not 58 or 60. I'm interested in proposed solutions, because nobody seems to have any. I thought about options to disable cores, but once you find the suspect box, the cheapest practical thing is to replace the CPU or indeed the box.

EdGr · 06-07-2021, 12:01 AM

It is interesting that Google and Facebook have enough servers, and sufficient control over their environment, that they can isolate CPU failures. These companies surely are very good at replacing marginal machines.

CPU reliability is irrelevant to home users, though. A home user has orders of magnitude greater reliability problems due to AC power, Internet service, and hot weather.
Ed

business_kid · 06-07-2021, 05:17 AM

Agreed on the home user. A home user also has fewer cores usually, a lower wattage package, an APU instead or a server CPU only, and lighter loads. These issues probably occur at high temperature and low core voltage, as the silicon would be most vulnerable then. A high voltage would increase the width on the insulating bit of the individual junctions, providing better insulation.

Frankly, my interest is the chips, and the fab size.

For a long time, people could see serious issues coming for several reasons at around 5nm fab size. These were issues mainly of physics. Now we have a factory producing 5nm fab size, & Apple's M-1 & M-2 CPUs. This indicates there's penalties for the reduction in fab size.

Apparently, at it's lowest level, a pn junction consists of a '++' doped section, & a '--' doped section. Where these meet a 'NN' section develops, as the opposite dopings cancel out. Now the problem is getting that assembly into a smaller and smaller size. A difficulty long foreseen was that as the 'N' section is shrunk by doping mixes, individual electrons may get through. At the reduced size, 1 stray electron may do enough to do damage.

I don't know how Google are set up, but in most companies, the job of isolating cpu unreliability would be handed to maintenance, but the issue of finding out why would be handed to R&D, or outside specialists. The existence of these CPU issues points out the fact that they are making "probably" chips, instead of the "certainly" ones we got before. It may be possible to decommission individual cores in future chips (i.e an 80 core switches off it's 'mercurial core' and becomes a 79 core), but I don't think it is now. But for a hardware guy (which I was), it's a fascinating and fairly intractable problem to watch. I've handled many difficult issues in my time which were caused by poor design. This is caused by 'one bridge too far' in the fab size, but it may be possible to live with the issue if the CPU isolates it's own mercurial (= dodgy) cores.

It may be getting to the stage where a core is needed as supervisor for the others. It's interesting to note the lack of a cpu manufacturer's name, also.