LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 06-06-2021, 02:17 PM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,307

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Silent Data Corruption in Multicore CPUs?


Here's a link with 2 references and a summary but google scholar has plenty more
https://hardware.slashdot.org/story/...re-modern-cpus

Basically, it seems that modern high-density fab is causing sporadic errors in CPUs and these are being noticed in big data centres.

As a hardware guy, I know it's a testing nightmare. Testing at max temperature and minimum voltage may help, but may not. These would typically heavily cooled 250W or 280W packages, and temperature uniformity throughout can only be modeled, not measured. Also, the uniformity of doping could be an issue. "Doping" mixes pure silicon with a tiny percentage of atoms with 1 electron more(negative doping), or 1 less (positive doping). Lastly, any manufacturing imperfection would do it. CPUs have an extremely low manufacturing pass rate anyhow.

What I can also imagine is the staggering amount of time required to decide core 59 is dodgy, but not 58 or 60. I'm interested in proposed solutions, because nobody seems to have any. I thought about options to disable cores, but once you find the suspect box, the cheapest practical thing is to replace the CPU or indeed the box.
 
Old 06-07-2021, 12:01 AM   #2
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
It is interesting that Google and Facebook have enough servers, and sufficient control over their environment, that they can isolate CPU failures. These companies surely are very good at replacing marginal machines.

CPU reliability is irrelevant to home users, though. A home user has orders of magnitude greater reliability problems due to AC power, Internet service, and hot weather.
Ed
 
Old 06-07-2021, 05:17 AM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,307

Original Poster
Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Agreed on the home user. A home user also has fewer cores usually, a lower wattage package, an APU instead or a server CPU only, and lighter loads. These issues probably occur at high temperature and low core voltage, as the silicon would be most vulnerable then. A high voltage would increase the width on the insulating bit of the individual junctions, providing better insulation.

Frankly, my interest is the chips, and the fab size.

For a long time, people could see serious issues coming for several reasons at around 5nm fab size. These were issues mainly of physics. Now we have a factory producing 5nm fab size, & Apple's M-1 & M-2 CPUs. This indicates there's penalties for the reduction in fab size.

Apparently, at it's lowest level, a pn junction consists of a '++' doped section, & a '--' doped section. Where these meet a 'NN' section develops, as the opposite dopings cancel out. Now the problem is getting that assembly into a smaller and smaller size. A difficulty long foreseen was that as the 'N' section is shrunk by doping mixes, individual electrons may get through. At the reduced size, 1 stray electron may do enough to do damage.

I don't know how Google are set up, but in most companies, the job of isolating cpu unreliability would be handed to maintenance, but the issue of finding out why would be handed to R&D, or outside specialists. The existence of these CPU issues points out the fact that they are making "probably" chips, instead of the "certainly" ones we got before. It may be possible to decommission individual cores in future chips (i.e an 80 core switches off it's 'mercurial core' and becomes a 79 core), but I don't think it is now. But for a hardware guy (which I was), it's a fascinating and fairly intractable problem to watch. I've handled many difficult issues in my time which were caused by poor design. This is caused by 'one bridge too far' in the fab size, but it may be possible to live with the issue if the CPU isolates it's own mercurial (= dodgy) cores.

It may be getting to the stage where a core is needed as supervisor for the others. It's interesting to note the lack of a cpu manufacturer's name, also.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
slackware 14.1 copying data to usb drives results data loss corruption TuxDork Slackware 7 02-17-2015 11:10 AM
RAID 1, how to verify mirror , silent corruption homerwsmith Linux - Newbie 3 12-01-2011 11:59 PM
Programming for multicore CPUs Mr. Alex Programming 4 08-06-2010 05:08 PM
LXer: OpenCL in Beta SDK for GPUs and Multicore CPUs LXer Syndicated Linux News 0 08-07-2009 12:01 PM
multicore cpus question mokku Linux - Server 8 09-04-2008 12:45 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 04:35 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration