Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Basically, it seems that modern high-density fab is causing sporadic errors in CPUs and these are being noticed in big data centres.
As a hardware guy, I know it's a testing nightmare. Testing at max temperature and minimum voltage may help, but may not. These would typically heavily cooled 250W or 280W packages, and temperature uniformity throughout can only be modeled, not measured. Also, the uniformity of doping could be an issue. "Doping" mixes pure silicon with a tiny percentage of atoms with 1 electron more(negative doping), or 1 less (positive doping). Lastly, any manufacturing imperfection would do it. CPUs have an extremely low manufacturing pass rate anyhow.
What I can also imagine is the staggering amount of time required to decide core 59 is dodgy, but not 58 or 60. I'm interested in proposed solutions, because nobody seems to have any. I thought about options to disable cores, but once you find the suspect box, the cheapest practical thing is to replace the CPU or indeed the box.
It is interesting that Google and Facebook have enough servers, and sufficient control over their environment, that they can isolate CPU failures. These companies surely are very good at replacing marginal machines.
CPU reliability is irrelevant to home users, though. A home user has orders of magnitude greater reliability problems due to AC power, Internet service, and hot weather.
Ed
Agreed on the home user. A home user also has fewer cores usually, a lower wattage package, an APU instead or a server CPU only, and lighter loads. These issues probably occur at high temperature and low core voltage, as the silicon would be most vulnerable then. A high voltage would increase the width on the insulating bit of the individual junctions, providing better insulation.
Frankly, my interest is the chips, and the fab size.
For a long time, people could see serious issues coming for several reasons at around 5nm fab size. These were issues mainly of physics. Now we have a factory producing 5nm fab size, & Apple's M-1 & M-2 CPUs. This indicates there's penalties for the reduction in fab size.
Apparently, at it's lowest level, a pn junction consists of a '++' doped section, & a '--' doped section. Where these meet a 'NN' section develops, as the opposite dopings cancel out. Now the problem is getting that assembly into a smaller and smaller size. A difficulty long foreseen was that as the 'N' section is shrunk by doping mixes, individual electrons may get through. At the reduced size, 1 stray electron may do enough to do damage.
I don't know how Google are set up, but in most companies, the job of isolating cpu unreliability would be handed to maintenance, but the issue of finding out why would be handed to R&D, or outside specialists. The existence of these CPU issues points out the fact that they are making "probably" chips, instead of the "certainly" ones we got before. It may be possible to decommission individual cores in future chips (i.e an 80 core switches off it's 'mercurial core' and becomes a 79 core), but I don't think it is now. But for a hardware guy (which I was), it's a fascinating and fairly intractable problem to watch. I've handled many difficult issues in my time which were caused by poor design. This is caused by 'one bridge too far' in the fab size, but it may be possible to live with the issue if the CPU isolates it's own mercurial (= dodgy) cores.
It may be getting to the stage where a core is needed as supervisor for the others. It's interesting to note the lack of a cpu manufacturer's name, also.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.