LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Why did my server enter an endless trace loop? (https://www.linuxquestions.org/questions/slackware-14/why-did-my-server-enter-an-endless-trace-loop-4175594541/)

mfoley 11-30-2016 08:06 PM

Why did my server enter an endless trace loop?
 
1 Attachment(s)
Sometime on Wednesday afternoon, Nov. 23rd, my Linux Server ceased function properly, but did not shutdown or reboot. I was no longer able to ssh into it, nor was it resolving DNS (it is the LAN DNS server). The mail port (25) was still open and it was receiving email, but not delivering any. Port 80 was still open. When I finally got to it physically on the following Saturday the console was continuously looping with the 1st line showing "BUG: unable to handle kernel paging request at ffff88030e28fe40". I made a screen-shot of the console - attached.

Some further lines down the output there are the lines:
Code:

CPU: 5 PID: 20523 Comm: sendmail Tainted: G    B    3.10.17 #3
Hardware name: Gigabyte Technology Co., Ltd. Z97X-UD5H/Z97X-UD5H, BIOS F8 06/17/2014

Then a bunch of apparently register dumps, etc.

The "3.10.17" number is the Linux kernel version. "Gigabyte Technology" is the motherboard manufacturer and the "297X-UD5H" is the motherboard product name. "06/17/2014" is the AMI BIOS date, "F8" is the BIOS version number. Not sure any of that is helpful. What might "sendmail Tainted" mean?

Simply rebooting fixed the problem -- apparently.

Does anyone have any idea what happened?

Slackware64 14.1

Darth Vader 11-30-2016 10:38 PM

Quote:

Originally Posted by mfoley (Post 5636504)
Does anyone have any idea what happened?

Some smartass though he can make a 24/7 server using a high-end gaming motherboard, CPU and a bunch of non-ECC memory.

That's the main and fundamental error. This is NOT a Server, but a Gaming Computer forced to do a 24/7 job...

You know, the smart guys invented the Server Grade Motherboards and the ECC Memories, with a reason... ;)

mfoley 12-01-2016 09:45 AM

Well, this was a custom built machine with components, including mother board, recommended. There was no indication on the MB box that it was specifically for gaming.

What, for example, would you suggest as a Server Grade Motherboard?

You may be right on the non-ECC memory, but how do you know that? Because of the motherboard model?

bassmadrigal 12-01-2016 10:10 AM

The issue that Darth Vader is (rudely) implying is that server hardware tends to be more robust, and that usually includes using ECC (error-correcting code) memory. Long story short, it is possible for a neutron from a cosmic ray to flip a memory bit from 0 to 1 or 1 to 0. This is not common, because neutrinos rarely interact with particles on Earth. There are about a trillion neutrinos passing through your hand every second, but collisions only occur once every few years (see this fun xkcd what-if for more detailed information on neutrinos). When one of these neutrinos "collide" with a bit in your RAM, it can have no RAMifications (heh) on the computer or it can bring it to its knees. It all depends on what was contained in that bit of memory. ECC memory is designed to notice this and correct it.

Unfortunately, as far as I know, there's no way to know whether this issue was caused by some random neutrino or some other hardware/software glitch. Many times, when this occurs due to a neutrino, a quick reboot and you're back to normal and won't experience it again. There's really not much you can do to protect yourself from those types of situations without getting server-grade hardware that uses ECC memory. If this issue starts happening frequently, it could be a sign your hardware is having issues and would be unrelated to neutrinos.

Personally, as long as a machine isn't considered critical, you probably don't need to spend the extra money for server-grade components. I have been happily using my normal desktop rig as a 24/7 server for many years (both within my LAN and out to the internet), but if my machine goes down, only me, my wife, and a few friends would be affected. I also have an htpc running kodi that runs 24/7 that can act as a media server to my mobile phone. I haven't had issues with each, and, to me, neither computer is worth the cost of server-grade hardware based on what they provide. If you're running a major business, you could lose a lot of money in a quick period of time if your desktop turned server goes out, so it's typically beneficial to pay the extra money for server-grade hardware when that computer will be providing a critical function (which would be up to you and your company to decide if it is worth it).

TL;DNR: Don't worry about it unless it happens frequently. Then start testing your hardware because there might be a bad component.

Skaendo 12-01-2016 01:38 PM

My best guess would be that some piece of your hardware had a hiccup. Like bassmadrigal said you probably don't need to worry about it unless it starts happening more frequently. You could run diagnostic tests, like a memtest, stressing the CPU, checking the SMART status of your HDD, test your power supply, etc. But you will need some downtime to do all that.

Personally, I think that it is the MoBo. Gigabyte IMO is junk. I have a friend that cannot give his Gigabyte laptop away. I have had nothing but issues with their MoBo's. But that is purely a personal opinion and speculation.

My favorite right now is a old ASUS M2NPV-VM MoBo that I have running my personal web facing server. This thing is a workhorse. No ECC-RAM or anthing "server-grade" in it. And it's pretty minimal, MoBo, HDD, PSU, DVD, NIC.


All times are GMT -5. The time now is 04:29 PM.