Slackware 14.2 virtual machine hangs regularly (perhaps due to snmpd)

gdsotirov · 07-19-2017, 03:12 AM

Hello,

I have a fully patched Slackware 14.2 VMware virtual machine (version 8) running in ESXi 5.5 U1 (build 1623387) hypervisor. From time to time this virtual machine just hangs up completely starting to burn high CPU on the host and thus making other running virtual machines less responsive. I wasn't been able to identify the reason for these hangs so far and haven't had such problem on any other Slackware virtual machine (and I have others raging from Slackware 11.0 and current, both 32 and 64 bit). The last time when the hang occurred I was logged on the terminal through SSH, so I caught the following written:

Code:

Message from syslogd@slack-142 at Wed Jul 19 10:26:01 2017 ...
slack-142 kernel: [ 2649.351954] CPU: 1 PID: 1085 Comm: snmpd Not tainted 4.4.75-smp #2

Message from syslogd@slack-142 at Wed Jul 19 10:26:01 2017 ...
slack-142 kernel: [ 2649.352612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.353910] task: f2d19200 ti: f0176000 task.ti: f0176000

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.358574] Stack:

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.359265]  00000018 00000010 f3263000 f4dad4b8 00000000 00000000 f4003d80 00000000

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.361261] Call Trace:

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.359925]  024280c0 00000000 f2d19200 00000000 00000001 00000000 024000c0 f4003d80

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.361899]  [<c1197208>] ? inode_init_always+0xe8/0x180

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.360602]  024080c0 f0177d78 f0177da0 80200020 f0177d78 c1fef180 f017000a e46e4ecc

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.362572]  [<c11dc1cc>] ? proc_alloc_inode+0x1c/0x90

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.363252]  [<c11696c1>] ___slab_alloc+0x51/0x4a0

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.364242]  [<c11a5196>] ? __inode_wait_for_writeback+0x56/0x90

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...
slack-142 kernel: [ 2649.364955]  [<c10adb4c>] ? call_rcu_sched+0x1c/0x20

Message from syslogd@slack-142 at Wed Jul 19 10:26:03 2017 ...

Which makes me think that the problem is due to snmpd, but what could it be? The virtual machine is being pooled through SNMP by Cacti running on another (real) machine in the same network segment just like my other Slackware virtual machines. However, only the Slackware 14.2 virtual machine hangs.

Any ideas anyone?

I would really appreciate any help in resolving this strange issue. And please, let me know if I need to provide any other information necessary.

ponce · 07-19-2017, 11:53 AM

I don't use vmware esxi anymore since a decade (more or less) but looking at the kernel errors of the virtual machine, wild guessing, might it be that the vmdk image file of the vm (more probably) and/or the esxi filesystem (where the vms are stored) are damaged?

gdsotirov · 07-19-2017, 01:05 PM

Quote:

Originally Posted by ponce

I don't use vmware esxi anymore since a decade (more or less)

And what are you using instead?

I'm using ESX on a dedicated real machine for running all my virtual machines (~40).

Quote:

Originally Posted by ponce

might it be that the vmdk image file of the vm (more probably) and/or the esxi filesystem (where the vms are stored) are damaged?

I have ran the following:

Code:

# vmkfstools --fix check Slack-14.2.vmdk
Disk is error free

to check the disk image file, but I'm not able to run voma for checking the vmfs volume, because it's the one on which ESXi is installed and thus it is in use (e.g. I received "Found 1 actively heartbeating hosts on device"). Do you know other ways to check for damages?

Anyway, I'll try to move the virtual machine on another vmfs volume (on another disk) and would report back if this has any effect.

ponce · 07-19-2017, 01:55 PM

Quote:

Originally Posted by gdsotirov

And what are you using instead?

I'm using ESX on a dedicated real machine for running all my virtual machines (~40).

libvirt + qemu and lxc.

just to isolate the issue, you can also try disabling snmpd and see if it still hangs.

kjhambrick · 07-20-2017, 07:03 AM

gdsotirov --

We don't run Slackware on ESXi ( not yet, but we are evaluating it )

However, we do have more than a few CentOS 6 Machines on ESXi out there among our Customer Sites.

On some Customer's Systems, we experienced occasional periods where the CentOS VM would be unresponsive for a few seconds for all users(

and a few seconds feels like a long time when running a terminal-based application via ssh

).

One thing that made a difference for us were a few of the Linux-on-VM tuning recommendations:

#1: Linux VM Performance Tuning ( old but useful for CentOS 6 )

We did not implement all the recommendations in #1.

However, Kernel Parameter elevator=noop ; set noatime in fstab ; vm.swappiness=1 were low-hanging fruit that helped a lot ( I assume you're running vmtools and you're running the para-virtualized devices ) ...

Another obvious one is If you don't need a GUI Console, be sure to boot into runlevel 3, and not runlevel 4 ... the GUI Console is just a `startx` away if you ever need one.

Setting vm.swappiness seemed to help a lot on our Customer's VMs ( we chose to set vm.swappiness=1 as in #2 instead of vm.swappiness=0 as in #1 ).

#2: Kernel Processes Periodically Eating CPU During High Load

Answer 1 in URL #2 also shows how the user solved his own problem via the results of `sar -W` which URL #1 says "don't do that !" ...

Anyhow, we've tried `sar` but never got any useful info from it.

As for other logs and sorry to belabor the obvious if you've already looked ... do you see anything in /var/log/{dmesg,messages,syslog} when the system freezes up ?

Or maybe `top` shows something useful ?

One thing about dmesg is you lose it after each reboot. If you can manage to reboot cleanly, then adding the following line to /etc/rc.d/rc.local_shutdown will save your dmesg for the current boot so you can inspect it after a reboot:

Code:

#
# save dmesg for this boot.  Append at end of /etc/rc.d/rc.local_shutdown
#
/bin/dmesg > /var/log/dmesg-last-boot

I do run VMWare Workstation on my main Slackware64 14.2 + Multilib Laptop with a couple Slackware Guests ( 14.2 and Current ).

I follow most of the recommendations in URL #1 for the VMWare Workstation Guests and they run very well on my Laptop.

HTH and good luck !

-- kjh

P.S. there are also a few customers running Hyper-V and we also tune CentOS on Hyper-V pretty much the same way.

gdsotirov · 07-20-2017, 10:50 AM

First, thanks for the extensive replay and for sharing your experience kjhambrick!

Now, to the different points and questions:

Quote: