OSP and RHEL possible Bug

AmerHwitat · 05-18-2019, 12:39 PM

Hello,

I had these problems earlier with OSP14 and RHEL 7.6;

Code:

[root@localhost network-scripts]# systemctl status network -l
? network.service - LSB: Bring up/down networking
   Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sat 2019-01-19 03:47:01 EST; 21s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 86319 ExecStop=/etc/rc.d/init.d/network stop (code=exited, status=0/SUCCESS)
  Process: 86591 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=1/FAILURE)
    Tasks: 0

Jan 19 03:47:01 localhost.localdomain dhclient[86963]: Please report for this software via the Red Hat Bugzilla site:
Jan 19 03:47:01 localhost.localdomain dhclient[86963]:     http://bugzilla.redhat.com
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: ution.
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: exiting.
Jan 19 03:47:01 localhost.localdomain network[86591]: failed.
Jan 19 03:47:01 localhost.localdomain network[86591]: [FAILED]
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service: control process exited, code=exited status=1
Jan 19 03:47:01 localhost.localdomain systemd[1]: Failed to start LSB: Bring up/down networking.
Jan 19 03:47:01 localhost.localdomain systemd[1]: Unit network.service entered failed state.
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service failed.
[root@localhost network-scripts]#

Code:

[root@localhost log]# 
Message from syslogd@localhost at Jan 23 02:23:31 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [ovsdb-server:10088]

Code:

[root@amer network-scripts]# 
Message from syslogd@amer at Jan 27 12:46:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nova-api:102738]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

Quote:

The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

I have reported to Bugzilla, no response, launchpad, openstack.org also didn't answer, I hope the answer is in this page.

the PC was accidentally disconnected and caused the last punch of errors, which caused crashdump, and I think it's the RabbitMQ heartbeat as it showed in my logs, and it seems that there is a problem when the server disconnects that causes RabbitMQ to halt and crash the system with Kernel loop.

berndbausch · 05-19-2019, 07:28 AM

It’s not clear to me what’s the problem. You can’t start your network, but have you tried to find more log messages in the journal? There is an unlimited number of possible causes for network failures. Also, is the network really down?

Do the “CPU stuck” messages cause any real problem or are they just annoying entries in the log?

Where are these messages generated, on a compute node, controller or VM? In which log do you see the message in red font?

You allude to problems with RabbitMQ. Can you elaborate?

You say that “the server disconnects”. Which server disconnects from what?

Finally, what is the significance of the PC you mention in the last paragraph?

TB0ne · 05-19-2019, 10:30 AM

Quote:

Originally Posted by AmerHwitat

Hello,
I had these problems earlier with OSP14 and RHEL 7.6;

Code:

[root@localhost network-scripts]# systemctl status network -l
? network.service - LSB: Bring up/down networking
   Loaded: loaded (/etc/rc.d/init.d/network; bad; vendor preset: disabled)
   Active: failed (Result: exit-code) since Sat 2019-01-19 03:47:01 EST; 21s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 86319 ExecStop=/etc/rc.d/init.d/network stop (code=exited, status=0/SUCCESS)
  Process: 86591 ExecStart=/etc/rc.d/init.d/network start (code=exited, status=1/FAILURE)
    Tasks: 0

Jan 19 03:47:01 localhost.localdomain dhclient[86963]: Please report for this software via the Red Hat Bugzilla site:
Jan 19 03:47:01 localhost.localdomain dhclient[86963]:     http://bugzilla.redhat.com
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: ution.
Jan 19 03:47:01 localhost.localdomain dhclient[86963]: exiting.
Jan 19 03:47:01 localhost.localdomain network[86591]: failed.
Jan 19 03:47:01 localhost.localdomain network[86591]: [FAILED]
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service: control process exited, code=exited status=1
Jan 19 03:47:01 localhost.localdomain systemd[1]: Failed to start LSB: Bring up/down networking.
Jan 19 03:47:01 localhost.localdomain systemd[1]: Unit network.service entered failed state.
Jan 19 03:47:01 localhost.localdomain systemd[1]: network.service failed.
[root@localhost network-scripts]#

Code:

[root@localhost log]# 
Message from syslogd@localhost at Jan 23 02:23:31 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [ovsdb-server:10088]

Code:

[root@amer network-scripts]# 
Message from syslogd@amer at Jan 27 12:46:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [nova-api:102738]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [swapper/5:0]

Message from syslogd@amer at Jan 27 19:26:19 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#1 stuck for 27s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:27:30 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [6_scheduler:64928]

Message from syslogd@amer at Jan 27 19:31:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:32:42 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 33s! [swift-object-up:11358]

Message from syslogd@amer at Jan 27 19:33:55 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#3 stuck for 24s! [dmeventd:71548]

Message from syslogd@amer at Jan 27 19:34:25 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 65s! [kworker/2:0:59993]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#2 stuck for 24s! [kworker/u256:3:8447]

Message from syslogd@amer at Jan 27 19:37:50 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:34]

Message from syslogd@amer at Jan 27 19:37:51 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [systemd:11968]

I have reported to Bugzilla, no response, launchpad, openstack.org also didn't answer, I hope the answer is in this page. the PC was accidentally disconnected and caused the last punch of errors, which caused crashdump, and I think it's the RabbitMQ heartbeat as it showed in my logs, and it seems that there is a problem when the server disconnects that causes RabbitMQ to halt and crash the system with Kernel loop.

I find this a little hard to believe; that bug has been reported since 2016, and there is a *LOT* of information on it. It was resolved, and RHEL has a patch available for it..so since you're using Red Hat Enterprise, have you contacted Red Hat Support? You are PAYING FOR RHEL, RIGHT????

https://access.redhat.com/solutions/2073603

Those messages typically revolve around high system load and/or insufficient resources for the RHEL system to function properly. You say nothing about what's running on that machine, if this is a new installation/problem or an existing one, etc. Red Hat support can easily analyze an SOS report for you, to get you more details.

AmerHwitat · 05-20-2019, 12:15 AM

Quote:

Originally Posted by TB0ne

I find this a little hard to believe; that bug has been reported since 2016, and there is a *LOT* of information on it. It was resolved, and RHEL has a patch available for it..so since you're using Red Hat Enterprise, have you contacted Red Hat Support? You are PAYING FOR RHEL, RIGHT????

https://access.redhat.com/solutions/2073603

Those messages typically revolve around high system load and/or insufficient resources for the RHEL system to function properly. You say nothing about what's running on that machine, if this is a new installation/problem or an existing one, etc. Red Hat support can easily analyze an SOS report for you, to get you more details.

well my friend I had this problem back in January 2019, and I posted it in Bugzilla, and some other locations, the reply wasn't fast or I didn't have a reply, so I deleted the VM and started all over again, the successful working solution was minimal installation on 12GB 8core VM, I had a developer subscription, and they did open a tech support request from me, the overall discussion was ended by that I can self support myself, with a big fat VM.

https://bugs.launchpad.net/nova/+bug/1813446
https://bugzilla.redhat.com/show_bug.cgi?id=1666598

I have the Kdump crash image and ABRT tool snapshot for reference, also I want to say that the firmware of VMs is not compatible with the new Kernels of Linux.

berndbausch · 05-20-2019, 01:51 AM

You must admit that your problem descriptions are a bit lacking. For example, the Nova bug description starts like this:

Quote:

I have problems running instance after creation it says hosts not found

What says “hosts not found”? Where do you find this message, and what action causes this message? You then continue talking about mysql.

The Bugzilla description mentions something about bridges not being set up correctly without providing details.

In this Linuxquestions thread you say that systemctl status network reports an error, but you are not clear which computer’s or VM’s network has problems, nor if they are really problems.You then mention the CPU stuck messages without clarifying how they are related to the network problem, before jumping to RabbitMQ problems, again without saying how that is supposed to be related.

I know a little bit about OpenStack and would be delighted to be of help, but at this point, I don’t even know what your problem is. Is your VM host crashing? Does the VM have network connectivity problems? Anything else? Without a clear problem statement (Note: random log messages are not problems), nobody can help you.

AmerHwitat · 05-20-2019, 06:40 AM

Quote:

Originally Posted by berndbausch

You must admit that your problem descriptions are a bit lacking. For example, the Nova bug description starts like this:

What says “hosts not found”? Where do you find this message, and what action causes this message? You then continue talking about mysql.

The Bugzilla description mentions something about bridges not being set up correctly without providing details.

In this Linuxquestions thread you say that systemctl status network reports an error, but you are not clear which computer’s or VM’s network has problems, nor if they are really problems.You then mention the CPU stuck messages without clarifying how they are related to the network problem, before jumping to RabbitMQ problems, again without saying how that is supposed to be related.

I know a little bit about OpenStack and would be delighted to be of help, but at this point, I don’t even know what your problem is. Is your VM host crashing? Does the VM have network connectivity problems? Anything else? Without a clear problem statement (Note: random log messages are not problems), nobody can help you.

what if I was all wrong, doesn't people have a right of reporting a problem, that specifically says, report this bug to this site.

as you can see, I was most of the time talking to no body in these channels, they are supposed to answer, and more than once, well also you don't see a Guru in my name at all.

and I found out later that there is people working in both sites, and I reporting to them the same problem, first that they will say, this site is not responsible for this, try the other site, and on the other site they say, this is not a bug, blah blah.

my VM crashed, I have a snapshot of the crash, and Horizon Gui gave, report this bug to launchpad, launchpad member said that this is not launchpad's this is bugzilla's, bugzilla's did respond by the same man that this is not a bug without so many replies and decided to close saying that this not a Bug, no explanation at all. (Duhhh)

thanks for the reply

https://ibb.co/ZzQ8Bfb
https://ibb.co/MV5RTV2
https://ibb.co/4SvwRVg

TB0ne · 05-20-2019, 07:57 AM

Quote:

Originally Posted by AmerHwitat

well my friend I had this problem back in January 2019, and I posted it in Bugzilla, and some other locations, the reply wasn't fast or I didn't have a reply, so I deleted the VM and started all over again, the successful working solution was minimal installation on 12GB 8core VM, I had a developer subscription, and they did open a tech support request from me, the overall discussion was ended by that I can self support myself, with a big fat VM.

https://bugs.launchpad.net/nova/+bug/1813446
https://bugzilla.redhat.com/show_bug.cgi?id=1666598

I have the Kdump crash image and ABRT tool snapshot for reference, also I want to say that the firmware of VMs is not compatible with the new Kernels of Linux.

I'm not your friend. And "bugzilla and some other locations" are meaningless, since if you're using Red Hat Enterprise 7.6, you just needed to call Red Hat support. Are you PAYING FOR RHEL??? Because this:

Quote:

Originally Posted by AmerHwitat

I had a developer subscription, and they did open a tech support request from me, the overall discussion was ended by that I can self support myself, with a big fat VM.

...makes little sense. Who is "they"? Why didn't you just call RHEL support? I gave you a link that was easily found that resolves your issue...since you claim to be paying for RHEL and have a subscription, you should be easily able to log in and view it. Again, this is a know issue and the resolution was patched....the only reason you wouldn't have it now, is if you're not paying for RHEL.

12GB of RAM and 8 cores is plenty to run RHEL, but you still don't say what ELSE is running on that VM, or what kind of VM you're running. The phrase "the firmware of VMs is not compatible with new kernels of Linux" also makes little sense, since there are many articles on the VMWare site that have resolutions listed, specifically for such issues. Again, if you're actually PAYING FOR VMWare, you can download and apply those patches.

dc.901 · 05-20-2019, 10:01 AM

Quote:

Originally Posted by AmerHwitat

the PC was accidentally disconnected and caused the last punch of errors,

What errors?

Quote:

Originally Posted by AmerHwitat

which caused crashdump, and I think it's the RabbitMQ heartbeat as it showed in my logs, and it seems that there is a problem when the server disconnects that causes RabbitMQ to halt and crash the system with Kernel loop.

Well, so before disconnecting, perhaps stop the RabbitMQ service!

AmerHwitat · 05-20-2019, 12:36 PM

Quote:

Originally Posted by dc.901

What errors?

Well, so before disconnecting, perhaps stop the RabbitMQ service!

yep that's it, disconnecting caused the RabbitMQ to time out, and then the CPUs got stuck in the loop of kernel.

Thanks

berndbausch · 05-20-2019, 09:50 PM

Quote:

Originally Posted by AmerHwitat

what if I was all wrong, doesn't people have a right of reporting a problem, that specifically says, report this bug to this site.

I understand your frustration with Red Hat support, but if I don't know what your problem is, I can't help you.

Quote:

my VM crashed, I have a snapshot of the crash, and Horizon Gui gave, report this bug to launchpad

In my humble opinion, the messages asking you to report something to launchpad should be removed from the code, but you probably know that OpenStack is not necessarily a very polished product.

Quote:

https://ibb.co/ZzQ8Bfb

My understanding is that the above link shows a crashing VM. Correct?

Quote:

https://ibb.co/4SvwRVg

And this link shows a VM that doesn't crash.

Can you confirm that your problem is a VM in OpenStack that crashes, possibly due to CPU lockups?

If so, your compute host might not have enough power to run this VM (https://ask.openstack.org/en/questio...ft-lockup-cpu/). Can you share the compute node's specs, and how much load is on that node?

As an aside, RabbitMQ is required to launch an instance, but it is not required for an instance to continue running once its launch was successful. I therefore guess that your RabbitMQ problems are separate from the VM crash, but if your compute node is under-powered, this may well cause RabbitMQ timeouts.

AmerHwitat · 05-20-2019, 11:04 PM

Quote:

Originally Posted by berndbausch

My understanding is that the above link shows a crashing VM. Correct?

And this link shows a VM that doesn't crash.

Can you confirm that your problem is a VM in OpenStack that crashes, possibly due to CPU lockups?

Yes you are right, and yes I do confirm that the crash was from the CPU lockups, it's the Nova component that fails, I had an i7 6th generation with 16GB RAM host, and an 8core VM 12GB RAM, the problem was caused when the access point (WiFi) failed, then half an hour everything was happeing fast, before that I had also a similar situation which didn't cause this to happen, but I had a CPU lock on OVS.

I have deleted the VM with the logs, and I started a fresh installation with Minimal conf in the answer file.

I had an All in one installation, not multi-node.

Best regards

berndbausch · 05-21-2019, 01:21 AM

Quote:

Originally Posted by AmerHwitat

I had an i7 6th generation with 16GB RAM host, and an 8core VM 12GB RAM
...
I had an All in one installation, not multi-node.

This looks very much like you are overloading the host. Keep in mind that the host has to run all the infrastructure including database and message queue, and perhaps also Ceilometer, which can also use up a significant part of the host’s CPU. If you can reproduce the problem, run top or similar tool in parallel to see to what extent the host is loaded.

Does OSP support single-host deployments? I would guess not, which might explain the reluctance of Red Hat Support to help you.

AmerHwitat · 05-21-2019, 01:31 AM

Quote:

Originally Posted by berndbausch

Does OSP support single-host deployments? I would guess not, which might explain the reluctance of Red Hat Support to help you.

well yes I ran top most of the time to check on the system resources, big portion was on the three main components (Nova, Neutron, Cinder) and I had noticed that there is more than one occurrence of these processes, the main thing for me was that I had the chance to install and run OSP 14 on my VM, before eventually RH disabled my account after reporting this indecent, so no more connecting to repositories from my subscription, which is ok, cause I have another account

anyway thanks for your swift reply

berndbausch · 05-21-2019, 02:24 AM

Try Tripleo instead of OSP.

foasolution · 05-21-2019, 03:49 AM

Quote:

Originally Posted by berndbausch

You must admit that your problem descriptions are a bit lacking. For example, the Nova bug description starts like this:

What says “hosts not found”? Where do you find this message, and what action causes this message? You then continue talking about mysql.

The Bugzilla description mentions something about bridges not being set up correctly without providing details.

In this Linuxquestions thread you say that systemctl status network reports an error, but you are not clear which computer’s or VM’s network has problems, nor if they are really problems.You then mention the CPU stuck messages without clarifying how they are related to the network problem, before jumping to RabbitMQ problems, again without saying how that is supposed to be related.

https://validedge.com/nvidia-control-panel-missing/

I know a little bit about OpenStack and would be delighted to be of help, but at this point, I don’t even know what your problem is. Is your VM host crashing? Does the VM have network connectivity problems? Anything else? Without a clear problem statement (Note: random log messages are not problems), nobody can help you.

We need to first stop RabbitMQ run and than run system files.