Server Hung with error message: ERROR: Message hist queue is filling up
Hi Guys,
1st, please accept my poor english. :) I have a cluster with 2 nodes running with RHEL + pacemaker. Recently, one of the cluster node hung (pingable but blackscreen on KVM and ssh also was not possible). The other active cluster node was serving the mysql service but user reported that application couldn't access to the database. I've no choice unless to reboot the system (to get back the console). After reboot, cluster back to normal. Lot of ERROR messages popped out (every second) in the message log file (on impacted cluster node) as below: ERROR: Message hist queue is filling up (500 messages in queue) WARN: Gmain_timeout_dispatch: Dispatch function for send_reqnodes_msg took too long to execute: 240 ms (> 100 ms) (GSource: 0x9c3f660) Could everyone advice what's went wrong on my system. Also how relates the above messages to hung issue. Your advice is highly appreciated. Thanks. |
You need to investigate following items:
1. Do you updated system in recent days? Especially, kernel 2. Do you have active support with RedHat? 3. if yes for #2, best to activate RedHat support. |
Hi myatthu,
Thanks for the feedback. FYI, there was no maintenance/update activity carried out recently. Suddenly, this issue hit the cluster. Second, the support contract (RHN) still valid and fyi, I've logged this issue to HP (since I bought the support from HP). HP did the troubleshooting at hardware and OS levels (found everything OK) but they can't assist me on the cluster level troubleshooting due to my cluster setup with 'pacemaker' instead of Red Hat cluster tool (luci-ricci). They have advised me to contact pacemaker support :( I'm stuck and really hope that everyone in this forum can assist/advice me on this issue. Thanks a lot. |
What is your RHEL and heartbeat version?
Code:
cat /etc/redhat-release Code:
rpm -qi heartbeat |
Hi Myatthu,
Please refer below: Quote:
Name : heartbeat Arch : x86_64 Version : 3.0.4 |
Yeah your environment is pretty recent version.
Can you provide following outputs? You may omit real IP and secret key. Code:
last Code:
uname -a Code:
cat /etc/ha.d/ha.cf Code:
cat /etc/ha.d/haresources Can you also grep ERROR at /var/log? Code:
grep -i error /var/log/* |
HI myatthu,
Sorry for the late reply. I've collected all the output/logs as requested. You may retrieve it from the below link: https://www.dropbox.com/sh/k083f71s2aoe2f9/emmr3fCzbe Appreciate of your advice. Thanks. |
How do you connect two nodes? Is it direct cable or through routed network?
You might want to adjust following items to adjust if your network latency is high. warntime 20 deadtime 30 initdead 30 You might want to check CPU usage at period. Again, I just guessing some possibility. You should ask at heartbeat community for further details analysis. Good luck. |
All times are GMT -5. The time now is 01:18 AM. |