centos 6.2 lost internet connections intermittently

hahacc · 03-29-2012, 01:46 AM

Hi guys,
There's one host(centos 6.2) which lost it's networking connection intermittently, and thus the whole OS was left there without networking which was very bad. It's a host with httpd installed, so without networking connections, it's very bad.

The OS was not shutted down or rebooted after the loss of networking, but just stayed there. I checked error logs and cannot find anything that's related to this strange behavior. The OS has xinetd(rsync/nrpe), httpd, mysql, vsftpd installed and I've already gave it a yum update and now it's at 2.6.32-220.7.1.el6.x86_64, CentOS release 6.2 (Final)

Can anyone help on this?

lithos · 03-29-2012, 01:53 AM

Hi

Is your server running any network daemon (service) with DHCP enabled maybe ?
Is it NIC that is defective maybe, can you try replace network card ?
What does your

Code:

# service network status
Configured devices:
lo eth0 eth1
Currently active devices:
lo eth0


and 

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:30:4F:28:16:C2
          inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::230:4fff:fe28:16c2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:520277291 errors:0 dropped:0 overruns:0 frame:0
          TX packets:320763080 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2683477502 (2.4 GiB)  TX bytes:3405751313 (3.1 GiB)
          Interrupt:209 Base address:0x2000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:222507 errors:0 dropped:0 overruns:0 frame:0
          TX packets:222507 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:60339993 (57.5 MiB)  TX bytes:60339993 (57.5 MiB)


# cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=static
BROADCAST=192.168.0.255
IPADDR=192.168.0.7
NETMASK=255.255.255.0
NETWORK=192.168.0.0
TYPE=Ethernet

show ?

Can you ping maybe any other computer/server in the same subnet network ?
or is maybe

Code:

ping www.google.com

giving any response ?

hahacc · 03-29-2012, 03:00 AM

Thanks.
Here's the outputs:

Quote:

[root@jingan10 network-scripts]# service network status
Configured devices:
lo eth0 eth1
Currently active devices:
lo eth1

[root@jingan10 network-scripts]# ifconfig -a
eth0 Link encap:Ethernet HWaddr BC:AE:C5:3D:31

6
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:16 Memory:fbde0000-fbe00000

eth1 Link encap:Ethernet HWaddr BC:AE:C5:3D:25:71
inet addr:116.255.130.60 Bcast:116.255.130.63 Mask:255.255.255.224
inet6 addr: fe80::beae:c5ff:fe3d:2571/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:96091 errors:0 dropped:0 overruns:0 frame:0
TX packets:105631 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:20427859 (19.4 MiB) TX bytes:67686188 (64.5 MiB)
Interrupt:17 Memory:fbce0000-fbd00000

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:59 errors:0 dropped:0 overruns:0 frame:0
TX packets:59 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:5700 (5.5 KiB) TX bytes:5700 (5.5 KiB)

[root@jingan10 network-scripts]# cat ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="none"
HWADDR="BC:AE:C5:3D:31

6"
ONBOOT="no"
IPADDR=

[root@jingan10 network-scripts]# cat ifcfg-eth1
DEVICE="eth1"
BOOTPROTO="static"
HWADDR="BC:AE:C5:3D:25:71"
ONBOOT="yes"
IPADDR=116.255.130.60
NETMASK=255.255.255.224
GATEWAY=116.255.130.33

Actually, seems there's no service with dhcp enabled running, I've taken a snapshot of all processes in the attachment of this thread.

And I've written a script to cron job to check for networking every 15 minutes, if the host can not ping some ip addresses, then restart network. And then wait for some time, then if it still can not ping, reboot the host, here goes the script:

Quote:

#!/bin/bash
#*/15 * * * * /backup/sites/reboot_if_no_internet_access.sh
sleep 10
ip_addy=(
8.8.8.8
8.8.8.8
8.8.8.8
220.181.111.85
220.181.111.85
220.181.111.85
123.125.38.240
123.125.38.240
123.125.38.240
)
_max=7
_count=0
for ip in ${ip_addy[*]} ; do
/bin/ping -c1 -w3 $ip > /dev/null
if [ $? -ne 0 ] ; then
_count=$(( $_count + 1 ))
fi
done

if [ $_count -gt $_max ] ; then
/bin/echo -n "restart networking at: ">>/var/tmp/reboot.log
/bin/echo `date` >>/var/tmp/reboot.log
/etc/init.d/network restart
sleep 90
ip_addySecond=(
8.8.8.8
8.8.8.8
8.8.8.8
220.181.111.85
220.181.111.85
220.181.111.85
123.125.38.240
123.125.38.240
123.125.38.240
)
_maxSecond=7
_countSecond=0

for ipSecond in ${ip_addySecond[*]} ; do
/bin/ping -c1 -w3 $ipSecond > /dev/null
if [ $? -ne 0 ] ; then
_countSecond=$(( $_countSecond + 1 ))
fi
done
if [ $_countSecond -gt $_maxSecond ] ; then
/bin/echo -n "reboot server at: ">>/var/tmp/reboot.log
/bin/echo `date` >>/var/tmp/reboot.log
/sbin/reboot
fi
fi

From the scripts log file, I can see that before rebooting, networking was restarted, and from dmesg at that time, networking seems restarted well, but still ping failed later and thus host rebooted:

Quote:

Mar 29 15:15:38 jingan10 kernel: lo: Disabled Privacy Extensions
Mar 29 15:15:39 jingan10 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar 29 15:15:40 jingan10 kernel: e1000e: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
Mar 29 15:15:40 jingan10 kernel: e1000e 0000:02:00.0: eth1: 10/100 speed: disabling TSO
Mar 29 15:15:40 jingan10 kernel: e1000e: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
Mar 29 15:15:40 jingan10 kernel: e1000e 0000:02:00.0: eth1: 10/100 speed: disabling TSO
Mar 29 15:15:40 jingan10 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready #seems networking restarted well
Mar 29 15:17:40 jingan10 init: tty (/dev/tty1) main process (1606) killed by TERM signal #but still host was rebooted

From

Quote:

Originally Posted by lithos

Hi

Is your server running any network daemon (service) with DHCP enabled maybe ?
Is it NIC that is defective maybe, can you try replace network card ?
What does your

Code:

# service network status
Configured devices:
lo eth0 eth1
Currently active devices:
lo eth0


and 

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:30:4F:28:16:C2
          inet addr:192.168.0.7  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::230:4fff:fe28:16c2/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:520277291 errors:0 dropped:0 overruns:0 frame:0
          TX packets:320763080 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:2683477502 (2.4 GiB)  TX bytes:3405751313 (3.1 GiB)
          Interrupt:209 Base address:0x2000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:222507 errors:0 dropped:0 overruns:0 frame:0
          TX packets:222507 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:60339993 (57.5 MiB)  TX bytes:60339993 (57.5 MiB)


# cat /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=static
BROADCAST=192.168.0.255
IPADDR=192.168.0.7
NETMASK=255.255.255.0
NETWORK=192.168.0.0
TYPE=Ethernet

show ?

Can you ping maybe any other computer/server in the same subnet network ?
or is maybe

Code:

ping www.google.com

giving any response ?

lithos · 03-29-2012, 07:00 AM

Hi,

just as a precaution please mask your IP addresses in your ifcfg-ethX (for example IPADDR=1.2.3.4) as I don't see it's relevant.

It seems that you're running your connections through eth1 NIC

Code:

root@jingan10 network-scripts]# cat ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="none"
HWADDR="BC:AE:C5:3D:316"
ONBOOT="no"
IPADDR=

[root@jingan10 network-scripts]# cat ifcfg-eth1
DEVICE="eth1"
BOOTPROTO="static"
HWADDR="BC:AE:C5:3D:25:71"
ONBOOT="yes"
IPADDR=1.2.1.2
NETMASK=255.255.255.224
GATEWAY=1.2.1.3

opposing to mostly eth0

But I think that this kind of setup needs to have some routing configured
which I unfortunately don't know of.

I wish Maybe some expert users here could help more on how to use eth1 for default Internet connection.

Regards

hahacc · 03-30-2012, 07:00 AM

please see note below

hahacc · 04-04-2012, 10:58 PM

Just for guys who may arrive here after searching:
1.there's kernel bug in intel 82574L e1000e driver on centos 6(MSI/MSI-X interrupts issue), we can resolve this by install kmod-e1000e package from ELrepo.org and later add pcie_aspm=off e1000e.IntMode=1,1 e1000e.InterruptThrottleRate=10000,10000 acpi=off to kernel parameters. You can read more info Intel e1000e driver bug on 82574L Ethernet controller causing network blipping.
2.For the high Tx traffic, this was caused by port 53 dns flooding attack. I've resolved this by writing some iptable rules. More info here: port 53 dns flooding attack

Severian37 · 10-19-2012, 03:46 PM

Thanks for posting the info on the elrepo e1000e package and kernel parameters. This was a huge help.