LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 11-11-2005, 11:19 AM   #1
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Rep: Reputation: 0
Network stops responding after inactivity


Hi Guys,

Here is the situation:

Fedora Core 4
Dell 600SC Server
P4 2.4, gig o ram
Integrated Intel 1000+ Ethernet (e1000 module)

This box is running Apache 2.0.54, Mysql and Apache 2.0.54

After a period of network inactivity, the network card will completely stop responding to request. It will not answer pings, ssh, ftp or http connections. If I log on locally, and ping another computer on our network, everything come back live. There are no errors being reported anywhere, and I am at a loss.

As a temporary solution, I have written a script that runs on the background and pings another server on our network every 60 seconds. I have also made it output to a log if it is unsuccessful, and for two days now the server has been up without a ping failing once.

This is a strange problem, and I cannot find anything out there that says anyone else has had this problem. I was just wondering if anyone on here has any ideas.

Thank you

PS: Static IP, SElinux is completely off, no firewall, direct connection to our network.

Thanks again for your help,

Brandon
 
Old 11-12-2005, 10:19 PM   #2
sigsegv
Senior Member
 
Registered: Nov 2004
Location: Third rock from the Sun
Distribution: NetBSD-2, FreeBSD-5.4, OpenBSD-3.[67], RHEL[34], OSX 10.4.1
Posts: 1,197

Rep: Reputation: 47
If you want to see a really neat trick, fiddle with the speed and/or duplex of the interface

We have ~30 Dell 8xx series servers with e1000 interfaces in them. I don't know if it's the module or the hardware, but whatever's to blame, the two do *not* play well together.

I have the module parameters at the office that (I think) will help you. I'll post back on Monday.
 
Old 11-14-2005, 08:08 AM   #3
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Original Poster
Rep: Reputation: 0
I am interested to see if those mod settings work. I would be most appreciative if you could post them.

We have tried using a 3com 10/100 card in this server, and it was doing the exact same thing. This is such a weird problem. This is also the second 600 we have tried this configuration on, to eliminate the possibility of a bad motherboard. Both systems have exhibited the same symptoms, so there has to be an issue with the Kernel not getting along with these servers.

I put FC4 on a Dell 1650 we had laying around, and it has worked like a champ, all weekend, and it uses the same e1000 module.

My first thought on this was some form of power management issue, but I checked the BIOS, and there aren't any power settings. I also stopped the ACPI and APM daemons, just to make sure.

So far, so good. My ping script has kept this thing up all week, and it hasn't reported a dropped packet yet. It will only crap out when it idles, but even after it idles, a single ping or a restart of the network will bring everything back up, like nothing was wrong.

Thanks again for your help, I would hate to lose a box to Windows (Especially because this site needs PHP, GD, Apache and MySql, which is a perfect linux setup)
 
Old 11-29-2005, 07:32 AM   #4
sigsegv
Senior Member
 
Registered: Nov 2004
Location: Third rock from the Sun
Distribution: NetBSD-2, FreeBSD-5.4, OpenBSD-3.[67], RHEL[34], OSX 10.4.1
Posts: 1,197

Rep: Reputation: 47
Ok, so these are from a Dell 1850, but it was having a very similar problem.

Code:
[root@iceman ~]# cat /etc/modprobe.conf
...
alias eth2 e1000
alias eth3 e1000
options e1000 Speed=100,100 Duplex=2,2
...
That forces both interfaces to only negotiate 100baseTx-FD. Settings taken from the Intel readme.

There seems to be a (known) problem with Intel EEP1000 cards and Cisco switches, too. If you've got cisco gear, have your network people turn on portfast and fiddle with other settings in the switch as well (speed and duplex can't be forced if you're wanting to run 1000baseTx-FD -- It *has* to be autoneg).
 
Old 12-05-2005, 08:14 PM   #5
jge026
LQ Newbie
 
Registered: Nov 2005
Location: Boston, MA
Distribution: Fedora Core 4
Posts: 6

Rep: Reputation: 0
don't worry, you're not alone.

I've had the same problem. Mine manifested itself over wired & wireless connections on FC4 & Ubuntu. I tried everything. Still no success.

http://www.linuxquestions.org/questi...d.php?t=381065
 
Old 12-05-2005, 09:16 PM   #6
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Original Poster
Rep: Reputation: 0
I put CentOS 4.2 on one of these boxes today, and so far so good. It is up and running, and seems to be working great.

The original site that was causing problems has been moved oer to a production server, so I never did get to try those mod settings out. I will definitely keep them in mind if I ever need to run FC4 on one of those things again.

Thanks again for your help, and I'm sure you'll see me around sometime again.

Brandon
 
Old 12-10-2005, 03:02 AM   #7
ajustin
Newbie
 
Registered: Dec 2005
Posts: 2

Rep: Reputation: 0
It's not just after no activity for me

Almost identical problem here using Centos 4.1 or 4.2, I can't figure out which and the CD is somewhere else.

Pinging out from or in to the machine makes no difference it still dies after usually no more than 20 minutes.

I transferred the hard drives into a completely different machine, different manufacturer and different network card and the problem moved to the new machine, therefore it isn't hardware and must be something to do with the OS.

If I put the machine on a 10/100 HUB with just one workstation for testing it works fine and the networks stays up, so far, for hours.

If I put the machine on the corporate network which uses recently installed Dell PowerConnect 3348 SWITCHES, the problem occurs. I don't have a spare switch to test it as a single workstation setup. The machine is using a Linksys NC100 NIC in its current setup but I have also tried an RTL8139 and had the same results.

I have tried setting up the test workstation and Centos machine on the hub and uplinking this to the main network but the problem still recurs.

It seems to me that the problem is being initiated either by the Dell switches themselves or by the activity of some other machine on the corporate network.

I have Red Hat ES3 and Red Hat ES4 servers on the network that work fine.

Any suggestions welcome, please.

Tony

Last edited by ajustin; 12-10-2005 at 03:08 AM.
 
Old 12-10-2005, 04:24 PM   #8
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Original Poster
Rep: Reputation: 0
It's BBBAAAACCCCKKK

This is really getting annoying.

Completely different server now (Dell Power Edge 2650)
OS: Cent 4.2 (Kernel 2.6.9-22.0.1.ELsmp)

Completely fresh installation. No SELinux or Firewall. One IP per adapter (One External one Internal) Both connections to a Cisco 3500XL Switch (With portfast on. Also did it with portfast off)

This is doing this same thing on TWO separate boxes now. I used to think it was due to inactivity, but I no longer think this is the case. One of the boxes seemed to work fine for a while, and then all of a sudden developed this symptom. Our development team will be in the box all day long, and all of a sudden one of them will message me telling me "it's down" Sometimes it will come right back up, but most of the time I will run the following without success:

Ping either ip (Internal and external) with no success (The internal network is on the same Cisco, just each side is VLAN'ed) No response
Try to load the page, no success
FTP: Nothing
SSH: Nothing

I know a single ping out EITHER interface will bring everything back on line. I also found that restarting Apache will also bring it back online.

The link lights stay lit.
Nothing in ANY log. Apache error, messages, dmesg, NOTHIN

Here is what else is weird. I run GFI network-server monitor to monitor this box (And several other on our network) GFI Logs into the box to check services every 5 minutes, and will fail after 3 failures. It also runs an http request to the main page every 5 minutes also. One of these checks on this box has NEVER failed. I also run Cacti on our production network with SNMP data from this box, and it has never shown a gap. So the plot thickens.

Now, on to the other box.

I just built this box the other day, but it is also a DELL PE 2650. It is running the same os and software as the box above, only this one doesn't have any activity on it yet.

This is the one that is completey dropping off the face of the earth after mere minutes of sitting there. Same symptoms, a restart of apache fixes it, no ping request on any IP, yadda yadda.

This box is also plugged into a Cisco 3500 (Same switch as above, although, before you say it, I have tried another switch, but it is also another 3500 set up the same way. I have a couple Dells back there I can use, but that is unacceptable for a long term fix.

The original problem also involved a cisco 3500, but that one wasn't vlaned. It was plugged in to the switch where our 100Meg comes in.


Now, time for ANOTHER twist:

I have this same configuration running on another 2650 and have never had as much as a hiccup.

I am at my wits end, and I HAVE to find a solution, quickly. I just placed an order for the equipment for our new SAN, and am going to be using this hardware (Along with a bunch more) to build a LVS cluster with GFS for these sites. I am not AS concerned over this in the near future, because when they get moved over to LVS, there will be constant heartbeat pings running to and from each of these servers, and from the previous problem, if the box is pinging out every few minutes, it stay up without a hitch. I just don't like it when weird problems like this crop up for no explained reason.

I have done search after search on this, and I haven't seen anyone else really experiencing this. There has to be SOMEONE out there that is having the same problem, or who at least has the same hardware setup (Those CISCO switches are VERY popular, and those Dells are still around by the thousands) without the same problem.

Thanks again for the support, I look forward to finding a solution.
 
Old 12-12-2005, 07:45 AM   #9
ajustin
Newbie
 
Registered: Dec 2005
Posts: 2

Rep: Reputation: 0
I've solved MY problem

My researches described above drew me to the view that it had to be something to do with the switch rather than the OS on the server.

Investigation showed that the switch and the server BOTH had the SAME IP address!

This had not been immediately obvious but browsing around the switch management console brought it to light. It may have shown up in one of the switch log files, but it was not at all obvious otherwise as ssh and vnc both connected to the server in preference to the switch and the server never complained about a duplicate IP.

Tony
 
Old 12-12-2005, 12:44 PM   #10
Darin
Senior Member
 
Registered: Jan 2003
Location: Portland, OR USA
Distribution: Slackware, SLAX, Gentoo, RH/Fedora
Posts: 1,024

Rep: Reputation: 45
I would definitly suggest the options e1000 Speed=100,100 Duplex=2,2 setting to force speed and duplex on the network card. From there you will also need to force speed and duplex on the switch to the same settings, 100 full in the case of that options line above.

If you haven't already, you may also want to check out Intel's driver http://downloadfinder.intel.com/scri...ilename=e1000-
The driver from Intel is slightly different from the open source version distributed with the kernel, although Intel puts work into both versions.

I am a little curious about the state it's in when it 'goes down'? Is the switch still sending network data intended for that host out the port? The activity light on the network card should show more than broadcast activity with an active [ping] to the host running somewhere. Does ifconfig show TX or RX packets increasing after it has failed? Does mii-tool show link status, although looking at my system I get a 'SIOCGMIIPHY on 'eth1' failed: Operation not supported' error running mii-tool against my e1000 card. Does the switch show that the link is still up? Do you have anything on the network with an IP address conflict like ajustin had?
 
Old 12-12-2005, 12:51 PM   #11
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Original Poster
Rep: Reputation: 0
the 2650s don't use the intel card, and, on one of the machines, I did try forcing it in to 100 FD. THe 2650s have a broadcom card in them (Well, built in to them)

I will check link status and watch the ports next time it goes down, I didnt think of that. Restarting apache actually DOESN'T do anything, I think that was just a fluke. The latest incarnation now has it where it won't respond on the external network (eth1) but I can ssh into the internal one (eth2), and then the external comes up.
 
Old 12-12-2005, 12:53 PM   #12
branall1
LQ Newbie
 
Registered: Nov 2005
Posts: 14

Original Poster
Rep: Reputation: 0
We are starting to focus on the switch today. I don't know anything about these cisco's, but the othre guy here does. I have him checking the logs to see if anything is weird.

Thanks for your help with this. This is driving me NUTS.
 
Old 12-12-2005, 01:02 PM   #13
Darin
Senior Member
 
Registered: Jan 2003
Location: Portland, OR USA
Distribution: Slackware, SLAX, Gentoo, RH/Fedora
Posts: 1,024

Rep: Reputation: 45
Quote:
Originally Posted by branall1
We are starting to focus on the switch today. I don't know anything about these cisco's, but the othre guy here does. I have him checking the logs to see if anything is weird.

Thanks for your help with this. This is driving me NUTS.
Just so you know, if you force speed and duplex on the network card, you pretty much have to also do it on the switch, and vice versa.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Weird Network Stops Responding Error branall1 Linux - Software 2 11-09-2005 05:44 PM
New Slackware Network Stops Responding littlenicker Linux - Networking 4 04-14-2005 08:37 AM
How often your Windows stops responding? chii-chan General 17 03-20-2004 07:26 PM
Intellistation Pro stops responding after inactivity Jagzseven Linux - Hardware 2 04-06-2003 03:42 AM
Network stops responding after time arobinson74 Linux - Networking 1 11-07-2002 11:34 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 01:41 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration