LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Networking (https://www.linuxquestions.org/questions/linux-networking-3/)
-   -   TCP Checksum errors ... only after some amount of time has passed. (https://www.linuxquestions.org/questions/linux-networking-3/tcp-checksum-errors-only-after-some-amount-of-time-has-passed-620209/)

debuser123 02-10-2008 09:21 PM

TCP Checksum errors ... only after some amount of time has passed.
 
Edit 07/11/2008: Quick fix found: disable tcp timestamps
(# echo 0 > /proc/sys/net/ipv4/tcp_timestamps
)

Currently running a 2.6.24.1 kernel but also experienced this under 2.6.23.11; I am pretty sure it's a kernel/driver problem as a reboot fixes the "problem" without me changing any modem/router configurations (no wireless stuff involved). And I've never experienced this problem with Windows.

As soon as I boot up my PC (Debian Etch), I have no problems with my internet connection. However, if I leave my computer on for a while, my internet connection pretty much stalls (all the while my laptop running XP plugged in the same router has no problem). "A while" is not a definite amount of time, but let's say I've experienced this when I've left it on & unattended for at least 5 hours.

DNS queries work no problem, but connecting is the part that seems to have a problem. Running wireshark, I noticed that the main contributor seems to be TCP checksum errors. Offloading is not the problem because the checksums are always off by 1 (for example, correct checksum could be 0x1234 when the segment might have a checksum of 0x1233). It seems some packets "work" while others don't...and for example just opening something as simple as http://google.com might take about 2-3 minutes for the page to fully load.

Doing a little research, I found out some guy who discovered a bug in some MIPS64 assembly code (of the kernel) that incorrectly converted between 32bit & 64bit value (here) and the bug he describes seems to be exactly my problem: checksums off by one, and only in specific cases (just like mine: packets eventually get through but there's a lot of packets thrown away in-between).

The problem is, my PC isn't running a MIPS processor, but an AMD. However, I am running AMD64 which means it could be the exact same problem (32/64 bit conversions). My kernel has been compiled as AMD64 (K7?)...I will know for sure if this is the problem when I recompile it using the same options but w/o the 64'bitness'...I'm hoping this is it and perhaps a bug report can be filed by people who know how to.

Otherwise, does anyone have any other suggestions on what the problem could be?

ARC1450 02-11-2008 08:45 AM

Looks like you're on the right track. If you're ruled out that it's an offloading problem (which has been known to occur with certain gigabit adapters), and you've ruled out that it's not a NIC going bad or a poor driver, you're going where you need to go.

Please, though. . .do post what happens when you compile your kernel for generic 32-bit support. By the by, AMD64 is K8, just for future reference. K7 was up to the XP series of Athlons.

debuser123 02-11-2008 04:31 PM

Looks like I was mistaken; my kernel (2.6.24.1) was not compiled as Athlon64(K8) but as a K7...so 64 bitness is probably not the problem.

However, a kernel that has never given me a problem was 2.4.27 and it looks like that was compiled using 386 as the processor type. The bad thing is that you can't really compare 2.4 and 2.6 kernels to figure out a problem, however I can compile as 386.

So, what I will do is this:
#1) use K8 (the processor I actually have) and see what happens
and
#2) use 386 (the end-all in compatibility) and see what happens

The reason I doubt it's offloading is because the checksums are only off by one (a computer I know that has offloading, the checksums differed by huge amounts). My NIC is your generic onboard 100mb Via Rhine (vt6102, rhine-II) which is another reason why offloading probably isn't it since it's a "slow" card and wouldn't have that big of a use for offloading.

I don't know about a bad NIC because I can reboot and not have a problem at all (though rebooting might reset the NIC and put it in working state).

Funny thing is I don't even need a 64 bit processor...when it was 64bit vs. dual core, I chose wrongly.

PS: Is there a way to, of sorts, reset the TCP stack without rebooting? ifconfig ups/downs, dhcp lease renewals don't fix the problem.

ARC1450 02-11-2008 06:23 PM

Well, restarting your network card should clear the TCP/IP stack, as far as I know.

What I can tell you is that when NICs die, don't be surprised about anything. Some NICs silently go into the night, and just up and die. Some NICs go rather violently and storm a network to death, then die. Some NICs will cause the computer to lock as they die. Some NICs just start sending out jumbo frames in a network that can't accept them. And some NICs will appear to be on, have a connection, and show nothing. I just dealt with an onboard like that at work. The switch detected a connection, the NIC light was on, the switch even had packets trickling in and out. But no traffic actually went to and from the box.

If you've got an el-cheapo NIC, slap it in, see if it works. That'll tell you if it's your kernel or not. But compiling your kernel for an earlier generation of processor is okay. K8's will run anything equal to or less than a K8 kernel on the AMD side. P4's will run anything equal to or less than a P4 kernel on the Intel side.

debuser123 02-12-2008 03:20 AM

I've always assumed optimizations for an X (K7) may not work that well on a Y (K8) even if the Y is backwards compatible. But with K8 as the processor type, I still had the same problem.

A little bit more info:

1) Only packets that have been received generate a TCP checksum mismatch. Wireshark says the checksum should be one greater than what was received. That rules out offloading since it is used for transmission.

2) Internal (TCP) LAN traffic whether received or transmitted does not have any checksum mismatches.

I guess another thing I could try is compiling the NIC driver as a module and then unloading / reloading the module when I start getting errors.

This issue isn't a big problem because I don't always waste electricity by leaving it on, but sometimes I leave it on when I know I might need to ssh into it & grab some files.

debuser123 05-16-2008 12:57 AM

Bump...still a problem on 2.6.25.

debuser123 07-03-2008 08:23 PM

This is pretty odd but I figured out the "problem" which does not require me to reboot:

Once I start experiencing a loss of internet access out of nowhere (e.g., can't connect to web sites, can't ping, dns lookups don't work [e.g., FF sits on "Looking up host whatever.com..." in the statusbar]), what fixed it was to:

....Restart XWindows/Xorg (e.g., ctrl+alt+backspace)

I noticed it because I recently put the system monitor applet on the panel (which shows current cpu usage). Well, I noticed that once it looked like my net access was out, that applet was about 50% blue (meaning about 50% of my cpu was being used)...but I wasn't doing anything (at least in the foreground). So I clicked it and noticed that process Xorg was taking up about 60% of the cpu.

What I did next, I don't know why I never tried before, but I switched to tty1 and opened up google in links...voila, came right up. Went back to X, nope, no go. Then I reduced Xorg's priority to 19 (while in X) which was kind of dumb 'cause then my mouse was useless as any clicking didn't register. So restart X, and bam, everything's back to normal.

What this makes me think is that there's an issue with kernel, my motherboard (onboard LAN), and display (GeForce 256). I guess this is now less of a networking issue but my uname and lspci output is below. Still on kernel 2.6.23.11 (which I built a while ago [too lazy to change the default in grub]), but I still experienced the issue on version 2.6.25.

Code:

$ uname -a
Linux thepc 2.6.23.11 #2 PREEMPT Sun Dec 23 01:05:27 CST 2007 i686 GNU/Linux

$ lspci
00:00.0 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.1 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.2 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.3 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.4 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.7 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:0b.0 Multimedia audio controller: Creative Labs SB0400 Audigy2 Value
00:0f.0 IDE interface: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South]
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 78)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation NV10 [GeForce 256 SDR] (rev 10)


debuser123 07-07-2008 09:24 AM

Then again, maybe that wasn't the fix or what I experienced earlier wasn't "it". Today it happened again and I reset X...but TCP connections still wouldn't go through. I can ping and do udp/icmp stuff, it's just TCP that has a problem. I give up.........

farslayer 07-07-2008 10:59 AM

I was having issues with BOTH my AMD boxes and their integrated nics, I believe they also have the VIA chipsets. and where doing similar things to what you are describing, and really SLOOOWWW transfer rates. I threw an intel NIC into each box, disabled the onboard NIC and all my network problems disappeared on those two machines. I personally think it's a hardware issue.

debuser123 07-10-2008 12:09 PM

I agree it's probably a hardware issue but I never experienced this on any of the 2.4 kernels with the same hardware. I'm attaching a wireshark/tcpdump/pcap dump of how it takes almost a full 2 minutes just to connect to google.com with the links text-mode browser. You can see that for about the first minute it's just filled with tcp checksum incorrect errors. Then after that it seems fine. Wireshark says the tcp checksums are off by a single value (bit). It says the reported checksum was 0x1234 but it should've been 0x1235.

1. I don't have any problems when I'm running a local server and I connect with the lo or eth0 ip address.
2. Plugging my laptop into my router and trying to ssh into my computer is successful.
3. Starting an ssh session from my computer to my laptop is also successful.

So it seems that when this problem arises, connections to localhost servers and servers on my LAN still go through without error. The internet just doesn't like me.

I couldn't attach a file so I uploaded the wireshark packet dump to mediafire.com:

http://www.mediafire.com/?cultwg690dh

Anyone know some mailing list I could subscribe to to better debug this problem?

ARC1450 07-10-2008 12:36 PM

Just curious, but have you taken your router out of the mix and just directly connected to the 'net?

debuser123 07-10-2008 04:55 PM

Yup, tried that. The peculiar thing is that a reboot is the only thing that can fix "it". I don't have to reset my router or anything.

I also tried just compiling via-rhine as a module. Once I get the problem I'd do something like:
# ifdown eth0
# modprobe -r via-rhine
# modprobe via-rhine
# ifup eth0

I get an IP address and all, but still incoming TCP packets from the network have invalid checksums.

ARC1450 07-10-2008 05:00 PM

Dude, it sounds like you have a bad NIC, period.

And how long have you been off of a 2.4 based kernel?

debuser123 07-10-2008 05:39 PM

I've used 2.6 kernels for about a year. Prior to that was the default Debian Sarge kernel 2.4.something. I guess I could set a 2.4 as my default. I just never remember having this problem while on a 2.4 kernel, but who knows, maybe I did but didn't realize it.

I really just would like more debugging info in my kernel pertaining to the TCP/IP stack. Anyone know how I could get that?

I'm not that big on upgrading...my video card (nvidia geforce 256) gets talked about enough [and has it's own bundle of problems....the infamous Xid lockups with nvidia's closed-source drivers]. I really would like to figure out that if my NIC is bad, why it is.

farslayer 07-10-2008 11:35 PM

could be a bug in the driver too and not bad hardware.. like I said it was easier for me to throw a NIC in the box than waste my time chasing something I couldn't control or identify.


All times are GMT -5. The time now is 12:40 PM.