TCP Checksum errors ... only after some amount of time has passed.
Linux - NetworkingThis forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
Currently running a 18.104.22.168 kernel but also experienced this under 22.214.171.124; I am pretty sure it's a kernel/driver problem as a reboot fixes the "problem" without me changing any modem/router configurations (no wireless stuff involved). And I've never experienced this problem with Windows.
As soon as I boot up my PC (Debian Etch), I have no problems with my internet connection. However, if I leave my computer on for a while, my internet connection pretty much stalls (all the while my laptop running XP plugged in the same router has no problem). "A while" is not a definite amount of time, but let's say I've experienced this when I've left it on & unattended for at least 5 hours.
DNS queries work no problem, but connecting is the part that seems to have a problem. Running wireshark, I noticed that the main contributor seems to be TCP checksum errors. Offloading is not the problem because the checksums are always off by 1 (for example, correct checksum could be 0x1234 when the segment might have a checksum of 0x1233). It seems some packets "work" while others don't...and for example just opening something as simple as http://google.com might take about 2-3 minutes for the page to fully load.
Doing a little research, I found out some guy who discovered a bug in some MIPS64 assembly code (of the kernel) that incorrectly converted between 32bit & 64bit value (here) and the bug he describes seems to be exactly my problem: checksums off by one, and only in specific cases (just like mine: packets eventually get through but there's a lot of packets thrown away in-between).
The problem is, my PC isn't running a MIPS processor, but an AMD. However, I am running AMD64 which means it could be the exact same problem (32/64 bit conversions). My kernel has been compiled as AMD64 (K7?)...I will know for sure if this is the problem when I recompile it using the same options but w/o the 64'bitness'...I'm hoping this is it and perhaps a bug report can be filed by people who know how to.
Otherwise, does anyone have any other suggestions on what the problem could be?
Last edited by debuser123; 07-11-2008 at 01:18 PM.
Looks like you're on the right track. If you're ruled out that it's an offloading problem (which has been known to occur with certain gigabit adapters), and you've ruled out that it's not a NIC going bad or a poor driver, you're going where you need to go.
Please, though. . .do post what happens when you compile your kernel for generic 32-bit support. By the by, AMD64 is K8, just for future reference. K7 was up to the XP series of Athlons.
Looks like I was mistaken; my kernel (126.96.36.199) was not compiled as Athlon64(K8) but as a K7...so 64 bitness is probably not the problem.
However, a kernel that has never given me a problem was 2.4.27 and it looks like that was compiled using 386 as the processor type. The bad thing is that you can't really compare 2.4 and 2.6 kernels to figure out a problem, however I can compile as 386.
So, what I will do is this:
#1) use K8 (the processor I actually have) and see what happens
#2) use 386 (the end-all in compatibility) and see what happens
The reason I doubt it's offloading is because the checksums are only off by one (a computer I know that has offloading, the checksums differed by huge amounts). My NIC is your generic onboard 100mb Via Rhine (vt6102, rhine-II) which is another reason why offloading probably isn't it since it's a "slow" card and wouldn't have that big of a use for offloading.
I don't know about a bad NIC because I can reboot and not have a problem at all (though rebooting might reset the NIC and put it in working state).
Funny thing is I don't even need a 64 bit processor...when it was 64bit vs. dual core, I chose wrongly.
PS: Is there a way to, of sorts, reset the TCP stack without rebooting? ifconfig ups/downs, dhcp lease renewals don't fix the problem.
Last edited by debuser123; 02-11-2008 at 05:33 PM.
Well, restarting your network card should clear the TCP/IP stack, as far as I know.
What I can tell you is that when NICs die, don't be surprised about anything. Some NICs silently go into the night, and just up and die. Some NICs go rather violently and storm a network to death, then die. Some NICs will cause the computer to lock as they die. Some NICs just start sending out jumbo frames in a network that can't accept them. And some NICs will appear to be on, have a connection, and show nothing. I just dealt with an onboard like that at work. The switch detected a connection, the NIC light was on, the switch even had packets trickling in and out. But no traffic actually went to and from the box.
If you've got an el-cheapo NIC, slap it in, see if it works. That'll tell you if it's your kernel or not. But compiling your kernel for an earlier generation of processor is okay. K8's will run anything equal to or less than a K8 kernel on the AMD side. P4's will run anything equal to or less than a P4 kernel on the Intel side.
I've always assumed optimizations for an X (K7) may not work that well on a Y (K8) even if the Y is backwards compatible. But with K8 as the processor type, I still had the same problem.
A little bit more info:
1) Only packets that have been received generate a TCP checksum mismatch. Wireshark says the checksum should be one greater than what was received. That rules out offloading since it is used for transmission.
2) Internal (TCP) LAN traffic whether received or transmitted does not have any checksum mismatches.
I guess another thing I could try is compiling the NIC driver as a module and then unloading / reloading the module when I start getting errors.
This issue isn't a big problem because I don't always waste electricity by leaving it on, but sometimes I leave it on when I know I might need to ssh into it & grab some files.
This is pretty odd but I figured out the "problem" which does not require me to reboot:
Once I start experiencing a loss of internet access out of nowhere (e.g., can't connect to web sites, can't ping, dns lookups don't work [e.g., FF sits on "Looking up host whatever.com..." in the statusbar]), what fixed it was to:
I noticed it because I recently put the system monitor applet on the panel (which shows current cpu usage). Well, I noticed that once it looked like my net access was out, that applet was about 50% blue (meaning about 50% of my cpu was being used)...but I wasn't doing anything (at least in the foreground). So I clicked it and noticed that process Xorg was taking up about 60% of the cpu.
What I did next, I don't know why I never tried before, but I switched to tty1 and opened up google in links...voila, came right up. Went back to X, nope, no go. Then I reduced Xorg's priority to 19 (while in X) which was kind of dumb 'cause then my mouse was useless as any clicking didn't register. So restart X, and bam, everything's back to normal.
What this makes me think is that there's an issue with kernel, my motherboard (onboard LAN), and display (GeForce 256). I guess this is now less of a networking issue but my uname and lspci output is below. Still on kernel 188.8.131.52 (which I built a while ago [too lazy to change the default in grub]), but I still experienced the issue on version 2.6.25.
$ uname -a
Linux thepc 184.108.40.206 #2 PREEMPT Sun Dec 23 01:05:27 CST 2007 i686 GNU/Linux
00:00.0 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.1 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.2 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.3 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.4 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:00.7 Host bridge: VIA Technologies, Inc. K8T800Pro Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:0b.0 Multimedia audio controller: Creative Labs SB0400 Audigy2 Value
00:0f.0 IDE interface: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South]
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 78)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 VGA compatible controller: nVidia Corporation NV10 [GeForce 256 SDR] (rev 10)
Then again, maybe that wasn't the fix or what I experienced earlier wasn't "it". Today it happened again and I reset X...but TCP connections still wouldn't go through. I can ping and do udp/icmp stuff, it's just TCP that has a problem. I give up.........
I was having issues with BOTH my AMD boxes and their integrated nics, I believe they also have the VIA chipsets. and where doing similar things to what you are describing, and really SLOOOWWW transfer rates. I threw an intel NIC into each box, disabled the onboard NIC and all my network problems disappeared on those two machines. I personally think it's a hardware issue.
I agree it's probably a hardware issue but I never experienced this on any of the 2.4 kernels with the same hardware. I'm attaching a wireshark/tcpdump/pcap dump of how it takes almost a full 2 minutes just to connect to google.com with the links text-mode browser. You can see that for about the first minute it's just filled with tcp checksum incorrect errors. Then after that it seems fine. Wireshark says the tcp checksums are off by a single value (bit). It says the reported checksum was 0x1234 but it should've been 0x1235.
1. I don't have any problems when I'm running a local server and I connect with the lo or eth0 ip address.
2. Plugging my laptop into my router and trying to ssh into my computer is successful.
3. Starting an ssh session from my computer to my laptop is also successful.
So it seems that when this problem arises, connections to localhost servers and servers on my LAN still go through without error. The internet just doesn't like me.
I couldn't attach a file so I uploaded the wireshark packet dump to mediafire.com:
I've used 2.6 kernels for about a year. Prior to that was the default Debian Sarge kernel 2.4.something. I guess I could set a 2.4 as my default. I just never remember having this problem while on a 2.4 kernel, but who knows, maybe I did but didn't realize it.
I really just would like more debugging info in my kernel pertaining to the TCP/IP stack. Anyone know how I could get that?
I'm not that big on upgrading...my video card (nvidia geforce 256) gets talked about enough [and has it's own bundle of problems....the infamous Xid lockups with nvidia's closed-source drivers]. I really would like to figure out that if my NIC is bad, why it is.