LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 12-04-2015, 03:39 PM   #1
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Rep: Reputation: Disabled
NFS Server Networking Issue


Hello,

My team is having a problem with Linux NFS Server. But I believe the problem is a low level network problem. The Linux NFS Server is a black box to my team. We are not able to log into the server to review logs or the state of the server. The vendor is no longer available to help. I can describe the Hardware and behavior of the Linux NFS Server from the network traffic when the problem occurred.

The Linux NFS Server Hardware and Configuration:
1) Red Hat Release running 64-Bit Kernel Version 2.6.18 on an Intel I7 Single Board Computer
(SBC).
2) The SBC has Concurrent Technologies XM 510/X24-RC daughter card
a. 4 Gigabit NIC Ethernet Card.
3) The 4 Gigabit NICs are bonded to a single IP Address
4) The bonding mode is Adaptive Load Balancing
5) The bonded Interface has the following features enabled:
a. RX Checksum
b. TX Checksum
c. Scatter and Gather
d. TCP Segmentation Offloading (TSO)
6) The NFS Server has Jumbo Frames enabled. I believe the MTU size is 8000 bytes.
7) The four Gigabit NICs are split between two switches. The two switches are trunked
together by four 10 Gigabit interfaces.
8) There are 8 NFS Clients accessing the NFS Server.
a. The NFS Clients are using TCP
b. The NFS Clients MTU size is 1500 bytes.
c. The NFS Clients have two mount points each.

The problem we are seeing is that the Linux NFS Server stops responding to one NFS Client during start up of the System. After the next power cycle the system, the system usually starts without any issues. The problem can be with any of the eight NFS Clients.

The network behavior that we are seeing is the following after operating normally for approximately 3 minutes:
1) A NFS Client issues an NFS Request
2) The NFS Server issues a TCP ACK, but no NFS Response is sent.
3) The NFS Server issues an ARP Request for that NFS Client
a. The ARP Request is not a broadcast but contains the MAC address of the NFS Client in
the Ethernet Header.
b. The NFS Client sends an ARP Response with the correct MAC Address
4) The NFS Client issues some additional NFS Requests.
a. The NFS Server sends a TCP ACK, but no NFS Response is sent.
5) The NFS Clients timeout is the default, 60 seconds.
6) The NFS Client retransmits the NFS Request of the first no response.
7) The NFS Server responds the same. A TCP ACK is issued, but no NFS Response is issued.
8) The NFS Server issues an ARP Request for that NFS Client again
a. The ARP Request is not a broadcast but contains the MAC address of the NFS Client in
the Ethernet Header.
b. The NFS Client sends an ARP Response with the correct MAC Address
9) The NFS Client starts to retransmit the other NFS Requests with the same results. The NFS Server issues a TCP ACK but never sends a NFS Response.
10) NFS Server will continue to issue ARP Requests for the NFS Client that is having problems. But there seems to be no pattern to the additional ARP Requests.
11) When this problem is happening, the other 7 NFS Clients are working without any issues.

The NFS Server has another strange behavior that is occurring with the NFS Server, but is not fatal. The NFS Server will close almost all the NFS Client connections during start up. The NFS Clients will establish a new TCP connection and continue without any problems. The NFS Clients connection that encountered the failure was not closed by the NFS Server. But other NFS Clients connections were not closed that function normally.

Any help will be greatly appreciated,
Chuck

Last edited by cwwanner; 12-04-2015 at 04:33 PM.
 
Old 12-04-2015, 11:23 PM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,162

Rep: Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268Reputation: 1268
Jumbo frames are usually 9000 bytes (though not always). Both endpoints need to be set to the same MTU.
 
Old 12-05-2015, 01:03 PM   #3
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
Both end points do not have to have the MTU set to same size with TCP. The initial TCP three way handshake determines the maximum segment size (MSS). The MSS is determined by MTU - TCP Overhead. Both end points will use the smallest MSS. The NFS Server's TCP connection will use the NFS Client's MSS value.

On Monday, I will double check the NFS Server's MTU Size.

Thank You,
Chuck

Last edited by cwwanner; 12-05-2015 at 01:06 PM.
 
Old 12-05-2015, 08:06 PM   #4
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
Not sure how you test this... But make sure that all bonded networks are bonded throughout the net. This almost sounds like the clients aren't getting/sending the packets. With short packets they go out only one of the bonded set - and if that particular link doesn't get through the packet gets lost - and it would look as if it were never sent.
 
Old 12-08-2015, 01:11 PM   #5
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
jpollard,

I am not sure what is meant by "make sure that all bonded networks are bonded throughout the net". All of the for NICs are transmitting and receiving network traffic. All four NICs are being used when this problem occurs.

When bonding the four NICs with Adaptive Load Balancing, all four NICs are being used. If a NIC is disconnected, the bonded set will attempt to balance the traffic on the other three NICs. The traffic is not duplicated on the bonded set. What do you mean by "With short packets they go out only one of the bonded set"?

I am sorry for not understanding, this is my first experience with Linux bonding.

Thank You,
Chuck
 
Old 12-08-2015, 06:03 PM   #6
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
The four connections are physically separate networks. Routers between all such connections must be symmetrical (all networks must be connected the same, with same routing so that all endpoints see the same thing).

If a router misses one, then packets will get lost.

I wasn't the one setting things up, but this was one of the symptoms at work - one of the other admins found out a router was missed... Most of the connections worked just fine, just not all of them, and those were on the other side of a router that missed a connection.

What happens is that data being passed is broken up into separate packets - each going out a different physical network in parallel. When one of those packets gets lost, the entire thing either fails... or performs really poorly.

Last edited by jpollard; 12-08-2015 at 06:06 PM.
 
Old 12-08-2015, 07:03 PM   #7
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
The SBCs are connected to one switch, we will call that switch X. The NFS Server has two of the four NICs connect directly to Switch X. The NFS Server has the other two NICs connect directly to another switch, we will call that switch Y. Switch X and Y are directly connect by four 10GBit Ethernet lines.

Switch X and Y are layer 3 switches. So Switch X and Y have multiple VLANs configured. However the SBCs and the NFS Server are within the same VLAN (Subnet).

All the network traffic between the SBC's and the NFS Server will not be handled by any layer 3 routing protocol.

When we captured the network traffic. We had four taps configured, one on each of the NFS Server's GB NIC ports. All the traffic described was captured on the NFS Server's physical ethernet lines. I believe no traffic was being lost.
 
Old 12-08-2015, 07:04 PM   #8
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
Smallpond,

I was able to verify the NFS Server's MTU size. I was wrong, the MTU size is 9000 bytes.

Regards,
Chuck
 
Old 12-09-2015, 11:50 AM   #9
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
We did perform an experiment. We disconnected 3 of the 4 Ethernet lines from the NFS Server and performed 20 power cycles of the system. The problem did not occur.

Does anyone know about a problem with TSO, Bonding, and the Linux network stack?

Thank You,
Chuck
 
Old 12-09-2015, 12:17 PM   #10
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
Quote:
Originally Posted by cwwanner View Post
We did perform an experiment. We disconnected 3 of the 4 Ethernet lines from the NFS Server and performed 20 power cycles of the system. The problem did not occur.

Does anyone know about a problem with TSO, Bonding, and the Linux network stack?

Thank You,
Chuck
Definitely sounds like some data is being dropped somewhere.

Try adding one link at a time. It may identify when the fault occurs, and isolated it to one link. When the error occurs, disconnect that specific link, then add another. If it ONLY occurs with that single link it is then outside the server. If it happens when any two links are present then the problem is inside the server. Who knows, it may even be a specific interface on the server that is flakey, or a specific cable.

BTW, the NFS server closing the connections is deliberate - it forces the clients to reconnect and resync possible buffer handling so that data doesn't get lost.

Last edited by jpollard; 12-09-2015 at 12:18 PM.
 
Old 12-09-2015, 04:50 PM   #11
cwwanner
LQ Newbie
 
Registered: Dec 2015
Posts: 11

Original Poster
Rep: Reputation: Disabled
The problem happens on 3 of 3 systems, but not in our lab environment.

We are using TCP for the NFS client. If data was lost outside the server, TCP would cause a retransmission. But the NFS Server performs a TCP ACK, so the network stack did receive the NFS Request. The retransmit of the NFS Request is performed by the NFS client not by the TCP protocol.

We are trying to put together a plan to perform those tests in the field, it will just take some time.

Is there documentation describing the reason for the NFS Server closing connections that you can point me too? I was looking online, but have not found any documentation on that subject.

Thank You,
Chuck
 
Old 12-09-2015, 08:23 PM   #12
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
I believe this should cover it:

https://docs.oracle.com/cd/E19120-01...138/index.html

Note: some clients using NFSv4 will lose their locks, and sometimes have to restart the application to re-acquire the locks.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
NFS permissions issue, client user to server as root markosjal Linux - Networking 1 03-17-2014 03:59 PM
Issue with NFS share on Debian Server HappyAlex Linux - Server 3 04-16-2013 10:00 PM
NFS server mount issue DD554 Linux - Networking 4 05-06-2010 05:20 AM
Networking issue with my web server Tux_Phoenix Linux - Networking 6 06-03-2006 11:05 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 01:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration