LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Blogs > vkmgeek
User Name
Password

Notices


Rate this Entry

RDMA(Remote Direct Memory Access) connection statistics

Posted 12-02-2009 at 12:36 AM by vkmgeek

RDMA(Remote Direct Memory Access) connection statistics

Lead:
Any Desktop or Server Linux operating system provides a rich set of
network connection troubleshooting tools. However, those tools are not
useful for iWARP [http://en.wikipedia.org/wiki/IWARP] devices. There
is not any kernel interface which provides information about iWARP
connections. And the fact that iWARP devices supports both native TCP
stack and iWARP (Also known as RDMA over TCP/IP. TCP stack is
offloaded to hardware.) connections on the same interface makes the
things more complex.
Recently, Moni Shoua submitted a patch which provides a kernel
interface via debugfs. User application can read file, mounted on debugfs, which
has iWARP connections statistics just like /proc/net/tcp.

Problem:
First, let us look at an instance of problem that arises for network
administrators when they are using iWARP adapters.

Imagine a cluster of machines where each machine/server/node has iWARP
NIC. Network administrator runs ssh, ftp or telnet like applications in order to
connect cluster with other machines on LAN. These applications use native TCP stack.
At the same time, high performance message passing applications are also being run within
the cluster for internode communication(for e.g., MPI - http://en.wikipedia.org/wiki/Message_Passing_Interface).
Those applications use iWARP stack.

Now, network administrator faces a problem and his application is not able to
bind on a particular port, or application is not behaving properly for
some particular port, say port 4567. He tries to look at all active
connections with the help of "netstat" like command. Netstat command
does not show any active connection which is bind to port 4567. The
reason is some other iWARP connection is already using that
4567 port. And there is no way for netstat to know about this
connection. So any packet that comes in for 4567 port keeps juggling
between native TCP stack and hadware's TCP stack. This problem is
known as "port sharing problem". Yet, kernel does not have any
solution for this problem. There was a solution
[http://lkml.org/lkml/2007/8/7/208] suggested by Steve and that was
badly rejected by networking developers.

Kernel patch that solves the problem:
However, there is a way to workaround this problem. If kernel provides
an interface which gives statistics about iWARP connections, at least
that will give useful information to network administrator to
workaround the actual problem.

Every one who is involved in RDMA (infiniband/iwarp) implementation or
RDMA user agrees on a fact that there should be such tool or kernel
interface which shows active iWARP connections information. Steve Wise
said (http://www.mail-archive.com/general@.../msg23880.html)

"
I think we definitely need something like this, and an rdmastat
command to do useful formatting...
"

Moni Shoua proposed a patch
(http://www.mail-archive.com/general@.../msg23868.html)
to Roalnd. Infiniband core code has all the necessary
information about the iWARP connection statistics. cma_device data structure
defined in cma.c holds connection information for all iWARP devices.
Moni's initial approach was to traverse that list and add statistics to rdma_cm proc file.
Any user space tool can read the proc file, format output and show information.

The approach was quite fine but other people on the list did not agree
on place for debug file. Roland just said
"
Umm, no... /proc is not the place for this, since it has nothing to do
with processes.
debugfs is probably the simplest place to put this info, unless you
want to do something like netlink + a userspace tool.
"
It was also discussed to put debug info under /proc/net but netdev
people never responded and so that is too dropped. Steve Wise
suggested to put information under /sys/class/infiniband/rdma_cm/* for each
connection. But, it has an obvious huge overhead for sysfs to maintain
one file for one connection.

At the end it was concluded to have it under debugfs. The patch is here,
http://www.mail-archive.com/general@.../msg24167.html

Patch creates a "rdma_cm" debugfs file which has cma_dev passed as private data.

cma_dev->rdma_id_dentry = debugfs_create_file(name, S_IFREG | S_IRUGO,
cma_root_dentry, cma_dev,

Later it uses cma_dev structure and prepares a sequence file by
traversing a list.
static void *cma_rdma_id_seq_start(struct seq_file *file, loff_t *pos)
{
struct cma_device *cma_dev = file->private;
...
...
ret = seq_list_start_head(&cma_dev->id_list, *pos);
return ret;
}

The sequences are added through cma_rdma_id_seq_next() function.
static void *cma_rdma_id_seq_next(struct seq_file *file, void *v, loff_t *pos)
{
...
if (v == SEQ_START_TOKEN) {
++*pos;
...
} else {
ret = seq_list_next(v, &cma_dev->id_list, pos);
}
return ret;
}

And cma_rdma_id_seq_show() function dumps final output on the screen.
It prints network device name, source IP address and port
info, destination IP address and port info, port space and state info.
In addition to that, it gives information about QP number which is
more useful to any iWARP user and not to normal network admin.

There were suggestions related to formatting and the way it displays
information. Moni submitted V2 patch
(http://lists.openfabrics.org/piperma...il/059109.html).
Here is an exemplary output of cat /sys/kernel/debug/rdma_cm/*,

TYPE DEVICE PORT NET_DEV SRC_ADDR DST_ADDR SPACE STATE QP_NUM
mthca0 0 0.0.0.0:7174 TCP LISTEN 0
IB mthca0 1 ib0 192.30.3.249:46079 192.30.3.248:7174 TCP CONNECT 132102
IB mthca0 1 ib0 192.30.3.249:7174 192.30.3.248:42561 TCP CONNECT 132103

However, everyone did not agree with the V2 patch as well. Output is not as good as netstat.
Output also shows that 0.0.0.0 address is attached to mthca0 device which is confusing as 0.0.0.0 address is attached to all the devices.
It does not show Process ID, Group ID and other important things like netstat.

Jason said, (http://lists.openfabrics.org/piperma...ay/059410.html)
"
Reall the thinking should be 'I want to make lsof work usefully' not
'I want some random and different hack to let me see something'. And
yes, that is harder. But the IB stack is now at the point where these
small hard things are the sort of work that is needed to get parity
with the other stuff in linux..
"

Moni also agreed on suggestions and V3 patch will have those suggestions implemented.

Conclusion:
The patch is not accepted, yet and it definitely needs some more work.
When all required changes are implemented, it is most likely that patch will
be mainlined. Then network administrator can look at one more
place to debug the aforementioned problem and netstat like tools can
also be extended to show iWARP connection statistics.


===========================================
This is based on my understanding. If you have any suggestion/correction, please email me at viral.vkm@gmail.com
Posted in Uncategorized
Views 4505 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 10:53 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration