how to read and write ethernet "link layer" packets

maxreason · 08-18-2008, 09:30 AM

How can a linux program read and write raw ethernet "link layer" packets? These are the simplest packets possible - which contain the actual data transmitted over ethernet wires. In essense, these raw ethernet packets contain:

0xAA * 7 bytes = preamble
0xAD * 1 bytes = start frame descriptor
0xHH * 6 bytes = target/destination MAC address
0xHH * 6 bytes = source/sender MAC address
0xHH * 2 bytes = packet length in bytes
0xHH * n bytes = data
0xHH * 4 bytes = CRC32

What many applications consider the "low level packet" has already been processed. For example, the 6-byte MAC addresses converted-to and replaced-by 4-byte IP addresses, a "port number" being taken from the data field, the "packet length" being interpreted as a "protocol identifier" if the length is one of the special recognized lengths (which are typically 0x8xxx == longer than maximum acceptable "jumbo" packet length), etc.

That's my question. To stop at least some programmers from telling me no legitimate reason exists to work at this level, I provide the following justification.

-----

We are developing ultra-cheap, multipurpose 5 and 8 megapixel cameras. The primary applications are: cinema-HDTV = 24:10 aspect ratio @ 24 FPS, coordinated multicamera robotics and vision systems, multicamera security systems, etc.

To keep hardware costs and power consumption within reason, the controller card is driven by inexpensive parts: one moderate-performance FPGA (EP3C5F256), two gigabit ethernet PHYs (88E1111s), and one wimpy 100MHz C8051F120 8-bit CPU. The CPU does not run fast enough to perform per-packet processing. The FPGA logic that controls the CCDs, reads data from the CCDs, performs lossless data-compression of every packet, computes CRC32 for every packet, and most everything else - is already more complex that I enjoy (and consumes ~every I/O pin on the FPGA). But hey, that's what it takes to maximize performance versus cost. To add the complexity of higher-level packet creation, packet processing, extra algorithm development (ARP, etc) seems quite unwise (and possibly impractical). Oh, each CCD/camera is on a tiny, inexpensive PCB, and the controller PCB can connect to and control one to four of the CCD/cameras (for multicamera applications).

But also, we are already pushing the bandwidth of gigabit ethernet dangerously close to its limit. We need to transfer close to 900 megabits of data per second over the gigabit ethernet connection. Each packet will be up to 5KB (one row of CCD pixels per packet) to maximize processing efficiency and minimize header overhead. But we must assure the PC can *reliably* consume with the data at full speed for at least a few minutes at a time *without failure*. It's okay to demand a fast multicore CPU, but it *must work*. Therefore, want to minimize packet processing in the PC, which is completely pointless anyway, since we intend to require a direct connect between our camera and a single (or dual) gigabit ethernet controller in the PC. In other words, no hub/router/etc is permitted between the camera and PC (for high-speed applications, like "cinema-HDTV" video). As a consequence, the only place that specific gigabit ethernet port can receive packets from - is the camera.

No doubt some programmers will think this is an abuse of ethernet. All I can say is, I disagree. Why should a camera, printer, or any other device/peripheral be expected to handle every network process a PC operating system would? Finally, in case someone cares/wonders, we may place the schematics, PCB films, FPGA code, 80C51F120 code and so forth into open-source/open-design when we finish, because we'd like to help make robotics, vision-systems, etc more accessible to the open-source/developer community. We can imagine all sorts of really cool gizmos that people might create - if they had inexpensive infrastructure to do so. Which, by the way, is one reason this device is gigabit ethernet (that any application can read/write bytes/commands/messages to), and not a USB nightmare with device drivers and closed/hidden/complex/obscured protocols.

I will appreciate any tips/advice/information from programmers who have worked with sockets at such a low level. I have written client/server/sockets software many times, but not at this lowest level. Thanks.

estabroo · 08-18-2008, 12:09 PM

Your best bet would be to write a kernel module that sits in the stack, like ip does. Another option that might work is to use a tap interface, that would allow you to put a userspace program on the receiving end by bridging the ethernet to the tap.

BedriddenTech · 08-18-2008, 12:33 PM

Transmitting OSI Layer 2 packets is quite simple using the standard C functions provided by the GLIBC. (However, afaik, the checksum is calculated by most network cards.) You could, for example, use the "SOCK_RAW" connection style with the socket call. The GNU C library documentation will give you more info on this topic: http://www.gnu.org/software/libc/man...ml#toc_Sockets

HTH

estabroo · 08-18-2008, 04:24 PM

I don't think the raw socket stuff will work for this, you still need something in the network stack that'll direct packets to your listener, you might be able to do something like that by putting the nic into permiscuous mode.

chort · 08-18-2008, 06:40 PM

Unless I'm very mistaken, this is going to be handled by the PC's kernel driver for the ethernet device regardless, so I don't think what you want to do will work. You could perhaps write patches for some of the drivers of some of the more popular ethernet cards, but then you're going to have heavy vendor lock-in and a lot of work to keep your patchset current for every card you "support".

Why don't you just go with typical UDP streams like every other streaming video protocol does? You can't do what you're trying to do without having your own IP stack, since the kernel handles everything from layer2 to layer4 (OSI).

By the way, pushing 900Mb/s through most "gigabit" cards is going to be pretty tough any way. The ability to do this depends heavily on the drivers, and the cards with inaccurate/poor/missing documentation tend to have drivers that don't perform nearly close to wire speed.

If real-time performance is so important, you're going to need to control the hardware and IP stack at both ends, then connect to a PC over a link where some latency/loss is tolerable. You could have some type of "stream aggregator" device that buffers the streams and off-loads them in bulk, or some other creative work-around.

chort · 08-18-2008, 06:42 PM

Hmmm, actually on second thought you might be able to accomplish this by leveraging libpcap. IMO you're still going to have problems with the performance of the NICs you're sending the data to, though.

BedriddenTech · 08-18-2008, 07:13 PM

Linux WOL programs are sending MAC frames, too: http://wake-on-lan.svn.sourceforge.n....c?view=markup
Have a look at raw_open and raw_send.

I've created WOL packets, too, during my university days, and used exactly the same way.

maxreason · 08-18-2008, 08:28 PM

Quote:

Originally Posted by chort

Unless I'm very mistaken, this is going to be handled by the PC's kernel driver for the ethernet device regardless, so I don't think what you want to do will work. You could perhaps write patches for some of the drivers of some of the more popular ethernet cards, but then you're going to have heavy vendor lock-in and a lot of work to keep your patchset current for every card you "support".

Why don't you just go with typical UDP streams like every other streaming video protocol does? You can't do what you're trying to do without having your own IP stack, since the kernel handles everything from layer2 to layer4 (OSI).

By the way, pushing 900Mb/s through most "gigabit" cards is going to be pretty tough any way. The ability to do this depends heavily on the drivers, and the cards with inaccurate/poor/missing documentation tend to have drivers that don't perform nearly close to wire speed.

If real-time performance is so important, you're going to need to control the hardware and IP stack at both ends, then connect to a PC over a link where some latency/loss is tolerable. You could have some type of "stream aggregator" device that buffers the streams and off-loads them in bulk, or some other creative work-around.

You are certainly correct about one thing - I don't want to write separate hacks for every ethernet card (or chipset)!

I haven't heard about "UDP streams", but I'll go do some searching around now to see what that is.

I won't like it, but I am willing to create a list of ethernet cards (or chipsets, if that is sufficient) that operate fast enough for the camera, and just be perfectly clear to everyone up-front that "other gigabit cards won't work".

The speed the device requires is slightly less than I stated, but probably not much (825Mbps~875Mbps). Everything I read from the chip specs imply the hardware is perfectly happy with constant frame after frame - with only 8 bytes "inter-frame gap" between. Also note that every packet will contain 4~5 megabytes of data, so header overhead is something like 0.0001%. The main reason I prefer to capture the data in the simplest form is to assure the drives and other software do nothing but store the data. My lossless *decompression* routine is obscenely short and fast - literally only 2~5 machine language instructions per value/pixel. Beat that! And if necessary, the data can be saved/stored compressed if necessary - and/or handed to another CPU/GPU core for display if necessary.

Remember that data only flows in one direction --- from camera to PC. Oh sure, after every couple thousand packets (one horizontal CCD pixel row per packet) the data stops for a millisecond, and the CPU might send a short packet to the camera now and then.

I'll go search for "UDP streams" now. Thanks for the ideas.

chort · 08-18-2008, 11:33 PM

A few things come to mind...

First off, UDP is the transport that's usually used by streaming applications (audio, video, etc) because it doesn't attempt to resend data (like TCP does). You can imagine what would happen to a video if it was constantly trying to "repair" missing pieces: It would be jerky, with occasional long pauses... that really wouldn't make for a good viewing experience. UDP on the other hand is "fire and forget". The packet is either received at the other end, or it's not. There's no time to backup because you have hundreds more packets coming in that need to be displayed.

Second, I realized after my first post that you only have to worry about receiving on the PC, not sending. That means you can use libpcap to read the raw frames off the wire without having to process all the way through the IP stack. That will probably be the best way to implement your code.

Third, even 800Mbps is too much for certain "gigabit" cards to handle (rather, too much for their Open Source drivers to handle). For instance, some Broadcom cards could only reach ~400Mbps until very recently due to limitations with their drivers, since they're essentially reverse-engineered. The vendors who actually provide documentation to Open Source projects have much better-performing drivers for their cards. Unfortunately the only real way to know is buy several of the more popular cards and try them out under you target operating systems. What makes it trickier is that most NICs are built into the motherboards these days, so you'll have to look for PCI-e/X cards that have the same chipset as the popular embedded models.

Last, you might be writing your data in 4-5MB chunks, but the largest you can put on the wire at one time is 9000 bytes (minus header overhead), and that's only for cards that support "jumbo frames", which not all do. That also means making sure the OS of the target machine is configured to use jumbo frames. It sounds like you're going to need to build some more overhead into your calculations to account for much smaller packet sizes.

maxreason · 08-19-2008, 01:31 AM

Quote:

Originally Posted by chort

First off, UDP is the transport that's usually used by streaming applications (audio, video, etc) because it doesn't attempt to resend data (like TCP does). You can imagine what would happen to a video if it was constantly trying to "repair" missing pieces: It would be jerky, with occasional long pauses... that really wouldn't make for a good viewing experience. UDP on the other hand is "fire and forget". The packet is either received at the other end, or it's not. There's no time to backup because you have hundreds more packets coming in that need to be displayed.

Yes, I always assumed "no retry", not only because the network and drivers can't handle it, but because my circuitry can't handle it either (at least the cheaper version of it, which contains no flash memory).

BTW, is "streaming UDP" any different than plain old UDP (which I am already slightly familiar with from a project several years ago)?

Quote:

Second, I realized after my first post that you only have to worry about receiving on the PC, not sending. That means you can use libpcap to read the raw frames off the wire without having to process all the way through the IP stack. That will probably be the best way to implement your code.

I ran across libpcap and downloaded the code from sourceforge - but I haven't looked into it yet. Thanks for giving me yet another reason to do so. But I hope what I find is that I could have done the same low-level access myself if only I figure out the appropriate PF_xxxxx and SOCK_xxxx arguments to put in the socket() call (perhaps PF_PACKET and SOCK_RAW on Linux?).

Quote:

Third, even 800Mbps is too much for certain "gigabit" cards to handle (rather, too much for their Open Source drivers to handle). For instance, some Broadcom cards could only reach ~400Mbps until very recently due to limitations with their drivers, since they're essentially reverse-engineered. The vendors who actually provide documentation to Open Source projects have much better-performing drivers for their cards. Unfortunately the only real way to know is buy several of the more popular cards and try them out under you target operating systems. What makes it trickier is that most NICs are built into the motherboards these days, so you'll have to look for PCI-e/X cards that have the same chipset as the popular embedded models.

This may in fact be a problem. But if I am lucky, their drivers are only slow on TCP - precisely because they need to handle so many strange cases (resend, delay, out-of-order, etc). Maybe, just maybe they are better at the lowest level. OTOH, that is probably not the case, because presumably all the sophisticated processing (like those I mentioned) would be handled in the kernel for any/every ethernet controller.

Quote:

Last, you might be writing your data in 4-5MB chunks, but the largest you can put on the wire at one time is 9000 bytes (minus header overhead), and that's only for cards that support "jumbo frames", which not all do. That also means making sure the OS of the target machine is configured to use jumbo frames. It sounds like you're going to need to build some more overhead into your calculations to account for much smaller packet sizes.

Sorry, I screwed up there - again! I keep doing that. The entire image is 5 megapixels (for the first camera) but each row of CCD pixels is about 5KB (kilobytes, not megabytes). So it is plenty big enough to be an efficient packet size, but still well below the 9000 byte maximum "jumbo packet". Sorry about that.

chort · 08-19-2008, 02:46 AM

There's no difference between UDP and "streaming" prefix, it's just an indicator that it's a stream, rather than a transaction.

TCP is implemented higher up the stack than the drivers, so poor performance in a particular driver is going to affect every protocol--it generally wouldn't be limited to TCP. It's mainly the efficiency in how buffers are dumped from the card and copied into kernel memory.

It sounds like jumbo frames would indeed be large enough to carry your packets, but normal (1500byte) frames would not. That would essentially mean you're limited to supporting only NICs & drivers that are jumbo-capable (and configured as such). If the card is capable, it shouldn't pose any problems since it's only talking to your device (i.e. you don't have to worry about what switch might go in between, since there won't be one). You want to make sure your PHY supports it too, though.