Allocating large DMA-able memory area

mod_dev_123 · 08-07-2015, 04:56 AM

Hello,

I'm writing a device driver for a PCIe card. The card will be generating lots of input data, which must be copied into memory, and then be accessible to a user-level process. Further, the design must be zero-copy, i.e. the device must be able to write the input data (using DMA) to the same memory area that the user process can then manipulate, without an additional copy.

The size of the buffer must be huge (several GB), although each DMA operation will be just around 60-100 bytes. In other words, the new data arriving from the device will go into the buffer in a round robin fashion, and the user process will lag behind it, processing it as it follows along.

What are the possible options to implement this memory buffer?

kmalloc and pci_alloc_consistent cannot allocate such huge chunks. vmalloc can do so, but I'm not sure how to make it visible to the user process, nor how to get DMA access to the buffer.

Looking forward to your suggestions.

Thanks.

smallpond · 08-07-2015, 05:49 AM

Use hugepages. Allocate in the user process and use user-space mapping functions when you set up the dma. Avoid putting too much code in the kernel.

mod_dev_123 · 08-10-2015, 01:40 AM

Thanks for the tip.

My initial reading about hugepages seems to suggest that it's focus is on reducing the VM to Physical memory translation cost. However, I'm trying to avoid the copy of data from one buffer to another.

Sorry for being dense, but some additional detail of a hugepages-based solution would be helpful.

BTW, I forgot to mention that I'm on a 64 bit system.

Thanks.

smallpond · 08-10-2015, 12:12 PM

Allocating memory in user space and using it as the DMA target in the kernel driver means there is no copy. Using hugepages is not necessary if your DMA device has good scatter-gather capabilities. You could just allocate many 4K pages.

As for detail, allocate your data area and a queue for communicating with your driver. Then start your driver and pass it the user-space addresses.

Code:

struct mystuff {
    char * data_area;
    long data_size;
    struct myqueue_type * work_queue;
    long work_queue_size;
}

system("/sbin/modprobe mydriver");
ioctl(fd, MY_SETUP, &mystuff);

In the driver, lock the data and queue with get_user_pages, map the user addresses to physical DMA addresses, and kick off the transfers. As your data comes in, post it on the queue for the user process to access it. Add barriers in the appropriate places to prevent CPU cache effects.

sundialsvcs · 08-10-2015, 06:31 PM

Personally, I think that, if "each DMA operation will be just around 60-100 bytes," the "zero-copy" requirement is not engineering-justifiable and should be defeated.

I recommend a design in which the device is able to make DMA transfers into a variable-size list of (not necessarily contiguous) "60-100 byte buffers," which, say, are then transferred to user-land by a separate dedicated kernel thread. The user-land process must specify a buffer, in virtual memory space, to which the data must be moved round-robin, and this is what the thread will do, consuming DMA-buffers as quickly as they have been filled by incoming interrupts.

It is crucial to note that, in this design, there is no "huge DMA-able memory area." None exists, and none is required.

mod_dev_123 · 08-10-2015, 09:52 PM

smallpond, sundialsvcs: Thanks for the replies.

smallpond: The size of the allocated memory will be close to the physical RAM size, i.e. around 30 or 60 GB. So, I guess the use of hugepages is indeed going to help.

When you say - "map the user addresses to physical DMA addresses", what scheme do you recommend?

sundialsvcs: Even though the data arrives in small chunks of 60-100 bytes, the rate of data traffic is going to be high - 2+ Gb/s. The introduction of an additional buffer, and the extra copy would be wasteful, don't you think?

BTW, I also came across some schemes which prevent the kernel from using all the physical RAM, i.e. use only 2 of the 32 GB, for instance. The remaining portion is usable by the driver, using ioremap. What do you think of using such a scheme for the zero copy space? We would need some way to access it from the user process too, (i.e. a virtual memory address).

sundialsvcs · 08-11-2015, 08:55 AM

No, I do not think that it would be wasteful at all.   It would be a CPU moving memory from one place to another, probably one or two cache-lines' worth.

The DMA would be happening into a string of discontiguous buffers, which correspond to what the device is actually moving and doing. The memory-footprint that is exposed to DMA, i.e. locked pages, would remain small.

kauuttt · 08-11-2015, 04:26 PM

My 2 cents:
Few years back, I worked on PCIex drivers with somewhat similar requirements. Faintly I remember - I created a huge shared memory area and shared its virtual address (through mmap) to the user application. They used to access that area by simple memcpy()!