Linux - KernelThis forum is for all discussion relating to the Linux kernel.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm writing a device driver for a PCIe card. The card will be generating lots of input data, which must be copied into memory, and then be accessible to a user-level process. Further, the design must be zero-copy, i.e. the device must be able to write the input data (using DMA) to the same memory area that the user process can then manipulate, without an additional copy.
The size of the buffer must be huge (several GB), although each DMA operation will be just around 60-100 bytes. In other words, the new data arriving from the device will go into the buffer in a round robin fashion, and the user process will lag behind it, processing it as it follows along.
What are the possible options to implement this memory buffer?
kmalloc and pci_alloc_consistent cannot allocate such huge chunks. vmalloc can do so, but I'm not sure how to make it visible to the user process, nor how to get DMA access to the buffer.
My initial reading about hugepages seems to suggest that it's focus is on reducing the VM to Physical memory translation cost. However, I'm trying to avoid the copy of data from one buffer to another.
Sorry for being dense, but some additional detail of a hugepages-based solution would be helpful.
BTW, I forgot to mention that I'm on a 64 bit system.
Allocating memory in user space and using it as the DMA target in the kernel driver means there is no copy. Using hugepages is not necessary if your DMA device has good scatter-gather capabilities. You could just allocate many 4K pages.
As for detail, allocate your data area and a queue for communicating with your driver. Then start your driver and pass it the user-space addresses.
Code:
struct mystuff {
char * data_area;
long data_size;
struct myqueue_type * work_queue;
long work_queue_size;
}
system("/sbin/modprobe mydriver");
ioctl(fd, MY_SETUP, &mystuff);
In the driver, lock the data and queue with get_user_pages, map the user addresses to physical DMA addresses, and kick off the transfers. As your data comes in, post it on the queue for the user process to access it. Add barriers in the appropriate places to prevent CPU cache effects.
Personally, I think that, if "each DMA operation will be just around 60-100 bytes," the "zero-copy" requirement is not engineering-justifiable and should be defeated.
I recommend a design in which the device is able to make DMA transfers into a variable-size list of (not necessarily contiguous) "60-100 byte buffers," which, say, are then transferred to user-land by a separate dedicated kernel thread. The user-land process must specify a buffer, in virtual memory space, to which the data must be moved round-robin, and this is what the thread will do, consuming DMA-buffers as quickly as they have been filled by incoming interrupts.
It is crucial to note that, in this design, there isno "huge DMA-able memory area." None exists, and none is required.
Last edited by sundialsvcs; 08-10-2015 at 06:32 PM.
smallpond: The size of the allocated memory will be close to the physical RAM size, i.e. around 30 or 60 GB. So, I guess the use of hugepages is indeed going to help.
When you say - "map the user addresses to physical DMA addresses", what scheme do you recommend?
sundialsvcs: Even though the data arrives in small chunks of 60-100 bytes, the rate of data traffic is going to be high - 2+ Gb/s. The introduction of an additional buffer, and the extra copy would be wasteful, don't you think?
BTW, I also came across some schemes which prevent the kernel from using all the physical RAM, i.e. use only 2 of the 32 GB, for instance. The remaining portion is usable by the driver, using ioremap. What do you think of using such a scheme for the zero copy space? We would need some way to access it from the user process too, (i.e. a virtual memory address).
No, I do not think that it would be wasteful at all. It would be a CPU moving memory from one place to another, probably one or two cache-lines' worth.
The DMA would be happening into a string of discontiguous buffers, which correspond to what the device is actually moving and doing. The memory-footprint that is exposed to DMA, i.e. locked pages, would remain small.
My 2 cents:
Few years back, I worked on PCIex drivers with somewhat similar requirements. Faintly I remember - I created a huge shared memory area and shared its virtual address (through mmap) to the user application. They used to access that area by simple memcpy()!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.