LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel
User Name
Password
Linux - Kernel This forum is for all discussion relating to the Linux kernel.

Notices


Reply
  Search this Thread
Old 08-07-2015, 04:56 AM   #1
mod_dev_123
LQ Newbie
 
Registered: Aug 2015
Posts: 16

Rep: Reputation: Disabled
Allocating large DMA-able memory area


Hello,

I'm writing a device driver for a PCIe card. The card will be generating lots of input data, which must be copied into memory, and then be accessible to a user-level process. Further, the design must be zero-copy, i.e. the device must be able to write the input data (using DMA) to the same memory area that the user process can then manipulate, without an additional copy.

The size of the buffer must be huge (several GB), although each DMA operation will be just around 60-100 bytes. In other words, the new data arriving from the device will go into the buffer in a round robin fashion, and the user process will lag behind it, processing it as it follows along.

What are the possible options to implement this memory buffer?

kmalloc and pci_alloc_consistent cannot allocate such huge chunks. vmalloc can do so, but I'm not sure how to make it visible to the user process, nor how to get DMA access to the buffer.

Looking forward to your suggestions.

Thanks.
 
Old 08-07-2015, 05:49 AM   #2
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,140

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
Use hugepages. Allocate in the user process and use user-space mapping functions when you set up the dma. Avoid putting too much code in the kernel.
 
Old 08-10-2015, 01:40 AM   #3
mod_dev_123
LQ Newbie
 
Registered: Aug 2015
Posts: 16

Original Poster
Rep: Reputation: Disabled
Thanks for the tip.

My initial reading about hugepages seems to suggest that it's focus is on reducing the VM to Physical memory translation cost. However, I'm trying to avoid the copy of data from one buffer to another.

Sorry for being dense, but some additional detail of a hugepages-based solution would be helpful.

BTW, I forgot to mention that I'm on a 64 bit system.

Thanks.
 
Old 08-10-2015, 12:12 PM   #4
smallpond
Senior Member
 
Registered: Feb 2011
Location: Massachusetts, USA
Distribution: Fedora
Posts: 4,140

Rep: Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263Reputation: 1263
Allocating memory in user space and using it as the DMA target in the kernel driver means there is no copy. Using hugepages is not necessary if your DMA device has good scatter-gather capabilities. You could just allocate many 4K pages.

As for detail, allocate your data area and a queue for communicating with your driver. Then start your driver and pass it the user-space addresses.

Code:
struct mystuff {
    char * data_area;
    long data_size;
    struct myqueue_type * work_queue;
    long work_queue_size;
}

system("/sbin/modprobe mydriver");
ioctl(fd, MY_SETUP, &mystuff);
In the driver, lock the data and queue with get_user_pages, map the user addresses to physical DMA addresses, and kick off the transfers. As your data comes in, post it on the queue for the user process to access it. Add barriers in the appropriate places to prevent CPU cache effects.
 
Old 08-10-2015, 06:31 PM   #5
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
Personally, I think that, if "each DMA operation will be just around 60-100 bytes," the "zero-copy" requirement is not engineering-justifiable and should be defeated.

I recommend a design in which the device is able to make DMA transfers into a variable-size list of (not necessarily contiguous) "60-100 byte buffers," which, say, are then transferred to user-land by a separate dedicated kernel thread. The user-land process must specify a buffer, in virtual memory space, to which the data must be moved round-robin, and this is what the thread will do, consuming DMA-buffers as quickly as they have been filled by incoming interrupts.

It is crucial to note that, in this design, there is no "huge DMA-able memory area." None exists, and none is required.

Last edited by sundialsvcs; 08-10-2015 at 06:32 PM.
 
Old 08-10-2015, 09:52 PM   #6
mod_dev_123
LQ Newbie
 
Registered: Aug 2015
Posts: 16

Original Poster
Rep: Reputation: Disabled
smallpond, sundialsvcs: Thanks for the replies.

smallpond: The size of the allocated memory will be close to the physical RAM size, i.e. around 30 or 60 GB. So, I guess the use of hugepages is indeed going to help.

When you say - "map the user addresses to physical DMA addresses", what scheme do you recommend?

sundialsvcs: Even though the data arrives in small chunks of 60-100 bytes, the rate of data traffic is going to be high - 2+ Gb/s. The introduction of an additional buffer, and the extra copy would be wasteful, don't you think?

BTW, I also came across some schemes which prevent the kernel from using all the physical RAM, i.e. use only 2 of the 32 GB, for instance. The remaining portion is usable by the driver, using ioremap. What do you think of using such a scheme for the zero copy space? We would need some way to access it from the user process too, (i.e. a virtual memory address).
 
Old 08-11-2015, 08:55 AM   #7
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
No, I do not think that it would be wasteful at all.   It would be a CPU moving memory from one place to another, probably one or two cache-lines' worth.

The DMA would be happening into a string of discontiguous buffers, which correspond to what the device is actually moving and doing. The memory-footprint that is exposed to DMA, i.e. locked pages, would remain small.
 
Old 08-11-2015, 04:26 PM   #8
kauuttt
Member
 
Registered: Dec 2008
Location: Atlanta, GA, USA
Distribution: Ubuntu
Posts: 135

Rep: Reputation: 26
My 2 cents:
Few years back, I worked on PCIex drivers with somewhat similar requirements. Faintly I remember - I created a huge shared memory area and shared its virtual address (through mmap) to the user application. They used to access that area by simple memcpy()!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to allocate LARGE (more than 16M) DMA memory? leechaotang Linux - Kernel 3 10-05-2011 06:04 AM
[SOLVED] Kernel Module Allocating memory? yaplej Linux - Kernel 4 10-01-2009 06:19 PM
Pre-allocating memory? your_shadow03 Linux - Newbie 8 02-10-2009 10:10 PM
Allocating Memory in the kernel arunachalam Linux - Software 4 10-12-2005 08:51 AM
allocating memory eshwar_ind Programming 6 02-26-2004 07:06 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software > Linux - Kernel

All times are GMT -5. The time now is 03:24 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration