Efficient data copy from PCIe device to RAM in kernel

PeterWurmsdobler · 07-13-2010, 04:36 PM

Hello,

I have got an FPGA card (4 PCIe v1.0 lanes) that exposes a prefetchable PCI memory window to the system with pcie_mem_length (16MB). In the driver I have ioremapped its base address into the kernel's virtual memory space, called pcie_mem_vaddr. On the other side, I have got memory reserved during boot time which I have also ioremapped into the kernel's virtual memory space, called reserved_vaddr.

Now, since the FPGA card does not support DMA (yet), I have measured the data transfer time of a single memcpy (reserved_vaddr, pcie_mem_vaddr, pcie_mem_length) using ktime_t and associated hr_timer functions. It takes quite consistently 1.84s for 16MB, i.e. roughly 8MB/s. That's not terrific for nowadays standards.

The kernel's default implementation for memcpy in kernelsource/lib/string.c seems to carry out a loop counting down to zero with a *dest++ = *src++, both being char*; this does not look to efficient. Is it this function that is linked in by gcc and called from a kernel module, or a more architecture specific version found in kernelsource/arch/x86/lib/*.S ? There I can find for example a page_copy.S which seems to be efficient (uses prefetch). How can I make use of that?

How could I speed up the data transfer in the kernel, but without using DMA?

Many thanks for any hints,
peter

nini09 · 07-13-2010, 05:55 PM

What's that memcpy, from system memory to system memory, or from PCIe memory to system memory?

PeterWurmsdobler · 07-14-2010, 03:20 AM

Hello,

sorry if it was not clearn from my first post. In the kernel driver's init part I have in essence:

Code:

unsigned long pcie_mem_hwaddr = pci_resource_start (pcie_dev, 0);
unsigned long pcie_mem_length = pci_resource_len(pcie_dev, 0);//(16MB)
void * pcie_mem_vaddr = ioremap(pcie_mem_hwaddr, pcie_mem_length);
void * reserved_vaddr = ioremap(0x100000000UL, 0x200000000UL);

In a periodically called function I have in essence:

Code:

ktime_t before = ktime_get_real();
memcpy(reserved_vaddr, pcie_mem_vaddr, pcie_mem_length);
ktime_t after = ktime_get_real();
ktime_t diff = ktime_sub(after, before);  /* kt1 - kt2 
printk("dT = %lld ns\n", ktime_to_ns(diff));*/

And this is what produces quite consitently 1.8s. So how could I speed up the transfer, without DMA?

Cheers

nini09 · 07-14-2010, 02:44 PM

The throughput is too low and it looks like something is wrong.
Why do you use ioremap to reserve/allocate memory instead of kmalloc? Are you sure no memory conflicting?

PeterWurmsdobler · 07-15-2010, 04:31 AM

Hello,

I need to record 500MB/s of data generated by an FGPA card to RAM. Reserving 8GB at boot time, then using ioremap to bring it into kernel virtual memory space gives me the guarantee that I have memory available to record for roughly 16seconds. kmalloc would return only small chunks and I can not be sure that I can claim 8GB in total; in addition, I would need to maintain the chunks returned by kmalloc.

So my question remains unanswered, how can I transfer data from a PCIe card efficiently without using DMA. From what I have tried so far, using memcpy, or readq(), I only get 8MB/s as every read() is translated into a PCIe request and a single PCIe transation packet is returned for every single word, even though PCIe would support 4k packets.

Cheers,
peter

nini09 · 07-15-2010, 02:43 PM

If you want PCIe 4k burst transfer, it is DMA job. My point is even if CPU can only generate single word PCI request, the throughput shouldn't be that low, 8MB/s.
You can try some CISC CPU, such as x86, and use the instruction of move word from string to string.

PeterWurmsdobler · 07-16-2010, 10:18 AM

Hello,
thanks for the answers. We have now changed the FPGA design to incorporate DMA. We now get 600MB/s on 4 lane v1.0 PCIe.
peter