LinuxQuestions.org - Efficient data copy from PCIe device to RAM in kernel

- Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)

- - Efficient data copy from PCIe device to RAM in kernel (https://www.linuxquestions.org/questions/linux-kernel-70/efficient-data-copy-from-pcie-device-to-ram-in-kernel-819672/)

Efficient data copy from PCIe device to RAM in kernel

Hello,

I have got an FPGA card (4 PCIe v1.0 lanes) that exposes a prefetchable PCI memory window to the system with pcie_mem_length (16MB). In the driver I have ioremapped its base address into the kernel's virtual memory space, called pcie_mem_vaddr. On the other side, I have got memory reserved during boot time which I have also ioremapped into the kernel's virtual memory space, called reserved_vaddr.

Now, since the FPGA card does not support DMA (yet), I have measured the data transfer time of a single memcpy (reserved_vaddr, pcie_mem_vaddr, pcie_mem_length) using ktime_t and associated hr_timer functions. It takes quite consistently 1.84s for 16MB, i.e. roughly 8MB/s. That's not terrific for nowadays standards.

The kernel's default implementation for memcpy in kernelsource/lib/string.c seems to carry out a loop counting down to zero with a *dest++ = *src++, both being char*; this does not look to efficient. Is it this function that is linked in by gcc and called from a kernel module, or a more architecture specific version found in kernelsource/arch/x86/lib/*.S ? There I can find for example a page_copy.S which seems to be efficient (uses prefetch). How can I make use of that?

How could I speed up the data transfer in the kernel, but without using DMA?

Many thanks for any hints,
peter

What's that memcpy, from system memory to system memory, or from PCIe memory to system memory?

Hello,

sorry if it was not clearn from my first post. In the kernel driver's init part I have in essence:

Code:

unsigned long pcie_mem_hwaddr = pci_resource_start (pcie_dev, 0);

unsigned long pcie_mem_length = pci_resource_len(pcie_dev, 0);//(16MB)

void * pcie_mem_vaddr = ioremap(pcie_mem_hwaddr, pcie_mem_length);

void * reserved_vaddr = ioremap(0x100000000UL, 0x200000000UL);

In a periodically called function I have in essence:

Code:

ktime_t before = ktime_get_real();

memcpy(reserved_vaddr, pcie_mem_vaddr, pcie_mem_length);

ktime_t after = ktime_get_real();

ktime_t diff = ktime_sub(after, before);  /* kt1 - kt2 

printk("dT = %lld ns\n", ktime_to_ns(diff));*/

And this is what produces quite consitently 1.8s. So how could I speed up the transfer, without DMA?

Cheers

The throughput is too low and it looks like something is wrong.
Why do you use ioremap to reserve/allocate memory instead of kmalloc? Are you sure no memory conflicting?

Hello,

I need to record 500MB/s of data generated by an FGPA card to RAM. Reserving 8GB at boot time, then using ioremap to bring it into kernel virtual memory space gives me the guarantee that I have memory available to record for roughly 16seconds. kmalloc would return only small chunks and I can not be sure that I can claim 8GB in total; in addition, I would need to maintain the chunks returned by kmalloc.

So my question remains unanswered, how can I transfer data from a PCIe card efficiently without using DMA. From what I have tried so far, using memcpy, or readq(), I only get 8MB/s as every read() is translated into a PCIe request and a single PCIe transation packet is returned for every single word, even though PCIe would support 4k packets.

Cheers,
peter

If you want PCIe 4k burst transfer, it is DMA job. My point is even if CPU can only generate single word PCI request, the throughput shouldn't be that low, 8MB/s.
You can try some CISC CPU, such as x86, and use the instruction of move word from string to string.

Hello,
thanks for the answers. We have now changed the FPGA design to incorporate DMA. We now get 600MB/s on 4 lane v1.0 PCIe.
peter