Efficient data copy from PCIe device to RAM in kernel
Hello,
I have got an FPGA card (4 PCIe v1.0 lanes) that exposes a prefetchable PCI memory window to the system with pcie_mem_length (16MB). In the driver I have ioremapped its base address into the kernel's virtual memory space, called pcie_mem_vaddr. On the other side, I have got memory reserved during boot time which I have also ioremapped into the kernel's virtual memory space, called reserved_vaddr. Now, since the FPGA card does not support DMA (yet), I have measured the data transfer time of a single memcpy (reserved_vaddr, pcie_mem_vaddr, pcie_mem_length) using ktime_t and associated hr_timer functions. It takes quite consistently 1.84s for 16MB, i.e. roughly 8MB/s. That's not terrific for nowadays standards. The kernel's default implementation for memcpy in kernelsource/lib/string.c seems to carry out a loop counting down to zero with a *dest++ = *src++, both being char*; this does not look to efficient. Is it this function that is linked in by gcc and called from a kernel module, or a more architecture specific version found in kernelsource/arch/x86/lib/*.S ? There I can find for example a page_copy.S which seems to be efficient (uses prefetch). How can I make use of that? How could I speed up the data transfer in the kernel, but without using DMA? Many thanks for any hints, peter |
What's that memcpy, from system memory to system memory, or from PCIe memory to system memory?
|
Hello,
sorry if it was not clearn from my first post. In the kernel driver's init part I have in essence: Code:
unsigned long pcie_mem_hwaddr = pci_resource_start (pcie_dev, 0); Code:
ktime_t before = ktime_get_real(); Cheers |
The throughput is too low and it looks like something is wrong.
Why do you use ioremap to reserve/allocate memory instead of kmalloc? Are you sure no memory conflicting? |
Hello,
I need to record 500MB/s of data generated by an FGPA card to RAM. Reserving 8GB at boot time, then using ioremap to bring it into kernel virtual memory space gives me the guarantee that I have memory available to record for roughly 16seconds. kmalloc would return only small chunks and I can not be sure that I can claim 8GB in total; in addition, I would need to maintain the chunks returned by kmalloc. So my question remains unanswered, how can I transfer data from a PCIe card efficiently without using DMA. From what I have tried so far, using memcpy, or readq(), I only get 8MB/s as every read() is translated into a PCIe request and a single PCIe transation packet is returned for every single word, even though PCIe would support 4k packets. Cheers, peter |
If you want PCIe 4k burst transfer, it is DMA job. My point is even if CPU can only generate single word PCI request, the throughput shouldn't be that low, 8MB/s.
You can try some CISC CPU, such as x86, and use the instruction of move word from string to string. |
Hello,
thanks for the answers. We have now changed the FPGA design to incorporate DMA. We now get 600MB/s on 4 lane v1.0 PCIe. peter |
All times are GMT -5. The time now is 01:19 PM. |