LinuxQuestions.org - Out of order data when performing R/W through mmaped memory

- Linux - Kernel (https://www.linuxquestions.org/questions/linux-kernel-70/)

- - Out of order data when performing R/W through mmaped memory (https://www.linuxquestions.org/questions/linux-kernel-70/out-of-order-data-when-performing-r-w-through-mmaped-memory-698534/)

Out of order data when performing R/W through mmaped memory

Hello

I'm writing a driver for a PCIe board. This board is configurable through IO memory regions, non-prefetchable, located at bars 0 and 1. I have two options to access this memory : either in kernel code, by calling ioremap, or in user-space, by remapping the area through remap_pfn_range (and calling mmap in the user-space program). I'm using a 2.6.18 kernel.

- Everything works fine if I ioremap the memory area, in kernel code. I use very classical ioread8, ioread16, ioread32 or iowrite8, iowrite16 and iowrite32 to access data.

- To make it faster, I want to mmap my device memory to user space. I thus implemented the mmap function in my module, and I get a void* pointer in user space to access memory. I'm facing a strange behavior ...
- If I cast that void* to uint32_t*, and then performs memory access just by calling ptr[i] = bla, I can read the data I've just written without any troubles.
- If I cast that void* to uint8_t*, and then performs memory access just by calling ptr[i] = bla, the memory I read is completely out of order. For instance, If I assign 0, 1, 2, 3 to the 4 first values of my array, I read ... 2, 1, 0, 3.

I suppose that the problem comes from caching, but I do call pgprot_noncached to the vm_page_prot variable in my mmap module function:

static int BLA_mmap(struct file *pFile, struct vm_area_struct *pVma)
{
/* nResourceStart is the value returned by pci_resource_start */

pVma->vm_page_prot = pgprot_noncached(pVma->vm_page_prot);
pVma->vm_flags |= VM_RESERVED | VM_IO;

return remap_pfn_range(pVma, pVma->vm_start, nResourceStart >> PAGE_SHIFT, pVma->vm_end - pVma->vm_start, pVma->vm_page_prot);
}

Any help appreciated!

Regards,

Marc

These are consecutive bytes, and the problem isn't "little-endian vs. big-endian?"

I would think, categorically, that if you are "talking directly to a device," you need to be doing that in kernel-space. If you then want to move the data to someplace accessible by user-space to facilitate access by, say, a user-land "helper daemon," that's okay. But I'm not sure that "putting user-land in the driver's seat" is going to be satisfactory or reliable, especially under load.

Also bear in mind the internal cacheing behavior of a modern pipelined CPU: "memory barriers."

See: /usr/src/linux/Documentation/memory-barriers.txt.

When you are dealing with memory-mapped I/O, you need to be sure that the bytes you're writing have really been written, and that the bytes you're reading are really fresh.

Thanks for your answer.

The problem is definitely not "little endian vs big endian".

Architecturaly speaking, The problem is that I have to talk to my device very often (about 100 times per second), and I don't want to send 100 ioctl per seconds (from user space to kernel space) to send control commands. I assume it would be much faster to map the memory device in user space. Is that correct ?

In a few words, is there a way to ensure that the data is written to my device as soon as I perform the affectation in user space (such that ptr[i] = bla immediately translates to a PCIe write request ?)

Marc

That's an interesting idea.

I don't think that you'll actually get into trouble doing "100 ioctl calls per-second," except for the issue of dispatcher-latency: in a busy system, you might not get "100 opportunities to run" in any given second. You can't count on that.

So, to my way of thinking, some kind of queueing arrangement is needed: some kind of circular buffer, say, so that you can tell the device, "go send these 100 things as quickly as the device will accept them." Or, "give me the latest things that the device has sent."

To actually drive the device, then, you use a kernel thread. This thread will keep track of the status of the device at all times, feeding it information and gathering results. The ioctl() calls allow you to add to the buffer and to retrieve things from it, properly synchronizing their activities with those of the kernel thread.

Although a device could "watch memory," there could be issues with that vis a vis virtual-memory paging and so forth. But you could implement the idea in user-land with a C++ (or somesuch) class...