Linux - KernelThis forum is for all discussion relating to the Linux kernel.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hello all, I have a big problem with specialized HW+driver related to RAM access. For data acquisition purposes we have developed a PCI card capable to write several GB/s in the RAM of a PC. To readout this data we have written a special driver that grants to user-level code memory-mapped access to the RAM. This works OK, but in newer (for me) kernels the data read throughput goes down of a factor 4.
I reserve big blocks of memory (several GBs) at boot time via "mem=" parameter. Linux does not see this memory but our driver can still map it via the remap_pfn_range kernel routine. Then I access this memory via memory-mapped CPU read cycles. Usually this data is sent to a second machine via 100 MB Ethernet or 1 GB Ethernet. Therefore I need read capabilities of maximum 110 MB/s. This used to be OK with old kernels. With newer kernels I can get maximum ~40 MB/s.
The RAM above 100000000 (4G) is reserved for our driver. The machine has 8G installed, which means the driver has 12fffffff+1 bytes available (a bit more of 4G, the upper 4 GB plus the BIOS memory hole). So far so good.
I have tried to enable write-combining for the RAM pages (via pgprot_writecombine) but all I got was a speedup of the write cycles, read cycles are still down a factor 4 (max ~40 MB/s vs. ~120 MB/s with the old kernel).
Now, I think that with the newer kernel the processor is accessing this RAM word-by-word rather than one cache-line at a time. In the old kernel, as the system was still considering the locations as "System RAM", the locations were still accessed cache-line by cache-line. This would explain the factor 4 in the read time.
The MTRR setup for the new system (similar to the old system's):
The RAM block I used for the tests, which starts at 0x106000000, is setup as write-combining which is the best policy I could think of. I also tried other policies (write-through, write-back, uncachable) without success (for the read access time).
So, my question: is it possible to setup the RAM block as for normal system RAM in the newer kernel, the same way it is done by the old kernel?
Thanks to all for any hint. I am available for extra information if this is requested.
P.S. On my only AMD-based machine (AMD Athlon(tm) 64 X2 Dual Core Processor 4200+) I do not see the slowdown, reading goes at 120 MB/s with both kernels. On these machine I do not see MTRR/PAT entries associated to the reserved RAM block. Unfortunately I do not have a second AMD-based host to run a second check. All the other machines I can use for my tests are Xeon-based.
P.P.S. I have also tried to boot the kernel with PAT disabled: same low throughput (40 MB/s).
This may be a really stupid question, but are you using the 64 bit kernel for the Xeon tests? Do these processors have the x64 extensions? The reason I ask is that you are mapping above the 4G boundary so a 32 bit machine would have to jump thru special routines to access that memory.
I did a "perf stat -a -ddd" on a simple program that accesses a memory block (edited: of 1 GB) in R/W. The code is the same (a memory pointer gets mapped and then the memory is read/written) and re-compiled to point either to a IPC block (cached, fast access) or to the off-Linux block (apparently not cached in read). Here are the results.
I have some important news on this subject. I found out that if I limit the mapping of the off-Linux memory block to the physical RAM installed on the system (without the RAM that falls in the BIOS memory hole re-mapping) then I get the proper caching.
Take for example a system with 8 GB, 4 GB for Linux and 4 GB for the special driver. In reality, the off-Linux memory will contain a bit more than 4 GB as we can see from the actual mapping:
In the above example, the off-Linux memory block will cover the range 0000000100000000 - 0000000230000000 and therefore we will have 0000000030000000 bytes of it located within the BIOS memory hole.
Now, if in my driver I allocate memory without the memory from BIOS hole I see (from the PAT debug):
Code:
reserve_memtype added 0x106000000-0x200000000, track write-back, req write-back, ret write-back
In this configuration, memory access time in the new kernel is as fast as in the old kernel.
On the same system, if I allocate using also memory from the BIOS hole (even a few bytes) I see the following:
Code:
reserve_memtype added 0x106000000-0x2006ba000, track uncached-minus, req write-back, ret uncached-minus
and the access time goes up to the roof.
Another interesting thing is that by mapping first one single page from the whole block and then the rest of the block (including the bit re-mapped from the BIOS memory hole), then caching is set as I want:
Code:
reserve_memtype added 0x106000000-0x106001000, track write-back, req write-back, ret write-back
Overlap at 0x106000000-0x106001000
reserve_memtype added 0x106000000-0x230000000, track write-back, req write-back, ret write-back
It's as if the kernel choose the setting already in place for the first page also for the following pages, as if this was used as a default.
For the moment my personal conclusion is that the kernel gets confused when a mapped block has memory with two default caching modes (write-back for the upper RAM and uncached for the BIOS memory hole) and makes an arbitrary choice (which is the one I do not want) while if any of locations of the blocks has already a cache mode in place, then this is used for the whole block.
More investigations will come, but to me this sounds like an undocumented feature of the Linux kernel.
Now, with the device driver I can only set the cache as write-through. I could not (yet) find a way to set it to write-back. The write performance in the two modes is almost identical, what changes radically is the read access. Anybody knows how to set a block of RAM as cached write-back?
We did more checks on other machines and what we found is not very conclusive.
We got several "uncached" when the memory block being remapped crosses the 4 GB memory barrier (which is consistent with the findings above).
Unfortunately we also got "uncached" when crossing the 4 GB memory barrier inside the off-Linux RAM (e.g. when allocating between a 4 GB block starting from 6 GB on a system with 32 GB). In other words there is a "cross-blocks" effect which is not always at the end of the physical RAM. To make things worse, we could get cached blocks which did span across different 4 GB blocks. There is something behind the decision taken in the kernel of caching or uncaching the mapped memory that I cannot yet figure out.
I fear that my only way out would be to explicitly request to Linux to have the remapped block cached as write-back. If I only knew how :-(
Some progresses. The barriers where we got "uncached" memory all corresponds to cross points between MTRR registers. The setup for the last test system mentioned above is:
Well, every time a block of memory mapped by our driver lies across 2 or more registers (e.g. we try to map the area 0x200000000 - 0x400001000) the memory block comes out uncached while if we remain inside the same block then all is OK. This smells a lot like a bug in the area of the MTRR routines (mtrr_type_lookup? pat_x_mtrr_type?). Has anything changed in that area recently?
The above is correct as it covers all the Linux RAM between 0x000000000 and 0x100000000 (4GB) plus the off-Linux RAM between 0x100000000 and 0x230000000 (4+GB including the RAM @ the BIOS memory hole). With this setup, allocating memory between 0x106000000 and 0x200000000 we get write-back while allocating memory between 0x106000000 and 0x20035c000 we get uncached. We can confirm this by looking at the PAT debug trace:
Code:
reserve_memtype added 0x106000000-0x1fe8a6000, track write-back, req write-back, ret write-back
reserve_memtype added 0x106000000-0x20035c000, track uncached-minus, req write-back, ret uncached-minus
This setup is basically the same as before but we have now registers 4 and 5 that cover what before was covered by register 4 alone.
I now allocate the a block that lies within one single register with the first setup and across two registers with the second setup.
This is the PAT trace for the original MTRR setup:
Code:
reserve_memtype added 0x106000000-0x1fe8a6000, track write-back, req write-back, ret write-back
This is the PAT setup for the modified MTRR setup:
Code:
reserve_memtype added 0x106000000-0x1fe8a6000, track uncached-minus, req write-back, ret uncached-minus
The two MTRR setups are 100% equivalent. Yet, PAT sees them differently. It looks like we have a confirmed bug in the way PAT interpreted the MTRR setup.
Update. We did some extra checks on the issue and we could confirm that PAT cannot handle correctly two adjacent blocks handled by two MTRRs having the same setting.
What we did next was to try to split the map into separate blocks covered by a single MTRR. Unfortunately this did not work due to another problem: PAT completely ignored the mapping request. The map completed OK, in the sense that we could access the memory, but caching on all of the blocks was undefined (which made them defined as "uncached"). We cannot understand why the remap_pfn_range call, that returned status OK, ended up ignored by the PAT module. This is very unfortunate as we do, for other reasons, rely on multiple maps to a single VM address space (to cover, for example, the BIOS memory hole) and this alas ends up uncached as well (as we could confirm by doing other tests).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.