Access mode to off-Linux RAM

divia · 05-02-2013, 07:00 AM

Hello all, I have a big problem with specialized HW+driver related to RAM access. For data acquisition purposes we have developed a PCI card capable to write several GB/s in the RAM of a PC. To readout this data we have written a special driver that grants to user-level code memory-mapped access to the RAM. This works OK, but in newer (for me) kernels the data read throughput goes down of a factor 4.

I reserve big blocks of memory (several GBs) at boot time via "mem=" parameter. Linux does not see this memory but our driver can still map it via the remap_pfn_range kernel routine. Then I access this memory via memory-mapped CPU read cycles. Usually this data is sent to a second machine via 100 MB Ethernet or 1 GB Ethernet. Therefore I need read capabilities of maximum 110 MB/s. This used to be OK with old kernels. With newer kernels I can get maximum ~40 MB/s.

In the old kernel (2.6.18-348.3.1) I see:

Code:

> cat /proc/iomem | tail -5l
fec00000-fec0ffff : reserved
fee00000-fee00fff : reserved
ff800000-ffbfffff : reserved
fffffc00-ffffffff : reserved
100000000-22fffffff : System RAM

The RAM above 100000000 (4G) is reserved for our driver. The machine has 8G installed, which means the driver has 12fffffff+1 bytes available (a bit more of 4G, the upper 4 GB plus the BIOS memory hole). So far so good.

In the new kernel I see:

Code:

> cat /proc/iomem | tail -5l
fec85400-fec857ff : IOAPIC 4
fee00000-fee00fff : Local APIC
  fee00000-fee00fff : reserved
ff800000-ffbfffff : reserved
fffffc00-1ffffffff : reserved

As you can see, the last block (the reserved RAM) is no longer visible. I think here lies my problem (see later on)...

Boot log of the old kernel:

Code:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000010000 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000cff70000 (usable)
 BIOS-e820: 00000000cff70000 - 00000000cff78000 (ACPI data)
 BIOS-e820: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000230000000 (usable)

No NUMA configuration found
Faking a node at 0000000000000000-0000000100000000
Bootmem setup node 0 0000000000000000-0000000100000000

Similar snippet for the new kernel:

Code:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000cff70000 (usable)
 BIOS-e820: 00000000cff70000 - 00000000cff78000 (ACPI data)
 BIOS-e820: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
e820 remove range: 0000000100000000 - ffffffffffffffff (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 000000000009b800 (usable)
 user: 000000000009b800 - 00000000000a0000 (reserved)
 user: 00000000000e4000 - 0000000000100000 (reserved)
 user: 0000000000100000 - 00000000cff70000 (usable)
 user: 00000000cff70000 - 00000000cff78000 (ACPI data)
 user: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 user: 00000000cff80000 - 00000000d0000000 (reserved)
 user: 00000000e0000000 - 00000000f0000000 (reserved)
 user: 00000000fec00000 - 00000000fec10000 (reserved)
 user: 00000000fee00000 - 00000000fee01000 (reserved)
 user: 00000000ff800000 - 00000000ffc00000 (reserved)
 user: 00000000fffffc00 - 0000000200000000 (reserved)

MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-C7FFF write-protect
  C8000-E3FFF uncachable
  E4000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 0D0000000 mask FF0000000 uncachable
  1 base 0E0000000 mask FE0000000 uncachable
  2 base 000000000 mask E00000000 write-back
  3 base 200000000 mask FE0000000 write-back
  4 base 220000000 mask FF0000000 write-back
  5 base 0CFF80000 mask FFFF80000 uncachable
  6 disabled
  7 disabled

I have tried to enable write-combining for the RAM pages (via pgprot_writecombine) but all I got was a speedup of the write cycles, read cycles are still down a factor 4 (max ~40 MB/s vs. ~120 MB/s with the old kernel).

Now, I think that with the newer kernel the processor is accessing this RAM word-by-word rather than one cache-line at a time. In the old kernel, as the system was still considering the locations as "System RAM", the locations were still accessed cache-line by cache-line. This would explain the factor 4 in the read time.

The MTRR setup for the new system (similar to the old system's):

Code:

# cat /proc/mtrr
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-back
reg03: base=0x0cff80000 ( 3327MB), size=  512KB, count=1: uncachable
reg04: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back
reg05: base=0x200000000 ( 8192MB), size=  512MB, count=1: write-back
reg06: base=0x220000000 ( 8704MB), size=  256MB, count=1: write-back

The PAT setup for the new kernel (PAT did not yet exist in the old kernel):

Code:

# cat /sys/kernel/debug/x86/pat_memtype_list
PAT memtype list:
write-back @ 0xcff72000-0xcff74000
write-back @ 0xcff73000-0xcff78000
write-back @ 0xcff78000-0xcff79000
uncached-minus @ 0xdd001000-0xdd002000
uncached-minus @ 0xdd200000-0xdd210000
uncached-minus @ 0xdd300000-0xdd310000
uncached-minus @ 0xe0000000-0xf0000000
uncached-minus @ 0xfed00000-0xfed01000
write-combining @ 0x106000000-0x230000000

The RAM block I used for the tests, which starts at 0x106000000, is setup as write-combining which is the best policy I could think of. I also tried other policies (write-through, write-back, uncachable) without success (for the read access time).

So, my question: is it possible to setup the RAM block as for normal system RAM in the newer kernel, the same way it is done by the old kernel?

Thanks to all for any hint. I am available for extra information if this is requested.

P.S. On my only AMD-based machine (AMD Athlon(tm) 64 X2 Dual Core Processor 4200+) I do not see the slowdown, reading goes at 120 MB/s with both kernels. On these machine I do not see MTRR/PAT entries associated to the reserved RAM block. Unfortunately I do not have a second AMD-based host to run a second check. All the other machines I can use for my tests are Xeon-based.

P.P.S. I have also tried to boot the kernel with PAT disabled: same low throughput (40 MB/s).

rmolinger · 05-02-2013, 06:28 PM

This may be a really stupid question, but are you using the 64 bit kernel for the Xeon tests? Do these processors have the x64 extensions? The reason I ask is that you are mapping above the 4G boundary so a 32 bit machine would have to jump thru special routines to access that memory.

Cheers
R.

divia · 05-03-2013, 12:22 AM

Hi Randy. The answer is yes: 64 bit kernel and 64 bit processors. The same machines run happily with the old kernel.

Ciao,
Roberto

divia · 05-03-2013, 02:10 AM

For completeness, here is the HW description of one of the machines I used for the tests:

Code:

    description: Computer
    product: X6DH8-XB
    vendor: Supermicro
    version: 0123456789
    serial: 0123456789
    width: 64 bits
    capabilities: smbios-2.33 dmi-2.33 vsyscall64 vsyscall32
    configuration: administrator_password=enabled boot=oem-specific frontpanel_password=unknown keyboard_password=unknown power-on_password=disabled uuid=80F1E964-DC63-0010-89D5-00304877F616
  *-core
       description: Motherboard
       product: X6DH8-XB
       vendor: Supermicro
       physical id: 0
       version: PCB Version
       serial: OM65S00361
     *-firmware
          description: BIOS
          vendor: Phoenix Technologies LTD
          physical id: 0
          version: 6.00 (01/24/2006)
          size: 109KiB
          capacity: 960KiB
          capabilities: isa pci pnp upgrade shadowing escd cdboot bootselect edd int13floppy2880 acpi usb ls120boot zipboot biosbootspecification
     *-cpu:0
          description: CPU
          product: Intel(R) Xeon(TM) CPU 2.80GHz
          vendor: Intel Corp.
          physical id: 4
          bus info: cpu@0
          version: Intel(R) Xeon(TM) CPU 2.80GHz
          slot: CPU1
          size: 2800MHz
          capacity: 4GHz
          width: 64 bits
          clock: 200MHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc pebs bts pni dtes64 monitor ds_cpl cid cx16 xtpr lahf_lm cpufreq
        *-cache:0
             description: L1 cache
             physical id: 6
             slot: L1 Cache
             size: 16KiB
             capacity: 16KiB
             capabilities: asynchronous internal write-back
        *-cache:1
             description: L2 cache
             physical id: 7
             slot: L2 Cache
             size: 2MiB
             capabilities: burst internal write-back
     *-cpu:1
          description: CPU
          product: Intel(R) Xeon(TM) CPU 2.80GHz
          vendor: Intel Corp.
          physical id: 5
          bus info: cpu@1
          version: Intel(R) Xeon(TM)
          slot: CPU2
          size: 2800MHz
          capacity: 4GHz
          width: 64 bits
          clock: 200MHz
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc pebs bts pni dtes64 monitor ds_cpl cid cx16 xtpr lahf_lm cpufreq
     *-memory
          description: System Memory
          physical id: 16
          slot: System board or motherboard
          size: 8GiB
        *-bank:0
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 0
             slot: DIMM#1A
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:1
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 1
             slot: DIMM#2A
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:2
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 2
             slot: DIMM#3A
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:3
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 3
             slot: DIMM#4A
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:4
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 4
             slot: DIMM#1B
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:5
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 5
             slot: DIMM#2B
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:6
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 6
             slot: DIMM#3B
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)
        *-bank:7
             description: DIMM DDR Synchronous 333 MHz (3.0 ns)
             physical id: 7
             slot: DIMM#4B
             size: 1GiB
             width: 64 bits
             clock: 333MHz (3.0ns)

divia · 05-03-2013, 05:36 AM

I did a "perf stat -a -ddd" on a simple program that accesses a memory block (edited: of 1 GB) in R/W. The code is the same (a memory pointer gets mapped and then the memory is read/written) and re-compiled to point either to a IPC block (cached, fast access) or to the off-Linux block (apparently not cached in read). Here are the results.

R/W from/to IPC:

Code:

       5582.423317 task-clock                #    2.000 CPUs utilized           [100.00%]
               189 context-switches          #    0.034 K/sec                   [100.00%]
                 7 CPU-migrations            #    0.001 K/sec                   [100.00%]
           262,283 page-faults               #    0.047 M/sec                  
     5,196,609,438 cycles                    #    0.931 GHz                     [ 7.38%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [18.48%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [27.68%]
       243,130,332 instructions              #    0.05  insns per cycle         [39.24%]
       452,218,432 branches                  #   81.008 M/sec                   [47.84%]
       171,684,844 branch-misses             #   37.97% of all branches         [55.58%]
   <not supported> L1-dcache-loads         
       145,365,573 L1-dcache-load-misses     #    0.00% of all L1-dcache hits   [64.39%]
   <not supported> LLC-loads               
       130,387,299 LLC-load-misses           #    0.00% of all LL-cache hits    [73.18%]
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
       117,182,384 dTLB-load-misses          #    0.00% of all dTLB cache hits  [81.43%]
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

       2.790536226 seconds time elapsed

R/W from/to the off-Linux RAM:

Code:

      53561.134173 task-clock                #    2.000 CPUs utilized           [100.00%]
             1,457 context-switches          #    0.027 K/sec                   [100.00%]
                10 CPU-migrations            #    0.000 K/sec                   [100.00%]
               179 page-faults               #    0.003 K/sec                  
        33,678,834 cycles                    #    0.001 GHz                     [ 9.49%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [17.32%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [25.75%]
         9,344,709 instructions              #    0.28  insns per cycle         [34.38%]
       439,320,527 branches                  #    8.202 M/sec                   [42.24%]
         6,286,433 branch-misses             #    1.43% of all branches         [51.33%]
   <not supported> L1-dcache-loads         
         4,770,538 L1-dcache-load-misses     #    0.00% of all L1-dcache hits   [60.66%]
   <not supported> LLC-loads               
         4,544,135 LLC-load-misses           #    0.00% of all LL-cache hits    [71.26%]
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
         3,979,059 dTLB-load-misses          #    0.00% of all dTLB cache hits  [81.38%]
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

      26.779970096 seconds time elapsed

Read from IPC:

Code:

       3930.403313 task-clock                #    2.001 CPUs utilized           [100.00%]
               374 context-switches          #    0.095 K/sec                   [100.00%]
                 7 CPU-migrations            #    0.002 K/sec                   [100.00%]
           262,293 page-faults               #    0.067 M/sec                  
     1,021,440,012 cycles                    #    0.260 GHz                     [ 9.44%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [16.17%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [25.29%]
       279,593,249 instructions              #    0.27  insns per cycle         [34.49%]
       377,632,038 branches                  #   96.080 M/sec                   [43.09%]
       193,916,829 branch-misses             #   51.35% of all branches         [49.74%]
   <not supported> L1-dcache-loads         
       162,698,461 L1-dcache-load-misses     #    0.00% of all L1-dcache hits   [58.14%]
   <not supported> LLC-loads               
       150,247,109 LLC-load-misses           #    0.00% of all LL-cache hits    [64.20%]
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
       132,995,436 dTLB-load-misses          #    0.00% of all dTLB cache hits  [72.52%]
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

       1.964593286 seconds time elapsed

Read from the off-Linux RAM:

Code:

      52572.069728 task-clock                #    2.000 CPUs utilized           [100.00%]
             1,698 context-switches          #    0.032 K/sec                   [100.00%]
               161 CPU-migrations            #    0.003 K/sec                   [100.00%]
               154 page-faults               #    0.003 K/sec                  
        35,814,524 cycles                    #    0.001 GHz                     [ 8.79%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [17.32%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [26.32%]
         8,867,268 instructions              #    0.25  insns per cycle         [35.60%]
       271,990,027 branches                  #    5.174 M/sec                   [45.04%]
         5,887,794 branch-misses             #    2.16% of all branches         [53.67%]
   <not supported> L1-dcache-loads         
         4,562,815 L1-dcache-load-misses     #    0.00% of all L1-dcache hits   [63.40%]
   <not supported> LLC-loads               
         4,356,119 LLC-load-misses           #    0.00% of all LL-cache hits    [72.57%]
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
         3,871,511 dTLB-load-misses          #    0.00% of all dTLB cache hits  [81.65%]
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

      26.285389995 seconds time elapsed

divia · 05-03-2013, 06:04 AM

Similar stats for a simple Read of 100 KB (to avoid swapping) repeated 100 times (to evaluate the effect of caching).

Read from IPC:

Code:

          2.952743 task-clock                #    3.057 CPUs utilized            ( +-  0.26% ) [99.88%]
                 7 context-switches          #    0.002 M/sec                    ( +-  0.20% ) [99.93%]
                 2 CPU-migrations            #    0.677 K/sec                    ( +-  0.71% ) [99.96%]
               145 page-faults               #    0.049 M/sec                    ( +-  0.05% )
         4,305,567 cycles                    #    1.458 GHz                      ( +-  0.88% ) [24.66%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [88.71%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [11.34%]
           696,238 instructions              #    0.16  insns per cycle          ( +- 17.87% ) [12.85%]
            97,770 branches                  #   33.112 M/sec                    ( +-  7.18% ) [14.43%]
     <not counted> branch-misses           
   <not supported> L1-dcache-loads         
     <not counted> L1-dcache-load-misses   
   <not supported> LLC-loads               
     <not counted> LLC-load-misses         
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
     <not counted> dTLB-load-misses        
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

       0.000965889 seconds time elapsed                                          ( +-  0.33% )

Read from off-Linux RAM:

Code:

          7.850429 task-clock                #    2.303 CPUs utilized            ( +-  0.20% ) [99.95%]
                11 context-switches          #    0.001 M/sec                    ( +-  1.37% ) [99.97%]
                 2 CPU-migrations            #    0.306 K/sec                    ( +-  3.45% ) [99.99%]
               119 page-faults               #    0.015 M/sec                    ( +-  0.06% )
        14,716,784 cycles                    #    1.875 GHz                      ( +-  1.46% ) [23.40%]
                 0 stalled-cycles-frontend   #    0.00% frontend cycles idle    [38.74%]
                 0 stalled-cycles-backend    #    0.00% backend  cycles idle    [27.63%]
         1,841,157 instructions              #    0.13  insns per cycle          ( +- 18.05% ) [54.60%]
           105,973 branches                  #   13.499 M/sec                    ( +-  0.84% ) [63.14%]
         3,013,535 branch-misses             #  2843.68% of all branches          ( +- 22.81% ) [ 8.27%]
   <not supported> L1-dcache-loads         
            55,108 L1-dcache-load-misses     #    0.00% of all L1-dcache hits    ( +-  7.53% ) [ 5.36%]
   <not supported> LLC-loads               
     <not counted> LLC-load-misses         
   <not supported> L1-icache-loads         
   <not supported> L1-icache-load-misses   
   <not supported> dTLB-loads              
     <not counted> dTLB-load-misses        
     <not counted> iTLB-loads              
     <not counted> iTLB-load-misses        
   <not supported> L1-dcache-prefetches    
   <not supported> L1-dcache-prefetch-misses

       0.003408948 seconds time elapsed                                          ( +-  0.22% )

divia · 05-14-2013, 06:37 AM

I have some important news on this subject. I found out that if I limit the mapping of the off-Linux memory block to the physical RAM installed on the system (without the RAM that falls in the BIOS memory hole re-mapping) then I get the proper caching.

Take for example a system with 8 GB, 4 GB for Linux and 4 GB for the special driver. In reality, the off-Linux memory will contain a bit more than 4 GB as we can see from the actual mapping:

Code:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000cff70000 (usable)
 BIOS-e820: 00000000cff70000 - 00000000cff78000 (ACPI data)
 BIOS-e820: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
e820 remove range: 0000000100000000 - ffffffffffffffff (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 000000000009b800 (usable)
 user: 000000000009b800 - 00000000000a0000 (reserved)
 user: 00000000000e4000 - 0000000000100000 (reserved)
 user: 0000000000100000 - 00000000cff70000 (usable)
 user: 00000000cff70000 - 00000000cff78000 (ACPI data)
 user: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 user: 00000000cff80000 - 00000000d0000000 (reserved)
 user: 00000000e0000000 - 00000000f0000000 (reserved)
 user: 00000000fec00000 - 00000000fec10000 (reserved)
 user: 00000000fee00000 - 00000000fee01000 (reserved)
 user: 00000000ff800000 - 00000000ffc00000 (reserved)
 user: 00000000fffffc00 - 0000000100000000 (reserved)

In the above example, the off-Linux memory block will cover the range 0000000100000000 - 0000000230000000 and therefore we will have 0000000030000000 bytes of it located within the BIOS memory hole.

Now, if in my driver I allocate memory without the memory from BIOS hole I see (from the PAT debug):

Code:

reserve_memtype added 0x106000000-0x200000000, track write-back, req write-back, ret write-back

In this configuration, memory access time in the new kernel is as fast as in the old kernel.

On the same system, if I allocate using also memory from the BIOS hole (even a few bytes) I see the following:

Code:

reserve_memtype added 0x106000000-0x2006ba000, track uncached-minus, req write-back, ret uncached-minus

and the access time goes up to the roof.

Another interesting thing is that by mapping first one single page from the whole block and then the rest of the block (including the bit re-mapped from the BIOS memory hole), then caching is set as I want:

Code:

reserve_memtype added 0x106000000-0x106001000, track write-back, req write-back, ret write-back
Overlap at 0x106000000-0x106001000
reserve_memtype added 0x106000000-0x230000000, track write-back, req write-back, ret write-back

It's as if the kernel choose the setting already in place for the first page also for the following pages, as if this was used as a default.

For the moment my personal conclusion is that the kernel gets confused when a mapped block has memory with two default caching modes (write-back for the upper RAM and uncached for the BIOS memory hole) and makes an arbitrary choice (which is the one I do not want) while if any of locations of the blocks has already a cache mode in place, then this is used for the whole block.

More investigations will come, but to me this sounds like an undocumented feature of the Linux kernel.

Now, with the device driver I can only set the cache as write-through. I could not (yet) find a way to set it to write-back. The write performance in the two modes is almost identical, what changes radically is the read access. Anybody knows how to set a block of RAM as cached write-back?

divia · 05-15-2013, 02:15 AM

We did more checks on other machines and what we found is not very conclusive.

We got several "uncached" when the memory block being remapped crosses the 4 GB memory barrier (which is consistent with the findings above).

Unfortunately we also got "uncached" when crossing the 4 GB memory barrier inside the off-Linux RAM (e.g. when allocating between a 4 GB block starting from 6 GB on a system with 32 GB). In other words there is a "cross-blocks" effect which is not always at the end of the physical RAM. To make things worse, we could get cached blocks which did span across different 4 GB blocks. There is something behind the decision taken in the kernel of caching or uncaching the mapped memory that I cannot yet figure out.

I fear that my only way out would be to explicitly request to Linux to have the remapped block cached as write-back. If I only knew how :-(

divia · 05-15-2013, 08:22 AM

Some progresses. The barriers where we got "uncached" memory all corresponds to cross points between MTRR registers. The setup for the last test system mentioned above is:

Code:

# cat /proc/mtrr
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back
reg03: base=0x200000000 ( 8192MB), size= 8192MB, count=1: write-back
reg04: base=0x400000000 (16384MB), size=16384MB, count=1: write-back
reg05: base=0x800000000 (32768MB), size=32768MB, count=1: write-back
reg06: base=0x1000000000 (65536MB), size= 1024MB, count=1: write-back

Well, every time a block of memory mapped by our driver lies across 2 or more registers (e.g. we try to map the area 0x200000000 - 0x400001000) the memory block comes out uncached while if we remain inside the same block then all is OK. This smells a lot like a bug in the area of the MTRR routines (mtrr_type_lookup? pat_x_mtrr_type?). Has anything changed in that area recently?

divia · 05-16-2013, 02:23 AM

I think I have the proof that the problem is indeed in the way the MTRR setup is interpreted by PAT.

This is the default setup of the mapping/MTRR for a machine with 4GB Linux + 4GB off-Linux:

Code:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000cff70000 (usable)
 BIOS-e820: 00000000cff70000 - 00000000cff78000 (ACPI data)
 BIOS-e820: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
e820 remove range: 0000000100000000 - ffffffffffffffff (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 000000000009b800 (usable)
 user: 000000000009b800 - 00000000000a0000 (reserved)
 user: 00000000000e4000 - 0000000000100000 (reserved)
 user: 0000000000100000 - 00000000cff70000 (usable)
 user: 00000000cff70000 - 00000000cff78000 (ACPI data)
 user: 00000000cff78000 - 00000000cff80000 (ACPI NVS)
 user: 00000000cff80000 - 00000000d0000000 (reserved)
 user: 00000000e0000000 - 00000000f0000000 (reserved)
 user: 00000000fec00000 - 00000000fec10000 (reserved)
 user: 00000000fee00000 - 00000000fee01000 (reserved)
 user: 00000000ff800000 - 00000000ffc00000 (reserved)
 user: 00000000fffffc00 - 0000000100000000 (reserved)

MTRR default type: uncachable
MTRR fixed ranges enabled:
  00000-9FFFF write-back
  A0000-BFFFF uncachable
  C0000-C7FFF write-protect
  C8000-E3FFF uncachable
  E4000-FFFFF write-protect
MTRR variable ranges enabled:
  0 base 0D0000000 mask FF0000000 uncachable
  1 base 0E0000000 mask FE0000000 uncachable
  2 base 000000000 mask E00000000 write-back
  3 base 200000000 mask FE0000000 write-back
  4 base 220000000 mask FF0000000 write-back
  5 base 0CFF80000 mask FFFF80000 uncachable
  6 disabled
  7 disabled

x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
original variable MTRRs
reg 0, base: 3328MB, range: 256MB, type UC
reg 1, base: 3584MB, range: 512MB, type UC
reg 2, base: 0GB, range: 8GB, type WB
reg 3, base: 8GB, range: 512MB, type WB
reg 4, base: 8704MB, range: 256MB, type WB
reg 5, base: 3407360KB, range: 512KB, type UC
total RAM covered: 8191M
Found optimal setting for mtrr clean up
 gran_size: 64K         chunk_size: 1M  num_reg: 7      lose cover RAM: 0G
New variable MTRRs
reg 0, base: 0GB, range: 2GB, type WB
reg 1, base: 2GB, range: 1GB, type WB
reg 2, base: 3GB, range: 256MB, type WB
reg 3, base: 3407360KB, range: 512KB, type UC
reg 4, base: 4GB, range: 4GB, type WB
reg 5, base: 8GB, range: 512MB, type WB
reg 6, base: 8704MB, range: 256MB, type WB
e820 update range: 00000000cff80000 - 0000000100000000 (usable) ==> (reserved)
initial memory mapped : 0 - 20000000
init_memory_mapping: 0000000000000000-00000000cff70000
 0000000000 - 00cfe00000 page 2M
 00cfe00000 - 00cff70000 page 4k

# cat /proc/mtrr
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-back
reg03: base=0x0cff80000 ( 3327MB), size=  512KB, count=1: uncachable
reg04: base=0x100000000 ( 4096MB), size= 4096MB, count=5: write-back
reg05: base=0x200000000 ( 8192MB), size=  512MB, count=1: write-back
reg06: base=0x220000000 ( 8704MB), size=  256MB, count=1: write-back

The above is correct as it covers all the Linux RAM between 0x000000000 and 0x100000000 (4GB) plus the off-Linux RAM between 0x100000000 and 0x230000000 (4+GB including the RAM @ the BIOS memory hole). With this setup, allocating memory between 0x106000000 and 0x200000000 we get write-back while allocating memory between 0x106000000 and 0x20035c000 we get uncached. We can confirm this by looking at the PAT debug trace:

Code:

reserve_memtype added 0x106000000-0x1fe8a6000, track write-back, req write-back, ret write-back

reserve_memtype added 0x106000000-0x20035c000, track uncached-minus, req write-back, ret uncached-minus

I have now modified the MTRR setup as follows:

Code:

# cat /proc/mtrr
reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-back
reg03: base=0x0cff80000 ( 3327MB), size=  512KB, count=1: uncachable
reg04: base=0x100000000 ( 4096MB), size= 2048MB, count=1: write-back
reg05: base=0x180000000 ( 6144MB), size= 2048MB, count=1: write-back
reg06: base=0x200000000 ( 8192MB), size=  512MB, count=1: write-back
reg07: base=0x220000000 ( 8704MB), size=  256MB, count=1: write-back

This setup is basically the same as before but we have now registers 4 and 5 that cover what before was covered by register 4 alone.

I now allocate the a block that lies within one single register with the first setup and across two registers with the second setup.

This is the PAT trace for the original MTRR setup:

Code:

reserve_memtype added 0x106000000-0x1fe8a6000, track write-back, req write-back, ret write-back

This is the PAT setup for the modified MTRR setup:

Code:

reserve_memtype added 0x106000000-0x1fe8a6000, track uncached-minus, req write-back, ret uncached-minus

The two MTRR setups are 100% equivalent. Yet, PAT sees them differently. It looks like we have a confirmed bug in the way PAT interpreted the MTRR setup.

divia · 05-17-2013, 05:08 AM

Update. We did some extra checks on the issue and we could confirm that PAT cannot handle correctly two adjacent blocks handled by two MTRRs having the same setting.

What we did next was to try to split the map into separate blocks covered by a single MTRR. Unfortunately this did not work due to another problem: PAT completely ignored the mapping request. The map completed OK, in the sense that we could access the memory, but caching on all of the blocks was undefined (which made them defined as "uncached"). We cannot understand why the remap_pfn_range call, that returned status OK, ended up ignored by the PAT module. This is very unfortunate as we do, for other reasons, rely on multiple maps to a single VM address space (to cover, for example, the BIOS memory hole) and this alas ends up uncached as well (as we could confirm by doing other tests).

We will not disable PAT and see what happens...