avoiding TLB flush on process switch

kailas · 07-18-2008, 02:58 AM

Hi

Does linux kernel flush all TLB entries for a process when it is switched out by scheduler?
AFAIK, cr3 is changed by switch_to and all non-global page translation entries are flushed by hardware when cr3 is changed(at least on Intel i386).

If this is true, then does this mean that linux under-utilizes TLB?
TLB has been designed to cache page translation entries for all running processes. But only because linux uses different page global directories for each process, (to give 4GB virtual address space for each process) it can not use TLB as it was intended by hardware designers.

Can someone please elaborate on this?

PS: This is my first post on this forum. Kindly forgive me if there is some problem with it.

Thanks in advanced.
Regards,
Kailas

jailbait · 07-18-2008, 08:23 PM

Quote:

Originally Posted by kailas

If this is true, then does this mean that linux under-utilizes TLB?
TLB has been designed to cache page translation entries for all running processes. But only because linux uses different page global directories for each process, (to give 4GB virtual address space for each process) it can not use TLB as it was intended by hardware designers.

I am not certain about what I am about to say but here goes:

As I see it when a process exits its memory is freed. However the command to free its virtual memory address and corresponding page translation entries does not clear the corresponding entries in the Translation Lookaside Buffer. Thus the tlb entries corresponding to the just freed virtual memory are stale and a potential source of error. Linux solves the problem with a cr3 loading tlb flush. Unfortunately that flushes all tlb entries, not just the stale ones.

The hardware designers should have put in circuitry that when virtual pages are freed the corresponding tlb entries are freed. The fact that they didn't means that the kernel developers need to come up with a method to find the tlb entries that correspond to the virtual memory that they are freeing. So far they haven't found one. So the kernel developers are reduced to flushing all tlb entries whenever some tlb entries are stale.

And yes, flushing all tlb entries instead of just the stale ones means that Linux probably under-utilizes tlb.

--------------------------
Steve Stites

kailas · 07-19-2008, 02:23 AM

Thanks a lot for the reply, Steve.

However, my question is not about what happens after process exit. I agree that the memory associated with the process will be freed after exit and this makes it's TLB entries stale.

What I am looking for is what happens after process switch, when scheduler switches out one process. One of the possible reasons for this is expired time slice. In this case, memory mapping for the switched out process is still valid as long as the correct Page Global Directory is used. Also, since TLB holds translation entries with respect to this memory mapping, these entries are valid one. But still, CR3 is rewritten which flushes out these valid entries. This is necessary just because linux uses different Page Global Directory for each process as stated in my earlier post.

Regarding hardware support for maintaining process id in TLB, I've read something about ASID. According to that, Intel processors maintain ASID for the process and this ASID is stored in TLB along with the translation entries. While looking for virtual address in TLB, it also looks for matching ASID. I have also seen ASID parameter for flushing macros in linux source, but not sure how and where it is used.

Please let me know more on this.

Thanks & Regards,
Kailas

kailas · 07-20-2008, 08:01 AM

Hi

Finally, I got the answer for this.

Just to update, Intel processors(including VT) do not support Tagged TLBs. Hence there is no way to distinguish TLB Entries of one process from other. This enforces flushing TLB on process switch by writing into CR3.

Latest AMD processors and MIPS processors support ASID based TLB Entries. I think linux takes advantage of this to avoid TLB flush as I did not find CR3 rewriting segement in the switch_mm for these architectures.

Please correct me if this is wrong.

Some more questions on the same line.

I have read that linux uses only 48 bits out of 64 on 64-bit intel processors. Does it mean that remaining bits can be used to emulate Tagged TLB (i.e. giving every process different 4GB linear address space using segmentation)?

Do someone know what can be the reason behind not implementing Tagged TLB by Intel even on VT when it has been proved that its much beneficial for Virtual machines?

Thanks in advance.

Regards,
Kailas

sundialsvcs · 07-24-2008, 09:14 AM

The Linux kernel has many different types of TLB-related functions which it calls at various times, and these functions are designed to have the potential of being very specific. Each processor's implementation of those functions varies according to what a particular engine (or model) can do. (In some cases, like "Motorola 68000 with no MMU," they don't do anything at all.)

When a task-switch occurs, the new task does not know anything about the old task. It will soon fill-up the TLB with the entries that it cares about, having no real need for (and probably, no access to) those of the former task.

kailas · 07-26-2008, 02:17 AM

Quote:

Originally Posted by sundialsvcs

When a task-switch occurs, the new task does not know anything about the old task. It will soon fill-up the TLB with the entries that it cares about, having no real need for (and probably, no access to) those of the former task.

The new task will create the TLB entries when it starts. However, the problem is that every time a task is switched out all its TLB entries will be flushed and it has to recreate all those entries every time scheduler pick it for execution.

Typically, Intel processors has TLB with the 96 entries each of which can map at least 4kb data. Even we assume that kernel occupies 1/3rd entries (which it cannot due to flushing), around 60 entries are left. I believe a process will not be able to create that many entries in relatively small time-slice it gets. Thus, TLB can hold entries for more that one processes at a time which will avoid access to paging structure when the same process is rescheduled.

Of course, this is possible only if we can avoid TLB flush on Intel processors. However, since there is no support for Tagged TLB, its not straight forward to implement this on Intel processors.
Can we use something like virtually tagged TLB to achieve this? Can we use upper 16 bits on 64 bit Intel processors(which I suppose are unused as linux uses only 48 bits) as region ids?

I agree that the performance benefit from this is debatable. But, I've seen some presentations on Xen saying that Xen performs better on some AMD processors due to support for Tagged TLB.

syg00 · 07-26-2008, 02:49 AM

Interesting topic - although I wonder about all this conjecture.
Just this last week I was (quickly) skimming some of the code (tangentially) related this to see how the swap pte's were handled with regard to new reporting being added to smaps.
It appeared from the code that (normally) only the TLB entries for the task interrupted were being invalidated, rather than a full TLB flush. There was a test for the need of a (separate) full TLB flush - presumably using CR3. Implies some form of interaction between the O/S and the hardware. I didn't pursue that line.

As for virtualized guests, I also stumbled on an Intel paper that described some (6 maybe ?) new instructions that allowed the hipervisor to maintain "shadow" TLB(s) for the guests, and intercept TLB flushes from those guests and determine which (if any) actually needed to be propogated to the hardware.

Food for thought.

kailas · 07-27-2008, 09:40 AM

Quote:

Originally Posted by syg00

It appeared from the code that (normally) only the TLB entries for the task interrupted were being invalidated, rather than a full TLB flush. There was a test for the need of a (separate) full TLB flush - presumably using CR3. Implies some form of interaction between the O/S and the hardware. I didn't pursue that line.

Are you talking about some specialized architecture? Because, as I read from UTLK and also the source code for i386 and x86_64, switch_to macro does load(cr3) which results in TLB flush.
Kindly explain as I am new to this.

Also, from Intel manuals for 64 bit processors, I read that the processor supports only 48 bit address space and requires that bits 48-63 should be same as bit 47. Thus, we cannot use these upper bits for virtual tagging of TLB.

Do linux kernel provide linear address space of 64TB to each process on 64 bit processors as is 4GB for 32 bit systems? If so, is it really required for each process to have so large linear address space? If it is not, then can we use upper 7 bits from this to do virtual tagging of TLB while providing linear address space of 1TB to each process.