1、 Preface

Chen Xingshen, a master of mathematics, has a saying like this: understanding the changes of history is a step to understand this subject. Today, I apply this to a specific Linux module: the best way to understand reverse mapping is to understand its history. This paper introduces the historical process of how the reverse mapping mechanism in Linux kernel came into being from nothing and how it went from heavy to light. Through these historical evolution processes, we hope to have a deeper understanding of reverse mapping.

2、 Basic knowledge

Before cutting into the history of reverse mapping, let’s take a brief look at some basic concepts, which mainly include two aspects: one is the definition of reverse mapping, the other is the reason for the introduction of reverse mapping.

1. What is reverse mapping?

Before we talk about reverse mapping, let’s talk about forward mapping. When you understand forward mapping, the concept of reverse mapping will be very easy. The so-called forward mapping is the process of establishing a complete page table for address mapping when the virtual address and physical address (or page number, page struct) are known. For example, after a VMA is allocated to a process, there is no corresponding page frame (that is, no physical address is allocated). When the program accesses this VMA, an exception is generated, and the kernel allocates physical pages for it and establishes all levels of translation tables. Through forward mapping, we can map the virtual page in the process virtual address space to the corresponding physical page frame.

Reverse mapping, on the contrary, when the page frame is known (it may be PFN, pointer to page descriptor, or physical address. The kernel has various macro definitions to convert between them), find the virtual pages mapped to the physical page. Since a page frame can be shared among multiple processes, the task of reverse mapping is to find all the page table entries scattered in the address space of each process.

Generally speaking, two virtual addresses will not be mapped to a page frame in the address space of a process. If there are multiple mappings, the page is mostly shared by multiple processes. The simplest example is the process fork of cow. The kernel will not allocate a new page frame before the process does not write. Therefore, the parent-child process shares a physical page. Another example is related to C lib. As C lib is the basic library, it will be mapped to many process address spaces, so the page frame corresponding to the program body segment in C lib should have many page table entries corresponding to it.

2. Why reverse mapping?

The reason for the establishment of reverse mapping mechanism is to facilitate page recycling. After the page recycling mechanism is started, if the recycled page frames are located in various memory caches in the kernel (such as the slab memory allocator), then these pages can be recycled directly without related page table operations. If the page frame of the user process space is recycled, the kernel needs to unmapping the page frame before recycling, that is, find all the page table entries, and then modify them accordingly. Of course, if the page is dirty, we need some necessary disk IO operations.

A practical example can be given. For example, when releasing an anonymous mapping page, it is required to change all related page table entries, and write the swap area and page slot index to the page table entries. Only after all the page table entries pointing to the page frame have been modified can the page be swapped to disk and the page frame be recycled. The scenario of demand paging is similar, except that all page table entries need to be cleared. I won’t go into details here.

3、 Prehistoric civilization

Before Pangu, the universe was in chaos. For the reverse mapping scenario, our question is: what is the chaotic kernel world like before the reverse mapping? This chapter is mainly to answer this question. The analysis is based on the source code of 2.4.18 kernel.

1. How does the system work without reverse mapping?

It may be hard for young kernel engineers to imagine a kernel world without reverse mapping, but in fact, the kernel in 2.4 was like this. Let’s imagine that we are the maintainers of page reclaim mechanism. Let’s take a look at our current dilemma: if there is no reverse mapping mechanism, no reverse mapping data is maintained in struct page. In this case, it is impossible for the kernel to find the PTEs corresponding to the page frame in a simple way. When recycling a page frame shared by multiple processes, what should we do?

It is not complicated to recycle the physical page frames of user process, which needs the support of memory mapping and swapping mechanism. The two mechanisms work in a similar way, except that one is used for file mapped page and the other is used for anonymous page. However, for page recycling, their working principle is similar: exchange the page frame that some processes don’t often use to disk, and at the same time release all relationships between the process and the page frame. After completing these two steps, the physical page frame is free and can be recycled to the partner system.

OK, I understand the basic principle. Now I need to see how to implement it: it’s easy to find the infrequently used page frame (inactive LRU linked list), but it’s difficult to break the relationship between page frame and processes, because there is no reverse mapping. However, it is difficult for Linux kernel developers to scan the address space of each process in the whole system.

2. How to scan the process address space?

The figure below is a schematic diagram of scanning the process address space

All the process address spaces (memory descriptors) in the system are strung into a linked list, and the head of the chain is init_ Mm, all the process address spaces in the system are hung in this linked list. The so-called scan of course is carried out along the MM list. Of course, the page recycling algorithm should not scan all the process address space of the whole system as far as possible. After all, that is a relatively stupid method. The recovery algorithm can consider shrinking memory cache or traversing inactive_ List to try to complete the request for the number of reclaims this time (some pages in the linked list are not related to any process). If enough page frames are released through these methods, then everything is done, and the scan process address space is not needed. Of course, the situation is not always so good. Sometimes, the process of physical page recycling must be started to meet the requirements of page recycling.

The process physical page recycling process is realized by calling swap_ Out function, and the code of scan process address space also starts from this function. The function is a three-tier nested structure

(1) First along init_ Mm, scan the address space of each process

(2) When scanning a process address space, scan every VMA belonging to the process address space

(3) When scanning each VMA, each page belonging to the VMA is scanned

In the scanning process, if page frame0 of process a is hit, because the page is only used by process a (that is, it is only mapped by process a), you can unmap and recycle the page directly. For shared pages, we can’t do this, such as page frame 1 in the figure above. However, when scanning a process, if the conditions are met, we will unmap the page and break the relationship between it and process a. of course, we can’t recycle the page at this time, because process x is still using the page. Until the scan process goes through thousands of mountains and rivers to process X and completes the unmaping operation of page frame 1, the physical page can really be embraced by the partner system.

3. Details of address space scanning

First of all, the first question: how much virtual address space does scan stop? When the target has been reached, for example, this scan intends to reclaim 32 page frames. If the target is reached, then scan will stop without scanning all the virtual address space. There is also a more tragic situation, that is, after scanning all the address spaces in the system, the target is still not achieved, and then it can be stopped, but this belongs to the processing of oom. In order to ensure that the processes in the system are scaned evenly (after all, swap out will affect the performance of the process, we certainly can’t just catch some processes to collect their wool). After each scan is completed, record the current scan location (save in swap out)_ The next time you start the scan process, start the swap_ Mm starts to scan.

Due to the impact on performance, swap out needs rain and dew, and all processes cannot run away. In the same way, we also need to treat the address space of a process fairly, so we need to save the virtual address of each scan (mm > swap)_ In this way, every time you restart scan, you always start from swap_ Mm > swap of the address space_ Address virtual address start scan.

The code to swap out a page frame is in try_ To_ swap_ In the out function, if the conditions are met, the relationship between the processes in the page frame will be released, the necessary IO operations will be completed, the page reference count will be reduced by one, the corresponding PTE will be cleared, or the swap entry will be set. Of course, after swap out a page, we may not be able to recycle it, because the page is likely to be shared by multiple processes. In the scan process, if all the page table entries corresponding to the page are found by chance, it means that the page has not been referenced by any process. At this time, the page frame will be ejected from the disk to complete the recycling of a page.

4、 Make the world new

Time goes back to January 2002, when Rik van Riel, the God of VM, suffered a major setback in his life. His painstaking maintenance code was replaced by a new VM subsystem. But Rik van Riel didn’t go down. He was holding back his big move, which is the legendary reverse mapping (hereinafter referred to as rmap). This chapter mainly describes the first version of rmap, with code from Linux 2.6.0.

1. Design concept

How to build rmap? The most intuitive idea is that for each page frame, we maintain a linked list to save all PTEs belonging to the page. Therefore, Rik van Riel adds a PTE chapter member to the struct page to string all PTE entry pointers mapped to the page. In this way, it’s easy to unmap a page. You can find all the mappings along the PTE chain. A basic diagram is as follows, and the following sections will give a more detailed explanation.

2. Modification of struct page

The structure page is modified as follows:

struct page {

……

union {

struct pte_ chain *chain;

} pte;

……

Of course, many pages are not shared. There is only one PTE entry, so direct to that PTE entry is OK. If there is page sharing, the chain member will point to a struct PTE_ Chain.

3. Define struct PTE_ chain

struct pte_ chain {

unsigned long next_ and_ idx;

} ____ cacheline_ aligned;

If PTE_ It’s too wasteful to save only one PTE entry pointer in chapter. A better way is to use struct PTE_ Align the chain in the cache line and make the whole struct PTE_ Chain occupies a cache line. Except next_ and_ IDX is used to point to the next PTE_ In addition to the linked list, the rest of the space is used to save the PTE entry pointer. Since the PTE entry pointer forms an array, we also need an index to indicate the location of the next idle PTE entry pointer_ Chain is aligned in cache line, so next_ and_ Several bits of LSB of IDX are equal to 0 and can be reused as index.

4. Modification of page recycling algorithm

Before we go to the page recycling algorithm based on rmap, let’s recall the painful past. Suppose a physical page P is shared by a and B processes. In the past, to release the physical page P, you need to scan the process address space. First, scan to a process to release the relationship between P and a process. But at this time, it cannot be recycled. B process is still using the page frame. Of course, the scanning process will eventually come to the B process. Only at this time can we have the opportunity to recycle the physical page P. You may ask: if process a accesses P when scanning process B’s address space, the mapping will be established. Then, when you scan a, process B visits again and again, so over and over again, won’t P never be recycled? What about this? This Theoretically, don’t ask me. In fact, I’m desperate.

With rmap, the page recycling algorithm immediately feels much easier. As long as the page frame that the page recycling algorithm likes, it can always pass the try_ To_ Unmap disassociates all processes and recycles them to partner systems. If the page frame is not shared (page flag set PG)_ Direct flag), then page – > pte.direct Hit PTE entry directly and call try_ To_ unmap_ One to unmap. If it is mapped to multiple virtual address spaces, then the_ Chain calls try in turn_ To_ unmap_ One to unmap.

5、 Nuwa mends the sky

Although Rik van Riel has opened up a new world of reverse mapping, there are huge holes in the sky and the earth, which need to be repaired. First of all, let’s see what this “huge hole” is?

After the introduction of the first version of rmap, the page recycling of Linux becomes simple and controllable, but this simple design has a price: each struct page adds a pointer member, which is 4B on the 32bit system. Considering that the system will create a struct page object for each page frame in order to manage memory, the memory overhead caused by introducing rmap is not a small number. In addition, share page needs to create PTE_ Chain list is also a big memory overhead. In addition to the memory pressure, the first version of rmap also had a certain impact on performance. For example, during fork operation, the parent-child processes share a lot of page frames. In this way, the copy page table will be accompanied by a large number of PTEs_ The operation of chain slows down the speed of fork.

This chapter is to show you how object-based reverse mapping (hereinafter referred to as object map) fills the “huge hole”. The code in this chapter comes from the kernel version 2.6.11.

1. Introduction of problem

The driving force of rmap optimization comes from the pressure of memory, and the related question is: does the 32-bit Linux kernel support more than 4G memory. In 1999, Linus decided that a 32-bit Linux kernel would never support more than 2G of memory. However, the tide of history is irresistible. Processor manufacturers have designed expansion modules to address more memory, and high-end servers are also configured with more and more memory. This also forces Linus to change the previous thinking and let the Linux kernel support more memory.

Andrea Arcangeli of red hat was working on making 32-bit Linux run on company servers with more than 32g of memory. These servers often start a large number of processes, share a large number of physical page frames, and consume a lot of memory. For Andrea Arcangeli, the real culprit of memory consumption is clear: reverse mapping module, which consumes too much low memory, resulting in various system crashes. In order to continue his work, he must solve the problem of memory scalability introduced by rmap.

2. Optimization of file mapped

Andrea Arcangeli is not the only one who has paid attention to the memory problem of rmap. In the development process of version 2.5, IBM’s Dave McCracken has submitted a patch, trying to ensure the reverse mapping function and at the same time correct various problems caused by rmap.

Dave McCracken’s scheme is an object-based reverse mapping mechanism. In the past, through rmap, we can directly obtain its corresponding PTEs from the struct page. Objrmap uses other data objects to complete the process of retrieving its corresponding PTEs from the struct page. The schematic diagram of this process is as follows:

For objrmap, it’s a long way to find a page frame’s mappings. It uses VMA (structure VM)_ area_ Struct) this data object. We know that there are backup files for some page frames. This type of page is related to a file. For example, the body segment of a process is related to the executable file of the process. In addition, the process can call MMAP () to map a file. We call these page frames file mapped page.

For these file mapping pages, a member mapping in the struct page points to a struct address_ space，address_ Space is related to the file, which stores the information about the page cache of the file. Of course, our scenario focuses on an I_ Members of MMAP. A file may be mapped to multiple Vmas of multiple processes, all of which are hung to I_ In the priority search tree pointed by MMAP.

Of course, our ultimate goal is PTEs. The following figure shows how to export the virtual address of the page frame from the information in VMA and struct page

In Linux kernel, the function VMA_ Address can complete this function:

staTIc inline unsigned long

vma_ address(struct page *page, struct vm_ area_ struct *vma)

{

pgoff_ t pgoff = page->index < (page_ cache_ shift=”” -=””>

address = vma->vm_ start + ((pgoff – vma->vm_ pgoff) <>

}

For file mapped page, Page > index represents the offset mapped to the file (in byte), while VMA > VM_ Pgoff represents the offset of the VMA mapped to the file (page is the unit). Therefore, through VMA – > VM_ Pgoff and Page > index can get the address offset of the page frame in VMA, plus VMA > VM_ Start to get the virtual address of the page frame. With virtual address and address space (VMA > VM)_ Mm), we can find the corresponding PTE entry of the page through the page table at all levels.

3. Optimization of anonymous page

As we all know, there are two main types of user space process pages, one is file mapped page, the other is anonymous mapped page. Although David McCracken’s objrmap scheme is good, it is only suitable for file mapped pages. For anonymous mapped pages, this scheme is powerless. Therefore, we must design an object-based reverse mapping mechanism for anonymous mapping pages, and finally form a full objrmap scheme.

In order to solve the problem of memory scalability, Andrea Arcangeli works hard on the full objrmap solution. However, he also has a competitor, Hugh Dickins, who also submitted a series of full objrmap patches and tried to incorporate them into the kernel mainline. Obviously, on the anonymous mapping page, the winner is Andrea Arcangeli. His anonymous mapping solution is shown in the following figure:

Similar to file mapped, anonymous page uses VMA to find the corresponding PTE entry of page frame. Since the number of Vmas on file mapping pages may be very large, we use a data structure such as priority search tree. For anonymous mapping pages, the number is generally not too large, so it is OK to use the linked list structure.

In order to save memory, we reuse the mapping pointer in struct page: if a page frame is file mapped, its mapping pointer points to the address of the corresponding file_ Space data structure. If it is anonymous page, the mapping pointer points to anon_ VMA data structure. Although the memory is saved, the readability is reduced. However, because the kernel will create a corresponding struct page data object for each page frame, even if the data structure is increased by 4b, the memory consumption of the whole system is huge. Therefore, the kernel still uses an ugly way to define the mapping member. Through the mapping member in struct page, we can get the information related to the page mapping, which is summarized as follows:

(1) Equal to null, which means that the page frame is no longer in memory, but is swap out to disk.

(2) If it is not equal to null and the least significant bit is equal to 1, it means that the page frame is an anonymous mapping page and mapping points to an anon_ The data structure of VMA.

(3) If it is not equal to null and the least signature bit is equal to 0, it means that the page frame is a file mapping page, and mapping points to an address of the file_ Space data structure.

Via anan_ VMA data structure, we can get all VMA mapped to the page. So far, anonymous mapping and file mapped are combined. The further problem to be solved is how to transfer VMA to PTE entry. In the previous section, we described VMA_ In fact, the logic of anonymous page is the same, except VMA > VM_ The basic point of pgoff is different from that of Page > index. For the scene of file mapped, this basic point is the starting position of the file. For anonymous mapping, there are two starting points: one is share anonymous mapping, and the starting point is 0. The other is private anonymous mapping. The starting point is the virtual address of mapping (divided by page size). However, the algorithm concept of getting the corresponding virtual address from VMA and struct page is similar.

6、 Make a comeback

After full objrmap enters the kernel, it seems that everything is perfect. Compared with Rik van Riel’s rmap scheme, all aspects of objrmap’s indicators are comprehensive. Rik van Riel, who first introduced reverse mapping into the kernel, suffered a second blow, but he was still fighting and trying to make a comeback.

Although objrmap is perfect, there is a dark cloud floating in the clear sky. Rik van Riel, the great God, keenly saw the “dark cloud” of reverse mapping and put forward his own solution. This chapter mainly describes the new anon_ VMA mechanism, code from 4.4.6 kernel.

1. Old anon_ What’s wrong with VMA?

Let’s take a look at old anon first_ How the system works under VMA mechanism. VMA_ P is an anonymous mapping of the parent process. VMA, a and C have assigned page frames, while other pages have not assigned physical pages. After fork, the child process copies VMA_ P. Of course, due to the use of cow technology, the anonymous pages of the parent-child process will be shared. At the same time, the PTE entry corresponding to the address space of the parent-child process will be marked with write protect, as shown in the following figure:

Normally, anonymous pages (such as stack and heap) of different processes are private and will not be shared. However, in order to save memory, after the parent process forks the child process and before the parent-child process writes to the page, the anonymous pages of the parent-child process are shared, so these page frames point to the same anon_ vma。 Of course, sharing is only temporary. Once there is a write operation, an exception will be generated, and page frame will be allocated in the exception handling to release the sharing of anonymous pages of parent-child processes, as shown in page a in the following figure:

At this time, due to the write operation, the page frame originally shared by the parent-child process is no longer shared. However, the two pages still point to the same anon_ VMA, not only that, for pages like B, they are not shared between the parent and child processes at the beginning. When they are visited for the first time (whether it is the parent process or the child process), do_ anonymous_ The page frame allocated by the page function also points to an anon_ vma。 In other words, the VMA of the parent-child process shares an anon_ vma。

In this case, let’s see what happens to unmap page frame1. There is no doubt that the mapping member of the struct page corresponding to page frame1 points to anon in the figure above_ VMA, ergodic anon_ VMA will kill VMA_ P and VMA_ C. Inside, VMA_ C is an invalid VMA and should not have been matched. If you don’t_ VMA’s linked list is not that long, so the overall performance is OK. However, in some network servers, the system relies heavily on fork. A service program may fork a large number of subprocesses to process service requests. In this case, the performance of the system is seriously degraded. Rik van Riel gives a specific example: there are 1000 processes in the system, which are generated by fork, and each process’s VMA has 1000 anonymous pages. According to the current software architecture, anon_ There will be 1000 VMA nodes in the VMA linked list, and a million anonymous pages in the system belong to the same anon_ vma。

What kind of problems will such a system cause? Let’s take a look at try_ To_ unmap_ The code framework of anon function is as follows:

static int try_ To_ unmap_ anon(struct page *page)

{……

anon_ vma = page_ lock_ anon_ vma(page);

list_ for_ each_ entry(vma, &anon_ vma->head, anon_ vma_ node) {

ret = try_ To_ unmap_ one(page, vma);

}

spin_ unlock(&anon_ vma->lock);

return ret;

}

When a CPU in the system is executing try_ To_ unmap_ When using the anon function, you need to traverse the VMA linked list. At this time, you will hold anon_ VMA > lock this spin lock. Because of Anan_ VMA has many irrelevant Vmas. Through the search process of page table, you will find that this VMA has nothing to do with the unmap page. Therefore, you can only scan the next VMA. The whole process takes a lot of time and extends the critical area (the complexity is O (n)). At the same time, when other CPUs try to acquire the lock, they are basically stuck. At this time, the performance of the whole system can be imagined. What’s worse is that not only unmap anonymous pages in the kernel lock and traverse VMA linked lists, but also other scenarios (such as page)_ Referenced function). Imagine a million pages sharing this anon_ VMA, yes, Anan_ The competition of VMA > lock spin lock is quite fierce.

2. Improved scheme

The crux of the old plan is anon_ VMA carries too many Vmas of processes. If it can be turned into per process, the problem will be solved. Rik van Riel’s solution is to create an anon for each process_ VMA structure and through a variety of data structures to the parent-child process anon_ VMA (hereinafter referred to as AV) and VMA are linked together. To link to anan_ VMA, the kernel introduces a new structure called anon_ vma_ Chain (hereinafter referred to as AVC)

struct anon_ vma_ chain {

struct vm_ area_ Struct * VMA; — points to the VMA corresponding to the AVC

struct anon_ vma *anon_ VMA; — points to the AV corresponding to the AVC

struct list_ head same_ VMA; — the node that links to the VMA list

struct rb_ Node RB; — the node linked to AV red black tree

unsigned long rb_ subtree_ last;

}

AVC is a magic structure, each AVC has its corresponding VMA and AV. All AVCS pointing to the same VMA will be linked to a linked list, and the chain header is the anon of VMA_ vma_ Members of the chain. An AV will manage several Vmas, and all related Vmas (their child processes or grandchildren) are linked to the red black tree. The root node is the RB of AV_ Root member.

This description is very boring. It is estimated that the students who first contact with reverse mapping will not understand it. Let’s take a look at how the “building” of AV, AVC and VMA is built.

3. When VMA and VA meet for the first time

Due to the use of cow technology, the anonymous pages of the child process and the parent process are often shared until one of them initiates a write operation. However, if the child process executes the Exec System call and loads its own binary image, the execution environment of the child process and the parent process (including anonymous pages) will go their separate ways_ old_ Exec function). Our scenario starts with such a new process after exec. When the process’s anonymous mapping VMA allocates the first page frame through page fault, the kernel will build the data relationship as shown in the following figure:

Av0 in the figure above is the anon of the process_ VMA is a top-level structure, so its root and parent point to themselves. Of course, the AV data structure is for VMA management, but in the new mechanism, it is transferred through AVC. The avc0 in the figure above builds a bridge between VMA and AV. There are pointers to vma0 and av0 respectively. In addition, avc0 is inserted into the red black tree of AV and the linked list of VMA.

For the newly allocated page frame, it will map to a virtual address corresponding to VMA. In order to maintain the reverse mapping relationship, the mapping in struct page points to av0, and the index member points to the offset of the page in the whole vma0. This relationship is not shown in the picture, mainly because it is a clich é. I believe everyone is familiar with it.

In vma0, several page frames may be mapped to a virtual page of the VMA, but the above structure will not change, except that the mapping in each page points to av0 in the figure above. In addition, the avc0 of the dotted green block in the figure above is actually equal to the avc0 block of the green solid line, that is to say, the VMA has only one anon at this time_ vma_ Chain, that is, avc0. The figure above just shows that the AVC will also be linked to the VMA linked list and anon_ VMA’s red black tree.

If you want to refer to the relevant code, you can take a closer look at do_ anonymous_ Page or do_ cow_ fault。

4. What happened to VMA of anonymous mapping in fork?

Once fork, the child process copies the VMA of the parent process_ In the old mechanism, multiple processes share an AV, while in the new mechanism, AV is per process. Then, the “building” of VMA and VA between parent and child processes is established. The main steps are as follows:

(1) Call anon_ vma_ Clone function to establish the relationship between VMA and VA

(2) Establish the relationship between VMA and VA

How to establish the relationship between VMA and VA? It’s actually Anan_ vma_ chain_ The procedure of calling link function is as follows:

(1) Assign an AVC structure, and the member pointer points to the corresponding VMA and VA

(2) Add the AVC to the VMA list

(3) The AVC was added to VA red black tree

Let’s not make things too complicated at the beginning. Let’s take a look at the scene of a new process fork sub process. At this time, the kernel will build the data relationship as shown in the following figure:

First of all, let’s see how to establish the relationship between the child process VMA1 and the parent process av0. Here we need to traverse the anon of vma0_ vma_ Chain list. Of course, this list has only one avc0 (link to av0). In order to establish a connection with the parent process, we assign AVC_ X01, which is a bridge connecting the parent-child process. (Note: AVC)_ The X in X01 is the connection, and 01 is the connection between level 0 and level 1). Through this bridge, the parent process can find the VMA (because of AVC) of the child process_ X01 is inserted into the red black tree of av0, and the child process can also find the AV of the parent process (because of AVC)_ X01 is inserted into the linked list of VMA1).

Of course, my own anon_ VMA also needs to be created. In the figure above, AV1 is the anon of the child process_ VMA, allocate an avc1 to connect VMA1 and AV1 of the subprocess, and call anon_ vma_ chain_ The link function inserts avc1 into the linked list of VMA1 and the red black tree of AV1.

The parent process will also create other new child processes. The hierarchy of the newly created child process is similar to VMA1 and va1, which will not be described here. However, it should be noted that for each child process created by the parent process, each AVC that acts as a “bridge” will be added to the red black tree of av0 to connect to the VMA of the child process.

5. Building a three story building

The previous section describes how the parent process creates a child process. If the child process forks again, the whole vma-va building will form a three-tier structure, as shown in the following figure:

Of course, the first thing to do is to establish the relationship between VMA and va. the “parent processes” here generally refer to those processes at the top of the sun process. For this scenario, “parent processes” refer to processes a and B in the figure above. How to set up? In fork, we copy VMA: that is, we allocate vma2 and copy VMA1 to vma2. Copy is done along the AVC linked list of VMA1, which has two elements: avc1 and AVC_ X01, associated with AV of parent process a and child process B respectively. Therefore, in process C, we allocate AVC_ X02 and AVC_ X12 two AVCS, and establish the relationship between level 2, level 0 and level 1.

In the same way, the anon of one’s own level_ VMA also needs to be created. In the figure above, AV2 is the anon of sun process c_ VMA, allocate an avc2 to connect vma2 and AV2 of the grandson process, and call anon_ vma_ chain_ The link function inserts avc2 into the linked list of vma2 and the red black tree of AV2.

Root in AV2 points to root AV, that is, AV of process a. The parent member points to the AV of its B process (the parent process of C). Through a pointer like parent, AV at different levels establish a parent-child relationship, while through the root pointer, AV at each level can find root AV.

6. How to add page frame to “building”?

The previous sections focus on how the structure of hierarchy AV is built, that is, how VMA, AVC and AV of parent-child process are related in the process of describing fork. In this section, we will take a look at the page fault processing when one of the parent-child processes visits the page. There are two scenarios in this process. One is that there is no page frame in the parent-child process. At this time, the kernel code will call do_ anonymous_ Page allocates page frame and calls page_ add_ new_ anon_ The rmap function establishes the relationship between the page and the corresponding VMA. The second scenario is a bit more complex. It is a scenario where father and son share anonymous pages. When a write fault occurs, the page frame is allocated and page is called_ add_ new_ anon_ The rmap function establishes the relationship between the page and the corresponding VMA. The specific code is in do_ Wp_ Page function. In any scenario, the mapping member of the page is ultimately pointed to the AV structure of the process.

7. Why build such a complex “building”?

If you can insist on reading here, then you have a strong tolerance for boring words, ha ha. Page, VMA, VAC, VA constitute such a complex hierarchy, why on earth? Is it to beat your interest in learning the kernel? No, let’s illustrate the function of this “building” with a real scene.

We set up the structure of the figure above through the following steps:

(1) There are two types of pages in a VMA of P process: one is with real physical pages, the other is not equipped with physical pages. In the figure above, we track a with physical pages and B without physical pages.

(2) P process fork P1 and P2

(3) The P1 process forks the p12 process

(4) The P1 process accesses page A and allocates page frame2

(5) The p12 process accesses the B page and allocates page frame3

(6) The P2 process accesses the B page and allocates page frame1

(7) The P2 process forks the p21 process

After the above actions, let’s take a look at the situation of page frame sharing: for the page frame of the P process, the mapping member of the page points to the AV of the P process, that is, the AV in the figure above_ P) In other words, it may be shared by all pages in VMA, a child process of any level_ P needs to include its child process, child process All VMA of. For the P1 process, AV_ P1 needs to include P1 subprocess, subprocess It can be confirmed that at least VMA of parent process P and brother process P2 need not be included.

Now let’s look back at the AV structure of the building, which actually meets the above requirements.

8. When recycling a page, how to unmap all the mappings of a page frame?

Building such a complex data structure building is for application. Let’s take a look at the scene of page recycling. This scenario needs to find all Vmas mapped to the physical page through page frame. With the foreshadowing, this is not complicated. Through the mapping member in the struct page, you can find the AV corresponding to the page. In the red black tree of the AV, all Vmas that may share anonymous pages are included. Traverse the red black tree and call try for each VMA_ To_ unmap_ The one function can remove all mappings of the physical page frame.

OK, let’s go back to the beginning of this chapter and look at the performance problems caused by the long critical region. Suppose our server has a service process a, which forks 999 sub processes to serve netizens all over the world. Process a has a VMA and 1000 pages. Now let’s compare the processing of the new and old mechanisms.

First, share an anon_ The situation of VMA has been solved in the new mechanism. Each process has its own unique anon_ VMA object, the page of each process points to its own unique anon_ VMA object. In the old mechanism, 1000 Vmas need to be scanned every time a page is unmapped. In the new mechanism, there are only 1000 Vmas in AV of parent process a at the top level and only one VMA in other child processes. This greatly reduces the length of the critical area.

7、 Postscript

This paper leads you to a brief understanding of the development process of reverse mapping. Of course, the wheel of time never stops, and the reverse mapping mechanism is still being revised. If you want, you can also understand its evolution process and put forward your own optimization scheme, leaving your own mark in its history.