1、 Preface

With the evolution of kernel version, the expansion speed of its source code is also increasing, which makes the learning curve of Linux more and more steep. This is certainly not a good thing for the students who have just met the core, and their enthusiasm is easily dampened. I have a step-by-step approach, that is, not to look at the latest kernel first, but to find an old version of the kernel (which is generally relatively simple), and understand it thoroughly, then iterate a little bit to understand the reason and purpose behind each version change, and finally advance to the latest kernel version.

This paper starts from task scheduler in 2.4 era, describes its implementation in detail and advances slowly. Of course, in order to better understand the design and implementation of Linux scheduler, we give some general concepts in Chapter 2. After that, we will talk about how o (1) scheduler can improve the performance of scheduler in Chapter 4. What is really epoch-making is the CFS scheduler, which is incorporated into the mainline in the kernel of version 2.6.23. Its design idea is so dazzling, even in the latest kernel, the completely fair design idea still has not changed much, which we will describe in Chapter 6. The fifth chapter is about the introduction of the idea of fair scheduling. Through this chapter, we can understand the rsdl scheduler of con kolivas, which is the pioneer of fair scheduling. Through the foreshadowing of this chapter, we can understand CFS more smoothly.

2、 Task scheduler overview

To avoid confusion, let’s start by clarifying a few concepts. Process scheduler is a traditional saying, but in fact, the process is the unit of resource management, and the thread is the unit of scheduling. But the saying of thread scheduler makes me feel very uncomfortable, so I finally use the saying of process scheduler or task scheduler. In order to save words, in some places of this paper, it is also referred to as scheduler. In addition, unless otherwise specified, the “process” in this paper refers to the entity represented by task struct. After all, this is a document about scheduler.

Task scheduler is a very important component of the operating system. Its main function is to schedule the tasks in the system to each CPU for execution, and meet the following performance requirements:

1. For time sharing processes, the scheduler must be fair

2. Fast process response time

3. The throughput of the system should be high

4. Less power consumption

Of course, different tasks have different requirements, so we need to classify the tasks: one is ordinary process, the other is real-time process. For real-time processes, there is no doubt that the need for quick response is the most important, while for ordinary processes, we need to take into account the first three requirements. I believe you also find that these requirements conflict with each other. How to balance the design of these ordinary processes of time sharing? Here, we need to further subdivide ordinary processes into interactive processes and batch processes. Interactive processes need to communicate with users, so they are more sensitive to scheduling delay. Batch processes are those that work silently in the background, so they pay more attention to throughput. Of course, in any case, the ordinary process of sharing time pieces still needs to take into account fairness. No one can eat fish and meat, and no one can even drink soup. In fact, the demand for power consumption has not been paid special attention by the scheduler. Of course, after a large number of Linux applications on handheld devices, the scheduler has to face this problem. Of course, limited to space, this paper will not start.

In order to achieve these design goals, the scheduler must consider some scheduling factors, such as “priority”, “time slice” and so on. A lot of RTOS schedulers are priority based, and they are always executed by the highest priority process. In Linux kernel, priority is the main consideration of real-time process scheduling. For ordinary processes, how to subdivide the time slice is the core of the scheduler. Too large time slice will seriously damage the response delay of the system, so that users can obviously perceive the delay and jam, thus affecting the user experience. Although a smaller time slice helps to reduce the scheduling delay, frequent handoff will have a serious impact on the throughput of the system. Because at this time most of the CPU time is used for process switching, and forget that its original function is actually to promote the execution of tasks.

Because Linux is a general operating system, its goal is the sea of stars. It can not only run on the embedded platform, but also get good performance in the server field. In addition, in the desktop application scenario, users can not have a poor user experience. Therefore, the design of Linux task scheduler is a very challenging work, which needs to maintain a balance among various conflicting requirements. Fortunately, after decades of hard work by kernel hackers, the Linux kernel is moving towards its ultimate goal.

3、 O (n) scheduler in 2.4 Era

There are many linux kernel archaeology teams on the Internet, digging out the design and implementation of very old kernel. Although I am interested in the history of process scheduler, I am only interested in “modern history”. Therefore, let’s start from the 2.4 era. The specific kernel version I choose is 2.4.18. For the software structure of this version, please refer to the following pictures:

All descriptions in this chapter are based on the above software structure diagram.

1. Process descriptor

struct task_ struct {

volaTIle long need_ resched;

long counter;

long nice;

unsigned long policy;

int processor;

unsigned long cpus_ runnable, cpus_ allowed;

struct list_ head run_ list;

unsigned long rt_ priority;

……

}


For 2.4 kernel, there are two kinds of process switching. One is that when a process cannot continue to execute because it needs to wait for a resource, it can only actively suspend itself (call the schedule function) to trigger a task scheduling process. The other is that the process executes happily, but it is forced to give up CPU due to various scheduling events (such as the running out of time slice) and is preempted by other processes. At this time, the schedule is not sent immediately, but delayed. The specific method is to set the need of the current process_ Resched equals 1, and then quietly wait for the arrival of the latest scheduling point. When the scheduling point arrives, the kernel will call the schedule function to preempt the execution of the current task.

Nice member is the static priority of the ordinary process_ TO_ The tips macro can map the static priority of a process to the default time slice and save it in the counter member. Therefore, at the beginning of a scheduling cycle, counter is actually the amount of CPU time allocated by the process (there are some rewards for sleeping processes, which will be described later). It is in the unit of tick, and it is reduced by one when each tick arrives until its time slice is exhausted, and then it waits for the next scheduling cycle to start all over again.

Policy is the scheduling strategy, and 2.4 kernel mainly supports three scheduling strategies, sched_ Other is a normal process, sched_ RR and sched_ FIFO is a real-time process. SCHED_ RR and sched_ FIFO scheduling strategy in RT_ When priority is different, who has the highest priority will execute first. The only difference is the same priority: sched_ RR uses time slice rotation, while sched uses time slice rotation_ FIFO adopts the strategy of first come first served. The process that occupies the CPU first will continue to execute until it exits or blocks. Only at this time can other real-time processes with the same priority have the opportunity to execute. If the process is a real-time process, RT_ Priority represents the static priority of the process. This member is not valid for normal processes and can be set to 0. In addition to the three scheduling policies described above, the policy member can also set sched_ Of course, it has nothing to do with scheduling policy. It mainly deals with sched_ Yield system call.

Processor、cpus_ Runnable and CPUs_ All three members of allowed are CPU related. Processor describes the logical CPU number of the process being executed (or last executed); CPUs_ Allowed is the mask that the task is allowed to execute on those CPUs_ Runnable is introduced to calculate whether a specified process is suitable for scheduling to a specified CPU for execution. If the process is not executed by any CPU, then all bits are set to 1. If the process is being executed by a CPU, then the CPU bit being executed is set to 1, and the others are set to 0. How to use CPUs_ Runnable can refer to can_ Schedule function.

run_ List members are nodes linked into various linked lists. The next section will describe how the kernel organizes tasks, which will not be repeated here.

2. How to organize tasks

Linux 2.4 version of the process scheduler uses a very simple method to manage the runnable process. The scheduler module defines a runqueue_ The chain header variable of head, whether the process is an ordinary process or a real-time process, will be linked to the global runqueue list as long as the process state becomes runnable. With the running of the system, the processes in the runqueue list will be constantly inserted or removed. For example, when the fork process is running, the new child process will hang into the runqueue. When blocking or exiting, the process will be removed from the runqueue. However, in any case, the scheduler always pays attention to the task in the global runqueue list, and throws the most suitable task to the CPU for execution. Because all CPUs in the whole system share a runqueue, in order to solve the synchronization problem, the scheduler module defines a spin lock to protect the concurrent access to the global runqueue

In addition to the runqueue queue, the system also has a linked list containing all tasks (regardless of their process status), and the chain header is defined as init_ Task. After a scheduling cycle, the linked list will be used when assigning the initial time slice value to the task again. In addition, the processes that enter the sleep state are hung in different waiting queues. Of course, because these process linked lists are not so closely related to scheduling, they are not identified in the figure above.

3. Dynamic priority and static priority

The so-called static priority is the inherent priority of task, which will not change with the process behavior. For real-time processes, the static priority is RT_ Priority. For ordinary processes, the static priority is (20 – NICE). However, the actual scheduler does not use static priority when scheduling, instead, it uses dynamic priority to decide who is more qualified to obtain CPU resources. Of course, the calculation of dynamic priority is based on static priority.

When calculating the dynamic priority (goodness function), we can divide it into two cases: real-time process and ordinary process. For real-time processes, dynamic priority is equal to static priority plus a fixed offset

weight = 1000 + p->rt_ priority;


The reason for this is to distinguish real-time process from ordinary process. This operation also ensures that real-time process will be completely prior to ordinary process scheduling. For ordinary processes, the calculation of dynamic priority is a little complicated. We can extract some of the codes as follows:

weight = p->counter;

if (!weight)

goto out;

weight += 20 – p->nice;

For ordinary processes, the strategy for calculating dynamic priority is as follows:

(1) If the time slice of the process has been exhausted, then the dynamic priority is 0, which also means that the process has no chance to obtain CPU resources in this scheduling cycle.

(2) If the time slice of the process is left, its dynamic priority is equal to the sum of the remaining time slice and static priority of the process. The reason why (20 nice value) is used to represent the static priority is to make the static priority monotonically increase. The reason to consider the remaining time slice is to reward the sleeping process, because the sleeping process has more time slices left, so the dynamic priority will be higher, and it is easier to be scheduled by the scheduler.

Scheduler is based on dynamic priority to schedule, who is the first to execute. We can take an ordinary process as an example: if the static priority of the process is high (that is, the nice value is small) and there are many time slices left, then it must be executed first. If the static priority is high, but the remaining time slice is small, it may give way to other tasks with more time slices and moderate priority. There is no doubt that the task with low static priority will suffer a double blow, because its default time slice is not many and its priority is also very low. However, no matter how high the static priority is, as long as the time slice is used up, the low priority tasks will always have the opportunity to execute and will not starve to death.

When calculating the dynamic priority of an ordinary process, in addition to considering the remaining time slice information and static priority of the process, the scheduler will also consider the performance of cache and TLB as appropriate. For example, process a and process B have the same priority, and the remaining time slices are all three ticks, but process a ran on the CPU last time. If you choose a, you may have better cache and TLB hit rates, thus improving performance. In this case, the scheduler raises the dynamic priority of the a process. In addition, if the candidate process and the current process share the same address space, there will be a small skew when calculating the scheduling index. There are two possible cases: one is that the candidate process and the current process are in the same thread group (that is, two threads in the process), the other is that the candidate process is a kernel thread. In this case, it often borrows the address space of the previous process. In either case, when the process switches, there is no need to switch the process address space, so there will be performance advantages.

3. Scheduling time

For 2.4 kernel, the timing of scheduling mainly includes:

(1) The process initiatively initiates scheduling.

(2) In timer interrupt processing, it is found that the current process has exhausted its time slice

(3) When a process wakes up (for example, wakes up an RT process). Refer to the next section for more details.

(4) When the parent process is in fork, its time slice will be divided equally between the parent process and the child process. However, if there is only one tick left, this tick will be allocated to the child process, and the time slice of the parent process will be cleared. In this case, the process encounters the same scenario as finding that when the process runs out of time slices in timer interrupt processing. If the parent process is in fork, its time slice is large, and the time slice of the parent-child process is not zero, the scene is similar to wake-up process. Because these two scenarios are caused by adding a task node to the runqueue.

(5) When the process switches. When process switching occurs on a CPU in the system, for example, task a switches to Task B, does task a lose the opportunity to execute? Of course, not necessarily, because although the competition for the CPU fails, the dynamic priority of tasks running on other CPUs may not be as good as a, or it happens that other CPUs are in idle state and waiting for new processes to settle in.

(6) When the user process actively gives up the CPU

(7) When the user process modifies the scheduling parameters

In the above scenarios, except for process active scheduling, other scenarios do not schedule the schedule function immediately, but set need_ Resched flag, and then wait for the arrival of the dispatch point. Since the 2.4 kernel is not a preemptive kernel, the scheduler always comes when it returns to the user space. When the scheduling point comes, the process scheduling will start on the CPU. There are too many preemptive scenarios. Let’s choose the process wake-up scenario to describe in detail. Let’s analyze other scenarios by ourselves.

4. Process wake up processing

When the process is awakened (try_ To_ wake_ The task will be added to the global runqueue, but whether to start the scheduling needs a series of judgments. In order to clearly describe this scenario, we define the process that performs wake-up as the waker, and the waked process as the wakee. There are two kinds of wakeup, one is sync wakeup, the other is non sync wakeup. The so-called sync wakeup means that the waker knows that he will soon enter the sleep state when he wakes up the wakee and calls try_ To_ wake_ It’s better not to preempt when it’s up, because the walker will take the initiative to dispatch soon. In addition, generally speaking, waker and wakee have certain affinity (for example, they communicate through share memory). In SMP scenario, when waker and wakee are scheduled to execute on one CPU, they can often achieve better performance. And if you’re trying_ To_ wake_ When it is up, it is scheduled. At this time, wakee is often scheduled to other idle CPUs in the system. At this time, through sync wakeup, we can often avoid unnecessary CPU bouncing. For non sync wakeup, there is no synchronization relationship between the waker and the wakee described above. After wakee wakes up, the waker operates independently, so you can try to trigger a schedule when wakeup.

Of course, it doesn’t mean that sync wakeup is not scheduled. Suppose that the waker wakes wakee on CPU a, and according to the CPUs of the wakee process_ The allowed member finds that it can’t schedule execution on CPU a at all. No matter whether it syncs or not, it needs to try to schedule (call reschedule)_ In any case, waker and wakee are destined to be different (executed on different CPUs).

Let’s first look at up. At this time, the waker and wakee are running on the same CPU (of course, there is only one CPU in the system, ha ha). At this time, who can preempt the CPU resources completely depends on the dynamic priority of the waker and wakee. If the dynamic priority of the wakee is higher than that of the waker, the need of the waker is marked_ The schedule function is called to schedule when the schedule point arrives.

In the case of SMP, due to the large CPU resources of the system, there is no need for waker and wakee to fight each other. In fact, wakee can also choose to execute on other CPUs. The related algorithms are as follows:

(1) Priority is given to scheduling wakee to other idle CPUs in the system. If the last CPU running wakee happened to be in idle state, priority can be given to scheduling wakee to that CPU for execution. If not, all CPUs in the system need to be scanned to find the most suitable idle CPU. The so-called most suitable CPU is the CPU that just entered idle recently.

(2) If all CPUs are busy, you need to traverse the tasks currently running on all CPUs, compare their dynamic priority, and find the CPU with the lowest dynamic priority.

(3) If the priority of the task with the lowest dynamic priority is still higher than that of the wakee, there is no need to schedule. The wakee in the runqueue needs to wait patiently for the next opportunity. If the dynamic priority of a wake is higher than the lowest dynamic priority task found, its need is marked_ Resched flag. If it is not preemptive, we need to send IPI to trigger the CPU scheduling.

Of course, this is the design choice of 2.4 kernel scheduler. In fact, this kind of operation is questionable. Limited to space, this article will not start to describe, if you have the opportunity to write a load balancing article, you can sort out these relationships.

5. Main tuner algorithm

The core code of the main tuner (schedule function) is as follows:

list_ for_ each(tmp, &runqueue_ head) {

p = list_ entry(tmp, struct task_ struct, run_ list);

int weight = goodness(p, this_ cpu, prev->active_ mm);

if (weight > c)

c = weight, next = p;

}


list_ for_ Each is used to traverse runqueue_ For all the processes in the head list, the temporary variable p is the process descriptor that needs to be checked this time. How to determine which process is the most suitable for scheduling? We need to calculate the dynamic priority of the process (corresponding to the variable weight in the above program), which is implemented through the goodness function. The process with the highest dynamic priority is the process most suitable for scheduling to CPU. Once selected, the scheduler starts the process switch and runs the process to replace the previous one.

According to the code, even if the first node in the list is the most suitable process to be scheduled, the scheduler algorithm will still traverse the global runqueue list and compare one by one. From this, we can judge that the algorithm complexity of scheduler in 2.4 kernel is O (n). Once the next process is selected, the process switching module will perform the specific process switching on the CPU.

For sched_ The real-time process of RR needs a concept of time slice rotation when the priority is equal. Therefore, before traversing the linked list, we first deal with the time slice processing of the process

if (unlikely(prev->policy == SCHED_ RR))

if (!prev->counter) {

prev->counter = NICE_ TO_ TICKS(prev->nice);

move_ last_ runqueue(prev);

}

If the time slice (corresponding to prev > counter in the above program) is used up, sched_ RR’s real-time process will be moved to the end of the runqueue list. In this way, sched with equal priority can be realized_ RR will hit the first task in the runqueue list when traversing the runqueue list, so as to realize the concept of time slice rotation. There is a wonderful thing here, sched_ The time slice of RR is set according to its nice value. In fact, nice value should only be applicable to ordinary processes.

6. Time slice processing

The idea of time slice processing in ordinary process is as follows:

(1) Each process can allocate a default time slice according to its static priority. The larger the static priority is, the larger the time slice will be allocated.

(2) Once the process in runqueue is scheduled to execute, its time slice will decrease when the tick arrives. If the process time slice runs out, the process will lose the qualification of allocating CPU resources.

(3) When all the time slices of the processes in runqueue are used up, we call a scheduling cycle end. At this time, we need to reset the default time slice for the processes in runqueue. In this way, a new scheduling cycle starts again.

How to determine the default time slice of each process? Since time slices are allocated according to ticks, the smallest time slice is also a tick, that is to say, the default time slice of the lowest priority (nice value equals 19) is a tick. For the time slice with intermediate priority (nice value equal to 0), we set it to about 50ms. You can refer to nice for the specific algorithm_ TO_ Tips code implementation. I have to admit that the process of calculating the default time slice based on nice value is very ugly. The default time slice calculated is different with different Hz settings. That is to say, the scheduling behavior of the system is related to the setting of Hz, which is how the students with code cleanliness can accept. Anyway, let’s give a practical example to feel:

-20 -10 Zero Ten Nineteen
HZ=100 11 ticks

110ms
8 ticks

80ms
6 ticks

60ms
Three ticks

30ms
1 tick

10ms
HZ=1000 81 ticks

81ms
61 ticks

61ms
41 ticks

41ms
21tick

21ms
Three ticks

3MS

When the time slices of all processes in runqueue are exhausted, the process of reloading the default time slices of processes will be started. The code is as follows (in the schedule function)

if (unlikely(!c)) {

struct task_ struct *p;

for_ each_ task(p)

p->counter = (p->counter >> 1) + NICE_ TO_ TICKS(p->nice);

goto repeat_ schedule;

}

Here, C is the maximum dynamic priority found after traversing the runqueue list. If it is equal to 0, it means that: first, there are no real-time processes in the runqueue state in the system. Secondly, all ordinary processes in the runqueue state have consumed their time slices. At this time, they need to recharge. for_ each_ Task this macro is to traverse all the process descriptors in the system, whether it is runnable or not. For an ordinary process linked to the runqueue list, its current time slice p – > counter is already equal to 0, so its new time slice is nice_ TO_ Tips (P – > NICE), which is the default time slice calculated according to nice value. For the process that is in sleep state in the waiting queue, its time slice p – > counter is still left. Of course, it will be accumulated into the process time slice quota, which can also be regarded as a reward for the sleep process. In order to prevent the interactive process that often sleeps from getting too large time slices, we don’t accumulate all the remaining time slices, but make a 50% discount (P > counter > > 1).

A new cycle begins. The current process has run on the CPU. The code that consumes its time slice is in timer interrupt processing, as follows:

if (–p->counter <= 0)=””>=>

p->counter = 0;

p->need_ resched = 1;

}

When each tick arrives, the time slice of the process will be reduced by one. When the time slice is 0, the scheduler deprives it of the right to execute, thus triggering a scheduling. The process with other time slices that are not 0 will be selected to run until the time slices of all processes in the runqueue are exhausted, and the value will be re assigned to start a new cycle. In this way, the scheduler starts again and again to promote the operation of the whole system.

4、 O (1) scheduler in 2.6 Era

1. Why o (1) scheduler

If simplicity is the only criterion to judge whether a scheduler is good or bad, then there is no doubt that O (n) scheduler is the best one. Although it is very simple and easy to understand, it has serious scalability and performance problems. Now let’s accuse o (n) scheduler of “seven sins”, which is also the reason why INGO Molnar initiated o (1) scheduler project.

(1) Algorithm complexity problem

The most unpleasant thing is the traversal of runqueue queue. When there are not many runnable processes in the system, the overhead of traversing the linked list is acceptable. However, with the increase of the number of runnable processes in the system, the computation of the scheduler select next increases linearly. This is why we call it o (n) scheduler.

In addition, after the end of the scheduling cycle, the scheduler will “recharge” the time slices of all processes. In a large system, there may be thousands of simultaneous processes (including sleeping processes), so it is time-consuming to calculate the time slice for each process.

(2) SMP scalability issues

The O (n) scheduler of 2.4 kernel has poor SMP scalability. We know that the O (n) scheduler manages all the processes waiting to be scheduled in the system through a linked list. There are many scenarios to access the runqueue linked list: when scheduling, we need to traverse the runqueue to find the appropriate next task; when wakeup or block processes, we need to add or delete nodes from the runqueue When accessing the runqueue linked list, we will first spin lock and disable the local CPU interrupt. Then we will access the linked list and perform the corresponding actions. After that, we will release the lock and open the interrupt. Through this kernel synchronization mechanism, we can guarantee the concurrent access to runqueue list from multiple CPUs. When the number of CPUs in the system is relatively small, the cost of spin lock is acceptable. However, in large-scale systems, when the number of CPUs is very large, runqueue spin lock becomes the performance bottleneck of the system.

(3) CPU idle problem

Whenever all the processes in the runqueue list run out of time slices, it is necessary to start the process of recalculating the time slices of all the processes in the system. This calculation process is very ugly. You need to traverse all processes in the system (Note: all processes!) To assign a new value to the counter member of the process descriptor. This operation is enough to kill all L1 caches on the CPU. When the time slice recalculation process is completed, you are almost faced with an all empty L1 cache (of course, it is not all empty, mainly because the data in the cache is meaningless. At this time, the hit rate of L1 cache drops sharply). In addition, the time slice recalculation process will lead to a waste of CPU resources. We use the following pictures to describe it:

Before all the process time slices in the runqueue are exhausted, the system will always be in such a state: the last group of processes with remaining time slices are separately scheduled to each CPU. Taking four CPUs as an example, t0 ~ T3 run on cpu0 ~ cpu3 respectively. With the operation of the system, the T2 on CPU2 first runs out of its time slice, but at this time, it can’t be scheduled on CPU2, because it can’t find a suitable process to schedule by traversing the runqueue list, so it can only be in idle state. Maybe then t0 and T3 also run out of time slices, leading to the idle state of cpu0 and cpu3. Now only the last process T1 is still running on CPU1, while the processors in other systems are in idle state, which is a waste of resources. The only thing that can change this state is that T1 runs out of time slices and starts a process of recalculating time slices. At this time, normal scheduling can be restored. With the increase of the number of CPU in the system, the waste of resources will be more and more serious.

(4)task bouncing issue

Generally speaking, a process is best to run from one to the end. If it runs in a CPU in the system, it is better to keep running on that CPU all the time when it is in a runnable state. However, in the O (n) scheduler, many people reflect the phenomenon of processes jumping between CPUs. The fundamental reason is also related to the time slice algorithm. After a new cycle starts, the process time slice in runqueue is full. When scheduling processes on each CPU, it has more choices. In addition, the scheduler tends to schedule the last process running on the CPU, so the scheduler has a great chance to schedule the last process running on the same processor. However, as the processes in runqueue run out of their time slices one by one, the CPU’s choice is constantly compressed, resulting in the execution of the process on a processor that has little affinity with it (for example, the last time the process was running on cpu0, but it was scheduled to be executed on CPU1, but in fact the affinity between the process and cpu0 is greater).

(5) Scheduling performance of RT process

The performance of real-time scheduling is average. Through the introduction in the previous section, we know that real-time process and ordinary process are linked in a linked list. When scheduling real-time processes, we need to traverse the whole runqueue list, scan and calculate the scheduling index of all processes, so as to select the desired real-time process. In principle, real-time process and ordinary process are located in different scheduling space, which is irrelevant. But now scheduling real-time process also needs to scan and calculate ordinary process. Such a bad algorithm makes those engineers who pay attention to real-time unbearable.

Of course, the above is not the key, the most important thing is that the whole Linux kernel is not preemptive kernel, and can not be preempted in the whole kernel state. For some time-consuming system calls or interrupt processing, it is not acceptable to return to user space to start scheduling. In addition to kernel preemption, priority reversal also needs to be paid attention to by the scheduler. Otherwise, even if an RT process becomes runnable, it can only watch the process with lower priority run until the resource waiting for the RT process is released.

(6) Scheduling delay of interactive ordinary processes

O (n) doesn’t distinguish between interactive processes and batch processes, it just rewards those processes that sleep a lot. However, some batch processes also belong to IO bound processes, such as database service process, which is a background process and insensitive to scheduling delay. However, because it needs to deal with disk, it will often block on disk IO. It is meaningless to increase the dynamic priority of such background process, which will increase the scheduling delay of other interactive processes. On the other hand, the user interactive process may be CPU bound, and at this time the scheduler will not correctly understand the scheduling requirements of the process and compensate for it.

(7) Time slice granularity.

The response delay perceived by users is related to the system load. We can use the number of runnable processes to roughly describe the system load. When the system load is high, the number of processes in the runqueue will be relatively large, and the time of a scheduling cycle will be relatively long. For example, in the case of Hz = 100, there are five runnable processes on the runqueue, the nice value is 0, and the quota of each time slice is 60ms, then a scheduling cycle is 300ms. With the increase of runnable process, the scheduling cycle becomes larger. When a process runs out of time, it can only wait for the next scheduling cycle. Therefore, as the scheduling cycle becomes larger, the system response will become worse.

Although there are many issues in O (n) scheduler, people in the community basically approve of this algorithm. Therefore, when designing a new scheduler, it is not to completely overthrow the design of O (n) scheduler, but to improve the problem of O (n) scheduler. In this chapter, we choose the 2.6.11 kernel to describe how the O (1) scheduler works. Since there is no essential difference between O (1) scheduler and O (n) scheduler, we only describe the differences between them.

2. Software function division of O (1) scheduler

The following figure shows the software framework of an O (1) scheduler

There is only one global runqueue in O (n) scheduler, which seriously affects the scalability. Therefore, the concept of per CPU runqueue is introduced into o (1) scheduler. All the runnable processes in the system are linked to the runqueue of each CPU through the load balancing module, and then the main scheduler and tick scheduler drive the scheduling behavior on the CPU. Because of the space, we will not explain the load balancing module in this paper, but focus on the task scheduling algorithm on a single CPU.

Due to the introduction of per CPU runqueue, the O (1) scheduler removes the spin lock of the global runqueue and puts the spin lock into the per CPU runqueue data structure (RQ – > lock). By subdividing a large lock into small locks, the scheduling delay can be greatly reduced and the system response time can be improved. This method is often used in kernel, and it is a general performance optimization method.

Through the above software structure division, we can solve the SMP scalability problem and CPU idle problem of O (n) scheduling. In addition, a good complex balancing algorithm can also solve the task bouncing problem of O (n) scheduler.

3. Runqueue structure of O (1) scheduler

The basic optimization idea of O (1) scheduler is to change the single linked list on the original runqueue into multiple linked lists, that is, each priority process is linked to a different linked list. For the related software structure, please refer to the following pictures:

In the scheduler, runqueue is a very important data structure, its most important role is to manage those processes in the runnable state. O (1) scheduler introduces the concept of priority queue to manage tasks_ Array abstraction:

struct prio_ array {

unsigned int nr_ active;

unsigned long bitmap[BITMAP_ SIZE];

struct list_ head queue[MAX_ PRIO];

}

Because it supports 140 priorities, 140 of the queue members represent the chain header of each priority, and processes with different priorities are linked to different linked lists. Bitmap indicates whether the linked list of each priority process is empty or not. Nr_ Active indicates how many tasks are in the queue. In these queues, 100-139 are the priority of ordinary processes, and the others are the priority of real-time processes. Therefore, in O (1) scheduler, RT process and normal process are separated, and normal process will not affect RT process scheduling at all.

There are two priority queues (struct prio) in runqueue_ Array) is used to manage the active (i.e. time slice is still left) and expired (time slice is exhausted) processes respectively. There are two priority queue pointers in runqueue, which point to the two priority queues respectively. With the operation of the system, the tasks of the active queue run out of their time slices one by one and hang into the expired queue. When the task of the active queue is empty, switch the active queue and the expired queue to start a new round of scheduling process.

Although the form of task organization has changed in O (1) scheduler, its core idea is still consistent with O (n) scheduler, which divides CPU resources into time slices and allocates them to each runnable process. After the process runs out of its quota, it is preempted, waiting for the arrival of the next scheduling cycle.

4. Core scheduling algorithm

The main function of the main scheduler (schedule function) is to find the most suitable process to schedule execution from the runqueue of the CPU. The basic idea is to search from the active priority queue. The code is as follows:

idx = sched_ find_ first_ bit(array->bitmap);

queue = array->queue + idx;

next = list_ entry(queue->next, task_ t, run_ list);

Firstly, the first non empty process list is found in the bitmap of the current active priority queue, and then the first node found from the list is the most suitable process for the next scheduling execution. Since there is no operation to traverse the whole linked list, the algorithm complexity of this scheduler is a constant, which solves the issue of O (n) algorithm complexity.

If the current active priority queue is “empty” (NR)_ Active equals 0), then you need to switch between active and expired priority queues

if (unlikely(!array->nr_ active)) {

rq->active = rq->expired;

rq->expired = array;

array = rq->active;

}

The switch is very fast, and there is no operation of traversing all processes to reset the default time slice (greatly reducing the size of the critical section of runqueue). All of these avoid the problems of O (n) scheduler and improve the performance of scheduler.

5. Static priority and dynamic priority

In the previous section, we intentionally ignored the specific meaning of “priority” in the priority queue. Now it’s time to clarify. In fact, “priority” in the priority queue refers to the dynamic priority. From this point of view, the scheduling algorithms of O (1) and O (n) schedulers are unified again, and they are all scheduled according to the dynamic priority.

The concept of static priority of O (1) is similar to o (n). For real-time processes, it is stored in the RT of process descriptor_ In the priority member, the value range is 1 (lowest priority) to 99 (highest priority). For normal processes, static priority is stored in static_ Among prio members, the value range is 100 (highest priority) to 139 (lowest priority), corresponding to – 20 to 19 of nice value respectively.

After learning about static priorities, let’s take a look at dynamic priorities (stored in the prio member of the process descriptor). In view of the fact that the dynamic priority is used in the actual scheduling, we must ensure that it is monotonous (the static priority can not be kept monotonous, the 99 of RT and the 100 of ordinary process are the highest points of static priority, that is to say, in the static priority axis, the RT segment is monotonically increasing, while in the ordinary process segment is monotonically decreasing). In order to solve this problem, a simple mapping is used to calculate the dynamic priority of real-time process

p->prio = MAX_ USER_ RT_ PRIO-1 – p->rt_ priority;

Through this transformation, the dynamic priority of RT also decreases monotonically on the number axis. The dynamic priority calculation of ordinary processes is not so simple. In addition to the static priority, the average sleep time of the process (stored in the sleep of the process descriptor) needs to be considered_ AVG members), and reward and punish the process according to the value. Specific code can refer to effective_ The prio function will not be detailed here. The dynamic priority of an ordinary process is 100 (highest priority) to 139 (lowest priority), which is consistent with the value range of static priority.

At the end of this section, we compare the dynamic priority algorithms of ordinary processes in O (1) and O (n) schedulers. The basic idea of the two schedulers is the same: considering the static priority, supplemented by the evaluation of the “user interaction index” of the process. If the user interaction index is high, the dynamic priority will be increased, otherwise it will be decreased. However, in the evaluation of user interaction index, O (1) is obviously better. O (n) scheduler only considers the remaining time slice of sleep process, while o (1) “average sleep time” algorithm obviously considers more factors: execution time on CPU, waiting time in runqueue, sleep time, process state during sleep (whether it can be interrupted by signal), context wake-up (interrupt context wake-up or process context wake-up) Wake up Therefore, O (1) scheduler can better judge whether the process belongs to interactive process or batch process, so as to accurately call interactive process.

6. Time slice processing

The default time slice is calculated by task_ Timeslice. In O (1) scheduler, the default timeslice has nothing to do with Hz. No matter how Hz is set, the default timeslice of ordinary processes with static priority [- 20… 0… 19] is [800ms… 100ms… 5ms].

When the tick arrives, the time slice of the current task will decrease (- – P – > time)_ When the time slice is equal to 0, the task is removed from the active priority list, the reset flag is set, and the time slice is reset. The code is as follows:

dequeue_ task(p, rq->active);

set_ tsk_ need_ resched(p);

p->time_ slice = task_ timeslice(p);

task_ The timeslice function is used to calculate the quota of process timeslice. For O (1) scheduler, the reassignment of time slice is decentralized, which is carried out immediately after each task exhausts its time slice. This change also corrects the O (n) scheduler’s one-time traversal of all the processes in the system and re assignment of time slices.

6. Identify user interactive processes

Generally speaking, after the time slice is exhausted, the task will be put into the expired priority queue. At this time, if you want to be scheduled, you can only wait until the next active and expired switch. However, some special scenes need special treatment:

if (!TASK_ INTERACTIVE(p) || EXPIRED_ STARVING(rq)) {

enqueue_ task(p, rq->expired);

if (p->static_ prio < rq-=””>best_ expired_ prio)

rq->best_ expired_ prio = p->static_ prio;

} else

enqueue_ task(p, rq->active);

Here is task_ Interactive is used to determine whether a process is a user interactive process (it is also related to the average sleep time. It can be seen that the average sleep time is not only used to calculate the dynamic priority, but also used to determine whether a process is inserted back into the active queue). If it is, it indicates that the process is sensitive to user response, and can not be roughly hung into the expired queue at this time Instead, it hangs back into the active queue and continues to have the opportunity to get the scheduling execution. It can be seen that O (1) scheduler really takes care of user interactive process. Once it is judged to be user interactive process, it will get the chance of continuous execution. Of course, the scheduler can’t go too far. If the user interactive process continues to occupy the CPU, what should we do if we wait for the process in the expired queue? At this time, we need to see the starvation state of the process in the expired queue, which is called expired_ Starting is a macro defined function. If the process in the expired queue has been waiting for too long, it means that the scheduler has been seriously unfair. Therefore, even if it is judged that the process that has exhausted the time slice is a user interactive process, it is also put into the expired queue to complete the scheduling cycle as soon as possible, so that active and expired can be switched.

The O (1) scheduler uses a very complex algorithm to determine the user interaction index of the process and whether the process is an interactive process. Hardcode uses a lot of unknown constants, and the estimation is summarized through a large number of experimental scenarios. I don’t have the courage to explore the design concept of this part, so I’ll skip it here. But in any case, it is always better than o (n) scheduler which only considers sleep time.

7. Preemptive kernel

In the era of 2.4, most of the Linux applications are concentrated in the server field, so there is nothing wrong with the design choice of non preemptive kernel. However, with the penetration of Linux in desktop and embedded systems, system response has become the main aspect of user complaints. Therefore, in the development process of 2.5, Linux introduces the concept of preemptive kernel (config)_ If you do not configure this option, everything will be consistent with the 2.4 kernel. If you configure this option, you do not need to wait until the scheduling point when you return to the user space. Most of the kernel execution paths can be preempted. Similarly, due to space limitation, we will not expand the description here.

5、 The introduction of the idea of fair scheduling

1. Time slice paradox of traditional scheduler

In O (n) and O (1) schedulers, time slices are fixed, and processes with higher static priority get larger time slices. For example, the process with nice value equal to 20 gets 5ms of default timeslice, while the process with nice value equal to 0 gets 100ms. Intuitively, there is no problem with this strategy (high priority gets more execution time), but the subtext of this setting is: high priority processes will get more opportunities for continuous execution, which is expected by CPU bound processes. In fact, CPU bound processes are often executed in the background, and their priority is relatively low.

Therefore, suppose that our scheduling strategy is to determine a fixed size time slice according to the static priority of the process. At this time, we will encounter a dilemma in how to allocate the time slice: we want to set a high priority for the user interactive process, so that it can have a better user experience, but it is meaningless to allocate a large time slice, because this kind of process is mostly at the same time In the blocking state, waiting for user input. The priority of background process is generally not high, but allocating a small time slice according to its priority will often affect its performance. This type of process is best to rush while cache hot.

What should I do? Or the design concept of traditional scheduler fixed allocation time slice is wrong.

2. Stuck problem of traditional scheduler

In the development of Linux version 2.5, the O (1) scheduler designed by INGO Molnar replaces the original and crude o (n) scheduler, which solves the problems of poor scalability and poor performance. In most scenarios, the scheduler has achieved satisfactory performance. In the commercial Linux 2.4 distribution, the O (1) scheduler has been reverse ported to Linux 2.4 by many manufacturers. It can be seen that the O (1) scheduler has excellent performance.

However, O (1) is not perfect, in the actual use process, there are still many desktop users complaining about poor user interaction. When a process that consumes CPU resources starts, the existing user interaction programs (for example, you are viewing a web page through a browser) can feel obvious delay. For these issues, many talented engineers try to solve the problem by modifying the user interaction index algorithm, including con kolivas, the proposer of fair scheduling. However, no matter how to adjust the algorithm, there is always a feeling that one scene’s issue is repaired, and another scene’s issue with poor interactivity emerges. Tricky customers can always find the response delay that users feel in the corner scenes.

In the process of repeatedly repairing the user’s stuck issue, engineers are most likely to be irritable, and often need calm thinking at this time. Con kolivas examined the scheduler code carefully and found that the problem was the code that evaluated the user interaction index of the process. Why does the scheduler guess the interaction requirements of a process based on its behavior? This is an impossible task at all, because you can’t guess 100% of it, just like you guess the mind of a girl you like. You carefully collect information about the girl’s personality, hobbies, work style, logical thinking level, constellation Even the physiological cycle, and expect to always correctly guess what they think, frankly speaking, this is impossible. Why can’t we schedule a process based on something that’s really certain? The static priority of a process has been completed, which indicates its scheduling requirements. This is enough. There is no need to guess the interactive requirements of the process. As long as fairness is maintained, it is OK. The so-called fairness is that the process obtains the CPU execution time matching its static priority. Under the guidance of this idea, con kolivas proposed rsdl (rotating stair case deadline) scheduler.

3. Rsdl scheduler

Rsdl scheduler still uses the data structure and software structure of O (1) scheduling. Of course, it removes the creepy code to evaluate the process interaction index. It is impossible to describe the rsdl algorithm in detail in this section, but as long as we make clear the three basic concepts of rotating, stair case and deadline, we should have a basic understanding of rsdl.

First of all, let’s look at the concept of stair case, which should be described in more detail as priority stair case, that is, in the process scheduling process, its priority will be reduced a little bit like going down the stairs. In the traditional scheduling concept, a process has a time slice that matches its static priority. In rsdl, there is also such a time slice, but the time slice is scattered in many priorities. For example, if the priority of a process is 120, then the whole time slice is scattered among the priorities of 120-139. In a scheduling cycle, the process starts to hang into the priority queue of 120 and runs on it for 6ms (this is an adjustable parameter. We assume that the time quota of each priority is 6ms). Once the time quota of 120 level is used, the process will go to 1 21 (the priority is reduced by one level), a rotation occurs, more accurately, priority minor rotation. After that, the process goes down the hierarchy until the priority of 139. If 6ms time slice is exhausted at this priority, all time slices of the process will be exhausted, and it will be hung in the expired queue to wait for the next scheduling cycle. This rotation is called major rotation. Of course, at this time, the process will hang into the expired queue corresponding to its static priority, that is, everything will return to the starting point of scheduling.

Deadline means that in rsdl algorithm, any process can accurately estimate its scheduling delay. For a simple example, suppose that there are two tasks in runqueue, a process with static priority of 130 and B process with static priority of 139. For process a, only when the process goes from 130 to 139 along the priority ladder, process B has the chance to execute, and its scheduling delay is 9 x 6 Ms = 54 Ms.

What a simple algorithm, it only needs to maintain fairness, no statistics of process sleep / running time, no calculation of user interaction index, no strange constants Scheduling, that’s it.

6、 CFS scheduler

Con kolivas rsdl scheduler has never been able to enter the kernel mainline. Instead, CFS scheduler, which is also based on the idea of fair scheduling, still provides modular design when CFS scheduler is incorporated into the mainline, providing convenience for rsdl or other schedulers to enter the kernel. However, con kolivas withdrew from the community forever with dissatisfaction with the kernel development mode. As the saying goes, where there are people, there are rivers and lakes. Let’s leave the right and wrong to later generations.

The design concept of CFS is to achieve ideal, accurate and completely fair multi task scheduling on real hardware. Of course, such a scheduler does not exist, and some tradeoffs need to be made in the actual design and implementation. In fact, all the design ideas of CFS scheduler have been very clear in the previous chapter. The only thing we need to describe in this chapter is how INGO Molnar brings the ideal of completely fair scheduling into reality.

1. The introduction of modularization

Starting from 2.6.23 kernel, the scheduler adopts the idea of modular design, which divides the process scheduling software into two levels: one is the core scheduler layer, the other is the specific scheduler layer

From the functional level, the process scheduling is still divided into two parts. The first part is to evenly allocate each runnable task to each CPU runqueue according to the load through the load balancing module. The function of the second part is to schedule on a single CPU driven by the main scheduler and tick scheduler of each CPU. The tasks handled by schedulers are different, including RT task, normal task and deal line task. However, no matter which kind of task, they all have the same logic. This part is abstracted as core scheduler layer. At the same time, various specific types of schedulers define their own sched_ Class, and join the system in the form of linked list. This modular design can facilitate users to define specific scheduler according to their own scenarios without changing the logic of core scheduler layer.

2. On Fairness

Like rsdl scheduler, CFS scheduler pursues fairness in CPU resource allocation, that is, CPU computing resources are accurately and evenly allocated to tasks running on it. For example, if there are two tasks with the same static priority running on a CPU, then each task consumes 50% of the CPU computing resources. If the static priority is different, CPU resources are allocated according to their static priority. How to allocate CPU resources according to priority? Here we need to introduce the concept of load weight.

In CFS, we use a constant array (sched)_ prio_ To_ The process static priority can be mapped to the process weight, and the so-called weight is the proportion of CPU resources that the process should occupy. For example, if there are three runnable threads a, B and C in the system, and the weights are a, B and C respectively, then the CPU resource to which a thread should be allocated is a / (a + B + C). Therefore, the fairness of CFS scheduler is to ensure that all runnable processes allocate their CPU resources according to the weight.

3. Time granularity

CPU resource allocation is an abstract concept, and we need to materialize it when we implement the scheduler. One of the simplest ways is to change the allocation of CPU resources into the allocation of CPU time slices. When you see the term “time slice”, you may instinctively feel that CFS and O (1) are no different. Aren’t they all time slices? In fact, the CFS scheduler of Linux kernel has abandoned the traditional concept of fixed time slice. O (1) scheduler will allocate a default time slice for each process. When the process finishes using its own time slice, it will be put into the expiration queue. When all processes in the system have consumed their own time slice, all processes will recover their own time slice and enter the active queue. CFS scheduler does not have the traditional concept of static time slice. Its time slice is dynamic, which is related to the current CPU’s runnable processes and their priorities. Therefore, in CFS scheduler, time slice is dynamic.

For an ideal fully fair scheduling algorithm, no matter how short the observation period is, the CPU time slice is fairly allocated. If the granularity is 100 ms, the two runnable processes a and B (with the same static priority) are divided into 50 ms each. Of course, it is not necessary for a to execute 50 ms, switch to B, and then execute 50 ms. during the observation process, a and B may switch many times, but the total CPU time occupied by process a is 50 ms, and so is process B. What if the particle size is 1ms? Are a and B running 500us? What if we continue to reduce the observation time and observe in a microsecond period? Obviously, it is impossible for each process to run 500ns. In that case, a lot of CPU time is consumed in process switching, and the CPU time for doing things is becoming less and less. Therefore, CFS scheduler has a definition of time granularity, which we call scheduling cycle. In other words, the CFS scheduler can ensure that all the runnable processes allocate CPU time equally in a scheduling cycle. In the next section, we will describe the concept of scheduling cycle in detail.

4. How to guarantee bounded scheduling delay?

The traditional scheduler can’t guarantee the scheduling delay. To illustrate this problem, let’s imagine a scenario: there are two runnable processes a and B with nice value equal to 0 in the CPU runqueue. The traditional scheduler will allocate a fixed time slice of 100ms for each process. At this time, a runs first until the 100ms time slice is exhausted, and then B runs. The two processes will run alternately, and the scheduling delay is 100 ms. With the increase of the system load, for example, two runnable processes C and D with nice value equal to 0 hang into runqueue. At this time, a, B, C and d run alternately, and the scheduling delay is 300 ms. Therefore, the scheduling delay of traditional scheduler is related to the system load. When the system load increases, users are more likely to observe the stuck phenomenon.

At the beginning of the design of CFS scheduler, we determined the parameter of scheduling delay, which is called targeted latency. This concept is similar to the concept of scheduling cycle in traditional scheduler, except that in the past, the time slice in a scheduling cycle was fixed to the runnable process, while in CFS, the scheduler would constantly check the scheduling delay in a targeted cycle Whether fairness is guaranteed in latency. The following code describes the calculation process of targeted latency:

static u64 __ sched_ period(unsigned long nr_ running)

{

if (unlikely(nr_ running > sched_ Nr_ latency))

return nr_ running * sysctl_ sched_ min_ granularity;

else

return sysctl_ sched_ latency;

}

When the number of runnable processes in runqueue is less than sched_ Nr_ When there are eight latency, targeted latency is sysctl_ sched_ Latency (6ms), when the number of runnable processes in runqueue is greater than or equal to sched_ Nr_ For latency, targeted latency is equal to the number of runnable processes multiplied by sysctl_ sched_ min_ granularity(0.75ms)。 Obviously sysctl_ sched_ min_ The granularity parameter is the size of the time slice that a process is scheduled to execute. This parameter is set to prevent the performance degradation caused by overscheduling.

CFS scheduler ensures that all runnable processes will be executed at least once in a targeted latency, thus ensuring a bounded and predictable scheduling delay.

5. Why virtual time?

Although con kolivas put forward a brilliant design idea, it is relatively conservative in the specific implementation. CFS scheduler, on the other hand, adopts a relatively radical approach, turning the priority list of the management tasks in runqueue into a red black tree structure. With such a red black tree of runnable process, how to determine the position of the process in the red black tree when inserting? In other words, what is the “key” of this tree?

The red black tree of CFS uses vruntime (virtual runtime) as the key. In order to understand vruntime, we need to introduce a concept of virtual timeline. In the previous chapter, we have clearly stated the meaning of fairness: CPU resources are allocated according to the static priority of the process. Of course, CPU resources are also CPU time slices. Therefore, in the physical world, fairness is to allocate CPU time slices that match the static priority. But the red black tree needs a single number axis to compare, and there are two measurement factors: static priority and physical time slice. Therefore, we need to map them to a virtual time axis to shield the influence of static priority. The specific calculation formula is as follows:

Virtual runtime = (physical runtime) x (weight of nice value 0) / weight of process

Through the above formula, we construct a virtual world. The two-dimensional (load weight, physical runtime) physical world becomes the one-dimensional virtual runtime virtual world. In this virtual world, the vruntime of each process can be compared to determine its position in the red black tree. The fairness of CFS scheduler is to maintain the fairness of vruntime in the virtual world, that is, the vruntime of each process is equal.

According to the above formula, we can see that: in fact, for the process with static priority of 120 (that is, nice value equals 0), its physical time axis is equal to the virtual time axis, while the virtual time of other static priority is scaled according to its weight and nice 0 weight. For the higher priority process, the virtual time axis is slow, while for the lower priority process, the virtual time axis is fast.

We can give a simple example to describe the fairness of the virtual world: for example, if there are two runnable processes a and B between time points a and B (virtual time axis), the CPU time is evenly allocated to each process from a to B, that is to say, process a and process B run (B-A) / 2 of the time, that is, 50% of each process %It’s time slice. At time B, a new runnable process C is generated until time C. From B to C, processes a, B, and C all take (C-B) / 3 of the execution time, that is, each occupies 1 / 3 of the CPU resources. Again, the above time is virtual time.

6. How to calculate virtual runtime

If we want to calculate time, we must have a timing tool like a watch. For CFS scheduler, this “watch” is stored in runqueue (clock and clock)_ Task member). Runqueue has two tables, one records the actual physical time, and the other records the time of task execution (clock)_ task)。 The reason why there is clock_ Task is to record process execution time more accurately. In fact, during the execution of a task, you will inevitably encounter some asynchronous events, such as interrupt. At this time, the execution of the process is interrupted, and the asynchronous event is processed. If these times are also included in the execution time of the process, it will be biased, so clock_ Task will remove the processing time of these asynchronous events, and only when the task is actually executed, the clock_ The counter of task will continue to accumulate time.

With clock, the process timing becomes relatively simple. When the process enters the execution state, take a look at clock_ Task this “watch”, record the value as a. When you need to count the running time, look at clock again_ Task this “watch”, record the value as B. B-A is the physical time that the process has been running. Of course, CFS is concerned with virtual time, which also needs to pass calc_ delta_ The fair function converts the physical time into the virtual time, and then accumulates it into the virtual runtime of the process (sched)_ The vruntime in entity is the key of the red black tree.

7. Operation of CFS scheduler

For CFS scheduler, there is no concept of fixed allocation time slice, only a concept of fixed weight, which is calculated according to the process static priority. Once the CFS scheduler selects a process to enter the execution state, it will immediately start the tracking process of its virtual runtime and update the virtual runtime when the tick arrives. With this virtual runtime information, the scheduler can constantly check the fairness of the current system (rather than check whether the time slice is used up). The specific method is: calculate the targeted time according to the current system situation Latency (scheduling cycle). In this scheduling cycle, calculate the time slice (physical time) that the current process should obtain, and then calculate the accumulated physical time that the current process has executed. If it is greater than the time slice that the current process should obtain, update the vruntime of the current process, mark the need reset flag, and initiate scheduling at the nearest scheduling point.

When scheduling a process, the scheduler needs to select the next thread that will occupy CPU resources. For CFS, its algorithm is to find the task of left most from the red black tree and schedule it to run. At this time, the task that lost the CPU execution right will be re inserted into the red black tree, and its position in the red black tree is determined by the vruntime value of the task.

Leave a Reply

Your email address will not be published. Required fields are marked *