As a very practical language in daily development and production, it is necessary to master some Python usage, such as crawler, network request and other scenarios, which is very practical. But Python is single threaded. How to improve the processing speed of Python is a very important problem. One of the key technologies of this problem is called orchestration. This article talks about the understanding and use of Python protocol, mainly focusing on the module of network request, hoping to help students in need.
Before we understand the concept and action scenario of collaboration, we need to understand several basic concepts about the operating system, mainly process, thread, synchronization, asynchrony, blocking and non blocking. Understanding these concepts is not only helpful to the scenario of collaboration, such as message queuing and cache. Next, the editor makes a summary of his own understanding and materials inquired online.
During the interview, we will all remember the concept that process is the smallest unit of system resource allocation. Yes, the system is composed of programs, that is, processes. Generally, it is divided into text area, data area and stack area.
The text area stores the code (machine code) executed by the processor. Generally speaking, it is a read-only area to prevent the running program from being accidentally modified.
The data area stores all variables and dynamically allocated memory, and is subdivided into initialized data area (all initialized global, static, constant, and external variables) and initialized data area (initialized global and static variables of 0). The initialized variables are initially saved in the text area, and copied to the initialized data area after the program is started.
The stack area stores the instructions and local variables of the active procedure call. In the address space, the stack area is closely connected with the heap area. Their growth direction is opposite, and the memory is linear. Therefore, our code is placed in the place of low address, growing from low to high. The size of the stack area is unpredictable and can be used as it is opened, so it is placed in the place of high address, growing from high to low. When the heap and stack pointers overlap, it means that the memory is exhausted, resulting in memory overflow.
The creation and destruction of a process is a relatively expensive operation, which consumes resources very much compared with system resources. In order to run the process itself, it must seize the CPU. For a single core CPU, only one process code can be executed at the same time, so to implement multiple processes on a single core CPU is to switch different processes quickly through the CPU, which looks like multiple processes are running at the same time.
Due to the isolation between processes, each process has its own memory resources. Compared with the common shared memory of threads, it is relatively safe. Data between different processes can only be shared through IPC (inter process communication).
Threads are the smallest unit of CPU scheduling. If the process is a container, the thread is the program running in the container. The thread belongs to the process. Multiple threads of the same process share the memory address space of the process.
The communication between threads can be directly carried out through global variables, so the communication between threads is relatively unsafe, so the scenarios of various locks are introduced, which will not be described here.
When a thread crashes, it will cause the whole process to crash, that is to say, other threads hang up, but multiple processes don’t, one process hangs up, and the other process still runs.
In the multi-core operating system, there is only one thread in the default process, so the processing of multi-process is like a core process.
Synchronous and asynchronous
Synchronization and asynchrony focus on message communication mechanism. The so-called synchronization means that when a function call is issued, the call will not return until the result is obtained. Once the call returns, it immediately gets the executed return value, that is, the caller actively waits for the call result. Asynchrony means that after the request is sent, the call will return immediately, and no result will be returned. The actual result of the call will be informed through callback and other ways.
Synchronous requests need to read and write data actively, and wait for the result; asynchronous requests, the caller will not get the result immediately. Instead, after the call is issued, the callee notifies the caller through status, notification, or handles the call through a callback function.
Blocking and non blocking
Blocking call means that the current thread will be suspended before the call result returns. The calling thread will not return until it gets the result. A non blocking call is one that does not block the current thread until the result is not immediately available. Therefore, the distinguishing condition is whether the data to be accessed by the process / thread is ready and whether the process / thread needs to wait.
Non blocking is usually realized by multiplexing, which includes select, poll and epoll.
After understanding the previous concepts, let’s look at the concept of synergy.
Coroutine belongs to thread, also known as micro thread, fiber, and its English name is coroutine. For example, when executing function a, I want to interrupt to execute function B at any time, then interrupt the execution of B and switch back to execute a. This is the function of the process, which is freely switched by the caller. This switch is not equivalent to a function call because it does not have a call statement. The execution mode is similar to multithreading, but only one thread of the cooperation process executes.
The advantage of a co program is that it is very efficient, because the switch of a co program is controlled by the program itself, and there is no need to switch threads, that is, there is no cost of switching threads. At the same time, because there is only one thread, there is no conflict, and there is no need to rely on locks (there is a lot of resource consumption in locking and releasing locks).
The main use scenario of the cooperation program is to deal with IO intensive programs and solve efficiency problems, which is not suitable for the processing of CPU intensive programs. However, there are a lot of these two scenarios in the actual scenario. If we want to give full play to the CPU utilization, we can combine multi process and co process. We will talk about the joint later.
According to the definition of Wikipedia, a cooperation program is a subprogram scheduling component without priority, which allows subprograms to suspend recovery in the characteristic place. So theoretically, as long as the memory is enough, a thread can have any number of CO processes, but only one co process can be running at the same time, and multiple co processes share the computer resources allocated by the thread. The purpose of cooperation is to give full play to the advantages of asynchronous calls, while asynchronous operations are to avoid IO operations blocking threads.
Before we understand the principle, we should make a preparation for knowledge.
1) Most modern mainstream operating systems are time-sharing operating systems, that is, a computer uses the way of time slice rotation to serve multiple users, the basic unit of system resource allocation is process, and the basic unit of CPU scheduling is thread.
2) Runtime memory space is divided into variable area, stack area and heap area. In memory address allocation, heap area is from low to high, stack area is from high to low.
3) When the computer executes, each instruction is read and executed. When the current instruction is executed, the address of the next instruction is in the IP of the instruction register. The ESP deposit value points to the top address of the current stack, and the EBP points to the base address of the current active stack frame.
4) When the system calls a function, the operation is as follows: first press the input parameter from right to left, then press the return address, finally press the value of the current EBP register, modify the value of the ESP register, and allocate the space required by the local variables of the current function in the stack area.
5) The context of a protocol contains the stack area and the value stored in the register belonging to the current protocol.
In Python 3.3, use the keyword yield from to use the coprocess. In 3.5, introduce the syntax sugar async and await about the coprocess. We mainly look at the principle analysis of async / await. Among them, event loop is a core. Students who have written JS will know more about event loop. Event loop is a programming architecture (Wikipedia) that waits for events or messages to be allocated by programs. In Python, the asyncio.coroutine modifier is used to mark the function as a coroutine. Here, the coroutine is used together with asyncio and its event loop. In the subsequent development, async / await is used more and more widely.
Async / await is the key to use Python collaboration. From the structure point of view, asyncio is essentially an asynchronous framework. Async / await is an API provided for the asynchronous framework, which is convenient for users to call. Therefore, if users want to use async / await to write collaboration code, they must take the opportunity of asyncio or other asynchronous libraries at present.
In the actual development of asynchronous code, in order to avoid too many callback methods leading to callback hell, but also need to get the return results of asynchronous calls, smart language designers designed an object called future, encapsulating the interaction behavior with loop. The general execution process is: after the program is started, register the callback function with epoll through the add ﹣ done ﹣ callback method. When the result property gets the return value, actively run the previously registered callback function and pass it up to coroutine. The future object is asyncio. Future.
However, in order to get the return value, the program must restore the working state. Because the life cycle of the future object itself is relatively short, the work may have been completed after each callback registration, event generation and callback triggering process, so it is not appropriate to use future to send result to the generator. Therefore, a new object task is introduced here, which is saved in the future object to manage the state of the generator process.
Another future object in Python is concurrent.futures.future, which is incompatible with asyncio.future and easy to be confused. The difference is that concurrent.futures is a thread level future object. When using concurrent.futures.executor for multi-threaded programming, this object is used to pass results between different threads.
As mentioned above, task is the task object to maintain the execution logic of generator cooperation process state processing. There is a “step” method in task, which is responsible for the state migration of generator cooperation process and EventLoop interaction process. The whole process can be understood as: task sends a value to the cooperation process, and restores its working state. When the process runs to the breakpoint, a new future object is obtained, and then the callback registration process of future and loop is processed.
In daily development, there is a misunderstanding that every thread can have an independent loop. In the actual run time, the main thread can create a new loop through asyncio. Get event loop(), while in other threads, using get event loop() will throw errors. The correct way is to explicitly bind the current thread to the main thread’s loop through asyncio. Set ﹣ event ﹣ loop().
Loop has a big defect, that is, the running state of loop is not controlled by Python code, so in business processing, it is unable to stably expand the cooperation process to multi-threaded running.
After introducing the concepts and principles, I’ll see how to use them. Here, take an example of a real-world scenario to see how to use Python’s orchestration.
Because there is no processing logic before and after each group of data in the same file, the network request sent through the requests library before is executed serially, and the sending of the next group of data needs to wait for the return of the previous group of data, which shows that the processing time of the whole file is long. This kind of request mode can be fully realized by the cooperation.
In order to make it more convenient to cooperate with the cooperation process to send requests, we use the aiohttp library instead of the requests library. As for aiohttp, we will not do too much analysis here, just do a brief introduction.
Aiohttp is asyncio and Python’s asynchronous HTTP client / server. Because it is asynchronous, it is often used in the service area to receive requests, and in the client crawler application to initiate asynchronous requests. Here we mainly use it to send requests.
Aiohttp supports client and HTTP server, and can realize single thread concurrent IO operation. It can support server WebSockets and client WebSockets without using callback hell, and has middleware.
Talk is heap, show me the code~
from inspect import isfuncTIon
@[email protected] utiles.exception（（logger）
def request（pool， data_list）：
loop = asyncio.get_event_loop（）
async def exec（pool， data_list）：
tasks = ［］
or = asyncio.Semaphore-65288pool-65289
for item in data_list：
async def control.u. without 65288;without 65292; method-65292; url-65292; date-65292; headers
async with sem：
count = 0
flag = False
while not flag and count 《 4：
flag = await fetch（method， url， data， headers， callback）
count = count + 1
if count == 4 and not flag：
raise Exception（‘EAS service not responding after 4 times of retry.’）
async def fetch（method， url， data， headers， callback）：
async with aiohttp.request（method， url=url， data=data， headers=headers） as resp：
json = await resp.read（）
except Exception as e：
Here, we encapsulate the request method of sending batch requests to the outside world. The number of data received and sent at one time is integrated with the data. When it is used externally, we only need to build the data of the network request object, set the request pool size, set the retry function, and retry four times to prevent the network request of single data from sending when the network shakes Failure.
After reconstructing the network request module by using the cooperation process, when the data volume is 1000, it is doubled from the previous 816s to 424s, and when the request pool size is increased, the effect is more obvious. Due to the data limitation of the third-party platform establishing the connection at the same time, we set a threshold of 40. As you can see, the degree of optimization is significant.
Editor in charge: CT