Artificial intelligence algorithms need strong computing power, and the demand for computing power is growing with Moore’s law. In particular, the large-scale use of deep learning algorithms puts forward higher requirements for computing power. Intelligent algorithms have high parallelism, strong data reusability, continuous evolution, emerging new algorithms and changing computing models, which bring huge design space for processor architecture design. At present, there are two types of artificial intelligence processor architecture design: special architecture represented by TPU and general architecture represented by GPU. The former has high performance power consumption ratio and simple use, but lacks certain flexibility and universality; The latter has better flexibility and versatility, but increases power consumption and makes programming and algorithm design more complex.
Domestic multi-core processors have the characteristics of integrating heterogeneous architecture, multi-dimensional parallel data communication, flexible and optimized storage, and efficient and balanced computing core, which provide effective support for artificial intelligence applications. Deeply integrate heterogeneous core architecture, integrate general processing core and domain general computing core, and meet the general intelligent computing capabilities in the field of general computing and intelligent computing. The multi-dimensional parallel data communication system adopts the fast synchronization technology based on lightweight register communication and operation core to realize the data exchange and flexible and efficient synchronization between operation cores with low delay and high bandwidth, and improve the core operation efficiency of artificial intelligence applications. The flexible optimized storage system adopts the method of combining software and hardware to make the on-chip storage management flexible and flexible, and solve the problems of limited storage bandwidth and increased delay of intelligent computing. The efficient and balanced computing core not only ensures the processing efficiency of intelligent computing applications, but also obtains higher parallel processing performance by integrating more cores, which can meet the needs of Artificial Intelligence Computing at the same time.
2 development of multi-core processors
Multi core processor is the key core device supporting artificial intelligence computing. Many types of technologies and architectures have emerged in the development process. A large number of researchers and companies have contributed wisdom and strength to promote its development.
Coarse grained reconfigurable architecture is the leading technology in the formation of multi-core processors. Around 2000, a large number of coarse-grained reconfigurable processors based on cross switch, linear array and mesh emerged. The architecture based on full cross switch has strong communication ability. Generally, simplified cross switch is used to deal with the exponential increase of implementation cost due to the increase of the number of processing units, such as rapid prototype paddi and paddi-2 for DSP data channel; Based on the architecture of one or more linear arrays, reconfigurable pipeline stage can be provided to realize partial fast dynamic pipeline reconfiguration and runtime scheduling of configuration flow and data flow, such as piperench; Based on the architecture of mesh, PES are arranged according to a two-dimensional array. Adjacent PES can communicate, generally support direct communication between PEs in rows or columns, and support static networks determined at compile time and dynamic networks determined at run time, such as raw. The research results of coarse-grained reconfigurable architecture are mostly focused on academic fields, except that some of them are transformed into industrial products (such as tile series 2).
In 2002, the concept of GPGPU (general graphics computing) gradually became clear, realized floating-point matrix multiplication matrix algorithm 3) and began to be applied to the traditional field of scientific and engineering computing; In 2005, GPU realized the LU decomposition calculation of floating-point matrix. At this stage, the main problem GPU faces is the difficulty of programming. It is necessary to map the scientific engineering algorithm into the traditional image processing process. During the same period, in 2002, IBM carried out the research and development of C64 for p-class supercomputer, and its core is cyclops-64 multi-core processor. Cyclops-64 contains 80 cores interconnected by cross switches, with a peak performance of 80 gflops. In 2005, IBM released cell processor, which integrates two types of cores with different functions: main control core (PPE) and coprocessor core (SPE). The cores are interconnected by bus, and the peak performance can reach 102 gflops. In 2008, IBM built Roadrunner supercomputer based on cell, Linpack’s continuous performance exceeded 1 PFlops for the first time and ranked first in the top 500 list, which had a great impact on the industry.
With the continuous improvement of multi-core processor architecture, its adaptability and usability have been continuously improved. The high-performance GPU gradually increases the double precision floating-point operation unit and the memory controller increases ECC verification, and the calculation method is more general. In particular, the release of CUDA software development kit in 2007 paved the way for the wide application of GPU. In June 2010, the galaxy supercomputer of Dawning company used NVIDIA Tesla to test the peak performance of 1.27 PFlops; In November 2010, Tianhe-1A tested with Tesla and its performance reached 2.56 PFlops; GPU has been more and more widely used in the field of high-performance computing, and has become the de facto standard of multi-core processors. As an important manufacturer in the HPC field, Intel has continuously increased its investment in the field of multi-core processors. In 2006, Intel began to study the Larrabee architecture. In 2010, Intel released the mic architecture and launched Xeon Phi high-performance multi-core processors, including 57 ∼ ∼ 72 x86 cores. In 2013, the University of national defense science and technology developed the “tianhe-2” supercomputer based on phi, with the performance ranking first in the world at that time.
According to the structural complexity and organization of computing core, multi-core processors can be divided into two categories: general processing core based and computing cluster based multi-core processors.
The multi-core processor based on the general processing core can be regarded as a further extension of the multi-core architecture processor, which integrates many general processor cores through the on-chip interconnection network (NOC). The computing core is generally simplified from the general core, and all cores have complete functions and strong computing power. However, it usually simplifies the structures such as instruction scheduling and speculative execution. The computing components in the computing core generally support SIMD. The traditional multi-level cache storage structure in the general processor is usually retained in the single core. Typical representatives include Intel’s Larrabee / MIC architecture processor, SCC architecture processor and tilera’s tile-gx series processor.
The multi-core processor based on computing cluster integrates a large number of simple computing cores, which aims to provide ultra-high computing performance through the aggregation of simple computing components. The computing core of this kind of multi-core processor is a simple computing component. Multiple cores are organized in the form of groups or clusters, which can provide powerful parallel computing power through the parallel way of data flow such as single instruction multithreading flow (SIMT). On chip is usually integrated with domain oriented special acceleration processing components. All computing cores in the computing cluster share the instruction transmitting unit, and share memory resources such as register file and level-1 cache. Computing clusters share secondary cache and main memory. Typical representatives mainly include NVIDIA GPU series processors, such as Fermi and Kepler; Amd / ATI’s GPU series, such as RV architecture processor, GCN architecture processor (5), etc.
With the development of international multi-core processors, domestic research is also carried out simultaneously, including Godson-T multi-core processors, yhft64-2 stream processors, Shenwei multi-core processors, etc. Godson-T adopts 2D mesh structure, 8 ×× 8 array structure with 64 processor cores, compatible with MIPS instruction set. Yhft64-2 processor adopts heterogeneous multi-core architecture, including 64 core processors. It has the flexibility of traditional general architecture, a large number of computing resources and strong peak computing power. Shenwei Zhongzhong core processor is applied to the “Shenwei ⋅ Taihu Lake light” supercomputer system. It adopts the heterogeneous architecture of on-chip integration, and adopts a unified instruction set system, taking into account the usability and performance of the application, so as to achieve better performance, power consumption ratio and computing density.
The field of artificial intelligence represented by deep learning has opened a new era of architecture. At present, the demand for computing power of artificial intelligence applications is growing at the rate of super Moore’s law. From 2012 to 2017, the computing demand increased 300000 times, that is, doubled every 3.5 months. The core calculation of artificial intelligence algorithm is low precision linear algebra. On the one hand, it has enough adaptability and can be extended to many fields; On the other hand, it has enough particularity to benefit from domain specific architecture design.
The multi-core processor architecture not only has high efficiency and good adaptability to scientific and engineering computing, but also its support for double precision and single precision matrix computing can meet the key computing needs of artificial intelligence to a certain extent. Therefore, multi-core processor architecture has natural advantages for Artificial Intelligence Computing. At the same time, multi-core processors continue to improve according to the special needs of Artificial Intelligence Computing, such as increasing the support of other computing cores for acceleration, increasing the support of hybrid precision computing, etc. NVIDIA adds tensorcore, which significantly improves the performance, to the V100 and Turing multi-core processors, so that its peak performance of Artificial Intelligence Computing is about 17 times that of double precision floating point. AMD’s Vega architecture also significantly improves AI computing performance. The Knights mill multi-core processor launched by Intel adds special instructions to support Artificial Intelligence Computing.
3 key technologies of domestic multi-core architecture for Intelligent Computing
Convolution and matrix multiplication are the core operations of intelligent computing, which have the characteristics of high parallelism and data reuse. At present, processors in the field of artificial intelligence focus on how to accelerate these two operations. In order to meet the needs of intelligent computing, domestic multi-core processors must effectively support large-scale convolution and matrix multiplication computing. The operation core of domestic multi-core processor needs to have flexible control ability to realize efficient control and data scheduling of complex cyclic processes of convolution and matrix multiplication; Support convolution weight reuse and global sharing of input eigenvalues through efficient on-chip communication; Through instruction rearrangement, the weight and input eigenvalues are accurately controlled, and the reading and calculation stored from the local chip to the computing pipeline overlap, so as to further improve the computing performance; Flexible data movement and on-chip layout are used to realize the flexible conversion of convolution and input eigenvalues, so as to reduce the overhead of data reorganization; The on-chip multi-level parallel mechanism supports an efficient on-chip data parallel strategy to improve data exchange performance and weight update performance.
On the whole, the domestic multi-core processor architecture needs to have a number of innovative key technologies to effectively support Artificial Intelligence Computing, including heterogeneous architecture, lightweight on-chip communication mechanism, flexible optimized storage system, efficient and balanced computing core architecture, etc.
3.1 integrate heterogeneous architecture
Multi core processors are integrated in the same chip at the same time, fully tap the “heavy” core and simple structure of TLP, and the “light” core used for ILP can efficiently support complex artificial intelligence applications and algorithm implementation, take into account ease of use and performance, and achieve better performance power consumption ratio and high computing density.
The computing core (“light” core) and the control core (“heavy” core) cooperate to support different types of tasks in artificial intelligence applications. The computing core supports multiple width SIMD, providing the main computing power required for artificial intelligence applications; The computing core supports software management of on-chip local storage, realizes data level and thread level parallelism through efficient on-chip network structure, and supports more flexible and rich artificial intelligence algorithm implementation mechanisms, such as algorithm hierarchy, data on-chip sharing, MPMD mode, etc. The control core is responsible for the calculation of difficult parallelization parts in artificial intelligence tasks, realizes instruction level parallelism, reuses the spatial and temporal locality of applications through multi-level caches, and supports complex hyperparameter tuning, training iteration, data splitting, etc.
In order to effectively solve the challenges of artificial intelligence heterogeneous task management, complex on-chip data sharing, difficult data consistency and difficult compatibility of execution models, different cores of multi-core architecture need to adopt unified instruction system, unified execution model and support multiple storage space management modes to realize the deep integration of on-chip heterogeneous cores.
3.2 lightweight on-chip communication mechanism
The number of multi-core processor cores is large, the local storage space of each core is limited, the working set that each core can handle independently is small, and the demand for main memory access bandwidth and delay is large. Most AI applications are “memory computing intensive applications”. Multi core processors must have an efficient inter core on-chip data reuse mechanism, expand the working set, reduce the memory access requirements of applications, and maximize the computing power of processors. The lightweight on-chip communication mechanism is adopted to realize the data exchange with low delay and high bandwidth between computing cores, improve the execution efficiency of close cooperation between computing cores, significantly improve the reuse efficiency of on-chip data, and effectively alleviate the problem of “storage wall” faced by multi-core processors.
The lightweight on-chip communication mechanism uses bilateral protocols to realize lightweight blocking and non blocking communication. The source core sends the data to the sending part, the sending instruction is executed, and the pipeline can continue to execute; The target core uses the receive instruction to obtain valid data from the receive buffer. In order to realize the high efficiency of communication and the simplification of physical implementation, the communication protocol needs to avoid the complex handshake or synchronization protocol in order to establish communication, and simplify the design complexity and overhead of cluster communication network. Compared with the traditional on-chip network communication mechanism, the operation core of lightweight communication mechanism needs to avoid moving through multi-level on-chip storage layer as far as possible.
From the perspective of improving the on-chip data reuse rate, the lightweight communication mechanism between computing cores needs to realize the fine-grained and low delay exchange / movement of data between cores, and support the collective communication functions such as multicast. For example, for the core operation of artificial intelligence application (matrix multiplication matrix operation), lightweight communication can improve the efficiency by more than 10 percentage points.
3.3 flexible optimized storage system
In view of the high computational density in the process of intelligent computing, multi-core processors need to realize flexible data movement, on-chip layout and reconfigurable local data memory technology. The combination of software and hardware makes the on-chip storage management flexible and flexible, optimizes the data transmission performance, effectively solves the problems of limited bandwidth and increased delay of Intelligent Computing storage, and improves the efficiency and adaptability of multi-core architecture.
(1) Flexible data movement and on-chip layout. When the computing core can directly access the main memory space, in order to support the efficient use of on-chip storage and the flexible allocation of data in the computing core, it needs to support flexible data movement and on-chip layout, support efficient asynchronous data transmission between core storage and main memory, and realize the parallelism of computing and memory access. According to the memory access characteristics of artificial intelligence algorithm, the storage interface realizes the scheduling strategy based on sliding window parallel and a variety of mapping performance optimization algorithms, which effectively improves the utilization efficiency of storage bandwidth.
The multi-core architecture supports a variety of data layouts. It supports single operation core mode, multicast mode, row mode, broadcast row mode and matrix mode. The multicast mode provides the data required by each core in the main memory to multiple computing cores; Row mode and broadcast row mode realize the transmission of row dimension circularly distributed data blocks; The matrix mode realizes the transmission of circularly distributed data blocks on the two-dimensional grid in the whole operation core cluster. Single core mode, row mode and matrix mode support the transmission from main memory to local data memory and from local data memory to main memory at the same time. Other modes only support the transmission from main memory to local data memory.
The multi-mode data stream transmission technology of multi-core processors can effectively improve the reuse rate of intelligent computing data, and then improve the performance of artificial intelligence algorithms.
(2) Reconfigurable data storage technology. The operation core design for intelligent computing strives to be concise and efficient, and adopts reconfigurable local data memory technology. The data storage of the operation core can be configured into hardware and Software Cooperative Cache or on-chip memory by software to complete the cache management of different characteristic data. These two data storage management methods can exist simultaneously and support dynamic capacity division, fully combine the efficiency of hardware and the flexibility of software, reduce design overhead and meet the storage needs of artificial intelligence applications.
The data of the cache line and the tag information of the cache line in the software and hardware cooperative cache are stored in the local data memory, and a fixed register is set to store the information of the whole cache. The software manages the loading and elimination of cache. The hardware provides instructions to accelerate the performance of hit query and address conversion. The software and hardware cooperate to complete the cache management of data, and fully combine the efficiency of hardware and the flexibility of software to realize efficient memory access optimization with less hardware overhead. In the software hardware cooperative cache, the hardware is responsible for the automatic jump when hit query and miss, so as to reduce the overhead of software implementation (such as code expansion, conditional branch judgment, etc.). The software is responsible for managing the loading and obsolescence of cache. The program can correspond to multiple caches when running. The software is responsible for the effective isolation of data access of different caches in the local data memory to avoid conflicts.
3.4 efficient and balanced computing core architecture
According to the analysis of artificial intelligence applications, the multi-core architecture can adopt a weak out of order pipeline structure, which is mainly characterized by a limited degree of out of order based on deterministic execution. The main purpose of deterministic execution is to reduce the additional power overhead caused by speculative execution, and reduce the area overhead of reordering buffer and other components set to cache non exited speculative execution instructions; Limited disorder refers to the instruction block based instruction scheduling and transmission strategy, which can effectively hide the performance loss caused by some long delay events (such as discrete access to main memory). The weakly disordered pipeline structure can improve the performance of sequential pipeline and effectively control the structural complexity.
Although the computing core with weak out of order pipeline structure reduces the hardware complexity, it can still efficiently handle intelligent computing applications, which is mainly reflected in the efficient transfer prediction mechanism for the simplified computing core adopted by the computing core, and the strategies of static transfer prediction, transfer prompt and branch bounce prefetching guided by compilation, For the intelligent computing application with regular operation, it not only ensures the performance of instruction pipeline, but also eliminates the large-capacity transfer history table relied on by the traditional transfer prediction mechanism and reduces the area overhead; Intelligent computing applications are data intensive applications with batch data processing requirements. The single instruction multi data stream technology implemented by the operation core can efficiently process batch data, reduce pipeline instruction control overhead and save power consumption; The local data memory structure adopted by the operation core, combined with the batch data transmission technology, can effectively hide the data access delay for the data access law and determined intelligent calculation, greatly improve the data local access efficiency, and reduce the risk that the capacity failure of the traditional data cache can not hide the data access delay.
The efficient and balanced computing core structure enables a single chip to integrate more computing cores. While ensuring the processing efficiency of intelligent computing applications, it can obtain higher parallel processing performance by integrating more cores.
4 performance analysis of intelligent computing applications based on domestic multi-core processors
At present, domestic multi-core processors have supported relatively complete software Ecology (such as linear algebra basic library swblas, deep learning library swdnn, deep learning framework swcaffe, etc.), supported many typical artificial intelligence applications (such as medical imaging, go, speech recognition, etc.), and achieved good test performance.
Convolution calculation is a typical algorithm of deep learning, and swdnn focuses on its optimization and acceleration: double buffer mechanism is used to allocate double LDM space for each part of data in convolution calculation, so as to ensure the relative independence of calculation and memory access and realize the overlap of calculation and memory access; Flexible on-chip network and multiple DMA mechanisms are used to ensure the efficient mapping of different convolution calculations to the operation core array; Using the dual pipeline feature of the operation core, the waiting time of the computing unit is reduced and the convolution performance is improved by maximizing the overlap of memory access instructions and computing instructions. The multi-core processor uses swdnn to perform convolution calculation. Compared with the k40m (using cudnn Library) of commercial multi-core processor NVIDIA in the same period, the performance is improved by 2 ∼ 9 times.
Swcafe is the transplantation of cafe deep learning framework on multi-core processors. It integrates swdnn and swblas to realize customization and optimization of function and performance. At the same time, it uses parameter server to update global parameters and supports the synchronous update strategy of computing communication overlap. The performance of convolution computing based on swcaffe on a single computing core array is 3.5 times that of a single Intel Xeon processor; The performance on a single multi-core processor is 1.5 times that of k40m; Parallel training can achieve good strong scalability and weak scalability.
256 multi-core processors are used to run the go training program. Its deep learning model includes 39 layer CNN network and 240 million training samples are used. 128 multi-core processors are used to train the medical image processor model. The model is based on various networks such as alexnet and VGg, and the training data is up to 1TB. The multi-core processor is used to complete the training of remote sensing image classification model with more than 10 TB data.
Driven by artificial intelligence (especially deep learning), multi-core processor architecture has developed towards intelligent computing. The complexity, flexibility and domain specificity of Artificial Intelligence Computing promote the future development of domestic multi-core processor architecture. With the continuous evolution of intelligent algorithms, new algorithms emerge in endlessly, and the algorithm model is also changing. It is necessary to build a dynamic and variable multi-core processor architecture and ensure programmability to deal with the transformation and iteration of algorithms; Design a new multi-level and multi granularity on-chip memory access and communication management mechanism to fully adapt to the on-chip data sharing and mobility characteristics of artificial intelligence applications, improve computing power and effectively reduce memory access requirements; For the core algorithm of artificial intelligence, build a customizable acceleration core, quickly respond to the changes of the algorithm, and adopt energy-efficient structure and design method to achieve the goal of green energy saving