Design of high efficiency and scalable GNN accelerator based on achronix speedster7t FPGA
Author: Kevin yuan, Senior Field Application Engineer of achronix
Thanks to the rise of big data and the rapid improvement of computing power, machine learning technology has made revolutionary development in recent years. In machine learning tasks such as image classification, speech recognition and natural language processing, the data is Euclidean data whose size dimension is determined and arranged orderly. However, in more and more real scenes, data are represented by complex non Euclidean data such as graph. Graph contains not only data, but also data dependencies, such as social networks, protein molecular structure, e-commerce platform customer data, etc. With the increase of data complexity, the traditional machine learning algorithm design and its implementation technology has brought severe challenges. In this context, many new graph based machine learning algorithms, GNN (graph neural network), are emerging in academia and industry.
GNN has a very high demand for computing power and memory, and its algorithm software implementation is very inefficient, so the industry has a very urgent demand for GNN hardware acceleration. We know that there are many traditional CNN hardware acceleration solutions; however, GNN hardware acceleration has not been fully discussed and studied. At the time of writing this paper, Google and Baidu can not find the Chinese Research on GNN hardware acceleration. The purpose of this paper is to combine the latest GNN algorithm, acceleration technology research abroad, and the author’s discussion on FPGA acceleration technology of GNN, and present it to readers in the form of panorama.
2. Introduction to GNN
GNN architecture has many similarities with traditional CNN in the macro level, such as convolution layer, polling, activation function, machine learning processor (MLP), FC layer and other modules, which will be applied in GNN. The figure below shows a relatively simple GNN architecture.
Figure 1: typical GNN architecture (source: https://arxiv.org/abs/1901.00596 )
However, graph convolution in GNN is different from 2D convolution in traditional CNN. Taking Figure 2 as an example, the convolution calculation process for the red target node is as follows:
Lgraph convolution: the neighbor function is used to sample the characteristics of the peripheral nodes and calculate the mean value. The number of neighbor nodes is uncertain and unordered (non Euclidean data).
L2d convolution: the convolution kernel is used to sample the characteristics of peripheral nodes and calculate the weighted average value, and the number of neighbor nodes is determined and ordered (Euclidean data).
Figure 2: graph convolution and 2D convolution (source: https://arxiv.org/abs/1901.00596 )
3. Introduction of graphsage algorithm
Scholars have done a lot of research and Discussion on GNN algorithm, and put forward a considerable number of innovative implementation methods. Among them, graphsage, which was proposed by Stanford University in 2017, is an inductive representation learning algorithm used to predict the dynamic new unknown node types in large-scale graphs, especially for graphs with large number of nodes and rich node characteristics. As shown in the figure below, the calculation process of graphsage can be divided into three main steps:
Figure 3: visual representation of graphsage algorithm (source: http://snap.stanford.edu/graphsage )
Adjacent node sampling: it is used to reduce the complexity. Generally, there are two sampling layers, and each layer samples several nodes
L aggregation: used to generate embedding of target nodes, that is, low dimensional vector representation of graph
L prediction: Taking embedding as the input of full connection layer, the label of target node D is predicted
In order to implement the acceleration of graphsage algorithm in FPGA, we need to know its mathematical model so that we can map the algorithm to different logic modules. The code shown in the figure below illustrates the mathematical process of this algorithm.
Figure 4: mathematical model of graphsage algorithm (source: http://snap.stanford.edu/graphsage )
For each target node XV to be processed, graphsage performs the following operations:
1) Through neighbor sampling function n (V), the nodes in the subgraph are sampled
2) The aggregation function can be mean (), LSTM (), polling (), etc
3) The aggregation result is combined with the output representation of the last iteration and convoluted with wk
4) The convolution results are processed nonlinearly
5) Iteration several times to end the current k-th layer of all adjacent nodes processing
6) Normalize the k-th iteration results
7) Iteration several times to end the processing of all k-layer sampling depth
8) The final iteration result zv is the embedding of input node XV
4. GNN accelerator design challenges
GNN algorithm involves a lot of matrix calculation and memory access operations. It is very inefficient to run this algorithm on the server of traditional x86 architecture, which is manifested in slow speed and high energy consumption.
The application of new GPU can bring significant benefits for GNN’s computing speed and energy efficiency. However, the short board of GPU’s memory scalability makes it unable to handle the graph of massive nodes; the execution mode of GPU’s instructions also causes the calculation delay to be too large and uncertain, which makes it unable to handle the scene where the graph needs to be calculated in real time.
As mentioned above, the existence of various design challenges makes the industry in urgent need of a GNN acceleration solution that can support highly concurrent real-time computing, huge memory capacity and bandwidth, and be scalable in the data center.
5. FPGA design of GNN accelerator
The speedster7t series of high-performance FPGAs produced by achronix company are specially optimized for data center and machine learning workload, eliminating some performance bottlenecks of CPU, GPU and traditional FPGAs. Speedster7t FPGA is based on TSMC’s 7Nm FinFET technology. Its architecture adopts revolutionary new 2D Network on chip (NOC), original machine learning processor matrix (MLP), and uses high bandwidth gddr6 controller, 400g Ethernet and pciexpressgen5 interface. It not only guarantees ASIC level performance, but also provides users with flexible hardware programmability. The following figure shows the architecture of speedster7t1500 high performance FPGA.
Figure 5: acronix speedster7t1500 high performance FPGA Architecture (source: http://www.achronix.com )
As mentioned above, the achronix speedster7t1500 FPGA provides a perfect solution to the challenges in GNN accelerator design.
Table 1: GNN design challenges and achronix’s speedster7t1500 FPGA solution
|GNN design challenge||Speedster7t1500 solution|
|High speed matrix operation||MLP machine learning processor matrix|
|High bandwidth and low latency storage||LRAM+BRAM+GDDR6+DDR4|
|High concurrency and low latency computing||FPGA uses programmable logic circuit to ensure low and high concurrency delay in hardware|
|Memory expansion||Based on 4 * 400gbps RDMA, it ensures that memory access can be expanded with extremely low latency in the data center|
|The algorithm is evolving||FPGA uses programmable logic circuit to ensure the algorithm can be upgraded and reconfigured at the hardware level|
|Complex design||Rich hard IP reduces development time and complexity, NOC simplifies interconnection between modules and improves timing|
Top level architecture of 5.1gnn accelerator
This GNN accelerator is designed for graphsage, but its architecture has certain generality, which can be applied to other similar GNN algorithm acceleration. Its top-level architecture is shown in the figure below.
Figure 6: top level architecture of GNN accelerator (source: original by achronix)
The GNNCore is the core of algorithm implementation. The design details will be discussed below. RoCE-Lite is a lightweight version of RDMA protocol, which is used for remote memory access through high-speed Ethernet to support Graph computation of massive nodes. Its design details will be discussed in the following article of the official account; 400GE Ethernet controller is used to carry RoCE-Lite protocol; GDDR6 is used for the RoCE-Lite protocol. It stores the high-speed access data needed in GNN processing; DDR4, as a spare high-capacity memory, can be used to store the data with relatively low access frequency, such as the graph to be preprocessed; pciegen5x16 provides high-speed host interface for data interaction with server software; all the above modules realize high-speed interconnection through NOC network on chip.
5.2gnncore micro architecture
Before we begin to discuss the gnncore microarchitecture, let’s review the graphsage algorithm in Section 3 of this paper. The aggregation of inner loops and merging (including convolution) occupy most of the computation and memory access of the algorithm. The characteristics of these two steps are as follows
Table 2: comparison of aggregation and merging operations in GNN algorithm (source: https://arxiv.org/abs/1908.10834 )
|Aggregation operation||Merging operation|
|Memory access mode||Indirect access, irregular||Direct access, rules|
|Calculation model||Dynamic, irregular||Static, regular|
It can be seen that aggregate operation and merge operation have completely different requirements for computing and memory access. The aggregation operation involves the sampling of neighboring nodes, but graph belongs to non Euclidean data type, its size dimension is uncertain and unordered, the matrix is sparse, and the node location is random, so the memory access is irregular and it is difficult to reuse data; in the merge operation, its input data is aggregation result (low dimensional representation of nodes) and weight matrix, its size dimension is fixed, and the storage bit is small It is not a challenge for memory access, but the computation of matrix is very large.
Based on the above analysis, we decided to use two different hardware structures to deal with aggregation and merging operations in gnncore accelerator design. The functional block diagram is shown in the following figure
Figure 7: gnncore functional block diagram (source: original by achronix)
Aggregator: it uses SIMD (single instruction multiple data processor) array to sample and aggregate the neighbor nodes of graph. The “single instruction” can be pre-defined as mean () mean value calculation or other applicable aggregation functions; “multiple data” means that the characteristic data of multiple neighbor nodes are required as input in a single mean () mean value calculation, and these data come from the subgraph sampler; SIMD array performs load balancing through the scheduler aggscheduler; subgraph sampler performs load balancing through NOC The adjacency matrix and node feature data h0v read back by gddr6 or DDR4 are cached in the adjacentlistbuffer and nodefeaturebuffer respectively, and the aggregated result HKN (V) is stored in the aggbuffer.
Combiner: convolution operation of aggregate results is performed by pulsating matrix PE; convolution kernel is wk weight matrix; convolution results are nonlinear processed by relu activation function, and stored in paratiasumbuffer for next iteration.
The combined result is normalized by l2bn, which is the final node characterization of hkv.
In a typical node classification prediction application, the node can be represented by a full connection layer (FC) to get the classification label of the node. This process belongs to one of the traditional machine learning processing methods, which is not reflected in the graphsage paper, and this function is not included in this design.
This paper discusses the mathematical principle of graphsage GNN algorithm, and analyzes the technical challenges in GNN accelerator design from multiple dimensions. By decomposing the problems and solving them one by one at the architecture level, the author creates an excellent and highly scalable GNN acceleration solution by comprehensively using the competitive advantages provided by the acronix speedster7t1500 FPGA.