Compared with other computing carriers such as CPU and GPU, FPGA has the characteristics of high performance, low energy consumption and hardware programming. Figure 1 introduces the hardware architecture of FPGA. Each FPGA is mainly composed of three parts: input and output logic, which is mainly used for the communication between FPGA and other external components, such as sensors; calculation logic component, which is mainly used to build calculation modules; and programmable connection network, which is mainly used to connect different calculation logic components to form a calculator.
In programming, we can map the computing logic to the hardware, and connect different logic components to complete a computing task by adjusting the network connection. For example, to complete an image feature extraction task, we will connect the input logic of FPGA and the output logic of camera, so that pictures can enter FPGA. Then, the input logic of FPGA is connected with several computing logic components, which can extract the feature points of each picture area in parallel. Finally, we can connect the output logic of the computing logic unit and the FPGA, summarize the feature points and output them. It can be seen that FPGA usually writes the data flow and execution instructions of the algorithm in the hardware logic, so as to avoid the work of instruction fetch and instruction decode of CPU.
Although the frequency of FPGA is generally lower than that of CPU, it can be used to implement a hardware calculator with high parallelism. For example, the CPU can only process 4 to 8 instructions at a time. Using data parallel method on FPGA can process 256 or more instructions at a time, so that FPGA can process a lot more data than CPU. In addition, as mentioned above, the FPGA generally does not need instruction fetch and instruction decode, which reduces the calculation time after these pipeline processes.
In order to let readers have a better understanding of FPGA acceleration, we summarize the FPGA acceleration research of Blas algorithm by Microsoft Research Institute in 2010. Blas is the bottom Library of matrix operation, which is widely used in high performance computing, machine learning and other fields. In this study, Microsoft researchers analyzed the acceleration and energy consumption of Blas by CPU, GPU and FPGA. Figure 2 compares the execution time of each iteration of the gaxpy algorithm by FPGA, CPU and GPU. Compared with CPU, GPU and FPGA achieve 60% acceleration. Small matrix operation is shown in the figure. With the increase of matrix, the acceleration ratio between GPU and FPGA and CPU will be more and more obvious.
Compared with CPU and GPU, FPGA has obvious energy consumption advantages, mainly due to two reasons. First of all, there is no instruction fetch and instruction decode in FPGA. In Intel CPU, only decoder accounts for 50% of the energy consumption of the whole chip because CISC architecture is used; in GPU, fetch and decode also consume 10% – 20% of the energy. Secondly, the main frequency of FPGA is much lower than that of CPU and GPU. Generally, the CPU and GPU are between 1GHz and 3GHz, while the main frequency of FPGA is generally below 500MHz. Such a large frequency difference makes FPGA consume far less energy than CPU and GPU.