FPGA provides an early architecture specialization option for high-performance computing and machine learning. Architecture specialization is an option to continue to improve performance to overcome the limitations of Moore’s law in slowing down the pace of technology. Whether in terms of power consumption or performance, use application specific hardware to accelerate the application or part of it, and allow more efficient hardware to be used as support.
Given the inherent cost of building computing hardware for a single application or workflow, this strategy cannot be used for all applications. However, by grouping challenges or identifying critical workloads or code that can benefit from acceleration, it is likely to be an important part of improving application performance. Some applications are very suitable for technologies such as GPU or FPGA, which can improve performance by implementing acceleration technology. GPU acceleration or architecture specialization is not a new concept. Some experts predict that they will be increasingly used to improve performance and reduce the energy cost of future systems.
CERN is using Xilinx FPGA to accelerate reasoning and sensor preprocessing workload to search for dark matter. The researchers behind the project are using FPGA and CERN other computing resources to process a large number of high-energy particle physics data at a very fast speed to find clues to the origin of the universe. Therefore, this requires real-time filtering of sensor data to identify new particle structures that may contain evidence of the existence of dark matter and other physical phenomena.
In today’s big data era, enterprises and consumers are flooded with massive data from various sources, including business transactions, social media and information from sensors or machines to machine data. These data come in a variety of formats, from structured digital data in traditional databases to unstructured text documents, e-mail, video, audio and financial transactions.
Effective analysis of these data is the key to generate insight and drive better decision-making and machine learning (ML) algorithms, which are widely used in modern data analysis. As a special ML algorithm, deep convolution network (DNN) has been widely used in image classification. The current generation of DNNS, such as alexnet and VGg, rely on dense floating-point matrix multiplication (GEMM). This algorithm has regular parallelism and high tflops (floating-point operations per second), which can be well mapped to GPU functions.
Although FPGA is more energy-saving than GPU (important in today’s Internet of things market), their performance on DNN does not match that of GPU. A series of tests conducted by Intel evaluated the performance of two latest generation FPGAs (ARIA TM10 and statix TM10 of Intel) and the latest high-performance GPU (Titan x Pascal) on DNN computing. Because data parallel computing has regular parallelism and high floating point computing throughput, DNN traditionally uses GPU. Each generation of GPU adds more floating-point units, on-chip RAM and higher memory bandwidth to provide more floating-point operations.
However, due to divergence and other problems, computing with irregular parallelism may pose a challenge to GPU. In addition, because GPU only supports a set of fixed local data types, customized data types may not be processed effectively, resulting in insufficient utilization of hardware resources and unsatisfactory performance. First, the next generation FPGA integrates more on-chip RAM. Second, technologies like Superflex can significantly improve the frequency. Third, there are more DSPs available. Fourth, the integration of HBM memory technology leads to the increase of off chip bandwidth. Finally, the next generation FPGA will use more advanced technology, such as 14nm CMOS.
Intel Stratix 10 FPGA chip has more than 5000 hardened floating-point units (DSPS), more than 28mb on-chip RAM (m20ks), integrated with high bandwidth memory (up to 4x250gb / S / stack or 1TB / s), and improved the frequency of new hyperflex technology, resulting in a peak fp32 throughput of 9.2 tflops. In addition, the FPGA development environment and toolset are also evolving, supporting higher level of abstraction programming, making it easier for developers to access FPGA programming.
Intel recently studied various GEMM operations for the next generation DNN. DNN hardware acceleration template for FPGA is developed, which provides first-class hardware support for the development of sparse matrix algorithm and user-defined data types. The template is developed to support various next-generation DNNS, and can be customized to generate optimized FPGA hardware instances for DNN variants given by users.
The template is used to run and evaluate various key matrix multiplication operations of next-generation DNNS, current and next-generation FPGAs (ARIA 10, Stratix 10) and the latest high-performance Titan x Pascal GPU. The results of this study show that compared with Titan x Pascal GPU, in the GEMM operations of pruned, INT6 and binarized DNNS, The performance of Stratix 10 FPGA is 1.1x, 1.5x and 5.4x higher than that of Titan x Pascal GPU respectively.
These tests also show that arria 10 and Stratix 10 FPGAs provide satisfactory energy efficiency (top / sec / watt) compared with Titan x GPU, and the energy efficiency of both devices is improved by 3 to 10 times compared with Titan X. Although GPU has always been an indisputable choice to support DNN, the recent performance comparison between two generations of Intel FPGAs (ARIA 10 and Stratix 10) and the latest Titan x GPU shows that the current trend of DNN algorithm is conducive to FPGA, and even FPGA may provide better performance.