Artificial intelligence technology is the basis of automatic driving, and algorithm, computing power and data are its three elements. This paper discusses the “computational power”. The level of computing power not only directly affects the driving speed, but also determines how much information redundancy is used to ensure driving safety. The computing power is most intuitively reflected in the hardware, and the automobile has special requirements for the controller of automatic driving.
In addition to the requirements for the cost, volume, weight and power consumption of general hardware, it is also required to provide sufficient computing power to ensure driving speed and information redundancy. Meet stringent vehicle regulations and standards, such as ultra wide temperature range, – 40 ℃ – 85 ℃. In general, FPGA is a technology suitable for automatic driving and high-speed computing.
The challenge in practice is the contradiction between various acceleration requirements and limited hardware resources. The sources of requirements include both deep learning forward speculation and rule-based algorithms. Hardware resource constraints include FPGA resource constraints and memory bandwidth constraints.
The limitation of FPGA resources reflects: the peak computing power is limited: the limited FPGA resources limit the improvement of computing parallelism, which restricts the peak computing power. The types of operators supported are limited: limited FPGA resources can only accommodate a limited number of operators. Memory bandwidth limitation is reflected in: memory data transmission occupies a non negligible time in the total computing time. In extreme cases, the computing time does not decrease after increasing the parallelism of some operators. In order to meet these challenges, we have extracted some useful experience in practice and summarized it for everyone to share.
Algorithm engineers use floating-point number float32 to train the model, and the output model parameters are also floating-point. However, in the FPGA we use, there is no special floating-point computing unit. To realize floating-point computing, it is expensive and infeasible. Using int8 calculation to approximate floating-point calculation, that is, to realize quantitative calculation, which is the first problem to be solved.
The < > symbol represents rounding. The two < > linearly map the elements of matrices A and B to the interval [- 127, 127], where multiplication and addition are completed. The last multiplication restores the integer result to float32. Before quantization, 1000000 float32 multiplications need to be completed. After quantization to int8, 1000000 times of int8 multiplication and 30000 times of quantization and inverse quantization multiplication need to be completed.
Since the proportion of quantization and inverse quantization is very low, the benefit of quantization is equal to the benefit of int8 replacing float32 multiplication, which is very significant.
The advantage of this method is that each calculation can make full use of the characterization ability of int8 data (127 can always be used), there is no data saturation (all elements are linearly mapped), and ensure the highest accuracy of single calculation. The model can directly accept floating-point training and maintain the quasi call rate. Resnet50 measured 50000 pictures, and the accuracy of Top1 and top5 decreased by 1%. In multiple networks used by valet parking products, no decline in call availability was observed. The disadvantage is that there is truncation error in FPGA calculation. After multiple accumulation, the maximum average numerical calculation error can reach 10%. For some models whose training is not completely successful (the effect is better only on the limited evaluation set), the quasi call rate decreases significantly and the results are uncontrollable.
Known quantization scale: static quantization. If the above formula becomes offline statistics, the quantization scale is solidified into scalea and scaleb, indicating rounding and limited to [- 127, 127]. The advantage of this method is to save FPGA resources. It is convenient to adopt the training method consistent with quantitative speculation. The numerical error of speculation and training calculation is very small and the quasi call rate is controllable. The disadvantage is that a consistent quantitative method is required for model training. Otherwise, the calculation error is very large and unacceptable.