“No PP, no way” this is a world where seeing is believing, and this is a world of visual information. The brain processes visual content 60000 times faster than text content. With the popularity of smart phones, the generation and sharing of pictures and videos has become the basic way of communication on social platforms. Users upload and share their pictures through mobile phones, tablets and computers, and this trend is growing every year (see Figure 1).
Figure 1 2016 KPCB statistical report
Every day, there are hundreds of millions of pictures uploaded by users on QQ photo albums and wechat circle of friends. These pictures are stored by the background server and distributed through the network. If each picture can be compressed, the smaller the amount of data stored, transmitted and distributed, which not only saves the user bandwidth, but also improves the speed of downloading pictures, and the user experience is better. Can images be compressed? In 1948, Shannon, the founder of information theory, demonstrated that the signal of voice or picture can be compressed because it contains a lot of redundant information. Image compression algorithms include JPEG, webp, H264 (intra compression), hevc (intra compression), The compression capability is: JPEG webp / H264 (intra compression) and hevc (intra compression). This compression capability is realized by increasing the computational complexity, in which the computational complexity of webp and hevc is more than 10 times that of JPEG compression. At present, a large number of pictures uploaded by users on social platforms are in JPEG format, and more complex algorithms such as webp and hevc (intra compression) are used through the background server , further compression to save storage and bandwidth, so the essence of image compression is to reduce storage and bandwidth by improving computing power. At the same time, more complex algorithms also lead to a large consumption of computing power and an increase in processing delay.
From the perspective of business, for offline business, image transcoding can be carried out through the idle computing power between the peak and trough of the business; However, for online services, image transcoding processing has higher requirements for processing delay. In order to meet the requirements of processing delay, sometimes the image transcoding processing is carried out first, and the transcoded images are stored and transmitted directly when users need them. In this way, the requirements of processing delay are solved at the cost of consuming storage resources. However, this brings a new problem. The screen size of the intelligent terminal for users to view pictures is different. If they all transmit pictures of the same size, it is obviously not the best. The optimal processing method can also carry out image transcoding in real time by calculating the computing power.
In the data center, the computing power is usually provided by x86 CPU. In the past, the performance of X86 CPU can double every 18 months (known as “Moore’s law”), but at present, the development direction of industry is that Moore’s law has come to the end. For example, on March 24, 2016, Intel announced that it would officially discontinue the “tick tock” processor R & D mode, and the future R & D cycle would change from a two-year cycle to a three-year cycle. The international technology roadmap for Semiconductors (ITRS) has been maintained for decades and updated every two years to provide suggestions and planning guidelines for the semiconductor industry around the world. It was also announced in 2016 that it would not be further updated.
On the one hand, processor performance can no longer grow according to Moore’s law. On the other hand, data growth requires more computing performance than that according to “Moore’s law”. The processor itself cannot meet the performance requirements of high performance computing (HPC) application software, resulting in a gap between requirements and performance (see Figure 2).
Figure 2 Gap development form of computing demand and computing power
Image processing solution
The image service supports a variety of capabilities. The basic functions include a variety of thumbnail clipping methods, text and image watermarking, format conversion, breakpoint continuation, image storage, anti-theft chain, etc. Combined with the user needs of the current graphic era, we provide an all-round integrated solution for image upload, storage, processing and distribution. At present, most of the images stored and downloaded on the ground in the solution of Internet image service are still JPEG / webp. However, with the emergence of the new coding standard hevc, the compression efficiency of hevc will be 30% ~ 70% better than JPEG / webp under the same image quality, which can save a lot of storage and bandwidth. However, the high complexity of hevc algorithm leads to the coding delay and throughput of CPU, which can not be met in the online environment. Therefore, We have developed a new solution based on FPGA. FPGA image processing solution can well meet the needs of online environment. Of course, FPGA image processing solution is also compatible with webp and other image transcoding formats of current user online system. It can well adapt to the needs of different users and provide solutions with low delay, high throughput and low cost.
We take hevc FPGA image processing as an example to illustrate the architecture of image upload, storage, processing and download in Internet business.
Figure 3 Hevc FPGA image upload, storage, processing and download solution
As shown in Figure 3, the deployment of hevc FPGA transcoding in the picture is mainly the transcoding server before landing storage and downloading. Using FPGA for transcoding mainly has the following advantages:
FPGA transcoding and floor storage hevc can effectively save storage cost.
1. Compared with CPU transcoding, FPGA transcoding server can reduce the cost of server.
Compared with CPU, the throughput of FPGA transcoding hevc pictures can be greatly improved.
When downloading, hevc pictures are generated in real time. Using FPGA to accelerate picture transcoding will greatly reduce transcoding delay and improve user experience.
Analysis of image coding algorithm
In the image and video codec algorithm, each module is based on pixel level operation or block operation, and the operation for each pixel or image block is the same and repeated. In the early image compression standards JPEG and JPEG200, the original image is first transformed by block based DCT or wavelet transform, and the transformed coefficients are quantized before entropy coding (including Huffman coding or adaptive arithmetic coding), so as to output the compressed code stream information. At the decoding end, the code stream information can be decoded through reverse operation. In JPEG2000, DCT transform is replaced by wavelet transform, which can better eliminate the redundancy in the image block, and the quantized system can achieve better compression performance by adaptive arithmetic coding according to the bit plane.
In addition to the direct transformation of the original image such as JPEG, there is another method based on block prediction. That is to predict an image block first, and then transform, quantize and encode the residuals of the original image block and the prediction block. A typical standard is webp developed from intra prediction of H.264. With the introduction of the new generation video coding standard hevc / h.265, the compression performance of its intra coding is nearly twice that of the previous generation standard . Therefore, it has become a trend to use the intra coding of hevc for image compression. The intra coding process of hevc is shown in Figure 4.
Figure 4 Hevc intra coding process
In hevc, the way of block partition is based on incomplete quadtree structure, which is more suitable for different image scenes. Only one independent prediction mode is required for each finally sized block. Fig. 5 is an example of block division and prediction mode in hevc picture coding. It can be seen that when a block can be predicted from a single angle, it does not need to be divided into smaller blocks. The area with complex scene information needs to be divided into smaller blocks. An important task of the encoder is to find the best block division method and the best prediction angle.
Figure 5 Hevc picture coding block division and prediction mode
Fig. 6 (a) is a prediction picture obtained according to the final block division mode and prediction mode. The difference (residual) between the predicted picture and the original picture is transformed by DCT, quantized and finally output by entropy encoder. The residual of picture prediction is shown in Figure 6 (b). In the decoder, according to the obtained residual data and the same prediction as the encoder, the final reconstructed picture can be obtained. Fig. 6 (c) shows the reconstructed data. Because the encoding process needs to use the reconstructed data as the reference data, the reconstruction process is also needed in the encoder. The original picture is shown in Figure 6 (d). It can be seen that the loss of reconstructed picture and original picture is very small.
Figure 6 Prediction, residual, reconstruction and original data in hevc image coding process
In the intra coding of hevc, the calculation complexity of the encoder is high due to the search of the best coding mode. Traditional CPUs cannot achieve the desired throughput. Although today’s GPU is also widely used in the field of pictures and videos, the parallelization of GPU is more suitable for each pixel to carry out the same operation, and then carry out the next parallelization operation. This is not conducive to the complex control of each module of hevc picture coding. In NVIDIA’s GPU, picture and video codec are also processed by special chips. FPGA can realize the pipelined operation of different modules and realize time parallelism. At the same time, because only intra coding is carried out, different images are independent of each other. Multi channel encoder can also be designed in FPGA to encode and compress different pictures in parallel.
Of course, for the image coding method based on block prediction, there are also some factors that limit the implementation of FPGA parallelization. However, these limited parts can also be solved through the characteristics of FPGA design. For example, as shown in Fig. 4, the reference point of intra prediction needs to be obtained by reconstruction, which increases the dependence between different blocks and limits the parallelization and pipelining design between blocks. In the actual FPGA design, the original data can be used as the reference instead of the reconstructed data in the primary selection of prediction mode, and the reconstructed data can be used as the reference data in the final coding. In the process of FPGA implementation, you can also change the scanning order and give priority to those dependent pixels. In addition, in the adaptive entropy coding part, due to the process of updating the code table and updating the probability estimation, there is also a dependency in the entropy coding of some bit data. In the actual FPGA design process, these data that need to be encoded can be grouped to divide the data without dependency into a group. At the same time, through the data cache, it can judge whether the next data has dependency in advance, so as to improve the throughput of entropy coding.
FPGA implementation of hevc image coding algorithm
FPGA image coding architecture
At present, our picture business has realized FPGA hardware acceleration in webp and hevc formats. The following takes hevc I-frame image hardware acceleration as an example to illustrate how image coding is realized in FPGA.
The logic architecture of FPGA mainly includes the platform part and the IP part of hevc encoder. The FPGA platform mainly includes PCIe DMA and DDR bus related logic. This part of logic mainly realizes data communication with host CPU and DDR communication with FPGA board. As shown in Figure 7, four hevc cores (several of which are related to FPGA resources) are instantiated on the FPGA architecture. Each hevc core completes the complete processing of hevc coding algorithm. Here, the four cores work in parallel, that is, at the same time, the four coding tasks can work in parallel and output four hevc code streams at the same time.
Figure 7 FPGA internal logic architecture
FPGA internal logic mainly includes:
Hevc core 0-3: h265 encoder IP to realize hevc coding algorithm;
Communication between CPU / host and PCI / DMA;
Register RW / int: register read / write and interrupt processing;
Hevc RW arbiter: bus read-write arbitration module;
Axi interconnect / ddrc / Ddry: bus control access DDR logic;
FPGA image coding process
The internal algorithm processing flow of FPGA hevc core is shown in Figure 8: it is divided into current image loading, intra prediction primary selection, intra prediction selection, CABAC coding and code stream output.
Figure 8 Hevc core internal algorithm processing flow
So how to design hevc core to realize the algorithm function? Here, the encoder module pipeline is designed as a four-stage pipeline. As shown in Figure 9, the four-stage pipeline curld / pintra / sel / CABAC processing performance design is close. After parallel, it takes 8400 cycles to process each LCU on average. If calculated according to a total of 510 LCUs of 1080p pictures, the coding of single core can reach 46 frames / s in theory (the implementation frequency of FPGA circuit is 200m), so that the parallel processing of four cores can reach 184 frames / s.
Specifically, curld completes the loading logic of the current image, pintra completes the traversal of 35 modes of intra prediction primary selection, and obtains the optimal prediction mode. This level of pipeline algorithm is optimized. The prediction reference pixel does not select the reconstructed pixel in the traditional way, but selects the current pixel as the reference pixel. This optimization enables the intra prediction primary selection to be divided into a level of pipeline separately from the intra prediction selection, The overall processing performance of the encoder is doubled. SEL completes the selection of intra prediction mode and RDO mode, and the prediction block size supports 32 / 16 / 8. Due to the large amount of logic such as transformation and quantization, this level of pipeline is a large resource consumer of the whole encoder, and the design has made a trade-off between algorithm and logic resource consumption; CABAC module completes the generation of code stream of header information and the coding of syntax elements and residuals of each LCU, and completes the packaging and output of code stream. The main problem of this level of pipeline is whether the performance of CABAC is fast enough, so as to deal with the processing of smaller QP coding and more bin in time.
Figure 9 Arithmetic module pipeline
Performance and benefits
FPGA is used to convert JPEG format pictures into hevc format pictures. The picture resolution is 1920×1080. The processing delay of FPGA is 7 times lower than that of CPU. The processing performance of FPGA is 10 times that of CPU machine. The unit performance cost of FPGA model is 1 / 3 of that of CPU model (see figure 10).
Figure 10 Comparison between image transcoding FPGA and CPU
In short, for the FPGA implementation of image algorithm, if FPGA resources, hardware implementation architecture and processing performance are not considered, the CPU image compression algorithm can be completely “copied” in FPGA, and the compression performance of FPGA algorithm can be completely equal to that of CPU. However, the reality is not so ideal. FPGA algorithm implementation should uniformly consider FPGA performance, resources, algorithm implementation complexity and other factors. Only joint design can design the best scheme. In order to give full play to the speed advantage of FPGA hardware implementation, algorithm optimization must be done. Considering all aspects, we often need to make some “concessions” in FPGA algorithm implementation in practical application. In addition, once a certain type of FPGA is selected, its operation and wiring resources often have a theoretical value. The implementation of the algorithm should also consider the utilization of FPGA resources. How to realize the best compression algorithm on the same FPGA resources has become the difficulty of design. The goal of our algorithm implementation with FPGA is to achieve the algorithm performance as close as possible to the CPU. The image processing throughput and processing delay make the CPU catch up with it.