In the existing block based video coding and decoding systems, there is a block effect when the code rate is low, and the same is true in the new video coding standard H.264. There are two main reasons for this block effect: one is that after the block based integer transformation of the transformed residual coefficients, the block edge of the decoded reconstructed image will be discontinuous if the transform coefficients are quantized with a large quantization step; The error caused by the inverse interpolation operation of the binary transform will appear in the image reconstruction. If not processed, the block effect will accumulate with the reconstructed frame, which will seriously affect the image quality and compression efficiency. In order to solve this problem, the deblocking filtering technology in H.264 uses a more complex adaptive filter to effectively remove this blocking effect. Therefore, how to optimize the deblocking filtering algorithm in real-time video decoding, reduce the computational complexity and improve the reconstructed image quality has become a key problem of H.264 decoding.
1. Deblocking filtering of H.264
1.1 filtering principle
Large quantization step size will cause relatively large quantization error, which may turn the gray continuity between pixels at the “border” of adjacent blocks into a “step” change, subjectively there is a “pseudo edge” block effect. The method to remove the block effect is to restore these step gray changes into small or nearly continuous gray changes while keeping the total energy of the image unchanged. At the same time, we must minimize the damage to the edge of the real image.
1.2 adaptive filtering process
In H.264, the deblocking filter is based on 16 × The macroblocks of 16 pixels are carried out in unit order, and each 4 in the macroblock × 4. The edges between sub blocks are carried out in the order of vertical first and then horizontal, so as to filter all edges (except image edges) in the whole reconstructed image. The specific edge diagram is shown in Figure 1. For 16 × The 16 pixel luminance macroblock has 4 vertical edges and 4 horizontal edges, and each edge is divided into 16 pixel edges. And corresponds to 8 × The chroma macroblock of 8 pixels has 2 vertical edges and 2 horizontal edges, and each edge is divided into 8 pixel edges. Pixel edge is the basic unit for filtering.
Adaptive filter on two levels 1.2
H. The deblocking filter in 264 has better filtering effect because of its adaptability at the following two levels.
1) Filter at 4 × 4 sub block level adaptivity
The filtering is based on the pixel edges in each sub block. By defining a parameter BS (edge strength) for each pixel edge, the filtering strength and the involved pixels are adaptively adjusted. The pixel edge intensity of the chroma block is the same as that of the corresponding luminance pixel. Suppose P and Q are two adjacent 4 × 4 sub blocks in which the pixel edge intensity is obtained by the step of Fig. 2. The larger the value of BS, the stronger the filtering on both sides of the corresponding edge, which is set according to the reason for the block effect. If the block phenomenon of the sub block in the intra prediction mode is obvious, set a larger pixel edge intensity value for the corresponding edge in the sub block for strong filtering.
2) White adaptability of filter at pixel level
Good filtering effect can be obtained by correctly distinguishing the false edge caused by quantization error and motion compensation from the real boundary in the image. Generally, the pixel gradient difference on both sides of the real boundary is larger than that on both sides of the false boundary. Therefore, the filter sets the threshold by setting the gradient difference of the gray value of the pixels on both sides of the edge α、 The threshold is set for the gradient difference of gray values of adjacent pixels on the same side β To determine the true and false boundary. α and β The value of is mainly related to the quantization step size. When the equivalence step grows, the quantization error is also large, the block effect is obvious, and it is easy to produce false boundaries. Therefore, the threshold value becomes larger and the filtering conditions are relaxed. The step size becomes smaller, and the quantization threshold becomes smaller. The setting of sampling points is shown in Figure 3. If all conditions are met, the filtering is started.
In addition to these two adaptations, you can also adjust the filtering strength by setting the coefficients loopfilteralphac0offset and loopfilterbetaoffset at the chip level. For example, when the transmission bit rate is low, the block effect is obvious, and the receiving end wants an image with relatively good subjective quality, the coding end can increase by setting the filter offset loopfil teralphac0offset and loopfilterbetaoffset in the header information as a positive value α and β To strengthen the filtering and improve the subjective quality of the image by removing the block effect. Or for high-resolution images, the filtering can be weakened by transmitting negative value offset to keep the details of the image as much as possible.
1.2.2 filter the adjacent pixels according to the BS value of each pixel edge
If the current pixel edge meets the filtering conditions, the corresponding filter is selected according to its corresponding BS value for filtering and appropriate clipping operation is carried out to prevent image blur.
When BS values are 1, 2 and 3, a 4-tap linear filter is used to filter and adjust the input P1, P0, Q0 and Q1 to obtain new Q0 and P0. If there are false boundaries inside, further adjust the values of Q1 and P1.
When the BS value is 4, it corresponds to the macroblock edge using intra coding mode, and strong filtering should be adopted to enhance the image quality. For the luminance component, if the condition (| P0 ~ Q0|《（（ α》 2) + 2) & ABS (p2-p0) is established, select 5-Tap filter to filter P0 and P2, and use strong 4-tap filter to filter P1; If the condition is not true, only the weaker 3-tap filter is used to filter P0, while the values of P1 and P2 remain unchanged. For chrominance components, if the above conditions are met, P0 is filtered by 3 taps. If the conditions are not met, all pixel values are not modified. The filtering operations for Q0, Q1 and Q2 are the same as those for P0, P1 and P2.
2. Characteristics and structure of BF533
Our H.264 deblocking filter is implemented on the Blackfin ADSP-BF533 processor of ADI company. Blackfin series DSP mainly has the following characteristics:
a) Highly parallel computing units. The core of Blackfin series DSP architecture is dau (data arithmetic unit), including 2 16 bit MAC (multiplication accumulator), 2 40 bit Alu (arithmetic logic unit), 1 40 bit single barrel shifter and 4 8-bit video ALU. Each Mac can perform 16 bit by 16 bit multiplication on four independent data operands in a single clock cycle. A 40 bit Alu can add two 40 bit numbers or four 16 bit numbers. This architecture can flexibly perform 8-value, 16 bit and 32-bit data operations.
b) Dynamic power management. The processor can consume less power than other DSPs by changing the voltage and operating frequency. The Blackfin series DSP architecture allows the voltage and frequency to be adjusted independently, which minimizes the energy consumption of each task and has a good balance between performance and power consumption. It is suitable for the development of real-time video codec, especially the real-time motion video processing with strict requirements on power consumption.
c) High performance address generator. It has 2 DAGs (data address generators) for generating composite loading or storage units that support advanced DSP filtering operations. It supports bit reverse addressing, circular buffering and other addressing modes, which improves the flexibility of programming.
d) Hierarchical memory. The hierarchical memory structure shortens the access time of the kernel to the memory to obtain the maximum data throughput, less latency and shortened processing idle time.
e) Unique video operation instructions. Provide operation instructions commonly used in video compression standards such as DCT (discrete cosine transform) and Huffman coding. These video instructions also eliminate the complex and mixed communication problem between the main processor and an independent video codec. These features help to shorten the time to market for end applications and reduce the overall cost of the system.
The ADSP-BF533 we use can work continuously at 600 MHz, with 4 GB of unified addressing space; L1 instruction memory of 80 KB SRAM, of which 16 KB can be configured as 4-way joint cache; Two L1 data memories of 32 KB SRAM, half of which can be configured as cache; Integrate rich peripherals and interfaces.
3. Optimized implementation of H.264 deblocking filter based on BF533
The optimization implementation of deblocking filter in Blackfin BF533 is mainly divided into three levels: system level optimization, algorithm level optimization and assembly level optimization.
3.1 system level optimization
Turn on the optimization option of the compiler in the DSP platform and set the optimization speed to the fastest. Turn on the automatic inlining switch and the interprocedural optimization switch to give full play to the hardware performance of the Blackfin BF533 through some of the above settings.
3.2 algorithm level optimization
The deblocking filtering part of JM8.6 reference model is appropriately modified and transplanted to the original H.264 basic gear decoder based on Blackfin BF533, and its time-consuming is analyzed through image sequence. Paris.cif, mobile.cif, foreman.cif and claire.cif sequences with a bit rate of about 400 kbit / s are selected. The clock cycle consumed by deblocking filtering is about 1600 MHz ~ 1800 MHz. Even after system optimization, the computational complexity is still quite large and the efficiency is very low, which is a considerable burden for the continuous working frequency of 600 MHz of Blackfin BF533 processor.
By analyzing the deblocking filter program in JM8.6, the main reasons for its low efficiency are:
a) The function logic relationship in the algorithm is complex, and there are many cases such as judgment, jump and function call;
b) The most time-consuming part, that is, there are a lot of repeated calculations in the function loop, resulting in a sharp increase in computational complexity;
c) Many data used in the algorithm, such as motion vector, image brightness and chroma data, are stored in the slow off-chip SDRAM, but the frequent calls in the filtering process increase the data handling time sharply.
For the reason of time-consuming, the algorithm is improved as follows:
3.2.1 simplify the complex functions and loops in the original program
Instruction length and operation speed restrict each other. The code can be highly simplified through conditional judgment, but the speed becomes slower due to the increase of judgment workload of the machine; On the contrary, removing the judgment in the code and expanding the program can often reduce the instruction cycle, but the length of the code will increase. The deblocking filter code in JM8.6 is short. Simplify the relationship between functions, and increase the execution speed in exchange for the increase of code length.
For the most time-consuming loop body of the system, appropriate rewriting of the loop form and multiple loop body expansion are adopted to effectively reduce the computational complexity. In addition, reducing the number of function calls and rewriting if else statements are also effective optimization methods.
3.2.2 remove a large number of redundant codes and repeated calculations in the reference code
a) Because the reference code used is the deblocking filter module in JM8.6, which can filter the code streams of various gears and levels of H.264, while the decoder is based on the basic gears and only involves the filtering operation of I frame and P frame, the relevant filtering parts related to B frame, SP / Si frame, field mode and frame field adaptive mode in the reference code can be removed.
b) In the process of obtaining filter strength BS and brightness / chroma filtering, the program should obtain the accessibility information of adjacent macroblocks of the macroblock where the current sub block is located (that is, whether this macroblock can be used is realized by calling getneighbor function). Since the filtering is carried out according to the edges in the macroblock first vertically and then horizontally, the information obtained for an edge is the same, Therefore, this operation can obtain each edge once without repeated judgment within the loop. At the same time, in the filtering algorithm, only the accessibility information of the macro block above and on the left of the current macro block needs to be obtained, and the redundant operation of obtaining the macro block information in the upper left and upper right corners of the current macro block can be removed. At the same time, when the function to obtain the filtering strength in the horizontal direction calls getneighbor, the values of the getneighbor parameters are luma as the fixed value 1, xn as [- 1, 3, 7, 11], yn as [0-15]. At this time, many if else statements in the function getneighbor are invalid judgments, and these redundant judgments occupy a lot of clock cycles. In addition, the probability of each branch is analyzed, and the judgment branch with the greatest probability is executed in front, which also improves the speed of function execution.
The following is the simplified getneighbor function code. There are only a few statements, which greatly reduces the amount of computation.
c) In the jm86 reference code, 16 for a luminance macroblock × 4. The BS values of 64 pixel edges are obtained one by one. Through the analysis of BS acquisition conditions, it can be seen that the BS values of the four pixel edges located at the vertical edge or horizontal edge between the two sub blocks are equal respectively. Therefore, for one edge, it is only necessary to obtain the BS values of the 1st, 5th, 9th and 13th pixel edges and assign them to the corresponding other pixel edges. Since the operation of obtaining the BS value is in the loop, it needs many judgments and operations. Through this improvement, the computational complexity is greatly simplified.
d) There are many statements inside the loop in the reference code that have nothing to do with the loop parameters. These statements can be adjusted outside the loop to avoid redundant calculation.
3.2.3 use BPP block processing technology to solve the problem of frequent call of off chip data
In view of the problem that the frequent call of off chip data affects the running speed of the program, BPP blocking technology is used for optimization. Three spaces are opened in the on-chip L1 cache to store the brightness component, chroma u component and chroma V component to be filtered respectively. According to the pixel range that may be involved in filtering each macroblock, when filtering CIF image, 396 macroblocks of a frame are divided into four categories: Class A is the first macroblock, its upper edge and left edge are image edges, and the brightness data read before filtering is 16 × 16. The chroma data is 2 8 × 8； Class B refers to the remaining macroblocks except the first macroblock in the first macroblock row. The upper edge is the image edge, and the brightness data read before filtering is 16 × 20. The chroma data is two 8 × 12； Class C is the remaining macroblocks except the first macroblock in the first macroblock column. The left edge is the image edge, and the brightness data read before filtering is 20 × 16. The chromaticity data is 2 12 × 8； Class D is the remaining macroblocks except for class A, B and C macroblocks, that is, the macroblocks whose upper and left edges are in the current image. The brightness data read before filtering is 20 × 20. The chromaticity data is 2 12 × 12。
When filtering, the luminance and chrominance data are read from the off-chip data cache in different quantities according to the type of macroblock, and then filtered, and the resulting data is re stored in the off-chip storage space. Through this method, on the one hand, the time of frequently calling off chip data is reduced to a certain extent, and the running speed is improved; On the other hand, by subdividing the filter macroblock, the pipeline interruption caused by the judgment in the reference code is reduced, and the program speed is improved to a certain extent.
3.3 assembly level optimization
The kernel of blackfinbf533 processor supports C or C + + language, but it is inefficient for the system to automatically translate C programs into assembly language. Therefore, some modules with frequent and time-consuming system calls can be manually transformed into efficient assembly language to improve the operation speed. The speed of the program is mainly improved through the following aspects:
a) Replace local variables with register variables. In C language, local variables are often used in subroutines and functions to temporarily store data. When the program is running, the compiler opens up temporary memory space for all declared local variables. The access operation of local variables involves memory access, and the speed of memory access is very slow compared with register access. Therefore, the data register and pointer register in the system can be used to replace the local variables that only play the role of temporary storage, so as to greatly save the time delay caused by the system accessing memory. However, the number of registers in the system is very limited for local variables, so registers must be used reasonably and efficiently.
b) Replace software loop with hardware loop. Software loop refers to setting judgment conditions at the beginning or end of loops such as for or while to control the beginning, continuation and end of the loop. The conditional judgment instruction of software loop will dynamically select branches. Once a jump occurs, it will block the pipeline, and keeping the pipeline unblocked is the key factor to maintain efficient operation. The Blackfin Processor has dedicated hardware to support two-level nested zero overhead hardware cycles. This method does not need to judge the condition transfer. The DSP hardware automatically executes the cycle and ends the cycle according to the predetermined cycle times, so as to ensure the smooth flow of the pipeline and improve the speed.
c) Make full use of the data bus width. The blackfin533 external data bus is 32 bits wide and can access 4 bytes at a time. Therefore, making full use of the total access width of data, especially when operating a large amount of data, maintaining 4 bytes at a time can reduce the number of instruction cycles and improve the execution speed.
d) Efficient use of parallel and vector instructions. Parallel instructions and vector instructions are a major feature of Blackfin series DSP. Through the use of parallel instructions, we can give full play to the advantages of SIMD system structure of Blackfin Processor and the parallel processing ability of hardware resources, reduce the number of instructions, and improve the program execution efficiency. Often through the reasonable arrangement of the program, one parallel instruction can be used to replace two or three non parallel instructions. Vector instructions make full use of the instruction width and perform the same operation on multiple data streams at the same time. If two 16 bit arithmetic or shift operations are to be performed, they can be realized by a 32-bit vector instruction, so as to realize the original two cycles in one clock cycle. For example, R3 = ABS R1 (V) uses one instruction cycle to realize the absolute value operation of two 16 bit data at the same time.
e) Reasonably allocate data storage space. Limited to the access speed and capacity characteristics of DSP on-chip and off-chip data storage space, the on-chip space has fast access speed but small capacity, while the off-chip space is large but slow access speed. Therefore, reasonable allocation of data storage location is very key to improve the running speed of the program. The data with high frequency of use shall be placed in the on-chip space as far as possible, while the less commonly used data shall be placed in the off-chip space. If you want to access the data outside the chip, you should arrange the data to be accessed into continuous distribution as far as possible, and read large pieces of off chip data into the on-chip cache at one time, so as to avoid the waste of time caused by frequent reading of off chip data.
4. Results of optimized implementation
The method to test the optimization effect is to add the deblocking filter C program module in the reference code JM8.6 to the original decoder for testing, and compare it with the test cycle of the deblocking filter assembler module optimized at three levels: system, algorithm and assembly. The selected test image sequences are clarie.cif, paris.cif and mobile.cif. See Table 1 for the test data.
As can be seen from table 1, compared with the C program code in JM8.6 before optimization, the efficiency of the optimized deblocking filter assembly module is improved by about 7 times.
In this paper, the deblocking filtering function in H.264 is realized through three levels of Optimization: system, algorithm and assembly. In particular, by improving the implementation algorithm of deblocking filtering, classifying the macroblocks to be filtered, and making full use of assembly level optimization means such as parallel instructions and vector instructions, good optimization results are achieved. The optimized deblocking filter module filters a 25 frame image sequence of about 400 kbit / S based on the original H.264 decoder, which requires about 250 MHz clock cycle, and the total cycle of the decoder is about 700 MHz clock cycle, so that the decoding speed of the decoder reaches about 20 frames / s, which basically meets the requirements of quasi real-time decoding.
Compared with the reference module, the implementation method is better optimized, but through the time-consuming analysis of the program, there is still room for further improvement in reading the data to be filtered and rewriting the filtered data, getbs function to obtain BS value and edgeloop function to filter. DMA technology can be used for the interaction of off chip and on-chip data to read and write data while filtering, so as to offset the clock cycle consumed by data movement; There is room for further improvement in the implementation efficiency of assembly code in getbs and edgeloop; These two aspects are also the direction of improvement in the next step.
Responsible editor: GT