This paper introduces the optimization design method of real-time image tracking system based on high-performance TigerSHARC DSP processing module and template matching algorithm; This paper deeply analyzes the problem of address alignment in SAD operation, and puts forward an optimized design scheme, which improves the processing efficiency by 20 times, and has been applied in real-time image tracking system.
With the continuous progress of electronic technology, more and more practical digital image tracking systems use high-performance dspcost module to realize the rapid construction of real-time processing prototype. Among many types of DSP processing chips, TigerSHARC series DSP processors have large fixed-point and floating-point processing capacity, and become the fastest floating-point DSP processor in the world. Because of its strong scalability, TigerSHARC DSP processors have been leading the development direction of scalable parallel multi DSP real-time processing system, It is called “the standard of multi DSP system implementation”. At the same time, TigerSHARC DSP processor provides good support for byte data processing, such as byte based sad (sum ABS difference) operation, which is very suitable for building real-time image processing system. However, due to the serious problem of address alignment, it is necessary to optimize the design of the system in order to limit the size of the system and reduce the cost of the system under the condition of meeting the strict real-time requirements.
2. Practical real-time image tracking and processing system
2.1 composition of real-time image tracking and processing system
Firstly, the structure block diagram of the practical real-time imaging processing system is given, as shown in Figure 1. The system includes PMC video acquisition module, TigerSHARC DSP processing module (integrated with four TigerSHARC DSP processors), CPCI industrial computer motherboard, control handle and VGA display. The input of video capture board is pal black-and-white image signal, and the output is 768 × two hundred and eighty-eight × The 8-bit digital image data is sent to the TigerSHARC DSP processing module for template matching detection algorithm at an interval of 20 milliseconds per field through the link interface of TigerSHARC DSP. After that, the DSP processing module superimposes the target information on the original image and transmits it to the VGA display through the CPCI industrial computer motherboard for display. National 863 Program: real time data processing technology based on Aerospace platform (2006aa701415)
2.2 template matching algorithm
Template matching algorithm is the basic algorithm to detect the target in an image scene. The known target image template is slid in an unknown image scene, and compared with the corresponding unknown scene image block. If the result is close enough, the image block is identified as the target. Generally, the sum of the absolute values of the differences between the target template and the unknown scene image block is used to measure the proximity. As shown in the following formula, the operation between pixels is defined as sad (sum ABS difference), and the whole process is shown in Figure 2. A target template t is m × The n-dimensional pixel array slides horizontally and vertically in the unknown image scene. At each sliding step, t performs SAD operation as shown in equation 1 with the corresponding pixel matrix s to be detected in the unknown image scene, and stores the obtained results D (I, J) into the measurement matrix D as the difference measure, where s is a pixel matrix of the same size as t.
2.3 address alignment in sad template matching
The ALU of the TigerSHARC processor provides instruction support for sad operations, as shown in Figure 3a. Within a clock cycle (3.3ns/300mhz/adsp-ts101s), a single Alu can complete the sad operation of 8 pixels in template t by the instruction PR + = ABS (brmd Brnd). PR is the 64 bit accumulator in ALU, which is used to store the accumulated result of image byte data in SAD operation. Brmd and Brnd are both 32-bit × 2, respectively read 8 corresponding pixel byte data in T and s. Because TigerSHARC DSP processor has dual core structure and two ALUs, it can complete SAD operation of up to 16 pixels in template t in a single cycle. Obviously, it is necessary for TigerSHARC DSP processor to continuously execute sad instructions in its instruction pipeline to obtain its peak processing capacity, and the address alignment problem caused by Alu fetching operands from memory is the main bottleneck restricting the improvement of processing efficiency.
As shown in Figure 3B, before the sad instruction is executed, the two ALUs need to read 8 bytes of target template data and 8 bytes of scene image data to be detected from the memory to the internal register, and the first address of the data storage must be 64 bit aligned, otherwise the internal bus access will be abnormal. Although T and s are the same size pixel matrix, their size is generally much larger than 16 bytes. In T, the data address is continuous, while s, as a part of the scene image, its address is usually discontinuous. At the same time, with t sliding line by line in the scene image, the efficiency loss caused by the address alignment problem will be more prominent. It is bound to need to calculate the non aligned data address, byte adjust and merging, and very complex multiple loop control. The processing efficiency is evaluated by a single chip TigerSHARC DSP processor. The ANSI C program optimized by the compiler can complete a background image matching task in about 100ms. Based on this calculation, the background data needs to be divided into five parts at least and allocated to five TigerSHARC DSP processors to reach the lower limit of 20ms real-time performance. In this way, a TigerSHARC DSP processing module has to be added to the system, which is unacceptable for power consumption, cost, weight, volume and other indicators. Therefore, it is not feasible to directly realize the calculation process as shown in Formula 1 and figure 3, and the optimized calculation scheme must be considered.
3. Optimization design of sad
The sad process described in Figure 2 is analyzed again. It is found that every byte pixel in t needs to slide the same number of times in the background image. Because the calculation result of each movement is stored in the corresponding position of D, the number of sliding times is the same as the number of elements in D, and it is one-to-one correspondence. In other words, one of the elements in D involves a slip of all the byte pixels in t. According to the law of additive combination, the optimized sad process can be obtained from equation 1 as follows.
Just like an ant moving, it can first slide the 16 byte pixel data in t on the background image, perform the sad4 operation, and store the result of each slide into the corresponding position in D. at this time, the cumulative sum of D is the intermediate result; Then the next 16 bytes of pixel data in t complete the sliding SAD operation on the background image, and add the result of this time and the intermediate result saved in d last time, and then store it in the corresponding position, and so on, until all the byte pixel data in t complete the sliding, and update the accumulated result in D. According to the above formula, the optimized implementation scheme is shown in Figure 6. Compared with the direct implementation scheme, the optimized scheme does not reduce the number of sad operations. However, due to the adjustment of algorithm execution structure and operation order, the efficiency of accessing memory data is greatly improved, the implementation process is simplified, and the processor instruction pipeline is well organized. To evaluate the same processing task mentioned above, the optimized implementation scheme can be completed in 5ms with a TigerSHARC DSP processor. Because the data in t has good separability in the optimized implementation scheme, it can be completed in 2.6 MS by using two TigerSHARC DSP processors. Using the optimized implementation scheme can not only improve the real-time performance of the system, but also reduce a TigerSHARC DSP processing module, greatly reducing the complexity, volume, power consumption and cost of the system. At the same time, it can also provide about 15 ms time for compression, transmission and other processing tasks.
This paper describes a practical optimization design method of real-time digital image tracking system, points out the importance of address alignment in the optimization design of processing system, and focuses on the innovative optimization design and implementation of template matching processing algorithm, which meets the requirements of real-time performance, power consumption, volume, cost and other aspects of the whole system. The implementation of the system in the actual project has received good results, the actual system composition is shown in the figure below.