1 Hardware design and instruction system of TMS320C6000

TMS320C6000 series DSP (Digital Signal Processor) is the latest parallel processing digital signal processor introduced by TI. It is based on TI’s VLIW technology, where TMS320C62xx is a fixed-point processor and TMS320C67xx is a floating-point processor. This article mainly discusses the TMS320C6201. The operating frequency of the processor can be up to 50MHz, which can be increased to 200MHz after internal frequency multiplication by 4. It can execute up to 8 instructions in parallel per clock cycle, so that it can achieve 1600MIPS fixed-point computing capability, and the time to complete 1024 fixed-point FFT is only 70μs is required.

1.1 The hardware structure of TMS320C6000

The CPU of TMS320C6000 has two data channels A and B, each channel has 16 registers of 32-bit word length (A0~A15, B0~B15), four functional units (L, S, M, D), each The functional unit is responsible for completing certain arithmetic or logic operations. The registers of the A and B channels are not completely shared. Only through the two exchange channels 1X and 2X provided by the TM320C6000, the processing unit can obtain 32-bit word length operands from the register files of different channels.

The address line of TMS320C6000 is 32 bits, and the memory addressing space is 4G. C6201 integrates 1Mbit SRAM – 512Kbit program memory (all can be configured as Cache according to needs) and 512Kbit data memory. Through the on-chip program storage space controller, the CPU can take out 256 bits at a time, that is, it can take out up to 8 32-bit instructions at a time.

C6201 has a 32-bit external storage interface EMIF, which provides a seamless interface for the CPU to access peripheral devices. Peripherals can be synchronous dynamic memory (SDRAM), synchronous burst static memory (SBSRAM), static memory (SRAM), read only memory (ROM), or FIFO registers.

In order to facilitate multi-channel digital signal processing, TMS320C6000 is equipped with a multi-channel serial port McBSP with buffering capability. The function of McBSP is very powerful. In addition to the general DSP serial port function, it can also support different standards such as T1/E1, ST-BUS, IOM2, SPI, and IIS. McBSP supports up to 128 channels; supports transmission of multiple data formats (8/12/16/20/24/32bit); can automatically perform u-law and A-law companding. Its operating rate can reach 1/2 clock rate.

The 16-bit host interface (HPI) provided by the TMS32C6000 enables the host device to directly access the storage space of the DPS. Through internal or external memory space, the host and DSP can exchange information. The host can also use HPI to directly access peripherals mapped into the memory space.

DSP devices generally have a DMA controller, which can perform data transfer in the background of CPU operation. The DMA controller of TMS320C6201 has 4 independent programmable channels, which can perform four different DMA operations at the same time, and the priority of each channel can be set by programming. Each channel can transfer 8/16/32bit data as needed, and the DMA controller can access the full 32-bit address space. In addition, there is an auxiliary channel that allows the DMA controller to respond to requests from the host through the HPI port.

1.2 Command system

C62xx and C67xx share the same instruction set. C67xx can use all C62xx instructions, but because C67xx is a floating-point chip, there are some instructions in the C67xx instruction set that can only be used for floating-point operations. The design of TMS320C6201CPU adopts the structure similar to RISC, and the instruction set is simple and the operation speed is fast. 8 functional units are responsible for the operations of different functions, and there is a mapping relationship between instructions and functional units. Among them, the L unit has 23 instructions, the M unit has 20 instructions, the S unit has 29 instructions, and the D unit has 26 instructions.

Most of the instructions of TMS320C6201 can be completed in a single cycle, and can directly operate on 8/16/32bit data. At the same time, the TMS320C6201 instruction set provides a special instruction for digital signal processing algorithms: the addition of 40-bit special operations for complex calculations; effective overflow processing and normalization processing; concise bit manipulation functions, etc. Up to 8 instructions can be executed in parallel in the TMS320C6201 at the same time; all instructions can be executed conditionally. All the above features improve the execution efficiency of the instruction, reduce the code length, greatly reduce the overhead caused by jumps, and improve the coding efficiency.

Pipeline operation is one of the key technologies for DSP to achieve high speed and high efficiency. The TMS320C6000 can only reach the speed of 1600MIPS when the pipeline is fully functioning. The pipeline of C6000 is divided into three stages: fetch, decode, execute, a total of 11 stages. Compared with the previous C3x and C54x, it has great advantages, which are mainly reflected in: simplifying the control of the pipeline to eliminate the interlock of the pipeline; increasing the depth of the pipeline to eliminate the traditional pipeline structure in instruction fetching, data access and multiplication operations. bottleneck. The instruction fetch and data access are divided into multiple stages, so that the C6000 can access the storage space at high speed.

2 Several methods for optimizing programming

When using TMS320C6000 for programming, the first feeling is that the assembly instruction set is too small. The C6000 adopts the structure of a RISC machine in the design, and the operation speed is very fast, but the instruction set is very simple. Like multiply-add instructions and loop operation instructions commonly used in DSP algorithms, the functions that can be completed by two instructions in C54x and C3x, but a loop body is required in C6000, so its programming is generally more complicated. In order to give full play to the computing power of C6000, we must go out of its hardware structure, make maximum use of eight functional units, use software pipelines, and try to make programs run in parallel without conflict.

The advantage of parallel processing is that when processing operations that have no relationship to each other, they can be completed in parallel when CPU resources allow. However, for situations where there is a succession relationship before and after or frequent judgments and jumps, the advantages of parallelism cannot be exerted. Generally, the loop body satisfies the conditions for parallel processing, and the loop body is often the longest time-consuming part in the program. Therefore, the optimization focus should be placed on the loop body when developing the C6000 application. In order to reduce the difficulty of development, C6000 provides many methods to optimize the program at the level of high-level language (such as ANSI C). This method should be adopted as far as possible when the application meets the real-time processing requirements. However, the efficiency of this method is relatively low. The best example of C language optimization is dot multiplication. This loop is optimized in C language, which can utilize CPU resources 100%, and the parallelism of the program is the best. But when we did a 20-point dot product, we found that it took 3 times as long as an assembly language program. Therefore, if the real-time requirements of the system are relatively high, this optimization method cannot be used.

At this time, you can consider using linear assembly language for development. Linear assembly language is a unique programming language in TMS320C6000, which is between high-level language and low-level language. Because when using handwritten assembly language for application development, in addition to being proficient in the C6000 instruction system, developers must also allocate functional units to instructions, consider the extension of instructions and the cooperation between functional units, and reasonably allocate and use 32 registers , in order to write efficient parallel instructions, to play the power of C6000. Problems in any of the above aspects will seriously affect the efficiency of the algorithm.

The instruction system of linear assembly language is exactly the same as that of assembly language, but it has its own assembly optimizer instruction system, which does not need to consider the delay of instructions, the use of registers and the allocation of functional units when used in assembly language. , which can be written in a high-level language. Of course since it is not a high-level language, there are many programming limitations. For example, when optimizing the loop body, jump instructions that jump to outside the loop body cannot be used; in addition, counting down can only be used to count down, if counting up is used, the optimizer will not work, and so on. But in general, its code efficiency is much higher than that of high-level languages, and the development difficulty and development cycle are much smaller than assembly language.

In the actual development process, it is necessary to analyze the specific situation and choose an efficient and fast development method. The optimization method we use is briefly described below in combination with several modules in application development.

2.1 Use assembly language for

Parallel programming in assembly language is difficult. However, in some cases, the data in the program has a very strong inheritance relationship, and the logic relationship of the program body is clear, and the number of registers used does not exceed 32. In this case, it is more efficient to implement it directly in assembly language. In addition, some operation functions that are difficult to implement in C language may have special DSP instructions in the assembly instruction set of C6000, and can be implemented directly in assembly language at this time.

When programming in assembly language, special attention should be paid to the delay of C6000 instructions, and some instructions do not get results immediately. The instructions with delay in the C6000 instruction set are shown in Table 1.

Example 1 32-bit normalization function morm_1()

short morm_1(long L_var1)

{short var_out;

if (L_var1= = 0L){

var_out = (short)0;

}

else {

if (L_var1= = (logn)0xffffffffL{

var_out = (short)31;

}

else {

if (L_var1 0L) {

L_var1 = ~L_var1;

}

for(var_out=(short)0;L_var1(long)0x40000000L;

var_out++){

L_var1 = 1L;

}}}

return(var_out);

}

use assembly language for optimization;

.global norm_1

_norm1:

B B3

CMPEQ 0,A4,B0

[!B0] STANDARD A4,A4

NOP 3

Elapsed time (clock cycles): 723 for C language norm_1(); 11 for assembly language.

2.2 Rewrite the entire function in linear assembly language

For some functions whose main body is loop, the whole function can be rewritten in linear assembly language. After optimization using the assembly optimizer, the efficiency is very high.

The following example is a function that computes the frame energy in the algorithm, which contains two single-loop bodies. When optimizing, first determine the number of loops. For the case where the number of loops is a variable, the optimizer does not perform parallel optimization; secondly, the number of data accesses is minimized. For example, accessing 16-bit data with a 32-bit access instruction can save an incremental access cycle. If you look closely at the C code, you will see that the number of loops is the same both times. The second loop uses the result of the first loop, so the two loops can be merged together, which avoids fetching the result from memory in the second loop and reduces the load operation in half.

Long Comp_En(short *Dpnt)

{ int i;

long Rez;

short Temp[60];

for (i=0;i60;i ++) Temp [i] = shr(Dpnt[i],(short) 2);

Rez=(long) 0;

for (i=0; i 60; i ++) Rez=L_mac(Rez,Temp[i],Temp[i]);

return Rez;

}

The corresponding linear assembler is as follows:

.global _Comp_En ; function name definition, add _ before the c variable

_Comp_En .cproc Dpnt; function header definition, Dpnt is a parameter

.reg Rez, Rez1, Rez2, 1 ; register definitions, regardless of actual register allocation

.reg t1,t2,x1,c1,m1,m2

zero Rez

zero Rez1

zero Rez2

mv Dpnt,c1

mvk 30, i ; Determine the number of cycles. Because LDW is used instead of LDH, the number of loops is reduced by half.

loop1 .trip 30

ldw *c1++,x1

sh1 x1, 16, t1

shr t1,2,t1

shr x1, 2, t2 ; Combining the two loops reduces the time to fetch data from memory in half.

smpyh t1, t1, m1

smpyh t2,t2,m2

sadd Rez1,m1,Rez1

sadd Rez2,m2,Rez2

[i] sub i, 1, i; loop counter decrements from 30

[i] b loop1

sadd Rez1,Rez2,Rez

.return Rez

.endproc

Elapsed time (clock cycles): 32971 for C; 93 for linear assembly.

2.3 Use linear assembly to rewrite the loop body in a complex function

When the logical relationship of the function is complex, and there are many judgments, jumps, and function calls, the effect of the above method will be reduced. At this time, linear assembly can be used to rewrite the loop part into a function, and replace the loop part with an optimized function call instead of optimizing the entire complex function.

The application scope of high-speed digital signal processing devices is becoming wider and wider, especially in the field of mobile communication, the implementation of new technologies such as software radio and smart antennas requires the support of powerful real-time digital signal processing. TMS320C6000 series DSP can fully meet such requirements. But at present, the software and hardware development of parallel DSP technology is still in the exploratory stage. How to make full use of the resources of high-speed DSP is the focus of this research. This paper studies the optimization strategy of the newly introduced TMS320C6000, and summarizes a set of practical optimization programming methods that can meet the real-time performance and ensure the development timeliness from the perspective of engineering and system, for analysis.

Responsible editor: gt

Leave a Reply

Your email address will not be published.