Embedded system and desktop PC structure is very different, but the underlying technology development is the same, and follow a similar development trend. When Desktop PC turns to 64 bit architecture to meet the increasing memory requirements, embedded system also quickly turns to 32-bit processor for the same reason. The desktop / server computing market is mainly around x86 architecture, and most of the innovations and differences are at the system level, such as dual core, four core or multi-core central processing architecture, integrated image processor unit and memory controller, etc. Similarly, the embedded system is mainly around the simple 32-bit RISC processor, multi-core architecture, integrated peripherals and configurable processing system level development, so that designers can quickly adapt to the changing application requirements. According to iSuppli’s research report, the 32-bit MCU market will surpass the 8-bit MCU market in 2007. As shown in Figure 1, the growth rate of 32-bit MCU market is faster than that of other parts of semiconductor market, while the market share of 8-bit MCU has declined in the past few years.
The main driving force of this trend is the increasing content and complexity of software in embedded systems, so the direct consequence is that a wider memory bus (32 bits) is needed to meet the code and data requirements of software programs. Unlike traditional microprocessors, 32-bit processors can handle larger memory space without memory management skills such as segmentation, so programming is easier. 8-bit MCU must use assembly language which is difficult to learn and use to meet the small memory space limit (less than 32K bytes). Many 32-bit embedded applications can be programmed by C / C + +, which improves the productivity of embedded software developers. More importantly, more and more operating systems (real-time and non real-time) provide ready-made drivers and software libraries, so that software developers can focus on the development of the application itself.
Integration reduces costs
Under the guidance of Moore’s law, more and more thin silicon process line width makes the cost of 32-bit embedded solutions continue to reduce, so as to meet the price requirements of more applications. In addition, the integration of peripherals and on-chip memory further reduces the cost of components and the overall bill of materials. By integrating peripherals optimized for vertical applications such as mobile phones and game consoles, the prices of many devices are greatly reduced, which directly promotes the market growth.
Price pressure also leads to the integration of only a fixed set of peripherals in these systems, so the usual combination of peripherals is for mass applications. However, it is impossible to have universal devices suitable for all applications, so many small batch, medium scale and even large batch applications cannot directly utilize the finished product integration solutions. As a result, designers have to use additional chips to expand the burden of peripherals, shunt processors, or increase glue logic. This is also the reason for configurable processing solutions.
Configurable 32-bit processing
According to the Gartner Dataquest report, as shown in Figure 2, the application of FPGA based embedded processing solutions is growing. By 2010, about 40% of FPGA designs will include embedded processors. Because it can be customized to meet the requirements of specific applications or products, embedded system designers are increasingly using FPGA based configurable processing solutions. The main advantage of this method is that the cost can be reduced through integration, and the product differentiation in the market can be realized at the same time.
By selecting different devices in the same FPGA family, or re adapting the design to new FPGA devices, we can design for higher performance, lower cost, or different I / O standards. This can reduce the risk of design obsolescence, so as to ensure that the design is available in the future. This is a particularly critical factor for products that must have a long service life, such as automotive or industrial applications.
Figure 1: the growth rate of 32-bit MCU market exceeds that of other types of MCU.
The configuration (or customization) level of configurable processing system includes:
1. Multiplier, divider, floating point unit and others.
2. Instruction or data buffer configuration.
3. Coprocessor or hardware accelerator.
1. I / O peripheral selection, customization, DMA selection.
2. Memory peripheral selection and customization.
1. RTOS selection and customization.
2. Application library / Middleware customization.
Many products include embedded systems that require some form of network or communication interface. Ethernet is one of the most widely used network interfaces in embedded products because of its low cost, almost everywhere, and the ability to connect to the Internet using TCP / IP and other Internet protocols. According to the different target applications, the requirements of the network subsystem also vary greatly. Simple remote control and monitoring applications require only a few thousand bits per second of transmission capacity, while high-end storage or video applications require sustained Gigabit throughput.
For the sake of simplicity, we will use TCP payload throughput as the main indicator of performance comparison. Table 1 lists some typical applications and the corresponding TCP / IP payload throughput requirements.
Table 1: network throughput requirements for different applications
Configurable embedded network
The powerful flexibility of FPGA based processing solutions allows you to enable or disable advanced functions of processors, IP cores and software platforms as needed, and fine tune many independent parameters until the application requirements are met at the software level. In addition, any performance critical software function can be identified and shunted to the appropriate hardware accelerator or coprocessor by using modeling tools.
Let’s take a look at three Ethernet subsystems that can use IP cores to meet the performance requirements of typical applications. Each design has different system architecture, including processor configuration, Ethernet MAC IP configuration and memory interface. In addition, these examples highlight the different TCP / IP software stacks that can be used with these hardware subsystems. Because the hardware building blocks and software layers are customizable, you can add or subtract these systems according to the needs of the application.
Simplify Ethernet subsystem
For the simple network interface needed in remote monitoring or control applications, the minimal network subsystem shown in Figure 3 is sufficient. In this kind of application, the performance requirement of TCP / IP is low (1Mbps), so a small TCP / IP protocol stack such as LwIP (simplified version of internet protocol stack) (without RTOS real-time operating system) is enough.
Figure 3: the smallest Ethernet system.
This can be achieved in simple query mode using uninterrupted Ethernet Lite IP. All software, including simple application layer, can be stored in local memory of FPGA. As shown in Figure 3, other required I / O interfaces, RS-232 UART and GPIO, can be added to the basic subsystem.
Figure 4: typical 10 / 100 Ethernet system architecture.
By modifying the smallest system in Figure 3, we can achieve higher TCP / IP throughput (10-50mbps), and turn to the more typical 10 / 100 Ethernet solution as shown in Figure 4. The main changes are as follows:
1. Add direct memory access (DMA) engine for Ethernet MAC to realize interrupt drive;
2. Add external memory for the system and cache for the processor;
3. More complex TCP / IP stack, such as Linux TCP / IP stack.
For applications requiring TCP/IP throughput above 100Mbps, the three mode Ethernet MAC provided by hard IP or soft IP kernel can be considered (Figure 5). In order to achieve the throughput of more than 500mbps required by high-end applications, advanced DMA technologies such as distributed / converged DMA (sgdma) and FPGA hardware accelerator technologies such as data rearrangement engine (DRE) and check and unload (CSO) are needed.
In order to meet the demand of Gigabit Ethernet for higher data throughput, it may need higher performance embedded (hard) processor or customizable soft processor implemented on FPGA, as well as larger buffer capacity, such as 16kbit instruction and data cache. As far as software platforms are concerned, advanced TCP / IP stacks in Linux, VxWorks, integrity and QNX support functions such as zero copy and check sum bypass.
Many factors, including hardware and software, will affect the TCP performance, and then affect the TCP throughput of the system. These factors include:
1. Processor, including frequency, function and cache
a. Frequency: TCP / IP protocol usually needs to copy the load from the user cache to the cache controlled by the protocol stack, and then copy it to the FIFO of Ethernet MAC. Some of these memory copy operations are performed by software, so it needs the processing cycle of the processor. At the same time, the processor is also involved in the calculation of TCP checksums. In the process of calculation, the whole data packet needs to be read out from the memory. Faster processor with faster memory can complete these operations in a shorter time, so it can maintain a higher data rate;
b. Function: TCP / IP protocol stack needs to access the packet header and payload. As a part of packet header processing, typical access includes specific location to read packet header information. Therefore, the processing of each packet needs a lot of shift operations. In addition, multiplication is needed when processing each packet. In a configurable processor, the instruction to complete shift or multiplication must be turned on to achieve higher performance;
c. Cache: after the packet is copied from Ethernet MAC to memory, it will pass through different layers of TCP / IP protocol stack. Then the packet processing code in the TCP / IP stack is executed. Reading all codes and packets into the cache will greatly improve processor efficiency and Ethernet bandwidth.
Memory access time and latency have a huge impact on system performance. In typical applications, TCP / IP applications are not stored in local memory, while programs and data are stored in external memory. The time spent accessing data and instructions has a great impact on performance. The memory factor is usually related to the size of the cache. Increasing the size of instruction and data cache helps to reduce the impact of external memory latency and access time.
3. Ethernet MAC
The Ethernet MAC peripherals implemented in FPGA provide great flexibility, especially in working mode (no DMA and sgdma), packet FIFO depth, dre support, CSO support and super frame support. Each item will affect the resources required by MAC and the number of functions that can be shunted from the processor, thus affecting the overall performance.
4. TCP / IP protocol stack
Flexible optimization of TCP / IP protocol stack is an important factor affecting system performance. The support of hardware CSO, zero copy API (data does not need to be copied from application to protocol stack cache) and configurable stack options can help improve system performance.
5. How much information
The size of information (application data) is another factor that affects performance. With the decrease of information, the overhead of TCP / IP protocol header (such as TCP, IP and Ethernet header) increases, which will reduce the overall data load throughput.
Most applications have a set of basic requirements for cost, performance, and functionality. When designing products for specific applications, designers must make the right tradeoffs between these requirements. However, in order to adapt to market conditions, these requirements may change during the product life cycle. The flexible and configurable platform can rebalance these requirements as needed without changing the design platform or suppliers.
Editor in charge: PJ