Since the first introduction of FPGA decades ago, each new architecture continues to use bit wise routing architecture. Although this method has been successful, with the rise of high-speed communication standards, it is always required to increase the bit width of on-chip bus to support these new data rates. One consequence of this limitation is that designers often spend a lot of development time trying to achieve timing convergence, sacrificing performance to lay out and route their designs.

Traditional FPGA routing is based on multiple independent segments running in the horizontal and vertical directions in the whole FPGA, and there is a switch box at the intersection of the horizontal and vertical wiring to realize the path connection. With these independent segments and switch boxes, a path from any source to any destination can be built on the FPGA. This unified structure of FPGA routing provides great flexibility for any logic function, and can be used for any data path bit width in FPGA logic array.

Although bit by bit routing in FPGA is very flexible, its disadvantage is that each segment adds delay to any given signal path. Signals that need to be transmitted over a long distance in FPGA will lead to connection delay between segments, which reduces the performance of the function. Another challenge of bit routing is congestion, which requires signal paths to bypass congestion, which leads to more latency and further degradation of performance.

Achronix sees this challenge as an opportunity to develop a new architecture to eliminate the design challenges of traditional FPGAs and improve system performance. Achronix’s solution is to create a revolutionary two-dimensional (2D) high-speed network on chip (NOC) for its brand-new speedster7t FPGA series devices based on the traditional segmented FPGA wiring structure. Speedster7t NOC is connected to multiple ports of all on-chip high-speed interfaces: 400g Ethernet, PCIe gen5, gddr6 and DDR4 / 5.

The NOC consists of a set of rows and columns that distribute network traffic horizontally and vertically throughout the FPGA logic array. The primary and secondary NOC access points are located at the intersection of each row and column of the NOC. These naps can be sources or destinations between NOCs and programmable logic arrays.

Memory speeds up the design by independent interface and logic verification

Figure 1: Network on chip (NOC) and interface of speedster7t

Ethernet: Ethernet

Security: Security

Configuration: Configuration

Each direction

The NOC of speedster7t seems to be only helpful to the internal wiring bus of FPGA; however, this new architecture can significantly improve the work efficiency of designers, achieve new design functions, and provide the ability to easily implement intensive data processing applications. Here are eight of the most significant application scenarios in terms of efficiency improvement, design change, and performance improvement.

Simplify high speed data distribution in the entire FPGA logic array

In traditional FPGA architectures, bidirectional read / write operations for the off chip memory connected to the FPGA and the external high-speed data sources connected to the FPGA require data to go through a long and segmented routing path in the FPGA logical architecture. This constraint not only limits the bandwidth, but also consumes the wiring resources needed by the user in the logic array, which brings challenges to FPGA designers in timing convergence, especially when other logic functions improve device utilization.

It is much easier to use the NOC of speedster7t to transfer data from external source to FPGA and memory than to use traditional FPGA architecture to accomplish the same task. Speedster7t NOC enhances the traditional programmable interconnection in FPGA array, and the NOC is like a highway network superimposed on the urban street system. Although the traditional, programmable interconnect matrix in speedster7t FPGA is still suitable for slow local data traffic, NOC can handle more challenging and high-speed data flow.

Each row or column in the NOC is implemented as two 256 bit unidirectional data channels operating at a fixed clock rate of 2 GHz. Rows have East / West channels and columns have North / South channels, allowing each NOC row or column to simultaneously process 512 Gbps of data traffic in each direction. All in all, these channels can transmit a large amount of data in FPGA array by writing simple Verilog or VHDL code, which supports FPGA to communicate with nap and connect to NOC highway network.

The figure below shows the data transfer between the points in the NOC. The logic of point 1 and point 2 instantiates a horizontal nap respectively. Nap can send and receive data, but each individual data stream is directed in one direction. Similarly, the logic of points 3 and 4 instantiates a vertical nap and can send data flows between each other.

Memory speeds up the design by independent interface and logic verification

Figure 2: data flow across device logical array on NOC

Automatically connect the PCIe interface to the memory

In today’s FPGA, designers must consider the delay caused by the connection logic, wiring, and the location of input and output signals when connecting the high-speed interface to the memory device connected with FPGA for reading and writing. In order to realize the basic interface function, it usually takes a lot of time to build a simple storage interface in the design process.

In the speedster7t architecture, connecting the embedded PCIe gen5 interface to the connected gddr6 or DDR4 memory can be automatically processed by the peripheral NOC, and the designers do not need to write any RTL to establish these connections. Since NOC is connected to all peripheral IP interfaces, designers have great flexibility in connecting the PCIe to any memory interface of gddr6 or DDR4. In the following example, the NOC can provide enough bandwidth to continuously support the connection of PCIe Gen 5 traffic to any two channels of gddr6 memory. This high bandwidth connection can be realized without consuming any FPGA logic array resources, and the design time is almost zero. Users can send transactions on NOC only by enabling the interface of PCIe and gddr6.

Memory speeds up the design by independent interface and logic verification

Connect directly to the gddr6 interface

Secure local reconfiguration on independent FPGA logic array modules

Like other SRAM based FPGAs, the speedster7t FPGA must be configured at power up. The speedster7t FPGA has an on-chip FPGA configuration unit (FCU), which is used to manage the initial configuration of the FPGA and any subsequent local reconfiguration. The FCU is also connected to the NOC, providing greater flexibility when configuring FPGAs. The NOC is used to transfer the configuration bit stream to the speedster7t FCU, and the FPGA can be configured using a new method that was not previously available.

Prior to device configuration, speedster7t NOC can be used for some read / write transactions: PCIe to gddr6, PCIe to DDR4, and finally, PCIe to FCU. Once the PCI interface is set, the FPGA can receive the bitstream through the PCI interface and send it to the FCU to configure the rest of the device. Once it reaches the FCU, the configuration bit stream is written into the FPGA programmable logic to configure the device. After the device is configured, designers can flexibly reconfigure some parts of FPGA (local reconfiguration) to add new functions or improve acceleration performance without shutting down the FPGA.

The new local reconfiguration bit stream can be sent to the FCU through the PCIe interface to reconfigure any part of the device. When some devices are reconfigured, by instantiating a nap in the required area to communicate with the NOC, any data in and out of the new configuration area can be easily accessed in the speedster7t1500 device. NOC eliminates the complexity of local reconfiguration of traditional FPGAs, because users don’t have to worry about wiring around existing logic functions and affecting performance, and they don’t have to worry about not being able to access the pins of some devices due to the existing logic in the area. This feature saves designers time and provides more flexibility when using local reconfiguration.

In addition, local reconfiguration allows designers to adjust the logic within the device as the workload changes. For example, if the FPGA is performing a compression algorithm on the input data and compression is no longer needed, the host CPU can tell the FPGA to reconfigure and load a new optimized design to handle the next workload. Local reconfiguration can be done independently at the logical array cluster level while the device is still running. A smart use case is to develop a self aware FPGA that uses a soft CPU to monitor device operations to start local reconfiguration in real time, to turn off logic to save power, or to add more accelerator modules to the FPGA architecture to temporarily process large amounts of input data. These features provide designers with unprecedented configuration flexibility.

Support hardware virtualization easily

Speedster7t NOC provides designers with the unique ability to create virtualized security hardware in a single FPGA by using nap and its Axi interface. To connect the programmable logic design directly to the NOC, we only need to instantiate a nap and its axi4 interface in the logic design. Each nap also has an associated address translation table (ATT), which translates logical addresses on nap to physical addresses on NOC. Nat of nap allows PLC to use local address and map NOC directed transaction to address assigned by NOC global storage map. This remapping feature can be used in a variety of ways. For example, it can be used to allow all identical copies of the accelerator engine to use zero based virtual addressing while sending data traffic from each accelerator engine to a different physical storage location.

Each att entry also contains an access protection bit to prevent the node from accessing the forbidden address range. This function provides an important inter process security mechanism to prevent multiple applications or tasks running on a speedster7t FPGA to interfere with the memory modules assigned to other applications or tasks. This security mechanism also helps to prevent the system from crashing due to unexpected, accidental or even intentional memory address conflicts. In addition, designers can use this scheme to prevent logical functions from accessing the entire storage device.

Memory speeds up the design by independent interface and logic verification

Figure 4: hardware virtualization using speedster7t NOC

Memory space: memory space

Simplify team collaborative design

Team based collaborative FPGA design is not a new concept, but the underlying architecture and wiring depend on other parts of FPGA, which makes the implementation of this simple concept very challenging. Once one team has completed one part of the design, another team designing the other part often has a challenge when trying to access the resources on the other end of the device, because wiring is needed in the already completed part of the design. Similarly, changing the area or size of a part of the FPGA that has been designed and wired may have a cascading effect on all other FPGA design modules.

With speedster7t NOC, design modules can be mapped to any part of the FPGA, and resource allocation can be changed without affecting the timing, layout or routing of other FPGA modules. Since all naps in the device support unlimited access to NOC for communication by each design module, team based design is possible. Therefore, if a part of a design increases in size, as long as there are enough FPGA resources available, the data flow will be automatically managed by the NOC, so that designers do not have to worry about whether the timing is met or not, and the possible subsequent impact on other parts of the design that other team members are working on.

Memory speeds up the design by independent interface and logic verification

Figure 5: multiple design teams working on the same FPGA

Design team: design team

Speed up the design through independent interface and logic verification

Another unique feature of speedster7t NOC is that it enables designers to configure and verify I / O connections independently of user logic. For example, one design team can verify the interface between the PCIe and gddr6, while another team can independently verify the internal logic functions. The reason why this independent operation can be realized is that the peripheral part of NOC is connected with PCIe, gddr6, DDR4 and FCU without consuming any FPGA resources. These connections can be tested without using any HDL code, thus independently validating the interface and logic at the same time. This function eliminates the dependence between verification steps and achieves a faster overall verification speed than traditional FPGA architectures.

Memory speeds up the design by independent interface and logic verification

Figure 6: independent I / O and logic verification

Design team 1: I / O verification

Design team 2: logic verification

Using packet mode to simplify the application of 400 Gbps Ethernet

The challenge of implementing high-speed 400 Gbps Ethernet data path in FPGA is to find a bus width that can meet the performance requirements of FPGA. For 400g Ethernet, the only feasible choice for full bandwidth operation is 1024 bit bus running at 724 MHz or 2048 bit bus running at 642 MHz. Such wide buses are difficult to route because they consume a lot of logic resources in the FPGA architecture. Even in the most advanced FPGA, timing convergence challenges will occur under such speed requirements.

However, in the speedster7t architecture, designers can use a new processing mode called packet mode, in which the incoming Ethernet streams are rearranged into four narrow 32 byte packets or four independent 256 bit buses running at 506 MHz. The advantages of this mode include: it reduces the waste of bytes when the packet ends, and can transfer data in parallel without waiting for the first packet to complete before starting the second packet. The speedster7t FPGA architecture is designed to enable packet mode by connecting Ethernet MAC directly to a specific NOC column, and then connecting from the NOC column to the logical array using user instantiated nap. With the NOC column, data can be sent anywhere in the FPGA architecture along this column for further processing. Using ACE design tool to configure packet mode can greatly simplify user design and improve efficiency in processing 400 Gbps Ethernet data stream.

Memory speeds up the design by independent interface and logic verification

Figure 7: data bus rearrangement in packet mode

Packet: packet

Byte: byte

Memory speeds up the design by independent interface and logic verification

Figure 8: 400 Gbps Ethernet using packet mode

Reduce logic footprint and improve overall FPGA performance

Compared with the traditional FPGA, speedster7t NOC has more flexibility and simpler design method. One potential benefit is that NOC can automatically reduce the amount of logic required for a given design, which can use NOC instead of FPGA logic array for inter module wiring. Ace design tools automatically manage the complexity of connecting design units to speedster7t NOC, so designers can achieve productivity without writing HDL code. This method simplifies the time-consuming challenge of timing convergence, and does not reduce the overall application performance due to the routing congestion in FPGA logic array. NOC can also improve device utilization without sacrificing FPGA performance, and can significantly increase the number of look-up tables (LUTS) available for computation.

To emphasize this advantage, we create an example design that supports convolution of two-dimensional input images. Each module uses speedster7t machine learning processor (MLP) and Bram module. Each MLP performs 12 int8 multiplications in a cycle. 40 2D convolution modules are linked together to make use of almost all available Bram and MLP resources in the device. A total of 40 two-dimensional convolution sample design instances run in parallel, using 94% of MLP, 97% of Bram, but only 8% of LUT. Of the total available LUTS, the remaining 92% can still be used for other functions.

As more instances are built into the device, the maximum frequency (Fmax) of a single unit module does not decrease. The design can maintain performance because the data in and out of each 2D convolution module can access gddr6 memory directly from nap connected to NOC without wiring through FPGA logical array.

Memory speeds up the design by independent interface and logic verification

Figure 9: a speedster7t device with 40 examples of 2D convolution modules

conclusion

Speedster7t NOC realizes the fundamental transformation of FPGA design process. Achronix is the first FPGA company to implement a two-dimensional network on chip (2D NOC), which can connect all system interfaces and FPGA logic arrays. This new architecture makes achronix’s FPGA particularly suitable for high bandwidth applications, while significantly improving the productivity of designers. Because NOC manages all the network functions between the data accelerator designed in FPGA and the high-speed data interface, designers only need to design its data accelerator and connect it to nap primitive. Ace and NOC are responsible for all other matters. By using NOC, FPGA designers will benefit from:

Simplify high speed data distribution in FPGA logic array

Automatically connect the PCIe interface to the memory

Secure local reconfiguration on independent FPGA logic array modules

Support hardware virtualization easily

Simplify team design

Speed up the design through independent interface and logic verification

Simplify 400 Gbps Ethernet application with packet mode

Reduce logic footprint and improve overall FPGA performance

Editor in charge: PJ

Leave a Reply

Your email address will not be published. Required fields are marked *