Author: Huang Lun, achronix Senior Application Engineer

1. General

With the advent of the Internet era, the data generated by human beings has experienced unprecedented and explosive growth. IDC predicts that the total amount of global data will increase from 45zb in 2019 to 175zb in 2025 [1]. At the same time, nearly 30% of the global data will need real-time processing, which brings the demand for hardware data processing accelerators such as FPGA. As shown in Figure 1.

poYBAGGke7-ADFlwAACVp4VzqqY224.png

Figure 1 global data growth forecast

With the rapid growth of such data, the network bandwidth used to transmit data and the computing power required to process data must also grow rapidly. The traditional CPU has become more and more overburdened, so using hardware acceleration to reduce the burden of CPU is an important development direction to meet the performance requirements in the future. The future hardware development requirements put forward higher and higher requirements for the hardware platform for acceleration, which can be summarized into three aspects: computing power, data transmission bandwidth and memory bandwidth.

Achronix’s new generation speedster 7T FPGA chip using TSMC 7Nm technology has been optimized in these three aspects according to the requirements of future hardware acceleration and network acceleration, eliminating the bottleneck of traditional FPGA. Let’s focus on the advantages achronix brings by using a hard core gddr6 controller in order to improve memory bandwidth.

2. Development of gddr6

At the beginning of the design of gddr, its positioning is a kind of DDR memory specially optimized for graphics display card. Because of the development and popularity of computer games, especially 3D games, after 2000, the graphics card running computer games needs a large number of high-speed image data interaction requirements. Gddr came into being in this case. The first gddr standard was DDR based gddr2, and then developed to DDR3 based gddr5, which was very popular for a period of time.

In 2016, gddr5x was officially released. It introduced a four times data rate mode with 16N prefetching, but at the cost of increasing the access granularity from 32byte of gddr5 to 64BYTE. In 2018, gddr6 was released. The data rate reached 16gbps, and the bandwidth was almost twice that of gddr5x. At the same time, the dual channel design was adopted, and the access granularity was 32byte as gddr5.

3. Comparison between gddr6 and DDR4 / 5

Gddr has always been a DDR memory optimized for graphics display cards. Because the graphics card processing image data, especially 3D image data, has higher requirements for video memory bandwidth, the data exchange between GPU and gddr is very frequent. DDR memory focuses on the efficiency of data exchange with CPU, so it pays more attention to the overall access performance and low latency. Therefore, DDR4 is basically used in CPU and traditional FPGA.

With the demand of hardware acceleration, higher and higher requirements are put forward for the bandwidth of memory. It is obvious that the traditional DDR4 bandwidth can not meet the requirements. Achronix attaches great importance to the bandwidth advantage of gddr6 in data storage, innovatively introduces gddr6 into FPGA, and completely solves the bottleneck of insufficient storage bandwidth of traditional FPGA.

On July 15, 2020, JEDEC Storage Association officially released the ddr5 SDRAM standard (jesd79-5). The memory frequency has increased significantly compared with the standard frequency of DDR4, and the total transmission bandwidth has also increased by 38%, but there is still a certain gap with the bandwidth of gddr6. As shown in Figure 2 [2], the bandwidth of gddr6 and DDR4 / 5 is compared.

poYBAGGke8mAQP2GAACgaGaCMZQ502.png

Figure 2 Comparison of bandwidth development between gddr and DDR

If the same large bandwidth storage application is implemented, under the condition of providing the same memory bandwidth, the performance of gddr6 is greatly improved compared with DDR4 in terms of design complexity, PCB occupied area and power consumption, as shown in Figure 3 [2].

pYYBAGGke_GAV_4wAAC68BcyiEI037.png

Figure 3 performance comparison between gddr6 and DDR4

4. Comparison between gddr6 and hbm2

The full name of HBM is high bandwidth memory. The original standard was released by JEDEC in 2013. In January 2016, the second generation hbm2 of HBM officially became the industrial standard. The emergence of HBM is also to solve the problem of memory bandwidth. Unlike gddr6, HBM memory is generally formed by a die stack of 4 or 8 hbms, which we call a stack. As shown in Figure 4 [4].

poYBAGGke-WAPsFSAAEZWZTDUnI552.png

Figure 4 HBM die stack

We take the high-end FPGA with hbm2 on the market as an example. This series of FPGA integrates 1 ~ 2 such hbm2 stacks. Two stacks are independent of each other and each has its own address space. Each die has two independent 128bit channels, so four dies and eight channels have a bit width of 1024bit. The frequency of hbm2 is 900MHz. It is accessed in DDR mode. The total bandwidth of one stack is 900 (MHz) x 2 (DDR) x 1024 (bit width) / 8 = 230gb / s. The maximum bandwidth of two stacks can reach 460gb / s.

Achronix’s speedster 7T FPGA integrates eight gddr6 hard cores, and each gddr6 hard core supports dual channels. The total bandwidth is 16gbps x 16 (bit width) x 2 (channel) x 8 (controller) / 8 = 512 GB / s, which is slightly higher than the FPGA memory bandwidth with hbm2.

In terms of cost, gddr6 has great advantages compared with hbm2. Hbm2 has high technical requirements, and the yield and output of chips will be greatly affected. At the same time, gddr6 is more flexible to use. Using off chip DRAM, gddr6 particles with different rates and capacities can be selected according to application requirements. Hbm2 has the advantage of high integration and does not occupy the area of PCB. Figure 5 shows a comprehensive comparison of DDR4, gddr6 and hbm2 in cost.

pYYBAGGke9uAXFq5AABwnNgallY091.png

Figure 5 DDR4 vs gddr6 vs hbm2

5. Technical details of gddr6 and clamshell mode

The structure of gddr6 is shown in Figure 6 [3]. It adopts 16N prefetch structure, and the data of a write operation or read operation is 16N. Each gddr6 particle has two independent channels, and each independent channel accesses an independent memory space. For each channel, the read or write bit width is 256bit or 32byte. P-to-S converter is a parallel to serial converter, which converts each 256 bit wide data into a 16 bit bus, and each bus transmits 16 bit data. In this way, the minimum access granularity of each channel of gddr6 is 256bit or 32byte.

pYYBAGGkfAaAU-7ZAAECVpCLsmU322.png

Figure 6 gddr6 particle structure

A gddr6 controller supports two independent channels, and a gddr6 particle is also two independent channels. Therefore, in the normal mode, a gddr6 controller corresponds to a gddr6 particle and uses the x16 mode to achieve a maximum bandwidth of 512gb / s.

At present, the maximum capacity of gddr6 particles on the market is 16GB. In some applications, if there are certain requirements for capacity, a connection mode called clamshell can be used. As shown in Figure 7 [5], each gddr6 controller connects two gddr6 particles, and each gddr6 particle uses X8 mode. In this way, the bandwidth remains unchanged in this clamshell mode, However, the capacity of gddr6 has doubled.

pYYBAGGkfBCADOReAABYT3-Ly58328.png

Figure 7 clamshell mode of gddr6

6. Read and write efficiency of gddr6 on 7t1500

Finally, we test the read-write efficiency of gddr6 controller on 7t1500. All test results are based on simulation data. The test environment is shown in Figure 8. Because 7t1500 includes network on chip (NOC), and NOC has realized the logic of arbitration and clock domain conversion, we use three user logic to access the same gddr6 channel through NOC, and the comprehensive read-write efficiency can better reflect the actual application scenario of users.

poYBAGGkfBiAXBY8AAITXSrjeCA537.png

Figure 8 gddr6 read / write efficiency test architecture

The test results under different burst lengths and different address access modes are shown in Figure 9.

poYBAGGkfCaAZKZ1AADuI2aO_D0955.png

Figure 9 gddr6 read / write efficiency

Later, we will continue to learn more about some features of speedster 7T FPGA chip and how these features are applied to data acceleration and network acceleration. Please look forward to it. For more information or questions, you can contact us through the contact numbers in the Achronix official account or visit the official website of Achronix. http://www.achronix.com

If you need to further contact achronix China technology and product application team, please send an email to: Dawson [email protected]

reference:

1.The DigiTIzaTIon of the World From Edge to Core 2018

2.Extending the Benefits of GDDR Beyond Graphics by Micron

3.TN-ED-03: GDDR6: The Next-GeneraTIon Graphics DRAM Memory Array Prefetch and Access Granularity

4. Samsung website: www.samsung.com

5. Micron website: www.micron.com

6. Achronix website: www.achronix.com

Leave a Reply

Your email address will not be published. Required fields are marked *