Everyone must be familiar with 3D stacking and multi-chip packaging. There are not many gimmicks in the manufacturing process these days, and sometimes even the performance improvement is limited. Manufacturers have to start from the architecture. Like Apple's UltraFusion splicing and Graphcore's 3D WoW, they are all products of the combination of chip design manufacturers and fabs, and the 3D stack cache we are going to talk about today is no exception.
96MB large 3D V-Cache
It was AMD who played the 3D stacked cache first. The Ryzen7 5800X3D announced at CES2022 this year can be regarded as an upgraded version of the 5800X, adding a 64MB 3D V-Cache cache to the original 32MB 2D L3 cache, bringing the total L3 cache to 96MB. This chip stacking method is different from the common C4 and MicroBump, AMD used HybridBond3D technology to achieve higher density, better thermal management and better connectivity.
3D V-Cache schematic / AMD
However, AMD's definition of Ryzen7 5800X3D is a "gaming CPU", which means that it will not necessarily get better performance in some high-performance computing scenarios, and the evaluation software running points may not be better than Intel's 12th-generation Core i9 and AMD Ryzen9. 5900 series of these flagship processors, but the game performance is quite impressive. At the time of the announcement, AMD's official FPS performance comparison chart showed that the 5800X3D was the same as Intel's i9-12900K in half of the projects, while the performance was 10% higher in half of the projects, even surpassing its own Ryzen9 5900X.
Ryzen9 5900X vs Ryzen7 5800X3D performance comparison / AMD
Recently, the media that finally rushed to show the game test results of this processor, the test object is the game "Tomb Raider: Shadow". Under the highest image quality option of 720p resolution, the 5800X3D ran a score of 231FPS, which was 21.58% higher than Intel's i9-12900K. You might ask why you are testing at such a low resolution, this is naturally to remove the GPU effect and let the comparison focus on pure CPU performance differences.
It is worth mentioning that in the test results, the 5800X3D system uses a 3080Ti graphics card, while the Intel system uses a more powerful 3090Ti graphics card, but the former still runs out of the leading level. The strength under the blessing of V-Cache.
Supercomputing can also benefit from it?
In addition to these high-performance consumer processors, HPC is also starting to see the potential of 3D stacked caches, such as AMD's Milan-X server CPUs. As in gaming scenarios, larger caches can de-bottleneck many core HPC applications. Researchers from Japan's Institute of Physical and Chemical Research (RIKEN) and other institutions and universities compared AMD's two server CPUs, one is the Milan of 256MB LLC, and the other is the Milan-X of 768MB LLC. As can be seen from the figure below, although the advantage gradually decreases as the input becomes larger, when the input scale is small, the large cache occupies an absolute advantage.
Performance comparison between Milan and Milan-X / RIKEN
After that, several researchers used gem5 simulators to create two LARC processor models, both built on a 1.5nm process, and both used large-capacity 3D stacked caches. The design of this processor refers to the A64FX core in Fuyue, the current king of supercomputers, and is based on the Armv8.2 architecture. However, under the advanced technology of 1.5nm, it can be seen that the area advantage is very large, and the realization is almost eight times. As a result, the number of cores has also increased from the 12+1 core stack of A64FX to 32 cores.
Core memory bank comparison of A64FX and LARC / RIKEN
The L2 cache of the A64FX is not large, only 8MB, and the LARC processor model they designed adds a 3D stacked L2 cache, and the 384MB L2 cache is 48 times higher than the former. However, the gem5 emulator provided by RIKEN only supports 2X size L2 cache configuration, so there are 256MB and 512MB versions. For a more accurate comparison, the researchers also designed a 32-core A64FX version.
In more than half of the test applications, the core memory bank of a single 512MBLARC processor has more than twice the performance than the A64FX, and most of these improvements are attributed to the 3D stacked cache, not just more than twice the number of cores. Combined with the advantages in area, for these cache-sensitive applications, the single-chip LARC processor can achieve nearly 10 times the improvement of A64FX. However, this is only a theoretical result. After all, there is neither a 1.5nm process nor an actual test comparison based on the chip level, but it does provide another idea for performance improvement under the slow progress of Moore's Law.
Limitations of 3D Stacked Caches
Of course, this type of 3D stacking design is quite common, but there is still a thorny problem for designers to optimize, and that is heat dissipation. The CPU core facing the stack cache is bound to face the situation of heat energy accumulation, the temperature rises and the performance is reduced, and the ordinary heat dissipation method is not effective in facing this kind of internal heat accumulation. Taking AMD's 3D V-Cache as an example, although AMD cleverly placed the stacked cache on non-computing cores, it was only slightly relieved. L3 cache and 3D V-Cache may also face the problem of heat accumulation.
Some clues can also be seen from the recent news. AMD officially confirmed that the first CPU using 3D V-Cache technology, Ryzen7 5800X3D does not support core, cache overclocking or core voltage adjustment, which is based on the fact that 5800X3D has a lower frequency than 5800X on a 0.4GHz basis. More importantly, AMD technical marketing director Robert Hallock pointed out that this technology is not yet mature, they may launch overclockable models in the future, but AMD's focus is still on overclockable non-3D stacked cache processors.
This is very understandable. At present, there is no way to solve the heat dissipation problem of 3D stacked cache, whether it is process or material. According to the speculation of several HPC researchers, the relevant manufacturing process may not be mature until 2028, and it will be advanced by then. to overcome this limitation. But before that, in addition to games, the 3D stacked cache is not enough to be a GameChanger that turns the processor market around.