In the past decade or so, the performance of CPU has increased by more than 100 times, while that of traditional HDD disk drive has increased by less than 1.5 times. This uneven development of computing storage technology has greatly affected the improvement of the overall performance of IT systems. It was not until the invention of solid state drive (SSD) that the storage bottleneck problem was solved. However, as a new technology, SSD still has some inherent defects. How to give full play to the advantages of SSD is a direction worthy of study. Here are some discussions on this topic from the aspects of performance, persistence, and use cost.
1、 How to give full play to SSD performance
First, let’s take a look at the use of traditional HDDs:
The protocol generally adopts SAS and SATA interfaces;
The IO scheduling of Linux requires the elevator algorithm to rearrange the IO to optimize the path of the magnetic head;
Enterprise storage usually uses raid cards for data protection.
In terms of interface protocol, with the invention of SSD, nvme protocol came into being. Compared with the single queue mechanism of SAS and SATA, nvme can have up to 65535 queues, and directly adopts the PCIe interface, eliminating the link and protocol bottleneck.
In terms of control card ecology, major manufacturers have also launched their own nvme control card chips, including PMC (now belonging to microchip), LSI, marvel, Intel, Huirong and domestic Derui. The technology is also very mature.
The linux driver and IO protocol stack have also been optimized. As shown in the figure below, nvme driver can directly bypass the traditional scheduling layer designed for HDD, greatly shortening the processing path.
So far, in order to give full play to SSD performance, the first two of the three traditional HDD problems mentioned above have been solved. However, in the enterprise market, nvme based raid has never had a good solution. The most widely used raid5/raid6 data protection mechanism (n+1, n+2) in traditional enterprises usually strips and slices the data, then calculates the redundant parity code, stores the data in multiple hard disks, and writes new data. It is usually a “read rewrite” mechanism. This mechanism itself becomes a performance bottleneck, and “read overwrite” has a great loss on the service life of SSD. In addition, the nvme protocol places the control card inside the nvme disk, and the IO is completed by the DMA module inside the nvme disk, which brings greater difficulties to the design of raid card based on nvme. At present, there are few available solutions for this kind of raid control card in the market, and the nvme advantage cannot be played in performance, so it cannot be widely used.
Based on the current situation, many enterprise storage solutions still use sas/sata SSDs and traditional raid cards. In this way, there will be two problems that have been solved previously, and the performance of SSDs will not be brought into full play.
However, this situation is also changing. The nvme over TCP (nvme/tcp) storage cluster solution invented by lightbits labs has handled this problem well. The solution can achieve random write performance of more than 1m IOPs through a self-developed data acceleration card and the use of the erase code mechanism, and can avoid the loss of service life caused by “read rewrite”. In addition, lightbits proposes the erastic raid mechanism, which provides elastic n+1 protection (similar to RAID5). Compared with the traditional RAID5, which requires hot spare disks or timely replacement of damaged disks, this mechanism can automatically balance and form new protection after a hard disk is damaged. For example, there are 10 disks in a node, which are protected by 9+1. When a disk is damaged, the system will automatically switch to the 8+1 protection state, and rebalance the original data to the new protection state, thus greatly improving maintainability and data security. In addition, the data accelerator card can achieve 100GB line speed compression, significantly improve the available capacity, and thus greatly reduce the system cost.
2、 How to improve the persistence of nvme disks
At present, the most widely used SSD is based on NAND particles, and an inherent problem of NAND is persistence. With the development of technology, the density of NAND is also getting higher and higher. The latest generation has reached QLC (4bits per cell), and the number of erasable times of each cell is also decreasing (1K p/e cycles). The development trend is shown in the figure below.
In addition, the use of NAND has a feature that the smallest erasable unit is relatively large, as shown in the figure below. When writing, 4KB can be used as a unit to write in, but when erasing (such as modifying the original data), 256Kb can only be used as a particle (different SSD sizes, but the principle is the same). This is easy to form holes and trigger the GC (garbage collection) data movement of SSD, leading to the so-called write amplification phenomenon, which will further affect the persistence of the disk.
In enterprise storage, the “read overwrite” mechanism of raid5/6 is usually used, which will further enlarge the number of write operations on the disk. In general, the loss is about twice that of the direct write mode. In addition, many RAID5 will start the journal mechanism, which will further reduce the service life of the disk.
Finally, for the latest QLC, another factor needs to be considered in use – indirection unit (IU). For example, some QLC disks use 16kb IU. If you want to write small IO, it will also trigger internal “read overwrite”, which will damage the service life again.
It can be seen that NAND based SSDs are still fragile. However, these problems can be avoided as long as they are used correctly. For example, taking a commonly used QLC disk as an example, it can be seen from the following two sets of parameters related to performance and persistence that sequential writes are 5 times more persistent than random writes, and the performance is 26 times more:
Write 0.9 dwpd in sequence and 0.18 dwpd in random 4K;
Write 1600 mb/s in sequence and 15K IOPs (60mb/s) in random 4K.
Through the above analysis, it is found that it is very important to use the disk in an optimal working state. The good news is that some advanced solutions, such as lightbits’ all nvme cluster storage solution, can solve this problem. By changing random IO into sequential IO and the unique erastic RAID technology, this scheme avoids the disadvantages of raid “read overwrite”, and can greatly improve the persistence and random performance of the disk.
3、 How to reduce the use cost
As SSD is a new technology compared with HDD, coupled with the contradiction between the production scale and demand of the industry, the current price is still high compared with HDD. So how to reduce the cost of SSD becomes very important.
The most important step to reduce the cost is to make full use of SSD, both in capacity and performance. However, at present, most nvme disks are directly inserted into the application server, and this method is very easy to cause a lot of capacity and performance waste, because only applications on this server can use it. According to the survey, using this DAS (Direct Attached Storage) method, the SSD utilization rate is about 15%-25%.
A better solution to this problem is the “decoupling” architecture widely accepted in the market in recent years. After decoupling, turn all nvme disks into a large storage resource pool, and take as much as the application server uses. As long as the total quantity is sufficient, it is very easy to push the utilization rate to 80%. In addition, because resources are centralized, there are more means and methods to reduce costs, such as compression. For example, if the average application data compression ratio is 2:1, it is equivalent to doubling the available capacity and halving the price per GB. Of course, compression itself will also bring some problems. For example, compression itself costs CPU. In addition, the performance of many storage solutions will be greatly reduced after compression is enabled.
For compression problems, lightbits’ nvme/tcp cluster storage solution can be solved through the storage accelerator card. The card can achieve 100GB line speed compression capacity without consuming CPU and increasing delay. With such a solution, the compression function has almost no additional cost. In addition, as mentioned in the previous introduction to improving persistence, the lightbits solution can improve the service life and support the use of QL C disks. From the perspective of the entire service cycle, it will also greatly reduce the use cost. In general, through decoupling to improve service efficiency, compression to improve available capacity, optimization to improve service life or enable QLC, the cost of SSD can be greatly controlled.
From the above analysis of how to make good use of SSD disks in terms of performance, durability and use cost, we can see that it is not easy to make good use of nvme SSD disks. Therefore, it is very important for ordinary users to choose a good storage solution. To this end, lightbits, an Israeli innovation company, has made it its mission to give full play to the maximum value of nvme disks, invented the nvme/tcp protocol, and launched a new generation of full nvme cluster storage solutions, which can help users easily use SSD disks.
Editor in charge: PJ