In recent years, the storage medium has developed rapidly, and the performance of unit storage medium is getting higher and higher. From the original mechanical hard disk less than 100 IOPs to the nvme SSD, it can reach 50W IOPs. However, in contrast, the CPU speed has not improved so much. According to the relevant data statistics of red hat, the storage medium has changed from the original single disk tens of IOPs to the current single disk 500000 IOPs, but the growth rate of CPU’s main frequency is not so fast. Different storage media require different CPU clock cycles for each io. For example, in HDD, an IO needs 20 million clock cycles, while in nmve, it only needs 6000 clock cycles.
The chart and data are from red hat
With a cpu clocked at 3ghz you can afford：
HDD： ~20 million cycles/IO
SSD： 300，000 cycles/IO
NVMe： 6000 cycles/IO
In this context, the key to the high performance of storage software is how to use CPU efficiently. The IOPs provided by single core CPU is a key index of storage system.
As an important branch of the current storage system, distributed storage has become a hot spot in the field of storage because of its software defined features, which can better and faster adapt to the development of new hardware.
Let’s briefly talk about the progress of adapting new hardware in the open source field of distributed storage. When it comes to open source distributed storage, we will definitely think of CEPH, the star open source project. CEPH is widely used at home and abroad, and its good scalability and stability are also recognized by everyone.
In the iterative development of CEPH, its local storage engine objectstore has also gone through two generations of development, from the initial file store to bluestore, which is now widely used. However, these storage engines have some shortcomings for high-performance storage media (such as nvme SSD, etc.).
Therefore, CEPH community also proposed a new generation of local storage engine seastore in 2018, which can be accessed by readers interested in details https://docs.ceph.com/docs/master/dev/seastore/ .
The following is a brief interpretation of seastore design according to my personal understanding.
Seastore design goals
Nvme oriented design, regardless of PEME and HDD.
Using spdk to implement user mode io.
Seastar framework is used to implement run to completion based on future & promise.
Combined with the network message layer of seastar, zero (minimum) copy on the read-write path is realized.
For target 1 and 2, nvme oriented design, I think it mainly means that the current nvme devices can use user mode driver, that is to say, using polling model can significantly reduce the IO delay without going through the system kernel.
At the same time, for nvme devices, GC problems caused by erase before write feature need to be solved efficiently. Theoretically, as a kind of hint mechanism, the upper layer software can inspire the lower layer flash memory to arrange GC, but in fact, most flash devices’ discard implementation is not ideal, so it needs the intervention of the upper layer software.
For goals 3 and 4, we consider the underlying storage design from the perspective of how to reduce the CPU. As mentioned above, for high-performance distributed storage, CPU will become the bottleneck of the system. How to improve the effective utilization rate of CPU is a key consideration.
In goal 3, the run to complexity method can avoid the overhead of thread switching and locking, so as to effectively improve the CPU utilization. Intel once issued a report that in the case of 4KB block size, single thread can provide up to 10.39 million IOPs by using spdk, which fully shows that the single thread asynchronous programming method can effectively improve the CPU utilization.
However, goal 4 needs to fully combine network model, consistency protocol, etc. to achieve zero copy, so as to reduce the number of memory copies in the module.
Based on the design goal of seastore, the specific design scheme mainly considers the GC optimization of nmve device through the data layout of segment, and the related processing when the upper layer controls the GC. At the same time, the document also mentions using B-tree to store metadata instead of using rocksdb to store metadata similar to bluestore.
But the author thinks that this design may be more difficult to implement. The current Rados not only stores data, but also has a large number of metadata storage functions. For example, OMAP and xattr, if these small kV information is actually stored by writing a new B-tree, it is equivalent to the need to implement a proprietary small kV database. This function will be very difficult to realize. However, if you directly use simple B-tree to store metadata, you will fall into the dilemma of storing metadata in file systems such as XFS, and you cannot store a large number of xattrs It’s a matter of time.
The above is just a brief description of CEPH’s design concept for the next generation of storage engine. If you want to wait until the specific open source is realized, it will take 3 to 5 years to develop.
And we are very keen on the demand of high-performance distributed storage, so deeply convinced that the storage team independently designed a new storage engine to meet the demand of high-performance distributed storage.
Next, let’s briefly introduce the practice of high-performance local storage by Shenxin enterprise distributed storage EDS team.
Design and implementation of pfstore
Pfstore (Phoenix fast store) is a user mode local storage engine based on spdk developed by EDS team. Its core architecture is shown in the figure below.
In the system, there are two core modules: data management and metadata management. The responsibilities and technical characteristics of the two core modules are introduced as follows:
Responsibilities of data management module:
It is a basic unit of segment space management. All the data are written in the form of appending, that is, the underlying SSDs are written in sequence. In this way, the performance of each SSD can be exerted as much as possible, and the overhead of SSD itself can be reduced.
At the same time, the whole store system is based on spdk in spdk_ Thread programming model, the entire IO process is completed in an independent thread, so there is no thread switching and locking overhead.
Of course, the fully asynchronous programming method will lead to higher development difficulty, so we also refined a set of asynchronous programming method based on state machine in our research and development, and adopted a sub state machine in each subsystem, which not only ensured the high performance of the system, but also ensured the maintainability and scalability of the system.
Responsibilities of metadata management module:
For the storage of metadata, we learn from the rocksdb mode of bluestore in CEPH, and also use a simplified user mode file system PLFs to dock with rocksdb.
In this way, we focus on the difference between the storage structures of LSM and B-tree and the benefits they can bring. For frequent data writing in the storage engine, relatively simple metadata management (the object storage interface provided by the local storage engine, and the metadata level is tiled), using LSM is conducive to improving the system writing performance.
Here are the technical features of the local storage engine.
Technical features of metadata engine:
Use self-developed journal to replace rocksdb’s wal. In this way, metadata changes are written to journal incrementally, and are written to rocksdb only when the disk is swiped at a later time. This improvement is based on the following reasons: a) plstore is a fully asynchronous programming model, while rocksdb’s interface is synchronous, which will block the entire IO path, resulting in poor performance. b) The data of journal is updated incrementally, so that it can be written in the way of aggregation and reduce the times of metadata writing. c) Journal not only records the metadata update of object, but also carries the log function of distributed consistency protocol (similar to pglog in CEPH). The integration of multiple functions can reduce the number of data writes.
In order to improve the compact processing of rocksdb, a location index based on B-tree is built in the memory during data aggregation. In this way, when rocksdb reads data, the location information of data can be obtained directly through the index, and then the location information can be used directly through the asynchronous data reading port. In this way, the original rocksdb synchronous read interface is transformed into asynchronous, which can significantly improve the read ability of single thread, and has 5-8 times improvement through the actual measurement.
Technical features of data storage engine:
For data, the method of appending write is more conducive to the performance of each SSD. It can achieve more rich data logical space management, and provide the basis for the implementation of some advanced features of storage, such as snapshot, clone, compression and re deletion.
At the same time, we need to implement space reclamation (GC) in the store layer, and the design of space reclamation is very important to improve the performance and stability of the storage system.
In plstore, we use hierarchical data management to reduce the impact of space reclamation on the performance and lifetime of underlying SSD.
As shown in the figure above, we divide the data into three levels:
Hot data, which is often updated, is stored in the data area of SSD Zone1
Temperature data, which are updated less frequently, are stored in the data area of SSD zone2
Cold data, the least updated data, is stored in the data area of SSD Zone3
Zone is composed of multiple segments. Through the hierarchical management of data, it can realize the centralized storage of the data that needs to be recycled in the process of space recycling. In this way, the amount of data moved in the process of space recycling is less, and the performance of normal business is less affected in the process of space recycling. At the same time, the amount of data needed to be written back in the process of space reclamation is less, which can effectively reduce the impact of space reclamation on SSD life.
After the data partition management, the space recovery takes segment as the execution unit, and the segment with less cost (less data to be moved) can be selected for space recovery through flexible space recovery strategy. This kind of space recovery process has less impact on the performance of the upper business, and with the discard instruction in SSD, it can also reduce the impact on the life of SSD.
In the process of space recovery, pfstore will cooperate with its QoS module to select the right time for space recovery, reduce the impact on normal business from all directions, and avoid performance jitter and “glitch” in distributed system.
Pfstore only runs in one thread, using the run to complex programming model, which not only simplifies the system, but also eliminates the performance jitter of the system. But in practice, the core of a CPU can’t give full play to the full performance of an nvme SSD, so we start multiple independent threads in a process, each independent thread is bound with a pfstore, and different pfstores manage the physical space of different SSDs, as shown in the figure below.
For a distributed system, the local storage engine is only a basic component, and it needs to cooperate with other modules, such as consistency protocol, network transmission and so on, in order to play its value. After CEPH community proposed the draft of season in 2018, it proposed a new OSD (crimson) in 2019 to meet the needs of high-performance application scenarios. Interested students can read crimson: a new cephal OSD for the age of persistent memory and fast nvme storage.
In recent years, great progress has been made in the field of storage. With the gradual maturity and implementation of nonvolatile memory, SCM and other technologies in the hardware field, nvme of and other protocols are gradually supported by various operating systems. At the same time, there are many good ideas in the academic community, which promote the software innovation and rapid development of distributed storage.
The future focus of deeply convinced EDS team will also be on improving the software architecture to better adapt to the hardware, so as to enhance the overall cost performance of the system.