Author: lightbits Labs

Nvme has changed the storage industry since it emerged as the latest protocol of high performance solid state disk (SSD).

Nvme was originally designed for high-performance direct attached pciesds, and later expanded in the form of nvme over fabrics (nvme of) to support rack scale remote SSD pools.

It is generally believed that this new nvme of mode will replace the iSCSI protocol as the communication standard between the computing server and the storage server, and become the default protocol of the disaggregated storage scheme.

However, the initial deployment options for nvme of are limited to fibre channel and remote direct memory access (RDMA) architectures.

What if we can provide a new and more powerful technology that can provide the speed and performance of nvme without high deployment cost and complexity?

Nvme over TCP (nvme / TCP) can use simple and efficient TCP / IP structure to extend nvme to the whole data center.

This article describes how nvme / TCP can become a better technology for existing data centers and the advantages it can provide. These advantages include:

Support decoupling across available areas of data center

Use ubiquitous TCP transmission and nvme stack with low latency and high parallelism

No need to change on the application server

High performance nvme of solutions with performance and latency similar to direct attached SSD (DAS)

Efficient and simplified block storage network software stack optimized for nvme

● parallel access to storage optimized for today’s multi-core applications / client servers

Standard nvme of control path operation

1. Overview of nvme / TCP

Nvme specification has become the latest protocol of high performance SSD.

Different from SCSI, iSCSI, SAS or SATA interfaces, nvme implements a simplified command mode and multi queue architecture optimized for multi-core server CPU. Nvme of specification extends nvme and realizes sharing of pciesds through network, which was originally implemented with RDMA structure. Today, lightbits labs is working with Facebook, Intel and other industry leaders to extend the nvme of standard to support TCP / IP transport that complements RDMA architecture.

The decoupling storage scheme based on nvme / TCP is simple and efficient. TCP is popular, scalable and reliable. It is an ideal choice for short connection and container based applications.

In addition, migration to shared flash via nvme / TCP does not require changing the network infrastructure of the data center. No infrastructure changes mean easy deployment across data centers, as almost all data center networks are designed to support TCP / IP.

The extensive industry cooperation based on nvme / TCP protocol means that the protocol has a broad ecosystem from the beginning of design, and supports any operating system and network interface card (NIC). Nvme / TCP linux driver matches Linux kernel natively and can use standard Linux network protocol stack and NIC without any modification.

This promising new protocol is tailored for very large-scale data centers and can be easily deployed without changing the underlying network infrastructure.

Figure 1 nvme / TCP can be seamlessly integrated with the existing nvme protocol in Linux kernel

2. How does the data center deal with storage

2.1 direct attached storage architecture and nvme

Nvme storage protocol aims to extract full performance from solid state drives (SSDs).

The parallel capability designed in nvme protocol is helpful to achieve this performance. Nvme does not use single queue iSCSI mode. Instead, nvme can support up to 64000 queues between CPU subsystem and storage.

SSD is a parallel device that uses multiple parallel communication channels to connect with multiple SSD storage locations, which means that SSD can receive data efficiently in large-scale parallel streams. Before the advent of nvme / TCP protocol, the simplest way to use this parallel mode was to install nvme SSD directly on the application server. In other words, you have to use Das mode to build your own storage infrastructure.

Using Das method, the application can benefit from the following aspects:

Multiple CPUs

Parallel SSD architecture

For the industry, the challenge is to move SSDs from stand-alone servers that may have excess capacity to shared storage solutions that have higher infrastructure utilization without loss of Das performance benefits. Therefore, the goal of all nvme decoupling technologies is to achieve Das performance in shared nvme solutions.

2.2 previous generation IP based storage architecture

Previously, the iSCSI standard was the only option for connecting to block storage over a TCP / IP network. It was developed at the turn of the century, when most processors were single core devices.

In SCSI, there is only one connection between application (initator) and storage (target). For iSCSI, there is only one TCP socket to connect the client to the block storage server.

Now, the processors in the data center are massively parallel and multithreaded devices. The complexity of today’s processors requires a radical overhaul of the available storage protocols. As a result, nvme has emerged as an alternative to SATA and SAS.

All those early protocols were developed based on a serial rotary disk drive.

Nonvolatile memory (NVM) is a parallel storage technology, which does not require one or more disks to rotate under one or a group of heads. With NVM storage devices, many storage units can be accessed in parallel with low latency.

There is no doubt that iSCSI is still suitable for application scenarios with low to medium storage performance requirements. However, iSCSI can not meet the requirements of I / O intensive applications, which need to achieve low latency in large scale.

2.3 other alternatives and nvme / TCP decoupling scheme

RDMA, RDMA based on converged Ethernet (roce), and nvme based on fibre channel (nvme over FC) are also other network storage protocols trying to solve the decoupling problem. However, these alternatives require expensive special hardware to be installed on both ends (application server and storage server), such as NIC with RDMA function. In addition, after installing RDMA hardware, it is also very complicated to configure and manage flow control in your switch fabric with RDMA function.

RDMA does provide performance for some high-performance computing environments, but it requires higher cost and very complex deployment.

TCP / IP has been proved to work reliably and efficiently in very large scale environment. Nvme / TCP inherits this kind of reliability and efficiency. It can coexist with RDMA as a complementary solution or completely replace RDMA.

3. Flash decoupling and nvme / TCP solution in data center

In a Das environment, drives are purchased before they are deployed to the server or with the server, and their capacity utilization grows slowly over time. In addition, in order to avoid the embarrassing situation of running out of storage, DAS often intentionally configures the capacity as surplus.

In contrast, data centers that separate storage from computing servers are more efficient. In this way, the storage capacity can be expanded independently and allocated to the computing server as needed.

With the decrease of the cost per GB flash, the decoupled storage method is more economical and efficient, and the initial cost of data center deployment is much lower. By dynamically allocating storage resources, the overhead of over provisioning can be avoided, and the total cost can be greatly reduced.

Nvme / TCP solutions unleash the potential of cloud infrastructure based on decoupled high performance solid state drives (SSDs). It enables the data center to change from the inefficient direct attached SSD mode to a shared mode. In this mode, computing and storage can be independently expanded to maximize resource utilization and operational flexibility.

This new sharing mode adopts the innovative nvme / TCP standard. Lightbits labs invented this concept and is leading the development of this new standard.

Nvme / TCP will not affect the performance of the application. In fact, it usually improves the tail latency of the application, thus improving the user experience and enabling cloud service providers to support more users on the same infrastructure. It also does not require any changes to the data center network infrastructure or application software. It can also reduce the total cost of ownership (TCO) of the data center and make it easier to maintain and scale a very large data center. Lightbits labs is working with other market leaders to achieve widespread adoption of the standard in the industry.

Nvme / TCP uses standard Ethernet network topology to independently expand computing and storage to achieve the highest resource utilization and reduce TCO.

Fig. 2 transformation from direct attached storage (DAS) to decoupling storage and Computing

4. Lightbits labs: deploying nvme / TCP in data center

Lightbits labs solutions provide the following performance advantages:

● up to 50% reduction in tail latency compared to direct attached storage (DAS)

● SSD capacity utilization doubled

The performance of data service is improved by 2-4 times

It can be expanded to tens of thousands of nodes

It can support the performance of millions of IOPs, and the average delay is less than 200 μ s

Lightbits solution can achieve the following improvements without affecting the stability or security of the system:

Physical separation of application server and its storage

-Support independent deployment, expansion and upgrade

-Support storage infrastructure to scale faster than computing infrastructure

-Improve the efficiency of application server and storage

-Through independent lifecycle management of application server and storage hardware, it can simplify management and reduce TCO

High performance and low latency comparable to internal nvme SSD

● the existing network infrastructure can be utilized without change

It can realize decoupling in multi hop data center network architecture

Figure 3 nvme / TCP can connect storage nodes to application servers across data centers

5. The working principle of lightbits storage solution

Lightbits labs provides a platform for cloud and data center infrastructure.

When tens of thousands or hundreds of thousands of computing nodes lock multiple islands of direct attached storage in each physical node, cloud level network will expose its extreme complexity.

Lightbits’ solution unleashes the potential to understand coupled high-performance SSD solutions. It enables the data center to transform from the inefficient direct attached SSD mode to a shared mode, in which computing and storage can be independently expanded to maximize resource utilization and flexibility.

When lightbits labs invented nvme / TCP, we continued to use the nvme mode used by Das devices, and then mapped it to the industry standard TCP / IP protocol group. Nvme / TCP maps multiple parallel nvme I / O queues to multiple parallel TCP / IP connections. This pairing between nvme and TCP can achieve a simple, standard based, end-to-end parallel architecture.

Figure 4 nvme / TCP for parallel Cloud Architecture

This new sharing model uses the innovative nvme / TCP standard, which does not affect latency and does not require changes to the network infrastructure or application server software. Lightbits labs is working with other market leaders to drive the adoption of this new nvme / TCP standard.

With the decoupling storage solution of lightbits labs, the storage can be configured to the application server in a simplified way. Thin provisioning means that administrators can assign volumes of any size to clients. Moreover, the underlying storage capacity is consumed only when the application server writes data. Therefore, storage is only used at the last moment (when it is needed). This will delay the purchase of more storage resources, further reducing costs. Lightbits also provides a hardware acceleration solution for data services running at line speed.

Therefore, when using lightbits thin configuration technology and data service-oriented hardware acceleration solution, the storage cost can be reduced to a small part of the cost of Das solution with equivalent performance.

5.1 write algorithm for flash memory

For both read and write operations, the latency of flash media is very low. However, the flash controller on the SSD must continue to perform “garbage collection” operations to provide free space for upcoming write operations. Unlike the hard disk drive, which can cover the existing data, the flash drive only allows data to be written to the previously unwritten or erased flash block.

Garbage collection operations cause write amplification. As the name suggests, when the SSD controller performs garbage collection, the single write operation issued by the application server will be amplified into more write operations on the actual flash media by the SSD controller. Write amplification will increase the consumption of flash drive, which will affect its long-term use.

In addition, background garbage collection will lead to an increase in the latency of the upcoming I / O, and garbage collection will increase significantly as more random writes are written to flash drives. Unfortunately, a large proportion of I / O is random. In general, this means that users cannot get the best performance or Flash durability.

Lightbits Labs’ solution solves this problem through an intelligent management layer, which manages SSD pools with different quality of service (QoS) levels. This solution reduces SSD background operations and makes I / O faster and more efficient.

Lightos architecture combines multiple algorithms to optimize performance and flash utilization. This includes combining data protection algorithms with hardware acceleration solutions for data services and our high-performance read-write algorithms. Finally, all I / O is managed and balanced across SSD pools, which greatly improves flash utilization.

This design improves the overall performance and reduces tail latency, write amplification and SSD wear. This means that lightos can provide the highest return on investment (ROI) for your flash storage.

5.2 high performance data protection scheme

In order to realize the separation of storage and application server, intelligent, efficient and performance independent data protection function is also needed.

Lightbits combines high-performance data protection solutions that work with hardware acceleration solutions and read-write algorithms for data services.

In terms of how to write data to SSD pool, compared with traditional raid algorithm, lightbits’ data protection method can prevent too many writes, so as to avoid SSD suffering more loss.

6. Summary

Lightbits labs implements an efficient flash decoupling scheme, which has the following advantages in implementation and operation:

Without any expensive dedicated network hardware, lightbits solution runs on standard TCP / IP network.

Use TCP / IP to run on one or more LANs in rack scale, without protocol limitation.

Provides performance and latency comparable to DAS, including 50% lower tail delay than DAS.

Combining high-performance data protection solutions with hardware acceleration solutions for data services, as well as read-write algorithms to ensure that performance is not affected.

● maximize flash efficiency with a hardware acceleration solution for data services that runs at full line speed without affecting performance.

Realize the storage volume of thin configuration and support the consumption mode of “pay as you need”.

Lightbits is the inventor and promoter of nvme / TCP.

As a new concept of application, lightbits nvme / TCP solution can achieve efficient flash decoupling, thus achieving the same or even better performance as DAS. Lightbits creates a modern implementation of IP storage architecture, which can maximize the potential of application server, nvme, TCP and SSD parallel architecture.

Through the solution of lightbits labs, cloud native applications can achieve cloud level performance, and cloud data center can reduce its cloud level TCO.

Leave a Reply

Your email address will not be published. Required fields are marked *