Storage technology is changing with each passing day. On the one hand, traditional centralized san/nas is still widely used. On the other hand, various distributed storage products have occupied the market in the emerging massive data scenarios. How to select the appropriate storage architecture based on business needs? Can distributed storage replace traditional centralized storage in all scenarios? Please refer to the introduction of this article.

01 traditional centralized storage

The hardware architecture of traditional san/nas storage adopts the mode of “controller + hard disk cabinet”. Medium and high-end storage supports multiple controllers to ensure high availability and improve performance. Multiple controllers are tightly coupled, interconnected through PCIe bus or Infiniband network, sharing disk array and cache.

Traditional centralized storage started early, with mature technology, simple architecture, sufficient stability, and good support for high IOPs, low latency, and strong data consistency. In addition, in recent years, all flash memory array storage has developed rapidly, and the IOPs performance has been improved to more than 100 times that of mechanical hard disk storage, which can effectively solve the IOPs performance pain points.

The characteristics of traditional centralized storage are suitable for the database storage of core business systems such as finance / medical treatment.

The traditional storage system architecture determines that its expansion ability is limited, and it can not well support high concurrent access performance. As we enter the big data era, the growth space of centralized storage is becoming more and more limited.

02 distributed storage

With the rapid growth of massive data, enterprises need more flexible and scalable storage architecture.

Distributed storage is an emerging storage technology. The architecture of “standard x86 server hardware + storage software” is adopted to interconnect standard x86/arm servers through high-speed Ethernet or Infiniband, and organize the local HDD, SSD and other storage media of the server into a unified large-scale storage resource pool through distributed storage software. Distributed storage realizes the decoupling of storage hardware and software. The data center can build a storage platform with standardized hardware, improve it agility, reduce operation and maintenance costs, which is in line with the development trend of software defined data center.

Distributed storage is also known as software defined storage (SDS).

The storage unit of distributed storage is an x86/arm server (also known as a node). Taking a standard 2U storage server as an example, 12 3.5-inch hard disks can be inserted into the front panel.

Capacity of hard disk, including:

4tb\6tb\8tb\10tb\12tb\16tb, etc.

If 10TB hard disk is selected, the physical capacity of a single node is 12 * 10 = 120tb.

Distributed storage effectively solves the scalability problem of traditional centralized storage. The scale can be expanded to thousands of nodes, and the capacity can be expanded to hundreds of Pb or even EB level. The performance increases linearly with the capacity. After online capacity expansion on demand, data rebalancing is automatically realized. Multiple storage nodes of distributed storage can provide read-write services at the same time, so it has high throughput, which can reach tens of gb/s.

Distributed storage supports three storage functions and can create a unified data storage platform:

1. San block storage, scsi/iscsi interface protocol

2. NAS file storage, cifs/nfs interface protocol

3. Object storage, S3 interface protocol

Distributed storage uses multiple copies and erasure code technology to achieve data protection. Multi copy mode (the commonly used multi copy mode in the industry is generally 2 copies or 3 copies), which has the advantages of high reliability and high performance; However, the disadvantage is that the effective utilization rate of storage capacity is low (50% for 2 replicas and 33% for 3 replicas). The commonly used erasure code configuration method in the industry is generally 8+4 (8 data blocks, 4 check blocks, and the capacity utilization rate is 66%). The advantages of erasure correcting codes are high reliability and capacity utilization, while the disadvantages are low performance.

The general selection principle is:

1. Multiple copies for online storage devices; Erasure code for backup and archiving;

2. Multiple copies for small documents; Large files use erasure codes.

At present, there are a variety of distributed storage products to choose from in China, including open source software, products based on open source software optimization, and domestically developed distributed storage products.

The blue sea distributed storage developed by Shanghai Xiaoyun has excellent performance, solves the pain point of massive small file storage, and its performance is three times that of traditional centralized high-end storage and CEPH products. The number of files can reach more than 10 billion. It provides an innovative storage solution for massive data scenarios such as medical PACS images, financial electronic bills, automatic driving and industrial automation.

03 conclusion

To sum up, distributed storage is a mainstream storage technology in the future and has very good development prospects. However, distributed storage cannot adapt to all business scenarios and needs to be reasonably selected according to actual business needs.

The Bihai distributed storage system developed by Xiaoyun technology has been widely used in the core production systems in the fields of medical treatment, finance, telecommunications, education and so on, and has been highly praised by users.

Leave a Reply

Your email address will not be published.