brief introduction

In the field of engineering architecture, storage is a very important direction. This direction is divided into the following levels from bottom to top

Hardware layer: explain the basic principles of disk, SSD, SAS, NAS, raid and other hardware layers, as well as the storage interface provided for the operating system;

Operating system layer: file system, how the operating system manages each hardware and provides a higher-level interface;

Single engine layer: general introduction to the principle of single engine corresponding to common storage systems, using file system interface to provide a higher level of storage system interface;

Distributed layer: how to combine multiple single engine into a distributed storage system;

Query layer: user’s typical query semantic expression and parsing;

Motherboard structure

Before entering the analysis of the hardware layer, let’s take a look at the relationship between the various components on the computer motherboard.

What is the hardware layer of the storage system

The figure above shows the structure of the main components on the motherboard and the bus connection between them. The core connection point is North Bridge and South Bridge two chips.

Among them, the north bridge is relatively hanging, connecting some high-speed equipment. Generally speaking, only CPU, memory and graphics card are connected. However, in recent years, pcie2.0 high-speed interface access to the North Bridge, so that some of the standard equipment can access to the north bridge.

The South Bridge is in charge of connecting all low-speed devices. USB, mouse, disk, sound card and so on are connected to the South Bridge. Different device uses and protocols are very different, so different interaction protocols are designed. Due to historical reasons, the transmission media of different devices may be different, which leads to the complexity of bus wiring. So far, the mainstream devices have been unified into PCI bus, we use this bus together.

Reading and writing process

Let’s simulate the process that the CPU reads a piece of data from the disk

The CPU sends out an instruction to say that elder brother is ready to read data

This instruction is transmitted to the disk controller through system bus, inter bridge bus and PCI bus. After receiving the instruction, the controller knows that this is a read request and has read the message whether to send interrupt. Do some preparation and wait for the data to be read.

The CPU sends another instruction to the logical address to be read

This instruction is sent to the disk controller through the serial bus. The disk controller is busy, looking for the address of the physical block corresponding to the logical fast, searching, seeking, and so on, and then it starts to read the data.

The CPU sends another instruction to read the address into memory

After receiving this instruction, the CPU doesn’t care. He tells a manager named DMA that it’s up to you. DMA device will take over the bus, which is responsible for synchronizing the disk data to the specified memory location through PCI bus, inter bridge bus and memory bus.

The process of the write operation is similar, so I don’t want to repeat it.

Above we just explained the process of data flow on the motherboard, but there is a black box, namely, disk controller. How does this guy manage all the disks? We will describe it for you in the next section.

Principle of storage medium

The process of reading data by computer is described above. In this chapter, we will briefly describe the storage principle of common storage media.

magnetic tape

The tape is similar to the tape you used to listen to music when you were a child. There are many small magnetic particles on a black belt, which can be judged as 0 / 1 according to the North-South order of the particles.

floppy disk

Floppy disk is a little more advanced than magnetic tape. The principle of recording data is the same. It can only be read randomly, while tape can only be read sequentially.

Hard disk

Hardware principle

If we say that the first two are ancient things, the technical content is moderate. Hard disk is a storage device with high technology content, mainly including three major devices:

electric machinery

The purpose of the motor is to control the magnetic arm to accurately locate the track. A track may be very small. It is high-tech to precisely locate where to locate.


There are two main points on the disk. One is that the substrate should be smooth enough without any defects; the other is to evenly plate the magnetic powder on the substrate. There are two high technologies, one is the manufacturing of magnetic powder, the other is how to evenly plate on the substrate.


The main difficulty of the magnetic head is to control the distance from the disk surface. Similar to the floppy disk, the hard disk records data by modifying the north and south poles of the magnetic powder. If they are too far away, they will not be able to sense the magnetic data. If they are too close, they may scratch the disk surface. Of course, 0 / 1 does not mean only one magnetic particle, but a magnetic particle of a region.

At present, the disk uses aerodynamics to control the distance between the head and the disk by floating the head on the disk. However, when the hard disk stops working and does not rotate, the head must fall on the disk surface. Therefore, there is a place without magnetic powder on the disk surface near the center of the disk for safe parking of the head. When the hard disk is about to start working, the head takes off in a concentric circle, flies up, and then moves to other areas.

However, I have been wondering whether there is a technology that can install multiple heads on the magnetic arm, one for each track, and fix the magnetic arm at a certain height when it stops working, so that it is not close to the disk surface. In this way, it can greatly improve the read-write efficiency of the hard disk, because it reduces the time of seeking tracks.

Basic concepts

The schematic diagram of hard disk composition is as follows:

What is the hardware layer of the storage system

As shown in the figure above, the hard disk mainly has the following concepts (the concept is relatively simple, so I won’t explain it)

a sector



Reading and writing process

The reading and writing process is divided into two parts: one is to read the data from each disk; the other is to say how to send the data to the computer.

Read data from disk

We know that sequential reading is much faster than random reading in a disk. With so many disks and tracks, what is the order?

If we are now traversing the data on the disk in sequence, the reading order is like this. First, read the outermost track of the top disk. After the disk rotates one circle, the track is read. Then switch to the second disk immediately, read the outermost track of the second disk, and so on, until the bottom disk is read. Then the arm moves one track inward and repeats the process until the innermost track is read.

In fact, the process of re reading is a little more complicated, because the disk is always rotating at a constant speed and high speed. There may be errors in a little time when one sector passes to the next sector, resulting in data not being read in the next sector. In order to solve this problem, the general data is stored in intervals on the track. Suppose that a data has 10 sectors, D1 and D2 respectively,. … D10; to illustrate this idea, we assume that a track has exactly 10 sectors, S1, S2,. ..s10。 If it is stored next to each other, the corresponding relationship between the sector and the data is: [(S1, D1), (S2, D2),]. .. , (s10, d10)]。 In this way, there is an error just mentioned. Then, for interval storage, the mapping relationship is: [(S1, D1), (S3, D2), (S5, D3). .., (s2, d10)]。

In addition, when switching disks, although it is electronic switching, the speed will still have a certain delay, and the starting points of the next disk and the previous disk can not be one-to-one corresponding, and there will be some staggering.

In the old disk, these values may need to be set by the user, but now they are set by the manufacturer for us, and the user does not need to care.

Since the movement of the magnetic arm is much slower than that of the disk (one is mechanical switching, the other is electronic switching), so in order to reduce the movement of the magnetic arm, the above reading sequence will read a cylinder first and then the next track, instead of reading a disk first and then reading the next disk.

So since the movement of the magnetic arm is so slow, we just talked about the logic of the magnetic arm movement when reading in sequence. So how does the arm move in the real world with a lot of random access? Here we need to consider the common scheduling algorithm of magnetic arm

RSS: random scheduling. This is bullshit. It’s just used to make green leaf contrast performance for others.

FIFO: first in, first out. This is not very performance friendly for random reads.

Pri: let users manage it. This is similar to FIFO, except that the priority is assigned by the user, which not only increases the cost of using disk, but also does not have high efficiency.

SSTF: shortest time scheduling. This means that the head always processes the most recent request processing. The biggest problem is that there will be starvation.

Scan: elevator algorithm. Back and forth on the disk. This is a common algorithm. Similar to elevator, the magnetic arm moves one track at a time, and the outermost or innermost track turns. This algorithm won’t starve to death.

C-scan: similar to elevator algorithm, it only reads data in one direction. The magnetic arm always reads data from the inner ring to the outer ring. When it reaches the outer ring, it quickly returns to the inner ring. During the return process, the data is not read, and the previous process is repeated.

Look: similar to scan, it only returns quickly. If there is no previous read / write request, it will return immediately.

C-look: similar to the relationship between C-scan and scan.

Generally speaking, SSTF performs better when there is less IO, and scan / look algorithm is better when IO pressure is high.

Here you can see the video of reading data from the hard disk: CVLum3 NQHg.html

Send the data to the computer

Just now we read the data from the disk. Next, we will explain what interface the disk uses to interact with the computer. This interface is also called disk management protocol.

The definition of disk management protocol is divided into two parts: software and hardware. Among them, software refers to instruction level, currently there are two instruction levels: ATA and SCSI; hardware represents data transmission mode, which is generally the wire transmission principle on the motherboard, but it is not limited. Data can even be transmitted through TCP / IP. To define a protocol, both instruction level and hardware transmission mode should be defined.


The full name is advanced technology attachment. It doesn’t look very good now, but from the name, it was still very advanced at that time.

This directive was put forward in the 1980s. It is divided into two types of parallel ATA (patata, which is a different type of hardware). At first PATA, also known as IDE, became popular. However, due to the poor anti-interference ability of parallel cable, the space occupied by the parallel cable is unfavorable to the heat dissipation of the computer. Since the more advanced SATA protocol was proposed in 2000, the disk with PATA / IDE interface has been eliminated in history. At present, the only disk with ATA interface is SATA disk.


The full name is small computer system interface. It was also proposed in the 1980s when he designed the disk interactive interface for small servers. The interface can achieve higher speed and faster transmission efficiency. But the price is also relatively high.

So at present, there are more SCSI disks in the server field and more SATA hard disks in the PC field. However, with the evolution of SATA disk and its unique price advantage, in the server field, SATA is gradually eroding the SCSI market.

However, SCSI will not wait to die. According to the idea that PATA evolved into SATA, he also engaged in serialization and evolved SAS (serial attach SCSI) interface. This interface is very popular in the market at present. It is not only cheap, but also has good performance. Therefore, it is estimated that SATA will phase out PATA in the near future.

SCSI instructions can also be transmitted over the Internet (iSCSI) or over FC network (fc-scsi), which we will discuss later.


Hardware principle

SSD is a storage medium which has been popular in recent years. There are two kinds of SSDs, one is flash memory, the other is DRAM, which only adds a battery in it. After power failure, it can continue to use battery to maintain data.

In this paper, we all refer to the former, that is, flash memory is used as the storage medium. Let’s take a look at the storage principle of SSD.

In the disk, 0 / 1 is the information of the north and south poles of the magnetic powder, while in the flash memory, the electronic signal is used. He used a floating gate field effect transistor as the basic storage medium. The transistor and the gate are mainly composed of two control gates. There’s a bunch of electrons between the two doors. When a potential is applied to the control gate, the electrons will run to the floating gate, and then the control gate will disconnect the potential, and the electrons will be stored in the floating gate (near the middle of the silicon dioxide insulation layer), which represents 0 in the binary system; when the control gate applies a reverse potential, the electrons run back to the control gate, and there is no electron on the floating gate, representing 1 in the binary system. So, by detecting the potential on the side of the floating door, you can get 0 or 1. And now some SSD manufacturers expand the value of a transistor from 0 / 1 to 0 / 1 / 2 / 3 according to different potential. This doubles the storage capacity. This type of transistor is called MLC (multi level cell), while SLC (single level cell) is used to represent only 0 / 1. However, generally speaking, the error rate of MLC is much higher, so the mainstream product on the market is still SLC.

After understanding the basic principle of SSD, let’s take a look at how SSD organizes these transistors. Consider the following concepts:

Page。 Generally, a page is 4K. Then the page contains 4K * 8 transistors, and page is the smallest unit of SSD read and write;

Block。 Generally, 128 pages form a block. The concept of block is very important. The control of reading and writing data is aimed at block. We will focus on the concept of block later;

Plane。 Generally, 2048 blocks form a plane;

A chip contains multiple planes, and multiple planes can operate in parallel.

The organization of block is shown in the following figure:

What is the hardware layer of the storage system

As shown in the figure, you can see that the transistors in the block are organized in a zigzag pattern. A horizontal row represents a page, so a block generally has 128 rows and 4K * 8 columns. Of course, since some error correction data need to be added for each page, there are usually more columns.

The horizontal row is the control line, which is responsible for the voltage to charge and discharge; the vertical row is the reading line, which is responsible for reading the potential in the floating door.

Reading and writing process

The reading process is as follows:

If the third line of data is to be read, the potential of the third line of control line will be set to 0, and the other 127 lines of control line will be given a potential, so that the vertical reading line can read only the data of the third line, but not other data. It can be seen that SSD does not need to seek these complicated things when reading data again, and the speed will be much higher than that of traditional disk blocks.

The SSD writing is more troublesome, because the SSD can not charge some cells in another block, and discharge some cells, so the signals will interfere with each other and cause unexpected situations. How does SSD deal with this problem? It’s violent. Read all the data of a block into the memory of SSD, and modify it. Next, discharge the whole block, that is, erase all data. Finally, write back the whole block in memory. As you can see, even if you only modify one bit of data, you need to make a big fight. As a result, the cost of writing data with SSD is very high. But the lean camel is bigger than the horse, and it is several orders of magnitude faster than the mechanical hard disk.

What’s more, SSD also has a headache. With the increase of charging and discharging times, the insulating effect of the silicon dioxide insulating layer in the middle will gradually decrease. If the floating gate can’t hold the electrons after the reduction to a certain extent, the transistor will be abandoned. So a single transistor has a lifetime of erasure times. The upper limit of the mainstream transistor is about 100000 orders of magnitude. MLC is even worse, only about 10000 times.

So in view of the above two problems, what solutions do SSD generally have to deal with?

In order to optimize the performance of writing, generally SSD does not erase when writing. When writing data, choose another clean block to write data. For the old block data, it will make a mark, and then periodically erase it;

For broken transistors, additional error correction bits can be used. According to different error correction algorithms, the number of broken bits in the same page is different. If the upper limit is exceeded, only unrecoverable errors can be reported.

Performance figures of common storage media

Finally, let’s compare the parameters of the current mainstream hard disk and SSD. This is the data obtained by the author in his work. The test data is random / sequential read / write data of various storage media under 4K size. The numbers are fuzzified and the order of magnitude information is reserved. If you have a general view, you can have a good idea

Test items: disk type | SATA | SAS | SSD|

Sequential read (MB / s) | 400 | 350 | 500|

Sequential write (MB / s) | 200 | 300 | 400|

Random reading (IOPs) | 700 | 1300 | 7W|

Random write (IOPs) | 400 | 800 | 3W|

Hard disk combination

In the previous section, we learned about the storage principle and read-write process of a single disk. In the actual production environment, the capacity and performance that a single disk can provide is still limited, so we need to use some combination technologies to combine multiple disks to provide better services.

In this section, we mainly introduce various disk combination technologies. First, we’ll look at the most basic combination technology, raid series technology; then, we’ll look at the larger scale integration technologies San and NAS.


RAID technology was put forward in 1980s.

Raid0: striping. Reading and writing efficiency is very high. But fault tolerance is poor.

RAID1: Mirror storage. The efficiency of reading is twice as high as that of writing. Fault tolerant cow B.

Raid2 & raid3: add one more check disk. On the basis of raid0, it has more fault tolerance. The difference between raid2 and raid3 is that different verification algorithms are used. Moreover, the verification of these two is for bit, so the efficiency of reading and writing is very high.

Raid4 & RAID5 & raid6: These are all for block, so the efficiency is worse than raid2 & raid3. Raid4 has no interleaving, and one disk is the check disk; RAID5 is interleaved, and each disk has data and verification information; raid6 is double insurance, which stores two check values.

At present, RAID5 and RAID1 are widely used.

There are two ways to realize Raid: soft raid and hard raid. Soft raid refers to the operation of the hard disk encapsulated with SCSI / SATA interface by the operating system in the way of software, and provides the interface of virtual hard disk on the top, and realizes the corresponding logic of raid in the middle; hard raid is a chip added to an ordinary SCSI / SATA card, in which the logic corresponding to raid can be executed.

At present, the general raid implementation scheme is hard raid, because soft raid can be determined in the following two ways:

It takes up extra memory and CPU resources;

Raid depends on the operating system, so the operating system itself cannot use raid. If the hard disk corresponding to the operating system is broken, then the whole raid cannot be used;

Now raid cards are generally more advanced, and can do multiple raid for multiple disks inserted on them. For example, the three disks are used for RAID5, and the other two for RAID1. Then two “logic disks” are provided for the operating system. The logical disk here is just a disk for the operating system, but the underlying layer may be multiple disks.

The logical disk does not have to occupy an independent disk. Similarly, several disks of raid can be made into multiple logical disks. Suppose that there are three disks made of RAID5. If there is a total of 200g of space, it can be divided into two blocks, each of which is 100g, which is equivalent to that the user can see two 100g disks. However, the general logical disk will not be implemented across raid. It’s not that we can’t do it, but we don’t need it. Moreover, we have an inconsistent impression on the upper layer: why is this disk fast or slow.

This logical disk also has an English Name: Lun (logic unit number). Nowadays, the storage system generally virtualizes the disk created by hardware as “Lun” and the disk created by software as “volume”. The term “Lun” is originally exclusive to the SCSI protocol. According to the SCSI protocol, a bus can only connect 16 devices (hosts or disks). In a large storage system, there may be thousands of devices, which is definitely not enough. Therefore, a new address labeling method called Lun is invented, which is called Lun_ ID+LUN_ ID to address the disk. Later, this concept gradually developed into all hardware virtual disks.

After the operating system sees the logical disk, it usually needs to do another encapsulation. The logic disk is still what the hardware layer is doing all the time. The feature of the hardware layer implementation is high efficiency, but it is not flexible. For example, if the logic disk is set to 100g, it is 100g. When the space is used up, if you want to adjust it to 150g, you have to stare at it, and the implementation cost is very high. In order to achieve the purpose of flexibility, the operating system has to do another layer of encapsulation “volume management”. This layer of volume management is to split and merge the logical disk in the software layer to form the “disk” that the new operating system really sees.

Finally, the operating system does some partitions on these volumes, and installs the operating system on the partitions.

Disk revolution

All of the above are about the internal disk organization of a single machine, and the storage space provided by a single machine is limited. After all, the size and space of the machine are limited, and only a few disks can be placed. In the general 2U machine can put 20 disks even if it is very good. In the actual industrial demand, for some large-scale applications, it is certainly far from enough. The solution adopted by the industry is: stack disks. If a single machine can’t hold so many disks, it will take a large box to load the disks, and then connect them to the computer interface through a dedicated line.

Of course, in recent years, a new technology field, the big data storage market distributed storage, has developed. Distributed storage is cheap, but its performance is low. It occupies a lot of markets that do not need high performance and query semantics are not complex. Let’s talk about distributed storage later. Let’s take a look at heap disk.

When there are more disks, people find that the disk capacity is up, but the transmission speed is still not up. The wire transfer mechanism of default SCSI has the following limitations:

It is stipulated that only 16 devices can be connected, that is to say, a storage device can only be accessed by 15 machines at most;

The longest length of SCSI wire can’t exceed 25m, which is a great challenge to the wiring of computer room;

So SCSI in some enterprise application market began to be disliked, so people look for other hardware solutions, people found: FC network.

FC network is a network interaction mode developed by a group of people who study the network in the 1980s. It is a similar product with Ethernet. It has its own complete OSI protocol system (from physical link layer to transport layer and application layer). He is the high rich handsome version of Ethernet, more expensive, higher performance. At that time, FC network was mainly designed for high-speed backbone network, and people didn’t expect that this thing would still shine brilliantly in the field of storage system.

Here, f in FC is fiber, not fiber. The former means network, not light. Although the general FC network uses optical fiber as the transmission medium, its main definition is not only fiber, but a set of network protocols.

However, the introduction of FC network solves the problem of perfect cable

FC network is similar to Ethernet. It has its own switch, network connection mode and routing algorithm. It can connect as many devices as you like;

Optical fiber transmission can be up to 100 kilometers, that is to say, the host computer is in Beijing, and the storage can be in Qingdao;

The transmission bandwidth is larger;

It only replaces the hardware layer, and the instruction set is still SCSI. Therefore, the migration cost is very low for the upper layer, so it is widely used in enterprise applications.

At present, the mainstream storage protocol: short distance (mainly in the machine) uses SAS, long distance uses FC.

After the above series of technical development, the technical solution of large-scale storage system is gradually mature, so commercial products gradually appear on the market. In fact, it is a box with a pile of disks, which we call San (storage area network).

When it comes to San, we must mention another concept: NAS (network attach storage). Because the letters are the same, just changed the order, so it is easier to confuse. NAS is actually a SAN + file system. San provides a disk management protocol level interface (ATA / SCSI); NAS provides a file system interface (EXT / NTFS). However, generally speaking, San is provided to the host with FC network (optical fiber high-speed mesh network), so the performance is high; while NAS is generally connected to the storage system through Ethernet, so the performance is low.

In addition, there is another concept often associated with San and NAS, DAS (Direct Attached Storage). This is similar to San, except that Das can only be used by one machine, and San provides multiple interfaces for multiple users.

Editor in charge: CT

Leave a Reply

Your email address will not be published. Required fields are marked *