The enterprise environment is complex and changeable, and the rapidly growing business requirements make enterprises put forward higher and higher requirements in terms of data storage scale, storage performance, and reliability. SSD solid state drive has become the preferred solution for more and more enterprises due to its extremely high read and write performance and extremely low latency, and plays an important role in the fields of database, virtualization, application acceleration, big data, cloud computing and even artificial intelligence. Enterprise-level SSDs often need to operate in a harsh environment with high concurrency, high pressure, and 24-hour operation. Its reliability is one of the key concerns of enterprise-level users.
Reliability refers to the ability of a component or system to continue to perform its intended function for a specified period of time under specified operating conditions. For enterprise-level SSDs, it is a very important indicator. It not only directly determines the core indicators such as the yield rate and failure rate of product shipments, but also plays a key role in the protection of data availability and consistency.
Quantitative index of reliability – MTBF
SSD “reliability” is usually measured in MTBF quantification. The full name of MTBF is Mean Time between Failures, the average time between failures, that is, the ratio of the cumulative working time of the product in the total use phase to the number of failures. It reflects the time quality of the product, the fewer product failures, the higher the MTBF, and the higher the product reliability.
Compared with consumer-grade SSD products, enterprise-grade SSDs face higher challenges in terms of reliability. According to the suggestion given by OCP (Open Compute Project), the MTBF mean time between failures of enterprise-level SSDs deployed in data centers should be 2,000,000 hours, which is also the current standard for enterprise-level SSDs. However, MTBF needs to be verified by actual running tests and cannot come out of thin air. According to the traditional method, it is obviously impossible to complete multiple verifications of 2 million hours. So, how did the MTBF of 2 million hours come about?
The answer is based on a certain sample size, and within a certain period of time, the acceleration factor (such as the acceleration of writing volume and the acceleration of operating environment temperature) is used for statistical inference. The process simulates typical user scenarios, verifies theoretical values through actual measurements, and checks product quality in advance. Rigorous running test verification will directly determine whether the MTBF “reliability index” is really reliable.
Characterization period for MTBF
Like most electronic products, SSD also conforms to the characteristics of the bathtub curve (failure rate curve), which is divided into three key periods:
Early failure period (Infant Mortality)
When the product is just produced and powered on, the failure rate will be high due to factors such as yield rate. In order to ensure that the SSDs delivered to customers meet enterprise-level reliability standards, enterprise-level SSD manufacturers will conduct aging tests for a certain period of time on all products on the production line to expose possible early failures of products to the greatest extent and ensure that customers receive The product does not suffer from premature failures.
Random Failures or Normal Life
This stage corresponds to the official shipment of products, and the product failure rate is low and relatively stable. This period is described by the product reliability index MTBF, that is, the stable use stage of the product.
In this stage, due to factors such as product wear and aging, the failure rate will increase exponentially with time. At this time, the SSD declares that its service life has ended. Although it can continue to be used, the number of bad blocks will accelerate with the increase of PE. The effective reserved space (OP) of the SSD is gradually exhausted, and the failure rate of the device increases. For enterprise-level SSDs, it is not recommended to continue to use products that have entered the wear-out period.
MTBF = MTTF？
In addition to MTBF, you may have heard another reliability description – MTTF. For a maintainable device, MTBF = MTTF + MTTR, the relationship between the three is as follows:
MTTF (Mean Time To Failure, mean time to failure): refers to the average time between two failures of the system, taking the average value of all the time periods from the normal operation of the system to the failure. MTTF =∑T1/N;
MTTR (Mean Time To Repair): refers to the average value of the time period between the failure of the system and the end of the repair. MTTR =∑(T2+T3)/N;
MTBF (Mean Time Between Failure, mean time between failures): refers to the average value of the time period between two system failures (including failure maintenance). MTBF =∑(T2+T3+T1)/N.
Because MTTR is usually much smaller than MTTF, MTBF is approximately equal to MTTF.
MTTF theoretical calculation formula, how does 2,000,000 hours come from?
In the simplest case, MTTF calculation follows the following formula:
Ai is the acceleration factor of SSD i;
ti is the test time of SSD i;
nf is the number of failed SSDs;
a is the confidence limit (60%);
x2 is chi-squared distribution.
The acceleration factors in the above equation are generally divided into 3 categories:
Unaccelerated factor: A=1, usually used for firmware failures;
TBW (Total Bytes Written) acceleration factor: accelerate life by increasing data writing intensity;
Temperature acceleration factor: Accelerate the occurrence of failures by increasing the test environment temperature.
TBW (Total Bytes Written) acceleration factor
TBW is the lifespan unit of SSD. Taking the PBlaze6 SSD with a lifespan of 1.5 DWPD and a user capacity of 3.84TB as an example, its total data write volume (that is, field deployment write volume field) for 5 years is 10.5 PB, corresponding to the daily data write volume is 5.76 TB. If you increase the amount of data written every day (accelerated writing stress), it is equivalent to accelerating the consumption of SSD life, which can accelerate the occurrence of failures. The calculation method of TBW acceleration factor is as follows:
Assume that an SSD with a user capacity of 100G has a product specification that defines the lifespan of the SSD as 175TBW, which can be used for 5 years (43,800 hours) in typical usage scenarios. It writes 130TB of data within 1008 hours, and the write amplification is 1.2, so the TBW acceleration factor is 32. If more data is written in a short period of time, the TBW acceleration factor will increase accordingly.
temperature acceleration factor
Due to the inherent characteristics of NAND, data retention will decrease with the increase of temperature. According to the Arrhenius Equation, if the SSD is stored for 1 year (8670 hours) at a room temperature of 40°C, it is equivalent to 52 hours in an aging room at 85°C.
JESD 22-A108 defines the influence of temperature on SSD over time, and performs high temperature operating life (HTOL, High Temperature Operating Life) test to determine the reliability of SSD operation under long-term high temperature conditions. The agreement stipulates that if there is no special requirement, the SSD shall be tested under the junction temperature stress of 125 °C. However, enterprise-level SSDs generally design high-temperature protection logic to prevent NAND data retention from falling and components from being damaged due to excessive temperature, so the actual operating temperature of SSDs will not reach 125°C.
For the temperature acceleration factor, the calculation method is as follows:
Ea is the activation energy of the failure model, generally 0.7 eV;
k is Boltzmann’s constant, 8.617 x 10-5 eV/°K;
T₁ is the working temperature (the standard value is 55°C or 328°K);
T₂ is the test acceleration temperature.
Example of MTTF calculation
Suppose the sample size is 400, the test time is 1008 hours, the acceleration factor Ai = A(TBW) * A(T) is 10, the number of failures is 0, and the confidence level is 60%, then MTTF = MTBF = 4,400,000 Hour.
Note that MTBF is strictly temperature dependent. This point is also mentioned in the OCP Datacenter NVMe SSD Specification:
MTBF 2,500,000 hours (AFR≤0.35%), the corresponding operating temperature of SSD is 0℃~50℃;
MTBF 2,000,000 hours (AFR≤0.44%), the corresponding SSD operating temperature is 0°C~55°C.
But there is always a gap between theory and reality. In reality, it is difficult to achieve an acceleration factor of 10 times in the MTBF test in the product sense. The TBW acceleration factor can only be used to test the life of NAND particles. In actual testing, the reliability of hardware such as circuits and physical interfaces needs to be considered. And this part can only be accelerated by temperature. In actual operation, the test of MTBF=2 million hours requires at least 2,000 samples to run for more than 1,000 hours under the action of the acceleration factor.
What is the relationship between MTBF and AFR?
In addition to the MTBF index, there are other quantitative indicators of reliability, such as failure rate λ (Failure Rate) and annualized failure rate (AFR (Annualized Failure Rate), where AFR and MTBF can be converted into each other.
Failure rate λ: When selecting key SSD components, it is necessary to ensure that the failure rate λ of each component reaches the standard. Compared with the failure rate index, the definition of MTBF is more direct and more suitable for expressing system-level reliability;
AFR: Annualized Failure Rate, which provides a better understanding of the chances of a drive failure occurring in any given year.
The conversion formula of MTBF and AFR is as follows:
MTBFhours = 1/λhours
MTBFyears = 1/（λhours*24*365）
AFR = 365*24hours*λhours = 8760hours/MTBFhours
The numerical correspondence between MTBF and AFR is as follows:
Enterprise-level SSD product reliability MTBF ≥ 2,000,000 hours (@55°C), converted to annualized failure rate AFR ≤ 0.44%, corresponding to FFR (Functional Failure Requirement, the cumulative functional failure of SSD in the entire wear life time range) rate, based on the 5-year warranty period) ≤2.2%.
Memblaze’s full series of enterprise-level SSDs are based on the standard of 2,000,000 hours MTBF @55°C / 2,500,000 hours [email protected]°C, meeting the requirements of 7×24 hours of stable and uninterrupted operation at 55°C/50°C. At 40°C, the data can be kept for at least 3 months without power failure and the UBER unrepairable error rate is lower than 1E-17.
Verification of MTBF
Memblaze self-developed test platform Whale system
In the field of data reliability technology, Memblaze has self-developed the MemSolid technology set to ensure the consistency and reliability of enterprise-level data. Through full-path data protection, LDPC soft-decision decoding error correction technology, metadata cross-Channel backup protection, dynamic RAID5 recovery mechanism for bad block data between Dies, and re-read protection and over-temperature protection technologies, the sustainable data consistency of PBlaze is realized. Protection, to ensure that the key business data assets of enterprises are always in a safe and reliable storage environment.
In order to ensure that the manufactured SSD products can meet the MTBF standard, Memblaze independently developed the MTBF test platform – Whale system by using more than ten years of experience in the solid state drive field and understanding of the actual application of users.
It is built according to the JEDEC standard and is suitable for PCIe SSD research and development (DVT), environmental stress (EST, Environmental Stress Test), data retention, production (aging, ORT, Ongoing Reliability Testing), RDT and other tests. The Whale system presets test cases that are closest to the actual use scenarios of customers, and uses reasonable acceleration factors to conduct long-term running tests on products in the RDT stage, which becomes the quality assurance of products before mass production.
According to Memblaze’s shipments and actual failure rate statistics, the actual cumulative product failure rate (CFR, Cumulative Failure Rate) of PBlaze series SSDs is far lower than the nominal annualized failure rate.
After more than ten years of deep cultivation and polishing in the SSD industry, Memblaze has formed a rigorous design and strict quality control system in all aspects of chips, software, hardware, production, and shipments, which can ensure that the PBlaze series of enterprise-level solid-state drives provide customers with excellent The reliability of the system also greatly reduces the operating expenses (OPEX) and total cost of ownership (TCO) of the customer’s system, and Memblaze will continue to polish it with the spirit of ingenuity and live up to expectations!
Responsible editor: haq