Choosing a flash memory device or solid-state drive (SSD) for industrial applications is complicated. Price-performance comparisons are of little use because the demand-driven definition of performance depends on many factors.
For standard IT applications, price per gigabyte is an excellent criterion for purchasing flash memory. In some cases, write and read speed are also considered. However, if you need flash memory devices for industrial applications, or memory cards for outdoor telecom applications, your priorities will be different. If capacity isn’t really an issue in general, then robustness and durability are definitely issues. Persistence (i.e. flash media longevity) and retention (i.e. how long the data is saved) are complex issues. The storage medium firmware and architecture both come into play here, as does the nature of the application. How will the data be written? Or more emphasis on reading? If you have special requirements for flash, then you need to know exactly which questions to ask.
Background: Signs of Aging
Typically, the cells of a NAND flash device only allow a limited number of deletions. This is because each time tunneling occurs, the oxide layer, which normally prevents electrons from flowing out of the floating gate, accumulates high-energy electrons due to the quenching voltage. As a result, the threshold voltage changes over time, eventually rendering the cell no longer readable (Figure 1).
[Figure 1 | Aging cell: Electrons accumulate in the tunnel oxide, resulting in a gradual change in threshold voltage. Cracks in the tunnel oxide create leakage current paths that allow charge to flow out. Read errors increase to the point where blocks become “bad blocks” that need to be eliminated. ]
Another effect of aging is the formation of conductive paths in the oxide layer, causing the battery to gradually lose charge—and with it, hold bits. This effect is accelerated by exposure to high temperatures, especially when the number of allowable program/erase (P/E) cycles is reduced, resulting in a dramatic drop in retention. So, while both new single-level and multi-level cell (SLC and MLC) NAND reliably offer 10-year retention, at the end of their useful life that number drops to just one year. But for MLC NAND, this is reached after 3,000 P/E cycles, and SLC after 100,000. This explains why SLC is the preferred solution for complex applications, and why low-cost triple-level cell (TLC) NAND chips cannot be used for long-term storage. To write 3 bits per cell using this technique, you need eight different charge states, which is why degradation quickly becomes apparent. In TLC, the initial one-year retention level only drops to three months after 500 P/E cycles.
Use various mechanisms to mitigate the effects of physical damage to the chip. Considering that when a unit fails, the entire block needs to be marked “bad”, wear leveling ensures that all physical storage addresses experience the same level of usage. But read errors are more than just wear and tear. Every time data is written, the cells around the programmed cells are stressed (i.e. they become more alive) – a phenomenon called “program disturb”. Over time, the cell voltage threshold increases, causing read errors that are fixed once the associated block is erased. Reading also causes a form of stress called “read disturbance”, where adjacent pages build up charge. Due to the relatively low voltages involved, this effect is far less pronounced than for writes – but reads are prone to bit errors, which must be corrected by Error Correcting Code (ECC) and fixed by deleting the relevant block. Interestingly, this effect is especially noticeable in applications that repeatedly read the same data. This means that even in read-only memory, blocks need to be deleted and pages written repeatedly to correct errors.
How to measure the endurance of an SSD
Manufacturers use two metrics to measure the lifespan of flash memory devices: terabytes written (TBW) and drive writes per day (DWPD). TBW represents the total amount of data that can be written over the life of the SSD, while DWPD represents the amount of data that can be written per day during the warranty period. The problem with these benchmarks is that they are extremely complex, and users have no choice but to rely on the manufacturer’s specifications. Furthermore, the practical relevance of these specifications is unclear when it comes to choosing the right data medium for a given application, as the numbers obtained are largely dependent on the testing effort. For example, tests of the Swissbit 480 GBP SSD yield 1360, 912 or 140 TBW of life, depending on the measurement method used. The most dramatic results were achieved by measuring sequential writes. The second value (912) is produced by customer workloads, while the third value (140) is from enterprise workloads. Both load tests constitute JEDEC standards. Client workloads are based on computer user behavior and generate primarily sequential access. Enterprise workloads, on the other hand, simulate the performance of servers in multi-user environments, generating up to 80% random access.
While in theory these standards should allow for comparability, the problem is that many manufacturers don’t specify the underlying workload at all, and instead base their product information on sequential write values. Also, as the example shows, the latter can vary by a factor of ten for enterprise workloads, so care needs to be taken when significant and unspecified high durability values are involved.
Write Amplification Factor (WAF) Decrease
The logical-to-physical mapping system, ECC, and the process of clearing blocks called garbage collection are all relevant mechanisms for understanding and ordering flash memory functionality and performance. A key term in this field is the write amplification factor, or WAF, which is the ratio between the user data coming from the host and the actual amount of data written to the flash device. Reducing WAF (a measure of flash controller efficiency) is the key to improving SSD endurance. Workload factors that affect WAF include the difference between sequential and sequential and random access, or the difference in data block size relative to page and block sizes. Two basic conditions must be met: pages need to be written one after the other, and blocks need to be deleted as a whole. In the standard process, the mapping between logical addresses and physical addresses is related to blocks. This works well for sequential data because pages of a given block can be written sequentially. An example of such a mechanism is continuously accumulated video data. However, with random data, pages are written in many different blocks, so each internal overwrite of a page requires deleting the entire block. This results in a higher WAF and shortened lifespan. Therefore, page-based mapping is more suitable for non-sequential data. In other words, the firmware ensures that data from different sources can be written sequentially to the pages of a single block. This reduces the number of deletes, thus extending the lifetime and improving write performance.
[Figure 2 | Comparison test: The F-60 durabit is more durable and has a lower WAF than regular products. This is achieved through FTL and larger over-provisioning in DRAM. ]
Another factor that increases WAF is memory usage. The more data stored on a flash device, the more bits the firmware needs to move from one place to another. Page-based mapping is also beneficial here. Manufacturers have another adjustment mechanism called over-provisioning (ie, the space on the flash device is reserved only for background activities) at their disposal. 7% of an SSD (the difference between the binary and decimal values of a gigabyte number) is typically used for this purpose. But reserving 12% instead of 7% for overprovisioning is very effective. For example, an endurance comparison (ie, TBW for enterprise workloads) of two identical SSDs derived from MLC NAND chips shows that the 60 GB Swissbit F-60 durabit achieved 6. Value is 6 times higher than 64-bit F-50 devices from the same company. In fact, the 240 GB and 265 GB versions are worth 10 times more.
Conclusion and nine key questions to ask
SLC flash devices are in many ways the most reliable solution for industrial applications and power loss protection. However, in many cases, high-end MLC flash media is also suitable for this type of use. In addition to mechanical performance to military standards, when looking for an SSD solution, special attention should be paid to manufacturers’ efforts to reduce WAF and extend product life through firmware. Other factors that come into play are “data maintenance management” measures for better retention, not forgetting the long-term availability of modules carefully selected to suit a given application.
Application requirements determine what you need to pay special attention to when choosing an SSD. On request, Swissbit provides its customers with LifeTIme Monitor, a tool that calculates the endurance of a given SSD by analyzing reads and writes. As mentioned, even if price isn’t the deciding factor, it’s useful to know if you really need an 8x more expensive SSD or MLC if it’s good enough for your purposes.
Key questions to ask when choosing flash memory devices for industrial applications:
Do I have specific physical requirements for vibration, resistance and temperature ranges? – Industrial flash memory devices should be able to demonstrate high quality material properties and good production and processing by applying appropriate qualification procedures.
Has the memory been exposed to high temperatures for a long time? – Because high temperature will weaken the readability of the cell more quickly, you should choose a product with data protection function and refresh the data regularly.
Are you planning to write and store large amounts of data on a data carrier for a long period of time? – If yes, the SLC product should be selected.
Does the application primarily need read access? – If yes, you should choose a product with data care function that refreshes the data regularly.
Does the application primarily need write functionality? – Then products with block-based mapping are suitable for sequential write functionality. For random requests, products with page-based mapping should be selected.
Is the memory capacity fully utilized? – For intensive use applications, the controller requires internal operating space, and over-provisioning can extend durability.
What workloads does the provider specify for TBW or DWPD? – The comparison of data carriers can only be achieved by the indication of workload benchmarks.
Need a higher level of data loss protection? – For particularly critical applications, data maintenance management and power failure protection are essential.
Is the medium still usable after a few years? – Manufacturers should guarantee long-term availability so that memory storage can be replaced without recertification.
Reviewing Editor: Guo Ting