Fault tolerance is the holy grail of embedded systems, especially for military and industrial applications where real-time operating systems are common and downtime is expensive. However, minimizing downtime is easier said than done — especially when it comes to storage.
Redundant storage using Redundant Array of Independent Disks (RAID) technology has prevailed at the enterprise level for decades, but the size, weight, and computational constraints of embedded systems make implementation in this area more difficult. The recent popularity of high-density SSDs in smaller and smaller form factors has enabled storage redundancy, even in compact embedded systems. Beyond ultra-compact hardware RAID controllers, we may be entering a new era where highly available embedded storage is no longer an oxymoron.
Redundancy is key when creating a reliable storage system. Mirroring disks using RAID has been a common practice since the 1990s. RAID is a standardized system for mirroring data across drives, allowing the construction of fault-tolerant storage systems—even with relatively inexpensive hardware. If a drive fails, its mirrored backup can take over, resulting in minimal to no downtime in a well-implemented system.
While RAID makes sense for server applications, implementing it at the embedded system level is a challenge. Before SSDs became popular, hard disks were the main storage medium. Their size and weight mean that having redundant drives is impossible for most, if not all, embedded applications.
When SSDs entered the market, RAID was still difficult to implement. Flash storage was initially very expensive, and redundant embedded storage was prohibitively expensive for many applications. Even with SSDs, size is an issue, as early SSDs weren’t always smaller than the hard drives they replaced.
The computing power required to manage RAID has traditionally required either bulky hardware RAID controllers (impractical for space-constrained systems) or software RAID controllers. While a software RAID controller makes sense in terms of saving space, it’s not always the right choice for embedded systems. Embedded computers are typically size and energy constrained systems that cannot afford the CPU and memory overhead of running RAID software.
Reliability and Fault Tolerance
Due to the various challenges of implementing storage redundancy in embedded systems, minimizing embedded storage downtime has traditionally focused on reliability rather than fault tolerance. Longevity and uptime can be improved by ensuring the use of high-quality components and designing reliable systems with higher mean time between failures (MTTF).
Mechanical hard drives are prone to multiple failure modes. Vibration, shock, and noticeable old wear and tear mean that it’s not a question of whether a drive will fail, but when. Building a reliable hard drive means using better quality components and a robust mechanical design to better withstand shock and vibration.
Today’s SSDs, with their solid-state design, eliminate mechanical problems as failure modes, but failures can still occur at the drive controller or storage media level. Flash memory cells have a finite number of write cycles before the cell can no longer accurately store the bit state. Therefore, while flash memory is very strong in the face of shock and vibration, SSD write endurance needs to be carefully monitored.
So, for SSDs, improving reliability requires using industrial drives whose drive controllers are optimized for reliability and write endurance rather than pure performance, as well as using higher-level flash. Industrial systems typically do not use consumer-grade multi-level cell (MLC) flash, but instead use single-level cell (SLC) or SLC-like flash, such as iSLC. These higher-level flash types last thousands of write cycles than MLC flash, greatly extending storage life.
While improving reliability is always the primary goal of industrial systems, true resiliency also requires fault tolerance. To understand how to create fault tolerance, we need only look at enterprise data centers – downtime can cost thousands to millions of dollars. In these mission-critical environments, reliable components are combined with fault-tolerant designs to create highly available systems.
Availability, which can be thought of as minimizing downtime, comes in two ways. The first approach involves increasing the lifetime of the system—improving reliability. Another approach is to reduce the time it takes to restore the system – increasing fault tolerance.
Fault-tolerant embedded storage
Fault-tolerant storage requires storage redundancy – there is no way around it. Thankfully, both SSDs and RAID controllers have shrunk considerably in size these days.
SSDs were originally the same size as the 3.5-inch hard drives they replaced, and today’s mSATA and M.2 form factor SSDs make even 2.5-inch laptop drives look like oversized behemoths. These compact SSDs are less than half the size of a playing card and are measured in millimeters thick.
RAID controllers have also undergone severe dieting. What used to require a full PCIe card can now be implemented on a SoC type chip. When paired with the correct firmware, the new generation of RAID controllers are designed to work with SSDs, not against them.
For today’s embedded system designers, there are several storage options on the market:
For larger systems with existing 2.5-inch drive slots, these AID controllers emulate 2.5-inch disks. They consist of a hardware RAID controller and two mSATA or M.2 slots for redundant SSDs. Can be configured in RAID 1 or RAID 0 configurations for increased performance, they are presented to the host system as regular 2.5-inch drives, providing both redundancy and fault tolerance, or in the case of RAID 0, higher performance.
For smaller systems, mSATA or M.2 connectors can provide one of the most compact RAID configurations available today. Just like replacing a 2.5″ hard drive, an mSATA or M.2 RAID controller plugs into the appropriate connector, presenting a single drive. In fact, it provides storage redundancy through a physical connection to two SATA drives.
These SATA drives can be regular-sized SATA drives connected using flex cables, or SATADOM drives, which are compact SSDs that connect directly to the SATA connectors. Innodisk’s SATADOM drives come in a variety of physical configurations, from vertical to horizontal, to suit a variety of embedded systems.
While not an option for most low-power embedded systems, space-constrained high-end embedded PCs can consider dual SSDs in combination with software RAID. The compact nature of mSATA, M.2 and SATADOM SSDs makes them the ultimate compact RAID configuration, but the CPU and memory of software RAID make it only suitable for high-end embedded systems with the resources to support this configuration.
Implement high-availability embedded storage
Fault-tolerant redundant RAID storage combined with reliable industrial-grade SSD drives such as SLC or iSLC-grade SSDs enables embedded systems to achieve true high availability. Both reliability (time to failure) and fault tolerance (time to repair) are addressed, minimizing storage subsystem downtime.
Fault Tolerance can also be used alone, in conjunction with MLC-level SLC. For applications with low write cycles, this can be an affordable but very effective method to minimize downtime.
While this has been a long and arduous journey, the miniaturization of SSDs and RAID controllers has enabled today’s embedded systems to finally achieve true fault-tolerant storage.
Reviewing Editor: Guo Ting