At present, a large number of embedded systems use single-chip microcomputer, and such applications are further expanded; but for many years, people have been puzzled by the reliability of single-chip microcomputer system. In some control systems that require high reliability, this is often the main reason to limit their application.
Failure analysis of single chip microcomputer system
The reliability of a single-chip microcomputer system is the result of its own software and hardware and its working environment. Therefore, the reliability of the system should be analyzed and designed from these two aspects. For the system itself, it is the key to determine the reliability of the system whether it can effectively suppress all kinds of interference signals and interference signals directly from the outside of the system while ensuring the realization of all functions of the system. The defective system usually only guarantees the realization of the system function from the logic, but lacks the consideration of the potential problems that may appear in the system operation process, and the measures taken are insufficient. When the interference signal really hits, the system may be in trouble. The reliability of any system is relative. A system that works well in one environment may be very unstable in another. This fully shows the importance of environment to the reliable operation of the system. At the same time, we should take measures to improve the system operation environment and reduce environmental interference, but such measures are often limited.
2. Measures to improve reliability
There are many methods and measures to improve the reliability of single chip microcomputer system. Generally, according to the specific reliability problems faced by the system, different measures should be taken to deal with the factors that cause or affect the system’s unreliability. These measures generally start from the following two purposes: first, to minimize the external factors that cause the system unreliability or affect the system reliability; second, to improve the anti-interference ability of the system itself and reduce the instability of its operation. For example, the filtering technology, isolation technology, shielding technology, etc. used to suppress power supply noise and environmental interference signals are all for the first purpose; in addition, the watchdog circuit, software anti-interference technology, backup technology, etc. used for the system itself are all measures taken for the second purpose. Among them, the first kind of measures are often used, which are simple to use and have good effect. However, the improvement of the system reliability is limited, which can not meet the requirements of the system in many cases. The use of the second kind of measures can further improve the reliability of the system, which is often widely used in high reliability system design. Next, we will make a further analysis on some related problems in the use of the second type of technology.
2.1 improve the reliability of the system with the technology of monitoring timer
The technology of watchdog is widely used now, and the technology is more mature. There are many supporting means of this technology. At present, almost all processor manufacturers are producing the single-chip products with Watchdog Timer built-in. There are many independent watchdog timer chips available in the market. It is easy to realize such a circuit, so the general details of how to realize this technology are not discussed in detail here, only the analysis of the problem of human nature caused by using this technology. After adopting the technology of monitoring timer, once the program flies, the system will be reset immediately by the monitoring timer, and the system will be restarted from the beginning, so as to exit from the abnormal operation state. However, when using this technology, we must pay attention to the system’s humanity. The so-called system’s reusable human nature can be defined as follows: when a microprocessor system is reset and started, the external operation of the system will not be changed due to restart, or the change can be tolerated, so as to ensure the continuity and sequence of external operation of the whole system, that is, the ultimate security and reliability of the system. For a system, if its external control operation is only related to the current input state of the system, then the system has almost complete reentry performance; on the contrary, if the external output operation of a system is not only related to the current input of the system, but also to the historical state of the system, then if the historical state of the system is not reserved or historical when the system is reentry If the state is destroyed, then the external operation of the system may be completely wrong. Although such a system exits the abnormal operation state under the action of the watchdog timer, the re-entry state will not be normal. Then such a system can only be a sick system and cannot be used. Therefore, for the system which uses watchdog circuit to improve the reliability, we must strictly guarantee the reentry of the system.
For the system related to the historical state, in order to ensure its reentry performance, the historical state can be saved in the ram of the system, that is, in the memory of the single-chip system or its extended external memory, a special buffer for saving the historical state can be opened. In the case of ensuring that the system does not power down, these historical data can be reused when the system is re entered. If the power supply of the system is not stable, the backup battery must be used for power supply to ensure the safety and stability of RAM data; for the system not too sensitive to time, E2PROM or Flash ROM can also be used to save historical data.
2.2 software anti-interference technology
A system may fail due to various disturbances and unstable factors. In order to solve this problem, some measures can be taken from the aspect of program design. The traditional software filtering technology and software redundancy design, which are often used to suppress the interference signal of the system, are typical applications of this kind. According to the design experience, software lock design and program trap design can also be used. This kind of method is mainly used for the situation of program running. When the system runs under the interference signal, the program pointer may point to two areas: one may be transferred to other addresses in the program area for execution, and the other may be transferred to the blind area in the program space for execution. The so-called blind area means that there are no valid program instructions stored there. In the first case, software lock can be used to suppress. For example, in order to ensure the security of external operation, in the design of software lock, for each relatively independent program block, a pre-set password is verified before or during its execution. Only when the password is consistent can the execution be truly effective, and only when the program is transferred through the normal transfer path, can the upper level program set the correct password; Otherwise, the program will be forced to transfer according to the error verification, the error state will be handled, and the normal operation state of the program will be restored. Let’s take the following example: suppose there are three program blocks that are executed in sequence, and the password set for each program block is verified when it is executed.
When the program is executed in sequence, every program block can be executed effectively and correctly. Now suppose that the program runs away due to interference, and the block processing of sub-pro1 jumps to program sub-pr03 to start execution, then the password verification will be wrong during execution, and the program will be transferred to the error processing program for processing to avoid wrong operation.
The purpose of program trap design is to prevent the program from running to the program blind area for execution. Generally speaking, the processing of ROM space outside the program code space adopts the method of vacancy. When the program is solidified, these empty spaces are all written as 1 or O, so that the program can jump into this area uncontrollably. In order to capture the programs that jump into this area, the program trap can be used. The following is illustrated by an example: assuming that the program space of a system is 32KB and a total of 18 KB of code is generated after the program is compiled, then there are still 14 KB of program space unused. The following trap programs can be placed in this area:
The number of NOP instructions contained in each of the remaining program space traps repeatedly covered by the above program segments has an impact on the capture success rate and capture time. The more NOP instructions are placed, the higher the capture success rate is, but the longer it takes, the longer the program is out of control; otherwise, the opposite is true. Because only when the program jumps to the first byte of NOP instruction or LJMP instruction can it be captured successfully; when the program jumps to the last two bytes of LJMP instruction, unpredictable execution results may occur. When the captured program is executed at the beginning of the program, the humanity of the program must also be considered.
2.3 using backup system to improve reliability
Backup system has been widely used in many important control systems, but it is mostly used in industrial computer or larger systems. The backup system can be divided into online backup system and backup backup system according to the specific situation. For online backup system, two CPUs in the system are all in working state. It is possible that two CPUs are in the same position, one is in the position of main CPU, and the other is in the position of slave CPU. In the case of peer-to-peer, the two CPUs jointly determine the external operation of the system, and any CPU error will result in
Cause the prohibition of external operation. In the case of one master and one slave, the main CPU is responsible for the realization of system control logic, while the slave CPU is responsible for monitoring the working state of the main CPU. When it is monitored that the main CPU is working abnormally, the slave CPU restores the main CPU to normal by forcibly resetting the main CPU and other operations. At the same time, in order to ensure that the slave CPU works normally, the slave CPU’s working state is also monitored by the main CPU. When the slave CPU’s working state is not normal, the main CPU can also take measures to restore the slave CPU to normal work, that is, to achieve the purpose of mutual monitoring. In the specific design, the way of information exchange between master and slave CPU is very flexible and diverse. For example, the common memory is used to realize the exchange of monitoring information (such as storing the common information in dual port RAM), and the handshake signal is used to realize the exchange of monitoring information.
3. Comprehensive design method to improve system reliability
In a specific system design, in order to improve the stability and reliability of the system, it is often necessary to adopt a variety of measures to achieve satisfactory results, which is the only way to comprehensively improve the reliability of the system. Different systems may have different control objects and different operating environments, so the main interference problems they face are different, and the measures they take are different. However, it is often unrealistic to take only one measure to improve the reliability of the system in an all-round way, and it is necessary to take multiple measures to improve the reliability of the main problems.
4 design example
A design example is given below to further illustrate some common methods to improve system reliability design.
In a satellite communication system, in order to reduce the phase noise of the system, the working temperature of the LNA is required to be kept constant (40 ℃); the ambient temperature range of the LNA in the field is between 40 ℃ and + 60 ℃, so the LNA must be put into a special incubator. The thermostat shall have the functions of heating and cooling. Resistance wire heater is used for heating and semiconductor cooling sheet is used for refrigeration. In order to prevent the thermostat from losing control of temperature or even damaging the low-noise amplifier due to controller failure, and damaging the normal operation of the whole system, the design of the thermostat mainly adopts the master-slave Dual CPU system to improve the reliability of the system. In addition, the power monitoring technology, watchdog technology, software trap technology, photoelectric isolation technology and other measures are used to improve the reliability of the system. The structure diagram of the system is shown in Figure 1.
The main CPU is responsible for the temperature detection of the heater, the refrigeration sheet and the outside of the box, and is responsible for the main control tasks. The main CPU selects AT89S52 single-chip microcomputer, including watchdog timer, and max707 is added to the chip as the power monitoring circuit; besides providing reliable reset signal to the main CPU, it can also detect the power failure interrupt application signal, and timely save the field data when the power failure occurs. The heating rod is powered by AC 220V, and the cooling plate is powered by 15V DC stabilized power supply. In order to prevent the interference of high voltage and strong current to the weak current part, the main CPU
The generated control signals are sent to the drive circuit through photoelectric isolation to improve the reliability of the system.
AT89C2051 is selected from CPU, which is mainly responsible for monitoring the working condition of main CPU and power supply voltage. When the power failure occurs, the voltage comparator in AT89C2051 will detect this change, which is powered by the backup battery and reported to the monitoring console through 485 port.
The monitoring between the master and slave CPUs is mutual. The master and slave CPUs shake hands with each other through the I / O port line between them, monitor each other’s working status, and take corresponding measures to ensure the safety of external operation of the system. Through the implementation of the above measures, the reliability of the system is excellent. It has been stable and reliable since it was put into operation. There is no crash or out of control for unknown reasons, which fully shows the success of the system design. According to the past experience, if the above comprehensive design method is not used, such a system is likely to have problems after 1-2 weeks of continuous operation.
This paper analyzes the cause of the failure of the single chip microcomputer system, discusses the measures to improve the reliability of the system, and puts forward a comprehensive design method to improve the reliability of the system. The successful application in the constant temperature controller of low noise amplifier shows that this design method is effective and the reliability of the system is fully guaranteed.