The reliability of power supply system in data center is very important. It is conceivable that no matter how precise the IT equipment is, how superior the function of the system is and how high the reliability is, once the power is cut off, no matter how good the system is, it will not work. Therefore, the importance of equipment maintenance in operation can not be ignored, it can be seen that the burden of maintenance personnel is very heavy.
Operation and maintenance tasks and unsolvable problems
In order to ensure the reliable operation of the power supply system, many good measures have been formulated in many places. But even so, there are many loopholes. The reliability of the equipment has been determined after leaving the factory. For example, there are some inherent defects, for example, some power output isolation transformer windings use aluminum enamelled wire instead of copper enamelled wire of the cable, and there are nine times out of ten that accidents will happen when the equipment is in full load operation… However, the fault statistics caused by the quality problems of the equipment itself show that less than 30% of the faults come from the day after tomorrow. That is, man-made fault, which is shown as follows:
1. Fault caused by improper selection
(1) The basic concept is not clear, easily misled by manufacturers. For example, in a highway bidding for UPS, it is required in the bidding document that the ups should have the ability to continue to supply power even if the battery does not discharge after one or two phases of input are disconnected. Because some manufacturers boast that the battery of ups will not discharge after one phase of input is broken, and UPS still has 50% power supply capacity; The battery still does not discharge after two-phase interruption, and the ups still has 25% power supply capacity, which extends the service life of the battery. Users think that the performance is good, and it is not difficult to find its disadvantages with a little thought: if you want to enjoy its advantages, you must buy a UPS with four times the load capacity, otherwise you will not be able to carry the current load after one phase is broken. Then again, what if the ups disconnects the two wires behind the input switch? Do you want to fix it? When will it be repaired? Can it be repaired only after the power is completely cut off? And so on how to solve this series of problems. If users really buy such UPS according to the actual capacity of the load, this is a great hidden danger, which can not be solved by operation and maintenance.
(2) The reason is not easy to explain. For example, some users began to use a certain brand of machine in the last century. At that time, due to objective reasons, although the input power factor is low, the efficiency is low, the volume is large, the power consumption is large and the price is expensive, it is not easy to solve. For example, the new high-frequency machine structure UPS can save 50000 kwh of electricity per 100 kW per year compared with the original power frequency machine structure UPS, and this machine room with a capacity of several megawatts can save millions of kwh of electricity per year. However, for some reason, the energy-saving equipment is not selected, and the energy-consuming machine is still included in the tender, which is not safe, and the structural characteristics of the machine are also included in the tender. This not only increases the investment and floor area of air conditioning equipment, but also lays a hidden danger for the future operation. This is a problem that cannot be solved in operation and maintenance.
(3) Pursue low price. Some users think that UPS is the same, so they pursue low price, resulting in failure. For example, a highway headquarters greedy for cheap, the first day installed, the next day on fire; In less than half a year, the machine purchased by a life insurance company at a low price burned down almost all the input circuits of IT equipment due to UPS failure, resulting in system paralysis; Another example is a megawatt data center with multiple UPS connected in parallel. Within a few months of installation, all UPS trips due to the breakdown of one inverter power tube in one ups
2. Faults caused by improper use environment
The machine is not placed according to the environmental requirements in the manual, and some even put the ups in the corridor or the basement dripping water. For example, a few 200KVA ups are placed in a bungalow with only one floor of prefabricated slabs on the roof, and the air conditioner is just two 5p comfortable air conditioners. Another example is a glass factory putting the ups in a factory with flying powder, and so on. This causes frequent failures.
3. Faults caused by imperfect system
For example, some on duty personnel casually connect the electric stove, electric cooker and vacuum cleaner to the ups, resulting in overload trip; The food of some people on duty caused rats to get into the machine and cause a fire
4. Handover failure
This kind of failure is mainly due to the management staff is not a group of people or poor cooperation. For example, in the ticketing system of a railway station, the front check-in personnel disconnected the external battery pack of ups due to moving the machine position, but later they did not explain to the later, resulting in the failure of power failure of both ups and the municipal power supply
5. Experience fault
Experience is indispensable and precious. But experience has its relativity, that is, the experience gained from one UPS may not be completely suitable for another UPS, otherwise it will lead to failure. A telecom office started another brand machine in the same way without reading the manual, which led to the inverter burning down.
6. Oversight failure
Some devices will appear aging or early failure in operation, if not detected in time, it will lead to failure. These cannot be found in automatic monitoring. For example, the fuse that begins to bend due to aging, the looseness of the battery structure screw, and the tiny cracks in the battery shell after long-term discharge of the battery can cause the failure if it is not found in time or not handled in time after discovery.
7. Failure caused by rushing to battle
You can’t be impatient when it comes to maintenance. You have to think carefully before you start. An engineer of a company needs to overhaul a UPS that is running by a user. According to the regulations, he needs to use the maintenance bypass switch to exit the ups and then overhaul it. But according to the procedure, he needs to start the automatic bypass first, and then close the switch of the maintenance bypass. Perhaps the project has other urgent matters to deal with. After entering the machine room, the bypass switch for maintenance was closed without consideration, resulting in the explosion of the inverter power tube.
8. Secondary failure caused by improper maintenance
Regular maintenance of UPS is necessary, but there should be a set of strict management procedures. Those who are irresponsible, do not meet the requirements of regular or irregular maintenance is an important cause of machine failure. In addition, it can also cause faults during maintenance. For example, when using the multimeter probe to measure the potential of the circuit board, the probe will short circuit two points to cause faults. When a user discharges the battery, the battery is removed from the ups, and when the battery is connected back after discharging, the current explosion occurs. Another example is when an engineer was replacing a centrifugal fan, the adjustable wrench slipped and hit the control board. He didn’t care at that time. After the fan was replaced, he couldn’t start the machine. After inspection, he found that one leg of the device was broken
9. Faults caused by static electricity
A computer room was shut down for maintenance as usual, but it could not be turned on after maintenance. After inspection, a component voltage breakdown was found. Recalling the maintenance process, it was found that the control board used a plastic toothbrush to sweep the dust. Plastic can produce several thousand volts of friction electrostatic voltage on the surface of the drying device. Because some MOS devices are used in the small signal circuit of the machine, these devices have very low voltage resistance and are most afraid of static electricity. After measuring, an ordinary plastic bag can produce 3000 V electrostatic voltage by rubbing it with a circuit board. Therefore, when checking these circuit boards, it is better to put a grounding ring on the wrist.
10. Faults caused by overconfidence
Self confidence is the foundation of success, but overconfidence sometimes makes mistakes. For example, an international bank should update the equipment after UPS has been running for 8 years, and the manufacturer has reminded us many times. Since the UPS has rarely had any problems in the past 8 years, the person in charge of the user repeatedly answered “do not update”. As a result, a few months later, the ups stopped power supply for two hours due to aging failure, resulting in a two-hour interruption of global business and a great loss.
According to the international statistics, the nominal service life of 5-year battery should not exceed 3 years at most. Usually, it should be replaced in 2 years without maintenance. The battery in the waiting hall of an airport was originally equipped for 4h, but it was still not replaced after 3 years. Once the external power grid was cut off, the ups backup time was only 4h, and the loss was caused by the power failure
There are many similar human failures, so we will not cite them one by one.
In the final analysis, the selection of power supply system is the first pass, which can’t hold the seeds of hidden dangers. Power system connection is the second pass, with good equipment, if there is no good connection scheme, there will be hidden dangers. A TV station was misled by the manufacturer because of its connection scheme. The ups failures of more than a dozen programs are continuous, and most of them are not dangerous. This has been the case for several years, which makes the maintenance personnel worried and worried. The connection scheme is a project, which is not controlled by the maintenance personnel. But had no choice but to major events and festivals so that the factory engineers to come on duty. What’s the use of that? The engineers of the manufacturer can only give psychological comfort to the users.
Editor in charge ajx