Authors: Huang Liwei, Tian Wenqing, Yang Bin, Zhan Pengfei
In view of the current situation that the automation level of man operation and maintenance is low, the labor cost is high and can not get rid of repeated operation and maintenance labor, this paper discusses the key points and difficulties of the current man automatic operation and maintenance, and puts forward a complete At the same time, it gives corresponding solutions for typical application cases, so as to completely change the difficult situation of low quality and low efficiency of traditional operation and maintenance, and promote the improvement of automatic operation and maintenance capability in the whole life cycle of metropolitan area network.
With the rapid development of mobile operators’ metropolitan area network business in recent years, especially with the steady progress of the national “broadband China” strategy, the construction of wired home broadband network has caught up with the three major operators. At the same time, the competition for customer market share is becoming more and more intense. With the access of 5g network business of the company, metropolitan area network carries more and more business types, Including broadband Internet service, broadband TV service, CDN service, IMS voice service, internet dedicated line service, TR069 service, WLAN service, network management service, 5g service, etc. the business complexity is getting higher and higher, and the network scale is also doubling, which makes the network operation and maintenance of man face many problems and challenges:
(1) The level of automatic operation and maintenance is limited. At present, the automatic operation and maintenance is only limited to the automatic inspection and backup of network equipment, and the automatic configuration of home wide and customer collection services, accounting for only about 20% of all operation and maintenance work. There are still many repetitions in the directions of automatic collection and filing of resources, automatic discovery of topology, automatic capacity expansion of resources, automatic troubleshooting and repair of network faults, security reinforcement and network industry coordination The optimized manual operation and maintenance work needs to further improve the operation and maintenance effectiveness by improving the automatic operation and maintenance level.
(2) The number of operation and maintenance personnel does not match the development of network scale. In recent years, in order to achieve the purpose of cost reduction and efficiency increase, the company has reduced all the third-party maintenance personnel. When the number of self owned personnel cannot be supplemented in time and the level of automatic operation and maintenance is limited, the self owned network operation and maintenance personnel relying on traditional operation and maintenance methods usually have multiple duties, including business configuration, security reinforcement, index control, link expansion and quality analysis, There is always a shortage of time, and the mismatch between the number of maintenance personnel and the scale of the network is becoming increasingly prominent. If you work with high intensity for a long time, misoperation will inevitably lead to network failure.
(3) The contradiction between the decline of operation and maintenance capacity and the increase of network complexity is prominent. The digital communication specialty usually has the characteristics of strong technology and professionalism. A mature traditional operation and maintenance personnel of digital communication basically need to study for one year before they can normally master the contents of various protocols, office data configuration specifications and network troubleshooting, and have the ability to independently support network operation and maintenance; At the same time, due to the relatively high job hopping rate of digital communication operation and maintenance personnel, if the gradient of operation and maintenance personnel is not well trained, it is easy to cause a situation of shortage. With the continuous increase of network scale and business complexity, according to the traditional operation and maintenance mode, the decline of operation and maintenance capacity will become an important weakness of network support and will continue to intensify.
To sum up, fully realizing automatic operation and maintenance is an ideal solution to solve the traditional operation and maintenance problems of metropolitan area network. Especially under the conditions of cost reduction and efficiency increase, limited human resources and many influencing factors, automatic operation and maintenance in the whole life cycle of metropolitan area network will become an inevitable development trend.
2 discussion on key points and difficulties of automatic operation and maintenance
In the process of promoting from traditional operation and maintenance mode to automatic operation and maintenance mode, achieving standardization and process standardization, further integrating with current trend technologies such as big data and artificial intelligence, and ensuring strong operability of the results of automatic operation and maintenance are the key and difficult points for the realization of automatic operation and maintenance.
2.1 importance of standardization
Standardization is the basis for realizing automatic resource management and automatic operation and maintenance, especially the standardization of data configuration specifications of equipment bureaus of various manufacturers, including the standardization of resource allocation specifications such as ports and VLANs and business configuration template specifications. In the process of promoting automatic operation and maintenance, it is inevitable to make standardized rectification on the historical Bureau data according to the customized specifications, while the standardized rectification with complex business logic and high risk still needs to be completed manually. At the same time, the rectified results need to be verified by the program with high efficiency. Only by realizing standardization can we ensure the transparency of bureau data and the clarity of business logic, better build a unified CMDB, and make it easier for automatic operation and maintenance procedures to master, understand and operate data.
2.2 importance of process standardization
The automatic operation and maintenance in the whole life cycle of man involves multiple processes, including resource request and allocation process, service automatic configuration activation process, fault control process and service verification process. Each process may involve the scheduling cooperation between multiple systems and modules. The standardization of the process ensures the feasibility, stability and security of automatic operation and maintenance, It effectively avoids the possible process jam in the automation process and ensures the efficient promotion of the automatic operation and maintenance process.
2.3 integration of new technology advantages
On the basis of standardization of norms and processes, automatic operation and maintenance should also integrate the advantages of new technologies such as big data, machine learning, cloud computing and nfv, so as to make data analysis, association mining and risk identification more scientific, reasonable and efficient, so as to maximize data value, minimize risk operation and optimize cost use, Give full play to the advantages of high efficiency and high energy of automatic operation and maintenance.
2.4 operability and safety assurance
The automatic operation and maintenance in the whole life cycle of man shall have strong operability and security guarantee. Operability means that the platform construction shall comply with the purpose of simplicity, practicality and efficiency, and can effectively solve the pain points existing in the current operation and maintenance work, such as repetitive labor and high data value work, and can open up the business system Barriers between network management system and data configuration system, reasonably build the coupling between systems, and ensure the enforceability and accuracy of automatic operation and maintenance tasks; At the same time, although automated operation and maintenance can improve production efficiency, how to ensure the safety of automated operation, especially the operation related to bureau data configuration, business logic must be rigorous, key link authorization must be strictly controlled, log audit can be tracked, return operation response should be rapid, and relevant emergency plans for automated operation and maintenance should be complete, Otherwise, misoperation will have a serious impact on network services.
3 application direction of automatic operation and maintenance
3.1 design of automatic operation and maintenance application system
The automatic operation and maintenance in the whole life cycle of man should cover many aspects, such as resource management, alarm monitoring, fault repair, business configuration, security protection and network industry coordination, so as to achieve comprehensive automation, completely liberate the traditional operation and maintenance labor force, save labor cost and improve production efficiency. Aiming at the pain points existing in the current operation and maintenance of man, the key applications that need to be solved automatically and can be realized are shown in Figure 1 below:
Figure 1 automatic operation and maintenance application system
3.1.1 application direction of resource management automation
The realization of resource management automation is the basis and guarantee for the realization of the whole automatic operation and maintenance. Only by building a unified data warehouse, ensuring the accuracy of basic data and realizing the transparent management of resources by the automatic operation and maintenance platform, can other automatic operation and maintenance applications be promoted.
184.108.40.206 application ideas of automatic management of basic resources
The automatic management of basic resources focuses on basic hardware resource management and IP resource management. Basic hardware includes equipment, board and link information. IP resources mainly involve the filing of public network IP information. The management of basic resource information should fully rely on the unified collection of equipment current network data and the trigger update of operation change events, so as to ensure the timeliness and accuracy of system resource information synchronization, and minimize the data value brought by manual work and improve the accuracy of resources.
220.127.116.11 application ideas of automatic topology discovery
The generation and change of network topology should rely on the standardized port description, business logic and VLAN information of the equipment, which can automatically discover and render the topology, change the traditional scheme of forming the topology by manually entering and updating system resources, and realize the automatic and fine management of the topology, including the trend of main and standby services, Load balancing is reflected in the application of topology automation management.
18.104.22.168 application idea of automatic resource allocation
Based on the realization of automatic management of basic resources, combined with standardization and process standardization, the realization of automatic resource allocation is relatively simple. The automatic resource allocation focuses on the realization of allocation logic rules, such as VLAN resource allocation rules, port resource cross board card binding allocation rules, etc. at the same time, Resource allocation conflict detection should be done well as the last protection bottom line of resource allocation. Resource conflict detection can be monitored online through automatic programs on the equipment, such as ping operation to monitor IP conflict, or command to view port occupancy.
22.214.171.124 application ideas of automatic resource early warning
The application of automatic resource early warning focuses on four core network concerns: link utilization, port occupancy, address resource occupancy and traffic load imbalance. A statistical early warning report is formed through automatic calculation, and a notice is automatically sent to the network administrator to coordinate capacity expansion, so as to fully provide advance early warning support for network capacity expansion.
126.96.36.199 application ideas of automatic resource expansion
Automatic resource expansion includes the expansion of board, link and address pool. The expansion of board is relatively simple. It only needs to execute simple loading instructions to ensure that the loading state is normal; Address pool expansion and link expansion are relatively complex, involving automatic resource allocation, automatic script generation and service verification. At the same time, link expansion also involves link commissioning and other links. The joint commissioning at the man side should focus on the automatic commissioning with engineering jumper personnel through robots.
3.1.2 intelligent application of alarm monitoring
The intelligent application of alarm monitoring not only focuses on the discovery of alarms, but also needs to further confirm and solve the existing abnormal problems through automatic learning and analysis, such as mining the causes of sudden flow changes, research and judgment of OLT faults, automatic alarm compression, etc. taking the automation of alarm compression as an example, the compression of invalid alarms should rely on automatic means to improve the compression quality and efficiency, The machine learning method is mainly applied. Through the supervision and learning of historical data, the marked data such as alarm occurrence frequency, compression recommended by the manufacturer, alarm importance rating, alarm influence degree and whether there are associated alarms are used for learning modeling. Finally, the efficient compression of alarm automation is realized through the alarm compression model, as shown in Figure 2:
Fig. 2 invalid alarm compression
3.1.3 intelligent application of fault repair
The key services of man mainly involve home broadband, television and passenger dedicated line services. In case of network failure, due to the long end-to-end link, different node equipment is in charge between prefectural, municipal and provincial companies, and the information exchange in the troubleshooting process often takes a long time. It is slow to rely on manpower to analyze and judge the fault point or complete the business, The key capability to improve the efficiency of fault repair and improve customer satisfaction is to build automatic obstacle removal capability and fault recovery capability.
188.8.131.52 application idea of end-to-end intelligent obstacle removal
In case of complaints from a single user or scattered complaints without access aggregation characteristics, end-to-end Ping can be conducted according to the type of complaint service to quickly determine the fault node, but the premise is to classify the trend of each service according to the classification of trunk links, so as to ensure that each service can be accurately associated with end-to-end links, Only in this way can we ensure the feasibility of automatic obstacle removal and the accuracy of results. For example, for a single TV service complaint, first, the automatic troubleshooting function module will ping the loopback addresses of Cr and BR on the backbone link bng-cr-br by bng, so as to ensure that the backbone link has no physical interruption and link packet loss; Then, according to the fault type, if it is an address pool problem, you can ping the DHCP server address. If it is a live broadcast problem, you can ping the address of the multicast sink node RP. if you can’t see the electronic program list, you can ping the EPG server. In this way, according to the packet loss of the Ping results, you don’t need to contact the operation and maintenance personnel of the provincial company, It can also quickly determine the fault point.
184.108.40.206 business self-healing application ideas
Service self-healing includes interruption self-healing and poor quality self-healing. After the flat networking transformation of man, all services basically realize the ability of automatic switching, including warm standby and hot standby. Therefore, the most practical scenario of service self-healing of automatic operation and maintenance is poor quality self-healing. Take the impact of OLT uplink CRC on TV flower screen as an example, It mainly collects the link port with wrong CRC data on the upper interface of OLT, then identifies the link pairing information, especially the peak utilization information of the paired link, and completes the scientific evaluation before switching according to this data. Finally, it makes an intelligent decision whether to execute the switching instruction, and sends the instruction to the equipment to realize the fast switching processing before complaint.
3.1.4 business configuration automation application
Automatic configuration activation is the first application to realize automation. In 2016, broadband configuration basically realized automatic configuration activation. In 2019, man also began to study automatic configuration activation of dedicated lines. In the whole application test process, the statistics of the reasons for the failure of automatic opening of dedicated lines are shown in Figure 3 below:
Figure 3 Statistics of reasons for failure of automatic opening of dedicated line
According to the statistical results, it is not difficult to find that the opening failure caused by the failure or conflict of IP and VLAN resource allocation is the main reason. In addition, the proportion of program bugs in the service opening system and configuration activation system involved in the opening of dedicated line automation has also reached 12%. At the same time, the non-standard data configuration of the office has led to the failure of the program to perform the tasks that should be performed has also reached 11%. Therefore, from the test experience of dedicated line automation opening, to ensure the feasibility of automation application, first, strengthen the enforceability of automatic resource allocation, and focus on solving the allocation logic and conflict detection of IP, VLAN and other resources; Secondly, it is necessary to implement the standardized rectification of the Bureau’s data. The rectification process should rely on automatic means instead of manual means as far as possible to ensure the accuracy of rectification; Finally, the robustness of the system program is also a very important guarantee for automation applications to avoid the impact of the vulnerability of the system itself on functional applications.
Although the business configuration automation application has been carried out, the scope of application is relatively limited. In order to truly realize the automatic operation and maintenance, we should carry out the automation application to the greatest extent on the premise of ensuring the enforceability. In terms of automatic service configuration, equipment from different manufacturers shall uniformly build configuration templates for various service types. At present, the service configuration templates of man include home wide service configuration template, customer collection service configuration template, WLAN service configuration template, network management service configuration template and service collection configuration template. Sub category templates shall be as detailed as possible under various service configuration templates, Only in this way can we fully adapt to the opening of various scenarios of automatic service configuration.
3.1.5 application of safety protection automation
With the vigorous development of Internet business, the management vulnerability of leakage in network security protection is becoming more and more prominent. Operators strictly abide by the principle of “three synchronization” in the early stage of network construction to avoid equipment “entering the network with disease”. At the same time, the Department of network security protection management is becoming more and more detailed. With the increasing network scale of man, the task of security protection is becoming more and more arduous. Often, the same security reinforcement content requires all equipment to log in and add configuration one by one, such as the security reinforcement of TV business; Similarly, the dedicated line drainage work is only a simple operation of configuring the corresponding ACL for the login equipment, but the manual operation efficiency is often relatively low. For this kind of security protection configuration with simple operation and low risk, it should be the key content of the automatic operation and maintenance work.
3.1.6 intelligent application of network industry collaboration
With the continuous development of man business scale and network scale, and facing the uncertainty of market business development, if there is no scientific prediction and analysis for network planning, construction and capacity expansion, blind new resources may cause a waste of man resources and is not conducive to the later network optimization and adjustment. Therefore, It is particularly important to do a good job of collaborative intelligence between the network industry, including the analysis of poor service quality and poor network quality in user satisfaction analysis. Only by doing a good job of intelligent collaboration between the two can we efficiently and accurately explore the causes of poor quality.
220.127.116.11 collaborative application ideas of resource delivery and market development
The system of resource delivery and market development can combine the town grid, market planning and development data, or the pre increased user volume data and the bearing data of the town’s existing network equipment, accurately evaluate and predict the demand for new capacity through the calculation of various business volume prediction models, and finally output the comparison diagram of existing and new capacity after evaluation, The capacity expansion needs of each town will be clear at a glance, so as to easily realize the rational planning and scientific investment of resources. The automation implementation scheme is shown in Figure 4 below:
Figure 4 automated evaluation model
18.104.22.168 application ideas of satisfaction collaborative analysis
Customer satisfaction analysis is also an important application point of automatic operation and maintenance. Generally, customer satisfaction analysis is basically aimed at the survey data. In order to avoid causing resentment to customers, the survey data are basically simple inquiries, and the survey results may be one-sided. Therefore, we can only rely on system automation, In order to more comprehensively excavate the causes of poor quality and implement relevant improvement measures. Since customer satisfaction involves many aspects, usually including poor network quality, poor installation and maintenance quality and poor business service quality, to comprehensively excavate and improve poor quality, it is necessary to improve the collaborative analysis ability of poor service quality and poor network quality. There are three steps to solve the problem of using machine learning method to realize collaborative analysis:
Step 1: build classifiers with poor network quality, poor installation and maintenance quality and poor business service quality, input the complaint user data under each bng into the classifier model for prediction, and finally statistically calculate all the classification prediction results to classify the users under each bng;
Step 2: use the Apriori association rule algorithm to determine the association causes of poor network quality, poor installation quality and poor business service quality.
Step 3: make use of the detailed classification of poor quality causes for each bng realized in step 1, and give a poor quality analysis report and rectification direction for the businesses covered by each bng equipment in combination with the related poor quality causes in step 2.
4 system architecture
Combined with the big data processing and small data flexible processing schemes, the man automatic operation and maintenance platform is built. The platform design is mainly realized by a 4-tier architecture, as shown in Figure 5 below:
Figure 5 platform architecture
(1) Data source: it mainly realizes the data collection function, including multi-dimensional basic data such as network management data, office data, DPI data, service opening data, complaint data and so on.
(2) Data platform: it mainly realizes data storage and preprocessing, and constructs a unified and standardized basic data warehouse for modeling analysis, calculation processing and instruction configuration at the core algorithm layer.
(3) Core algorithm: it mainly integrates automatic resource processing, automatic business configuration, security protection and reinforcement, intelligent cooperation of network industry and artificial intelligence algorithm to realize intelligent analysis of big data, so as to realize the automation function of application layer.
(4) Function application: the application layer mainly realizes six functions, including alarm monitoring intelligence, resource management automation, fault repair intelligence, security protection automation, business configuration automation and network industry collaborative intelligence.
The transformation of man operation and maintenance from traditional operation and maintenance to automatic operation and maintenance will become an inevitable trend. The full life cycle automatic operation and maintenance proposed in this paper includes multiple scenario applications such as resource management, alarm monitoring, fault repair, service configuration, security protection and network industry cooperation, At the same time, it gives the solutions to the typical application problems that need to be solved or improved urgently in the current metropolitan area network operation and maintenance, which lays the foundation for the promotion of metropolitan area network automatic operation and maintenance. Realizing the automatic operation and maintenance in the whole life cycle of metropolitan area network will not only bring comprehensive cost reduction and efficiency increase to the network operation and maintenance, but also bring important significance for the development of automatic operation and maintenance to the intelligent operation and maintenance stage.
Responsible editor: GT