Natural language processing (NLP) is an important direction in the field of artificial intelligence (AI) and is known as a pearl in the palm of artificial intelligence technology. It studies various theories and methods that can realize effective communication between human and computer in natural language, involving a wide range of aspects. Mr. Zhou Haizhong, a famous international scholar, once pointed out that “natural language processing is a very attractive research field, which has great theoretical significance and practical value.” At present, NLP has become a powerful driving force to promote scientific and technological progress and a key support to improve the comprehensive national strength.
NLP mainly studies various theories and methods that can realize effective communication between human and computer with natural language. The communication between natural language and computer is of great practical significance and revolutionary theoretical significance. The realization of natural language communication between man and machine means that the computer can not only understand the meaning of natural language text, but also express the given intention and thought in natural language text. The former is called natural language understanding (NLU), and the latter is called natural language generation (NLG). Therefore, NLP generally includes NLU and NLG. Because the key to processing natural language is to make the computer “understand” natural language, NLU is usually regarded as NLP, also known as computational linguistics.
NLP is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, that is, the language people use in daily life. Therefore, it is closely related to the study of linguistics, but there are important differences. NLP does not study natural language in general, but develops a computer system, especially its software system, which can effectively realize natural language communication. So it’s part of computer science. It can be said that NLP is a field of computer science, linguistics, AI and other fields that focus on the interaction between computer and human language. At present, people’s demand for AI has changed from computational intelligence, perceptual intelligence to cognitive intelligence represented by NLP. Without successful NLP, there will be no real cognitive intelligence.
Since AI includes perceptual intelligence (such as image recognition, language recognition and gesture recognition) and cognitive intelligence (mainly language understanding knowledge and reasoning), language plays the most important role in cognitive intelligence. If the language problem can be solved, the most difficult part of AI will be basically solved. Mr. Bill Gates, the founder of Microsoft in the United States, once said that “language understanding is the Pearl in the crown of artificial intelligence”. Mr. Shen Xiangyang, former global executive vice president of Microsoft, also said in a public speech: “those who understand the language will win the world In the next decade, the breakthrough of artificial intelligence lies in the understanding of natural language The most profound impact of artificial intelligence on human beings is natural language. NLP is also regarded as one of the core problems in solving AI complete, because understanding natural language requires a wide range of knowledge about the external world and the ability to manipulate it. It can be said that NLP is currently the key core technology in the field of AI, and its research is also full of charm and challenges.
The earliest NLP research work was machine translation. In 1949, Mr. Warren weaver, a famous American scientist, first proposed the design scheme of machine translation. In the 1960s, many scientists did a large-scale research on machine translation, which cost a lot of money. But they obviously underestimated the complexity of natural language, and the theory and technology of language processing were not mature, so little progress was made. At that time, the main method was to store a large dictionary of words and phrases corresponding to the translation method of the two languages, and the translation was one-to-one correspondence. Technically, it was only to adjust the same order of languages. However, in daily life, language translation is far from simple. Many times, it is necessary to refer to the meaning before and after a sentence, which requires context connection in order to translate correctly. This is where the difficulty of machine translation is high.
Since the 1990s, great changes have taken place in the field of NLP. This change has two obvious characteristics
(1) For the input of the system, NLP system is required to be able to process large-scale real texts, instead of dealing with only a few entries and typical sentences as the previous research-based systems. Only in this way can the developed system have real practical value.
(2) As for the output of the system, it is very difficult to understand the natural language truly, so it is not required to have a deep understanding of the natural language text, but to be able to extract useful information from it. At the same time, due to the emphasis on “large-scale” and “real text”.
Therefore, the following two aspects of basic work have been paid attention to and strengthened:
(1) The development of large-scale real corpus. The large-scale corpus of real texts processed in different depth is the basis of studying the statistical nature of natural language; without such a corpus, the statistical method can only be passive water.
(2) The compilation of large-scale and informative dictionaries. Therefore, the importance of computer-based dictionaries, which are tens of thousands, hundreds of thousands or even hundreds of thousands of words, and contain rich information (such as collocation information) is very important to NLP.
The input and output characteristics of the system are reflected in many fields of NLP, and its development directly promotes the emergence and rise of computer automatic retrieval technology. In fact, with the continuous development of computer technology, the performance of machine learning, data mining, data modeling and other technologies based on mass computing is becoming more and more excellent. The reason why NLP can survive the “cold winter” and develop again is because of the continuous combination of computer science and Statistical Science, so that people and even machines can constantly find “features” from a large number of data and learn from them. However, in order to realize the true understanding of natural language, it is not enough to learn from the original text. We also need new methods and models.
At present, there are two main problems: on the one hand, grammar is limited to analyzing an isolated sentence, and there is no systematic study on the constraints and influence of context and conversation environment on this sentence. Therefore, there is no systematic study on ambiguity, word omission, pronoun reference, different meanings of the same sentence in different situations or by different people It is necessary to strengthen the study of semantics and pragmatics in order to solve the problem step by step. On the other hand, people understand a sentence not only by grammar, but also by using a lot of relevant knowledge, including life knowledge and specialized knowledge, which can not be stored in the computer. Therefore, a written understanding system can only be built in a limited vocabulary, sentence pattern and specific topic range; only when the storage capacity and running speed of the computer are greatly improved can the scope be appropriately expanded.
Since language engineering and cognitive science are mainly limited to laboratories, data processing may be the most popular development direction of NLP application scenarios. In fact, since entering the era of big data, the major platforms have not stopped the deep mining of user data. In order to extract useful information, it is not enough to only extract keywords and count word frequency. It is necessary to understand user data (especially speeches, comments, etc.) semantically. In addition, the research on NLP tasks using offline big data statistical analysis method is a research paradigm with great potential at present, especially the successful experience of Google, twitter, Baidu and other large companies in such applications leads to the current wave of big data research.
NLP is a core tool for text analysis and mining for various enterprises and developers. It has been widely used in many businesses of e-commerce, finance, logistics, medical, cultural and entertainment industries. It can help users build intelligent products such as content search, content recommendation, public opinion identification and analysis, text structure, dialogue robot and other intelligent products. It can also customize personalized solutions through cooperation. Since understanding natural language requires extensive knowledge about the external world and the ability to use and operate these knowledge, NLP is also regarded as one of the core problems to solve the strong AI. In the future, NLP will closely integrate the development of AI, especially the design of a neural network that mimics the human brain.
Training NLP text parsing AI system needs to collect a large number of multi-source data sets, which is a continuous challenge for scientists: they need to use the latest deep learning model, imitate the behavior of neurons in the human brain, and train in millions or even billions of annotation examples to continuously improve. The current popular NLP solution is pre training, which improves the general language model for unlabeled text training to perform specific tasks. Its idea is that the parameters of the model are no longer randomly initialized, but a set of model parameters is obtained by training a task, and then the model is initialized with this set of parameters, and then trained to obtain more information Good predictive insight.
At present, we have entered the era of massive information with the Internet as the main symbol, most of which are expressed in natural language. On the one hand, massive information also provides more “material” for computer learning human language; on the other hand, it also provides a broader application stage for NLP. For example, as an important application of NLP, search engine has gradually become an important tool for people to obtain information, and search engine giants such as Google and Baidu have emerged; machine translation has also entered ordinary people’s homes from the laboratory; Chinese input methods based on natural language processing (such as Sogou, Microsoft, Google and other input methods) have become necessary tools for computer users; with voice Computers and mobile phones that are identified are also on the way to help users live, work and learn more effectively.
Now, there are a lot of manual annotation knowledge in NLP field, and deep learning can obtain relevant semantic knowledge through supervised learning. There should be some corresponding relationship between this knowledge and human summarized knowledge, especially in some shallow semantic aspects. Because manual tagging, in essence, has provided learning objectives for deep learning; only in-depth learning can learn without sleep, and this process of gradually approaching learning objectives may be much faster and better than the human summary process. This seems to be verified by the fact that alphago, a go software developed by Google’s deepmind research team, has won two human go masters in a short period of time.
Deep learning is widely used in NLP. It can be said that there are deep learning models in all applications of NLP, such as word segmentation at the bottom, language model, syntactic analysis, speech recognition, semantic understanding, pragmatic interpretation, dialogue management, Knowledge Q & A, and good results have been achieved. The research has changed from traditional machine learning algorithm to more expressive deep learning model, such as convolution neural network and regression neural network. However, the current deep learning technology does not have the necessary conceptual abstraction and logical reasoning ability to understand and use natural language, which needs further research in the future.
Internet search engines have been using conversational languages and terminology to search things online for a while. Now, Google’s cloud hard disk users can already use this feature. Users can search for files and content stored in Google’s cloud hard disk just as they use Google search to provide new support for NLP built-in to the cloud hard disk. This feature allows users to find the required content more easily using queries that are usually expressed in phrases and queries to be used in actual conversations. Google uses NLP widely in services such as online and mobile search, mobile applications, and googletranslate; its research in this area is part of a broader effort to improve machine reading and understanding of human language. As Google adjusts its algorithm, NLP should be better over time.
The Cambridge quantum computing company (CQC) recently announced that it has opened up a new field of possible applications using the “inherent quantum” structure of natural language. It translates grammatical sentences into quantum circuits, and then implements the generated program on quantum computer, and actually executes question answering. This is the first time NLP has been performed on a quantum computer. By using CQC’s first-class, platform independent retargeting compiler t|ket?? These programs have been successfully executed on IBM quantum computer and the results have been obtained. The whole breakthrough has taken a significant step towards NLP, which is the dream pursued by computer scientists and computational linguists since the early computer age.
Novel coronavirus pneumonia (COVID-19) patients, social media and health data were developed by researchers from the Harvard University School of medicine with NLP technology. They pioneered novel coronavirus pneumonia solutions by using machine learning technology to view data and information from various sources, including patient records, social media and public health data. Novel coronavirus pneumonia virus can be searched online with the help of NLP tools, and the location of the outbreak is understood. In addition, novel coronavirus pneumonia, drugs and vaccines were also studied by NLP technology, including clinical diagnosis and treatment, and epidemiological studies.
The NLP research team of China’s Ali Damo Institute recently proposed an optimization model, struct Bert, which enables machines to better master human grammar and deepen their understanding of natural language. Using this model is like building a “grammar recognizer” into the machine, so that the machine can still accurately understand and give the correct expression and response in the face of words and sentences that are out of order or not in accordance with the grammar habits, which greatly improves the machine’s understanding of words, sentences and the whole language. This technology has been widely used in Ali Xiaomi, ant financial, Youku and other businesses. The language model and reading comprehension technology of Ali Damo hospital are also used for Industry empowerment, promoting the implementation of AI technology in medical, power, finance and other industries. It is reported that the structurebert model has recently been rated as the most powerful NLP system in the world.
According to a report by Mordor intelligence, the global NLP market value will be 10.9 billion US dollars in 2019, and it is expected to reach 34.8 billion US dollars by 2025, with a compound annual growth rate of 21.5%. The report points out that in the past few years, deep learning architectures and algorithms have made remarkable progress in the market structure, and voice analysis solutions are dominating this market, because traditional text-based analysis is no longer enough to handle complex business problems.
In a word, with the popularity of the Internet and the emergence of massive information, as the key core technology and the apple of the eye in AI field, NLP is playing an increasingly important role in people’s life, work and learning, and will play an increasingly important role in the process of scientific and technological progress and social development.
Lin Feng, Li Yan (author’s units are Boston University Institute of technology, Purdue University School of Technology)