Language is an important tool for people to communicate and obtain information. With the development of society, it has become a reality for machines to understand human language. We know that when sound is transmitted to people’s ears through the media, the brain will process the voice and form its own understanding, and then respond with language or action. So how do computers understand human language? This depends on the important technology of human-computer interaction speech recognition technology.
Speech recognition technology is to let the machine turn the speech signal into the corresponding text or command. The language communication between people will cause poor information communication because of the differences in the background, education level and experience range of both sides, which makes it more complex for the machine to accurately recognize and understand the voice. Machine recognition speech needs to deal with different voice, different speed, different content and different environment. Speech signal has the characteristics of variability, dynamic, instantaneous and continuity, which are the constraints of the development of speech recognition.
In the 1950s, Audry system developed by at & T Bell laboratory is the first speech recognition system in the world that can recognize 10 English numbers. At the end of 1980s, speech recognition research made a major breakthrough, which solved the three obstacles of large vocabulary, continuous speech and non-specific person. For the first time, these three characteristics were integrated into one system. The more representative is the Sphinx system developed by Carnegie Mellon University. In the early 1990s, major companies have spent a lot of money on the practical research of speech recognition system.
The current speech recognition technology mainly includes feature parameter extraction technology, pattern matching and model training technology. Feature extraction technology is a technology to extract the useful feature parameter information from all the information. Through analysis and processing, the redundant information is deleted and the key information is left. Pattern matching is to get the best match between an unknown pattern and a model in the model base according to certain criteria. Model training is to obtain the model parameters from a large number of known patterns according to certain criteria.
With the development of speech recognition technology, the recognition accuracy has reached a very high level. Especially for small and medium vocabulary, the recognition accuracy of speaker independent speech recognition system is more than 98%, while for speaker specific speech recognition system, the recognition accuracy is higher. Nowadays, the accuracy of speech recognition has been able to meet the needs of people’s daily applications. Many mobile phones, smart speakers and computers have speech recognition function, which is very convenient.
According to the current development trend of speech recognition technology, can barrier free conversation between human and robot be realized in the future, just like the scene seen in science and technology movies? Although speech recognition research institutions have spent decades to study how to achieve “human equivalence” of speech recognition accuracy, it is still unable to reach a high level in some aspects, such as speech recognition of distant microphone, dialect recognition or speech recognition of less used language in noisy environment.
The development of speech recognition technology provides convenience for people’s work and life. Many tedious steps can be completed by one voice instruction. Now smart home is in the early stage of development, but relying on speech recognition technology has been able to build a complete set of smart home system. In the future, speech recognition technology will show more possibilities in all aspects.