Norwegian scientists are working on new technology that allows computers to recognize any language without pre-learning stands that could revolutionize automatic speech recognition.
With funding from the Large-scale Programme on Core Competence and Value Creation in ICT (VERDIKT) under the Research Council of Norway, professor Torbjørn Svendsen of the Norwegian University of Science and Technology (NTNU) and colleagues have demonstrated that the production of human speech is fundamentally the same across languages. As such, the technology being developed will be applicable to any language without being reliant on speech data for each individual language to train a machine.
The researchers based their approach on phonetics and have also incorporated additional speech and language knowledge into the system, for example the correspondence between sound frequency and words and how words are put together in forming sentences.
The method developed Svendsen and colleagues involves training a computer to determine which parts of the speech organs are in activity based on analysis of the pressure of sound waves captured by the microphone.
Up to now, two different approaches to speech recognition systems have been most prevalent. Both are based on the use of speech data and source texts in training a computer to recognize different languages on an individual basis.
One approach involves individuals observing words and sounds and deducing rules which are then entered into the computer. For instance, whether or not a sound is voiced depends on whether the vocal cords vibrate during the production of the sound.
“If we analyze a small speech segment and determine that a specific sound is voiced in speech peaking at resonances of 750 and 1200 hertz (Hz) then the sound is likely an ‘a’. If the peaks range between 350 and 800 Mz, it’s likely to be a ‘u’,” said Svendsen.
The other approach is to leave the training up to the computer by feeding it large amounts of sample material.
“Initially, a machine perceives all sound events to be equally probable,” Svendsen said. “But as the data-driven learning proceeds, occurrences with higher frequency are interpreted as more likely while less common occurrences decrease on the probability scale. This type of approach enables us to process much more speech data than we can using human-based observations. There are just limits to how much data a human can handle.”
The research group has chosen an approach somewhere between the two traditional approaches.
Speech patterns differ due to variations in the physiology, dialect, education and health of individuals. This affects the production of voice and sentence structure. In order for a machine to learn how to understand speech, it must be able to discern among the most common variations in normal speech and language.
“We are currently developing a computer program which determines the probability of various distinctive characteristics being present or absent during sound production, for example, if there is vocal cord vibration, this indicates the occurrence of a voiced sound. This is our method of classifying sounds,” Svendsen said.
The next step for the Norwegian researchers is to develop a language-independent module for use in designing competitive speech recognition products.
“The solutions will result in savings both in terms of time and money,” said Svendsen. “It is an important technology, not only for people who are part of a minor language group such as Norwegian. There are a staggering number of languages with only a few million speakers that would benefit greatly from such tools.”
A by-product is that this type of technology can be useful in contexts where several different languages are being used at once. It takes only in 30 to 60 seconds to identify a given spoken language. This can be helpful in instances where, for example, a person giving a presentation in one language cites a quote in another. It can also be significant in investigative work to determine quickly which language an individual is speaking.