During a demonstration at Microsoft Research Asia’s 21st Century Computing event, Rick Rashid, Microsoft’s chief research officer, showed off the latest breakthroughs the company has made in speech translation technology.
Rashid gave the audience a brief history of speech technology and said that until recently even the best speech systems still had word error rates of 20 to 25 percent on arbitrary speech.
However, just over two years ago, researchers at Microsoft Research and the University of Toronto made a breakthrough: by using a technique called Deep Neural Networks, which is patterned after human brain behavior, researchers were able to train more discriminative and better speech recognizers than previous methods.
“We have been able to reduce the word error rate for speech by over 30 percent compared to previous methods,” Rashid said in a blog post. “This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979, and as we add more data to the training we believe that we will get even better results.”
Rashid noted that machine translation of text has also been difficult.
“Just like speech, the research community has been working on translation for the last 60 years, and as with speech, the introduction of statistical techniques and Big Data have also revolutionized machine translation over the last few years,” he said.
In his presentation, Rashid demonstrated how the technology took text that represented his speech and he then ran it through translation—in this case, turning English into Chinese. In the first step, he found the Chinese equivalents for his words. In the second part, he reordered the words to be appropriate for Chinese, an important step for correct translation between languages.
“We have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice,” said Rashid. “It required a text to speech system that Microsoft researchers built using a few hours speech of a native Chinese speaker and properties of my own voice taken from about one hour of pre-recorded (English) data, in this case recordings of previous speeches I’d made.”
Rashid admits that the results are still not perfect, but believes that the technology is very promising.
“We hope that in a few years we will have systems that can completely break down language barriers,” he said. “We may not have to wait until the 22nd century for a usable equivalent of Star Trek’s universal translator, and we can also hope that as barriers to understanding language are removed, barriers to understanding each other might also be removed.”