Speech Heads, after many a voicemail message and perilously rigorous scientific testing, we’re finally ready to give you STBlog’s assessment of Nuance’s VM2T (voicemail-to-text) client.
How the review breaks down
A couple weeks ago, I had a briefing with Nuance Communications about setting the service up where they explained the lay of the land. They explained that version I would be testing is a little bit different from the one you’ll find out in the wilds of market. As we’ve mentioned before, Nuance’s marketing strategy with VM2T is to distribute through its partners–in this case carriers. Nuance provides the underlying technology to its partners, but each iteration is likely to look a little different according to those partners’ needs. The version I was using was hosted directly by Nuance, so interface specifics probably wouldn’t bear any relation to what most end-users will see.
For one, I had to set up a forwarding service to use it which an end-user would never have to do. For two, all of the messages were emailed to me rather than sent as text messages. In real deployments, Sean Brown, product manager for mobile applications at Nuance, assured me the messages will be sent as SMS texts under most carriers. Also varying from provider to provider are settings for live agent intervention. Depending on what a provider wants to pay for/provide they may bring in real people to clean up the texts if a message scores low-confidence.
All that said, the recognition engine (Dragon 10) is identical to the one that carriers will be using, so we focussed on that for the purposes of this review.
The process began when I set up my account, dialing a number that would, from that brave moment on, forward all my voicemails past my provider’s system to Nuance VM2T HQ. There, they’d be subjected to pinch-and-pull of Nuance’s automated recognition, possible human oversight depending on the strength or weakness of confidence scores, and spat back out to my email as a text with a .wav of the message attached for review. If the system was unable recognize what was said, it would be indicated this with [...]. Likewise, if it didn’t have high confidence and guessed a word it would write [?] after it.
The results
Initial tests in the office yielded great results. My first message was processed with near perfection but for two paltry punctuation marks.
Testing, testing testing. Hello I’m testing my voicemail. How is it going? Goodbye.
That success under VM2T’s belt, I decided to step up the difficulty. I picked out Lear’s first long bit of dialogue in the first act of Shakespeare’s King Lear and let Nuance have at it. The results were pretty damn accurate, especially given that I had a couple tongue stumbles in my rendering of the Bard.
Meantime we shall express our darker purpose. Give me the map there. Know that we have divided. In three our kingdom: and it is our fast intent to shake all cares and business from our age conferring them on younger strengths, while we unburdened crawl towards death. Our son of Cornwall, and you, our no less loving son of Albany. We have this hour a constant will to publish. Our daughters’ several dowers, that future strife may be prevented now. The princess, France and burgundy create rival for our youngest daughters love, long in our court have made their [...] and here are to be answered. Tell me my daughters, since now we [...] role interest of territory, cares of state, which of you shall we say, tough love is the most that we are largest bounty made the extend. We are nature [...].
Again, things were a tad funky with the punctuation (niggling, I know). It also read Burgundy as the color rather than the principality, and, best of all, “That we our largest…” was rendered as “tough love is the most that we are…”–a chillingly poetic truism. Also, you’ll notice there were a couple spots where the recognizer failed and dropped in its bracketed ellipses–particularly at the end. The whole last sentence or so of the monologue is truncated. All told, though a pretty strong showing for Elizabethan English.
We tested further with people from the office taking cracks at it. A message from my brother Adam B. about a speech implementation down at the local Chili’s yielded similarly strong results with only minor flaws (it transcribed his name as Adam D). Clearing the hoops of controlled lab tests, it was time to subject VM2T to the only worthwhile test–The Real World: NYC.
On the mean streets of the Big Apple, VM2T fared pretty well. It was stumped in some predicable areas–a siren going by as a friend of mine was telling me about homemade shelves he was trying to install yielded handmade shell(?)–the question mark indicating a low confidence score. The same friend (who has a thick Arkansan accent)’s, rendering of “installin’ shelves n’ shit” was rendered solving a thousand shit(?). There were also goof ups with names, but for the most part it was reliable.
It’s actually somewhat of a triumph that Nuance is able to get this as right as they have. It’s leaps and bounds over the voice recognition of yore, particularly when you consider the subpar phone connections (cell phones the lot of them) in less than ideal circumstances (the loud streets of New York), but, that said, any triumph Nuance has had with VM2T is still relative. VM2T is still not a precise tool.
Mostly accurate just doesn’t do in some cases. Say when someone is leaving you directions, for instance. In one message that a friend left telling me how to get to a place we were meeting for a project, the name of the street and neighborhood were just not recognized accurately at all. Had I tried to follow them, God knows what state I would have ended up just trying to get across the East River.
When I talked to Mr. Brown from Nuance about that, he pointed out that the company is working to eventually localize its service–that is include things like street names in my area in the working lexicon tied to a particular account. Nuance would essentially leverage location information native to mobile technology to target the grammar. That, he claimed, should alleviate some of the mistakes made. Likewise, he told me that VM2T is using Dragon 10’s adaptive technology to build profiles around callers who call my number, keeping track of the language they use, their cadences, etc., so that as I get more voicemails from my thickly Arkansan friend, the system should adapt to his voice and get better at reading it.
For now though, the adaptive process takes time and the localization is in the works and unavailable (I couldn’t get a firm estimate as to when we could expect that actually be implemented). Out of the box, those features aren’t going to help much. The way I ended up using VM2T was basically heuristically. It gave me a pretty concrete of what the content of each message was about, but if it was something I needed to have an especially accurate understanding of, I still ended up going with the audio. That is to say, it can get the job done for most messages, but it’s not a replacement for audio voicemail altogether. Don’t expect to be relying on it exclusively, which is by no means Nuance’s intention–they see it as a value adding proposition for carriers, not some kind of voicemail cureall.
Final thought
At the end of the day, it’s definitely a tool worth checking out if you’re one of those people who’s getting ten voicemails a day and don’t have time to plow though them. The engine seems particularly adept at handling short messages, and does especially well in not so noisy circumstances. The calls placed from our office were among the best recognized. Is it worth using if you’re not a heavy voicemail recipient though? That’s probably a matter personal taste. Almost all of my voicemails are usally from my mother and I generally know what they’re about without having to check (they’re about me not picking up the phone). So for me, there’s not such a high payoff for using this kind of a service. I’ll probably keep checking my messages the old fashioned way, but admitedly I’m a curmudgeon who frowns on new-fangledness.

Eric B. —
April 8, 2009 @ 1:01 pm
[...] our review on Nuance’s VM2T, SpinVox (perhaps a little jealous) wrote me to ask me to the dance—-the speech dance that [...]
Pingback by SpinVox Review A-Coming to STB | Speech Technology Magazine Blog — April 14, 2009 @ 2:18 pm