This is a transcript of the morning keynote panel between
Bill Meisel, President, TMA Associates
Mike Cohen, Manager, Speech Technology Group, Google
Victor Melfi, Chief Strategy Officer & Senior Vice President, Marketing, VoiceBox Technologies
Neal Bernstein, Senior Director, Local & Mobile Search, Microsoft
Michael Wehrs, Vice President, Evangelism & Industry Affairs [Ed: Yes, that is his real title], Nuance Communications
John Tadlock, Lead Technical Architect, Consumer Application Architecture, AT&T
I typed the discussion out as I listened. So if it’s a little sloppy in parts…tough tamales, readers.
I did this for two reasons. First, it was a really good debate and the panelists gave some great information that I was unable to include in the daily Speech Tech news story. I just didn’t have enough space.
The second reason is because one panelist calls another an “ignorant slut” [Ed: According to staffers here, this phrase originates from SNL's Weekend Update editions with Jane Curtin...and is hilarious in the context of voice search.]. I wanted to share that.
And yes. Yes. You’re welcome.
Meisel: Voice search is vague and we’ve kept it vague for this conference. Voice search is a way of implying, like web search, that you can get things quickly and easily. How do you see this paradigm? What can we do with it?
Answers after the jump!
Cohen: In terms of terminology, search is a good metaphor for the new capabilities and applications coming along. On the audio indexing side, it’s including more types of information, making it available. On the mobile application side, we look at search to give users more and more flexibility in how they express their needs.
I’d be careful with the word “voice.” We mustn’t become too voice centric. With some user scenarios you want voice and in other scenarios you want other ways or in other combinations. We want to focus on bringing end-users the best value.
Melfi: Definitionally, it’s like nailing Jello to a tree. It can range from an aspect of input architecture to, at its extreme, voice as an interface particularly germane to times when people’s hands and eyes are busy. (Driving) is a use case. That’s voice as an interface rather than voice as an input methodology.
Bernstein: We don’t want it to become another hyped term like Vcommerce. We need to define it more specifically. It’s not a market. In my opinion (voice search) is not voice-driven IVR. It could be searching audio files, but it’s really searching for content by voice. It can take the form of an app that runs on all sorts of devices, a PC, a phone. It’s not a market, it’s not a product; it’s a two-part function.
Tadlock: I read an article describing Web 2.0 as a perspective on technology we’ve been using for a while. I think of voice search the same way. You can emphasize the voice channel or the search. If you look at topics in this conversation, we do so very broadly. So you have large vocabulary recognition, open prompts, but it’s just a perspective. Saying it that way doesn’t mean a perspective isn’t important. The way you look at technology: you can make business decisions based on perspectives. I’m not minimizing voice search.
Wehrs: Voice search isn’t The Thing. It’s a way to get something done. People may choose the modality of voice to get information. There’s lots of technology underneath it to make that happen. The relevance of it, the flexibility of it, become interesting components in how you measure how good it is.
Meisel: If you think of web search, you go to a company web site, you expect a search box. The same thing can happen with certain mobile applications. The tech is capable of supporting that. In defense of the broadness of the term, it’s useful sometimes to have a summary term. What will be the most important apps in voice search?
Wehrs: We’re in an interesting position in that a number of devices ship with a voice-enabled app from Nuance. We have some real data as to what people are asking for. We also have the other perspective on directory assistance.
Primarily in a mobile world, people are out and about. If they’re in a car that’s a subset of that. The search criteria are 1) finding the nearest pizza joint. That’s the number one request we get. McDonald’s is a high request. Five times the amount people that ask for Burger King. If you look at what people do in a search application, it’s primarily about eating. Food and directions on how to get there.
After that you fall into the messaging bucket in terms of frequency of use: SMS and voice messaging.
Tadlock: Thinking about applications, I like to categorize things in three areas: Connectivity, routing, and self-service.
With connectivity, it’s customers trying to figure out what “phone number” they need to call. You’ll have various flavors of directory assistance and voice dial.
By routing I mean determining the customer’s purpose or intention. So in the example you provided, the customer wants to find a restaurant. The system looks for what the customer wants to do. That’s a search activity.
Self-service is automating the task the customer is trying to get to. If AT&T were to automate technical support for mobile phones further, let’s say the customer calls in and is having a problem with their handset. We’d have to figure out which of over 9000 handsets the customer is using. That’s a search activity on the customer size. We’d have to ask what problem they’re having. That’s a triage activity and a search activity as well.
Bernstein: From the Microsoft view, we think of mobile as an input channel connected to your other experiences. Your PC experience that follows you in the car or your media player that follows you when you leave the car.
It’s a seamless interface, probably cloud based, that gets to know you and improve over time. We do multimodal, which is key. There’s a new definition of WYSIWYG—What You See Is What You Get. I’d add WYSIWYG ART: And Respond To. So having both graphics and voice as the output and being text-driven, graphics-driven. It’s true multimodal, not just a screenpop of the experience.
Melfi: I’d also refer to functionality, which is specific to location, which is germane to the mobile experience. It’s meeting the particular need of a customer in a situation. And at the backend, it adds a new variable to the analytic perspective to drive advertising and personalization. That’s very important.
Cohen: Obviously with new technologies, it’s hard to guess what will be the killer apps. What we know from plenty of industry data, when people are on the go is when people are highly motivated to use voice, way more than in other scenarios. The hands-free environment is very important. Given that mobile is the big usage case, the question is: What are the scenarios?
People are pretty well aware that the nature of what you search for on your mobile is different than what you search for on your desktop. On your mobile, you deal with immediate needs, location-based things that need immediate satisfaction. Those are the areas of most-useful purpose.
Melfi: I want to change my mind. One problem I’ve had with voice is the concept of the application. Lots of friends have developed cool apps as part of an infrastructure. Apps live on a device the interface of which is so muddled that it’ll get lost in the quagmire of a device. It’s more important as an interface. You need to cut through the crap on the surface.
Meisel: I see a strong prejudice towards a graphical user interface (GUI). The mobile device will be ubiquitous and you’ll think of it as your personal assistant. Eventually it’ll know about you. It’ll make your default theater the one you used the last time.
Bernstein: You’re talking about the direction of the whole mobile phone industry. Voice will act as an interface. To Victor’s comments, the ODMs (Original Design Manufacturers) and carriers will continue to embed the technology and put it front and center for the users in different creative applications, starting with local search and expanding to other domains.
Where it’ll become unique is similar to what iPhone has done; put their local search with Google on the homepage.
Wehrs: I look at it differently. How many apps are on the average phone today? How many people actually use all of them? It isn’t more apps or how you position them. I guarantee you, yes you can reorder the menus and make them more accessible and you end up with this weird skewing as to how it should be.
Discoverability and integration are key. No normal person would go through the effort of running a bunch of apps to make something work. How do we make it run seamlessly for a person, make it knit together?
Speech is a layer of integration to deliver on these more comprehensive scenarios.
Melfi: The GUI is necessary to categorize objects. You need to transcend that, see voice not as a continuum but as an alternative to GUI, which falls apart when you get to mobile.
Meisel: Speech is great if you know what you want so you don’t need to see a list. Speech is great at that long list recognition. It’s not surprising that two of the early merging applications are directory assistance where you’re looking for a business. Let’s look at directory assistance, which is important: How do you see that application evolving? Will it stay as directory assistance or become—if I may borrow a term from the past—a voice portal?
Tadlock: I’d first like to respond to something that Michael said about usability: what is it that customers really use and don’t use? That applies to directory assistance as well: what are customers really going to do, what do they want to do? We have to track that and provide the service that customers want to use. On the flipside, AT&T has a long history of providing directory assistance automation. We have to approach that from the goal that we want to be 100 percent successful on every phone call that comes in. You see a lot of voice search apps that don’t necessarily go well with your customer; that’s not acceptable. If we’re doing directory assistance, we can automate some calls fully. But we want 100 percent success, so you’ll need to have varying degrees of partial automation. That means a service integrated in your agent structure.
Wehrs: I’ll take a spin on it. Directory assistance is the way in, where the masses will integrate. We see it in our product offering. Spoken has taken it to the next level.
Yes you can start to get into more complicated capabilities. And that means failing occasionally a graceful way that doesn’t mean the customer throwing his phone against the wall and demanding a refund. We can’t get our market expectations ahead of what we can actually deliver.
So let’s start off with getting automation rates better, getting scaled agent tiers in place because we will need that as a core capability. And then we can talk about adding more complexity.
But to come out today and say “Everyone can have a personal concierge”: we’re not there yet.
Melfi: I have a contrarian take. 411 from a technology perspective sits on an IVR which is incapable of expanding to enhanced 411 because it’s more menu driven. I tried an enhanced 411 experience and it took me two minutes to get a stock quote. I was very frustrated. I don’t think 411 is appropriate for mobile search: it’s the problem of deeply nested data. It’s the stuff of which SNL skits are made.
Bernstein: I disagree with you, Victor. The first part of the question is: is voice search really using 411 as a proxy? It is. We’re getting close to this 90 percent completion rate bar.
To go back on something I said earlier saying voice search is a feature and not a product: The exception to that is the 411 market. It’s voice driven and core search. It’s a crucial piece of the market and it does extend to things like ratings and reviews and community content.
As far as the nested problem, that’s a historical problem of IVRs in general. So from the Microsoft point of view, we’re going to continue to incorporate, taking more and more advantage of personalization features. So (voice search) can hopefully figure out what’s the intent to eliminate frustration of nested menus, incorporate the screen and keep users from having to go through all of them, because they don’t.
Melfi: Oh Jane, you ignorant slut.
Cohen: 411 as a starting point on your mobile: It’s very unclear what people exactly want. On the other hand there are very clear examples of services as far as getting directions that are very natural, clear extensions to 411 experiences.
I can’t give you any numbers (related to Goog411). Generally, I can say a lot of people get a lot of useful information. We’re getting high task completion rates. I can’t site more specifics than that.
Melfi: Oh Jane.
Meisel: Is the technology the bottle neck in some of these harder applications?
Bernstein: I’ll give the same answer Mike gave: MS’s product is 1-800-CALL-411. The thing that’s the most interesting isn’t just the upward trend, but the average calls and queries per month. It’s also going up. That shows the usefulness of the app and that’s a very important metric we track as it relates to customer experience.
The technology is good enough. We all believe that; it’s about how well we design the application. What will take it to the next level will be taking advantage of the screen and personalization, making that application more streamlined.
Cohen: I take issue with the claim that the speech technology is good enough: it’s good enough to bring value, but there’s a huge amount of room for improvement. There’s a lot more value you can bring to end users with improved speech technology. The quality of the underlying speech technology is one of the big bottlenecks.
The other is user interface design: How do people narrow down their search? How do they better specify location? Improving on the UI (user interface) is in some ways more challenging than improving on the recognition technology. It’s easier to define quantitative metrics.
There’s a lot of room for improvement.
Wehrs: Yes, speech tech, algorithmically, can be improved. The integration of grammars can be improved. There are significant improvements just in that core technology.
The multimodality components create interesting metaphors about how things are designed today and how difficult it is to improve.
For example: You’re on a personal navigation device which has voice-driven destination entry. How does that get implemented? The real benefit and what most people don’t realize is: Look at what’s happening on the screen? Your selections are on the screen. If your eyes aren’t busy, you’re pushing the button when it’s clearly faster. Rather than have the computer ask you questions, you just point.
But doing that’s a lot of work because the architecture just isn’t in place. We’re at the baby steps of integrating multimodality.
Tadlock: The AT&T experience is informative here. We’d established in our labs a criteria in advance, a speech recognition performance criteria, waiting for the technology to achieve a certain level of experience. So we’d test recognition performance every year to see if what was good enough. We crossed that “good enough” threshold in 2000. We very quickly went into trials, which meant automating the entire state of Oklahoma.
Those trials were technically very successful. But did we deploy? No. Because at that time we had to respect our labor contracts.
We did another trial a few years later. The one thing I remember most about is there were accountants with the largest Excel spreadsheets I’d ever seen, modeling every aspect of customer self-service including agent severance packages. It came down to pennies, especially when directory assistance was in declining call volumes.
I don’t know if everyone is aware of that here. Call volumes are declining dramatically and that drives your business in very strange ways. Why do you spend a large investment in declining call volumes? Everything was driven by the guy who owned that spreadsheet. The accountant. The decisions will be made by the guy with the spreadsheets.
Meisel: In terms of multimodality, this is an issue of who controls the phone. Even if you have an operating system, it’s the owner of the service like AT&T that determines what goes on. It’s not easy to insert multimodality from a business point of view. Will this open up? Will this hurdle be overcome?
Melfi: That’s why it’s best not to narrow our conversation to phones. It’s a business architecture issue. From a business strategy perspective, at VoiceBox, I just spent over a year banging my head on the issue but in my mind, you’ll see innovation in alternative devices because of convergence and the declining costs of a communication card. The point is other industries will step up to the plate to connect to portable devices. That’s where we’ll see movement and it might change the minds of carriers. But they’re very complex organizations.
Wehrs: Global telecom revenues from mobile network operators: You’re dealing with well over a trillion dollars per year. Governments get involved if you try to mess with that system.
Network operators won’t tolerate high revenue shares. They’ll pat you on the back and say “Thank you very much.” You have to be exactly the right shape to get through those hurdles or the operators will give you certain requirements you have to meet.
The best is Verizon. They’re more open, but that’s a trend of one.
The channel to the market is very interesting. Yes it’s happening, but it’s not happening fast.
Bernstein: At Microsoft, it’s not the guys who control the spreadsheets who control the product plans. That allows us to innovate.
We as a company are very carrier-partnership oriented. In the US it’s absolutely a control point the carriers have. In other markets, in Europe, it’s not nearly as controlled. In Europe, they sell direct. There’s more flexibility.
Related to carriers, you’ll see VoIP evolving on the handsets. And all sorts of devices, whether it’s an MP3 or whatever. Ford Sync is the best example of that today. It is literally supposed to just be in a few cars and it’s in every Ford. It’s expanding beyond Ford and has become a driver of people getting into the showroom. Examples like that will introduce new types of devices.
What Victor is doing at Magellan is an example. We’ve traditionally thought of the market and the carrier’s control. They’re an important partner, but they’re not the only option.
Meisel: Ford put as much money in launching the Sync as they did a new car model.
Bernstein: From a branding perspective, I must add it’s not Ford Sync. It’s Microsoft Sync.
Meisel: It has Nuance technology as well.
Wehrs: We wouldn’t write the check for that branding.
Melfi: I think we will point to Ford Sync as a seminal point. It’s gotten into mindshare.
Meisel: Is advertising going to help? If so, what form? What’s the business model that will support these applications?
Wehrs: (shrugs)
Melfi: If you go back to the history of media, I can point to seven research studies done this year alone asking: Do you want commercial content on your phone? 90 percent said absolutely not.
They asked: How about your sports team? 70 percent said absolutely.
Advertising is something we love to hate; people accept it more than they realize. But the economics of doing this on mobile is less favorable than on the internet. It’s owned by multiple carriers and if you use the search model paradigm, the value of a search is intuitively lower than the internet model, which is driven by mortgage brokers and lawyers.
And when you’re asking about restaurants, the value of that lead is less than a penny.
I don’t think we’ll have a choice though because we have to go there. And sometimes, markets are made.
Wehrs: This falls into two categories: there is a segment of the population that will accept advertising to access the service. There’s another segment that says, I’ll just pay because I don’t want to deal with that advertising; they’ll have a tiered service. When I say advertising, I was very specific in the use of that word. When it falls into information relevant to me, it falls into information I want to see. So if it’s targeted enough, the people that pay for it will be happy to get it too. You have to be careful sitting on that bubble.
Cohen: Well, what do I know about ads?
As far as Goog411 goes, our focus now is improving the end user experience. There can be many ways that bring value either through an advertising outlet or not.
What I will say is that media like that become the basis for advertisements. It’s important to respect the end-user and not be intrusive. Try to bring real value with advertising.
The way it evolved on the internet is a good example: matching those ads to what users needed at the time turned out to be a big breakthrough.
Are there similar opportunities for voice search? I don’t know.
People are interested in simple, sparse interactions. On mobile, it’s even more important to be simple, efficient, and quick. If suddenly your directory assistance call turns into a radio station playing crazy ads that are irrelevant to what you’re interested in, that’s not an approach that will work.
So you need to think of the way to bring value.
Bernstein: The business model depends on who you’re talking about: the consumer? The device manufacturer?
It takes a form of subscriptions, ads, or transactions. If you’re providing 411, it’s transaction. If you’re providing information to a driver that doesn’t want to be distracted, it’s subscription.
The thing is, you’re not going to ask for Ruth’s Chris Steakhouse and hear a McDonald’s ad. There are more vehicles than the audio channel for delivering relevant ads. There’s the text channel and the graphical channel.
Those ads need to add value and be unobtrusive to the user experience and those that aren’t will be inferior, regardless of what the device happens to be. Related to the device, you’ll see a cross-platform piece. In the case of Microsoft or Google that are part of an industry-wide ad network, your voice-search experience may result in an ad for the P&D. It may not.
Tadlock: A lot of people here are talking about network services. The focus on AT&T is less on apps and services and more on the next generation network infrastructure that can support more arbitrary apps and services.
The idea is you can’t wait until the business model is discovered: you still need to make progress. We have a next-generation network architecture called CARTS. The idea is to provide a home in the AT&T network for a multitude of services. Trying to anticipate in the future what will work. This is a business model that works across multiple levels along multiple companies.
It needs to synergistically come together. The focus at the moment at AT&T is to be prepared for whatever business model does appear. A lot of effort is put into that approach.
Audience Question: What is the one technology problem you’d like to see solved by 2009?
Melfi: That’s easy for me: the problem of the interface. That’s where I put my money in.
Cohen: I agree that’s the single biggest bottleneck. The other one is recognition accuracy. We’ve had incremental progress for years, but it’d be nice to see some big breakthroughs.
Bernstein: I’d like to expand the corpus from local search to web search. Relevant to what you want to do on your mobile, you can get more than names and directions. You’re shopping and you want a review on a camera or a better price. Expand what you’d normally do from your desktop to your mobile phone.
Tadlock: I think the real issue is what combination, what service has a mass-market appeal and a business model that works? If you could answer that question, then you’ll get all the technology people making sure (the technology) works.
Wehrs: It comes down to integration. That can be manifested in the UI. It’s about integration and how it manifests in a UI.
But I’ll agree that the number one problem that needs to be resolved is the business problem that lets us make enough money to continue this. It’s hard to get innovation when you’ve got so many controls placed on you by others.
Audience Question: How will demographics determine killer apps from a market segment perspective?
Melfi: I have a great video I took yesterday. My daughter had a slumber party and she and her friend were playing on the iMac voice interface. The younger demographic is remarkably forgiving of engaging interface technology. The younger demographic is particularly open to this form. The killer apps will be sharing of content, because that’s what they do all the time.
We might see some content as intrusive, but the younger generation sees it as engaging content. When I see a Harry Potter trailer, I know I’m being hit on. My daughter shares it with her friends.
Wehrs: Two thoughts: You’ve got the younger demographic, and also a skewing of 50-plus. The general older set uses the phone to dial a phone.
The other one I’ll give you is a Hollywood reference. There was a Star Trek movie, with the whales, where Scotty comes down trying to give a formula for some nonsense. He sits down at the computer and says, “Computer.” He picks up the mouse and says “Oh, computer.”
Someone says, “Why not use the keyboard?” and he says “Oh that’s so quaint.”
You get to a point where people have expectations on how to interact with a device.
The younger generation is fully expecting that all things are connected, so having a speech-related interface is just another interface.

STM Blog —
March 11, 2008 @ 9:04 pm