Allison Smith

The Tough Task of TTS

I was thrilled a couple of years ago when I was approached by Cepstral, one of the premiere architects of high-quality, natural-sounding voice synthesis products, to be one of their text-to speech voices….and I was even thrilled by their very public proposal. They did a presentation at Astricon one year, and while discussing their range of voices available, a slide appeared on the screen which read: “Coming soon: The Allison Voice!”

Geez, give a girl some notice. At least we’re not capturing the event on a jumbotron.

Text-to-speech (TTS) synthesis is basically the artificial production of human speech (most people’s first thought will gravitate immediately to Stephen Hawking, whose TTS voice has become a part of his persona).

Rumor has it that Cepstral, which  designed his initial TTS utility, has offered him numerous upgrades and more current and evolved versions throughout the years. He has turned them all down. His early, rudimentary voice works well; it is recognizable; and most significantly, it has practically become a part of who he is.

TTS products immeasurably enhance the lives those unable to speak, and it’s imperative that the user and voice connect on a visceral level.

A TTS system converts normal language text into speech by concatenating pieces of recorded speech that are stored in a database. Phonemes and graphemes are simply broken-down sound fragments that the system recognizes and assigns to the corresponding typed words. The storage of entire words and even sentences allows for high-quality output but is laborious and time-intensive to record.

Tell me about it.

Cepstral’s goal, when they proposed the idea of working together, was to build a very robust TTS engine—possibly the most robust they’d ever designed. Due to the prevalence of my voice, not only on the Asterisk Open Source PBX but with many other telephony platforms, they saw the advantages in recording volumes more “sounds” than usual to create as seamless as possible an interface that would dovetail well with pre-installed stock prompts and custom-recorded prompts alike. As a way of achieving that, a script arrived that had the breadth (and thickness) of a typical major-city white pages telephone book. No problem!

In this script, I found thousands upon thousands of single words and just as many pages of random and often nonsensical sentences (“During the period, the company continued to benefit from favorable tax effects” or “But oh what a hit it could be” as examples). From larger sentences, phonemes can be farmed (think of the single sounds and combinations of sounds that could be extracted from the sentence: “Julie put on her red coat and made it to the train station by nine”) and stored for retrieval when the system perceives that the fragment is needed.

Although it’s not flawless, at a subsequent Astricon after the Allison TTS voice was launched,  one of the Digium staffers was very eager to unveil the Cepstral Allison Voice; he typed in “Hi! I’m Allison Smith!” and out of the computer I spoke: “Hi! I’m Allison Smeeeeth!” I find it hard to believe we didn’t capture the “ih” sound that the “I” in “Smith” makes, but there you have it.

One of the most difficult sounds to capture in a TTS application is oddly enough  the word “of.” Widely used in the English language; it’s one of the few words where “f” is pronounced “v.” Naturally, this creates problems for TTS utilities.

I devoted about three hours a day for several weeks to getting the project recorded, and managed to soldier through it, not only voicing all the words and sentences but editing them into individual sound files. Apparently it was worth it because the Cepstral Allison TTS voice is the number one selling voice for Cepstral, and is offered as a very useful add-on for purchasers of the Asterisk PBX.

The uses of TTS for the speaking-disabled allows for clear, real-time communication for those with challenges; other applications in the area of transcription  of the written word to audio format are immeasurably vast and key to its growth and evolution. While it will never “replace” me (I’ve had a few clients who have tried doing longer paragraphs and one client who even tried to forge together an entire on-hold system using strictly my TTS voice—unsuccessfully), the TTS utility is ideal for filling in gaps, smithing together proper and place names, or simply bridging together prompts that need integration. While the Allison TTS voice—just by the volume of material that built it—is a formidable and extensive TTS utility, it will always be identifiable as “mechanized” and never apt to be mistaken from an organic recording.

Check out the Cepstral Allison Voice at: www.cepstral.com/demos.

Type anything in, and I’ll say it. Yes, anything. My husband is prone to typing in things like: “You are correct 100 percent of the time!” or “There are no chores for you today!” Hearing them in a slightly robotic, manufactured style is better than not hearing them at all….

————

Allison Smith is a professional telephone voice, having voiced platforms for Sprint, Verizon, Qwest, Cingular, Bell Canada, Vonage, Twitterfone, Hawaiian Telcom, and the Asterisk Open-Source PBX. Her Web site is www.theivrvoice.com.