Creating a computer voice that people like

Creating a computer voice that people like

Building of such systems is led by researchers in a field known as human-computer interaction design

Creating a computer voice that people like
When computers speak, how human should they sound? This was a question that a team of six IBM linguists, engineers and marketers faced in 2009, when they began designing a function that turned text into speech for Watson, the company’s “Jeopardy!” – playing artificial intelligence programme.

Eighteen months later, a carefully crafted voice ­– sounding not quite human but also not quite like HAL 9000 from the movie “2001: A Space Odyssey” – expressed Watson’s synthetic character in a highly publicised match in which the programme defeated two of the best human “Jeopardy!” players.

The challenge of creating a computer “personality” is now one that a growing number of software designers are grappling with as comp-uters become portable and users with busy ha-nds and eyes increasingly use voice interaction.

Machines are listening, understanding and speaking, and not just computers and smartphones. Voices have been added to a wide range of everyday objects like cars and toys, as well as household information “appliances” like the home-companion robots Pepper and Jibo, and Alexa, the voice of the Amazon Echo speaker device.

A new design science is emerging in the pursuit of building what are called “conversational agents,” software programmes that understand natural language and speech and can respond to human voice commands. However, the creation of such systems, led by researchers in a field known as human-computer interaction design, is still as much an art as it is a science.

It is not yet possible to create a computerised voice that is indistinguishable from a human one for anything longer than short phrases that might be used for weather forecasts or communicating driving directions.

Most software designers acknowledge that they are still faced with crossing the “uncanny valley,” in which voices that are almost human-sounding are actually disturbing or jarring.

The phrase was coined by the Japanese roboticist Masahiro Mori in 1970. He observed that as graphical animations became more humanlike, there was a point at which they would become creepy and weird before improving to become indistinguishable from videos of humans. The same is true for speech.

“Jarring is the way I would put it,” said Brian Langner, senior speech scientist at ToyTalk, a technology firm in San Francisco that creates digital speech for things like the Barbie doll. “When the machine gets some of those things correct, people tend to expect that it will get everything correct.”

Beyond correct pronunciation, there is the even larger challenge of correctly placing human qualities like inflection and emotion into speech. Linguists call this “prosody,” the ability to add correct stress, intonation or sentiment to spoken language.

Today, even with all the progress, it is not possible to completely represent rich emotions in human speech via artificial intelligence. The first experimental-research results – gained from employing machine-learning algorithms and huge databases of human emotions embedded in speech – are just becoming available to speech scientists.

Synthesised speech is created in a variety of ways. The highest-quality techniques for natural-sounding speech begin with a human voice that is used to generate a database of parts and even subparts of speech spoken in many different ways. A human voice actor may spend from 10 hours to hundreds of hours, if not more, recording for each database.

The roots of modern speech synthesis technology lie in the early work of the Scottish computer scientist Alan Black, who is now a professor at the Language Technologies Institute at Carnegie Mellon University.

Black acknowledges that even though major progress has been made, speech synthesis systems do not yet achieve humanlike perfection. “The problem is we don’t have good controls over how we say to these synthesisers, ‘Say this with feeling,’ ” he said.

For those like the developers at ToyTalk who design entertainment characters, errors may not be fatal, since the goal is to entertain or ev-en to make their audience laugh. However, for programmes that are intended to collaborate with humans in commercial situations or to become companions, challenges are more subtle.

These designers often say they do not want to try to fool the humans that the machines are communicating with, but they still want to create a humanlike relationship between the user and the machine.

IBM, for example, recently ran a television ad featuring a conversation between the influential singer-songwriter Bob Dylan and the Watson programme in which Dylan abruptly leaves the stage when the programme tries to sing. Watson, as it happens, is a terrible singer.

The advertisement does a good job of expressing IBM’s goal of conveying a not-quite-human savant. They wanted a voice that was not too humanlike and by extension not creepy.

Mispronunciation pitfalls

“Jeopardy!” was a particularly challenging speech synthesis problem for IBM’s researchers because although the answers were short, there were a vast number of possible mispronunciation pitfalls.

“The error rate, in just correctly pronouncing a word, was our biggest problem,” said Andy Aaron, a researcher in the Cognitive Environments Laboratory at IBM Research.

Several members of the team spent more than a year creating a giant database of correct pronunciations to cut the errors to as close to zero as possible. Phrases like brut Champagne, carpe diem and sotto voce presented potential minefields of errors, making it impossible to follow pronunciation guidelines blindly.

Researchers interviewed 25 voice actors, lo-oking for a particular human sound from which to build the Watson voice. Narrowing it down to the voice they liked best, they then played with it in various ways, at one point even frequency-shifting it so that it sounded like a child.

“This type of persona was strongly rejected by just about everyone,” said Michael Picheny, a senior manager at the Watson Multimodal Lab for IBM Research. “We didn’t want the voice to sound hyper-enthusiastic.”

Researchers looked for a machine voice that was slow, steady and most importantly “pleasant.” And in the end, they, acting more as artists than engineers, fine-tuned the programme. The voice they arrived at is clearly a computer, but it sounds optimistic, even a bit peppy.

“A good computer-machine interface is a piece of art and should be treated as such,” Picheny said.

Get a round-up of the day's top stories in your inbox

Check out all newsletters

Get a round-up of the day's top stories in your inbox