No laughing matter

The engineer is one of a growing number of researchers trying to crack the next barrier in computer speech synthesis - emotion. Computers are starting to laugh and sigh, express joy and anger, and even hesitate with natural "ums" and "ahs," in labs at many research centres and universities.

Called expressive speech synthesis, Ellen Eide of IBM in New York says "it’s the hot area" in the field today. The multinational plans to introduce a version of its commercial speech synthesiser that incorporates the new technology.

It is also one of the hardest problems to solve, says Sundaram, who has spent months at the University of Southern California tweaking his laugh synthesiser. And the sound? Mirthful, but still machine-made.

"Laughter," he says, "is a very, very complex process."

In 2001, psychologists Jo-Anne Bachorowski, of Vanderbilt University, and Michael Owren, of Cornell, discovered the complexity of laughter when they recorded 1,024 responses from college students watching the films Monty Python and the Holy Grail and When Harry Met Sally.

Men tended to grunt and snort, while women generated more song-like laughter. When some of the audience cracked up with laughter, they hit pitches in excess of 1,000 hertz - roughly high C for a soprano. And those were just the men.

But the complexity of the challenge does not prevent the quest for expressive speech synthesis. It is driven primarily by a grim fact of electronic life: the automated response computers that many of us talk to every day, as we look up phone numbers, check portfolio balances or book airline flights, might be convenient but they can be annoying.

Commercial voice synthesisers speak in the same perpetually upbeat tone whether they’re announcing the time of day or telling you that your pension has just plummeted.

David Nahamoo, overseer of voice synthesis research at IBM, says businesses are concerned that as the technology spreads, customers will be turned off. "We all go crazy when we get some chipper voice telling us bad news," he says.

And so, in the coming months, IBM plans to roll out a new commercial speech synthesiser that understands. The Expressive Text-to-Speech Engine took two years to develop and is designed to strike the appropriate tone when delivering good and bad news. The goal, says Nahamoo, is "to really show some sort of feeling".

Scientist Juergen Schroeter, who oversees speech synthesis research at AT&T Labs, says his organisation wants not only to generate emotional speech, but to detect it too. "Everybody wants to be able to recognise anger and frustration automatically," says Julia Hirschberg, a former AT&T researcher who is now at Columbia University in New York. For example, an automated system that senses stress or anger in a caller’s voice could automatically transfer a customer to a human for help, she says. Hirschberg is developing tutoring software that can recognise frustration and stress in a student’s voice and react by adopting a more soothing tone or by restating a problem. "Sometimes, just by addressing the emotion, it makes people feel better," says Hirschberg.

So, how do you make a machine sound emotional? Nick Campbell, a speech synthesis researcher at the Advanced Telecommunications Research Institute in Kyoto, Japan, says it first helps to understand how the speech synthesis technology most people encounter today is created.

The technique, known as "concatenative synthesis", works like this: engineers hire actors to read into a microphone for several hours. Then they dice the recording into short segments. Measuring in milliseconds, each segment is often barely the length of a single vowel. When it’s time to talk, the computer picks through this audio database for the right vocal elements and stitches them together, digitally smoothing any rough transitions.

If research succeeds, the first customers are likely to be Japanese auto and toymakers, who want to make their products - cars, robots and so on - expressive. Campbell adds: "Instead of saying, ‘You’ve exceeded the speed limit,’ they want the car to go, "Oy - watch it!"

But Sundaram and others know that synthesising emotional speech is only part of the challenge. Determining when and how to use it is vital. As Jurgen Trouvain, a linguist at Saarland University in Germany who is working on laughter synthesis, says: "You would not like to be embarrassing."