Auto-translation gets more personal

Microsoft believes it has taken automatic speech translation to the next step: using something close to your own voice rather than what sounds like a robot.

There have been several developments in speech recognition and translation in recent years, most notably Google’s translation app on Android that can allow two people who speak different languages to converse. Well, as long as they don’t mind passing a smartphone back and forth and pressing a button to show the speaker has changed — and as long as the phone’s owner can convey that requirement to the other speaker.

Indeed, it’s looking very conceivable that concept will lead to the point where two speakers of different languages can talk on the telephone, with the translation quick enough to make real-time conversation viable.

One drawback, however, is that computer-generated speech often sounds disappointingly artificial (though using Kindle text-to-speech on A Brief History Of Time is a particularly weird experience.) It is unnaturally rhythmic and doesn’t reflect the distinctive pitch or accent of the speaker.

The Microsoft system attempts to solve that by piecing together the speech from the user’s own voice. It requires about an hour of training, with the user reading out a series of sounds in their own language, along with variations and combinations in other languages. These recordings are then stitched together to make the “foreign” speech.

Technology Review has posted examples of the system in action. To my ears, the results still sound very artificial, though it is a little closer to the speaker’s own voice (Microsoft’s Rick Rachid) than a “standard” robotic voice. It might have been better if the demonstration had involved subjects with more distinctive voices or accents.

For the system to develop further, it might need a longer training period in order to record the person saying the same sound with different intonations or levels of emphasis, meaning the translated speech could capture more of the mood and emotion of the speaker.

(Image: Tower of Babel by Lucas van Valckenborch)


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.