Advertisement

It's 2009 why cant we do real sounding text to speech?

Started by July 19, 2009 02:25 AM
19 comments, last by CodaKiller 15 years, 3 months ago
Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).

If you could develop quality IPA-to-speech and speech-to-IPA (which should be far easier than actual language processing), you could accomplish many of the same goals as text-to-speech and speech-to-text using traditional AI algorithms. It could work much like old-school adventure games did with "natural language" input, except more sophisticated with the additional computing resources and knowledge available today.
You'd also have the ultimate VOIP system - build a pronunciation profile for each user based on them reading some well-chosen text, and aside from that initial transmission, it would only take a few bytes per second to transmit speech.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
You can get decent text to speech, but it's gonna cost you a pretty penny. It needs a lot of setup time, someone with a clear voice and a lot of love and attention placed into getting the text right, which usually means butchering the mother tongue.

We use TTS in our products, and it works reasonably well, granted we're not a game development house.
Advertisement
Quote: Original post by NickGravelyn
Quote: Original post by zedz
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
Who ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.

Yes I know what they in fact done + that the whole thing was fake
but If you watched the video you would see they imply(*)
that with this technology it will

A/ understood what you said
B/ could reply back to you in perfect english
C/ conduct an intelligent conversation
D/ recognize objects by sight
..
well a whole long list

(*)fair enuf if you havent watched the demonstration

Also Milo calls you by name, which would require some sort of text-to-speech.

I have a fairly uncommon RL name, I'm interested how/(if at all) it'll handle that. Not to mention all the wild Us3rN@m3z. :)
You know, you're not supposed to feed correct English to a TTS system. You're supposed to write it, usually using their hints, so that the text it's speaking actually includes intonation information. So parsing and understanding English aren't really difficulties unless you're trying to use your TTS engine quite naively.
SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.
Quote: Original post by Promit
You know, you're not supposed to feed correct English to a TTS system. You're supposed to write it, usually using their hints, so that the text it's speaking actually includes intonation information. So parsing and understanding English aren't really difficulties unless you're trying to use your TTS engine quite naively.


I think our intention was that our end users could use it naively. I think we've come to the conclusion now that that is not going to be possible.
Advertisement
Quote: Original post by Extrarius
Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).


Easy. You would still need to convert the plain text to IPA, which requires -- tadaa -- natural language parsing.

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

Quote: Original post by Sander
Quote: Original post by Extrarius
Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).


Easy. You would still need to convert the plain text to IPA, which requires -- tadaa -- natural language parsing.
While technically true, you omit the fact that for many uses of text-to-speech (and speech-to-text) it would be possible to work IPA either directly or indirectly. For example, to make a game where NPCs speak, it'd be easy enough to make a dictionary of english-to-IPA for developers to use when building dialog text (so they can produce both an english and a phonetic version of the speech). The developers could also input additional data to help it the sound more natural. Since they're directly describing the sounds made, they can also easily give different characters different accents, dialects, and manners of speech.
Conversely, text-to-speech that works directly with text is less tunable and must be far more complicated to produce equal results.
I'm a big fan of small, modular pieces rather than monolithic systems, and a natural-text-to-speech system is monolithic by design.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
Quote: Original post by Extrarius
For example, to make a game where NPCs speak, it'd be easy enough to make a dictionary of english-to-IPA for developers to use when building dialog text (so they can produce both an english and a phonetic version of the speech). The developers could also input additional data to help it the sound more natural. Since they're directly describing the sounds made, they can also easily give different characters different accents, dialects, and manners of speech.

Heh bioware would love a system like that probably. Not sure if you guys have paid much attention to SW:TOR, but they have a lot of voice actors.

I don't mind the Microsoft Sam text to speech. I used to use something in the game America's Army (it was probably Microsoft Sam) since they had it set up to read the text.
All this and no mention of Tom Baker Says? It's probably the most impressive voice synthesis I've seen (heard). The delivery is still slightly stilted but it doesn't have the robotic twang that the AT&T stuff has.

This topic is closed to new replies.

Advertisement