Finding Debbie's Voice: Not just what we say.
"Hi, I'm Debbie!"
[ˈhaɪ ˈaɪm ˈdɛ.bi]
These lines say the same thing. Since you're reading this, it feels safe to assume the first one makes sense to you. The second line is in the IPA - the International Phonetic Alphabet - it breaks down words and phrases into the individual sounds that make them up, rather than being concerned with their meaning. In day-to-day life it's mostly seen under dictionary entries.
Dictionaries and that transcription both have the same issue: they sound like me. English is my first language and I grew up in the south of the UK so, broadly, my accent can be described as Standard Southern British English (SSBE). This isn't particularly standard. Not everyone sounds like me, nor should they, but when we input words into text-to-speech (TTS) simulators, that's the assumption they make. We can set a language and a location - such as English (US) or English (UK) - but I'm yet to see English (Cornwall) as an option.
Debbie - our resident AI and chatterbox - is Cornish. She was born here, she grew up here, and she'd never want to live anywhere else. It's inconceivable to imagine a version of Debbie that would mispronounce Mousehole (it's like 'mowzle', if you're not sure!), but if we didn't think about her voice, that's exactly what would happen.
Fortunately, TTS editors can and do understand the IPA. We can ensure that Debbie knows the names of all the places she cares about, and her colleagues too. We use Speech Synthesis Markup Language (SSML) to input this information.
Before: 'Mousehole'
After: <phoneme alphabet="ipa" ph="ˈmaʊzl̩">Mousehole</phoneme>
That's not all. Your voice is more than just pronunciation. It's pitch, speed, volume, emphasis and a million other things that help to convey more meaning behind the words you say.
You might already know that your voice tends to get higher at the end of a question. It also tends to get lower during sarcasm, and rise during lists until the last item. Some of this might not seem very useful in the context of an AI, but it all builds towards speech being natural and conversational. If you've ever noticed how often in conversations people slightly overlap with each other's speech, that's partly because they've judged from the intonation that the utterance before is nearly complete.
Intonation and emphasis can also fully change the meaning of a statement. Look at the two sentences below:
The first casts doubt on who 'he' is, and the second on his relationship to you. We can build such cues into Debbie's speech to make her both more natural and to convey meaning faster and more reliably. We can also do this to an extent with written text, in the examples above I've used italics. The issue is I could also use italics for myriad other things, like sarcasm, whispers, and even an internal monologue.
All in all, it's not just words we communicate with. It's how we say them. By building in emphasis and intonation, we can give Debbie more of the emotional range she needs to foster trust and create relationships with the people she talks to. In building her accent to be accurate with her own history, we're creating a more authentic experience. Giving her the right voice, helps Debbie be more Debbie.