My voice… but make it synthetic 🤖

5 minute read

Published:

Do machines dream of synthetic voices?

Speech is the most natural way of communicating for human beings. Systems of spoken language had developed long before the most primitive attempts at writing were even made. While the gift of language in general and speech in particular comes easy to people, the science has been struggling to make machines produce speech that would sound as natural as possible.

The goal of speech synthesis, also known as text-to-speech, is to convert written texts into spoken utterances. Speech synthesis finds application in many areas of our everyday life, ranging from announcements at train stations to voice assistants in call centers.

Given we have the recordings of a person speaking, we may recreate their voice to generate arbitrary utterances. This is a unique opportunity for people who have lost their voices due to some medical issues. Text-to-speech can also be used to reconstruct the voice of deceased loved ones. Sometimes hearing reassuring things in the voice of a late beloved may help to process the grief and ease the pain.

...Or we can build our own voice right now just for fun!

Listen to some things I never said... (but then I did)

Here are several samples of synthetically created utterances along with the references, i.e. me pronouncing the same sentences. Speech samples under "Unit selection" header are generated by an open-source framework Festival using the custom voice I've built on my own speech data (around 400 recordings of me reading prompts from the CMU ARCTIC dataset.). Samples under "SV2TTS" are created with the help of SV2TTS, a real-time voice cloning framework that uses pre-trained deep neural networks.

NB: All wave files have the following characteristics: 16kHz sampling rate, 1 channel, 16-bit.

Unit selectionSV2TTSReference speech
1. "Happiness isn't in the having. It's in just being. It's in just saying it."
Unit selectionSV2TTSReference speech
2. "Maturity is a bitter disappointment for which no remedy exists, unless laughter could be said to remedy anything."

Unit selectionSV2TTSReference speech
3. "If you don't know, the thing to do is not to get scared, but to learn."
Unit selectionSV2TTSReference speech
4. "Enjoy the little things in life, for one day you'll look back and realize they were big things."
Unit selectionSV2TTSReference speech
5. "Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering."
Unit selectionSV2TTSReference speech
6. "He was my North, my South, my East and West, My working week and my Sunday rest, My noon, my midnight, my talk, my song; I thought that love would last forever: I was wrong."

You reap what you sow

Or how machine learning practitioners like to say: garbage in - garbage out. This statement stays true when we deal with speech synthesis as well. While TTS systems based on deep neural networks may benefit from such a concept as transfer learning, i.e. utilizing the knowledge gained by a model while being trained on different data or task, and in some cases may reproduce someone's voice with just a few samples, unit selection (concatenative) speech synthesis heavily rely on the data it is built with.

As the name suggests, concatenative speech synthesizer selects pre-recorded units of speech and concatenates them to form a new utterance. It's obvious that the quality of generated speech will depend on how carefully the dataset was designed. To that end, there are several criteria that need to be considered:
  • Phonemic coverage. While creating a script, it's important to design a script in a way that the final recording are phonetically balanced.
  • The diagram below shows the distribution of individual phonemes in the dataset that I recorded for building a custom unit selection voice (NB: this is CMU phonetic notation). As we can see, some phonemes are severely underrepresented, especially vowels with the secondary stress (the ones with a digit 2).
  • Diphone (triphone) coverage. It may seem counterintuitive to a non-phonetician that the same phoneme is not quite the same depending on the context it occurs in. /k/ in cat, puncture, and look are not the same. Different realizations of the same phoneme are called allophones. While using them interchangeably will probably not cause problems at understanding, the naturalness of speech will be lost. Thus, in case of unit selection speech synthesis it's crucial to provide as many samples of a phoneme in as many phonemic contexts as possible.
  • Prosodic coverage. Prosody basically deals with everything in phonetics that is above individual phones: intonation, rhythm, stress, etc. Therefore, it's vital to make sure that the dataset is prosodically balanced: there is a sufficient amount of declarative, exclamatory, interrogative and imperative sentences; the style of delivery by the speaker is consistent in terms of rhythm and emotions; etc.
TO DO: diphone coverage diagram