Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech

1NAVER AI Lab, 2KAIST, 3NAVER CLOUD

Face-TTS generates speech from given text script and face image.

Abstract

The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model.

We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles.

Examples

Here, all of the results were using only unseen and unheard speakers.

Note that we does not aim to reconstruct an accurate voice of the person, but rather to generate acoustic characteristics that are correlated with the input face.

Case 1 (Real face from LRS3 dataset)

Text: The employees raced the elevators to the first floor. Givens saw Oswald standing at the gate on the fifth floor as the elevator went by.

From voice

From face

Case 2 (Real face from LRS3 dataset)

Text: These had been attributed to political action; some thought that the large purchases in foreign grains, effected at losing prices,

From voice

From face

Case 3 (Real face from LRS3 dataset)

Text: Four point eight to five point six seconds if the second shot missed,

From voice

From face

Case 4 (Real face from LRS3 dataset)

Text: And so numerous were his opportunities of showing favoritism, that all the prisoners may be said to be in his power.

From voice

From face

Case 5 (Virtual face brought from Stable Diffusion)

Text: The preference given to the Pentonville system destroyed all hopes of a complete reformation of Newgate.

Case 6 (Virtual face brought from Stable Diffusion)

Text: And one or two men were allowed to mend clothes and make shoes. The rules made by the Secretary of State were hung up in conspicuous parts of the prison.

Ethical Considerations

Please be aware that our objective is not to generate an accurate reproduction of the person's voice, but rather to produce acoustic characteristics that are connected to the input face. It is important to note that certain aspects of synthesised speech may not necessarily be linked to face. Furthermore, Face-TTS should not be used for purposes that are illegal or abusive, and we stand in opposition to such usage.

Related links

We need face detector to align the face image into LRS3 construction. The code implementation is heavily borrowed from the official implementation of Grad-TTS. We are deeply grateful for all of the projects.

BibTeX

If you find this work useful for your research, please cite our paper:
@inproceedings{lee2023imaginary,
  author    = {Lee, Jiyoung and Chung, Joon Son and Chung, Soo-Whan},
  title     = {Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  year      = {2023},
}