TTS x Hallo Talking Portrait Generator

This demo allows you to generate a talking portrait with the help of several open-source projects: SDXL Lightning | Parler TTS | WhisperSpeech | Hallo

To let the community try and enjoy this demo, video length is limited to 4 seconds audio maximum.

Duplicate this space to skip the queue and get unlimited video duration. 4-5 seconds of audio will take ~5 minutes per inference, please be patient.

1. Load Portrait

2. Load Voice

3. Result

Image

Generate image

Audio

Text to synthetize

Voice description

Video

Hallo Pro Tips:

Hallo has a few simple requirements for input data:

For the source image:

It should be cropped into squares.
The face should be the main focus, making up 50%-70% of the image.
The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

For the driving audio:

It must be in WAV format.
It must be in English since our training datasets are only in this language.
Ensure the vocals are clear; background music is acceptable.

TTS Pro Tips:

For Parler TTS:

Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt

For WhisperSpeech:

WhisperSpeech is able to quickly clone a voice from an audio sample.

Upload a voice sample in the WhisperSpeech tab
Add text to synthetize, hit Generate voice clone button