TTS x Hallo Talking Portrait Generator

This demo allows you to generate a talking portrait with the help of several open-source projects: SDXL Lightning | Parler TTS | WhisperSpeech | Hallo

To let the community try and enjoy this demo, video length is limited to 4 seconds audio maximum.

Duplicate this space to skip the queue and get unlimited video duration. 4-5 seconds of audio will take ~5 minutes per inference, please be patient.

1. Load Portrait

2. Load Voice

3. Result

Hallo Pro Tips:

Hallo has a few simple requirements for input data:

For the source image:

  1. It should be cropped into squares.
  2. The face should be the main focus, making up 50%-70% of the image.
  3. The face should be facing forward, with a rotation angle of less than 30° (no side profiles).

For the driving audio:

  1. It must be in WAV format.
  2. It must be in English since our training datasets are only in this language.
  3. Ensure the vocals are clear; background music is acceptable.

TTS Pro Tips:

For Parler TTS:

  • Include the term "very clear audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
  • Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
  • The remaining speech features (gender, speaking rate, pitch and reverberation) can be controlled directly through the prompt

For WhisperSpeech:

WhisperSpeech is able to quickly clone a voice from an audio sample.

  • Upload a voice sample in the WhisperSpeech tab
  • Add text to synthetize, hit Generate voice clone button