Spaces:
Running
Running
| title: DeepAudio-V1 | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| app_file: app.py | |
| pinned: false | |
| ## DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation | |
| ## Installation | |
| **1. Create a conda environment** | |
| ```bash | |
| conda create -n v2as python=3.10 | |
| conda activate v2as | |
| ``` | |
| **2. F5-TTS base install** | |
| ```bash | |
| cd ./F5-TTS | |
| pip install -e . | |
| ``` | |
| **3. Additional requirements** | |
| ```bash | |
| pip install -r requirements.txt | |
| conda install cudnn | |
| ``` | |
| **Pretrained models** | |
| The models are available at https://huggingface.co/lshzhm/DeepAudio-V1. See [MODELS.md](./MODELS.md) for more details. | |
| ## Inference | |
| **1. V2A inference** | |
| ```bash | |
| bash v2a.sh | |
| ``` | |
| **2. V2S inference** | |
| ```bash | |
| bash v2s.sh | |
| ``` | |
| **3. TTS inference** | |
| ```bash | |
| bash tts.sh | |
| ``` | |
| ## Evaluation | |
| ```bash | |
| bash eval_v2c.sh | |
| ``` | |
| ## Acknowledgement | |
| - [MMAudio](https://github.com/hkchengrex/MMAudio) for video-to-audio backbone and pretrained models | |
| - [F5-TTS](https://github.com/SWivid/F5-TTS) for text-to-speech and video-to-speech backbone | |
| - [V2C](https://github.com/chenqi008/V2C) for animated movie benchmark | |
| - [Wav2Vec2-Emotion](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim) for emotion recognition in EMO-SIM evaluation. | |
| - [WavLM-SV](https://huggingface.co/microsoft/wavlm-base-sv) for speech recognition in SPK-SIM evaluation. | |
| - [Whisper](https://huggingface.co/Systran/faster-whisper-large-v3) for speech recognition in WER evaluation. | |