AudioX: A Unified Framework for Anything-to-Audio Generation

AudioX is a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals). The core design is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality.

Sample Usage

To use this model programmatically, you can use the following script. Note that you need to install the audiox package as specified in the official repository.

import torch
import torchaudio
from einops import rearrange
from audiox import get_pretrained_model
from audiox.inference.generation import generate_diffusion_cond
from audiox.data.utils import read_video, merge_video_audio, load_and_process_audio, encode_video_with_synchformer
import os

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load pretrained model
# Choose one: "HKUSTAudio/AudioX", "HKUSTAudio/AudioX-MAF", or "HKUSTAudio/AudioX-MAF-MMDiT"
model_name = "HKUSTAudio/AudioX"
model, model_config = get_pretrained_model(model_name)
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
target_fps = model_config["video_fps"]
seconds_start = 0
seconds_total = 10

model = model.to(device)

# Example: Video-to-Music generation
video_path = "example/V2M_sample-1.mp4"
text_prompt = "Generate music for the video" 
audio_path = None

# Prepare inputs
video_tensor = read_video(video_path, seek_time=seconds_start, duration=seconds_total, target_fps=target_fps)
if audio_path:
    audio_tensor = load_and_process_audio(audio_path, sample_rate, seconds_start, seconds_total)
else:
    # Use zero tensor when no audio is provided
    audio_tensor = torch.zeros((2, int(sample_rate * seconds_total)))

# For AudioX-MAF and AudioX-MAF-MMDiT: encode video with synchformer
video_sync_frames = None
if "MAF" in model_name:
    video_sync_frames = encode_video_with_synchformer(
        video_path, model_name, seconds_start, seconds_total, device
    )

# Create conditioning
conditioning = [{
    "video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": video_sync_frames},        
    "text_prompt": text_prompt,
    "audio_prompt": audio_tensor.unsqueeze(0),
    "seconds_start": seconds_start,
    "seconds_total": seconds_total
}]
    
# Generate audio
output = generate_diffusion_cond(
    model,
    steps=250,
    cfg_scale=7,
    conditioning=conditioning,
    sample_size=sample_size,
    sigma_min=0.3,
    sigma_max=500,
    sampler_type="dpmpp-3m-sde",
    device=device
)

# Post-process audio
output = rearrange(output, "b d n -> d (b n)")
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

Citation

@article{tian2025audiox,
  title={AudioX: Diffusion Transformer for Anything-to-Audio Generation},
  author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike},
  journal={arXiv preprint arXiv:2503.10522},
  year={2025}
}
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HKUSTAudio/AudioX-MAF

Base model

HKUSTAudio/AudioX
Finetuned
(2)
this model

Collection including HKUSTAudio/AudioX-MAF

Paper for HKUSTAudio/AudioX-MAF