|
|
--- |
|
|
license: other |
|
|
license_name: sam-license |
|
|
license_link: LICENSE |
|
|
--- |
|
|
|
|
|
# SAM-Audio Judge Model |
|
|
|
|
|
SAM-Audio Judge is a model for evaluating the quality of audio separation results for [SAM Audio](https://huggingface.co/facebook/sam-audio-large). It assesses how well a separated audio matches a given text description by providing four different quality metrics: overall quality, recall, precision, and faithfulness. |
|
|
|
|
|
## Authentication |
|
|
|
|
|
Before using SAM-Audio Judge, you need to: |
|
|
1. Request access to the checkpoints on the [SAM-Audio Judge Hugging Face repo](https://huggingface.co/facebook/sam-audio-judge) |
|
|
2. Authenticate with Hugging Face: `huggingface-cli login` |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
The Judge model evaluates the quality of audio separation by comparing the input audio, separated audio, and text description. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor |
|
|
|
|
|
# Load model and processor |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval() |
|
|
processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge") |
|
|
|
|
|
# Load audio files |
|
|
input_audio, sr = torchaudio.load("path/to/input_audio.wav") |
|
|
separated_audio, sr = torchaudio.load("path/to/separated_audio.wav") |
|
|
|
|
|
# Text description that was used for separation |
|
|
description = "A man speaking" |
|
|
|
|
|
# Process inputs |
|
|
inputs = processor( |
|
|
text=[description], |
|
|
input_audio=[input_audio], # Can also use a list of tensors (shape (1, num_samples)) |
|
|
separated_audio=[separated_audio], # Can also use a list of tensors (shape (1, num_samples)) |
|
|
).to(device) |
|
|
|
|
|
# Get quality scores |
|
|
with torch.inference_mode(): |
|
|
result = model(**inputs) |
|
|
|
|
|
# Access individual scores |
|
|
print(f"Overall Quality: {result.overall.item():.3f}") |
|
|
print(f"Recall: {result.recall.item():.3f}") |
|
|
print(f"Precision: {result.precision.item():.3f}") |
|
|
print(f"Faithfulness: {result.faithfulness.item():.3f}") |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
|
|
|
You can evaluate multiple separation results in a single batch: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval() |
|
|
processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge") |
|
|
|
|
|
# Multiple examples |
|
|
descriptions = ["A man speaking", "Piano playing a melody", "A dog barking"] |
|
|
input_audios = ["input1.wav", "input2.wav", "input3.wav"] |
|
|
separated_audios = ["separated1.wav", "separated2.wav", "separated3.wav"] |
|
|
|
|
|
# Process batch |
|
|
inputs = processor( |
|
|
text=descriptions, |
|
|
input_audio=input_audios, |
|
|
separated_audio=separated_audios, |
|
|
).to(device) |
|
|
|
|
|
with torch.inference_mode(): |
|
|
result = model(**inputs) |
|
|
|
|
|
# Results shape: (batch_size, 1) |
|
|
for i, desc in enumerate(descriptions): |
|
|
print(f"\nExample {i+1}: {desc}") |
|
|
print(f" Overall: {result.overall[i].item():.3f}") |
|
|
print(f" Recall: {result.recall[i].item():.3f}") |
|
|
print(f" Precision: {result.precision[i].item():.3f}") |
|
|
print(f" Faithfulness: {result.faithfulness[i].item():.3f}") |
|
|
``` |
|
|
|
|
|
### Evaluating SAM-Audio Separation |
|
|
|
|
|
Here's a complete example that performs separation with SAM-Audio and then evaluates it with the Judge model: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from sam_audio import SAMAudio, SAMAudioProcessor |
|
|
from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
# Step 1: Perform separation with SAM-Audio |
|
|
sam_model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval() |
|
|
sam_processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large") |
|
|
|
|
|
audio_file = "path/to/audio.wav" |
|
|
description = "A person coughing" |
|
|
|
|
|
# Separate |
|
|
inputs = sam_processor(audios=[audio_file], descriptions=[description]).to(device) |
|
|
with torch.inference_mode(): |
|
|
separation_result = sam_model.separate(inputs) |
|
|
|
|
|
# Step 2: Evaluate the separation |
|
|
judge_model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval() |
|
|
judge_processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge") |
|
|
|
|
|
# Prepare for judge |
|
|
judge_inputs = judge_processor( |
|
|
text=[description], |
|
|
input_audio=[audio_file], |
|
|
separated_audio=[separation_result.target[0].unsqueeze(0)], |
|
|
sampling_rate=judge_processor.audio_sampling_rate, |
|
|
).to(device) |
|
|
|
|
|
with torch.inference_mode(): |
|
|
judge_result = judge_model(**judge_inputs) |
|
|
|
|
|
print(f"\nSeparation Quality Metrics:") |
|
|
print(f"Overall Quality: {judge_result.overall.item():.3f}") |
|
|
print(f"Recall: {judge_result.recall.item():.3f}") |
|
|
print(f"Precision: {judge_result.precision.item():.3f}") |
|
|
print(f"Faithfulness: {judge_result.faithfulness.item():.3f}") |
|
|
``` |
|
|
|
|
|
## Output Format |
|
|
|
|
|
The `SAMAudioJudgeModel` returns a `SAMAudioJudgeOutput` object with the following attributes: |
|
|
|
|
|
- **`overall`** (torch.Tensor): Overall quality of shape `(batch_size, 1)`. This is a combined metric that represents the overall separation quality. |
|
|
- **`recall`** (torch.Tensor): Recall of shape `(batch_size, 1)`. Measures how much of the target sound was successfully captured in the separation. |
|
|
- **`precision`** (torch.Tensor): Precision of shape `(batch_size, 1)`. Measures how pure the separated sound is (i.e., how little unwanted sound is included). |
|
|
- **`faithfulness`** (torch.Tensor): Faithfulness of shape `(batch_size, 1)`. For target sounds present in the extracted audio, how similar to they sound to their counterparts in the input audio. |
|
|
|
|
|
All scores are continuous values where higher values indicate better quality. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use SAM Audio Judge in your research, please use the following BibTex entry: |
|
|
|
|
|
```bibtex |
|
|
@article{shi2025samaudio, |
|
|
title={SAM Audio: Segment Anything in Audio}, |
|
|
author={Bowen Shi and Andros Tjandra and John Hoffman and Helin Wang and Yi-Chiao Wu and Luya Gao and Julius Richter and Matt Le and Apoorv Vyas and Sanyuan Chen and Christoph Feichtenhofer and Piotr Doll{\'a}r and Wei-Ning Hsu and Ann Lee}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/abs/2512.18099} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the SAM License. See the [LICENSE](LICENSE) file for details. |
|
|
|