sam-audio-judge / README.md

lematt1991

Upload README.md with huggingface_hub

d461b0b verified 16 days ago

6.31 kB

	---
	license: other
	license_name: sam-license
	license_link: LICENSE
	---

	# SAM-Audio Judge Model

	SAM-Audio Judge is a model for evaluating the quality of audio separation results for [SAM Audio](https://huggingface.co/facebook/sam-audio-large). It assesses how well a separated audio matches a given text description by providing four different quality metrics: overall quality, recall, precision, and faithfulness.

	## Authentication

	Before using SAM-Audio Judge, you need to:
	1. Request access to the checkpoints on the [SAM-Audio Judge Hugging Face repo](https://huggingface.co/facebook/sam-audio-judge)
	2. Authenticate with Hugging Face: `huggingface-cli login`


	## Usage

	### Basic Usage

	The Judge model evaluates the quality of audio separation by comparing the input audio, separated audio, and text description.

	```python
	import torch
	import torchaudio
	from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor

	# Load model and processor
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval()
	processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge")

	# Load audio files
	input_audio, sr = torchaudio.load("path/to/input_audio.wav")
	separated_audio, sr = torchaudio.load("path/to/separated_audio.wav")

	# Text description that was used for separation
	description = "A man speaking"

	# Process inputs
	inputs = processor(
	text=[description],
	input_audio=[input_audio], # Can also use a list of tensors (shape (1, num_samples))
	separated_audio=[separated_audio], # Can also use a list of tensors (shape (1, num_samples))
	).to(device)

	# Get quality scores
	with torch.inference_mode():
	result = model(**inputs)

	# Access individual scores
	print(f"Overall Quality: {result.overall.item():.3f}")
	print(f"Recall: {result.recall.item():.3f}")
	print(f"Precision: {result.precision.item():.3f}")
	print(f"Faithfulness: {result.faithfulness.item():.3f}")
	```

	### Batch Processing

	You can evaluate multiple separation results in a single batch:

	```python
	import torch
	import torchaudio
	from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval()
	processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge")

	# Multiple examples
	descriptions = ["A man speaking", "Piano playing a melody", "A dog barking"]
	input_audios = ["input1.wav", "input2.wav", "input3.wav"]
	separated_audios = ["separated1.wav", "separated2.wav", "separated3.wav"]

	# Process batch
	inputs = processor(
	text=descriptions,
	input_audio=input_audios,
	separated_audio=separated_audios,
	).to(device)

	with torch.inference_mode():
	result = model(**inputs)

	# Results shape: (batch_size, 1)
	for i, desc in enumerate(descriptions):
	print(f"\nExample {i+1}: {desc}")
	print(f" Overall: {result.overall[i].item():.3f}")
	print(f" Recall: {result.recall[i].item():.3f}")
	print(f" Precision: {result.precision[i].item():.3f}")
	print(f" Faithfulness: {result.faithfulness[i].item():.3f}")
	```

	### Evaluating SAM-Audio Separation

	Here's a complete example that performs separation with SAM-Audio and then evaluates it with the Judge model:

	```python
	import torch
	import torchaudio
	from sam_audio import SAMAudio, SAMAudioProcessor
	from sam_audio import SAMAudioJudgeModel, SAMAudioJudgeProcessor

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Step 1: Perform separation with SAM-Audio
	sam_model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
	sam_processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")

	audio_file = "path/to/audio.wav"
	description = "A person coughing"

	# Separate
	inputs = sam_processor(audios=[audio_file], descriptions=[description]).to(device)
	with torch.inference_mode():
	separation_result = sam_model.separate(inputs)

	# Step 2: Evaluate the separation
	judge_model = SAMAudioJudgeModel.from_pretrained("facebook/sam-audio-judge").to(device).eval()
	judge_processor = SAMAudioJudgeProcessor.from_pretrained("facebook/sam-audio-judge")

	# Prepare for judge
	judge_inputs = judge_processor(
	text=[description],
	input_audio=[audio_file],
	separated_audio=[separation_result.target[0].unsqueeze(0)],
	sampling_rate=judge_processor.audio_sampling_rate,
	).to(device)

	with torch.inference_mode():
	judge_result = judge_model(**judge_inputs)

	print(f"\nSeparation Quality Metrics:")
	print(f"Overall Quality: {judge_result.overall.item():.3f}")
	print(f"Recall: {judge_result.recall.item():.3f}")
	print(f"Precision: {judge_result.precision.item():.3f}")
	print(f"Faithfulness: {judge_result.faithfulness.item():.3f}")
	```

	## Output Format

	The `SAMAudioJudgeModel` returns a `SAMAudioJudgeOutput` object with the following attributes:

	- `overall` (torch.Tensor): Overall quality of shape `(batch_size, 1)`. This is a combined metric that represents the overall separation quality.
	- `recall` (torch.Tensor): Recall of shape `(batch_size, 1)`. Measures how much of the target sound was successfully captured in the separation.
	- `precision` (torch.Tensor): Precision of shape `(batch_size, 1)`. Measures how pure the separated sound is (i.e., how little unwanted sound is included).
	- `faithfulness` (torch.Tensor): Faithfulness of shape `(batch_size, 1)`. For target sounds present in the extracted audio, how similar to they sound to their counterparts in the input audio.

	All scores are continuous values where higher values indicate better quality.


	## Citation

	If you use SAM Audio Judge in your research, please use the following BibTex entry:

	```bibtex
	@article{shi2025samaudio,
	title={SAM Audio: Segment Anything in Audio},
	author={Bowen Shi and Andros Tjandra and John Hoffman and Helin Wang and Yi-Chiao Wu and Luya Gao and Julius Richter and Matt Le and Apoorv Vyas and Sanyuan Chen and Christoph Feichtenhofer and Piotr Doll{\'a}r and Wei-Ning Hsu and Ann Lee},
	year={2025},
	url={https://arxiv.org/abs/2512.18099}
	}
	```

	## License

	This project is licensed under the SAM License. See the [LICENSE](LICENSE) file for details.