slprl
/

PAST

Model card Files Files and versions

xet

Community

ortal1602 commited on Jul 6

Commit

8e7e856

verified ·

1 Parent(s): bc49b51

Upload README.md

Browse files

Files changed (1) hide show

README.md +130 -3

README.md CHANGED Viewed

@@ -1,4 +1,131 @@
-# PAST: Phonetic-Acoustic Speech Tokenizer
-### News
-20/5 - Initialized model card for the paper ["PAST: Phonetic-Acoustic Speech Tokenizer"](https://arxiv.org/abs/2505.14470v1). This repo would be updated soon.

+# 📘 PAST: Phonetic-Acoustic Speech Tokenizer
+**Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
+**Affiliation:** The Hebrew University of Jerusalem
+📄 [Paper PDF](https://huggingface.co/path/to/pdf) | 🌐 [Project Page](https://pastpaper2025.github.io/past) | 📦 [Model Repo](https://huggingface.co/username/past-model)
+🧠 **Abstract:** See below
+📸 **Figure:** See below
+📊 Sample results and evaluation: See tables below
+---
+## 🧭 Quick Start
+### 📥 Clone and Set Up
+```bash
+git clone https://github.com/yourname/past.git
+cd past
+conda create -n past_env python=3.10 -y
+conda activate past_env
+pip install -r requirements.txt
+```
+### 🚀 Load the Model
+```python
+from past.models.past_model import PastModel
+import torch
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = PastModel.from_pretrained("path/to/checkpoint.th", device=device)
+print("Sample rate:", model.sample_rate)
+```
+### 🔊 Run on Audio
+```python
+import torchaudio
+def read_one_wav(path, target_sr):
+    wav, sr = torchaudio.load(path)
+    if sr != target_sr:
+        wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
+    if wav.shape[0] == 2:
+        wav = wav[:1]
+    return wav.unsqueeze(0)
+wav = read_one_wav("path/to/audio.wav", model.sample_rate).to(device)
+with torch.no_grad():
+    codes, scale = model.encode(wav)
+    reconstructed = model.decode(codes, scale)
+```
+### 🎧 Listen and Evaluate
+```python
+from IPython.display import Audio, display
+display(Audio(wav.cpu().numpy().squeeze(), rate=model.sample_rate))
+display(Audio(reconstructed.cpu().numpy().squeeze(), rate=model.sample_rate))
+# Evaluate
+from audiocraft.losses.sisnr import SISNR
+from pypesq import pesq
+sisnr_val = SISNR(sample_rate=model.sample_rate)(reconstructed, wav)
+pesq_val = pesq(wav.squeeze().cpu().numpy(), reconstructed.squeeze().cpu().numpy(), model.sample_rate)
+print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
+```
+---
+## 📌 What You Can Do
+- 🎙️ **Tokenize** audio into discrete phonetic-acoustic tokens
+- 🔁 **Reconstruct** audio from tokens (no vocoder needed)
+- 🧠 **Use tokens** in speech language modeling tasks
+- 📊 **Evaluate** token quality (PESQ, SI-SNR, ABX, PNMI)
+- 🛰️ Use the **streamable variant** for real-time applications
+---
+## 🧪 Results (from the paper)
+### 🧠 Phonetic Information
+| Tokenizer        | PNMI ↑ | ABX↓ (W/A) | WER ↓ |
+|------------------|--------|------------|--------|
+| Deep HuBERT 500  | 0.67   | 3.91 / 4.73| 11.3 / 24.7 |
+| **PAST**         | **0.75** | **2.82 / 3.54** | 15.7 / 36.8 |
+| PAST Streamable  | 0.74   | 3.05 / 3.89| **14.3 / 32.3** |
+### 🔊 Reconstruction Quality
+| Tokenizer        | SI-SNR ↑ | ViSQOL ↑ | PESQ ↑ |
+|------------------|----------|-----------|--------|
+| EnCodec          | **7.49** | 4.48      | 3.88   |
+| PAST             | 4.84     | 4.40      | 3.55   |
+| PAST Streamable  | 3.90     | 4.37      | 3.40   |
+### 📖 Speech Language Modeling (sWUGGY)
+| Tokenizer        | Inter ↑ | OOV ↑ |
+|------------------|---------|--------|
+| PAST             | **71.8** | **57.5** |
+| PAST Streamable  | 70.2    | 56.3  |
+---
+## 📝 Citation
+> If you use PAST in your work, please cite:
+```
+@article{har2025past,
+  title={PAST: Phonetic-Acoustic Speech Tokenizer},
+  author={Har-Tuv, Nadav and Tal, Or and Adi, Yossi},
+  journal={Interspeech},
+  year={2025}
+}
+```
+---
+## 🖼️ Abstract and Figure
+> **Abstract:**
+We present **PAST**, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. [...] Results demonstrate that PAST surpasses existing tokenizers across phonetic representation, speech reconstruction, and language modeling. We also introduce a **streamable variant** for real-time use.
+![Figure 1: PAST pipeline](path/to/figure.png)