ortal1602 commited on
Commit
8e7e856
Β·
verified Β·
1 Parent(s): bc49b51

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,4 +1,131 @@
1
- # PAST: Phonetic-Acoustic Speech Tokenizer
2
 
3
- ### News
4
- 20/5 - Initialized model card for the paper ["PAST: Phonetic-Acoustic Speech Tokenizer"](https://arxiv.org/abs/2505.14470v1). This repo would be updated soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ“˜ PAST: Phonetic-Acoustic Speech Tokenizer
2
 
3
+ **Authors:** Nadav Har-Tuv, Or Tal, Yossi Adi
4
+ **Affiliation:** The Hebrew University of Jerusalem
5
+ πŸ“„ [Paper PDF](https://huggingface.co/path/to/pdf) | 🌐 [Project Page](https://pastpaper2025.github.io/past) | πŸ“¦ [Model Repo](https://huggingface.co/username/past-model)
6
+ 🧠 **Abstract:** See below
7
+ πŸ“Έ **Figure:** See below
8
+ πŸ“Š Sample results and evaluation: See tables below
9
+
10
+ ---
11
+
12
+ ## 🧭 Quick Start
13
+
14
+ ### πŸ“₯ Clone and Set Up
15
+
16
+ ```bash
17
+ git clone https://github.com/yourname/past.git
18
+ cd past
19
+ conda create -n past_env python=3.10 -y
20
+ conda activate past_env
21
+ pip install -r requirements.txt
22
+ ```
23
+
24
+ ### πŸš€ Load the Model
25
+
26
+ ```python
27
+ from past.models.past_model import PastModel
28
+ import torch
29
+
30
+ device = "cuda" if torch.cuda.is_available() else "cpu"
31
+ model = PastModel.from_pretrained("path/to/checkpoint.th", device=device)
32
+ print("Sample rate:", model.sample_rate)
33
+ ```
34
+
35
+ ### πŸ”Š Run on Audio
36
+
37
+ ```python
38
+ import torchaudio
39
+
40
+ def read_one_wav(path, target_sr):
41
+ wav, sr = torchaudio.load(path)
42
+ if sr != target_sr:
43
+ wav = torchaudio.transforms.Resample(sr, target_sr)(wav)
44
+ if wav.shape[0] == 2:
45
+ wav = wav[:1]
46
+ return wav.unsqueeze(0)
47
+
48
+ wav = read_one_wav("path/to/audio.wav", model.sample_rate).to(device)
49
+
50
+ with torch.no_grad():
51
+ codes, scale = model.encode(wav)
52
+ reconstructed = model.decode(codes, scale)
53
+ ```
54
+
55
+ ### 🎧 Listen and Evaluate
56
+
57
+ ```python
58
+ from IPython.display import Audio, display
59
+ display(Audio(wav.cpu().numpy().squeeze(), rate=model.sample_rate))
60
+ display(Audio(reconstructed.cpu().numpy().squeeze(), rate=model.sample_rate))
61
+
62
+ # Evaluate
63
+ from audiocraft.losses.sisnr import SISNR
64
+ from pypesq import pesq
65
+
66
+ sisnr_val = SISNR(sample_rate=model.sample_rate)(reconstructed, wav)
67
+ pesq_val = pesq(wav.squeeze().cpu().numpy(), reconstructed.squeeze().cpu().numpy(), model.sample_rate)
68
+
69
+ print(f"PESQ: {pesq_val:.2f}, SI-SNR: {sisnr_val:.2f}")
70
+ ```
71
+
72
+ ---
73
+
74
+ ## πŸ“Œ What You Can Do
75
+
76
+ - πŸŽ™οΈ **Tokenize** audio into discrete phonetic-acoustic tokens
77
+ - πŸ” **Reconstruct** audio from tokens (no vocoder needed)
78
+ - 🧠 **Use tokens** in speech language modeling tasks
79
+ - πŸ“Š **Evaluate** token quality (PESQ, SI-SNR, ABX, PNMI)
80
+ - πŸ›°οΈ Use the **streamable variant** for real-time applications
81
+
82
+ ---
83
+
84
+ ## πŸ§ͺ Results (from the paper)
85
+
86
+ ### 🧠 Phonetic Information
87
+
88
+ | Tokenizer | PNMI ↑ | ABX↓ (W/A) | WER ↓ |
89
+ |------------------|--------|------------|--------|
90
+ | Deep HuBERT 500 | 0.67 | 3.91 / 4.73| 11.3 / 24.7 |
91
+ | **PAST** | **0.75** | **2.82 / 3.54** | 15.7 / 36.8 |
92
+ | PAST Streamable | 0.74 | 3.05 / 3.89| **14.3 / 32.3** |
93
+
94
+ ### πŸ”Š Reconstruction Quality
95
+
96
+ | Tokenizer | SI-SNR ↑ | ViSQOL ↑ | PESQ ↑ |
97
+ |------------------|----------|-----------|--------|
98
+ | EnCodec | **7.49** | 4.48 | 3.88 |
99
+ | PAST | 4.84 | 4.40 | 3.55 |
100
+ | PAST Streamable | 3.90 | 4.37 | 3.40 |
101
+
102
+ ### πŸ“– Speech Language Modeling (sWUGGY)
103
+
104
+ | Tokenizer | Inter ↑ | OOV ↑ |
105
+ |------------------|---------|--------|
106
+ | PAST | **71.8** | **57.5** |
107
+ | PAST Streamable | 70.2 | 56.3 |
108
+
109
+ ---
110
+
111
+ ## πŸ“ Citation
112
+
113
+ > If you use PAST in your work, please cite:
114
+
115
+ ```
116
+ @article{har2025past,
117
+ title={PAST: Phonetic-Acoustic Speech Tokenizer},
118
+ author={Har-Tuv, Nadav and Tal, Or and Adi, Yossi},
119
+ journal={Interspeech},
120
+ year={2025}
121
+ }
122
+ ```
123
+
124
+ ---
125
+
126
+ ## πŸ–ΌοΈ Abstract and Figure
127
+
128
+ > **Abstract:**
129
+ We present **PAST**, a novel end-to-end framework that jointly models phonetic information alongside signal reconstruction, eliminating the need for external pretrained models. [...] Results demonstrate that PAST surpasses existing tokenizers across phonetic representation, speech reconstruction, and language modeling. We also introduce a **streamable variant** for real-time use.
130
+
131
+ ![Figure 1: PAST pipeline](path/to/figure.png)