mhenrichsen commited on
Commit
cd1ebac
·
verified ·
1 Parent(s): 6fdd2ea

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Plapre Simple - Phoneme-based TTS Model
2
+
3
+ A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.
4
+
5
+ ## Model Overview
6
+
7
+ This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).
8
+
9
+ ## Tokenization
10
+
11
+ The model uses a custom phoneme-based tokenizer with the following vocabulary structure:
12
+
13
+ ### Vocabulary Composition (66,192 tokens total)
14
+
15
+ 1. **Standard tokens (4)**: `<pad>`, `<unk>`, `<bos>`, `<eos>`
16
+ 2. **Phonemes (109)**: IPA phoneme characters from `phoneme_list.json`
17
+ 3. **Audio tokens (65,536)**: `<audio_0>` to `<audio_65535>` (representing neural codec codes)
18
+ 4. **Special structure tokens (8)**:
19
+ - `<phoneme_start>`, `<phoneme_end>`
20
+ - `<audio_start>`, `<audio_end>`
21
+ - `<ref_audio_start>`, `<ref_audio_end>`
22
+ - `<ref_text_start>`, `<ref_text_end>`
23
+ 5. **Placeholder tokens (128)**: `<placeholder_0>` to `<placeholder_127>` (reserved for future use)
24
+
25
+ ## Training Sequence Format
26
+
27
+ The model is trained on sequences with the following structure:
28
+
29
+ ```
30
+ <phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>
31
+ ```
32
+
33
+ ### Example Sequence
34
+
35
+ For the text "You know, when":
36
+
37
+ 1. **Text**: "You know, when"
38
+ 2. **Phonemes**: `juː nˈoʊ, wˌɛn`
39
+ 3. **Training sequence**:
40
+ ```
41
+ <phoneme_start>
42
+ j u ː n ˈ o ʊ , w ˌ ɛ n
43
+ <phoneme_end>
44
+ <audio_start>
45
+ <audio_2151> <audio_43235> <audio_56802> ... (audio tokens)
46
+ <audio_end>
47
+ ```
48
+
49
+ ### Training Objective
50
+
51
+ - The model uses causal language modeling (next-token prediction)
52
+ - **Phoneme tokens are masked** in the loss (labels set to -100)
53
+ - **Only audio tokens are trained** to be predicted from the phoneme context
54
+ - This teaches the model to generate audio tokens conditioned on phoneme input
55
+
56
+ ## Phoneme Encoding
57
+
58
+ Text is converted to phonemes using espeak-ng with the following settings:
59
+ - Language: `en-us`
60
+ - Preserve punctuation: `True`
61
+ - With stress markers: `True`
62
+
63
+ Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).
64
+
65
+ ## Audio Token Encoding
66
+
67
+ Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:
68
+ - Audio code `n` → token `<audio_n>` → token ID `(audio_token_start_id + n)`
69
+
70
+ ## Model Details
71
+
72
+ - **Base model**: Qwen3-0.6B-Base
73
+ - **Vocabulary size**: 66,192 tokens
74
+ - **Training dataset**: neuphonic/emilia-yodas-english-neucodec
75
+ - **Batch size**: 16 (effective)
76
+ - **Precision**: bfloat16
77
+ - **Attention**: Flash Attention 2
78
+
79
+ ## Usage
80
+
81
+ To use this model, you'll need:
82
+ 1. The custom `PhonemeTokenizer` class (see `train_simple.py`)
83
+ 2. espeak-ng for phonemization
84
+ 3. A neural audio codec decoder for converting audio tokens to waveforms
85
+
86
+ ```python
87
+ from transformers import AutoModelForCausalLM
88
+ from train_simple import PhonemeTokenizer
89
+
90
+ # Load model and tokenizer
91
+ model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
92
+ tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")
93
+
94
+ # Your inference code here
95
+ ```
96
+
97
+ ## Files in Repository
98
+
99
+ - `config.json` - Model configuration
100
+ - `model.safetensors` / `pytorch_model.bin` - Model weights
101
+ - `tokenizer_config.json` - Tokenizer configuration and vocabulary
102
+ - `phoneme_list.json` - List of phonemes used in vocabulary
103
+ - `README.md` - This file
104
+
105
+ ## Training Details
106
+
107
+ Trained using the Hugging Face Transformers `Trainer` with:
108
+ - Learning rate: 0.0002
109
+ - Warmup steps: 1000
110
+ - Gradient accumulation: 4
111
+ - Per-device batch size: 4
112
+ - Optimizer: AdamW
113
+
114
+ ## License
115
+
116
+ Inherits license from Qwen3-0.6B-Base.