Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Plapre Simple - Phoneme-based TTS Model
|
| 2 |
+
|
| 3 |
+
A simplified phoneme-based text-to-speech model built on Qwen3-0.6B-Base, trained to generate audio tokens from phoneme sequences.
|
| 4 |
+
|
| 5 |
+
## Model Overview
|
| 6 |
+
|
| 7 |
+
This model is trained to perform phoneme-to-audio-token generation using a causal language modeling approach. It takes phoneme sequences as input and generates audio tokens that can be decoded by a neural audio codec (e.g., NeuCodec).
|
| 8 |
+
|
| 9 |
+
## Tokenization
|
| 10 |
+
|
| 11 |
+
The model uses a custom phoneme-based tokenizer with the following vocabulary structure:
|
| 12 |
+
|
| 13 |
+
### Vocabulary Composition (66,192 tokens total)
|
| 14 |
+
|
| 15 |
+
1. **Standard tokens (4)**: `<pad>`, `<unk>`, `<bos>`, `<eos>`
|
| 16 |
+
2. **Phonemes (109)**: IPA phoneme characters from `phoneme_list.json`
|
| 17 |
+
3. **Audio tokens (65,536)**: `<audio_0>` to `<audio_65535>` (representing neural codec codes)
|
| 18 |
+
4. **Special structure tokens (8)**:
|
| 19 |
+
- `<phoneme_start>`, `<phoneme_end>`
|
| 20 |
+
- `<audio_start>`, `<audio_end>`
|
| 21 |
+
- `<ref_audio_start>`, `<ref_audio_end>`
|
| 22 |
+
- `<ref_text_start>`, `<ref_text_end>`
|
| 23 |
+
5. **Placeholder tokens (128)**: `<placeholder_0>` to `<placeholder_127>` (reserved for future use)
|
| 24 |
+
|
| 25 |
+
## Training Sequence Format
|
| 26 |
+
|
| 27 |
+
The model is trained on sequences with the following structure:
|
| 28 |
+
|
| 29 |
+
```
|
| 30 |
+
<phoneme_start> + [phoneme tokens] + <phoneme_end> + <audio_start> + [audio tokens] + <audio_end>
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### Example Sequence
|
| 34 |
+
|
| 35 |
+
For the text "You know, when":
|
| 36 |
+
|
| 37 |
+
1. **Text**: "You know, when"
|
| 38 |
+
2. **Phonemes**: `juː nˈoʊ, wˌɛn`
|
| 39 |
+
3. **Training sequence**:
|
| 40 |
+
```
|
| 41 |
+
<phoneme_start>
|
| 42 |
+
j u ː n ˈ o ʊ , w ˌ ɛ n
|
| 43 |
+
<phoneme_end>
|
| 44 |
+
<audio_start>
|
| 45 |
+
<audio_2151> <audio_43235> <audio_56802> ... (audio tokens)
|
| 46 |
+
<audio_end>
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Training Objective
|
| 50 |
+
|
| 51 |
+
- The model uses causal language modeling (next-token prediction)
|
| 52 |
+
- **Phoneme tokens are masked** in the loss (labels set to -100)
|
| 53 |
+
- **Only audio tokens are trained** to be predicted from the phoneme context
|
| 54 |
+
- This teaches the model to generate audio tokens conditioned on phoneme input
|
| 55 |
+
|
| 56 |
+
## Phoneme Encoding
|
| 57 |
+
|
| 58 |
+
Text is converted to phonemes using espeak-ng with the following settings:
|
| 59 |
+
- Language: `en-us`
|
| 60 |
+
- Preserve punctuation: `True`
|
| 61 |
+
- With stress markers: `True`
|
| 62 |
+
|
| 63 |
+
Phonemes are then tokenized character-by-character (each IPA symbol is a separate token).
|
| 64 |
+
|
| 65 |
+
## Audio Token Encoding
|
| 66 |
+
|
| 67 |
+
Audio codes from the neural codec (range 0-65535) are mapped to vocabulary tokens:
|
| 68 |
+
- Audio code `n` → token `<audio_n>` → token ID `(audio_token_start_id + n)`
|
| 69 |
+
|
| 70 |
+
## Model Details
|
| 71 |
+
|
| 72 |
+
- **Base model**: Qwen3-0.6B-Base
|
| 73 |
+
- **Vocabulary size**: 66,192 tokens
|
| 74 |
+
- **Training dataset**: neuphonic/emilia-yodas-english-neucodec
|
| 75 |
+
- **Batch size**: 16 (effective)
|
| 76 |
+
- **Precision**: bfloat16
|
| 77 |
+
- **Attention**: Flash Attention 2
|
| 78 |
+
|
| 79 |
+
## Usage
|
| 80 |
+
|
| 81 |
+
To use this model, you'll need:
|
| 82 |
+
1. The custom `PhonemeTokenizer` class (see `train_simple.py`)
|
| 83 |
+
2. espeak-ng for phonemization
|
| 84 |
+
3. A neural audio codec decoder for converting audio tokens to waveforms
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
from transformers import AutoModelForCausalLM
|
| 88 |
+
from train_simple import PhonemeTokenizer
|
| 89 |
+
|
| 90 |
+
# Load model and tokenizer
|
| 91 |
+
model = AutoModelForCausalLM.from_pretrained("syvai/plapre-simple")
|
| 92 |
+
tokenizer = PhonemeTokenizer.from_pretrained("syvai/plapre-simple")
|
| 93 |
+
|
| 94 |
+
# Your inference code here
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Files in Repository
|
| 98 |
+
|
| 99 |
+
- `config.json` - Model configuration
|
| 100 |
+
- `model.safetensors` / `pytorch_model.bin` - Model weights
|
| 101 |
+
- `tokenizer_config.json` - Tokenizer configuration and vocabulary
|
| 102 |
+
- `phoneme_list.json` - List of phonemes used in vocabulary
|
| 103 |
+
- `README.md` - This file
|
| 104 |
+
|
| 105 |
+
## Training Details
|
| 106 |
+
|
| 107 |
+
Trained using the Hugging Face Transformers `Trainer` with:
|
| 108 |
+
- Learning rate: 0.0002
|
| 109 |
+
- Warmup steps: 1000
|
| 110 |
+
- Gradient accumulation: 4
|
| 111 |
+
- Per-device batch size: 4
|
| 112 |
+
- Optimizer: AdamW
|
| 113 |
+
|
| 114 |
+
## License
|
| 115 |
+
|
| 116 |
+
Inherits license from Qwen3-0.6B-Base.
|