kadirnar commited on
Commit
2192f30
·
verified ·
1 Parent(s): dfe21b7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -0
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - amphion/Emilia-Dataset
4
+ language:
5
+ - en
6
+ base_model:
7
+ - LiquidAI/LFM2-350M
8
+ pipeline_tag: text-to-speech
9
+ library_name: transformers
10
+ ---
11
+ ## Overview
12
+ VyvoTTS-LFM2-350M is a Text-to-Speech model based on LFM2-350M, trained to produce natural-sounding English speech.
13
+
14
+ - **Type:** Text-to-Speech
15
+ - **Language:** English
16
+ - **License:** CC BY-NC 4.0
17
+ - **Params:** ~383M
18
+
19
+ Check the GitHub repository for pretraining and finetuning.
20
+
21
+ Github: https://github.com/Vyvo-Labs/VyvoTTS
22
+
23
+ ## Usage
24
+ Below is an example of using the model with `unsloth` and `SNAC` for speech generation:
25
+
26
+ ```python
27
+ from unsloth import FastLanguageModel
28
+ import torch
29
+ from snac import SNAC
30
+
31
+ model, tokenizer = FastLanguageModel.from_pretrained(
32
+ model_name = "Vyvo/VyvoTTS-LFM2-Neuvillette",
33
+ max_seq_length= 8192,
34
+ dtype = None,
35
+ load_in_4bit = False,
36
+ )
37
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
38
+ tokeniser_length = 64400
39
+ start_of_text = 1
40
+ end_of_text = 7
41
+
42
+ start_of_speech = tokeniser_length + 1
43
+ end_of_speech = tokeniser_length + 2
44
+ start_of_human = tokeniser_length + 3
45
+ end_of_human = tokeniser_length + 4
46
+ pad_token = tokeniser_length + 7
47
+
48
+ audio_tokens_start = tokeniser_length + 10
49
+ prompts = ["Hey there my name is Elise, and I'm a speech generation model that can sound like a person."]
50
+ chosen_voice = None
51
+
52
+ FastLanguageModel.for_inference(model)
53
+ snac_model.to("cpu")
54
+ prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
55
+
56
+ all_input_ids = []
57
+ for prompt in prompts_:
58
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
59
+ all_input_ids.append(input_ids)
60
+
61
+ start_token = torch.tensor([[start_of_human]], dtype=torch.int64)
62
+ end_tokens = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64)
63
+
64
+ all_modified_input_ids = []
65
+ for input_ids in all_input_ids:
66
+ modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
67
+ all_modified_input_ids.append(modified_input_ids)
68
+
69
+ all_padded_tensors, all_attention_masks = [], []
70
+ max_length = max([m.shape[1] for m in all_modified_input_ids])
71
+ for m in all_modified_input_ids:
72
+ padding = max_length - m.shape[1]
73
+ padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), m], dim=1)
74
+ attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, m.shape[1]), dtype=torch.int64)], dim=1)
75
+ all_padded_tensors.append(padded_tensor)
76
+ all_attention_masks.append(attention_mask)
77
+
78
+ input_ids = torch.cat(all_padded_tensors, dim=0).to("cuda")
79
+ attention_mask = torch.cat(all_attention_masks, dim=0).to("cuda")
80
+
81
+ generated_ids = model.generate(
82
+ input_ids=input_ids,
83
+ attention_mask=attention_mask,
84
+ max_new_tokens=1200,
85
+ do_sample=True,
86
+ temperature=0.6,
87
+ top_p=0.95,
88
+ repetition_penalty=1.1,
89
+ num_return_sequences=1,
90
+ eos_token_id=end_of_speech,
91
+ use_cache=True
92
+ )
93
+
94
+ token_to_find = start_of_speech
95
+ token_to_remove = end_of_speech
96
+ token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
97
+
98
+ if len(token_indices[1]) > 0:
99
+ last_occurrence_idx = token_indices[1][-1].item()
100
+ cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
101
+ else:
102
+ cropped_tensor = generated_ids
103
+
104
+ processed_rows = []
105
+ for row in cropped_tensor:
106
+ masked_row = row[row != token_to_remove]
107
+ processed_rows.append(masked_row)
108
+
109
+ code_lists = []
110
+ for row in processed_rows:
111
+ row_length = row.size(0)
112
+ new_length = (row_length // 7) * 7
113
+ trimmed_row = row[:new_length]
114
+ trimmed_row = [t - audio_tokens_start for t in trimmed_row]
115
+ code_lists.append(trimmed_row)
116
+
117
+ def redistribute_codes(code_list):
118
+ layer_1, layer_2, layer_3 = [], [], []
119
+ for i in range((len(code_list)+1)//7):
120
+ layer_1.append(code_list[7*i])
121
+ layer_2.append(code_list[7*i+1]-4096)
122
+ layer_3.append(code_list[7*i+2]-(2*4096))
123
+ layer_3.append(code_list[7*i+3]-(3*4096))
124
+ layer_2.append(code_list[7*i+4]-(4*4096))
125
+ layer_3.append(code_list[7*i+5]-(5*4096))
126
+ layer_3.append(code_list[7*i+6]-(6*4096))
127
+ codes = [
128
+ torch.tensor(layer_1).unsqueeze(0),
129
+ torch.tensor(layer_2).unsqueeze(0),
130
+ torch.tensor(layer_3).unsqueeze(0)
131
+ ]
132
+ audio_hat = snac_model.decode(codes)
133
+ return audio_hat
134
+
135
+ my_samples = []
136
+ for code_list in code_lists:
137
+ samples = redistribute_codes(code_list)
138
+ my_samples.append(samples)
139
+
140
+ from IPython.display import display, Audio
141
+ if len(prompts) != len(my_samples):
142
+ raise Exception("Number of prompts and samples do not match")
143
+ else:
144
+ for i in range(len(my_samples)):
145
+ print(prompts[i])
146
+ samples = my_samples[i]
147
+ display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
148
+
149
+ del my_samples, samples
150
+ ```
151
+
152
+ ## Citation
153
+
154
+ If you use this model, please cite:
155
+
156
+ ```bibtex
157
+ @misc{VyvoTTS-LFM2-350M,
158
+ title={VyvoTTS-LFM2-350M},
159
+ author={Vyvo},
160
+ year={2025},
161
+ howpublished={\url{https://huggingface.co/Vyvo/VyvoTTS-LFM2-350M}}
162
+ }
163
+ ```