Text-to-Speech Models
Collection
3 items
β’
Updated
β’
1
This is a fine-tuned version of the Conversational Speech Model (CSM-1B) using LoRA for parameter-efficient fine-tuning. The model is trained on a 1,530-sample dataset of medical cardiology texts, designed to generate high-quality speech from cardiology-related text. It leverages the capabilities of the original CSM-1B model for text-to-speech synthesis, extended with domain-specific terminology for medical cardiology. It is intended for speech generation in English, especially for clinical and educational contexts.
Use the code below to get started with the model.
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
import soundfile as sf
from peft import PeftModel
model_id = "unsloth/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
base_model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(base_model, "khazarai/Cardiology-TTS")
text = "The coronary arteries are patent with no significant stenosis."
speaker_id = 0
conversation = [
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
audio_values = model.generate(
**processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to("cuda"),
max_new_tokens=200,
# play with these parameters to tweak results
# depth_decoder_top_k=0,
# depth_decoder_top_p=0.9,
# depth_decoder_do_sample=True,
# depth_decoder_temperature=0.9,
# top_k=0,
# top_p=1.0,
# temperature=0.9,
# do_sample=True,
#########################################################
output_audio=True
)
audio = audio_values[0].to(torch.float32).cpu().numpy()
sf.write("example.wav", audio, 24000)
1,530 samples of cardiology-related text paired with audio.