--- license: cc-by-4.0 language: - en - zh base_model: - Qwen/Qwen2.5-1.5B-Instruct - SparkAudio/Spark-TTS-0.5B - zai-org/glm-4-voice-tokenizer pipeline_tag: audio-to-audio metrics: - bleu library_name: transformers --- # Model Card for UniSS ## Model Details ### Model Description UniSS is a unified single-stage speech-to-speech translation (S2ST) framework that achieves high translation fidelity and speech quality, while preserving timbre, emotion, and duration consistency. UniSS supports English and Chinese now. ### Model Sources - **Repository:** https://github.com/cmots/UniSS - **Paper:** https://arxiv.org/pdf/2509.21144 - **Demo:** https://cmots.github.io/uniss-demo ## Quick Start 1. Install the environment and get the code ```bash conda create -n uniss python=3.10.16 conda activate uniss git clone https://github.com/cmots/UniSS.git cd UniSS pip install -r requirements.txt # If you are in mainland China, you can set the mirror as follows: pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com ``` 2. Download the weight The weight of UniSS is on [HuggingFace](https://huggingface.co/cmots/UniSS). You have to download the model manually, you can download it via provided script: ``` python download_weight.py ``` or download via git clone (skip this if you have download via python script): ``` bash mkdir -p pretrained_models # Make sure you have git-lfs installed (https://git-lfs.com) git lfs install git clone https://huggingface.co/cmots/UniSS pretrained_models/UniSS ``` 3. Run the code ``` python import soundfile from uniss import UniSSTokenizer from transformers import AutoTokenizer, AutoModelForCausalLM import torch from uniss import process_input, process_output # 1. Set the device, wav path, model path device = torch.device("cuda" if torch.cuda.is_available() else "cpu") wav_path = "prompt_audio.wav" model_path = "pretrained_models/UniSS" # 2. Set the mode and target language mode = 'Quality' # 'Quality' or 'Performance' tgt_lang = "<|eng|>" # for English output # tgt_lang = "<|cmn|>" # for Chinese output # 3. load the model, text tokenizer, and speech tokenizer model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device) tokenizer = AutoTokenizer.from_pretrained(model_path) speech_tokenizer = UniSSTokenizer.from_pretrained(model_path, device=device) # 4. extract speech tokens glm4_tokens, bicodec_tokens = speech_tokenizer.tokenize(wav_path) # 5. process the input input_text = process_input(glm4_tokens, bicodec_tokens, mode, tgt_lang) input_token_ids = tokenizer.encode(input_text, return_tensors="pt").to(device) # 6. translate the speech output = model.generate( input_token_ids, max_new_tokens=1500, temperature=0.7, top_p=0.8, repetition_penalty=1.1 ) # 7. decode the output output_text = tokenizer.batch_decode(output, skip_special_tokens=True) # 8. process the output audio, translation, transcription = process_output(output_text[0], input_text, speech_tokenizer, mode, device) # 9. save and show the results soundfile.write("output_audio.wav", audio, 16000) if mode == 'Quality': print("Transcription:\n", transcription) print("Translation:\n", translation) ``` More examples and details is on [Our Github Repo](https://github.com/cmots/UniSS). ## Citation If you find our paper and code useful in your research, please consider giving a like and citation. ```bibtex @misc{cheng2025uniss_s2st, title={UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice}, author={Sitong Cheng and Weizhen Bian and Xinsheng Wang and Ruibin Yuan and Jianyi Chen and Shunshun Yin and Yike Guo and Wei Xue}, year={2025}, eprint={2509.21144}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2509.21144}, } ```