How do you use this exactly?

by TimeLordRaps - opened Jan 2

Discussion

TimeLordRaps

Jan 2

I've been looking for a <3B omni model, would this work for that? Does it generate audio as well or only adapt to it?

shivamg05

Owner Jan 4

•

edited Jan 4

Hey, yes this works as an omni model - it takes in text, video, and audio. To use the model and take in audio input, you must also pair it with shivamg05/smolVLA-Audio-Projector, as well as the DyMN10-as audio encoder. So, the overall flow for audio inputs is raw --> DyMN10-as --> smolVLA-Audio-Projector --> SmolVLM2-500M-Audio-Aligned. Also note that the raw audio must be preprocessed to be in the spectrogram format that DyMN10-as expects - you can use AugmentMelSTFT, provided by the same git repo as DyMN10-as. The model is only adapted to take in audio as input, it does not generate audio. If you have any more questions, let me know!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment