How do you use this exactly?

#1
by TimeLordRaps - opened

I've been looking for a <3B omni model, would this work for that? Does it generate audio as well or only adapt to it?

Hey, yes this works as an omni model - it takes in text, video, and audio. To use the model and take in audio input, you must also pair it with shivamg05/smolVLA-Audio-Projector, as well as the DyMN10-as audio encoder. So, the overall flow for audio inputs is raw --> DyMN10-as --> smolVLA-Audio-Projector --> SmolVLM2-500M-Audio-Aligned. Also note that the raw audio must be preprocessed to be in the spectrogram format that DyMN10-as expects - you can use AugmentMelSTFT, provided by the same git repo as DyMN10-as. The model is only adapted to take in audio as input, it does not generate audio. If you have any more questions, let me know!

Sign up or log in to comment