How do you use this exactly?
I've been looking for a <3B omni model, would this work for that? Does it generate audio as well or only adapt to it?
Hey, yes this works as an omni model - it takes in text, video, and audio. To use the model and take in audio input, you must also pair it with shivamg05/smolVLA-Audio-Projector, as well as the DyMN10-as audio encoder. So, the overall flow for audio inputs is raw --> DyMN10-as --> smolVLA-Audio-Projector --> SmolVLM2-500M-Audio-Aligned. Also note that the raw audio must be preprocessed to be in the spectrogram format that DyMN10-as expects - you can use AugmentMelSTFT, provided by the same git repo as DyMN10-as. The model is only adapted to take in audio as input, it does not generate audio. If you have any more questions, let me know!