Fails when I try to run on openwebui whit ollama backend
Here is the log:
llama_model_load: vocab only - skipping tensors
time=2025-10-14T16:03:20.246Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-14T16:03:20.247Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-9e7232c93d12498a74a19826c367681314d86d5e468dacc6afe24bc6adbd7754 --port 36647"
time=2025-10-14T16:03:20.247Z level=INFO source=server.go:505 msg="system memory" total="31.3 GiB" free="28.1 GiB" free_swap="166.9 MiB"
time=2025-10-14T16:03:20.248Z level=INFO source=server.go:512 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
time=2025-10-14T16:03:20.255Z level=INFO source=runner.go:864 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes, ID: GPU-c6d9f479-6f10-6431-34ac-33af8732e7aa
The following devices will have suboptimal performance due to a lack of tensor cores:
Device 0: NVIDIA GeForce GTX 1660 SUPER
Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing.
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-10-14T16:03:20.285Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-14T16:03:20.285Z level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:36647"
time=2025-10-14T16:03:25.352Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.104590158 runner.size="4.5 GiB" runner.vram="4.5 GiB" runner.parallel=1 runner.pid=217845 runner.model=/root/.ollama/models/blobs/sha256-a29e9d0911c8a65726beaf9934899fd98e64d97e5330629fc079b1d822455298
time=2025-10-14T16:03:25.575Z level=INFO source=server.go:505 msg="system memory" total="31.3 GiB" free="27.9 GiB" free_swap="166.9 MiB"
time=2025-10-14T16:03:25.575Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-9e7232c93d12498a74a19826c367681314d86d5e468dacc6afe24bc6adbd7754 library=CUDA parallel=1 required="3.2 GiB" gpus=1
time=2025-10-14T16:03:25.576Z level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="3.2 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="1.5 GiB" memory.weights.repeating="1.2 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="298.7 MiB" memory.graph.partial="298.7 MiB"
time=2025-10-14T16:03:25.576Z level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:8 GPULayers:29[ID:GPU-c6d9f479-6f10-6431-34ac-33af8732e7aa Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-10-14T16:03:25.576Z level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) (0000:01:00.0) - 0 MiB free
Im no LLM expert, but looks like it errors out at "llama_model_load_from_file_impl: using device CUDA0 ... - 0 MiB free". It cant be, as I run i9-9900k, 32GB RAM and 1660 super is 6GB VRAM card. Regular granite 4 micro, tiny an small are working on my system.
Hey! It's fixed now; I had to requantitize and reupload ggufs, as they've changed granitehybrid architecture to granite in the recent llama.cpp release; sorry!
Yup - all good now. A suggestion, add q6 version also for sake of completeness. Otherwise I can fit q8 even in my 6GB VRAM card whit 8k context window.
Done, my friend! Thanks for the suggestion
One more note - I cant go higher that 8k (8192). But this is granite issue, original model also errors out on me whit 16k (16384).
I will try to check it out tomorrow, as I didn't have much time to test on longer than 4096 tbh.
For me, it's not that important, as I usually use these types of models for single-shot questions or at most a few follow-ups. Its mostly interesting for me - why..?
I did moved to llama.cpp and 16k cw works.