Fails when I try to run on openwebui whit ollama backend

by maigonis - opened Oct 14

Oct 14

•

Here is the log:

llama_model_load: vocab only - skipping tensors
time=2025-10-14T16:03:20.246Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-14T16:03:20.247Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-9e7232c93d12498a74a19826c367681314d86d5e468dacc6afe24bc6adbd7754 --port 36647"
time=2025-10-14T16:03:20.247Z level=INFO source=server.go:505 msg="system memory" total="31.3 GiB" free="28.1 GiB" free_swap="166.9 MiB"
time=2025-10-14T16:03:20.248Z level=INFO source=server.go:512 msg="model requires more memory than is currently available, evicting a model to make space" estimate.library="" estimate.layers.requested=0 estimate.layers.model=0 estimate.layers.offload=0 estimate.layers.split=[] estimate.memory.available=[] estimate.memory.gpu_overhead="0 B" estimate.memory.required.full="0 B" estimate.memory.required.partial="0 B" estimate.memory.required.kv="0 B" estimate.memory.required.allocations=[] estimate.memory.weights.total="0 B" estimate.memory.weights.repeating="0 B" estimate.memory.weights.nonrepeating="0 B" estimate.memory.graph.full="0 B" estimate.memory.graph.partial="0 B"
time=2025-10-14T16:03:20.255Z level=INFO source=runner.go:864 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes, ID: GPU-c6d9f479-6f10-6431-34ac-33af8732e7aa
The following devices will have suboptimal performance due to a lack of tensor cores:
Device 0: NVIDIA GeForce GTX 1660 SUPER
Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing.
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-10-14T16:03:20.285Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-14T16:03:20.285Z level=INFO source=runner.go:900 msg="Server listening on 127.0.0.1:36647"
time=2025-10-14T16:03:25.352Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.104590158 runner.size="4.5 GiB" runner.vram="4.5 GiB" runner.parallel=1 runner.pid=217845 runner.model=/root/.ollama/models/blobs/sha256-a29e9d0911c8a65726beaf9934899fd98e64d97e5330629fc079b1d822455298
time=2025-10-14T16:03:25.575Z level=INFO source=server.go:505 msg="system memory" total="31.3 GiB" free="27.9 GiB" free_swap="166.9 MiB"
time=2025-10-14T16:03:25.575Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-9e7232c93d12498a74a19826c367681314d86d5e468dacc6afe24bc6adbd7754 library=CUDA parallel=1 required="3.2 GiB" gpus=1
time=2025-10-14T16:03:25.576Z level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=29 layers.offload=29 layers.split=[29] memory.available="[5.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.2 GiB" memory.required.partial="3.2 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="1.5 GiB" memory.weights.repeating="1.2 GiB" memory.weights.nonrepeating="315.3 MiB" memory.graph.full="298.7 MiB" memory.graph.partial="298.7 MiB"
time=2025-10-14T16:03:25.576Z level=INFO source=runner.go:799 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:8 GPULayers:29[ID:GPU-c6d9f479-6f10-6431-34ac-33af8732e7aa Layers:29(0..28)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
time=2025-10-14T16:03:25.576Z level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1660 SUPER) (0000:01:00.0) - 0 MiB free

Im no LLM expert, but looks like it errors out at "llama_model_load_from_file_impl: using device CUDA0 ... - 0 MiB free". It cant be, as I run i9-9900k, 32GB RAM and 1660 super is 6GB VRAM card. Regular granite 4 micro, tiny an small are working on my system.

mkurman

OpenMed Community org Oct 14

Hey! It's fixed now; I had to requantitize and reupload ggufs, as they've changed granitehybrid architecture to granite in the recent llama.cpp release; sorry!

maigonis

Oct 14

Yup - all good now. A suggestion, add q6 version also for sake of completeness. Otherwise I can fit q8 even in my 6GB VRAM card whit 8k context window.

mkurman

OpenMed Community org Oct 14

Done, my friend! Thanks for the suggestion

maigonis

Oct 14

One more note - I cant go higher that 8k (8192). But this is granite issue, original model also errors out on me whit 16k (16384).

mkurman

OpenMed Community org Oct 14

I will try to check it out tomorrow, as I didn't have much time to test on longer than 4096 tbh.

maigonis

Oct 15

For me, it's not that important, as I usually use these types of models for single-shot questions or at most a few follow-ups. Its mostly interesting for me - why..?

maigonis

Oct 31

I did moved to llama.cpp and 16k cw works.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment