Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc