Unsloth Dynamic 2.0 GGUFs

(unsloth.ai)

24 points | by tosh 2 hours ago

6 comments

Maxious 1 hour ago
ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.
[-]
- Kayou 1 hour ago
  Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
  [-]
  - segmondy 1 hour ago
    llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
- jychang 1 hour ago
  Not really breakthroughs, more like bugfixes for their broken first batch.
jychang 54 minutes ago
What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
[-]
- tosh 51 minutes ago
  Unsloth have just released benchmarks on how their dynamic quants perform for Qwen 3.5
  https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
Havoc 1 hour ago
Advances in this space are always welcome.
I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc
aichen_dev 1 hour ago
[dead]
MarcLore 1 hour ago
[dead]
shablulman 1 hour ago
[dead]