Unsloth Dynamic 2.0 GGUFs

(unsloth.ai)

24 points | by tosh 2 hours ago

6 comments

  • Maxious 1 hour ago
    ICYMI unsloth has had some major breakthroughs today with the Qwen3.5 local models https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks

    With the Qwen3.5 35B A3B at Q4 I've got 200k context running at 62.98 tokens per second on a local RTX5080 16GB.

    • Kayou 1 hour ago
      Wait, the Q4 quantization which is more than 20GB fits in your 16GB GPU ? I didn't know that was possible, I was always restricting myself to smaller model than the VRAM I had
      • segmondy 1 hour ago
        llama.cpp is designed for partial offloading, the most important part of the model will be loaded into the GPU and the rest on system ram. I run 500B+ models such as DeepSeek/KimiK2.5/GLM-5 without having that much GPU vram.
    • jychang 1 hour ago
      Not really breakthroughs, more like bugfixes for their broken first batch.
  • jychang 54 minutes ago
    What's up with this post? It's a link to something which has existed for a long time, and there's a bunch of dead comments below. Some weird SEO campaign thing?
  • Havoc 1 hour ago
    Advances in this space are always welcome.

    I see the change in kld values is pretty modest vs prior version. Does anyone know how that translates to real world? Is more of a linear type situation or exponential etc

  • aichen_dev 1 hour ago
    [dead]
  • MarcLore 1 hour ago
    [dead]
  • shablulman 1 hour ago
    [dead]