I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.
To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:
First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
> you find that the GPU often sits idle, not for lack of work, but because the CPU hasn't told it what to do next yet. This phenomenon is called a GPU bubble.
This is true, but I've never heard anyone refer to this as a GPU bubble before.
I think most people hear "GPU bubble" and think of a financial bubble of some kind.
Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
I feel like bubble is what this is commonly called in GPU programming circles (e.g. https://github.com/sgl-project/sglang/issues/5593 or any number of other issues). Didn't occur to me that it would be confusing to be honest. But yes stall is maybe a better word.
it's not stalled, as that would imply that it waits for something, which is not necessarily the case with bubbles. most often it shows lack of proper pipelining or wrong pipeline dependencies (pipeline A waits for pipeline B, pipeline C waits for pipeline B, while pipeline B waits for an event X, now you've just made all three pipelines stalled on event X - not good).
To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852
This is true, but I've never heard anyone refer to this as a GPU bubble before.
I think most people hear "GPU bubble" and think of a financial bubble of some kind.
Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
any time your GPU is idle = you are losing $$$ = your TCO is going up. you don't want that.
Better term, anyone?