I don’t understand why this is a useful effort. It seems like a solution in source of a problem. It’s going to be incredibly easy to end up with hopelessly inefficient programs that need a full redesign in a normal gpu programming model to be useful.
1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.
2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.
3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.
Anyway, skepticism is understandable and we are well aware code wins arguments.
It depends. At VecorWare are a bit of an extreme case in that we are inverting the relationship and making the GPU the main loop that calls out to the CPU sparingly. So in that model, yes. If your code is run in a more traditional model (CPU driving and using GPU as a coprocessor), probably not. Going across the bus dominates most workloads. That being said, the traditional wisdom is becoming less relevant as integrated memory is popping up everywhere and tech like GPUDirect exists with the right datacenter hardware.
These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.
(for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)
This is the fault of NVIDIA, who, instead of using the terms that had been used for decades in computer science before them for things like vector lanes, processor threads, processor cores etc., have invented a new jargon by replacing each old word with a new word, in order to obfuscate how their GPUs really work.
Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.
Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.
So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).
All the names for waves come from different hardware and software vendors adopting names for the same or similar concept.
- Wavefront: AMD, comes from their hardware naming
- Warp: Nvidia, comes from their hardware naming for largely the same concept
Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get
- Subgroup: Khronos' name for the abstract model that maps to the hardware
- Wave: Microsoft's name for the same
They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.
Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.
We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.
We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.
Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!
This programming model seems like the wrong one, and I think its based on some faulty assumptions
>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur
This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!
>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution
This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality
GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast
This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance
>Each warp has its own program counter, its own register file, and can execute independently from other warps
The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here
Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest
The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!
1. Programming GPUs is a problem. The ratio of CPUs to CPU programmers and GPUs to GPU programmers is massively out of whack. Not because GPU programming is less valuable or lucrative, because GPUs are weird and the tools are weird.
2. We are more interested in leveraging existing libraries than running existing binaries wholesale (mostly within a warp). But, running GPU-unaware code leaves a lot of space for the compiler to move stuff around and optimize things.
3. The compiler changes are not our product, the GPU apps we are building with them are. So it is in our interest to make the apps very fast.
Anyway, skepticism is understandable and we are well aware code wins arguments.
These are the details we intend to insulate people from so they can just write code and have it run fast. There is a reason why abstractions were invented on the CPU and we think we are at that point for the GPU.
(for the datacenter folks I know hardware topology has a HUGE impact that software cannot overcome on its own in many situations)
Why is it also that terminology is so all over the place. Subgroups, wavefronts, warps etc. referring to the same concept. That doesn't help it.
Unfortunately, ATI/AMD has imitated slavishly many things initiated by NVIDIA, so soon after that they have created their own jargon, by replacing every word used by NVIDIA with a different word, also different from the traditional word, enhancing the confusion. The worst is that the NVIDIA jargon and the AMD jargon sometimes reuse traditional terms by giving them different meanings, e.g. an NVIDIA thread is not what a "thread" normally means.
Later standards, like OpenCL, have attempted to make a compromise between the GPU vendor jargons, instead of going back to a more traditional terminology, so they have only increased the number of possible confusions.
So to be able to understand GPUs, you must create a dictionary with word equivalences: traditional => NVIDIA => ATI/AMD (e.g. IBM 1964 task = Vyssotsky 1966 thread => NVIDIA warp => AMD wavefront).
- Wavefront: AMD, comes from their hardware naming
- Warp: Nvidia, comes from their hardware naming for largely the same concept
Both of these were implementation detail until Microsoft and Khronos enshrined them in the shader programming model independent of the hardware implementation so you get
- Subgroup: Khronos' name for the abstract model that maps to the hardware
- Wave: Microsoft's name for the same
They all describe mostly the same thing so they all get used and you get the naming mess. Doesn't help that you'll have the API spec use wave/subgroup, but the vendor profilers will use warp/wavefront in the names of their hardware counters.
Besides, full redesign isn't so expensive these days (depending).
>It seems like a solution in source of a problem.
Agreed, but it'll be interesting to see how it plays out.
We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.
Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!
>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur
This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!
>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution
This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality
GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast
This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance
>Each warp has its own program counter, its own register file, and can execute independently from other warps
The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here
Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest
The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!