Chen, Z. (CSE) – GPU Subgroup Semantics for Portable High-Performance Kernels
Modern high-performance GPU kernels increasingly rely on subgroup-level execution, including subgroup-level communication, subgroup operations, and matrix operations. These features are essential for workloads such as matrix multiplication and FlashAttention, but […]