Round pegs, square holes: Why GPGPUs are an architectural mismatch for modern LLMs

The saying “round pegs do not fit square holes” persists because it captures a deep engineering reality: inefficiency most often arises not from flawed components, but from misalignment between a system’s assumptions and the problem it is asked to solve. A square hole is not poorly made; it’s simply optimized for square pegs.

Modern large language models (LLMs) now find themselves in exactly this situation. Although they are overwhelmingly executed on general-purpose graphics processing units (GPGPUs), these processors were never shaped around the needs of enormous inference-based matrix multiplications.

GPUs dominate not because they are a perfect match, but because they were already available, massively parallel, and economically scalable when deep learning began to grow, especially for training AI models.

What follows is not an indictment of GPUs, but a careful explanation of why they are extraordinarily effective when the workload is rather dynamic and unpredictable, such as graphic processing, and disappointedly inefficient when the workload is essentially regular and predictable, such as AI/LLM inference execution.

The inefficiencies that emerge are not accidental; they are structural, predictable, and increasingly expensive as models continue to evolve.

Execution geometry and the meaning of “square”

When a GPU renders a graphic scene, it deals with a workload that is considerably irregular at the macro level, but rather regular at the micro level. A graphic scene changes in real time with significant variations in content—changes in triangles and illumination—but in an image, there is usually a lot of local regularity.

One frame displays a simple brick wall, the next, an explosion creating thousands of tiny triangles and complex lighting changes. To handle this, the GPU architecture relies on a single-instruction multiple threads (SIMT) or wave/warp-based approach where all threads in a “wave” or “warp,” usually between 16 and 128, receive the same instruction at once.

This works rather efficiently for graphics because, while the whole scene is a mess, local patches of pixels are usually doing the same thing. This allows the GPU to be a “micro-manager,” constantly and dynamically scheduling these tiny waves to react to the scene’s chaos.

However, when applied to AI and LLMs, the workload changes entirely. AI processing is built on tensor math and matrix multiplication, which is fundamentally regular and predictable. Unlike a highly dynamic game scene, matrix math is just an immense but steady flow of numbers. Because AI is so consistent, the GPU’s fancy, high-speed micro-management becomes unnecessary. In this context, that hardware is just “overhead,” consuming power and space for a flexibility that the AI doesn’t actually use.

This leaves the GPGPU in a bit of a paradox: it’s simultaneously too dynamic and not dynamic enough. It’s too dynamic because it wastes energy on micro-level programming and complex scheduling that a steady AI workload doesn’t require. Yet it’s not dynamic enough because it is bound by the rigid size of its “waves.”

If the AI math doesn’t perfectly fit into a warp of 32, the GPU must use “padding,” effectively leaving seats empty on the bus. While the GPU is a perfect match for solving irregular graphics problems, it’s an imperfect fit for the sheer, repetitive scale of modern tensor processing.

Wasted area as a physical quantity

This inefficiency can be understood geometrically. A circle inscribed in a square leaves about 21% of the square’s area unused. In processing hardware terms, the “area” corresponds to execution lanes, cycles, bandwidth, and joules. Any portion of these resources that performs work that does not advance the model’s output is wasted area.

The utilization gap (MFU)

The primary way to quantify this inefficiency is through Model FLOPs Utilization (MFU). This metric measures how much of the chip’s theoretical peak math power is actually being used for the model’s calculations versus how much is wasted on overhead, data movement, or idling.

For an LLM like GPT-4 running on GPGPT-based accelerators operating in interactive mode, the MFU drops by an order of magnitude with the hardware busy with “bookkeeping,” which encompasses moving data between memory levels, managing thread synchronization, or waiting for the next “wave” of instructions to be decoded.

The energy cost of flexibility

The inefficiency is even more visible in power consumption. A significant portion of that energy is spent powering the “dynamic micromanagement,” namely, the logic gates that handle warp scheduling, branch prediction, and instruction fetching for irregular tasks.

The “padding” penalty

Finally, there is the “padding” inefficiency. Because a GPGPU-based accelerator operates in fixed wave sizes (typically 32 or 64 threads), if the specific calculation doesn’t perfectly align with those multiples, often happening in the “Attention” mechanism of the LLM model, the GPGPU still burns the power for a full wave while some threads sit idle.

These effects multiply rather than add. A GPU may be promoted with a high throughput, but once deployed, may deliver only a fraction of its peak useful throughput for LLM inference, while drawing close to peak power.

The memory wall and idle compute

Even if compute utilization was perfect, LLM inference would still collide with the memory wall, the growing disparity between how fast processors can compute and how fast they can access memory. LLM inference has low arithmetic intensity, meaning that relatively few floating-point operations are performed per byte of data fetched. Much of the execution time is spent reading and writing the key-value (KV) cache.

GPUs attempt to hide memory latency using massive concurrency. Each streaming multiprocessor (SM) holds many warps and switches between them while others wait for memory. This strategy works well when memory accesses are staggered and independent. In LLM inference, however, many warps stall simultaneously while waiting for similar memory accesses.

As a result, SMs spend large fractions of idle time, not because they lack instructions, but because data cannot arrive fast enough. Measurements commonly show that 50–70% of cycles during inference are lost to memory stalls. Importantly, the power draw does not scale down proportionally since clocks continue toggling and control logic remains active, resulting in poor energy efficiency.

Predictable stride assumptions and the cost of generality

To maximize bandwidth, GPUs rely on predictable stride assumptions; that is, the expectation that memory accesses follow regular patterns. This enables techniques such as cache line coalescing and memory swizzling, a remapping of addresses designed to avoid bank conflicts and improve locality.

LLM memory access patterns violate these assumptions. Accesses into the KV cache depend on token position, sequence length, and request interleaving across users. The result is reduced cache effectiveness and increased pressure on address-generation logic. The hardware expends additional cycles and energy rearranging data that cannot be reused.

This is often described as a “generality tax.”

Why GPUs still dominate

Given these inefficiencies, it’s natural to ask why GPUs remain dominant. The answer lies in history rather than optimality. Early deep learning workloads were dominated by dense linear algebra, which mapped reasonably well onto GPU hardware. Training budgets were large enough that inefficiency could be absorbed.

Inference changes priorities. Latency, cost per token, and energy efficiency now matter more than peak throughput. At this stage, structural inefficiencies are no longer abstract; they directly translate into operational cost.

From adapting models to aligning hardware

For years, the industry focused on adapting models to hardware such as larger batches, heavier padding, and more aggressive quantization. These techniques smooth the mismatch but do not remove it.

A growing alternative is architectural alignment: building hardware whose execution model matches the structure of LLMs themselves. Such designs schedule work around tokens rather than warps, and memory systems are optimized for KV locality instead of predictable strides. By eliminating unused execution lanes entirely, these systems reclaim the wasted area rather than hiding it.

The inefficiencies seen in modern AI data centers—idle compute, memory stalls, padding overhead, and excess power draw—are not signs of poor engineering. They are the inevitable result of forcing a smooth, temporal workload into a rigid, geometric execution model.

GPUs remain masterfully engineered square holes. LLMs remain inherently round pegs. As AI becomes a key ingredient in global infrastructure, the cost of this mismatch becomes the problem itself. The next phase of AI computing will belong not to those who shave the peg more cleverly, but to those who reshape the hole to match the true geometry of the workload.

Lauro Rizzatti is a business advisor to VSORA, a technology company offering silicon semiconductor solutions that redefine performance. He is a noted chip design verification consultant and industry expert on hardware emulation.