Meeting Challenges Posed by AI Inference at the Edge

GPU, the ubiquitous graphics processing unit, is arguably the most important computing technology today. It made AI processing possible but at an unsustainable cost, exacerbated by enormous power consumption.

A revolution is underway in the software development world. The paradigm that ruled software engineering as originally conceived consists of a sequence of instructions executed on the traditional computing architecture known as the von Neuman CPU. This enduring processing architecture can only execute rigorously codifiable jobs. It cannot process tasks, such as recognizing an object or a music score, or write an essay. Known as predictive AI and generative AI, these assignments can be handled by large language models (LLM) that require processing hundreds of billions, if not trillions, of parameters within a clock cycle –– way beyond the realm of CPUs.

Today, LLM learning and inferencing is performed in data centers equipped with arrays of leading-edge GPUs. While the approach works, it leads to soaring acquisition/operating costs and spiraling power consumption that might strain a power grid. 

That is not the case for inferencing at the edge, expected to serve the largest AI application market in sectors as different as industrial, commercial, medical, educational and entertainment. 

Power consumption, cost and latency cannot be overlooked when inference is executed at the edge. High performance, low latency, low cost and low power are critical attributes for inference at the edge.

Efficiency is an often-ignored parameter in a computational engine’s target specifications. It quantifies the amount of useful compute power of the theoretical maximum that can be delivered when executing an algorithm. GPUs are an example of this dilemma. Originally designed for parallel processing of graphics, GPUs suffer a drop in deliverable computational power when executing AI algorithms. In the case of ChatGPT-3, the efficiency drops to low single digits. GPU vendors address the limitation by adding large numbers of devices for a cost and an exponential increase in energy consumption of AI processing in data centers.

The bottleneck sits in the data transmission between memory and processing units. 

Historically, advancements in memory technology have not kept up with the progress in processing logic. Over time, the gap has led to a drop in useful processing power since the memory cannot feed data at the rate required by the processors. Most of the time, the computational units wait for data to be made available and deteriorate as the processing power increases. The higher the compute power of the processing units, the larger the bottleneck feeding them data, known as the memory wall coined in the mid-1990s.

A memory hierarchy was created to ease the problem. At the bottom sits the slow main memory, at the top rest the registers next to the processing units. In between, a series of layers of faster memories with smaller capacity speed data transfer. 

While registers are capable of feeding data to compute units at the rate they need, their number is typically limited to hundreds or at most a few thousands when many millions are necessary today.

An innovative architecture that breaks the memory wall is needed. One proposal is to collapse all layered caches into a Tightly Coupled Memory (TCM) that looks and acts like registers. From the perspective of the processing units, the data could be accessed anywhere at any time within a clock cycle. A TCM of 192 megabytes would roughly equate to 1.5 billion single-bit registers. 

Implementing 192 megabytes of registers via a register transfer level (RTL) design flow would be arduous, posing a major challenge. Instead, a design implementation flow at a high level of abstraction would drastically abridge and speed up the accelerator’s deployment. If coupled to 192 gigabytes of onboard High-Bandwidth Memory (HBM), a single device could run GPT-3 entirely on a single chip, making it a highly efficient implementation. When processing LLMs, it would reach 50% to 55% efficiencies, more than one order of magnitude larger than GPUs.

A drastic reduction in data transmission between the external memory and the compute units could lead to a significant drop in power use, about 50 watts per petaflops. At the same time, it would decrease the execution latency by more than 10X vis-a-vis GPUs.

Rather critical, the architecture should not be hardwired. Instead, it ought to be totally programable and highly scalable. 

AI application algorithms evolve almost weekly. More frequent changes are limited to fine tuning the algorithms’ performance, latency, power consumption attributes with impact on cost. Periodically, radically new algorithmic structures obsolete older versions. The new accelerator architecture should accommodate all of the above and enable updates and upgrades in the field. 

Such a fully programmable approach ought also to support configurable computation quantization on the fly from 4-bit to 64-bit either integer or floating math automatically on a layer-by-layer-basis to accommodate a broad range of applications. Sparsity on weights and data should be supported on the fly, as well.

From the deployment perspective, the accelerator could act as a companion chip to the main processor in a scheme transparent to the user. Algorithmic engineers could write their algorithms as if they run on the main processor, letting the compiler separate the code that runs on the accelerator from the code that runs on the main processors. The approach would simplify and ease the accelerator’s deployment and use model.

Unlike the data flow driving GPUs that operate at low level, data flow in this imagined architecture would work at the algorithmic level by reading using MATLAB code and graphs and executing them natively.

Is it possible? Perhaps. 

A device like this would run five to 10 times faster than the leading-edge GPU-based accelerators while consuming a small fraction of their power and boasting significant lower latency, meeting the needs of AI inference at the edge. Undoubtedly, it would ease deployment and usage, appealing to a large community of scientists and engineers.