A Breakthrough in FPGA-Based Deep Learning Inference

Mipsology’s Zebra Deep Learning inference engine is designed to be fast, painless, and adaptable, outclassing CPU, GPU, and ASIC competitors

Source: EEWEB

By Lauro Rizzatti (Contributed Content) | Friday, November 23, 2018

I recently attended the 2018 Xilinx Development Forum (XDF) in Silicon Valley. While at this forum, I was introduced to a company called Mipsology, a startup in the field of artificial intelligence (AI) that claims to have solved the AI-related problems associated with field-programmable gate arrays (FPGAs). Mipsology was founded with a grand vision to accelerate the computation of any neural network (NN) with the highest performance achievable on FPGAs without the constraints inherent in their deployment.

Mipsology demonstrated the ability to execute more than 20,000 images per second, running on the newly announced Alveo boards from Xilinx and processing a collection of NNs, including ResNet50, InceptionV3, VGG19, and others.

Introducing neural networks and deep learning
Loosely modeled on the web of neurons in the human brain, a neural network is at the foundation of deep learning (DL), a complex mathematical system that can learn tasks on its own. By looking at many examples or associations, a NN can learn connections and relationships faster than a traditional recognition program. The process of configuring an NN to perform a specific task based on learning millions of samples of the same type is called training.

For example, an NN might listen to many vocal samples and use DL to learn to “recognize” the sounds of specific words. This NN could then sift through a list of new vocal samples and correctly identify samples containing words that it has learned by using a technique called inference.

Despite its complexity, DL is based on performing simple operations — mostly additions and multiplications — in the billions or trillions. The computational demand to perform such operations is daunting. More specifically, the computing needs to execute DL inferences are greater than those for DL training. Whereas DL training must be performed only one time, an NN, once trained, must perform inference again and again for each new sample that it receives.

Four choices to accelerate deep-learning inference
Over time, the engineering community resorted to four different computing devices to process NNs. In increasing order of processing power and power consumption, and in decreasing order of flexibility/adaptability, these devices encompass: central processing units (CPUs), graphics processing units (GPUs), FPGAs, and application-specific integrated circuits (ASICs). The table below summarizes the main differences among the four computing devices.

Comparison of CPUs, GPUs, FPGAs, and ASICs for DL computing (Source: Lauro Rizzatti)

CPUs are based on the Von Neumann architecture. While flexible (the reason for their existence), CPUs are affected by long latency because of memory accesses consuming several clock cycles to execute a simple task. When applied to tasks that benefit from the lowest latencies such as NN computation and, specifically, DL training and inference, they are the poorest choice.

GPUs provide high computation throughput at the cost of decreased flexibility. Furthermore, GPUs consume significant power that demands cooling, making them less than ideal for deployment in data centers.

While custom ASICs may seem to be an ideal solution, they have their own set of issues. Developing an ASIC takes years. DL and NNs are evolving rapidly with ongoing breakthroughs, making last year’s technology irrelevant. Plus, to compete with a CPU or a GPU, an ASIC would need to use a large silicon area using the thinnest process node technology. This makes the upfront investment expensive without any guarantee of long-term relevancy. All things considered, ASICs are effective for specific tasks.

FPGA devices have emerged as the best possible choice for inference. They are fast, flexible, power-efficient, and offer a good solution for data processing in data centers, especially in the fast-moving world of DL, at the edge of the network and under the desk of AI scientists.

The largest FPGAs available today include millions of simple Boolean operators, thousands of memories and DSPs, and several CPU Arm cores. All of these resources work in parallel — each clock tick triggers up to millions of simultaneous operations — resulting in trillions of operations performed at each second. The processing required by DL maps quite well onto FPGA resources.

FPGAs have other advantages over CPUs and GPUs used for DL, including the following:

  • They are not limited to certain types of data. They can handle non-standard low precision more suitable to deliver higher throughput for DL.
  • They use less power than CPUs or GPUs — usually five to 10 times less average power for the same NN computation. Their recurring cost in data centers is lower.
  • They can be reprogrammed to fit any task but be generic enough to accommodate various undertakings. DL is evolving rapidly, and the same FPGA will fit new requirements without needing the next-generation silicon (which is typical with ASICs), thereby reducing the cost of ownership.
  • They range from large to small devices. They can be used in data centers or in an internet of things (IoT) node. The only difference is the number of blocks that they contain.

All that glitters is not gold
An FPGA’s high computational power, low power consumption, and flexibility come at a price — difficulty to program.

Programming an FPGA requires specific skills and knowledge. It starts with the need for specialized hardware programming languages (HDLS) and continues with specific tools delivered by FPGA suppliers to compile the design through complex steps such as synthesis, placement, and routing. Programming an FPGA involves several issues before getting the rewards, including defining a “program” architecture, respecting constraining design rules, fitting the “program” into the FPGA, timing closure, day-long compilations, and lack of software-like debug.

Mipsology’s Zebra: solving the FPGA problem
At XDF, Ludovic Larzul, Mipsology founder and CEO, and I talked about Mipsology’s Zebra, a deep-learning inference engine that computes neural networks on FPGAs.

According to Larzul, “Zebra conceals the FPGA from the user, eliminating the issues that make them hard to program. Zebra does not require learning a new language, new tools, nor understanding hardware-level details. It is delivered with pre-compiled FPGA binary files, removing the need to learn the FPGA compilation process.

“We simplified the process with Zebra. Once an FPGA-based board has been plugged in a PC, all that is needed is a single Linux command. FPGAs can be used for inference in place of CPUs or GPUs, seamlessly and instantly, and accelerate computation by an order of magnitude at lower power consumption.”

Zebra was designed for the AI community: “FPGAs now can be used by the AI and deep-learning community,” affirmed Larzul. “Zebra is integrated in Caffe, Caffe2, MXNet, and TensorFlow. No modifications are necessary to neural frameworks to deploy Zebra, giving the AI expert the ability to run any application on top of the same framework. They can switch from CPUs or GPUs after NN training to FPGAs for inference without wasting R&D time.

“Zebra supports a wide variety of NNs, from the most common commercial networks to any custom-designed NN. No changes are required to a neural network as long as it is built with the supported layers and parameters. Zebra’s boundaries should not prevent any NN from running on it. They include up to 1 million layers, up to 3 billion weights in the network, and 50,000 filters in each of the convolution filters. All resources are well above those commonly used by neural networks.

“Adjusting NN parameters or even changing the neural network does not force recompilation of the FPGAs, a task that can take hours, days, or — in the event of timing issues — weeks, if at all possible, making Zebra viable for NN deployment. Any modification to an NN can run on Zebra, simplifying the testing of new versions for use in data centers. Zebra uses the NN training typically performed on GPUs, removing the need for retraining and avoiding new tools to migrate the training parameters.

“Zebra executes inference calculations using 8-bit or 16-bit fixed-point integers, while a CPU or GPU typically uses floating-point values. As disclosed in numerous scientific articles, the accuracy of results does not suffer from the change in precision when using appropriate quantization. Zebra accommodates that, too, without any user intervention. By reducing computing precision, the computation throughput increases dramatically.”

Performance is what matters: Larzul claimed, “While FPGAs offer multiple advantages over other hardware platforms, processing speed, power consumption, and cost are what make a hardware platform attractive in most situations. The speed of execution of Zebra is considerably higher than that of GPUs or CPUs when using the same software stack, framework, and neural network.”

Zebra accommodates NNs trained by other accelerators. (Source: Mipsology)

Larzul adamantly stated, “At Mipsology, we excel in continuously improving Zebra’s throughput, aiming at the highest performance possible in FPGAs. Just in 2018, for example, we achieved 5x speedup on the same silicon.

“Not to mention that throughput per dollar and throughput per dollar per watt are even better metrics in Zebra compared to a variety of FPGA boards available on the market.”

The discovery of Mipsology and Zebra at the 2018 XDF was a pleasant surprise. As Larzul summarized, “Zebra was conceived from the ground up to exploit the potential throughput of FPGAs without their drawbacks to provide AI scientists and specialists with high computational power to accelerate NN inference in data centers and on the edge, complementing GPUs for training.”