Parsing the Mindboggling Cost of Ownership of Generative AI

The latest algorithms, such as GPT-4, pose a challenge to the current state-of-the-art processing hardware, and GenAI accelerators aren’t keeping up. In fact, no hardware on the market today can run the full GPT-4.

Current large language model (LLM) development focuses on creating smaller but more specialized LLMs that can run on existing hardware is a diversion. The GenAI industry needs semiconductor innovations in computing methods and architectures capable of delivering performance of multiple petaFLOPS with efficiency greater than 50%, reducing latency to less than two second per query, constraining energy consumption and shrinking cost to 0.2 cent per query.

Once this is in place–and it is only matter of time–the promise of transformers when deployed on edge devices will be fully exploited.

Advancements in software algorithms driven by transformers, however, haven’t been met with similar progress in the computing hardware tasked to execute them. For example, GPT-4’s LLM is huge, exceeding one-trillion parameters. The volume of these parameters poses a challenge to storage and performance requirements. Memory storage is already reaching hundreds of gigabytes. Processing throughput needs multiple petaops (1,000,000,000,000,000 of operations per second) to deliver query responses in an acceptable timeframe, typically less than a couple of seconds.

While model training and inference share performance requirements, they differ on four other characteristics: memory, latency, power consumption and cost. See table 1.

Table 1: Algorithm training and inference share some but not all critical attributes. (Source: Vsora)

The model training and inference scenario today is carried out on extensive computing farms. The job runs for a long time, consumes a sizable amount of electric power that produces copious heat at mindboggling costs. Nonetheless, the farms deliver what’s expected from them.

To size the task, training a GPT-4 model on fp32 or fp64 arithmetic may require more than one-trillion bits stored on the fastest versions of high-bandwidth memory (HBM) DRAM. The performance necessary to train such a massive model calls for tens of petaops running for weeks—an annoyance but not a roadblock. To accomplish the job, the computing farms consume megawatts with total cost of ownership in the hundreds of billions of dollars. No, not a perfect scenario, but a working solution.

Via-a-vis model training—model inference—usually performed on fp8 arithmetic that still produces large amounts of data in hundreds of billions of bits must deliver a query response with a latency of no more than a couple of seconds to keep the user’s attention and acceptance. Further, considering that a vast potential market for inference encompasses mobile applications at the edge, a viable solution must provide high throughput of more than one petaops with implementation efficiency exceeding 50%.

Additionally, mandatory for mobility, the solution must minimize energy consumption, possibly less than 50 watts per petaops, at an acquisition/deployment cost in the ballpark of few hundred dollars.

These are lofty specifications for feasible inference scenarios running on edge devices.

The crux of the matter centers on the memory bottleneck, known as the memory wall, that increases latency with a deleterious impact on implementation efficiency, expands energy consumption and magnifies cost.

Impact of memory wall on generative AI

Moving terabytes of data at high speed between memory and computing elements requires data transfer bandwidths of terabytes/sec, hardly practicable. If the processor doesn’t receive data on time, it sits idle, impacting its efficiency. As recently reported, the efficiency of running GPT-4 on leading-edge hardware deep dives to 3% or less. A GenAI accelerator with one petaops nominal performance, but actual 3% efficiency, delivers a meager 30 teraops. Basically, a very costly processor designed to run these algorithms remains inactive 97% of the time.

To compensate for the low efficiency in processing model training and inference in data centers, cloud providers add more hardware to perform the same task. The approach escalates the cost and multiplies power consumption. Obviously, such a method isn’t applicable for inference at the edge.

Estimated cost analysis of GenAI in datacenters processing ChatGPT-4

McKinsey estimated that in 2022, Google search processed 3.3 trillion queries (~100,000 queries/sec) at a cost of ¢0.2 per query, considered to be the benchmark. The total annual cost amounted to $6.6 billion. Google isn’t charging fees for the search service. Instead, it covers the cost via advertising revenues. For now.

The same McKinsey analysis stated that the ChatGPT-3 cost per query hovers around ¢3 per query—15× larger than the benchmark. On an annual basis of 100,000 queries/sec, the total cost would exceed $100 billion.

Let’s evaluate the implication of the benchmarks on the cost-of-ownership of a data center supporting ChatGPT-4 based on a best-in-class GenAI accelerator, including purchasing, operating and system maintenance expenses.

The cost per query comprises two contributors: acquisition cost and energy consumption cost.

Estimated hardware acquisition costs


  • Hardware refresh: three years
  • Purchasing cost of leading-edge GenAI accelerator, containing eight accelerator chips, delivering a gross compute power of 16 petaops at fp8 processing ChatGPT-4 with a 3% efficiency: ~$500,000 per system
  • Theoretical throughput of one leading-edge GenAI system processing ChatGPT-4: ~0.055 queries/sec
  • Number of systems needed to meet a processing capability of 100,000 queries/sec: ~1,800,000 (100,000 / 0.055)
  • Total acquisition cost: ~$900,000,000,000 (1,800,000 * 500,000), approaching $1 trillion.

The daily depreciation amounts to about ~$820 million (900,000,000,000 / 1,095).

Estimated energy costs to execute the hardware


  • Average power consumption per chip: 25 W, based on nominal power, efficiency, memory bandwidth
  • Troughput per chip: ~0.007 queries/sec (0.055/ 8)
  • Energy consumption per query: 3,637 J (25 W / 0.007 queries per second)
  • Total energy cost: $0.11 per kWh
  • Energy cost per query: $1.2e-4
  • Total power consumption for 100,000 queries/second: ~ 363.7 MW

The energy cost amounts to about $1.0 million/day (power consumption for the chips * 24 hours * 0.11).

Clearly, the cost is dominated by hardware acquisition. The best-guess total daily cost is the ballpark of $820 million.

The above leads to a GPT-4 cost per query for a system running 100,000 queries per second of ¢9.5 (820,000,000 / (100,000 * 24 * 60 * 60)) [(cost per day) / (# of queries * # of hours * # of seconds)]. See table 2. 

Table 2: Comparing the cost per query of GPT3 and GPT-4 against Google search shows the leap in cost associate with GPT-4 (Source: Vsora)

Lauro Rizzatti is a business consultant at Vsora.