Hyperscale Data Centers: A Game Changer for SSD Designs

Changes to 2020 SSD controllers versus 2018 SSD controllers are comparable to the task of migrating from HDD controllers to SSD controllers

Source: EEWeb

By Lauro Rizzatti (Contributed Content) | Thursday, December 20, 2018

In the technology sector, occasionally a new requirement will drive a minor technical concern into a big issue. If not addressed properly and in a timely manner, the unexpected fallout may jeopardize the life of a product, if not an entire company.

Hyperscale requirements for storage in data centers provide good examples. Out of the blue, what is known as non-deterministic or random latency of solid-state drives (SSDs) is becoming the “make or break” issue for the success of SSDs in hyperscale data centers. Considering that SSDs replaced hard disk drives (HDDs) as the primary choice for storage, the random latency of an SSD may force out of business a storage player that ignored or downplayed the problem.

First, let’s review what a hyperscale data center is all about. Hyperscale refers to the ability of a data center infrastructure to scale with demand for storage. Examples are streaming movies or video events. If everyone tunes into a live event, instead of a million users, the data center might need to serve 100-million users.

Data centers must keep up with swift and dynamic demand. By 2020, more than half of all data center traffic will come from a hyperscale data center. This means that in order to sell a storage device into these sockets, they must meet hyperscale requirements.

This constraint is imposing radical changes to the architecture of an SSD controller. The SSD controller of 2020 will be different from the controller of 2018. Some challenges are similar to previous general prerequisites, such as latency, power, and high performance in terms of input/output operations per second (IOPS) and bandwidth. Some new challenges are emerging: endurance and reliability issues with Quad Level Cell (QLC) NAND will be more complicated than those managing Triple Level Cell (TLC). Deterministic I/O and precise latency will become critical.

Another challenge is the convergence of networking and storage interfaces. While new possibilities will open up, they will come with the task of integrating a new interface. Non-Volatile Memory Express over Fabric (NVMEoF) — a standard that enables attaching Ethernet directly to storage devices, bypassing the need for CPU involvement in the transfer — is gaining traction in this space.

As mentioned, one of the more complicated issues is deterministic IO, or deterministic latency. SSD controllers managing a NAND flash can get caught balancing a variety of simultaneous issues, a result of write amplification, garbage collection — which involves moving data around on the NAND without the host’s knowledge — and error retries, the result of which is non-deterministic. Results can be failure to qualify a drive for the intended market and will be especially difficult for the hyperscale data center. When a drive has been filled once, and these processes start in earnest, the host’s view of the drive performance can be orders of magnitude worse than when the drive is fresh-out-of-the-box.

In a hyperscale data center, random latency is not acceptable. Non-determinism is frustrating with a single drive. When multiplied by thousands or tens of thousands of drives in a datacenter, it becomes unacceptable. The host must be informed when garbage collection is going to start and how fast it can finish the job. Knowing what the latencies are going to be to prioritize SSD operations is an imperative; at the same time, knowing what these latencies will be prior to silicon is increasingly difficult.

By 2020, demand for storage in the data center is expected to increase by 5X to 10X, from two terabytes to 10 terabytes for a normal enterprise data center and to 20 terabytes for a hyperscale data center. By 2020, 47% of all servers, 68% of all processing, 57% of all data stored, and 53% of all traffic will be hyperscale. In order to sell drives into a socket in the hyperscale data center, SSD providers must meet hyperscale requirements.

The SSD architecture necessary to support such a requirement mandates full pre-silicon performance validation of the design to ensure the SSD meets tough specifications. This includes bandwidth and latency during all operations of the controller.

Performance/Latency Validation of SSDs
Performance validation of a silicon-on-chip (SoC) design is a challenging task. Register transfer level (RTL) simulation possesses the timing accuracy, but not enough processing power to simulate the many millions of clock cycles necessary to determine that the design performs within a few percentage points of the actual silicon. Hardware emulation has the processing power to perform the job. One point that’s worth noting is the fact that, while emulation is not timing accurate, it is cycle accurate. This is required to establish how many cycles are needed to complete an operation, or how many cycles are consumed between an input request and the corresponding output response. That’s what latency is all about.

The classic in-circuit-emulation (ICE) mode of deployment is not suitable for the task. In ICE mode, a physical target system drives the design under test (DUT) represented by a software model compiled and mapped inside the emulator.

This setup requires inserting speed adapters, i.e., FIFO registers, between the relatively slow DUT (~1 to 2MHz) and the fast target system (100MHz to 1GHz). This changes the speed ratios between slow design clocks and fast target system clocks. Under these conditions, no performance/latency evaluation can be accomplished.

Virtual Emulation as an Alternative to ICE
As shown above, the ICE mode can be used for the media, which is usually mounted on a NAND daughter card and attached to the emulator with special cables. The NAND is usually tolerant of speed variations, though this is not the case for the host interface. For that, in order to get performance verification, it is necessary to virtualize the host interface.

In virtual emulation mode, the target system is described as a C++ software model processed on the host driving the DUT compiled inside the emulator. Since both the DUT and the target system are models, their operational frequencies can be set to be any value without need for speed adapters. As simple as it sounds, the clock ratio of the DUT and target system is preserved, and performance and latency validations can be verified within a few percentage points of accuracy vis-à-vis actual silicon.

The virtual emulation mode was conceived about 20 years ago by IKOS Systems, which was acquired by Mentor, a Siemens Business, in 2002 to interface a software testbench described at a high-level of abstraction to the DUT. This mode opened up a wide range of deployments for emulation, but ICE was — and still is — considered mandatory for the ability to exercise the DUT with real traffic.

Now, virtual is better and compulsory for these types of precise design metrics. It is the only solution that can validate SSD designs for hyperscale data centers with a high degree of accuracy compared to the real silicon.

Conclusions
Hyperscale data centers are a game changer. Changes to the 2020 SSD controller are comparable to moving from HDD controllers to SSD. Some successful SSD companies saw this coming last year and required virtual solutions for SSD controller verification. Other companies decided against taking this route. They thought it wasn’t an important investment, and today it may be just too late.

Dr. Lauro Rizzatti is a marketing expert and consultant in EDA. He held management positions in product marketing, technical marketing and engineering at EVE (acquired by Synopsys), Get2Chip (acquired by Cadence), Synopsys, Mentor Graphics, Teradyne, Alcatel and Italtel. Dr. Rizzatti has published numerous articles and technical papers in industry publications and has presented at various international technical conferences around the globe. He holds a doctorate in Electronic Engineering from the Universita` degli Studi di Trieste, Italy.

Ben Whitehead has been in the storage industry developing verification solutions for nearly 20 years working for LSI Logic, Seagate and, most recently, managing SSD controller teams at Micron. His leadership with verification methodologies in storage technologies led him to his current position as a Storage Product Specialist at Mentor, a Siemens Business.