2024 DVCon US Panel: Overcoming the challenges of multi-die systems verification

2024 DVCon was very busy this year. Bernard Murphy and I were in attendance for SemiWiki, he has already written about it.  Multi die and chiplets was again a popular topic. Lauro Rizzatti, a consultant specializing in hardware-assisted verification, moderated an engaging panel, sponsored by Synopsys, focusing on the intricacies of verifying multi-die systems. The panel, which attracted a significant audience, included esteemed experts such as Alex Starr, a Corporate Fellow at AMD; Bharat Vinta, Director of Hardware Engineering at Nvidia; Divyang Agrawal, Senior Director of RISC-V Cores at Tenstorrent; and Dr. Arturo Salz, a distinguished Fellow at Synopsys.

Presented below is a condensed transcript of the panel discussion edited for clarity and coherence.

Rizzatti: How is multi-die evolving and growing? More specifically, what advantages have you experienced? Any drawback you can share?

Starr: By now, I think everybody has probably realized that AMD’s strategy is multi-die. Many years ago, ahead of the industry, we made a big bet on multi-die solutions, including scalability and practicality. Today, I would reflect and say our bet paid off to the extent that we’re living that dream right now. And that dream looks something like many dies per package, being able to use different process geometries for each of those dies as well to exploit the best out of each technology in terms of I/O versus compute, power/performance trade-offs.

Vinta: Pushed by the demand for increasing performance from generation to generation, modern chip sizes are growing so huge that a single die cannot accommodate the capacity we need any longer. Multi-die, as Alex put it, is here right now, it’s becoming a necessity not only today, but into the future. A multitude of upcoming products are going to reuse chiplets.

Agrawal: Coming from a startup, I have a slightly different view. Multi-die gets you the flexibility of mixing and matching different technologies. This is a significant help for small companies since we can focus on what our core competency is rather than worrying about the entire ecosystem.

Salz: I agree with all three of you because that’s largely what it is. Monolithic SoCs are hitting a reticle limit, we cannot grow them any bigger, they are low-yield, high-cost designs. We had to switch to multi-die and the benefits include the ability to mix and match different technologies. Now that you can mount and stack chiplets the interposer has no reticle limit, hence, there is no foreseeable limit for each of these SoCs. Size and capacity become the big challenge

Rizzatti: Let’s talk about adoption of multi-die design. What are the challenges to adopt the technology and what changes have you experienced?

Starr: We have different teams for different chiplets. All of them work independently but have to deliver on a common schedule to go into the package. While the teams are inherently tied, they are slightly decoupled in their schedules. Making sure that the different die work together as you’re co-developing them is a challenge.

You can verify each individual die, but, unfortunately, the real functionality of the device requires all of those dies to be there, forcing you to do what we used to call SoC simulation – I don’t even know what a SoC is anymore – you now have all of those components assembled together in such multitude that RTL simulators are not fast enough to perform any real testing at this system level. That’s why there has been a large growth in emulation/prototyping deployment because they’re the only engines that can perform this task.

Vinta: Multi-die introduces a major challenge when they all share the same delivery schedules. To meet the tapeout schedule, you not only have to perform die-level verification but also full chip verification. You need to verify the full SoC under all use cases scenarios.

Agrawal: I tend to think of everything backwards from a silicon standpoint. If your compute is coming in a little early you may have a platform on which do silicon bring up and not wait for everything else to come in. What if my DDR is busted? What if my HBM is busted? How do you compare, how do you combine, mix and match those things.

Salz: When you get into system level, you’re not dealing with just a system but a collection of systems communicating through interconnect fabrics. That’s a big difference that RTL designers are not used to thinking about. You have jitter or coherency issues, errors, guaranteed delivery, all things engineers commonly deal with in networking. It really is a bunch of networks on the chip but we’re not thinking about it that way. You need to plan this out all the way at the architectural level. You need to think about floor planning before you write any RTL code. You need to think about how you are going to test these chiplets. Are you going to test them each time we integrate one? What happens to different DPM or yields for different dies? Semiconductor makers are opportunistic. If you build a 16 core engine and two of them don’t work, you label it as an eight core piece and sell it. When you have 10 chiplets, you can get a factorial number in the millions of products. It can’t work that way.

Rizzatti: What are the specific challenges in verification and validation? Obviously, you need emulation and prototyping, can you possibly quantify these issues?

Starr: In terms of emulation capacity, we’ve grown 225X over the last 10 years and a large part of that is because of the increased complexity of chiplet-based designs. That’s a reference point for quantification.

I would like to add that, as Arturo mentioned, the focus on making sure you’re performing correct-by-construction design is more important now than ever before. In a monolithic chip-die environment you could get away with SoC level verification and just catch bugs that you may have missed in your IP. That is just really hard to do in a multi-die design.

Vinta: With the chiplet approach, there is no end in sight for how big the chip could grow to. System-level verification of full chip calls for huge emulation capacity requirements, particularly for the use cases that require full system emulation. It’s a challenge not only for emulation but also for prototyping. The capacity could easily increase an order of magnitude from chip to chip. That is one of my primary concerns, in the sense of “how do we configure emulation and prototyping systems that could handle these full system level sizes?”

Agrawal: With so many interfaces connected together, how do you even guarantee system level performance? This was a much cleaner problem to address when you had a monolithic die, but when you have chiplets then the performance is the least common denominator of all the interfaces, hoops that a transition has to go through.

Salz: That’s a very good point. By the way, the whole industry hinges on having standard interfaces. The future when you can buy a chiplet from a supplier and integrate it into your chip is only going to be possible if you have standard interfaces. We need more and better interfaces, such as UCIe.

By the way you don’t need to go to emulation right away. You do need emulation when you’re going to run software cycles, at the application-level, but for basic configuration testing you can use a mix of hybrid models and simulation. If you throw the entire system at it, you’ve got a big issue because emulation capacity is not growing as fast as these systems are growing, so that’s going to be a big challenge too.

Rizzatti: Are the tools available today adequate for the job? Do you need different tools? Have you developed anything inhouse that you couldn’t find on the market?

Starr: PSS portable stimulus is an important tool for chiplet verification. It’s because a lot of functionality of these designs is not just in RTL anymore, you’ve got tons of firmware components, and you need to be able to test out the systemic nature of these chiplet-based designs. Portable stimulus is going to give us a path to have a highly efficient, close to the metal stimulus that can exercise things at the system-level.

Vinta: From the tools and methodologies point of view, given that there is a need to do verification at the chiplet-level as well as at the system-level, you would want to simulate the chiplets individually and then, if possible, simulate at full system-level. The same goes for emulation and prototyping. Emulate and prototype at the chiplet-level as well as at the system-level if you can afford to do it. From the tools perspective, chiplet-level simulation is pretty much like monolithic chip simulation. Verification engineers are knowledgeable and experienced to that methodology.

Agrawal: No good debug tools are out there where you could combine multiple chiplets and debug something.

From a user standpoint, if you have a CPU-based chiplet and you’re running a Spec benchmark or 100 million instructions per workload on your multi-die package then something fails, maybe it’s functional performance, where do you start? What do you look at? If I bring that design up in Verdi it would take forever.

When you verify a large language model and run a data flow graph and you’re placing different pieces or snippets of the model across different cores, whether Tenstorrent cores or CPU cores, and you have to know at that point whether your placement is correct, how can you answer that question? There’s absolute lack of good visibility tools that can help verification engineers moving to multi-die design right now.

Salz: I do agree with Alex that portable stimulus is a good place to start because you want to do scenario testing, and that’s well suited for doing scenario testing with consumer-producer schemes that pick snippets of code needed for the test.

There are things to do for debug. Divyang, I think you’re thinking of old style waveform dumping for the whole SoC, and that is never going to work. You need to think about transaction level debug. There are features in Verdi to enable transaction level debug, but you need to create the transactions. I’ve seen people grab like a CPU transaction which typically is just the instructions and look at it and say, there’s a bug right there, or no, the problem is not in the CPU. Most of the time, north of 90%, the problem sits in the firmware or in the software, so that’s a good place to start as well.

Rizzatti: If there is such a thing as a wish-list for multi-die system verification, what would that wish-list include?

Starr: We probably need something like a thousand times faster verification, but typically we see perhaps a 2X improvement per generation in these technologies today. The EDA solutions are not keeping up with the demands of this scaling.

Some of that’s just inherent in the nature of things in that you can’t create technologies that are going to outpace the new technology you’re actually building. But we still need to come up with some novel ways of doing things and we can do all the things we discussed such as divide and conquer, hybrid modeling, and surrogate models.

Vinta: I 100% agree. Capacity and throughput need to be addressed. Current platforms are not going to scale, at least not in the near future. We would need to figure out how to divide and conquer, as Alex noted. Making sure with the given footprint how do you get more testing done, more verification up front. And then on top of it, address the debug questions that Divyang and Arturo have brought up.

Agrawal: Not exactly tool specific, but it would be nice to have a standard for some of these methodologies to talk to each other. Right now, it’s vendor specific. It would be nice to have some way of plugging in and playing different standards together so things just work right so people can focus on their core competencies rather than having to deal with what they don’t know.

Salz: It’s interesting that nobody’s brought up the “when do know you’re done?”

It’s an infinite process. You can keep simulating/verifying and that brings to mind the question of coverage. We understand some coverage at the block level, but at the system level is scenario driven. You can dream up more and more scenarios, each application brings something else. That’s an interesting problem that we have not yet addressed.

Rizzatti: This concludes our panel discussion for today. I want to thank all the panelists for offering their time and for sharing their insights into the multi-die verification challenges and solutions.