The problem with verifying today’s designs is that we have to deal with systems that are inherently unverifiable
By Lauro Rizzatti (Contributed Content) | Thursday, June 27, 2019
Jean-Marie Brunet, senior director of marketing at Mentor, a Siemens Business, served as moderator for a well-attended and lively DVCon U.S. panel discussion on the hot topics of artificial intelligence (AI) and machine learning (ML).
The hour-long session featured panelists Raymond Nijssen, vice president and chief technologist at Achronix; Rob Aitken, fellow and director of technology at Arm; Alex Starr, senior fellow at AMD; Ty Garibay, vice president of hardware engineering at Mythic; and Saad Godil, director of Applied Deep Learning Research at Nvidia.
Part One of this mini-series reported on Brunet’s first question about how AI is reshaping the semiconductor industry, specifically chip design verification and the panelists’ impressions. In Part Two, panelists discussed whether tool vendors are ready to deliver what they need to verify their chips in a particular domain and if tools providers are ready to help.
Now, in Part 3 of this mini-series, which is based on the panel transcript, audience members question the panelists:
Audience Question #1: Increasingly, I’m hearing about probabilistic computing, analog computing, quantum computing, and so on putting strain on both instruction sets and testing. How is the chip going to be made? I’d like to hear your views on this. The Mythic representative should have a greater understanding since it is building circuits based on the above — at least that’s what it has been talking about.
The second question: Chip verification is a known thing. This is Silicon Valley. We have been doing that for a long time. In my view, the algorithmic novelty is new. And those novelties are coming from statistical models, which are new. Many of them have not even been implemented anywhere before, and they are in the process of being designed [and] implemented based on the idea of biology-oriented neuro-morphic computing where asynchronous messaging is more prevalent than synchronous models.
The last question is about benchmarks. People talk a lot about algorithmic benchmarks. BIDU came out with some research papers a couple of years ago. I’d like to know what is the latest from your standpoint.
Ty Garibay: First, thank you for even knowing we [Mythic] exist. We are implementing convolutional math in analog, and it does create its own range of verification issues. I think it’s a bit humorous that we’re spending millions of dollars and millions of man hours to target 100% verification of our machines that are intended to be 96.8% accurate, and then treating the final bit of it as if it was meaningful. Oh, that’s wrong because it’s off by this significant bit. No, it’s not wrong; it’s just different from your software model. And that’s really all we have right now as verification golden models.
The way that most digital AI implementations deal with this so far is to treat a software model as if it was an instruction set model. And every bit must match. Why? The hardware could be more accurate in the end, as we believe that our analog computing is more accurate in certain cases. We have to create a new paradigm for verification in this space — what does it mean to be correct when modeling digital operations in the analog world? It’s a fascinating challenge that we will say more about in the future.
Rob Aitken: You talked about 96.8%, and it’s tempting, especially as a digital designer, to say, “Oh, well, I fudged the circuit a little bit, and now, instead of 96.8%, it’s only 96.7%. But, hey, it was only 96.8% before, so really 96.7% is probably fine.” The software people who come up with the networks in the first place will kill for that extra 0.1%. Hardware people can’t just give it away. It does lead to some interesting thoughts.
You mentioned probabilistic with quantum, and so on. That is in the future. When we start running out of other knobs, when we can’t tweak the transistors, we can’t get any more power out of what we’ve got, we can’t do these things, then I think probabilistic compute and approximate computing will start becoming much more common. That’s because there will be no other way to do better when you’ve maxed out on whatever you can get out of digital. Then you will have to start doing all the things that I am afraid of because I fear analog.
Jean-Marie Brunet: On the third question on benchmarks: There are benchmarks on frameworks. Are we going toward some standards or is it going to be like this for the foreseeable future?
Raymond Nijssen: I would guess it is going to be a mess for a while. There is no way to keep up with all these new developments. Everything is going so fast, it’s diverging and converging. We all should be focusing on this, and you start focusing on that, and then you look up and everyone’s already moved on in different directions. I think there is a need for benchmarks, there is a need to know where we stand, how good is this really, is it 96.8% or 96.7%? I think that’s what we’re going to have to live with, and that is part of the exciting times.
This is the same as you remember 1960s. All these professors came up with seemingly simple algorithms that were named after them. Why was that? Because everything was in its infancy. Now, AI/ML is in its infancy. Now is the time to come up with Johnson’s benchmark or some methodology.
In terms of the previous question, we’re going to have to figure out a way to make reliable systems out of unreliable components.
From a design verification point of view, nothing could be more daunting than that. In other environments this is a complex task, like a telecom transmission channel, and the only reason why the data is going to be corrected on the other side of this is because of probabilistic methods. There is no guarantee that the data will be transmitted correctly on your hard drive. I know that the data is only read correctly in a probabilistic fashion and there’s no guarantees that’s the case. Yet, you trust all your data because somehow, somebody figured out how to build an extremely reliable system out of an unreliable foundation.
In design verification, you have to deal with a system that is unverifiable. Somebody else is going to build something by putting layers like in the analogy of error correction, or redundancy, or whatever it may be. Somebody’s going to figure out how to build something reliable out of that. How do you verify that? The amount of data to do this is enormous, and you’re going to have computers plow through the data and recognize issues because there’s no way you can just browse log files in the future to figure it out.
Alex Starr: I am going to add on to what you said. I agree. In some ways, we’re already there. If you look at what we’re trying to verify when we make big designs today, it’s got the hardware and the software in it and it is not deterministic when we run it. In fact, we go out of our way to design non-deterministic systems in the spirit of performance, so that we can get as much throughput as possible to verify that everything works.
We’re inherently already in a place where there is no golden model of the entire system that’s going to behave exactly like we model pre-silicon and potentially also post-silicon. We’re already there in this painful place, and we’ve got to try and figure out the tools to get better at reading log files to understand what’s going on. Maybe we don’t know what we’re looking for. Have we got bottlenecks in the system? Are there issues in some dark corner where we weren’t looking? How we find those things? These are some of the top problems we’ve got right now in the industry.
A good example is in GPUs. There’s a reasonable example of designs there. Depending on the timing of the system, the load on the software driver, and what’s happening on the host, you can end up with different image results, in some cases by design, and that’s for performance reasons. What’s the golden model in that scenario? I think it is a challenging problem. That’s something we just have to deal with today. We need to do better to solve the problems.
Saad Godil: I want to comment on this whole idea of neural nets, artificial intelligence, and the probability aspect to it, and what if we get something wrong e now and then. As Rob was saying, how do you know whether this activation was supposed to fire or not. As someone who stares at loss functions all day to see why the model is not converging, I can guarantee you, you do not want your verification problem to depend on being able to figure out why this neural network is not working. You’re not going to make it. I think eventually you will be hit with some limits — maybe that’ll be the eventual result.
This is going to be a trade off, and you’re going to make an active choice — do you want to really push? Do you want to move away from the spec? The hardware specs today, in many cases, are deterministic, and, as Alex said, sometimes you have to push the boundary with that. Every time you do that, that’s a huge verification cost.
The most important thing that I did when I was working on verification in GPUs was to work with the architects in the early stages and basically stamp out any unnecessary non-determinism. That single thing had the biggest impact on our time-to-market more than anything else. It was better than using the latest EDA methodology or new test benches or wherever.
The best thing that you need, that you can do today as an organization, is to make sure that you are not adding unnecessary complexity. It has a huge cost. Maybe you have to do it for marketing reasons or business reasons. I certainly wouldn’t say that, we’re there — let’s just do it and throw in the towel. Just be aware that when you do that, you’re adding an incredible amount of cost. I think the people that are going to be successful in this case will be the ones that are make careful intelligent tradeoffs, and they will be in a better situation than their competitors.
In the final installment of this mini-series, Part Four (which will be posted next week), panelists were asked to talk about system validation, neural-network hardware, and deep-learning training and inference.