Subject: Lauro on CDNS Palladium-XP2 vs. MENT Veloce 2 vs. SNPS Zebu 3 > Category 1: > > - Cadence Palladium. Hats off to Cadence for being pioneers in > emulation and sustaining innovation to maintain a very competitive > product year-over-year. > > - Mentor Veloce. Their revenue numbers show emulation is a growing > segment for them. (See ESNUG 510 #7.) Clearly Wally and Greg > have been investing heavily in emulation. > > Category 2: > > - Synopsys EVE Zebu. This has been the choice for companies and > design groups doing mid-size SoCs or blocks for emulation. It > is no secret that Intel was an EVE customer. (See ESNUG 508 #6.) > My expectation is that with the Synopsys acquisition, EVE will now > move upstream to challenge Cadence and Mentor at the high end. > > - from http://www.deepchip.com/items/0522-04.html From: [ Lauro Rizzatti, Emulation Consultant ] Hi, John, About 18 months ago, you published a multi-part overview of HW emulators by Jim Hogan in ESNUG 522 #4. Hogan's report gave the impression back then that Cadence Palladium was the undisputed leader in emulation. Recently in the DAC'14 #1 engineering survey 7 readers mentioned MENT Veloce, 3 cited CDNS Palladium, and only one spoke on SNPS EVE ZeBu. As you know, I was EVE's VP of marketing at one time -- but I no longer work Synopsys -- so my analysis doesn't reflect an official Synopsys view. On the other hand, over the last few months, I've written articles on emulation technology, its market and its applications sponsored by Mentor Graphics. ---- ---- ---- ---- ---- ---- ---- Based on my knowledge of emulation from careful investigation of published data, word-of-mouth, common sense, blah, blah, blah... here's my summary of what each emulation tool currently has to offer today:
Zebu Server 3
|Chip Structure||massive array|
of 65nm custom
|custom 65nm FPGAs|
Xilinx 28nm FPGAs
|# of Users|
|16 users||64 users||5 users|
|~2.0 MHz||~2.0 MHz||~5.0 MHz|
|unpublished||~44.0 kW||~4.0 kW|
|Cooling System||water cooled||forced air||forced air|
|SAIF & FSDB||SAIF & FSDB||SAIF & FSDB|
|SW Debug||physical or|
in ICE and in
---- ---- ---- ---- ---- ---- ---- CADENCE PALLADIUM: The original architecture of Palladium was an offspring of an IBM technology that Quickturn acquired in 1996 and promoted in 1997 under the CoBALT name. Based on a vast array of Boolean processors, it was sold an alternative to Quickturn's standard FPGA-based emulator.
In 1998, Cadence bought Quickturn, and discontinued the FPGA-based approach, claiming it to be inferior to Palladium CoBALT's custom processor-based architecture in three main areas: - very slow setup-time and compilation time; - rather poorer debugging capabilities; and - a significant drop in execution speed as design size increased. Over the years, Cadence launched five generations of this custom processor technology under the name of Palladium. The 5th and last implementation called Palladium-XP2 was introduced in 2013. It appears to be an improvement of the hardware and software of the previous Palladium-XP version -- but XP2 is NOT a brand-new emulator based on a re-spin of its 65nm custom-processor chip. Palladium-XP2 continues to excel in very fast compilation time; an inherent benefit of the custom-processor approach. According to published specs its compilation speed reaches 70-million gates per hour on a single workstation. As for maximum design capacity, the benefit of a processor-based emulator is that instead of a hard limit typical of the FPGA-based emulator, it enjoys a somewhat soft limit. A Palladium user can slightly exceed max capacity specified by Cadence -- maybe by as much as 10% -- at the expense of a drop in performance (that may be significant.) HEAT, CABLES, AND RELIABILTY But while overall Palladium design capacity is more than adequate for most system-on-chip (SoC) designs, Palladiums demand the largest number of boxes of the three emulator to achieve a comparable capacity. Just consider that the maximum Palladium-XP2 capacity of 2.3 billion ASIC-equivalent gates, as specified in the datasheet, requires a setup of 32 interconnected boxes. Its interconnection network is a massive collection of cables that affect the reliability of the Palladium system. From my research, the largest Palladium configuration installed today has 16 boxes for about 1.1 billion ASIC-equivalent gates. Palladium-XP2 does not scale too well. More boxes translate to larger dimensions, heavier weight and more power consumption. Palladium-XP2 runs with water cooling. One negative of its 65nm processor-based technology is that it consumes significantly more energy than a 65nm or 28nm FPGA-based emulator with equivalent capacity. This increases the cost-of-ownership and further worsens the emulation system reliability. Palladium-XP2 supports CPF (but not UPF) power analysis and it generates the switching activity for your power estimation tools. REALLY GOOD AT ICE In terms of maximum speed of execution, Palladium-XP2 clocks in the range of 1.5 to 2.0 MHz, i.e., in the same ballpark of Veloce 2, both slower that ZeBu-Server 3. However, this speed is reached only in two deployment modes: - In-Circuit Emulation (ICE) mode - targetless mode In ICE mode, a design-under-test (DUT) in mapped inside the Palladium box. It uses a socket to reach your external target system where the rest of your system (and where your test SW) is. In targetless mode, your DUT and your test environment are all included inside the Palladium emulator. For instance, when the testbench is synthesizable or when your DUT is crunching on embedded software that has no external dependencies. Historically, Cadence's (and its predecessor Quickturn's) emulators were mostly used in ICE mode. This approach requires a speed adapter -- Cadence calls them "speed-bridges" -- to accommodate your chip's fast clock rate (usually hundreds of megahertz or even gigahertz) to the Palladium box's slow clock rate (one/two megahertz or less). The long history of Palladium (CoBALT) has fostered the creation a large library of speed bridges. This is definitely a plus for Cadence. TRANSACTION BASED VERIFICATION (TBV) This 1-to-2 Mhz speed is NOT achieved in acceleration mode, i.e., when your DUT is driven by external software testbench running on the workstation. This should not be surprising -- if your external testbench talks to your DUT by way of a programming language interface (PLI) -- but it is a bit surprising if your interface is based on Direct Programming Interface (DPI) calls -- typical for transaction-based communication. Different vendors call it Transaction-Based Acceleration (TBA or TBX) mode or Transaction-Based Verification (TBV) mode. Regardless of the name, this verification mode is the emerging trend in the industry. It does not require human manned supervision to plug/unplug speed adapters when you switch from one design to the next. As such, TBV is the mandatory choice for remote access at large emulation datacenters accessible 24/7 from anywhere in the world. Palladium-XP2 supports transaction-based verification TBV, but it is rumored that its throughput is significantly lower than that of Veloce 2 and ZeBu 3. Consider that recurring in the recent CDNS quarterly financial earnings calls that Cadence CEO Lip-Bu Tan claims "progress in TBA" -- indicating it is an issue that needs attention. DEBUG Hardware emulators are mandatory to clear a large chip of all the residual bugs that were not uncovered by Verilog/VHDL/SystemVerilog SW runs -- and before final tape-out. Needless to say, debug must be efficient: which means easy to use, effective, and fast. For general debugging, Palladium supports System Verilog Assertions (SVA), Universal Verification Methodology (UVM), save/restore, and functional coverage. Palladium-XP2 also has "FullVision", defined as "at-speed full visibility of nets for typically two-million samples during runtime," and "InfiniTrace", defined as "enables unlimited trace-capture depth and allows users to revert back to any checkpoint and restart emulation from that point." Further its "Dynamic Probes" allow for "fast waveform upload of up to 80 million samples of selected signals before run." All of this sounds impressive, but these definitions do NOT clearly state that the timing window extension from 2 million cycles to 80 million cycles trades off full vision -- to a partial vision of 50,000 signals that must be pre-selected at compile time. EMBEDDED DEBUG For embedded software validation, Palladium-XP2 supports software debugging by way of a physical JTAG connection in ICE mode at full emulation speed. This is a popular method that requires a HW debug infrastructure embedded in your DUT. As an alternative to a physical JTAG connection, Palladium-XP2 can be deployed with a transaction-based virtual JTAG connection. A virtual JTAG presents several benefits vs. a physical JTAG: - Virtual can be used earlier in the design cycle - Virtual removes complexities due to physical timing dependencies making it simpler/quicker/cheaper to use. - Virtual let's you create massive emulation datacenters as mentioned earlier. However, debugging your chip's software by way of a virtual JTAG connection in a Palladium-XP2 is rather slow. In addition, it supports System Verilog Assertions (SVA), Universal Verification Methodology (UVM), save/restore, and functional coverage. ---- ---- ---- ---- ---- ---- ---- With two decades in business, Cadence Palladium easily enjoys the largest list of customers from all segments of the semiconductor industry. ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- MENTOR VELOCE: In 2012, Mentor Graphics launched a new emulator, Veloce-2, an evolution of their 65nm custom emulator-on-a-chip first introduced with Veloce.
The "emulator-on-chip" concept was architected by merging the custom-FPGA emulator from Meta Systems, a French startup acquired by Mentor in 1996, with the Virtual Wire approach implemented in the IKOS VStation emulators purchased by Mentor in 2002. Compilation speed of a Veloce-2 stands at about 35 million gates per hour on a farm of workstations -- a notch below Palladium-XP2. The Veloce-2 execution speed hovers around 1.5 MHz -- with minor drop at the increase of the design size. This is from its scalable architecture based on an active backplane that removes interconnect bottlenecks based on cables. In early 2014, Mentor announced a new operating system called "Veloce OS3" that makes the emulator a global datacenter. Veloce-2 doubles the capacity of the original Veloce (launched in 2007) to 2 billion ASIC-equivalent gates with two interconnected Maximus cabinets: each accommodating 1 billion gates. From what I can find, the largest configuration installed today is two Maximus cabinets for almost 2 billion ASIC gates. Veloce-2 scales to almost the target maximum capacity. LESS HEAT, LESS CABLES, MORE RELIABLE The Maximus cabinet is made up of four units interconnected internally with a backplane and limited cabling to avoid impacting reliability. Lots of cables means lots of failure points. The Veloce-2 forced air cooling makes it easier and a less expensive to install than a Palladium-XP2. And not only does air cooling save on user's A/C bills, it also increases Veloce-2's reliability as compared to a water cooled emulator. Veloce-2 supports UPF (but not CPF) power analysis and generates switching activity for power estimation tools. For ICE support, Veloce-2 also has a large library of speed adapters, too. TRANSACTION BASED VERIFICATION Mentor is actively pushing its Transaction-Based-Acceleration (TBX), because it does not require human manned supervision to plug/unplug speed adapters when you switch from one design to the next -- thus enabling large remote emulation datacenters. While no vendor publishes specs for their throughput in acceleration mode, Veloce-2 users claim that they've seen no degradation in speed while switching from ICE to TBX. In fact, a few Veloce-2 users reported higher throughput in TBX than in ICE. One negative aspect to the TBX acceleration mode is that you need to create a testbench. Mentor addressed this by introducing their VirtuaLAB concept; which is their virtual target system that's functionally equivalent to a physical target system, but without the need for cables nor speed adapters. The ViruaLAB is driven by operating systems, drivers, and stacks of software running on the emulator. It eliminates the need to create a testbench -- a foreign concept for a software developer used to writing software programs. Veloce-2 does 100% visibility without compilation. Its on-board memories coupled to their 65nm emulator-on-chip devices, store up to 500 K samples of "compressed" data -- including registers and memory contents. The data is uploaded to the host workstation by way of wideband channels, and there's a reconstruction mechanism running on the host computer that rebuilds your waveforms all of your combinational logic nodes. DEBUG While some Veloce-2 debug is similar to Palladium-XP2's, Mentor devised a faster debugging scheme based on the on-demand waveform streaming of a few selected signals without requiring compilation. A Veloce-2 debug process called "back-replay debug" consists of rewinding and re-running a test with added visibility such as: assertions, monitors, trackers, $display, and waveform capture. It removes the need for a testbench and reduces the amount of data sent to the host, providing a boost in time-to-visibility in a fully deterministic, repetitive environment. Like Palladium, it also supports SVAs, UVM, save/restore, and functional coverage. EMBEDDED DEBUG In addition to physical and virtual JTGA connections like a Palladium box, a Veloce-2 can do tracing software debug by way of Codelink. When your DUT does not have any hardware debug infrastructure in place, Codelink traces the state of your processor by observing signals in and around the RTL code of the processors in your design -- and it does not interfere with the operation of your design being run. With Codelink, the HW developer can begin debugging earlier in the design cycle and offline. Another Veloce-2 approach replaces the RTL processing cores in your SoC design with QEMU-based cores -- and then moves them into the host connected to Veloce2 by way of transactors. The emulator continues to execute the remaining synthesizable portion of your SoC; pushing performance from 1 to 3 MIPS up to an upper limit of 100 MIPS when your entire SoC is mapped inside Veloce-2. With this some users are booting an Android RTOS, and then running applications like Antutu for performance characterization prior to silicon. ---- ---- ---- ---- ---- ---- ---- Today, Mentor Veloce can not match the sheer number of customers claimed by Cadence Palladium but, in the past 2 to 3 years, MENT has increased its own customer base by taking a bite out of the CDNS customer base. ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- SYNOPSYS EVE ZEBU: After acquiring EVE in 2012, Synopsys launched the ZeBu Server 3 in 2014. EVE was an early developer of standard FPGA-based emulators. The name ZeBu stands for "zero-bugs" to ensure that your design had no bugs.
ZeBu Server 3 is based on the Xilinx 28nm Virtex7-LX2000T. While all the FPGA prototyping companies assume the ~12 million gate capacity of the V7-LX2000T, Synopsys elected to lower it (i.e. ~50% utilization) to about 6.5 million gates. From what I can find, the largest configuration ZeBu installed today is 7 boxes for close to 2 billion ASIC gates. ZeBu Server 3 scales nicely, but it may not reach the target of 3 billion ASIC-equivalent gates. SLOW COMPILES, FAST RUNS It's horrible 5 M gates/hour design compilation speed puts the ZeBu Server 3 at disadvantage vis-a-vis 70 M gates/hour Palladium and 35 M gates/hour Veloce. It compiles at 1/14th the speed of a Palladium! The main compile-time hurdle is in the place-and-route of the Xilinx FPGAs. Synopsys does not publish data, but it is public knowledge that the P&R of a Virtex7-LX2000 may take several hours -- even while limiting the resource utilization to 50% or less. The ZeBu Server 3 leads the pack with the highest clock speed, bordering the performance of 28nm FPGA prototyping for designs of 100+ million gates. But this performance drops significanlty when multiple ZeBu boxes are used due to the massive interconnecting cabling. ZeBu Server 3 supports UPF (but not CPF) power analysis and it generates switching activity for power estimation tools. Compared to the Palladium XP2 and the Veloce 2, the forced air cooling of the Zebu 3 plus its small physical dimensions, its light weight, and low power consumption -- gives the ZeBu 3 relatively high reliability. LESS ICE, MORE TBV Apparently Synopsys continues EVE's approach of not actively promoting ICE, and instead supports TBV. But just like the Veloce 2, the ZeBu Server 3 also performs TBV at speeds in the same ballpark as its ICE. DEBUG Design debug is 100% visibility via dynamic probing, a feature that takes advantage of the built-in scan chain in the Xilinx Virtex FPGAs. While dynamic probing does not require compilation, it comes with a drawback: to retrieve data takes a long time at a speed of a few 10's of hertz. The sequential data activity is not stored in on-board memories. Rather, it is sent directly to the host server where the combinational activity is recreated via a proprietary mechanism. EVE/Synopsys points out that the overall performance of doing: - data retrieval via dynamic probing, - data transfer to the server, and - reconstruction of combinational data is comparable to Palladium or Veloce when doing these same tasks combined. The ZeBu Server 3 also supports SVAs and save/restore, but there is NO mention of UVM nor functional coverage. ---- ---- ---- ---- ---- ---- ---- By all indications, the ZeBu is trailing behind Palladium and Veloce in the number of customers. A monitoring of the quarterly earnings calls would reveal that Mentor and Cadence boast success after success in the emulation space. Not so for Synopsys. This may be company policy, or it may reflect the difficulty in reporting a sales "win". ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- CONCLUSION:
Palladium-XP2 offers the fastest compilation speed combined with excellent HW debug capabilities. Fast 2.0 Mhz execution speed. For it's 1.1 B gate max capacity, scalability is questionable when design sizes approach/exceed those billion gates. Palladium is very strong in ICE, but has noticably slower TBV compared its rivals. CPF. Palladium's large physical footprint, energy consumption, water cooling, and reliability are not the best. Veloce-2 does fast compilation and excellent debug but it also has added stuff like: on-demand waveform streaming of a few selected signals without requiring compilation, less need to write testbenches, QEMU-based cores, and tracing software. Fast 2.0 Mhz execution speed. It runs both TBV and ICE equally fast. Large 2.0 B gate max capacity, and scalable. UPF. Having a backplane with much less cabling, it being air cooled, and it uses less energy gives the Veloce-2 box good reliability. ZeBu Server 3 leads the pack with the highest 5.0 Mhz execution speed, and large 2.0 B gate max capacity. Ideal for software debug in TBV. But no mention of ICE, no UVM, no functional coverage. And its compilation speed is 1/14th of rivals. Data retrieval is painfully in 10's of hertz. UPF. Small size, low power, and air cooling gives Zebu Server 3 high reliability. Today, all three emulators choices can do the job, some better than others. - Lauro Rizzatti, emulation consultant