Tradeoffs Between On-Premise And On-Cloud Design

Tradeoffs Between On-Premise And On-Cloud Design

Source Node: 2825730

Experts at the Table: Semiconductor Engineering sat down discuss how and why companies are dividing up work on-premise and in the cloud, and what to watch out for, with Philip Steinke, fellow, CAD infrastructure and physical design at AMD; Mahesh Turaga, vice president of business development for cloud at Cadence Design Systems; Richard Ho, vice president hardware engineering at Lightmatter; Craig Johnson, vice president cloud solutions at Siemens Digital Industries Software; and Rob Aitken, fellow at Synopsys. What follows are excerpts of that conversation, which was held in front of a live audience at the Design Automation Conference. Part one of this discussion is here.

SE: With so much at stake in chip design today, how do you achieve the best ROI with cloud resources?

Steinke: When selecting workflows for cloud enablement, we started by looking at the data. Which ones have a manageable set of data that can be encapsulated, shifted to a cloud-hosted data center to do some sort of compute, and then have a reasonable amount of results that come back? One of the areas we focused on was front-end verification. My preferred flow there involves keeping our builds on-premise, bundling up the model itself along with the test stimulus, and sending that out to do the actual simulation activity on the cloud compute. The other class of workloads that we’ve done cloud enablement for are full-chip verification workloads, full-chip signoff runs for static timing, physical verification, and power — mainly because in a place-and-route type environment, you get into a regular cadence of daily ECOs or regular ECOs that are happening and making changes, and there’s already a data management setup, with releases being done of the design. So we’re able to put hooks into that release mechanism, not just to put it in some sort of a local release volume in our own data center, but to then push that data to a cloud data center that has been selected for executing those jobs. One of the concerns about that big of a workload is that if you already have your merged Oasis, or if you’ve collected all of the specs for your design that you want to run static timing on, that’s a significant chunk of data to shift all at once. But by updating the block-level release methodology, then they trickle in as each block releases through the day. That way you can kick off the cloud-hosted, full-chip analysis job with lower latency. The main challenge I’ve seen there is access to good cloud VMs with enough memory to run those big jobs. That’s another space that we continue to push our cloud partners to offer — solutions with plenty of RAM for chip design companies to use.

SE: What advice can you give about discerning when a workload should be done on prem versus on cloud?

Aitken: There’s an interesting dynamic we see just in the representatives on this panel, because the way that Richard might approach it, and the way Phil might approach it will be different. One is very focused on a design moving through the peaks and valleys in terms of needs. At AMD, presumably, there’s lots of designs going on all the time, so there’s an initial effort in terms of just what it is they’re trying to do. What infrastructure makes sense if you’re trying to get to a world where you’re going to do all your design on the cloud, and you’d just rather not have an on-prem data center at all? The way you approach that is going to be different than if you’re using cloud as a backup and expansion capability for a massive infrastructure you have already.

SE: Practically speaking, how do you decide?

Ho: You look at the data. How much data do you have? How much data are you going to generate? And what do you need back? The key thing to make it successful is that the information you want back from the cloud. On-prem has to be minimal, so it’s just the reports and the results of your regressions that are going to show up. Our build is actually on our small on-prem. We ship it up and run our simulations out there, and we do our own coverage analysis. Then we ship the results back, and it’s very small, and it’s good. The back-end is different. On the physical design part, you ship the design out there, you want it to stay on cloud as long as you can because those databases are enormous, and you really don’t want it to come back at all. At that point, it’s infrastructure as a service. You just have your people log into the cloud and do all of the physical design up there until you get the GDS. There, it’s the stuff inside of the machine — how much memory can you get? That’s one of the limiters. It’s actually very expensive to have very large virtual machines in the cloud. Quite often it’s cheaper to buy your own. We haven’t talked about cost. The cost of cloud is not what people think. It’s pretty high. It’s more than on-prem quite often, so you have to balance that to get the advantage of being flexible and access to the big memory resources. And the way that looks is going to be very individual for each customer.

Johnson: This question really relates to the ROI of doing that. It depends on what you’re trying to achieve as the user of the environment, and that’s part of the challenge. Each company does its own calculation based on their strategy to figure out what they want to do in the cloud and how aggressively they are willing to spend to reap that benefit. The other element is the return that you measure is different than the cost of ownership. We tend to be better at doing total-cost-of-ownership analysis than the “R” part of ROI, which has to then bring into play more intangible factors throughput time and time-to-market advantage.

Aitken: Even with something as simple as latency, when you’re running a tool and the response time is, ‘I moved the mouse and then a while later something happens,’ that can be very frustrating.

Turaga: Historically, if we look at the ROI for cloud, there are three classes of tools that can leverage cloud very effectively. First is the design organization, design iterations, regressions with verification. Second is where there are long running simulations with heavy compute loads, and they can scale very well to take advantage of the compute algorithms in various instances. Third is the interactive type, like the tools that require latency and also require a lot of collaboration growth. Those are the three categories of tools that get the best ROI from cloud. And again, depending on each customer situation, it depends on which tool they want to start in cloud. Some of our customers started with verification, but it depends on your specific customer situation.

SE: For the cloud users, how did you arrive at the decisions for your cloud-use model?

Steinke: We’ve been around a while. We already had a pretty big data center, so we didn’t need to go all-in at the beginning. We were looking to augment what we have with what the cloud has to offer. Our on-prem data center continues to deliver a huge amount of our compute capacity. Projects come and go, and unexpected things happen; being able to layer in that flexibility and have multiple sources of compute that we can pull from is an advantage that we wanted to jump on. That’s been a big part of our motivation for cloud, and why we went that way. We already had that upfront investment so it was something that we were looking to augment and build on.

Ho: I can answer this from two perspectives. The first perspective is that before I was at Lightmatter, I was at Google working on TPU and on the infrastructure team, and we also used cloud there. The answer there is different from the answer at Lightmatter. One of the questions you have to ask yourself is if you want your repo (repository) on-prem or in the cloud. At a company like Google, and presumably AMD, they want their repo on-prem. They feel more secure, they feel like it’s more in their control. At a smaller company like Lightmatter, I don’t necessarily care. I was comfortable with the security of the cloud, so I can have a repo in the cloud. And in that smaller context, having repo in the cloud means that we are using cloud almost as a full infrastructure. It’s the same as my on-prem. That’s the first concern. The second concern is legacy. Some companies have legacy, and when you try to move from legacy to a cloud-based solution, you really have to understand what you’re gaining, which speaks to the goal of this panel. We’re trying to point out where you gain a benefit in terms of flexibility, in terms of being able to have newer machines, etc. Where that really counts is on some of those workloads, where you have a lot of parallel runs going. You want to manage a large set of servers and jobs running, and that’s where you should go on the cloud. You can make your workload take advantage of that. Then, coming back to data flow, where you have a constraint, then you have to make a decision. We made the decision to have repo for the physical design in the cloud, but other companies haven’t. I know that companies have more. They’ve done a lot of physical design still on-prem because they need a lot of storage and don’t need that many machines. So you have to look at each of those cases and make a decision based on what your situation is.

Turaga: Many of our small to medium customer startups don’t want the repo on-prem. They don’t have legacy data center issues, so they’re really embracing cloud fully. Some of the larger companies that have huge on-prem infrastructure already are moving to a hybrid-type model.

SE: With the almost-baffling number of instance types available in the cloud from different vendors, on top of licensing costs which are not optimized for cloud yet, how can users improve the way they choose the right kind of instance to run their jobs?

Johnson: This is one of the foundational things we’re trying to address. Our idea was for companies like AMD that largely want to manage their own infrastructure, and optimize it in their particular way, what they would like help from us on are application-specific decisions around what types of instances with what amount of memory work best, and maybe the configuration of the workload itself. How can they manage the job runs for optimal performance? We try to package that all up into something we call a flight plan. We have these flight plans available for various parts of our flow with baseline suggestions. If a customer wants to use that, great. If they want to riff on that and improve from that, that’s fine with us too.

Aitken:  The Synopsys view is the same, but there’s also a dependency on the specific design that you’re doing. Some designs are just going to require bigger or higher instances than others. And depending on what your particular workflow is, then some instances will make more or less sense than others. Also, it’s not just the licensing costs. It’s also the machine costs in the cloud that you have to trade off, as well. Bigger machines are more expensive, but maybe you can run your workload on a lesser instance for longer and pay more in license fees but less than compute fees.

Ho: Our focus has been to look at it in terms of what the actual run is. Is it a pre-check-in run? With a pre-check-in, you want a faster run with low latency, so you get a really high-performance instance for that. Is it an overnight regression? In that case, I don’t necessarily care how fast it finishes. It just needs to finish overnight, so that means I can pay for cheaper instances for my overnight regressions. We work with our cloud provider to figure out what is the best instance for a run of this type of job. Then it’s a question of optimizing costs. You do want to keep your costs as low as you can because as I said, it adds up pretty fast. We look at each particular workload and say, “What is the instance type that’s needed for that?” At the same time, it does get difficult because you then have to manage the pools of instances for each job, and then make sure you have enough of that pool of instances available so that when that job actually kicks in, it runs in a reasonable timeframe. As you get to deployment, you have to address these questions.

Turaga: Over the years we’ve developed some best practices. Initially, when you are not sure which instance type to use, you choose something with a balance between compute and memory or that general instance type. But then as you look at different types of workloads, and verification, you need a bit more memory. It is the same case with timing. You need even more memory. For CFD analysis, you may need GPUs. These are part of the best practices we developed that we share with customers.


Time Stamp:

More from Semi Engineering