Original Link: https://www.anandtech.com/show/16947/cerebras-in-the-cloud-get-your-wafer-scale-in-an-instance
Cerebras In The Cloud: Get Your Wafer Scale in an Instance
by Dr. Ian Cutress on September 16, 2021 9:00 AM EST- Posted in
- CPUs
- Cirrascale
- AI
- Cloud
- ML
- Cerebras
- Wafer Scale
- WSE2
- CS-2
- CSP
To date, most of the new AI hardware entering the market has been a ‘purchase necessary’ involvement. For any business looking to go down the route of using specialized AI hardware, they need to get hold of a test system, see how easy it is to migrate their workflow, then compute the cost/work/future of going down that route, if feasible. Most AI startups are flush with VC funding that they’re willing to put the leg work in for it, hoping to snag a big customer at some point to make that business profitable. One simple answer would be to offer the hardware in the cloud, but it takes a lot for a Cloud Service Provider (CSP) to bite and offer that hardware as an option to their customers. Today’s announcement between Cerebras and Cirrascale is that as a CSP, Cirrascale will begin to offer wafer-scale instances based on Cerebras’ WSE2.
Cerebras WSE2 and CS-2
The Cerebras Wafer Scale Engine 2 is a single AI chip the size of a wafer. Using TSMC N7 and a variety of patented technologies relating to cross-reticle connectivity and packaging, a single 46225 mm2 chip has over 800000 cores and 2.6 trillion transistors. With 40 GB of SRAM on board, WSE2 is designed to capture large machine learning models for training without the need to split the training across multiple nodes. Rather than using a distributed TensorFlow or Pytorch model with MPI or synchronization, the aim of WSE2 is to fit the entire model onto a single chip, speeding up communications between the cores, and making the software easier to manage as models are scaling rapidly.
The WSE2 sits at the heart of a CS-2 system, a 15U rack device with a custom machined aluminium front panel. Connectivity comes through 12 x 100 gigabit Ethernet ports, and the chip inside uses a custom packaging and water cooling system with redundancy. A single chip is rated at 14 kW typical, 23 kW peak, however there are 12 x 4 kW power supplies inside. Current customers of CS-2 units include national laboratories, supercomputing centers, pharmacology, biotechnology, the military, and other intelligence services. At cost of several million each, it’s a large bite to take all at once, hence the announcement today.
Cerebras x Cirrascale: WSE2 In The Cloud
Today’s announcement is that Cirrascale, a cloud services provider focusing on GPU clouds for AI and machine learning, will deploy a CS-2 system at its facility in Santa Clara. It will be offered to customers as a complete system instance, rather than a partitioned device like a CPU/GPU might be, on the basis that the sort of customers interested in a CS-2 will be customers who have large models for which a portion of a CS-2 isn’t enough. Cerebras CEO Andrew Feldman explained that customers looking at CS-2 know that their workload scales to so many GPUs they need a different avenue to get their models to fit on a single device.
Currently this is only a single system, and rather than having multiple users, Cirrascale will be offering a first-come, first serve system. Normally a single CS-2 system is several million to purchase, however cloud rental costs at Cirrascale will run to $60k a week, or $180k a month, and further discounts if longer is needed. The minimum rental time is a week, and if a customer wishes, their instance data can be saved locally by Cirrascale for a future rental window.
Cirrascale’s CEO PJ Go explained that some of the interest they’ve had in the system comes from large financial services looking to analyze their internal databases or customer services, as well as pharmacology, and these businesses tend to initiate long contracts when they’ve found the right solution for their extended ongoing workflow.
Those who are interested in the system will be able to use Cirrascale’s cloud toolset which already has Cerebras’ toolchain and compilers built-in. A CS-2 instance rental will include the full toolset and an associated compute and storage system.
Thoughts
One of the issues of getting most AI training hardware into the cloud is scale. It simply isn’t enough to rent a few dozen instances of several AI chips each and then partition them together, because ultimately they might be on other sides of the datacenter. If that package of a dozen instances gets sold as a single instance type, then you have to balance between workload and scale-out. This is why training in the cloud can be difficult to execute, and most AI hardware startups end up looking for on-premises deployments rather than cloud deployments.
This is what puts Cerebras in a unique position. The Wafer Scale Engine is a big unit, designed to cater for a large training job that might require 100s of GPUs and fits it into a single chip. There is no sub-division of an instance, or time-sharing for simple jobs – companies that need it, tend to need all of it, and that makes it a monetizable unit for cloud deployment. However, that monetizable unit is still a hefty chunk, especially for anyone wanting to explore the capability of the device for their workloads. $180k for a month for example would essentially pay for an on-premises DGX A100. That being said, as Cerebras pointed out, the WSE is for users that have to scale above and beyond that, without the complexity of synchronizing across multiple chips.
The only issue I still can’t seem to work out with this deal is that it seems that Cirrascale is only deploying a single CS-2 system. In our briefing it sounded like that there are potential customers lining up the door to try this thing, and I can imagine that even if everyone only wanted a week to try it out, some won’t wait around for 8 weeks to get to their turn. Or, alternatively, if a customer books it for a month and wants it for a year, then no-one else can use it and Cirrascale will need another. It wasn’t clear that Cirrascale had purchased the CS-2 from Cerebras, or if the company is simply ‘renting’ / ‘profit-sharing’ how it gets used. I have been told however that if the unit Cirrascale is offering is regularly oversubscribed, more will be added.
From a corporate perspective, Cerebras is in a healthy position. There’s lots of VC funding still in the bank, they have sold strong double-digit WSE systems both to corporate and government accounts, and the team has a continuing roadmap for future products. The team seems very eager to promote every sale, or at least the ones they’re allowed to talk about. Out of almost all the AI startups, Cerebras has the most immediately striking unique proposition for the market – for large training big single chips make it easier, so it will be interesting to see how the company fares with some of the newer AI startups that aim to approach the multichip-as-monolithic approach. Arguably Cerebras has already done that with its new SwarmX/MemoryX technology that it announced back at Hot Chips 2021, which allows seamless scaling up to 192 CS-2 machines, and a reported 1:1 performance scaling for 100 trillion parameter models. Tesla’s Dojo aims to do something similar, however that’s just for Tesla, not anyone else - Cerebras’ market is selling or offering systems, through deployments like Cirrascale, that theoretically anyone should be able to use.
Interested customers will be able to register their interest from today, with the system up and running now for the first cloud customers.
Related Reading
- Hot Chips 2021 Live Blog: Machine Learning (Graphcore, Cerebras, SambaNova, Anton)
- Cerebras Unveils Wafer Scale Engine Two (WSE2): 2.6 Trillion Transistors, 100% Yield
- Cerebras Wafer Scale Engine News: DoE Supercomputer Gets 400,000 AI Cores
- Hot Chips 2020 Live Blog: Cerebras WSE Programming (3:00pm PT)
- 342 Transistors for Every Person In the World: Cerebras 2nd Gen Wafer Scale Engine Teased
- Cerebras’ Wafer Scale Engine Scores a Sale: $5m Buys Two for the Pittsburgh Supercomputing Center
- Hot Chips 31 Live Blogs: Cerebras' 1.2 Trillion Transistor Deep Learning Processor