7 billion transistors for "just" 512 CUDA cores, and 43% the performance of P40? Assuming Pascal CUDA cores this would imply a clock speed of ~4.9 GHz. Nope, I'm pretty sure the additional transistors went either into significantly redisigned SMs (wider CUDA cores) or the accelerators are performing some magic. Quite interesting!
I vote for magic. In fact, PMoS (Probability Manipulation on Silicon) could be next big thing in automotive industry. Beside increasing apparent performance of the AI behind the whell it would help reduce danger in case of collision.
Only if they are hooked up to a decent memory system. That has been conspicuously lacking in mobile ARM systems so far. Look at the usual ARM scaling --- 4+4 systems like A57+A53 or A72+A53 get about 3x the throughput of a single core.
It's not enough just to have a wider bus, you need more parallelism throughout the memory system, which in turn means you need multiple (at least two) memory controllers, memory buses, and physical memory slivers of silicon. But if there's one thing manufacturers hate, it's adding more pins and more physical slivers of silicon. So I fully expect this to be as crippled as all its predecessors, able to run those three cores for a mighty 3x throughput compared to one core.
As the saying goes, amateurs talk core count, professionals talk memory subsystem.
I think they have quite definitely increased clocks per cycle... If you have 4 clocks per cycle at 2.7GHz on 512 cores gives you 5.5 Teraflops or 22.1184 Deep Learning Tera-Ops. All of this at 20 watts which includes the cpus it would be just insanely amazing..
Which by the way is exactly the performance Tesla P4 gets for 75W of power. Strange that Ryan chose to compare Xavier with "43% of P40's performance", when the P4 comparison was much more apt.
Jokes aside, big part of the performance is probably tailoring the cores for integer operations. Abandoning compatibility with heavy FP, would give space for heavy parallelization within single core, especially eith such high transistor count.
Volta is their first microarchitecture to be built from the ground up for 16nm so it's likely to have even wider optimization over Pascal than Pascal had over Maxwell.
I'm still as surprised as you are though. 7 billion transistors with 512 cores could mean some epic sized caches too.
7 billion is not THAT many transistor, just to calibrate expectations.
The Apple A10 has 3.3B, at 125sq mm 16nm FF+. The A10X will probably crack 4B, maybe 4.5B, since it's apparently going to ship on TSMC 10nm early'ish next year, and will presumably at the very least bump up the GPUs by around 50% and, who knows, maybe even add a third core?
The A11X will ship at likely the same sort of time frame (early 2018), quite possibly even on 7nm, and quite possibly approaching the same number of transistors (though on a substantially smaller piece of silicon).
(What will Apple do with all those transistors? Who the heck knows? Maybe just keep upping the GPUs and using them for AI-related matters, the way nV does? Or maybe add specific dedicate AI hardware of the sort pioneered by Movidius and Nervana?)
I mean, the chip looks very impressive for what it is, but it's getting a little annoying seeing Nvidia name everything from a $200 chip to a $129,000 one an "AI supercomputer".
This performance of this chip will likely still be a tiny fraction of what Level 4 autonomous driving systems will require 10 years from now, so why call it a supercomputer, as if it's already more than capable of handling Level 4 self-driving or whatever?
What are they going to call their 2020 generation? The "AI Quantum Computer" (but you know, without actually being a quantum computer either...)
That chart was made by Anandtech based on all information they know. The Parker and Erista entries are filled from published information on existing products. The Xavier entries are filled from slides from a keynote presentation giving a 'sneak-peak' of the architecture. Trying to compare the different entries too critically is bound to result in red herrings.
Well I'm gonna take a wild guess even though I have a very limited knowledge of the industry. The two surprising numbers are 7 billion transistors and 20 DL TOPS at 20 W. A 512 core Pascal GPU would be about 2 billion transistors by my very rough estimate. I think that leaves three possibilities: 1) Volta has a massive increase in transistor per core over Pascal in general 2) the GPU in Xavier must be specially designed or 3) over 2/3 of the transistors are devoted to other areas of the chip.
I can't see number 1 happening because the GV100 looks to have 1.6 to 2 times the theoretical performance of the GP100 on the same process node. The GP100 is already a physically maxed out chip, so they would not be likely to be able to have such a huge transistor count increase without reducing the number of CUDA cores on the GV100 is compared with the GP100. Then, the GV100 would need to either run at vastly higher clocks or perform more than 2 FMA per clock in order to achieve its performance. Both seem unlikely.
That leaves 2) or 3). I think they are just different ways of saying the same thing. There is specialized hardware on the SOC for performing DL operations, outside the realm of a normal GPU, and there are a significant number of transistors on Xavier devoted to it. There are 4 blocks mentioned on NVIDIA's slides. The GPU block, the CPU block, "dual 8k HDR video processors", and a "computer vision accelerator". My guess is that an "HDR video processor" is more than just encode/decode block and includes an ISP. If so then what the computer vision accelerator does is a mystery to me (though maybe not to someone with knowledge about vision processing). I propose that this computer vision accelerator is a special compute unit that consists of a significant number of transistors (>2 billion?) and accounts for a significant number of DL TOPS (>10 DL TOPS?).
It's really sad that they're not doing mobile SoCs. I get they've had some terrible chips (T2, T3 and the 64-bit K1), but the T4 was okay. The 32-bit K1 was very good, and the X1 was the best Android SoC of 2015. I'd love to see this Xavier in a Shield TV2 (or other Shield device). I'd really like it if they designed their new stuff with the ability to downsize. Have an 8 core beast for AI, but chop it down to size and give us a 2-4 core Tablet chip.
I dunno if you can assume they won't be introducing further Shield products or even won't have their chips in tablets. This advanced reveal is targeted at a specific segment, one with a lot of forward-looking reveals from their competition (Intel/Movidius, NXP, CEVA). The segment is very young and these companies are all competing to convince developers to use their platforms.
It's entirely possible that NVIDIA would release 2 Tegra based SKUs: one high power SKU geared towards Drive PX and Jetson and one low power SKU geared towards consumer electronics products, with the computer vision accelerator and various other blocks stripped out of the design of the latter. Although the fact they are calling this "Xavier" seems to suggest it's taken over the entire Tegra line and so we must not see any more consumer electronics Tegras, I don't think we can be completely sure what "Parker" and "Xavier" really mean to NVIDIA or if they'd switch a consumer electronics version of the technology to a different code name scheme.
Agreed. There's a huge hole in the Chromebook processor landscape. You have Intel Atom based Celerons and Rockchip at the bottom and then from there you jump up to more expensive Intel Broadwell Celerons and i3's with a TDP of 15 watts that require a fan. Nvidia needs a SOC to fill this gap. Something like quad core A73's at 3ghz with a 256 core Pascal GPU and a TDP of 3.5 to 4 watts. This would be great for Chromebooks and tablets and slot in-between the already existing options on the market. Just using stock A73's with their own GPU would make it much quicker and cheaper to bring to market and they would have that "tweener" space to themselves.
If a 2560 Pascal core Tesla P4 can operate on a 50W TDP by clocking low enough why can't a mobile SOC with 256 Pascal cores have a TDP under 10 Watts? Maybe not 3.5 to 4, but somewhere from 5 to 10.
It would be my assumption that a hypothetical SOC like the one I would like to see would be built on 10nm. TSMC already produced quad core A73 test chips at 10nm for ARM back in May. Tegra X1 was a 10 watt SOC and was built on TSMC 20nm (Planar?). It is my thinking that much more efficient A73's plus Pascal @ 10nm would be possible in the 3.5 to 4 watt TDP range.
10nm will ramp up for volume production in 2017. Apple, Samsung, and Qualcomm will buy all the early capacity. NVIDIA has a history of not producing on a new node, anyway. Therefore an NVIDIA 10nm chip wouldn't arrive until 2018. By that time a Volta GPU would make more sense, even if it meant waiting another quarter.
Besides, if NVIDIA isn't planning on putting this Xavier chip on 10nm, I doubt they would put a hypothetical consumer electronic chip on the node, as such a chip would be much less important to them.
It could be a decision similar to IBM's in the 2005 time frame, where they decided overall it was better for their business (and shareholders) to pursue development of higher power CPUs (which eventually evolved to today's POWER 8), rather than lower power CPUs appropriate for laptops, for example. While Apple made prototypes of laptops with IBM's G5, which was not easy, they ultimately decided to switch to Intel CPUs so they could make their computers smaller.
unfortunately that leaves us with 0 good, powerful tablets. Samsung seems content to put weak GPUs in their newest TAB models, and nobody else will use anything other then bottom barrel mid range chips, and dont dare to make a product over $150. Nobody wants to make a 10 inch $400 tablet with a snapdragon 820 and microSD support it seems.
i'm glad I grabbed a shield tablet when I did. Looks like there will not be a good non $500 pixel tablet for quite some time (and the pixel tablet doesnt have microSD, so its a bit of a non starter).
There are two ways designers seem able to track the path they need to follow: you can start at high performance and then try to maintain that while reducing energy every iteration. That's basically been the Intel path. Or you can start at low energy and try to maintain that while improving performance (the Apple/ARM path).
Maybe we don't have enough data to draw strong conclusions, but it is notable that Apple and ARM have done OK following their track, while Intel and nV have not. Both have managed to stand still, to protect their existing markets, but they have not managed to grow the market substantially in the way that Apple and ARM have.
Trying to grow downward seems fundamentally more problematic than trying to grow upward. I don't know if that's because of business psychology (management is scared that cheaper chips will steal high end sales, so they cripple the chips so much as to be useless), or technological (it's just a more complicated problem to strip the energy out of a complicated fast design, than to add performance to a low-energy design [being very careful to make sure that everything you add does not add extra energy costs]).
I think if anything you could see Apple as very intel-like in their development. Where as the majority of arm licensees go for many small cores to save costs, Apple have followed Intel in having big wide cores with large memory caches to prioritise single threaded IPC.
Further if you look at Intel's current offerings they are offering "cores formerly known as Core-M" at 4-5w that are not architecturally dissimilar to their big 100-150 watt cores.
The only real difference is that Apple's cores were designed from the get go to be mobile first, while intel originally did not have those design constraints and focused on MOAR PWR and have taken 3-4 generations to get the power draw issues resolved while at the same time not sacrificing performance drastically. (not really moving forward far either.. But that is a different issue)
Or to put it another way... The only real difference between Apple and Intel in my book is WHEN they entered the market, different expectations were in place.
Tell me about it. The shield tablet is the last reasonable android machine. The rest of the $150-200 crowd are poorly made, have vastly inferior SoCs, ece. Samsung is putting too-small of GPUs in their new TABs, and google is only pushing the ridiculously prices pixel C with no micro SD.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
35 Comments
Back to Article
MrSpadge - Wednesday, September 28, 2016 - link
7 billion transistors for "just" 512 CUDA cores, and 43% the performance of P40? Assuming Pascal CUDA cores this would imply a clock speed of ~4.9 GHz. Nope, I'm pretty sure the additional transistors went either into significantly redisigned SMs (wider CUDA cores) or the accelerators are performing some magic. Quite interesting!Vatharian - Wednesday, September 28, 2016 - link
I vote for magic. In fact, PMoS (Probability Manipulation on Silicon) could be next big thing in automotive industry. Beside increasing apparent performance of the AI behind the whell it would help reduce danger in case of collision.Qwertilot - Wednesday, September 28, 2016 - link
Don't forget the 8 arm based cores, which will presumably give a fair chunk of performance.name99 - Wednesday, September 28, 2016 - link
Only if they are hooked up to a decent memory system. That has been conspicuously lacking in mobile ARM systems so far. Look at the usual ARM scaling --- 4+4 systems like A57+A53 or A72+A53 get about 3x the throughput of a single core.It's not enough just to have a wider bus, you need more parallelism throughout the memory system, which in turn means you need multiple (at least two) memory controllers, memory buses, and physical memory slivers of silicon. But if there's one thing manufacturers hate, it's adding more pins and more physical slivers of silicon. So I fully expect this to be as crippled as all its predecessors, able to run those three cores for a mighty 3x throughput compared to one core.
As the saying goes, amateurs talk core count, professionals talk memory subsystem.
londedoganet - Wednesday, September 28, 2016 - link
I think you'll have to wait for the nVidia Maximoff (as in Wanda) to see a PMoS implementation. ;-)hahmed330 - Wednesday, September 28, 2016 - link
I think they have quite definitely increased clocks per cycle... If you have 4 clocks per cycle at 2.7GHz on 512 cores gives you 5.5 Teraflops or 22.1184 Deep Learning Tera-Ops. All of this at 20 watts which includes the cpus it would be just insanely amazing..Krysto - Wednesday, September 28, 2016 - link
Which by the way is exactly the performance Tesla P4 gets for 75W of power. Strange that Ryan chose to compare Xavier with "43% of P40's performance", when the P4 comparison was much more apt.p1esk - Wednesday, September 28, 2016 - link
He compared it to P40 because it's currently the fastest Nvidia card when it comes to INT8 performance.Vatharian - Wednesday, September 28, 2016 - link
Jokes aside, big part of the performance is probably tailoring the cores for integer operations. Abandoning compatibility with heavy FP, would give space for heavy parallelization within single core, especially eith such high transistor count.Samus - Wednesday, September 28, 2016 - link
Volta is their first microarchitecture to be built from the ground up for 16nm so it's likely to have even wider optimization over Pascal than Pascal had over Maxwell.I'm still as surprised as you are though. 7 billion transistors with 512 cores could mean some epic sized caches too.
name99 - Wednesday, September 28, 2016 - link
7 billion is not THAT many transistor, just to calibrate expectations.The Apple A10 has 3.3B, at 125sq mm 16nm FF+.
The A10X will probably crack 4B, maybe 4.5B, since it's apparently going to ship on TSMC 10nm early'ish next year, and will presumably at the very least bump up the GPUs by around 50% and, who knows, maybe even add a third core?
The A11X will ship at likely the same sort of time frame (early 2018), quite possibly even on 7nm, and quite possibly approaching the same number of transistors (though on a substantially smaller piece of silicon).
(What will Apple do with all those transistors? Who the heck knows? Maybe just keep upping the GPUs and using them for AI-related matters, the way nV does? Or maybe add specific dedicate AI hardware of the sort pioneered by Movidius and Nervana?)
Krysto - Wednesday, September 28, 2016 - link
I mean, the chip looks very impressive for what it is, but it's getting a little annoying seeing Nvidia name everything from a $200 chip to a $129,000 one an "AI supercomputer".This performance of this chip will likely still be a tiny fraction of what Level 4 autonomous driving systems will require 10 years from now, so why call it a supercomputer, as if it's already more than capable of handling Level 4 self-driving or whatever?
What are they going to call their 2020 generation? The "AI Quantum Computer" (but you know, without actually being a quantum computer either...)
Meteor2 - Wednesday, September 28, 2016 - link
Did you read the live blog? This looks enough for Level 4, easily.GhostOfAnand - Wednesday, September 28, 2016 - link
Yawn...djayjp - Wednesday, September 28, 2016 - link
8K video but not at 60fps?Yojimbo - Wednesday, September 28, 2016 - link
That chart was made by Anandtech based on all information they know. The Parker and Erista entries are filled from published information on existing products. The Xavier entries are filled from slides from a keynote presentation giving a 'sneak-peak' of the architecture. Trying to compare the different entries too critically is bound to result in red herrings.Yojimbo - Wednesday, September 28, 2016 - link
Well I'm gonna take a wild guess even though I have a very limited knowledge of the industry. The two surprising numbers are 7 billion transistors and 20 DL TOPS at 20 W. A 512 core Pascal GPU would be about 2 billion transistors by my very rough estimate. I think that leaves three possibilities: 1) Volta has a massive increase in transistor per core over Pascal in general 2) the GPU in Xavier must be specially designed or 3) over 2/3 of the transistors are devoted to other areas of the chip.I can't see number 1 happening because the GV100 looks to have 1.6 to 2 times the theoretical performance of the GP100 on the same process node. The GP100 is already a physically maxed out chip, so they would not be likely to be able to have such a huge transistor count increase without reducing the number of CUDA cores on the GV100 is compared with the GP100. Then, the GV100 would need to either run at vastly higher clocks or perform more than 2 FMA per clock in order to achieve its performance. Both seem unlikely.
That leaves 2) or 3). I think they are just different ways of saying the same thing. There is specialized hardware on the SOC for performing DL operations, outside the realm of a normal GPU, and there are a significant number of transistors on Xavier devoted to it. There are 4 blocks mentioned on NVIDIA's slides. The GPU block, the CPU block, "dual 8k HDR video processors", and a "computer vision accelerator". My guess is that an "HDR video processor" is more than just encode/decode block and includes an ISP. If so then what the computer vision accelerator does is a mystery to me (though maybe not to someone with knowledge about vision processing). I propose that this computer vision accelerator is a special compute unit that consists of a significant number of transistors (>2 billion?) and accounts for a significant number of DL TOPS (>10 DL TOPS?).
Yojimbo - Wednesday, September 28, 2016 - link
I meant "more than 2 FP ops per core per clock", not "more than 2 FMA per clock".syxbit - Wednesday, September 28, 2016 - link
It's really sad that they're not doing mobile SoCs. I get they've had some terrible chips (T2, T3 and the 64-bit K1), but the T4 was okay. The 32-bit K1 was very good, and the X1 was the best Android SoC of 2015. I'd love to see this Xavier in a Shield TV2 (or other Shield device).I'd really like it if they designed their new stuff with the ability to downsize. Have an 8 core beast for AI, but chop it down to size and give us a 2-4 core Tablet chip.
A5 - Wednesday, September 28, 2016 - link
They couldn't get any phone design wins from major OEMs. Android tablets aren't a big enough space to justify the resources they were putting in.Yojimbo - Wednesday, September 28, 2016 - link
I dunno if you can assume they won't be introducing further Shield products or even won't have their chips in tablets. This advanced reveal is targeted at a specific segment, one with a lot of forward-looking reveals from their competition (Intel/Movidius, NXP, CEVA). The segment is very young and these companies are all competing to convince developers to use their platforms.It's entirely possible that NVIDIA would release 2 Tegra based SKUs: one high power SKU geared towards Drive PX and Jetson and one low power SKU geared towards consumer electronics products, with the computer vision accelerator and various other blocks stripped out of the design of the latter. Although the fact they are calling this "Xavier" seems to suggest it's taken over the entire Tegra line and so we must not see any more consumer electronics Tegras, I don't think we can be completely sure what "Parker" and "Xavier" really mean to NVIDIA or if they'd switch a consumer electronics version of the technology to a different code name scheme.
SquarePeg - Wednesday, September 28, 2016 - link
Agreed. There's a huge hole in the Chromebook processor landscape. You have Intel Atom based Celerons and Rockchip at the bottom and then from there you jump up to more expensive Intel Broadwell Celerons and i3's with a TDP of 15 watts that require a fan. Nvidia needs a SOC to fill this gap. Something like quad core A73's at 3ghz with a 256 core Pascal GPU and a TDP of 3.5 to 4 watts. This would be great for Chromebooks and tablets and slot in-between the already existing options on the market. Just using stock A73's with their own GPU would make it much quicker and cheaper to bring to market and they would have that "tweener" space to themselves.milli - Wednesday, September 28, 2016 - link
"Something like quad core A73's at 3ghz with a 256 core Pascal GPU and a TDP of 3.5 to 4 watts."Well that's just not possible on 16nm. Maybe in the future on 10 or 7nm but on 16nm that would result in a 10w SOC (if not more).
Yojimbo - Wednesday, September 28, 2016 - link
If a 2560 Pascal core Tesla P4 can operate on a 50W TDP by clocking low enough why can't a mobile SOC with 256 Pascal cores have a TDP under 10 Watts? Maybe not 3.5 to 4, but somewhere from 5 to 10.SquarePeg - Wednesday, September 28, 2016 - link
It would be my assumption that a hypothetical SOC like the one I would like to see would be built on 10nm. TSMC already produced quad core A73 test chips at 10nm for ARM back in May. Tegra X1 was a 10 watt SOC and was built on TSMC 20nm (Planar?). It is my thinking that much more efficient A73's plus Pascal @ 10nm would be possible in the 3.5 to 4 watt TDP range.Yojimbo - Wednesday, September 28, 2016 - link
10nm will ramp up for volume production in 2017. Apple, Samsung, and Qualcomm will buy all the early capacity. NVIDIA has a history of not producing on a new node, anyway. Therefore an NVIDIA 10nm chip wouldn't arrive until 2018. By that time a Volta GPU would make more sense, even if it meant waiting another quarter.Yojimbo - Wednesday, September 28, 2016 - link
Besides, if NVIDIA isn't planning on putting this Xavier chip on 10nm, I doubt they would put a hypothetical consumer electronic chip on the node, as such a chip would be much less important to them.Ktracho - Wednesday, September 28, 2016 - link
It could be a decision similar to IBM's in the 2005 time frame, where they decided overall it was better for their business (and shareholders) to pursue development of higher power CPUs (which eventually evolved to today's POWER 8), rather than lower power CPUs appropriate for laptops, for example. While Apple made prototypes of laptops with IBM's G5, which was not easy, they ultimately decided to switch to Intel CPUs so they could make their computers smaller.TheinsanegamerN - Thursday, October 6, 2016 - link
unfortunately that leaves us with 0 good, powerful tablets. Samsung seems content to put weak GPUs in their newest TAB models, and nobody else will use anything other then bottom barrel mid range chips, and dont dare to make a product over $150. Nobody wants to make a 10 inch $400 tablet with a snapdragon 820 and microSD support it seems.i'm glad I grabbed a shield tablet when I did. Looks like there will not be a good non $500 pixel tablet for quite some time (and the pixel tablet doesnt have microSD, so its a bit of a non starter).
name99 - Wednesday, September 28, 2016 - link
There are two ways designers seem able to track the path they need to follow: you can start at high performance and then try to maintain that while reducing energy every iteration. That's basically been the Intel path. Or you can start at low energy and try to maintain that while improving performance (the Apple/ARM path).Maybe we don't have enough data to draw strong conclusions, but it is notable that Apple and ARM have done OK following their track, while Intel and nV have not. Both have managed to stand still, to protect their existing markets, but they have not managed to grow the market substantially in the way that Apple and ARM have.
Trying to grow downward seems fundamentally more problematic than trying to grow upward. I don't know if that's because of business psychology (management is scared that cheaper chips will steal high end sales, so they cripple the chips so much as to be useless), or technological (it's just a more complicated problem to strip the energy out of a complicated fast design, than to add performance to a low-energy design [being very careful to make sure that everything you add does not add extra energy costs]).
doggface - Thursday, September 29, 2016 - link
I think if anything you could see Apple as very intel-like in their development. Where as the majority of arm licensees go for many small cores to save costs, Apple have followed Intel in having big wide cores with large memory caches to prioritise single threaded IPC.Further if you look at Intel's current offerings they are offering "cores formerly known as Core-M" at 4-5w that are not architecturally dissimilar to their big 100-150 watt cores.
The only real difference is that Apple's cores were designed from the get go to be mobile first, while intel originally did not have those design constraints and focused on MOAR PWR and have taken 3-4 generations to get the power draw issues resolved while at the same time not sacrificing performance drastically. (not really moving forward far either.. But that is a different issue)
doggface - Thursday, September 29, 2016 - link
Or to put it another way... The only real difference between Apple and Intel in my book is WHEN they entered the market, different expectations were in place.TheinsanegamerN - Thursday, October 6, 2016 - link
Tell me about it. The shield tablet is the last reasonable android machine. The rest of the $150-200 crowd are poorly made, have vastly inferior SoCs, ece. Samsung is putting too-small of GPUs in their new TABs, and google is only pushing the ridiculously prices pixel C with no micro SD.They are just handing the market to apple.
Huber - Sunday, October 9, 2016 - link
So i, guessing this means no more NVIDIA SHIELD tablets and android tv devices? if so, shame, those were the best in the market.