This might sound flippant, but why does the industry talk about performance of "known good stack dies"? What's the alternative, measuring performance of dies that failed in the production process?
Not all the dies in a stack will connect correctly. That is why nVidia's special H100 accelerators for ChatGPT have 94 GB of memory enabled instead of the 96 GB that is physically on the module. Two dies between the six stacks don't pass testing and are disabled. Maximum bandwidth figures are advertised with a fully functional stack. The reason for partially functional stacks is that leveraging through silicon vias (TSV) is tricky and there are well over two thousand vias per stack. With products like nVidia's H100, there are six stacks of HBM which also have to connect to the H100 die. That means a ~24,000 points of failure in packaging between the H100 and memory bus.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
5 Comments
Back to Article
rpg1966 - Tuesday, May 30, 2023 - link
This might sound flippant, but why does the industry talk about performance of "known good stack dies"? What's the alternative, measuring performance of dies that failed in the production process?Amandtec - Tuesday, May 30, 2023 - link
I speculate - different dies can produce different grades of silicon - in this case 10Gbs vs 8Gbs.Kevin G - Tuesday, May 30, 2023 - link
Not all the dies in a stack will connect correctly. That is why nVidia's special H100 accelerators for ChatGPT have 94 GB of memory enabled instead of the 96 GB that is physically on the module. Two dies between the six stacks don't pass testing and are disabled. Maximum bandwidth figures are advertised with a fully functional stack. The reason for partially functional stacks is that leveraging through silicon vias (TSV) is tricky and there are well over two thousand vias per stack. With products like nVidia's H100, there are six stacks of HBM which also have to connect to the H100 die. That means a ~24,000 points of failure in packaging between the H100 and memory bus.Shaunathan - Tuesday, May 30, 2023 - link
kevin you smart i like yourpg1966 - Tuesday, May 30, 2023 - link
Thanks!