[Meatloaf] Sleezon's 100K AI Cluster

Sat Nov 2 18:51:52 MDT 2024

Some interesting trivia on the NVIDIA H100 GPU:

(1) "Nvidia is said to have opted to outsource the production of its
next-generation GPUs to Taiwan's TSMC. Nvidia intends to manufacture its
H100 GPUs on TSMC's 4-nanometer manufacturing technology. The new GPUs will
be available beginning in the third quarter of 2022." [1]
<https://www.guru3d.com/story/nvidia-will-manufacture-h100-gpus-using-tsmc-4-nm-process/>

(2) "The NVIDIA H100 Tensor Core GPU delivers exceptional performance,
scalability, and security for every workload. H100 uses breakthrough
innovations based on the NVIDIA Hopper™ architecture to deliver
industry-leading conversational AI, speeding up large language models
(LLMs) by 30X. H100 also includes a dedicated Transformer Engine to solve
trillion-parameter language models." [2]
<https://www.nvidia.com/en-us/data-center/h100/?srsltid=AfmBOoqTOjJl2kGWWiLV5Ldbj4o9F-8PA1lh7MGTHu79Ljh3xlVlPtzA>

(3) "Nvidia Makes 1000% Profit on H100 GPUs [3]
<https://semiwiki.com/forum/index.php?threads/nvidia-makes-1000-profit-on-h100-gpus.18591/>

(4) "Nvidia is seemingly considering Intel’s foundries to manufacture its
H100 AI GPUs. Team Green may start slow with a small batch to test the
waters, potentially leading to larger orders if everything goes as planned.
.  Following the great demand for its H100 GPUs and TSMC’s overloaded
calendar, Nvidia is apparently looking for alternative factories to build
its chips. The GPU giant may soon add Intel Foundry Services (IFS) to its
suppliers since TSMC alone can’t satisfy its needs.
.  Currently, all major Nvidia chips – A100, A800, A30, H100, H800, H200,
GH200, etc. – are manufactured by TSMC. This makes availability very
susceptible to unexpected events, be they natural or political in nature.
It’s a delicate position for one of the most valued companies on the
planet.
.  According to MyDrivers’ sources, Intel has the capacity to produce 5,000
wafers per month for Nvidia using its advanced process and packaging
technologies. While the exact chips remain unknown, Nvidia is likely to go
with its high-margin H100 AI GPUs, which are in short supply. Depending on
the wafer size and yield, Intel could make between 300,000 and 800,000 H100
GPUs per month – roughly speaking.
.  Although TSMC expects to double its capacity by the end of 2024 to
20,000 wafers per month, up from 11,000 back in 2023, Nvidia’s appetite
will not be satiated. Since Intel already has an alternative to TSMC’s
CoWoS-S packaging in the form of Foveros 3D stacking technology, Nvidia has
more reasons to diversify." [4]
<https://www.club386.com/nvidia-may-select-intel-for-some-of-its-h100-gpu-production/>

(5) "Nvidia reportedly selects Intel Foundry Services for GPU packaging
production — could produce over 300,000 H100 GPUs per month." [5]
<https://www.tomshardware.com/pc-components/gpus/nvidia-reportedly-selects-intel-foundry-services-for-chip-packaging-production-could-produce-over-300000-h100-gpus-per-month>

(6) "Which Companies Own The Most Nvidia H100 GPUs?" [6]
<https://www.visualcapitalist.com/which-companies-own-the-most-nvidia-h100-gpus/>
       - Meta  ...       350K
       - xAI/X  ...      100K
       - Tesla  ...         35K
       - Lambda  ...    30K
       - Google  ...      26K
       - Oracle  ...       16K
The first 3 companies use them for a "Private Cloud".  The next 3 use them
for a "Public Cloud".

-------

All of this is - as Spock might say - "Interesting".

What makes me most nervous about it all is NOT a Colossus takeover of the
world.
Rather, it's China's temptation to invade Taiwan to take over the
fabrication tech that makes all of this possible.

Or as Jersey guys put it: "Eh, I'm just sayin."

-- Uncle Ersatz

On Sat, 2 Nov 2024 at 18:05, Jeff Hayas <jeff.hayas at gmail.com> wrote:

>
> Wow.  At 28K$ (retail) per GPU
> <https://www.amazon.com/NVIDIA-Hopper-Graphics-5120-Bit-Learning/dp/B0CXBNNNSD>,
> that's 280M$ just for the GPU's.
> Then there is the cost of Racks, Power and Cooling systems, and of course
> the data-interconnects (I wonder what architecture they use);
> for all that we can probably say the cost is 10X the GPU's.  So yeah, $2-3
> Billion(US).
>
> I find it ironic that they chose to call the Super-AI system "Colossus",
> as in the 1970 film "Colossus: The Forbin Project"
> <https://en.wikipedia.org/wiki/Colossus:_The_Forbin_Project>.
> We'll know we're in trouble if the new Colossus system has Musk
> assassinated.
>
> -- Uncle Ersatz
>
>
> On Fri, 1 Nov 2024 at 17:50, pt <mnemotronic at gmail.com> wrote:
>
>>
>> For those who have heard stories of Elon Musk’s xAI building a giant AI
>> supercomputer in Memphis, this is that cluster. With 100,000 NVIDIA H100
>> GPUs, this multi-billion-dollar AI cluster is notable not just for its size
>> but also for the speed at which it was built. In only 122 days, the teams
>> built this giant cluster.
>>
>>
>> https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://dashs.denver.co.us:7083/pipermail/meatloaf/attachments/20241102/05285d5d/attachment.htm>