December 26, 2024

Westside People

Complete News World

2 ExaFLOPS, tens of thousands of CPUs and GPUs

2 ExaFLOPS, tens of thousands of CPUs and GPUs

Argonne National Laboratory and Intel said Thursday that they have installed all 10,624 ciphers of the Aurora supercomputer, a machine announced in 2015 with a particularly bumpy history. The system promises to deliver the highest theoretical computing performance on 2 FP64 ExaFLOPS using a combination of tens of thousands of Xeon Max ‘Sapphire Rapids’ CPUs with onboard HBM2E memory plus a ‘Ponte Vecchio’ GPU Max for the data center. The system will go online later this year.

“Aurora is Intel’s first Max Series GPU deployment, the largest Xeon Max CPU-based system, and the largest GPU cluster in the world,” said Jeff McPhee, Intel Corporate Vice President and General Manager, Super Compute Group.

The Aurora supercomputer looks impressive, even by the numbers. The device is powered by 21,248 general-purpose processors with over 1.1 million cores for workloads that require traditional CPU horsepower and 63,744 compute GPUs serving AI and HPC workloads. Memory-wise, the Aurora has 1.36 petabytes of onboard HBM2E memory and 19.9 petabytes of DDR5 memory used by the CPUs in addition to the 8.16 petabytes of HBM2E held by the Ponte Vecchi GPUs.

The Aurora uses 166 racks with 66 blades each. It spans eight rows and occupies an area equivalent to two basketball courts. Meanwhile, the Aurora storage subsystem, which uses 1,024 all-flash storage nodes that provide 220TB of storage and a total bandwidth of 31TB/s, doesn’t count. Currently, Argonne National Laboratory does not release official power consumption figures for Aurora or its storage subsystem.

The supercomputer, which will be used for a variety of workloads from nuclear fusion simulations to whether predictions and from aerodynamics to medical research, uses HPE’s Shasta supercomputer architecture with Slingshot connections. Meanwhile, before the system passes the ANL acceptance tests, it will be used for large-scale scientific generative AI models.

As we work on acceptance testing, we will use Aurora to train some large-scale open source generative AI models for science,” said Rick Stevens, associate laboratory director at Argonne National Laboratory. “With over 60,000 Intel Max GPUs, a very fast I/O system, and a massive, full solid-state storage system, Aurora is the perfect environment for training these models.

Even though the Aurora blades are installed, the supercomputer still has to undergo and pass a series of acceptance tests, which is a common procedure for supercomputers. Once it is successfully scanned and brought online later in the year, it is expected to achieve theoretical performance in excess of 2 ExaFLOPS (two billion floating point operations per second). With phenomenal performance, it is expected to secure the first position in the Top500 list.

The Aurora supercomputer installation marks several milestones: It is the industry’s first supercomputer with performance above 2 ExaFLOPS and the first Intel-based ExaFLOPS class device. Finally, it marks the conclusion of the Aurora saga that began eight years ago as the supercomputer’s journey has seen its fair share of bumps.

Originally revealed in 2015, the Aurora was initially intended to be powered by Intel Xeon Phi coprocessors and was expected to deliver approximately 180 PetaFLOPS in 2018. However, Intel decided to abandon the Xeon Phi in favor of computing modules. graphics processing, resulting in the need to renegotiate the agreement with Argonne National Laboratory to provide the ExaFLOPS system by 2021.

Delivery of the system was further delayed by complications with the Ponte Vecchio computing tiles due to Intel’s 7nm production node delay (now known as Intel 4) and the necessity to redesign the tiles for TSMC’s N5 (5nm class) process technology. Intel finally introduced its GPU Max data center products late last year, and has now shipped more than 60,000 of those GPUs to the ANL.