Graviton3 performance

Since AWS announced Graviton3 at the end of last year, we had been eagerly looking forward to test driving it. After three weeks, the conclusion is that Graviton3 takes Arm microarchitecture in the cloud to a new level in regard to HPC performance and is now able to compete mano-a-mano against x86_64 processors.
Before discussing some of our preliminary results, a few introductory remarks about the processor itself follow: Graviton3 uses Arm Neoverse-V1 microarchitecture running on ARMv8 ISA (not on ARMv9 as speculated in some discussion groups). It is built with 5 nm technology and runs at 2.6 GHz, which a very slight increase versus Graviton2 (2.5 GHz). It uses DDR5 memory & PCI-Express 5.0 while also doubling the capacity of double-precision instructions per cycle (DP-IPC) versus its predecessor, which theoretically could boost performance for HPC apps by 100%. Further technical details are discussed at by one of the AWS Team members.
The following graphs reflect performance on a c7g.16xlarge instance, which is similar to the widely tested c6g.16xlarge instance but with the upgraded processor. The results also include measurements with hpc6a.48xlarge (AMD EPYC3) and c6i.32xlarge (Intel Ice Lake) instances for comparison purposes. The first measurements focus on memory bandwidth and the HPCG benchmark, which is predominantly memory bound. Then, the discussion turns to app performance.

Memory performance

The memory bandwidth for the 4 instances was measured with the STREAM (TRIAD) benchmark:
The upgrades to DDR5 memory along with the use of PCI-Express 5.0 have increased memory bandwidth by 50% compared to Graviton2 (as previously revealed by AWS). To gauge a first estimate of the combined performance of memory and double-precision instructions, we used the HPCG benchmark. The gains in performance of Graviton3 versus Graviton2 is greater than 75%. Although not the theoretical maximum (based on DP-IPC), it is an impressive achievement particularly from a historical perspective as this is only the 3rd generation of this processor. Furthermore, the HPCG GFLOP/s per core is now even superior or within 5% of its x86_64 competitors.

WRF & CMAQ performance

To assess whether these gains translate into app performance, we selected 2 benchmark cases for WRF and CMAQ. For WRF, our tests used a domain over the continental U.S. with a 12 km resolution. The number of grid points is 425x300x50 with a time step of 72 seconds. Here, WRF v4.4 is used with the history and restart files being compressed with the HDF5 algorithm. The results reflect wall times for a 12-hour forecast. The CMAQ benchmarks use the standard U.S. Southeast (2016) benchmark, which consists of a 100x80x35 points grid with a 12 km resolution. The results reflect wall times for a 24-hour forecast with 218 tracked species. Our testing shows significant gains in switching from Graviton2 to Graviton3 with performance coming very close to x86_64 microarchitecture even when using per core metrics.

Air quality modeling is now available at AWS and Azure

The latest versions of CMAQ, WRF-CMAQ and CAMx are now available on images from the AWS and Azure Marketplaces. These apps come precompiled and optimized for different processors and architectures (x86_64 and AArch64). The images are available to any organization or person with valid AWS or Azure accounts. CMAQ includes the ISAM and DDM3D models in addition to the standard compilation. The images also incorporate several pre and postprocessing tools such as NCL, SMOKE, IDV or VERDI among others.
We have performed several benchmarks with the images to evaluate performance with different public cloud hardware. For CMAQ, the first benchmark is the standard U.S. Southeast (2016) benchmark with a domain size of 100x80x35 and 218 tracked species. The figure shows wall (computational) times from the benchmark versus the number of cores for different AWS/Azure options and the EPA cluster.

The results with the new AMD EPYC3 processor (codenamed “Milan”) are particularly impressive as wall times are below 200 s for single instances.
In addition to the stand-alone CMAQ, we have also benchmarked a similar case using WRF-CMAQ with short-wave feedback. The computational times are approximately five times the previous measurements as the figure shows:

When compared versus the EPA cluster, the gains for WRF-CMAQ with cloud hardware are also very significant. Another CMAQ benchmark that is sometimes used to assess large system covers the continental U.S. (CONUS1) with a 499x299x35 grid and 219 tracked species. The figure shows the results for the CONUS1 benchmark with AWS IaaS.

Updated WRF benchmarks

One of the great things about working with public CSPs is that they continue bringing the latest hardware options from different manufacturers to the benefit of all of us. We continue evaluating the performance of several HPC apps with the latest hardware so that our customers can make informed decisions. Our updated round of WRF benchmarks include measurements with Intel Ice Lake and AMD EPYC3 (Milan) processors from AWS and Azure. The figure reflects overall WRF performance on computing and high-performance-computing instances from AWS and Azure.

AWS introduced Graviton3 at re:Invent 2021

During the course of re:Invent (2021) in Las Vegas, AWS CEO Adam Selipsky announced the future availability of AWS’s new Graviton3 processor. This announcement came two years after the launch of Graviton2 and in the middle of the battle for server processor supremacy traditionally dominated by the x86_64 architecture. Graviton3 is based on an AArch64 (arm) architecture and will power the first family of 7th generation EC2 instances. The new processor promises to boost almost every aspect of performance versus Graviton2, and to position itself as the strongest competitor to the new AMD and Intel server processors, particularly as a more economical option.
As usual AWS has been tight-lipped discussing this project and there is only so much that we know about the new processor. In a similar fashion to Graviton2, it has been developed by Annapurna Labs using 5 nm technology and comes in a single socket with 64 cores running at 2.6 GHz, which means a very slight increase versus Graviton2 (2.5 GHz). For HPC apps, critical upgrades include doubling the capacity of double-precision operations per cycle, using PCI-Express 5.0, and being the first processor to include DDR5 memory, which has about 50 percent more bandwidth than the DDR4 memory commonly used by the older generation of server processors. Other technical enhancements seem to be the upgrade to ARMv8.5 ISA and doubling the L3 cache. Some initial quantifications estimate a performance improvement close to 60% in double-precision operations, which bodes well for HPC apps.
At the time of this newsletter, AWS still has not communicated us our preview window, but we will provide an update as soon as we have tested the new instances.