Microprocessor Trend Usage in HPC Systems for 2022-2023

Background

In 2018 Intel x86 microprocessors were particularly susceptible to the Meltdown security vulnerabilities, whereby any system that allowed out-of-order execution was potentially vulnerable to an attack where a process could read memory that it was not authorised to do so [1]. As this vulnerability did not affect AMD processors, suggestions were raised that AMD could be a more effective choice for HPC environments. In the same year, as a topic at International Supercomputing Conference, the European Processor Initiative (EPI), a program to develop processors for domestic supercomputers, based on ARM (Advanced RISC Machine) and RISC-V, "European Processor Accelerator" system-on-a-chip [2]. With the benefits of four years of hindsight, it is valuable to consider the current trends in microprocessor architecture.

A wide analysis was recently presented at HPCAsia2021 [3] that conducts a detailed analysis of the trends of the last 27 year from over 10,000 computers from the Top500, with even more detailed analysis of 28 systems from 2009 to 2019. Of particular note in this context is the steady growth in recent years of heterogeneous supercomputers i.e., systems with GPGPUs to 28% of the Top500 with an increase of 1% per annum. The authors note: "We expect this increasing trend will continue, particularly for addressing technological limitations controlling the power consumption", a claim that could certainly be justified with the use of Nvidia GPUs or Intel Xeon Phi (discontinued as of 2020) as co-processors. At the time most systems were clustered around 1 GB per core with only three contemporary systems at 2 GB per CPU core, there was a wide variation in compute performance and parallel file system storage, and an increasing use among the most powerful systems of burst buffer storage to overcome the performance gap between memory and the file system.

Recent Developments

It is also necessary to explore trend changes in microprocessor architecture, which are not covered by the HPCAsia paper, especially over the last eighteen months. In particular, there is an increasing growth of AMD EPYC processors, increasing its share almost five-fold in the top 500 in the June 2021 list compared to a year earlier, and present in half of the 58 new entries on the June 2021 list [4]. Also of note is AMD's HPC Fund for COVID-19 research, which includes a donation system with EPYC processors and Instinct accelerators. Specifically, AMD systems made up 49 of the systems compared to 11 a year ago, include 3 new entrants in the top 10, however none of these are systems with Instinct accelerators [5]. Intel is still dominant of course, with 431 systems in the Top 500 in July 2021, albeit down from 470 the previous year.

In the November 2021 list, there was one new entry in to the top 10 list, an AMD system (the Microsoft Azure system, "Voyager-EUS2") with NVIDIA A100 GPUs. In November 2021, AMD had four systems, Fujitsu one (albeit the first), Sunway one, IBM Power 9 two, and Intel Xeon two [6]. The total Top500 share for AMD rose to 73 for AMD for November 2021, and Intel's reduced to 401, although Intel added 42 new systems to AMDs 28 (however, AMDs core count of the new entries was higher). Much of this trend is driven by the EPYC Milan series, launched in March 2021, which compares strongly to Intel's Ice Lake, which came out a month later. Among accelerators, Nvidia's share in November 2021 was 143, roughly stable compared to the previous values of 138 and 141; Nvidia GPUs are in seven of the top 10 clusters and 14 of the top 20. It also can be stated with complete certainty that another AMD system, LUMI from a European consortium, will be in the top 10 at the next release [7]

Whilst the European Union's home-grown ARM/RISC-V exascale systems are not planned to be released in the public until 2023, it remains within its planned timeline, with the completion of stage one at the end of 2021 which included Rhea general-purpose processor, using ARM Neoverse V1 and with 29 RISC-V cores [8], with an emphasises on security, power utilisation, and integration with the European automotive industry. The EPI team heavily advocates the capacity of the open-source RISC-V acceleration to transform the HPC space, with a number of architectures including long-vector processing units, stencil and tensor accelerators, and variable precision accelerators.

At the moment, however, ARM-based systems only make up six computers in the Top500; however, that includes the world's top system, Fugaku, which is the world's first exascale system and has held the number one position since June 2020. Of the six ARM-based systems, five use Fujitsu processors, while the other one uses Marvell’s ThunderX2, a now-cancelled line. The prospect of ARM/RISC-V processors increasing their share in HPC depends very much on assembly-level or compiler software development. Complex Instruction Set Computers (CISC), such as x86, provide a more comprehensive set of tools but at a cost of flexibility and power-consumption. RISC systems, in contrast, have a smaller instruction set and, whilst requiring more cycles to achieve the same task in most cases, can flexibly add new criteria and have a much lower power consumption for the same tasks.

In terms of SPEC (Standard Performance Evaluation Corporation) SPECspeed2017 and SPECrate2017 tests AMD's Milan outperforms Intel's Ice Lake in each of the 16 tests conducted in one and two socket versions, and for integer and floating point performance ratings [9]; this is perhaps unsurprising given that Milan offers more cores per processor (64 vs 40, albeit with individual cores at c25% lower speeds), increased PCIe lanes, very large L3 cache (up to 768 MB on the Milan-X with 3D-Cache) etc. The L3 cache will be particularly significant for simulation applications based on many data points and rapid read-writes, such as molecular modelling, climate and weather simulations, computational fluid dynamics, finite element analysis, etc. Further, the Milan also offers up to 42% lower power usage and 50% less rack space requirements.

Whilst the Intel Ice Lake is a very significant advance from its Cannon Lake and Whiskey Lake predecessors, essentially it seems to be a generation behind AMD Epyc. However, it is worth noting that since Ice Lake, Intel's server generation has been upgraded to Rocket Lake with Sapphire Rapids promised in 2022 (Adler Lake, the equivalent desktop system is already available). Nevertheless, a comparison between Rocket Lake (Intel Core i9-11900T) and AMD's Milan Zen 3 (Epyc 75F3) server specifications also reveals that the AMD Milan is ahead. The Sapphire Rapids (originally expected last quarter 2021) does promise 64GB of L4 cache, this is obviously a far cry from Milan-X's 768 MB.

Perhaps it should be expected that Intel would offer heavy discounts to provide a more competitive offering, given the advantage to AMD systems on many baseline costs as well as performance (after all, Intel tried very hard in the US Supreme Court to invalidate AMD and other x86 licenses). Another advantage that is notable is that Intel offers a greater variety of Ice Lake systems compared to Milan systems, thus providing greater specialisation for diverse workloads which extends over other micro-architectures. For example, Intel's Alder Lake microprocessors are superior to equivalent AMD's Ryzen 5000 series on desktop systems [10]. Whilst this advantage has led to some rather up-beat end-of-year remarks from Intel, it would be wrong to automatically make the same comparison on server systems or, for that matter, to market share. Whilst in the past AMD has struggled financially, this has not been the case for some years, with obvious benefits to consumers: "AMD has literally never been in a stronger position to face Intel’s challenge. The company has now been profitable every year since 2018... What we actually have for the first time in at least 20 years is two financially stable and healthy x86 CPU design firms slugging it out for your dollars." [11]

Whilst both AMD and Intel use the same ISA, the implantation in micro-architecture is different. This initiates another area that Intel systems always enjoy an advantage due to market share is with extensions to the x86 instruction set architecture (e.g., AVX SIMD 256, AVX SIMD 512, for Intel, XOP for AMD, etc) and hardware-assisted virtualisation. For example, in the previous generation of micro-architecture extensions AMD adopted an over-clocking approach to replicate the performance of extensions on Intel micro-architectures [12]. As it turned out (c.f., Meltdown) Intel's own performance metrics were on less than secure foundations. Today, whilst AMD does make an effort to incorporate the instruction set extensions from Intel it is quite possible that performance can be suboptimal, ranging from "lower performance" to "export this environment variable for compatibility" for particular applications. One example of this has been seen with Gaussian and MATLAB applications, which requires an additional environment variable. These differences, and requirements for environment modifications, are most evident in software that makes use of the Intel MKL [13]. Note that with hardware-assisted virtualisation (e.g., Intel HAXM) competitors do not even pretend to aim for compatibility.

Overall, it must be said that whilst Intel's market position certainly remains in a majority in major HPC systems and it provides solid products with incremental improvements and a diverse range, the Milan-series of EPYC processors from Intel have kept match with comparable Intel systems and, with the enormous L3 cache, have really considered pinpointed where "big data" problems are in HPC systems. This has been a known problem for quite a while, between memory and the processor. Whilst many systems have talked about various solutions (such as reviving the old PDP-11 core memory idea with cache NVM), this is really the first architecture that has addressed the problem at scale. It would be unwise to overlook this opportunity as certainly does not take advantage of this problem the University of Melbourne's flagship supercomputer would lose significant ground and fall out from being a world-class system. All other things being roughly equal, any microarchitecture that deals with the processor-memory gap - Intel, AMD, or ARM - should be considered as a priority.

Certainly, a heterogeneous system in terms of processors is possible, as long as system engineers are aware of the potential need for build switches and multiple builds. This is no simple task; specific builds for heterogeneous systems that take advantage of the specific architecture is an often overlooked hurdle, especially when the software in question has a complex collection of dependent builds.

GPUs and Accelerators

In the GPU and accelerator space, the release of AMD's Instinct MI210 and MI210X in November 2021 can be compared very favourably to Nvidia's A100 with 80GB GPU released in November 2020 [14], although one can expect that Nvidia will have a major release with "Ampere Next" (aka "Hopper"), expected in March 2022, although it would require a very significant improvement (around fivefold across 64-bit floating point vector and matrix compute) to close the performance gap. Implied research [15] suggests that: "Coming to the performance numbers, the 'GPU-N' (presumably Hopper GH100) produces 24.2 TFLOPs of FP32 (24% increase over A100) and 779 TFLOPs FP16 (2.5x increase over A100) which sounds really close to the 3x gains that were rumored for GH100 over A100. Compared to AMD's CDNA 2 'Aldebaran' GPU on the Instinct MI250X accelerator, the FP32 performance is less than half (95.7 TFLOPs vs 24.2 TFLOPs) but the FP16 performance is 2.15x higher." [16]

For their own part, albeit delayed by a year more than than expected, Intel will certainly be launching the "Ponte Vecchio" Xe HPC GPU, although the promise of "petaflops in your palm" is quite plausible from released specifications. "What stands out almost immediately is the amount of L2 cache leveraged by Ponte Vecchio: 408MB vs just 16MB on the Instinct MI200 and 40MB on the A100. However, in terms of raw compute, AMD has a lot more vector units: 7,040 across 110 CUs, resulting in an overall throughput of 95.7 TFLOPs, compared to just 19.5 TFLOPs on the NVIDIA A100. However, each of Intel’s CUs will be better fed with much higher cache hit rates and wider XMX matrix units. The MI250X has an 8192-bit wide bus paired with 128GB of HBM2e memory capable of transfer rates of up to 3.2TB/s. Intel hasn’t shared any details regarding the bus width or memory configuration of PVC just yet." [17]

At the same time, technical management must keep a careful eye on the development of stage 2 of the European Processor Initiative and prepare in their mind that the next dominant processor type in high performance computing may very well not only be based on ARM/RISC-V systems, but even provided with open-source principles. This would be an extraordinary achievement as the consumer benefits of competitive performance without vendor-lockin would be enormous. There is even an international shift in this regard [18]; the European Union's EPI is explicitly aimed at reducing reliance on foreign technologies, which other major international powers are also engaging. In 2021, Russia revealed a programme based around RISC-V parts, combined with that country's Elbrus, and the People's Republic of China also has a RISC-V chip family (XiangShan) for personal computers, following the use of their Matrix-2000 and Sunway SW26010 RISC processors as a response to US sanctions on the export of Intel Xeon Phi systems.

Summary

* Heterogeneous systems (CPU/GPU) are now normal in HPC and this will expand to include other architectures.
* AMD Zen CPUs are increasing as a proportion of the Top500 and are overcoming MLK/AVX-512 concerns.
* RISC systems are currently a very small percentage, but include the top system. RISC-V will be important in the future.
* Despite very competitive hardware offerings from AMD and Intel, GPUs are still overwhelming dominated by Nvidia.

References

[1] CVE-2017-5754 Detail
https://www.cve.org/CVERecord?id=CVE-2017-5754

[2] Lev Lafayette, New Developments in Supercomputing, Presentation to Linux Users of Victoria, September 4, 2018
http://levlafayette.com/files/2018luvsupercomputers.pdf

[3] Awais Khan, Hyogi Sim, Sudharshan S. Vazhkudai, Ali R. Butt, Youngjae Kim. An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers. HPCAsia2021: International Conference on High Performance Computing in Asia-Pacific Region, Association for Computing Machinery, 2021

[4] AMD Leads High Performance Computing Towards Exascale and Beyond, June 28, 2021
https://ir.amd.com/news-events/press-releases/detail/1012/amd-leads-high...

[5] AMD Quadrupled EPYC’s Top 500 Supercomputer Share In A Year, June 28, 2021
https://www.crn.com/news/components-peripherals/amd-quadrupled-epyc-s-to...

[6] November 2021 Top 500
https://www.top500.org/lists/top500/2021/11/

[7] Kurt Lust, EasyBuild on LUMI, a pre-exascale supercomputer, Proceedings of the 7th EasyBuild User Meeting, 24-28 January, 2022

[8] EPI Announces Successful Conclusion of European Processor Initiative Phase One, December 22, 2021
https://www.hpcwire.com/off-the-wire/epi-announces-successful-conclusion...

[9] AMD 3rd Gen Epyc CPUs Put Intel Xeon SPS on Ice in the Datacenter, NextPlatform, July 29, 2021
https://www.nextplatform.com/2021/07/29/amd-3rd-gen-epyc-cpus-put-intel-...
Nota Bene: Article sponsored by AMD.

[10] Paul Alcorn, CPU Benchmarks and Hierarchy 2022: Intel and AMD Processors Ranked, Tom's Hardware, January 8, 2022
https://www.tomshardware.com/reviews/cpu-hierarchy,4312.html

[11] Joel Hruska, Intel’s CEO is Wrong About AMD, ExtremeTech, January 19, 2022
https://www.extremetech.com/computing/330685-intels-ceo-is-wrong-about-amd

[12] Joel Hruska, Analyzing Bulldozer: Why AMD’s chip is so disappointing, ExtremeTech, October 24, 2011
https://www.extremetech.com/computing/100583-analyzing-bulldozers-scalin...

[13] Mingru Yang, MKL has bad performances on an AMD CPU, Nov 18, 2019
https://sites.google.com/a/uci.edu/mingru-yang/programming/mkl-has-bad-p...

[14] AMD Instinct MI200 Series Accelerator, December 2021
https://www.amd.com/system/files/documents/amd-instinct-mi200-datasheet.pdf

[15] Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee et al, ACM Transactions on Architecture and Code Optimization, Volume 19 Issue 1 March 2022
(pre-release available online December 2021)
https://doi.org/10.1145/3484505

[16] Hassan Mujtaba, Mysterious NVIDIA ‘GPU-N’ Could Be Next-Gen Hopper GH100 In Disguise, December 21, 2021
https://wccftech.com/mysterious-nvidia-gpu-n-could-be-next-gen-hopper-gh...

[17] Areej Syed, Intel Ponte Vecchio Specs. HarwareTimes, November 15, 2021
https://www.hardwaretimes.com/intel-ponte-vecchio-specs-1024-cores-408mb...

[18] Gareth Halfacree, First RISC-V computer chip lands at the European Processor Initiative, The Register, 22 Sep, 2021
https://www.theregister.com/2021/09/22/first_riscv_epi_chip/