Intel 11th Gen Core RocketLake AVX512 Performance Improvement vs AVX2/FMA3

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA3/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators – albeit in a somewhat different form – it has finally made it to its desktop CPU lines with “RocketLake” (RKL) having previously been available in HEDT / server / workstation “Skylake-X” (SKL-X) and mobile “IceLake” (ICL).

While it was rumoured desktop/mobile “Skylake” arch was meant to also support AVX512 based on core changes (widening of ports to 512-bit, unit changes, etc.) – nevertheless no public way of engaging them has been found.

AVX512 consists of multiple extensions and not all CPUs (or GPGPUs) may implement them all:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit. [supported by SKL-X, ICL, RKL]
  • AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit [supported by SKL-X, ICL, RKL]
  • AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit [supported by SKL-X, ICL, RKL]
  • AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers [supported by SKL-X, ICL, RKL]
  • AVX512-CD – Conflict Detection – loop vectorisation through predication [future server ICL-SP]
  • AVX512-ER – Exponential & Reciprocal – transcedental operations [future server ICL-SP]
  • AVX512-VNNI (Vector Neural Network Instructions, dlBoost FP16/INT8) e.g. convolution [supported by ICL, RKL]
  • AVX512-VBMIVBMI2 (Vector Byte Manipulation Instructions) various use
  • AVX512-VAES (Vector AES) accelerating block-crypto [supported by ICL, RKL]
  • AVX512-GFNI (Galois Field) – e.g. used in AES-GCM [supported by ICL, RKL]
  • more sets will be introduced in future versions

Unfortunately, simply doubling register width does not automagically increase performance by 2x (twice) as dependencies, memory load/store latencies and even data characteristics limit performance gains – some of which may require future arch or even tools to realise their true potential.

AVX512 usage does increase the power usage of the processor; this is why historically the turbo speed was limited in AVX512 mode just as previously was limited with AVX2/AVX mode.  While SKL-X already consumed a relatively high amount of power on HEDT platform (which needed to be dissipated by the cooling system) – RKL is the first desktop processor to consume/dissipate such high amounts (up to 250W).

Some have postulated that AVX512 should just be disabled and RKL should just rely on older AVX2/FMA3; we shall see whether that would have been sufficient and just how much RKL benefits from AVX512 code.

Reviews

In this article we test AVX512 core performance; please see our other articles on:

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Disclaimer

Note: We (SiSoftware) claim copyright over the scores (benchmark results) posted to the Ranker. Please see:
Privacy: Who owns the data (scores) posted to the Ranker?

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Intel Core i9 11900K 8C/16T (RKL) – AVX512 Intel Core i9 11900K 8C/16T (RKL) – AVX2/FMA3 Intel Core i9 10900K 10C/20T (CML) – AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,688 [+53%] 1,100 1,475 Integer workloads improve by over 50% and allow it to beat CML.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 569 [+13%] 504 589 With a 64-bit integer, the improvement is just 13%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 236* [+2.46x] 96 109 IFMA makes AVX512 almost 2.5x faster and faster than CML.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,774 [+53%] 1,160 1,358 With floating-point we see a similar 53% improvement and again beating CML.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 998 [+57%] 636 778 Switching to FP64 we get an even better 57% improvement.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 43.68 [+42%] 30.68 36.63 Using FP64 to mantissa extend FP128 we see a 42% improvement.
If you were expecting 2x performance you may be disappointed: don’t be! AVX512 delivers a solid 40-50% improvement over AVX2/FMA3 which allows RKL beat CML with 2 more cores and be competitive against AMD competition.

Keeping in mind RKL at 14nm runs hot and very likely thermally limited and has a single AVX512-FMA unit. Future processors at 10nm will perform much better. In any case RKL is much more performant with AVX512.

Note*: using AVX512-IFMA extension.

BenchCrypt Crypto SHA2-256 (GB/s) 33.56 [+2.46x] 13.67 16.24 Heavy compute shows the power of AVX512 – it’s almost 2.5x faster.
BenchCrypt Crypto SHA1 (GB/s) 38.84 [+42%] 27.43 28.61 Less compute reduces the benefit to 42% – likely due to memory bandwidth limitation.
BenchCrypt Crypto SHA2-512 (GB/s) 22.88 [+2.2x] 10.32 64-bit integer workload is over 2x faster with AVX512.
With heavy compute integer workload, we see AVX512 over 2x faster than old AVX – a significant result. It’s only when we hit memory bandwidth limitations (using same memory speed) the improvement reduces – we need higher speed memory.

Despite core improvements, it is clear RKL with less cores would not beat CML nor AMD competition without AVX512.

BenchScience SGEMM (GFLOPS) float/FP32 460 [+14%] 405 575 FP32 GEMM sees only 14% improvement, need optimisation
BenchScience DGEMM (GFLOPS) double/FP64 293 [+58%] 186 215 Changing to FP64 we see a healthy 58% improvement.
BenchScience SFFT (GFLOPS) float/FP32 22 [-5%] 23.25 25.43 We see a regression here that needs optimisation.
BenchScience DFFT (GFLOPS) double/FP64 14.43 [+33%] 10.89 12.46 With FP64 we again see a good 33% improvement.
BenchScience SNBODY (GFLOPS) float/FP32 616 [+2%] 605 657 Again we only see a minor 2% improvement – further optimisation needed.
BenchScience DNBODY (GFLOPS) double/FP64 190 [+3%] 184 222 With FP64 we again see a minor improvement.
With complex SIMD code – not written in assembler, it seems there is *still* some work to be done just as we say many years back with SKL-X. Memory bound algorithms with many dependencies need careful optimisation to take advantage of AVX512.
CPU Image Processing Blur (3×3) Filter (MPix/s) 5,210 [+69%] 3,080 3,337 We start well here with AVX512 69% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 2,439 [+86%] 1,310 1,318 Same algorithm but more shared data improves by 86%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,246 [+94%] 641 676 Again same algorithm but even more data shared bring 2x improvement.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,984 [+98%] 1,000 1,137 Using two buffers retains the 2x speed improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 305 [+3.46x] 88.1 102 This algorithms loves AVX512, it’s almost 3.5x faster!
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 84.37 [+59%] 52.9 56.43 Using the new scatter/gather in AVX512 brings 60% improvement.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4,728 [+12%] 4,210 4,724 A 64-bit integer workload only brings 12% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 1,022 [+39%] 737 800 Again loads of gathers brings 40% performance.
Image processing just loves SIMD and here AVX512 again brings massive performance increases of 40% to 3.5x, with the new scatter/gather instructions proving especially useful – even when  limited by memory latency.

There is no question that for image processing tasks, AVX512 is a winner.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: AVX512 is ~40% faster than AVX2/FMA3 on RocketLake!

Despite RKL on 14nm and being power limited (just like SKL-X before it due to its high power consumption) which is especially an issue in AVX512 mode – there is no question, across algorithms there is no question that RKL performs much better due to AVX512 and will benefit greatly from software updated to use it.

With AVX512 now available across platforms (server/workstation, mobile, desktop) there is no question that just about all modern software will *have to* be updated to use it – which generally is not a difficult task. Further optimisations taking advantage of specific extensions (e.g. IFMA, VNNI, etc.) will yield far higher improvements. Here, at SiSoftware, are very much looking to optimise our benchmarks further.

It is likely that future processors, e.g. “AlderLake” (ADL) on 10nm that will not be power limited, allowing them to sustain higher turbo speeds – will perform even better on AVX512 code. Thus there is no question that AVX512 is here to stay and get even more extensions.

Here is to the next processors!

Reviews

In this article we test AVX512 core performance; please see our other articles on:

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Disclaimer

Note: We (SiSoftware) claim copyright over the scores (benchmark results) posted to the Ranker. Please see:
Privacy: Who owns the data (scores) posted to the Ranker?

Tagged , , , , , , , . Bookmark the permalink.

Comments are closed.