What is AVX512?
AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators, it was next introduced on the HEDT platform with Skylake-X (SKL-X/EX/EP) but until now it was not avaible on the mainstream platforms.
With the 10th “real” generation Core arch(itecture) (IceLake/ICL), we finally see “enhanced” AVX512 on the mobile platform which includes all the original extensions and quite a few new ones.
Original AVX512 extensions as supported by SKL/KBL-X HEDT processors:
- AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit.
- AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit
- AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit
- AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers
- AVX512-CD* – Conflict Detection – loop vectorisation through predication [only on Xeon/Phi co-processors]
- AVX512-ER* – Exponential & Reciprocal – transcedental operations [only on Xeon/Phi co-processors]
New AVX512 extensions supported by ICL processors:
- AVX512-VNNI** (Vector Neural Network Instructions) [also supported by updated CPL-X HEDT]
- AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
- AVX512-BITALG (Bit Algorithms)
- AVX512-IFMA (Integer FMA)
- AVX512-VAES (Vector AES) accelerating crypto
- AVX512-GFNI (Galois Field)
- AVX512-GNA (Gaussian Neural Accelerator)
As with anything, simply doubling register widths does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains; some may require future arch updates or tools to realise their true potential.
SIMD FMA Units: Unlike HEDT/server processors, ICL ULV (and likely desktop) have a single 512-bit FMA unit, not two (2): the execution rate (without dependencies) is thus similar for AVX512 and AVX2/FMA code. However, future versions are likely to increase execution units thus AVX512 code will benefit even more.
In this article we test AVX512 core performance; please see our other articles on:
- Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance
- Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance
- AVX512 Improvement for Skylake-X (Core i9-9700X)
Native SIMD Performance
We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.
Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
We never expected a low-power TDP (power)-limited ULV platform to benefit from AVX512 as much as HEDT/server platforms – especially when you consider the lower count of SIMD execution units. Nevertheless, it is clear that ICL (even in ULV form) benefits greatly from AVX512 with 50-100% improvement in many algorithms and no loses.
ICL also introduces many new AVX512 extensions which can even be used to accelrate existing AVX512 code (not just legacy AVX2/FMA), we are likely to see even higher gains in the future as software (and compilers) take advantage of the new extensions. Future CPU architectures are also likely to optimise complex instructions as well as add more SIMD/FMA execution units which will greatly improve AVX512 code performance.
As the data-paths for caches (L1D, L2?) have been widened, 512-bit memory accesses help extract more bandwidth for streaming algorithms (e.g. crypto) while scatter/gather instruction reduce latencies for non-sequential data accesses. Thus the benefit of AVX512 extends to more than just raw compute code.
We are excitedly waiting to see how AVX512-enabled desktop/HEDT ICL performs, not constrained by TDP and adequately cooled…