What is AVX512?
AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.
Why is AVX512 important?
CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.
SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.
Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.
SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.
What CPUs will support AVX512?
It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.
AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.
It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.
What kind of performance improvement can we expect with AVX512?
The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.
AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.
In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.
Can I test AVX512 performance with Sandra?
Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:
- Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
- Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
- Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now
The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:
- Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
- Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
- Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
- .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.
We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.
|Processor||Intel i7-6700K (Skylake)||Intel i7-77XX? (next-gen)||Intel i7-5820K (Haswell-E)||Intel i7-78XX? (Skylake-E)|
|Cores/Threads||4C / 8T||4C / 8T||6C / 12T||6C / 12T|
|Clock Speeds (MHz) Min-Max-Turbo||800-4000-4200||assumed same||1200-3300-3600||assumed same|
|Caches L1/L2/L3||4x 32kB, 4x 256kB, 8MB||assumed same||6x 32kB, 6x 256kB, 15MB||assumed same|
|Power TDP Rating (W)||91W||assumed same||140W||assumed same|
|Instruction Set Support||AVX2, FMA3, AVX, etc.||AVX512 + AVX2, FMA3, AVX, etc.||AVX2, FMA3, AVX, etc.||AVX512 + AVX2, FMA3, AVX, etc.|
We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.
Multi-media (Fractal Generation) Benchmark
We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.
Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!