AVX10: A big improvement for x64 vectorisation but it’s not ARM SVE

What is AVX10?

It is a unification of current AVX (Advanced Vector eXtension) instruction sets (ISA) into a new version-based standard that is width-independent (i.e. applies to any width from 128, 256 and 512-bit). This allows different CPU cores (e.g. Atom, Core, Xeon, etc.) to use the same instruction set but for different width (e.g. 128, 256 and 512-bit) rather than support different features.

Unlike the current flag-based feature detection (e.g. FMA, iFMA, FP16, BF16, VNNI, VAES, etc.) we will have version numbers (10.1, 10.2, etc.) that will include a number of baseline supported features instead. Similar to say DirectX 12 baselines vs. the old DirectX 10, 11 (that still had versions).

Why has Intel done this?

Since the introduction of “Hybrid” (aka big.LITTLE) architecture (with “Alder Lake” ADL, now “Raptor Lake” RPL and beyond) the difference in feature sets of “Atom” (E) and “Core” (P) CPU cores created a problem that could only be resolved by disabling features (like AVX512) of the “Core” cores in order to have a homogeneous feature set for all threads running on a system.

Thus, for the first time in x86/x64 history – a new CPU has been launched with less instruction set support than previous CPUs (e.g. “Rocket Lake” RKL, “Tiger Lake” TGL, “Ice Lake” ICL, etc.); in some benchmarks this caused performance regression due to AVX/AVX2 code-paths being used instead of more advanced AVX512 code-paths that could be used on older CPUs.

We did not only lose the wider vector registers (512-bit vs. 256-bit) and the increased number (32 vs. 16) but also access to all the new advanced extensions that were defined only for AVX512. To mitigate, Intel has “retro-fitted” (retcon?) these extensions added to AVX512 (e.g. AVX512-VNNI, VAES, iFMA, FP16, BF16, etc.) to “old” AVX/AVX2 (aka AVX-VNNI, VAES, iFMA, etc.) though a host of additional new flags.

Thus we had to add new code paths based on AVX/AVX2 to support these retro-fitted features of new and future CPU cores – while all the work in the last half-decade or so has been AVX512 only!

Is this a good thing for AVX?

In general yes, although there are reservations:

AVX10 allows all cores (e.g. Atom) to use advanced extensions that applied only to AVX512 processors
AVX10 allows all cores (e.g. Atom) to use 2x more registers (32 vs. 16 on AVX/AVX2 or 8 on SSE/2)
AVX10 does not allow all cores (e.g. Atom) to use 512-bit width registers!
Atom will still use 128/256-bit registers and some cores might even use 128-bit only
New extensions in AVX10.x will apply to all cores (i.e. Atom, Core, Xeon)

Intel also suggests that AVX2 software will benefit from AVX10, though, with respect, that is unlikely in practice:

AVX2 assembler-code (legacy optimised code) would not benefit from AVX10 as register use is fixed (16) and would require a rewrite (e.g. to intrinsics that are dependent on compiler support and code-generation quality)
AVX2 intrinsic-code (newer code) would benefit from AVX10 as compiler can use additional registers (32) but would need to be an extra code-path (thus “AVX2 256-bit legacy” and “AVX10 256-bit new”)
AVX2 high-level-code would not use new instruction sets of AVX10, unless generator is aware such instructions previously available to AVX512 only are now usable down-level to non-AVX512 capable CPUs (e.g. .Net/Java vector byte-code)
Hybrid, Atom and other cores still limited to 128/256-bit width or possible less
Only Xeon with discrete Core (P) cores will be able to use 512-bit width as it is today with AVX512.
We still need different code paths for AVX10 128, 256 and 512-bit widths rather than a single code path!
Still limited to 512-bit, future widths (e.g. 1,024/2,048-bit accelerator) will require support and recompilation

It is likely that future AVX10 hybrid processors (e.g. “Meteor Lake” MTL, “Arrow Lake” ARL, etc.) will still report only 256-bit width support for both Atom (E) and Core (P) cores for consistency – even though the “Core” cores support 512-bit! Otherwise, code designed for 512-bit width may attempt to run on 256-bit Atom core and thus fail.

And no, Atom is not getting 512-bit width AVX10.x now or in the future. Possibly some variants may be limited to 128-bit.

As mentioned, we would still need different code paths for each width (3) – thus require to keep track of CPU width supported in addition to the AVX10 version (10.1, 10.2, etc.) could easily end up with 6 code paths in addition to the older AVX2/FMA, AVX legacy paths.

How does Intel’s AVX10 compare to ARM’s SVE?

Ever since ARM has introduced the width-independent SVE instruction set (ISA), we realised how much it simplifies support of different current but also future CPUs. While more complex to write initially (and debug!) it really simplifies development and future support.

We even wrote an article for it years ago: What Intel needs is SVE-like variable width SIMD-AVX to solve hybrid problem.

Single binary (!) supports any width from 128-bit to 2,048-bit (!)
Generic width-indepenent vector registers
Width-independent instructions (load, store, compare, etc.)
Width-independent constructs (loop, compare, etc.)

Yes – you read that right, a single binary supports any vector register width, from 128-bit phone/table CPU/SoC to 256/512-bit mobile/desktop CPU to 1,024/2,048-bit server/accelerator – both now and in the future. Similar to say .Net vector support of variable width that runs either AVX, AVX2 or AVX512 depending on CPU.

ARM was bold to design this brand-new vectorised instruction set from scratch and not simply extend the old Neon (128-bit like SSE/2) to 256/512-bit as on x86/x64. However, once the new SVE code is written, width scaling is automatic and no code changes are needed.

Thus, while AVX10 is a step in the right direction – we cannot but be sad that it is more of a “fix” for the Hybrid architecture issue rather than a bold step into the future for vectorisation on x86/x64 platform…

What improvement can be expected from AVX10?

For AVX512-bit processors, none – as AVX10.1 is basically AVX512 + friends as current baseline.

For future Hybrid processors (e.g. “Meteor Lake” MTL, “Arrow Lake” ARL, etc.) that will support AVX10.2, the improvement depends on:

Advanced instruction sets previously available on AVX512 only not already retro-fitted to AVX (e.g. VNNI, VAES, iFMA)
Future instruction sets supported by AVX10.2
~20% due to more registers (32 vs. 16) availability
Not quite reaching the performance improvement seen by AVX512 512-bit processors, even those (like AMD Ryzen Zen4) with only 256-bit SIMD units

For future discrete Atom/Other processors that will support AVX10.2, the improvements depends on:

Whether limited to only 128-bit or 256-bit width
Advanced instruction sets previously available on AVX512 only not already retro-fitted to AVX (e.g. VNNI, VAES, iFMA)
Future instruction sets supported by AVX10.2
~20% due to more registers (32 vs. 16) availability
Not quite reaching the performance improvement seen by AVX512 512-bit processors, even those (like AMD Ryzen Zen4) with only 256-bit SIMD units – but better than legacy AVX2/FMA3 supporting cores

Will Sandra use AVX10?

Yes, to fully support hybrid we will have to add additional code-paths for AVX10 on 256-bit, but current AVX512 code paths are likely to be sufficient to support both “legacy” AVX512 (and friends) and AVX10.x 512-bit. In effect we will have:

SSE2 128-bit code path [legacy]
AVX10 128-bit code path if required for Atom/Other [new]
AVX 256-bit code path [legacy]
AVX2/FMA3 256-bit code path [legacy]
AVX10 256-bit code path for Hybrid [new]
AVX10 512-bit & AVX512 code path for Xeon [current]

Future AVX10.2 will likely require new code paths – and code paths for all widths – in addition to the ones above.

Will Sandra use different vector-width code-paths on Hybrid with AVX10?

On (big.LITTLE) ARM, the SVE code-path is width-independent thus could run different widths on different cores automatically.

On (Hybrid) x64, even with AVX10 each width requires a different binary – thus different threads would need to load/use a different binary (e.g. Core 512-bit, Atom 256-bit) and ensure that threads are hard affinitised and do not move to a different type of core. This would not be a problem for Sandra that has its own scheduler for threads – but could be a problem for other programs.

Thus no, on Hybrid the 256-bit AVX10 code-path will be used as this is what it is supported.

It is unknown whether disabling the Atom (E) cores on Hybrid will allow the Core (P) to report and use 512-bit width AVX10.

When we can test the new AVX10 benchmarks?

Look for the new benchmarks in the new version of Sandra, available soon. 😉