AVX512 Improvement for Icelake Mobile (i7-1065G7 ULV)

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators, it was next introduced on the HEDT platform with Skylake-X (SKL-X/EX/EP) but until now it was not avaible on the mainstream platforms.

With the 10th “real” generation Core arch(itecture) (IceLake/ICL), we finally see “enhanced” AVX512 on the mobile platform which includes all the original extensions and quite a few new ones.

Original AVX512 extensions as supported by SKL/KBL-X HEDT processors:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit.
  • AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit
  • AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit
  • AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers
  • AVX512-CD* – Conflict Detection – loop vectorisation through predication [only on Xeon/Phi co-processors]
  • AVX512-ER* – Exponential & Reciprocal – transcedental operations [only on Xeon/Phi co-processors]

New AVX512 extensions supported by ICL processors:

  • AVX512-VNNI** (Vector Neural Network Instructions) [also supported by updated CPL-X HEDT]
  • AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-IFMA (Integer FMA)
  • AVX512-VAES (Vector AES) accelerating crypto
  • AVX512-GFNI (Galois Field)
  • AVX512-GNA (Gaussian Neural Accelerator)

As with anything, simply doubling register widths does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains; some may require future arch updates or tools to realise their true potential.

SIMD FMA Units: Unlike HEDT/server processors, ICL ULV (and likely desktop) have a single 512-bit FMA unit, not two (2): the execution rate (without dependencies) is thus similar for AVX512 and AVX2/FMA code. However, future versions are likely to increase execution units thus AVX512 code will benefit even more.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks ICL ULV AVX512 ICL ULV AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 504 [+25%] 403 For integer workloads we manage25% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 145 [+1%] 143 With a 64-bit integer workload the improvement reduces to 1%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.67 3.73 [-2%] – [No SIMD in use here]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 414 [+22%] 339 In this floating-point test, we see a 22% improvement similar to integer.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 232 [+20%] 194 Switching to FP64 we see a similar improvement.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 10.17 [+13%] 9 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 13% improvement
With limited resources, AVX512 cannot bring 100% improvement, but still manages 20-25% improvement over AVX2/FMA which is decent improvement; also consider this is a TDP-constrained ULV platform not desktop/HEDT.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [+2.25x] 4 With no data dependency – we get great scaling of over 2x in this integer workload.
BenchCrypt Crypto SHA1 (GB/s) 15.71 [+81%] 8.6 Here we see only 80% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09 [+2.3x] 3.07 With 64-bit integer workload we see larger than 2x improvement.
Thanks to the new crypto-algorithm friendly acceleration instructions of AVX512 and no doubt helped by high-bandwidth LP-DDR4X memory, we see over 2x (twice) improvement over older AVX2. ICL ULV will no doubt be a great choice for low-power network devices (routers/gateways/firewalls) able to pump 100′ Gbe crypto streams.
BenchScience SGEMM (GFLOPS) float/FP32 185 [-6%] 196 More optimisations seem to be required here for ICL at least.
BenchScience DGEMM (GFLOPS) double/FP64 91 [+18%] 77 Changing to FP64 brings a 18% improvement.
BenchScience SFFT (GFLOPS) float/FP32 31.72 [+12%] 28.34 With FFT, we see a modest 12% improvement.
BenchScience DFFT (GFLOPS) double/FP64 17.72 [-2%] 18 With FP64 we see 2% regression.
BenchScience SNBODY (GFLOPS) float/FP32 200 [+7%] 187 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64 61.76 [=] 62 With FP64 there is no delta.
With highly-optimised scientific algorithms, it seems we still have some way to go to extract more performance out of AVX512, though overall we still see a 7-12% improvement even at this time.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1,580 [+79%] 883 We start well here with AVX512 80% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 633 [+71%] 371 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 326 [+67%] 195 Again same algorithm but even more data shared now brings the improvement down to 67%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 502 [+58%] 318 Using two buffers does not change much still 58% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 72.92 [+2.4x] 30.14 Different algorithm works better, with AVX512 over 2x faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.73 [+50%] 16.45 Using the new scatter/gather in AVX512 still brings 50% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 2,100 [+33%] 1,580 Here we have a 64-bit integer workload algorithm with many gathers still good 33% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 307 [+33%] 231 Again loads of gathers and similar 33% improvement.
Image manipulation algorithms working on individual (non-dependent) pixels love AVX512, with 33-140% improvement. The new scatter/gather instructions also simplily memory access code that can benefit from future arch improvements.
Neural Networks NeuralNet CNN Inference (Samples/s) 25.94 [+3%] 25.23 Inference improves by a mere 3% only despite few dependencies.
Neural Networks NeuralNet CNN Training (Samples/s) 4.6 [+5%] 4.39 Traning improves by a slighly better 5% likely due to 512-bit accesses.
Neural Networks NeuralNet RNN Inference (Samples/s) 25.66 [-1%] 25.81 RNN interference seems very slighly slower.
Neural Networks NeuralNet RNN Training (Samples/s) 2.97 [+33%] 2.23 Finally RNN traning improves by 33%.
Unlike image manipulation, neural networks don’t seem to benefit as much pretty much the same performance across board. Clearly more optimisation is needed to push performance.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We never expected a low-power TDP (power)-limited ULV platform to benefit from AVX512 as much as HEDT/server platforms – especially when you consider the lower count of SIMD execution units. Nevertheless, it is clear that ICL (even in ULV form) benefits greatly from AVX512 with 50-100% improvement in many algorithms and no loses.

ICL also introduces many new AVX512 extensions which can even be used to accelrate existing AVX512 code (not just legacy AVX2/FMA), we are likely to see even higher gains in the future as software (and compilers) take advantage of the new extensions. Future CPU architectures are also likely to optimise complex instructions as well as add more SIMD/FMA execution units which will greatly improve AVX512 code performance.

As the data-paths for caches (L1D, L2?) have been widened, 512-bit memory accesses help extract more bandwidth for streaming algorithms (e.g. crypto) while scatter/gather instruction reduce latencies for non-sequential data accesses. Thus the benefit of AVX512 extends to more than just raw compute code.

We are excitedly waiting to see how AVX512-enabled desktop/HEDT ICL performs, not constrained by TDP and adequately cooled…

SiSoftware Sandra 20/20/4a (2020 R4a) Released

Note: The original R4 release text has been updated below. The (*) denotes new changes.

We are pleased to release R4a (version 30.39) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Benchmarks:
    • Crypto AES Benchmarks*: Optimised AVX512/AVX2-VAES code to outperform AES-HWA where possible.
    • Crypto SHA Benchmarks*: Select AVX512 multi-buffer instead of SHA-HWA where supported.
    • Network (LAN), Wireless (WLAN/WWAN) Benchmarks: multi-threaded transfer tests and increased packet size to better utilise 10Gbe+ (and higher) links. [Note: threaded CPU required]
    • Internet Connection, Internet Peerage Benchmarks: multi-threaded transfer tests and increased packet size to better utilise Gigabit+ (and higher) connections.
  • Hardware Support:
    • Updated IceLake (ICL Gen10 Core), Future* (RKL, TGL Gen11 Core) AVX512, VAES, SHA-HWA support (see CPU, GP-GPU, Cache & Memory, AVX512 improvement reviews)
    • Updated CometLake (Gen10 Core) support (see CPU, GP-GPU, Cache & Memory reviews)
    • Updated CPU features support*
    • Updated NVMe support
    • Enhanced Biometrics information (fingerprint, face, voice, audio, etc. sensors)
    • Updated WiFi support (WiFi 6/802.11ax, WPA3)
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/3 (2020 R3) Released

We are pleased to release R3 (version 30.31) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • Additional PCIe extended capabilities support
  • CPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements
  • GPGPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/2 (2020 R2) Released

We are pleased to release R2 (version 30.27) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • PCIe extended capabilities support
  • Software Support:
    • ReFS format Disk benchmark stability issues
  • CPU Benchmarks:
    • Tools (Visual C++ compiler 2019) Update
  • GPGPU Benchmarks:
    • CUDA: Updated SDK 10.2/10.1
    • OpenCL: Updated SDK support

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/1a (2020 R1a) Released

Update November 25th: Released patch (version 30.24) to add further hardware and software support.

Update October 24th: Released patch (version 30.21) to corrrect Windows 7 / Server 2008/R2 run-time issues.

We are pleased to release R1 (version 30.24) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • AMD Ryzen2 (series 3000 Matisse), Stoney Ridge updated support
    • Intel Cascade Lake (CSL), Comet Lake (CML), Cannon Lake (CNL), Ice Lake (ICL) updated support
  • CPU Benchmarks:
    • Tools (Visual C++ compiler 2019) Update
  • GPGPU Benchmarks:
    • CUDA: Updated SDK 10.2/10.1
    • OpenCL: Updated SDK support

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20 (2020) Released!

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Sandra 20/20 (2020) Released:
Brand-new benchmarks (AI/ML), hardware support

Updates: R1, R2, R3, R4.

London, UK, July 18th, 2019 – We are pleased to announce the launch of SiSoftware Sandra 20/20 (2020), the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, mobile devices and networks.

It adds two Neural Networks AI/ML (Artificial Intelligence/Machine Learning) benchmarks for both CPU and GP (GPU) to measure both CNN (Convolution Neural Network) & RNN (Recurrent Neural Networks) performance on modern hardware.

It also adds hardware support and optimisations for brand-new CPU architectures (AMD Ryzen 2 (3000 series); Intel IceLake, CometLake) not forgetting GPGPU architectures across the various interfaces (CUDA, OpenCL, DirectX ComputeShader, OpenGL Compute).

As SiSoftware operates a “just-in-time” release cycle, some features were introduced in Sandra 2017 service packs: in Sandra Titanium they have been updated and enhanced based on all the feedback received.

Operating System Module

Broad Operating System Support

All current versions supported: Windows 10, 8.1*, 8*, 7*; Server 2019, 2016, 2012/R2 and 2008/R2*

Brand new AI/ML benchmarks featuring both CNN & RNN networks testing both inference/forward and training/back-propagation performance.

Processor Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (single/FP32, double/FP64 floating-point) and instruction sets (AVX512, AVX2/FMA, AVX, SSE4, SSE2, RTM/HLE with NUMA and large-page support)

Ranker: Processor Neural Networks (Normal/Single Precision)
Ranker: Processor Neural Networks (High/Double Precision)

GP (GPU) Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (half/FP16, single/FP32 floating-point) and platforms (CUDA, OpenCL, DirectX Compute)

GP (GPU) Neural Networks (Normal/Single Precision)
GP (GPU) Neural Networks (Low/Half Precision)

CNN (Convolution Neural Network) Architecture

Detailed document on the CNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

RNN (Recurrent Neural Network) Architecture

Detailed document on the RNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

Major changes

  • All connections to website engines (Ranker, Information, Price) are now secured by SSL through HTTP.
  • Sandra client (management console) is now installed as native 64-bit (on x64 and arm64) and thus needs 64-bit Access components (2016, 2013, 2010, etc.) or SQL Server (2017, 2016, 2014, etc) for its database.

Key features of Sandra 20/20

  • 4 native architectures support (x86, x64, ARM64** – Windows; ARM, ARM64, x86, x64 – Android)
  • Huge official hardware support through technology partners (AMD/ATI, nVidia, Intel).
  • 4 native (GP)GPU/APU platforms support (OpenCL 2.1+, CUDA 10.1+, DirectX Compute Shader 11/10+, OpenGL Compute 4.5+, Vulkan 1.0+).
  • 4 native Graphics platforms support (DirectX 11.x/10.x, OpenGL 4.0+, Vulkan 1.0+).
  • 9 language versions (English, German, French, Italian, Spanish, Japanese, Chinese (Traditional, Simplified), Russian) in a single installer.
  • Enhanced Sandra Lite (Eval) version (free for personal/educational use, evaluation for other uses)

Articles & Benchmarks

For more details, please see the following articles:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please click here.

Downloading

For more details, and to download the Lite (Evaluation) version, please click here.

Reviewers and Editors

For your free review copies, please contact us.

About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Many worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Thousands on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, OpenGL, DirectCompute, x64, ARM64, ARM, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX512, AVX2/FMA3, AVX, NEON/2, SSE4.2/4, SSSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit www.sisoftware.net, www.sisoftware.eu, or www.sisoftware.co.uk