AVX512-IFMA(52) Improvement for IceLake and TigerLake

CPU Multi-Media Vectorised SIMD

What is Sandra’s Multi-Media benchmark?

The “multi-media” benchmark in Sandra was introduced way back with Intel’s MMX instruction set (and thus Pentium MMX) to show the difference vectorisation brings to common algorithms, in this case (Mandelbrot) fractal generation. While MMX did not have floating-point support – we can emulate them using integers of various widths (short/16-bit, int/32-bit, long/int64/64-bit, etc.).

The benchmark thus contains various precision tests using both integer and floating point data, currently 6 (single/double/quad-floating point, short/int/long integer) with more to come in the near future (half/FP16 floating-point, etc.). Larger widths provide more precision and thus generate more accurate fractals (images) but are slower to compute (they also take more memory to store).

While the latest instruction sets (AVX(2)/FMA, AVX512) do naturally support floating-point data, integer compute performance is still very much important thus its performance needs to be tested. As quantities become larger (e.g. memory/disk sizes, pointers/address spaces, etc.) we have moved from int/32-bit to long/64-bit processing with even exclusive 64-bit algorithms (e.g. hashing SHA512).

What is the “trouble” with 64-bit integers?

While all native 64-bit processors (e.g. x64, IA64, etc.) support native 64-bit integer operations, these are generally scalar with limited SIMD vectorised support. Multiplication is especially “problematic” as it has the potential to generate numbers up to twice (2x) the number of bits – thus multiplying two 64-bit integers can generate 128-bit integer full result for which there was no (SIMD) support.

Intel has added native full 128-bit multiplication support (MULX) with the BMI2 (Bit Manipulation Instructions Version 2) but that is still scalar (non-SIMD); not even the latest AVX512-DQ instruction set brought support. While we could emulate full 128-bit multiplication using native 32-bit to 64-bit halves multiplication we have chosen to wait for native support. An additional issue (for us) is that we use “signed integers” (i.e. can hold both positive (+ve) and negative (-ve) values) while most multiplication instructions are for “unsigned integers” (thus can hold only positive values) – thus we need to modify the result for our needs which incurs overheads.

Thus the long/64-bit integer benchmark in Sandra remained non-vectorised until the introduction of AVX512-IFMA52.

What is AVX512-IFMA52?

IFMA52 is one of the new extensions of AVX512 introduced with “IceLake” (ICL) that supports native 52-bit fused multiply-add with 104-bit result. As it is 512-bit wide, we can multiply-add eight (8) pairs 64-bit integers in one go every 2 clocks (0.5 throughput, 4 latency on ICL) – especially useful for algorithms like (Mandelbrot) fractals where we can operate on many pixels independently.

As is generates a 104-bit full result, it is (as per name) only a 52-bit integer thus we need to restrict our integers to 52-bits. It also operates on unsigned integers only thus needs to be modified for our signed-integer purpose. Note also that while it is a fused multiply-add, we have chosen to use only the multiply feature here (in this Sandra version 20/20 R9); future versions (of Sandra) may use the full multiply-add feature for even better performance.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512, AVX2, AVX, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
BenchCpuMM Emulated Int64 ALU64 (Mpix/s) 3.67 4.34 While native, scalar int64 processing is pretty slow.
BenchCpuMM Native Int64 ADX/BMI2 (Mpix/s) 21.24 [+5.78x] Using BMI2 for 64-bit multiplication increases (scalar) performance by 6x!
BenchCpuMM Emulated Int64 SSE4 (Mpix/s) 13.92 [-35%] Using vectorisation though SSE4 (2x wide) is not enough to beat ADX/BMI
BenchCpuMM Emulated Int64 AVX2 (Mpix/s) 22.8 [+64%] AVX2 is 4x wide (256-bit) and just about beats scalar ADX/BMI2.
BenchCpuMM Emulated Int64 AVX512/DQ (Mpix/s) 33.53 [+47%] 512-bit wide AVX512 is 47% faster than AVX2.
BenchCpuMM Native Int64 AVX512/IFMA52 (Mpix/s) 55.87 [+66%] / [+15x over ALU64] 70.41 [+16x over ALU64] IFMA52 is 66% faster than normal AVX512 and over 15x faster than scalar ALU.
With IFMA52, we finally see a big performance gain though native 64-bit integer multiplication and vectorisation (512-bit wide, thus 8x 64-bit integer pairs), it is over 15x faster on ICL and 16x faster on TGL! In fairness, ADX/BMI2 is only about 1/2 slower and that is scalar – showing how much native instructions help processing.

Conclusion

AVX512 continues to bring performance improvements by adding more sub-instruction sets like AVX512-IFMA(52) that help 64-bit integer processing. With 64-bit integers taking over most computations due to increased sizes (data, pointers, etc.) this is becoming more and more important and is not before time.

While not a full 128-bit multiplier, 104-bits allow complete 52-bit integer operation which is sufficient for most tasks – today. Perhaps in the future, a IFMA64 will be provided for full 128-bit multiply result integer support.

Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

Intel Core i7 Gen 11

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Willow Cove” on ULV  (CometLake up to 6C/12T)
  • Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • AVX512 and more of its friends
  • Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
  • Increased L3 cache from 8MB to 12MB (+50%)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
  • PCIe 4.0
  • Thunderbolt 4 (and thus USB 4.0 support) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

  • AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
  • AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well  (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
Cores (CU) / Threads (SP) 6C / 6T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.3-4.0GHz 0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) Both rates and Turbo clocks are way up
Power (TDP) 15-35W 15-35W 15-35W 12-35W Similar power envelope possibly higher.
L1D / L1I Caches 6x 32kB 8-way / 6x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way No change L1D
L2 Caches 6x 512kB 8-way 4x 256kB 16-way 4x 512kB 16-way 4x 1.25MB L2 has more than doubled (2.5x)!
L3 Caches 2x 4MB 16-way 8MB 16-way 8MB 16-way 12MB 16-way L3 is 50% larger
Microcode (Firmware) n/a MU-068E09-CC MU-067E05-6A MU-TBD Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX512, VNNI, SHA, VAES,  IFMA AVX512, VNNI, SHA, VAES,  IFMA More AVX512!
SIMD Width / Units
256-bit 256-bit 512-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 208 134 154 169 [+10%] TGL is 10% faster than ICL but not enough to beat AMD.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 191 135 151 167 [+11%] With a 64-bit integer workload – 11% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 89 85 90  99.5 [+10%]
With floating-point, TGL is only 10% faster but enough to beat AMD.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 75 70 74  83 [+12%]
With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 506 409 504* 709* [+41%] With AVX512 TGL is over 40% faster than ICL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 193 149 145* 216* [+49%] With a 64-bit AVX512 integer workload TGL is 50% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.47 2.54 3.67** 4.34** [+18%] A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 433 328 414*  666* [+61%]
In this floating-point vectorised test TGL is 61% faster!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 251 194 232*  381* [+64%]
Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 11.23 8.22 10.2*  15.28* [+50%]
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!

* using AVX512 instead of AVX2/FMA.

** note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

BenchCrypt Crypto AES-256 (GB/s) 13.46 12.11 21.3*  19.72* [-7%] Memory bandwidth rules here so TGL is similar to ICL in speed.
BenchCrypt Crypto AES-128 (GB/s) 13.5 12.11 21.3* 19.8* [-7%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 7.03** 4.28 9*** 13.87*** [+54%] Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
BenchCrypt Crypto SHA1 (GB/s) 7.19 15.71***   Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL!  AVX512 helps even against native hashing acceleration.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 64.16 109
BenchFinance Black-Scholes double/FP64 (MOPT/s) 91.48 87.17 91 132 [+45%] Using FP64 TGL is 45% faster than ICL.
BenchFinance Binomial float/FP32 (kOPT/s) 16.34 23.55 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 31.2 21 27  37.23 [+38%]
With FP64 code TGL is 38% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 12.48 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 45.59 16.5 33 45.98 [+39%] Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 158 185*  294* [+59%]
In this tough vectorised algorithm, TGL is 60% faster!
BenchScience DGEMM (GFLOPS) double/FP64 76.86 69.2 91.7*  167* [+82%]
With FP64 vectorised code, TGL is over 80% faster!
BenchScience SFFT (GFLOPS) float/FP32 13.9 31.7*  31.14* [-2%] FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
BenchScience DFFT (GFLOPS) double/FP64 7.15 7.35 17.7*  16.41* [-3%] With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 169 200*  286* [+43%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 98.7 64.2 61.8* 81.61* [+32%]
With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 19.33 25.62*  
Neural Networks NeuralNet CNN Training (Samples/s) 3.33 4.56*
Neural Networks NeuralNet RNN Inference (Samples/s) 23.88 24.93*
Neural Networks NeuralNet RNN Training (Samples/s) 1.57 2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)
CPU Image Processing Blur (3×3) Filter (MPix/s) 1060 891 1580* 2276* [+44%] In this vectorised integer workload TGL is 44% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 441 359 633*  912* [+44%] Same algorithm but more shared data TGL still 44% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 231 186 326*  480* [+47%]
Again same algorithm but even more data shared brings 47%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 363 302 502*  751* [+50%]
Different algorithm but still vectorised workload still 50% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 28.02 27.7 72.9*  109* [+49%]
Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.23 15.7 24.7*  34.74* [+40%]
Similar improvement here of about 40%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 936 1580 2100*  2998* [+43%]
With integer workload, 43% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 127 214 307*  430* [+40%]
In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended!

Please see our other articles on:

SiSoftware Sandra 20/20/9 (2020 R9) Update – GPGPU updates, fixes

Update Wizard

We are pleased to release R9 (version 30.69) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

GPGPU Benchmarks

  • CUDA SDK 11.3+ update for nVidia “Ampere” (SM8.x) – deprecated SM3.x support
  • OpenCL SDK updated to 1.2 minimum, 2.x recommended and 3.x experimental.
  • Increased addressing in all benchmarks (GPGPU Processing, Cryptography, Scientific Analysis, Financial Analysis Image Processing, etc) to 64-bit for large VRAM cards (12GB and larger)
  • Increased limits allowing bigger grids/workloads on large memory systems (e.g. 24GB+ VRAM, 64GB+ RAM)
  • Optimised (vectorised/predicated) some image processing kernels for higher performance on VLIW GPGPUs.

CPU / SVM Benchmarks

  • Vectorised CPU Multi-Media 128-bit integer benchmark to support ADX/BMI(2) instructions as well as vectorised to support AVX512-IFMA(52) (52-bit integer FMA, 104-bit intermediate result) on Intel “IceLake” and newer CPUs (also supporting AVX512-DQ, AVX2 and SSE4). See “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.
  • Optimised .Net and Java Multi-Media 128-bit integer benchmarks similar to CPU version.
  • Increased limits/addressing allowing bigger grids/workloads on large thread/memory systems (e.g. 256-thread, 256GB RAM)

Bug Fixes

  • Fixed incorrect rounding function resulting in negative numbers displayed for zero values(!) Display issue only, actual values not affected.
  • Fixed display for large scores in GPGPU Processing benchmarks (result would overflow the display routine). Display issue only, scores stored internally (database) or sent/received from Ranker were correct and will display correctly upon update.
  • Fixed (sub)domain for Information Engine. Updated microcode, firmware, BIOS, driver versions are displayed again when available.

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite