Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance

What is “IceLake”?

It is the “real” 10th generation Core arch(itecture) (ICL/IceLake) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, performance benefits)
  • Up to 4C/8T on ULV (similar to WhiskyLake but less than top-end CometLake 6C/12T)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • AVX512 instruction set (like HEDT platform)
  • SHA HWA instruction set (like Ryzen)
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • WiFi6 (802.11ax) AX201 integrated

Probably the biggest change is support for AVX512-family instruction set, effectively doubling the SIMD processing width (vs. AVX2/FMA) as well as adding a whole host of specialised instructions that even the HEDT platform (SKL/KBL-X) does not support:

  • VNNI (Vector Neural Network Instructions)
  • VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • BITALG (Bit Algorithms)
  • IFMA (Integer FMA)
  • VAES (Vector AES) accelerating crypto
  • GFNI (Galois Field)
  • SHA accelerating hashing
  • GNA (Gaussian Neural Accelerator)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

VAES and SHA acceleration improve crypto/hashing performance – important today as even LAN transfers between workstations are likely to be encrypted/signed, not to mention just about all WAN transfers, encrypted disk/containers, etc. Some SoCs will also make their way into powerful (but low power) firewall appliances where both AES and SHA acceleration will prove very useful.

From a security point-of-view, ICL mitigates all (existing/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O).

The memory controller supports LP-DDR4X at higher speeds than CML while the cache/TLB systems have been improved that should help both CPU and GPU performance (see corresponding article) as well as reduce power vs. older designs using LP-DDR3.

Finally the GPU core has been updated (Gen11) and generally contains many more cores than the old core (Gen9.5) that was used from KBL (CPU Gen7) all the way to CML (CPU Gen10) (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

 

CPU Specifications AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
Cores (CU) / Threads (SP) 4C / 8T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz 0.4-1.8-4.0GHz
(1.8 @ 15W, 2GHz @ 25W)
0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
ICL has lower clocks ws. CML.
Power (TDP) 15-35W 15-35W 15-35W 12-35W Same power envelope.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way L1D is 50% larger.
L2 Caches 4x 512kB 8-way 4x 256kB 16-way 4x 256kB 16-way 4x 512kB 16-way L2 has doubled.
L3 Caches 4MB 16-way 6MB 16-way 8MB 16-way 8MB 16-way No L3 changes
Microcode (Firmware) MU8F1100-0B MU068E09-AE MU068E0C-BE MU067E05-6A Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX2/FMA AVX512, VNNI, SHA, VAES, GFNI 512-bit wide SIMD on mobile!
SIMD Width / Units
128-bit 256-bit 256-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

 

Native Benchmarks AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 125 134 154 [+15%]
ICL is 15% faster than CML.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 115 135 151 [+12%]
With a 64-bit integer workload – 12% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 67 85 90 [+6%]
With floating-point, ICL is 6% faster than CML
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 57 70 74 [+5%]
With FP64 we see 5% improvement
With integer (legacy) workloads (not using SIMD) we see the new ICL core is over 10% faster than the higher-clocked CML core; with floating-point we see a 5% improvement. While modest, it shows the potential of the new core over the old-but-refined cores we’ve had since SKL.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 306 409 504* [+23%] With AVX512 ICL wins this vectorised integer test
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 117 149 145* [-3%] With a 64-bit AVX512 integer workload we have parity.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 2.21 2.54 3.67 [+44%] A tough test using long integers to emulate Int128 without SIMD;  ICL is 44% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 266 328 414* [+26%]
In this floating-point vectorised test, AVX512 is 26% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 127 155.9 194 232* [+19%]
Switching to FP64 SIMD code,  ICL is 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 6.51 8.22 10.2* [+24%]
A heavy algorithm using FP64 to mantissa extend FP128 ICL is 24% faster.
With heavily vectorised SIMD workloads ICL is able to deploy AVX512 which leads to a 20-25% performance improvement even at the slower clock. However, AVX512 is quite power-hungry (as we’ve seen on HEDT) so we are power constrained in an ULV here – but higher TDP systems (28W, etc.) should perform much better.

* using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
ICL with VAES is 76% faster than CML.
BenchCrypt Crypto AES-128 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78** 3.97 4.3 9** [+2.1x] Despite SHA HWA, Ryzen loses top spot.
BenchCrypt Crypto SHA1 (GB/s) 7.13** 7.5 7.2 15.7** [+2.2x] Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 1.54 7.1*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite VAES (AVX512 VL) and SHA HWA support (like Ryzen), ICL wins thanks to the very fast LP-DDR4X @ 3733Mt/s. VAES marginally helps (at this time) and SHA HWA cannot beat AVX512 multi-buffer but should be much more important in single-buffer large data workloads.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.34 73.02 109 With non-vectorised code ICL is still faster
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.86 75.24 87.2 91 [+4%] Using FP64 ICL is 4% faster
BenchFinance Binomial float/FP32 (kOPT/s) 35.49 16.2 23.5 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.46 19.31 21 27 [+29%] With FP64 code ICL is 29% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.11 14.61 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.32 14.54 16.5 66 [+2x] Switching to FP64 ICL is 2x faster.
With non-SIMD financial workloads, ICL still improves a significant amount over CML thus it makes sense to choose it rather than the older core. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 107 141 158 185* [+17%]
In this tough vectorised  algorithm, ICL is 17% faster
BenchScience DGEMM (GFLOPS) double/FP64 47.2 55 69.2 91.7* [+32%]
With FP64 vectorised code, ICL is 32% faster.
BenchScience SFFT (GFLOPS) float/FP32 3.75 13.23 13.9 31.7* [+2.3x%]
FFT is also heavily vectorised and here ICL is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 4 6.53 7.35 17.7* [+2.4x]
With FP64 code, ICL is even faster.
BenchScience SNBODY (GFLOPS) float/FP32 112.6 160 169 200* [+18%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 57.9 64.2 61.8* [-4%]
With FP64 code ICL is slighly behind CML.
With highly vectorised SIMD code (scientific workloads), ICL again shows us the power of AVX512 and can be over 2x (twice) faster than CML even at higher clock. Some algorithms may need further optimisations but even then we see 17-30% improvement.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 14.32 17.27 19.33 25.62* [+33%] Using AVX512 ICL inference is 33% faster.
Neural Networks NeuralNet CNN Training (Samples/s) 1.46 2.06 3.33 4.56* [+37%] Even training improves by 37%.
Neural Networks NeuralNet RNN Inference (Samples/s) 16.93 22.69 23.88 24.93* [+4%] Just 4% faster but improvement is there.
Neural Networks NeuralNet RNN Training (Samples/s) 1.48 1.14 1.57 2.97* [+43%] Training is much faster by 43% over CML.
As we’ve seen before, ICL benefits greatly from AVX512 – manages to beat the higher-clock CML across the board from 33-43% – and that is before using VNNI to accelerate algorithms even more.

* using AVX512 instead of AVX2/FMA (not using VNNI yet)

CPU Image Processing Blur (3×3) Filter (MPix/s) 532 720 891  1580* [+77%] In this vectorised integer workload ICL is 77% faster
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 290 359 633* [+76%]
Same algorithm but more shared data still 76%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 157 186 326* [+75%]
Again same algorithm but even more data shared brings 75%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 251 302 502* [+66%]
Different algorithm but still vectorised workload still 66% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.49 25.38 27.7 72.9* [+2.6x]
Still vectorised code ICL rules here 2.6x faster!
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 14.29 15.7 24.7* [57%]
Similar improvement here of about 57%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 1525 1580 2100* [+33%]
With integer workload, 33% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94,16 188.8 214 307* [+43%]
In this final test again with integer workload 43% faster
ICL rules this benchmark with AVX512 integer (B/W) 33-43% faster and floating-point AVX512 66-77% faster than CML even at lower clock. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Unlike CML, ICL with AVX512 support is a revolution in performance – which is exactly what we were hoping for; even at much lower clock we see anywhere between 33% all the way to over 2x (twice) faster within the same power limits (TDP/turbo). As we know from HEDT, AVX512 is power-hungry thus higher-TDP rated version (e.g. 28W) should perform even better.

Even without AVX512, we see good improvement of 5-15% again at much lower clock (3.9GHz vs 4.9GHz) while CML and older versions relied on higher clock / more cores to outperform older versions KBL/SKL-U.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD snapping at its heel with Ryzen Mobile, Intel has finally fixed its 10nm production and rolled out the “new Skylake” we deserve: Ice Lake with AVX512 brings feature parity with the much older HEDT platform and showing good promise for the future. This is the “Core” you have been looking for.

While power-hungry and TDP constrained, AVX512 does bring sizeable performance gains that are in addition to core improvements and cache & memory sub-system improvements. Other instruction sets VAES, SHA HWA complete the package and might help in some scenarios where code has not been updated to AVX512.

With ICL, a mere 15W thin & light (e.g. Dell XPS 13 9300) can outperform older desktop-class CPUs (e.g. SKL) at 4-6x (four/six-times) TDP which makes us really keen to see what desktop-class processors will be capable of. And not before time as the competition has been bringing stronger and stronger designs (Ryzen2, future Ryzen 3).

If you have been waiting to upgrade from the much older – but still good – SKL/KBL with just 2 cores and no hardware vulnerability mitigations – then you finally have something to upgrade to: CML was not it as despite its 4 cores (and rumoured 6 core), it just did not bring enough to the table to make upgrading worth-while (save hardware mitigations that don’t cripple performance).

Overall, with GP GPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

Tagged , , , , , , . Bookmark the permalink.

Comments are closed.