Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

Intel Core i7 Gen 11

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Willow Cove” on ULV  (CometLake up to 6C/12T)
  • Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • AVX512 and more of its friends
  • Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
  • Increased L3 cache from 8MB to 12MB (+50%)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
  • PCIe 4.0
  • Thunderbolt 4 (and thus USB 4.0 support) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

  • AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
  • AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well  (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
Cores (CU) / Threads (SP) 6C / 6T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.3-4.0GHz 0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) Both rates and Turbo clocks are way up
Power (TDP) 15-35W 15-35W 15-35W 12-35W Similar power envelope possibly higher.
L1D / L1I Caches 6x 32kB 8-way / 6x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way No change L1D
L2 Caches 6x 512kB 8-way 4x 256kB 16-way 4x 512kB 16-way 4x 1.25MB L2 has more than doubled (2.5x)!
L3 Caches 2x 4MB 16-way 8MB 16-way 8MB 16-way 12MB 16-way L3 is 50% larger
Microcode (Firmware) n/a MU-068E09-CC MU-067E05-6A MU-TBD Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX512, VNNI, SHA, VAES,  IFMA AVX512, VNNI, SHA, VAES,  IFMA More AVX512!
SIMD Width / Units
256-bit 256-bit 512-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 208 134 154 169 [+10%] TGL is 10% faster than ICL but not enough to beat AMD.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 191 135 151 167 [+11%] With a 64-bit integer workload – 11% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 89 85 90  99.5 [+10%]
With floating-point, TGL is only 10% faster but enough to beat AMD.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 75 70 74  83 [+12%]
With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 506 409 504* 709* [+41%] With AVX512 TGL is over 40% faster than ICL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 193 149 145* 216* [+49%] With a 64-bit AVX512 integer workload TGL is 50% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.47 2.54 3.67** 4.34** [+18%] A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 433 328 414*  666* [+61%]
In this floating-point vectorised test TGL is 61% faster!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 251 194 232*  381* [+64%]
Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 11.23 8.22 10.2*  15.28* [+50%]
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!

* using AVX512 instead of AVX2/FMA.

** note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

BenchCrypt Crypto AES-256 (GB/s) 13.46 12.11 21.3*  19.72* [-7%] Memory bandwidth rules here so TGL is similar to ICL in speed.
BenchCrypt Crypto AES-128 (GB/s) 13.5 12.11 21.3* 19.8* [-7%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 7.03** 4.28 9*** 13.87*** [+54%] Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
BenchCrypt Crypto SHA1 (GB/s) 7.19 15.71***   Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL!  AVX512 helps even against native hashing acceleration.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 64.16 109
BenchFinance Black-Scholes double/FP64 (MOPT/s) 91.48 87.17 91 132 [+45%] Using FP64 TGL is 45% faster than ICL.
BenchFinance Binomial float/FP32 (kOPT/s) 16.34 23.55 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 31.2 21 27  37.23 [+38%]
With FP64 code TGL is 38% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 12.48 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 45.59 16.5 33 45.98 [+39%] Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 158 185*  294* [+59%]
In this tough vectorised algorithm, TGL is 60% faster!
BenchScience DGEMM (GFLOPS) double/FP64 76.86 69.2 91.7*  167* [+82%]
With FP64 vectorised code, TGL is over 80% faster!
BenchScience SFFT (GFLOPS) float/FP32 13.9 31.7*  31.14* [-2%] FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
BenchScience DFFT (GFLOPS) double/FP64 7.15 7.35 17.7*  16.41* [-3%] With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 169 200*  286* [+43%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 98.7 64.2 61.8* 81.61* [+32%]
With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 19.33 25.62*  
Neural Networks NeuralNet CNN Training (Samples/s) 3.33 4.56*
Neural Networks NeuralNet RNN Inference (Samples/s) 23.88 24.93*
Neural Networks NeuralNet RNN Training (Samples/s) 1.57 2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)
CPU Image Processing Blur (3×3) Filter (MPix/s) 1060 891 1580* 2276* [+44%] In this vectorised integer workload TGL is 44% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 441 359 633*  912* [+44%] Same algorithm but more shared data TGL still 44% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 231 186 326*  480* [+47%]
Again same algorithm but even more data shared brings 47%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 363 302 502*  751* [+50%]
Different algorithm but still vectorised workload still 50% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 28.02 27.7 72.9*  109* [+49%]
Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.23 15.7 24.7*  34.74* [+40%]
Similar improvement here of about 40%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 936 1580 2100*  2998* [+43%]
With integer workload, 43% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 127 214 307*  430* [+40%]
In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended!

Please see our other articles on:

Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance

What is “IceLake”?

It is the “real” 10th generation Core arch(itecture) (ICL/”IceLake”) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Sunny Cove” cores on ULV (less than top-end CometLake 6C/12T)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • AVX512 instruction set (like HEDT platform)
  • SHA HWA instruction set (like Ryzen)
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • WiFi6 (802.11ax) AX201 integrated

Probably the biggest change is support for AVX512-family instruction set, effectively doubling the SIMD processing width (vs. AVX2/FMA) as well as adding a whole host of specialised instructions that even the HEDT platform (SKL/KBL-X) does not support:

  • AVX512-VNNI (Vector Neural Network Instructions)
  • AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-AVX512-IFMA (Integer FMA)
  • AVX512-VAES (Vector AES) accelerating crypto
  • AVX512-GFNI (Galois Field)
  • SHA HWA accelerating hashing
  • AVX512-GNA (Gaussian Neural Accelerator)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

VAES and SHA acceleration improve crypto/hashing performance – important today as even LAN transfers between workstations are likely to be encrypted/signed, not to mention just about all WAN transfers, encrypted disk/containers, etc. Some SoCs will also make their way into powerful (but low power) firewall appliances where both AES and SHA acceleration will prove very useful.

From a security point-of-view, ICL mitigates all (existing/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O).

The memory controller supports LP-DDR4X at higher speeds than CML while the cache/TLB systems have been improved that should help both CPU and GPU performance (see corresponding article) as well as reduce power vs. older designs using LP-DDR3.

Finally the GPU core has been updated (Gen11) and generally contains many more cores than the old core (Gen9.5) that was used from KBL (CPU Gen7) all the way to CML (CPU Gen10) (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
Cores (CU) / Threads (SP) 4C / 8T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz 0.4-1.8-4.0GHz
(1.8 @ 15W, 2GHz @ 25W)
0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
ICL has lower clocks ws. CML.
Power (TDP) 15-35W 15-35W 15-35W 12-35W Same power envelope.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way L1D is 50% larger.
L2 Caches 4x 512kB 8-way 4x 256kB 16-way 4x 256kB 16-way 4x 512kB 16-way L2 has doubled.
L3 Caches 4MB 16-way 6MB 16-way 8MB 16-way 8MB 16-way No L3 changes
Microcode (Firmware) MU8F1100-0B MU068E09-AE MU068E0C-BE MU067E05-6A Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX2/FMA AVX512, VNNI, SHA, VAES, GFNI 512-bit wide SIMD on mobile!
SIMD Width / Units
128-bit 256-bit 256-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 125 134 154 [+15%]
ICL is 15% faster than CML.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 115 135 151 [+12%]
With a 64-bit integer workload – 12% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 67 85 90 [+6%]
With floating-point, ICL is 6% faster than CML
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 57 70 74 [+5%]
With FP64 we see 5% improvement
With integer (legacy) workloads (not using SIMD) we see the new ICL core is over 10% faster than the higher-clocked CML core; with floating-point we see a 5% improvement. While modest, it shows the potential of the new core over the old-but-refined cores we’ve had since SKL.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 306 409 504* [+23%] With AVX512 ICL wins this vectorised integer test
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 117 149 145* [-3%] With a 64-bit AVX512 integer workload we have parity.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 2.21 2.54 3.67 [+44%] A tough test using long integers to emulate Int128 without SIMD;  ICL is 44% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 266 328 414* [+26%]
In this floating-point vectorised test, AVX512 is 26% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 127 155.9 194 232* [+19%]
Switching to FP64 SIMD code,  ICL is 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 6.51 8.22 10.2* [+24%]
A heavy algorithm using FP64 to mantissa extend FP128 ICL is 24% faster.
With heavily vectorised SIMD workloads ICL is able to deploy AVX512 which leads to a 20-25% performance improvement even at the slower clock. However, AVX512 is quite power-hungry (as we’ve seen on HEDT) so we are power constrained in an ULV here – but higher TDP systems (28W, etc.) should perform much better.

* using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
ICL with VAES is 76% faster than CML.
BenchCrypt Crypto AES-128 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78** 3.97 4.3 9** [+2.1x] Despite SHA HWA, Ryzen loses top spot.
BenchCrypt Crypto SHA1 (GB/s) 7.13** 7.5 7.2 15.7** [+2.2x] Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 1.54 7.1*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite VAES (AVX512 VL) and SHA HWA support (like Ryzen), ICL wins thanks to the very fast LP-DDR4X @ 3733Mt/s. VAES marginally helps (at this time) and SHA HWA cannot beat AVX512 multi-buffer but should be much more important in single-buffer large data workloads.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.34 73.02 109 With non-vectorised code ICL is still faster
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.86 75.24 87.2 91 [+4%] Using FP64 ICL is 4% faster
BenchFinance Binomial float/FP32 (kOPT/s) 35.49 16.2 23.5 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.46 19.31 21 27 [+29%] With FP64 code ICL is 29% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.11 14.61 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.32 14.54 16.5 66 [+2x] Switching to FP64 ICL is 2x faster.
With non-SIMD financial workloads, ICL still improves a significant amount over CML thus it makes sense to choose it rather than the older core. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 107 141 158 185* [+17%]
In this tough vectorised  algorithm, ICL is 17% faster
BenchScience DGEMM (GFLOPS) double/FP64 47.2 55 69.2 91.7* [+32%]
With FP64 vectorised code, ICL is 32% faster.
BenchScience SFFT (GFLOPS) float/FP32 3.75 13.23 13.9 31.7* [+2.3x%]
FFT is also heavily vectorised and here ICL is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 4 6.53 7.35 17.7* [+2.4x]
With FP64 code, ICL is even faster.
BenchScience SNBODY (GFLOPS) float/FP32 112.6 160 169 200* [+18%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 57.9 64.2 61.8* [-4%]
With FP64 code ICL is slighly behind CML.
With highly vectorised SIMD code (scientific workloads), ICL again shows us the power of AVX512 and can be over 2x (twice) faster than CML even at higher clock. Some algorithms may need further optimisations but even then we see 17-30% improvement.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 14.32 17.27 19.33 25.62* [+33%] Using AVX512 ICL inference is 33% faster.
Neural Networks NeuralNet CNN Training (Samples/s) 1.46 2.06 3.33 4.56* [+37%] Even training improves by 37%.
Neural Networks NeuralNet RNN Inference (Samples/s) 16.93 22.69 23.88 24.93* [+4%] Just 4% faster but improvement is there.
Neural Networks NeuralNet RNN Training (Samples/s) 1.48 1.14 1.57 2.97* [+43%] Training is much faster by 43% over CML.
As we’ve seen before, ICL benefits greatly from AVX512 – manages to beat the higher-clock CML across the board from 33-43% – and that is before using VNNI to accelerate algorithms even more.

* using AVX512 instead of AVX2/FMA (not using VNNI yet)

CPU Image Processing Blur (3×3) Filter (MPix/s) 532 720 891  1580* [+77%] In this vectorised integer workload ICL is 77% faster
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 290 359 633* [+76%]
Same algorithm but more shared data still 76%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 157 186 326* [+75%]
Again same algorithm but even more data shared brings 75%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 251 302 502* [+66%]
Different algorithm but still vectorised workload still 66% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.49 25.38 27.7 72.9* [+2.6x]
Still vectorised code ICL rules here 2.6x faster!
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 14.29 15.7 24.7* [57%]
Similar improvement here of about 57%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 1525 1580 2100* [+33%]
With integer workload, 33% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94,16 188.8 214 307* [+43%]
In this final test again with integer workload 43% faster
ICL rules this benchmark with AVX512 integer (B/W) 33-43% faster and floating-point AVX512 66-77% faster than CML even at lower clock. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Unlike CML, ICL with AVX512 support is a revolution in performance – which is exactly what we were hoping for; even at much lower clock we see anywhere between 33% all the way to over 2x (twice) faster within the same power limits (TDP/turbo). As we know from HEDT, AVX512 is power-hungry thus higher-TDP rated version (e.g. 28W) should perform even better.

Even without AVX512, we see good improvement of 5-15% again at much lower clock (3.9GHz vs 4.9GHz) while CML and older versions relied on higher clock / more cores to outperform older versions KBL/SKL-U.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD snapping at its heel with Ryzen Mobile, Intel has finally fixed its 10nm production and rolled out the “new Skylake” we deserve: Ice Lake with AVX512 brings feature parity with the much older HEDT platform and showing good promise for the future. This is the “Core” you have been looking for.

While power-hungry and TDP constrained, AVX512 does bring sizeable performance gains that are in addition to core improvements and cache & memory sub-system improvements. Other instruction sets VAES, SHA HWA complete the package and might help in some scenarios where code has not been updated to AVX512.

With ICL, a mere 15W thin & light (e.g. Dell XPS 13 9300) can outperform older desktop-class CPUs (e.g. SKL) at 4-6x (four/six-times) TDP which makes us really keen to see what desktop-class processors will be capable of. And not before time as the competition has been bringing stronger and stronger designs (Ryzen2, future Ryzen 3).

If you have been waiting to upgrade from the much older – but still good – SKL/KBL with just 2 cores and no hardware vulnerability mitigations – then you finally have something to upgrade to: CML was not it as despite its 4 cores (and rumoured 6 core), it just did not bring enough to the table to make upgrading worth-while (save hardware mitigations that don’t cripple performance).

Overall, with GP GPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

Intel Core Gen10 CometLake ULV (i7-10510U) Review & Benchmarks – CPU Performance

What is “CometLake”?

It is one of the 10th generation Core arch (CML) from Intel – the latest revision of the venerable (6th gen!) “Skylake” (SKL) arch; it succeeds the “WhiskyLake”/”CofeeLake” 8/9-gen current architectures for mobile (ULV U/Y) devices. The “real” 10th generation Core arch is “IceLake” (ICL) that does bring many changes but has not made its mainstream debut yet.

As a result there ar no major updates vs. previous “Skylake” designs, save increase in core count top end versions and hardware vulnerability mitigations which can still make a big difference:

  • Up to 6C/12T (from 4C/8T “WhiskyLake”/”CoffeeLake” or 2C/4T Skylake/KabyLake)
  • Increase Turbo ratios
  • 2-channel LP-DDR4 support and DDR4-2667 (up from 2400)
  • WiFi6 (802.11ax) AX201 integrated (from WiFi5 (802.11ac) 9560)
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)

The 3x (three times) increase in core count (6C/12T vs. “Skylake”/”Kabylake” 4C/8T) in the same 15-28W power envelope is pretty significant considering that Core ULV designs since 1st gen have always had 2C/4T; unfortunately it is limited to top-end thus even i7-10510U still has 4C/8T.

LP-DDR4 support is important as many thin & light laptops (e.g. Dell XPS, Lenovo Carbon X1, etc.) have been “stuck” with slow LP-DDR3 memory instead of high-bandwidth DDR4 memory in order to save power. Note the Y-variants (4.5-6W) will not support this.

WiFi is now integrated in the PCH and has been updated to WiFi6/AX (2×2 streams, up to 2400Mbps with 160MHz-wide channel) from WiFi5/AX (1733Mbps); this also means no simple WiFi-card upgrade in the future as with older laptops (except those with “whitelists” like HP, Lenovo, etc.)

Why review it now?

Until “IceLake” makes its public debut, “CometLake” latest ULV APUs from Intel you can buy today; despite being just a revision of “Skylake” due to increased core counts/Turbo ratios they may still prove worthy competitors not just in cost but also performance.

As they contain hardware fixes/mitigations for vulnerabilities discovered since original “Skylake” has launched (especially “Meltdown” but also various “Spectre” variants), the operating system & applications do not need to deploy slower mitigations that can affect performance (especially I/O) on the older designs. For some algorithms, this may be worth an upgrade alone!

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen2 2500U Bristol Ridge Intel i7 7500U (Kabylake ULV) Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Comments
Cores (CU) / Threads (SP) 4C / 8T 2C / 4T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz 0.4-2.7-3.5GHz 0.4-1.8-4.0GHz
(1.8 @ 15W, 2GHz @ 25W)
0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
CML has +22% faster turbo.
Power (TDP) 15-35W 15-25W 15-35W 15-35W Same power envelope.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No L1 changes
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 4x 256kB 16-way 4x 256kB 16-way No L2 changes
L3 Caches 4MB 16-way 4MB 16-way 8MB 16-way 8MB 16-way And no L3 changes
Microcode (Firmware) MU8F1100-0B MU068E09-8E MU068E09-AE MU068E0C-BE Revisions just keep on coming.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “CometLake” (CML) supports all modern instruction sets including AVX2, FMA3 but not AVX512 (like “IceLake”) or SHA HWA (like Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge
Intel i7 7500U (Kabylake ULV)
Intel i7 8550U (Coffeelake ULV)
Intel Core i7 10510U (CometLake ULV)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 73.15 125 134 [+8%] CML starts off 7% faster than CFL a good start.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 74.74 115 135 [+17%] With a 64-bit integer workload – increases to 17%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 45 67.29 84.95 [+26%] With floating-point workload CML is 26% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 37 57 70.63 [+24%] With FP64 we see a similar 24% improvement.
With integer (legacy) workloads, CML-U brings a modest improvement of about 10% over CFL-U, cementing its top position. But with floating-points (also legacy) workloads we see a larger 25% increase which allows it to beat the competition (Ryzen Mobile) that was beating older designs (CFL-U, WHL-U, KBL-U, etc.)
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 193 306 409 [+34%] In this vectorised AVX2 integer test  CML-U is 34% faster than CFL-U.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 75 117 149 [+27%] With a 64-bit AVX2 integer workload the difference drops to 27%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 1.12 2.21 2.54 [+15%] This is a tough test using Long integers to emulate Int128 without SIMD; here CML-U is still 15% faster.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 160 266 328 [+23%] In this floating-point AVX/FMA vectorised test, CML-U is 23% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 127 94.8 155.9 194.4 [+25%] Switching to FP64 SIMD code, nothing much changes still 20% slower.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 4.04 6.51 8.22 [+26%] In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – we see 26% improvement.
With heavily vectorised SIMD workloads CML-U is 25% faster than previous CFL-U that may be sufficient to see future competition from Gen3 Ryzen Mobile with improved (256-bit) SIMD units, something that CFL/WHL-U may not beat. IcyLake (ICL) with AVX512 should improve over this despite lower clocks.
BenchCrypt Crypto AES-256 (GB/s) 10.9 7.28 13.11 12.11 [-8%] With AES/HWA support all CPUs are memory bandwidth bound.
BenchCrypt Crypto AES-128 (GB/s) 10.9 9.07 13.11 12.11 [-8%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78 2.55 3.97 4.28 [+8%] Without SHA/HWA Ryzen Mobile beats even CML-U.
BenchCrypt Crypto SHA1 (GB/s) 7.13 4.07 7.19 Less compute intensive SHA1 allows CML-U to catch up.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 1.54 SHA2-512 is not accelerated by SHA/HWA CML-U does better.
The memory sub-system is crucial here, and CML-U can improve over older designs when using faster memory (which we were not able to use here). Without SHA/HWA supported by Ryzen Mobile, it cannot beat it and improves marginally over older CFL-U.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.34 49.34 73.02 With non vectorised CML-U needs to cath up.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.86 43.33 75.24 87.17 [+16%] Using FP64 CML-U is 16% faster finally beating Ryzen Mobile.
BenchFinance Binomial float/FP32 (kOPT/s) 35.49 12.3 16.2 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.46 11.4 19.31 20.99 [+9%] With FP64 code CML-U is 9% faster than CFL-U.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.11 9.87 14.61 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.32 7.88 14.54 16.54 [+14%] Switching to FP64 nothing much changes, CML-U is 14% faster.
With non-SIMD financial workloads, CML-U modestly improves (10-15%) over the older CFL-U but this does allow it to beat the competition (Ryzen Mobile) which dominated older CFL-U designs. This may just be enough to match future Gen3 Ryzen Mobile and thus be competitive all-round.
BenchScience SGEMM (GFLOPS) float/FP32 107 76.14 141 158 [+12%] In this tough vectorised AVX2/FMA algorithm CML-U is 12% faster.
BenchScience DGEMM (GFLOPS) double/FP64 47.2 31.71 55 69.2 [+26%] With FP64 vectorised code, CML-U is 26% faster than CFL-U.
BenchScience SFFT (GFLOPS) float/FP32 3.75 7.21 13.23 13.93 [+5%] FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 4 3.95 6.53 7.35 [+13%] With FP64 code, CML-U is 13% faster.
BenchScience SNBODY (GFLOPS) float/FP32 112.6 105 160 169 [+6%] N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 30.64 57.9 64.16 [+11%] With FP64 code nothing much changes.
With highly vectorised SIMD code (scientific workloads) CML-U is again 15-25% faster than CFL-U which should be enough to match future Gen3 Ryzen Mobile with 256-bit SIMD units. Again we need ICL with AVX512 to bring dominance to these workloads or more cores.
CPU Image Processing Blur (3×3) Filter (MPix/s) 532 474 720 891 [+24%] In this vectorised integer AVX2 workload CML-U is 24% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 191 290 359 [+24%] Same algorithm but more shared data still 24%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 98.3 157 186 [+18%] Again same algorithm but even more data shared reduces improvement to 18%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 164 251 302 [+20%] Different algorithm but still AVX2 vectorised workload still 20% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.49 14.38 25.38 27.73 [+9%] Still AVX2 vectorised code but here just 9% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 7.63 14.29 15.74 [+10%] Similar improvement here of about 10%.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 764 1525 1580 [+4%] With integer AVX2 workload, only 4% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94,16 105.1 188.8 214 [+13%] In this final test again with integer AVX2 workload CML-U is 13% faster.

Without any new instruction sets (AVX512, SHA/HWA, etc.) support, CML-U was never going to be a revolution in performance and has to rely on clock and very minor improvements/fixes (especially for vulnerabilities) only. Versions with more cores (6C/12T) would certainly help if they can stay within the power limits (TDP/turbo).

Intel themselves did not claim a big performance improvement – still CML-U is 10-25% faster than CFL-U across workloads – at same TDP. At the same cost/power, it is a welcome improvement and it does allow it to beat current competition (Ryzen Mobile) which was nipping at its heels; it may also be enough to match future Gen3 Ryzen Mobile designs.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For some it may be disappointing we do not have brand-new improved “IceLake” (ICL-U) now rather than a 3-rd revision “Skylake” – but “CometLake” (CML-U) does seem to improve even over the previous revisions (8/9th gen “WhiskyLake”/”CofeeLake” WHL/CFL-U) while due to 2x core count completly outperforming the original (6/7th gen “Skylake”/”KabyLake”) in the same power envelope. Perhaps it also shows how much Intel has had to improve at short notice due to Ryzen Mobile APUs (e.g. 2500U) that finally brought competition to the mobile space.

While owners of 8/9-th gen won’t be upgrading – it is very rare to recommend changing from one generation to another anyway – owners of older hardware can look forward to over 2x performance increase in most workloads for the same power draw, not to mention the additional features (integrated WiFi6, Thunderbolt, etc.).

On the other hand, the competition (AMD Ryzen Mobile) also good performance and older 8/9th gen also offer competitive performance – thus it will all depend on price. With Gen3 Ryzen Mobile on the horizon (with 256-bit SIMD units) “CometLake” may just manage to match it on performance. It may also be worth waiting for “IceLake” to make its debut to see what performance improvements it brings and at what cost – which may also push “CometLake” prices down.

All in all Intel has managed to “squeeze” all it can from the old Skylake arch that while not revolutionary, still has enough to be competitive with current designs – and with future 50% increase core count (6C/12T from 4C/8T) might even beat them not just in cost but also in performance.

In a word: Qualified Recommendation!

Please see our other articles on: