Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

Note that RocketLake-S (RKL) will be the desktop equivalent of IceLake (ICL) cores but with TigerLake (TGL) graphics.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Willow Cove” on ULV  (CometLake up to 6C/12T)
  • Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • AVX512 and more of its friends
  • Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
  • Increased L3 cache from 8MB to 12MB (+50%)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
  • PCIe 4.0 (up to 32GB/s with x16 lanes)
  • Thunderbolt 4 (and thus USB 4.0 support) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

  • AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
  • AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well  (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
Cores (CU) / Threads (SP) 6C / 6T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.3-4.0GHz 0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) Both rates and Turbo clocks are way up
Power (TDP) 15-35W 15-35W 15-35W 12-35W Similar power envelope possibly higher.
L1D / L1I Caches 6x 32kB 8-way / 6x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way No change L1D
L2 Caches 6x 512kB 8-way 4x 256kB 16-way 4x 512kB 16-way 4x 1.25MB L2 has more than doubled (2.5x)!
L3 Caches 2x 4MB 16-way 8MB 16-way 8MB 16-way 12MB 16-way L3 is 50% larger
Microcode (Firmware) n/a MC068E09-CC MC067E05-6A MC068C01-72 Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX512, VNNI, SHA, VAES,  IFMA AVX512, VNNI, SHA, VAES,  IFMA More AVX512!
SIMD Width / Units
256-bit 256-bit 512-bit 512-bit Widest SIMD units ever

Disclaimer

This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.

The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed. At publication time, the products have not been directly testied by SiSoftware and thus the accuracy of the benchmark scores cannot be verified; however, they appear consistent and do not appear to be false/fake.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 4500U 6C/6T
Intel Core i7 10510U  4C/8T (CometLake ULV) Intel Core i7 1065G7 4C/8T (IceLake ULV) Intel Core i7 1165G7 4C/8T (TigerLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 208 134 154 169 [+10%] TGL is 10% faster than ICL but not enough to beat AMD.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 191 135 151 167 [+11%] With a 64-bit integer workload – 11% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 89 85 90  99.5 [+10%]
With floating-point, TGL is only 10% faster but enough to beat AMD.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 75 70 74  83 [+12%]
With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 506 409 504* 709* [+41%] With AVX512 TGL is over 40% faster than ICL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 193 149 145* 216* [+49%] With a 64-bit AVX512 integer workload TGL is 50% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.47 2.54 3.67** 4.34** [+18%] A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 433 328 414*  666* [+61%]
In this floating-point vectorised test TGL is 61% faster!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 251 194 232*  381* [+64%]
Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 11.23 8.22 10.2*  15.28* [+50%]
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!

Note:* using AVX512 instead of AVX2/FMA.

Note**: note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

BenchCrypt Crypto AES-256 (GB/s) 13.46 12.11 21.3*  19.72* [-7%] Memory bandwidth rules here so TGL is similar to ICL in speed.
BenchCrypt Crypto AES-128 (GB/s) 13.5 12.11 21.3* 19.8* [-7%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 7.03** 4.28 9*** 13.87*** [+54%] Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
BenchCrypt Crypto SHA1 (GB/s) 7.19 15.71***   Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL!  AVX512 helps even against native hashing acceleration.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 64.16 109
BenchFinance Black-Scholes double/FP64 (MOPT/s) 91.48 87.17 91 132 [+45%] Using FP64 TGL is 45% faster than ICL.
BenchFinance Binomial float/FP32 (kOPT/s) 16.34 23.55 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 31.2 21 27  37.23 [+38%]
With FP64 code TGL is 38% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 12.48 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 45.59 16.5 33 45.98 [+39%] Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 158 185*  294* [+59%]
In this tough vectorised algorithm, TGL is 60% faster!
BenchScience DGEMM (GFLOPS) double/FP64 76.86 69.2 91.7*  167* [+82%]
With FP64 vectorised code, TGL is over 80% faster!
BenchScience SFFT (GFLOPS) float/FP32 13.9 31.7*  31.14* [-2%] FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
BenchScience DFFT (GFLOPS) double/FP64 7.15 7.35 17.7*  16.41* [-3%] With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 169 200*  286* [+43%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 98.7 64.2 61.8* 81.61* [+32%]
With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.

* using AVX512 instead of AVX2/FMA3

Neural Networks NeuralNet CNN Inference (Samples/s) 19.33 25.62*  
Neural Networks NeuralNet CNN Training (Samples/s) 3.33 4.56*
Neural Networks NeuralNet RNN Inference (Samples/s) 23.88 24.93*
Neural Networks NeuralNet RNN Training (Samples/s) 1.57 2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)
CPU Image Processing Blur (3×3) Filter (MPix/s) 1060 891 1580* 2276* [+44%] In this vectorised integer workload TGL is 44% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 441 359 633*  912* [+44%] Same algorithm but more shared data TGL still 44% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 231 186 326*  480* [+47%]
Again same algorithm but even more data shared brings 47%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 363 302 502*  751* [+50%]
Different algorithm but still vectorised workload still 50% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 28.02 27.7 72.9*  109* [+49%]
Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.23 15.7 24.7*  34.74* [+40%]
Similar improvement here of about 40%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 936 1580 2100*  2998* [+43%]
With integer workload, 43% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 127 214 307*  430* [+40%]
In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

 

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended – 9/10

Please see our other articles on:

Disclaimer

This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.

The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed. At publication time, the products have not been directly testied by SiSoftware and thus the accuracy of the benchmark scores cannot be verified; however, they appear consistent and do not appear to be false/fake.

Tagged , , , , , , , . Bookmark the permalink.

Comments are closed.