What is “TigerLake”?
It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.
Note that RocketLake-S (RKL) will be the desktop equivalent of IceLake (ICL) cores but with TigerLake (TGL) graphics.
While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:
- 10nm++ process (lower voltage, higher performance benefits)
- Up to 4C/8T “Willow Cove” on ULV (CometLake up to 6C/12T)
- Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
- AVX512 and more of its friends
- Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
- Increased L3 cache from 8MB to 12MB (+50%)
- DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
- PCIe 4.0 (up to 32GB/s with x16 lanes)
- Thunderbolt 4 (and thus USB 4.0 support) integrated
- Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)
While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:
- AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
- AVX512-VPINTERSECT/2 (Vector Pair Intersect)
While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.
The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.
From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.
DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.
PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.
Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.
Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).
CPU (Core) Performance Benchmarking
In this article we test CPU core performance; please see our other articles on:
- CPU
- Intel Core Gen 11 RocketLake (i7-11700K) Review & Benchmarks – AVX512 Performance
- AVX512-IFMA(52) Improvement for IceLake and TigerLake
- Benchmarks of JCC Erratum Mitigation – Intel CPUs
- Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance
- Intel Core Gen10 CometLake ULV (i7-10510U) Review & Benchmarks – CPU Performance
- GPGPU
Hardware Specifications
We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.
CPU Specifications | AMD Ryzen 4500U | Intel Core i7 10510U (CometLake ULV) | Intel Core i7 1065G7 (IceLake ULV) | Intel Core i7 1165G7 (TigerLake ULV) | Comments | |
Cores (CU) / Threads (SP) | 6C / 6T | 4C / 8T | 4C / 8T | 4C / 8T | No change in cores count. | |
Speed (Min / Max / Turbo) | 1.6-2.3-4.0GHz | 0.4-1.8-4.9GHz (1.8GHz @ 15W, 2.3GHz @ 25W) |
0.4-1.5-3.9GHz (1.0GHz @ 12W, 1.5GHz @ 25W) |
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) | Both rates and Turbo clocks are way up | |
Power (TDP) | 15-35W | 15-35W | 15-35W | 12-35W | Similar power envelope possibly higher. | |
L1D / L1I Caches | 6x 32kB 8-way / 6x 64kB 4-way | 4x 32kB 8-way / 4x 32kB 8-way | 4x 48kB 12-way / 4x 32kB 8-way | 4x 48kB 12-way / 4x 32kB 8-way | No change L1D | |
L2 Caches | 6x 512kB 8-way | 4x 256kB 16-way | 4x 512kB 16-way | 4x 1.25MB | L2 has more than doubled (2.5x)! | |
L3 Caches | 2x 4MB 16-way | 8MB 16-way | 8MB 16-way | 12MB 16-way | L3 is 50% larger | |
Microcode (Firmware) | n/a | MC068E09-CC | MC067E05-6A | MC068C01-72 | Revisions just keep on coming. | |
Special Instruction Sets |
AVX2/FMA, SHA | AVX2/FMA | AVX512, VNNI, SHA, VAES, IFMA | AVX512, VNNI, SHA, VAES, IFMA | More AVX512! | |
SIMD Width / Units |
256-bit | 256-bit | 512-bit | 512-bit | Widest SIMD units ever |
Disclaimer
This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.
The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed. At publication time, the products have not been directly testied by SiSoftware and thus the accuracy of the benchmark scores cannot be verified; however, they appear consistent and do not appear to be false/fake.
Native Performance
We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.
Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.
Native Benchmarks | AMD Ryzen 4500U 6C/6T |
Intel Core i7 10510U 4C/8T (CometLake ULV) | Intel Core i7 1065G7 4C/8T (IceLake ULV) | Intel Core i7 1165G7 4C/8T (TigerLake ULV) | Comments | |
Native Dhrystone Integer (GIPS) | 208 | 134 | 154 | 169 [+10%] | TGL is 10% faster than ICL but not enough to beat AMD. | |
Native Dhrystone Long (GIPS) | 191 | 135 | 151 | 167 [+11%] | With a 64-bit integer workload – 11% increase | |
Native FP32 (Float) Whetstone (GFLOPS) | 89 | 85 | 90 | 99.5 [+10%] |
With floating-point, TGL is only 10% faster but enough to beat AMD. | |
Native FP64 (Double) Whetstone (GFLOPS) | 75 | 70 | 74 | 83 [+12%] |
With FP64 we see a 12% improvement. | |
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition. | ||||||
Native Integer (Int32) Multi-Media (Mpix/s) | 506 | 409 | 504* | 709* [+41%] | With AVX512 TGL is over 40% faster than ICL. | |
Native Long (Int64) Multi-Media (Mpix/s) | 193 | 149 | 145* | 216* [+49%] | With a 64-bit AVX512 integer workload TGL is 50% faster. | |
Native Quad-Int (Int128) Multi-Media (Mpix/s) | 4.47 | 2.54 | 3.67** | 4.34** [+18%] | A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**] | |
Native Float/FP32 Multi-Media (Mpix/s) | 433 | 328 | 414* | 666* [+61%] |
In this floating-point vectorised test TGL is 61% faster! | |
Native Double/FP64 Multi-Media (Mpix/s) | 251 | 194 | 232* | 381* [+64%] |
Switching to FP64 SIMD AVX512 code, TGL is 64% faster. | |
Native Quad-Float/FP128 Multi-Media (Mpix/s) | 11.23 | 8.22 | 10.2* | 15.28* [+50%] |
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL. | |
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!
Note:* using AVX512 instead of AVX2/FMA. Note**: note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article. |
||||||
Crypto AES-256 (GB/s) | 13.46 | 12.11 | 21.3* | 19.72* [-7%] | Memory bandwidth rules here so TGL is similar to ICL in speed. | |
Crypto AES-128 (GB/s) | 13.5 | 12.11 | 21.3* | 19.8* [-7%] | No change with AES128. | |
Crypto SHA2-256 (GB/s) | 7.03** | 4.28 | 9*** | 13.87*** [+54%] | Despite SHA HWA, TGL soundly beats Ryzen using AVX512. | |
Crypto SHA1 (GB/s) | 7.19 | 15.71*** | Less compute intensive SHA1 does not help. | |||
Crypto SHA2-512 (GB/s) | 7.09*** | SHA2-512 is not accelerated by SHA HWA. | ||||
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL! AVX512 helps even against native hashing acceleration.
* using VAES (AVX512 VL) instead of AES HWA. ** using SHA HWA instead of multi-buffer AVX2. *** using AVX512 B/W |
||||||
Black-Scholes float/FP32 (MOPT/s) | 64.16 | 109 | – | |||
Black-Scholes double/FP64 (MOPT/s) | 91.48 | 87.17 | 91 | 132 [+45%] | Using FP64 TGL is 45% faster than ICL. | |
Binomial float/FP32 (kOPT/s) | 16.34 | 23.55 | Binomial uses thread shared data thus stresses the cache & memory system. | |||
Binomial double/FP64 (kOPT/s) | 31.2 | 21 | 27 | 37.23 [+38%] |
With FP64 code TGL is 38% faster. | |
Monte-Carlo float/FP32 (kOPT/s) | 12.48 | 79.9 | Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches. | |||
Monte-Carlo double/FP64 (kOPT/s) | 45.59 | 16.5 | 33 | 45.98 [+39%] | Switching to FP64 TGL is 40% faster. | |
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today. | ||||||
SGEMM (GFLOPS) float/FP32 | 158 | 185* | 294* [+59%] |
In this tough vectorised algorithm, TGL is 60% faster! | ||
DGEMM (GFLOPS) double/FP64 | 76.86 | 69.2 | 91.7* | 167* [+82%] |
With FP64 vectorised code, TGL is over 80% faster! | |
SFFT (GFLOPS) float/FP32 | 13.9 | 31.7* | 31.14* [-2%] | FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL. | ||
DFFT (GFLOPS) double/FP64 | 7.15 | 7.35 | 17.7* | 16.41* [-3%] | With FP64 code, nothing much changes. | |
SNBODY (GFLOPS) float/FP32 | 169 | 200* | 286* [+43%] |
N-Body simulation is vectorised but with more memory accesses. | ||
DNBODY (GFLOPS) double/FP64 | 98.7 | 64.2 | 61.8* | 81.61* [+32%] |
With FP64 code TGL is 32% faster. | |
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.
* using AVX512 instead of AVX2/FMA3 |
||||||
NeuralNet CNN Inference (Samples/s) | 19.33 | 25.62* | |
|||
NeuralNet CNN Training (Samples/s) | 3.33 | 4.56* | ||||
NeuralNet RNN Inference (Samples/s) | 23.88 | 24.93* | ||||
NeuralNet RNN Training (Samples/s) | 1.57 | 2.97* | ||||
* using AVX512 instead of AVX2/FMA (not using VNNI yet) | ||||||
Blur (3×3) Filter (MPix/s) | 1060 | 891 | 1580* | 2276* [+44%] | In this vectorised integer workload TGL is 44% faster. | |
Sharpen (5×5) Filter (MPix/s) | 441 | 359 | 633* | 912* [+44%] | Same algorithm but more shared data TGL still 44% faster. | |
Motion-Blur (7×7) Filter (MPix/s) | 231 | 186 | 326* | 480* [+47%] |
Again same algorithm but even more data shared brings 47% | |
Edge Detection (2*5×5) Sobel Filter (MPix/s) | 363 | 302 | 502* | 751* [+50%] |
Different algorithm but still vectorised workload still 50% faster. | |
Noise Removal (5×5) Median Filter (MPix/s) | 28.02 | 27.7 | 72.9* | 109* [+49%] |
Still vectorised code TGL is again 50% faster. | |
Oil Painting Quantise Filter (MPix/s) | 12.23 | 15.7 | 24.7* | 34.74* [+40%] |
Similar improvement here of about 40% | |
Diffusion Randomise (XorShift) Filter (MPix/s) | 936 | 1580 | 2100* | 2998* [+43%] |
With integer workload, 43% faster. | |
Marbling Perlin Noise 2D Filter (MPix/s) | 127 | 214 | 307* | 430* [+40%] |
In this final test again with integer workload 40% faster | |
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.
* using AVX512 instead of AVX2/FMA |
Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.
TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.
TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.
TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!
In a word: Highly Recommended – 9/10
Please see our other articles on:
- Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance
- AVX512-IFMA(52) Improvement for IceLake and TigerLake
- Benchmarks of JCC Erratum Mitigation – Intel CPUs
- Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – Cache & Memory Performance
- Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GPGPU Performance
Disclaimer
This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.
The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed. At publication time, the products have not been directly testied by SiSoftware and thus the accuracy of the benchmark scores cannot be verified; however, they appear consistent and do not appear to be false/fake.