Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

Note that RocketLake-S (RKL) will be the desktop equivalent of IceLake (ICL) cores but with TigerLake (TGL) graphics.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

10nm++ process (lower voltage, higher performance benefits)
Up to 4C/8T “Willow Cove” on ULV (CometLake up to 6C/12T)
Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
AVX512 and more of its friends
Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
Increased L3 cache from 8MB to 12MB (+50%)
DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
PCIe 4.0 (up to 32GB/s with x16 lanes)
Thunderbolt 4 (and thus USB 4.0 support) integrated
Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications	AMD Ryzen 4500U	Intel Core i7 10510U (CometLake ULV)	Intel Core i7 1065G7 (IceLake ULV)	Intel Core i7 1165G7 (TigerLake ULV)	Comments
Cores (CU) / Threads (SP)	6C / 6T	4C / 8T	4C / 8T	4C / 8T	No change in cores count.
Speed (Min / Max / Turbo)	1.6-2.3-4.0GHz	0.4-1.8-4.9GHz (1.8GHz @ 15W, 2.3GHz @ 25W)	0.4-1.5-3.9GHz (1.0GHz @ 12W, 1.5GHz @ 25W)	0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W)	Both rates and Turbo clocks are way up
Power (TDP)	15-35W	15-35W	15-35W	12-35W	Similar power envelope possibly higher.
L1D / L1I Caches	6x 32kB 8-way / 6x 64kB 4-way	4x 32kB 8-way / 4x 32kB 8-way	4x 48kB 12-way / 4x 32kB 8-way	4x 48kB 12-way / 4x 32kB 8-way	No change L1D
L2 Caches	6x 512kB 8-way	4x 256kB 16-way	4x 512kB 16-way	4x 1.25MB	L2 has more than doubled (2.5x)!
L3 Caches	2x 4MB 16-way	8MB 16-way	8MB 16-way	12MB 16-way	L3 is 50% larger
Microcode (Firmware)	n/a	MC068E09-CC	MC067E05-6A	MC068C01-72	Revisions just keep on coming.
Special Instruction Sets	AVX2/FMA, SHA	AVX2/FMA	AVX512, VNNI, SHA, VAES, IFMA	AVX512, VNNI, SHA, VAES, IFMA	More AVX512!
SIMD Width / Units	256-bit	256-bit	512-bit	512-bit	Widest SIMD units ever

Disclaimer

This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.

The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed. At publication time, the products have not been directly testied by SiSoftware and thus the accuracy of the benchmark scores cannot be verified; however, they appear consistent and do not appear to be false/fake.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks		AMD Ryzen 4500U 6C/6T	Intel Core i7 10510U 4C/8T (CometLake ULV)	Intel Core i7 1065G7 4C/8T (IceLake ULV)	Intel Core i7 1165G7 4C/8T (TigerLake ULV)	Comments

	Native Dhrystone Integer (GIPS)	208	134	154	169 [+10%]	TGL is 10% faster than ICL but not enough to beat AMD.
	Native Dhrystone Long (GIPS)	191	135	151	167 [+11%]	With a 64-bit integer workload – 11% increase
	Native FP32 (Float) Whetstone (GFLOPS)	89	85	90	99.5 [+10%]	With floating-point, TGL is only 10% faster but enough to beat AMD.
	Native FP64 (Double) Whetstone (GFLOPS)	75	70	74	83 [+12%]	With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.

	Native Integer (Int32) Multi-Media (Mpix/s)	506	409	504*	*709 [+41%]**	With AVX512 TGL is over 40% faster than ICL.
	Native Long (Int64) Multi-Media (Mpix/s)	193	149	145*	*216 [+49%]**	With a 64-bit AVX512 integer workload TGL is 50% faster.
	Native Quad-Int (Int128) Multi-Media (Mpix/s)	4.47	2.54	3.67**	4.34** [+18%]	A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
	Native Float/FP32 Multi-Media (Mpix/s)	433	328	414*	*666 [+61%]**	In this floating-point vectorised test TGL is 61% faster!
	Native Double/FP64 Multi-Media (Mpix/s)	251	194	232*	*381 [+64%]**	Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
	Native Quad-Float/FP128 Multi-Media (Mpix/s)	11.23	8.22	10.2*	*15.28 [+50%]**	A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope! Note:* using AVX512 instead of AVX2/FMA. Note**: note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

	Crypto AES-256 (GB/s)	13.46	12.11	21.3*	19.72* [-7%]	Memory bandwidth rules here so TGL is similar to ICL in speed.
	Crypto AES-128 (GB/s)	13.5	12.11	21.3*	19.8* [-7%]	No change with AES128.
	Crypto SHA2-256 (GB/s)	7.03**	4.28	9***	13.87* [+54%]**	Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
	Crypto SHA1 (GB/s)		7.19	15.71***		Less compute intensive SHA1 does not help.
	Crypto SHA2-512 (GB/s)			7.09***		SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL! AVX512 helps even against native hashing acceleration. * using VAES (AVX512 VL) instead of AES HWA. using SHA HWA instead of multi-buffer AVX2. * using AVX512 B/W

	Black-Scholes float/FP32 (MOPT/s)		64.16	109		–
	Black-Scholes double/FP64 (MOPT/s)	91.48	87.17	91	132 [+45%]	Using FP64 TGL is 45% faster than ICL.
	Binomial float/FP32 (kOPT/s)		16.34	23.55		Binomial uses thread shared data thus stresses the cache & memory system.
	Binomial double/FP64 (kOPT/s)	31.2	21	27	37.23 [+38%]	With FP64 code TGL is 38% faster.
	Monte-Carlo float/FP32 (kOPT/s)		12.48	79.9		Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
	Monte-Carlo double/FP64 (kOPT/s)	45.59	16.5	33	45.98 [+39%]	Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.

	SGEMM (GFLOPS) float/FP32		158	185*	*294 [+59%]**	In this tough vectorised algorithm, TGL is 60% faster!
	DGEMM (GFLOPS) double/FP64	76.86	69.2	91.7*	*167 [+82%]**	With FP64 vectorised code, TGL is over 80% faster!
	SFFT (GFLOPS) float/FP32		13.9	31.7*	31.14* [-2%]	FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
	DFFT (GFLOPS) double/FP64	7.15	7.35	17.7*	16.41* [-3%]	With FP64 code, nothing much changes.
	SNBODY (GFLOPS) float/FP32		169	200*	*286 [+43%]**	N-Body simulation is vectorised but with more memory accesses.
	DNBODY (GFLOPS) double/FP64	98.7	64.2	61.8*	81.61* [+32%]	With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead. * using AVX512 instead of AVX2/FMA3

	NeuralNet CNN Inference (Samples/s)		19.33	25.62*
	NeuralNet CNN Training (Samples/s)		3.33	4.56*
	NeuralNet RNN Inference (Samples/s)		23.88	24.93*
	NeuralNet RNN Training (Samples/s)		1.57	2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)

	Blur (3×3) Filter (MPix/s)	1060	891	1580*	*2276 [+44%]**	In this vectorised integer workload TGL is 44% faster.
	Sharpen (5×5) Filter (MPix/s)	441	359	633*	*912 [+44%]**	Same algorithm but more shared data TGL still 44% faster.
	Motion-Blur (7×7) Filter (MPix/s)	231	186	326*	*480 [+47%]**	Again same algorithm but even more data shared brings 47%
	Edge Detection (2*5×5) Sobel Filter (MPix/s)	363	302	502*	*751 [+50%]**	Different algorithm but still vectorised workload still 50% faster.
	Noise Removal (5×5) Median Filter (MPix/s)	28.02	27.7	72.9*	*109 [+49%]**	Still vectorised code TGL is again 50% faster.
	Oil Painting Quantise Filter (MPix/s)	12.23	15.7	24.7*	*34.74 [+40%]**	Similar improvement here of about 40%
	Diffusion Randomise (XorShift) Filter (MPix/s)	936	1580	2100*	*2998 [+43%]**	With integer workload, 43% faster.
	Marbling Perlin Noise 2D Filter (MPix/s)	127	214	307*	*430 [+40%]**	In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes. * using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended – 9/10

Please see our other articles on:

Disclaimer

This is an independent article that has not been endorsed or sponsored by any entity (e.g. Intel). All trademarks acknowledged and used for indentification only under fair use.