Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance

What is “IceLake”?

It is the “real” 10th generation Core arch(itecture) (ICL/”IceLake”) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

10nm+ process (lower voltage, higher performance benefits)
Up to 4C/8T “Sunny Cove” cores on ULV (less than top-end CometLake 6C/12T)
Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
AVX512 instruction set (like HEDT platform)
SHA HWA instruction set (like Ryzen)
2-channel LP-DDR4X support up to 3733Mt/s
Thunderbolt 3 integrated
Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
WiFi6 (802.11ax) AX201 integrated

Probably the biggest change is support for AVX512-family instruction set, effectively doubling the SIMD processing width (vs. AVX2/FMA) as well as adding a whole host of specialised instructions that even the HEDT platform (SKL/KBL-X) does not support:

AVX512-VNNI (Vector Neural Network Instructions)
AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
AVX512-BITALG (Bit Algorithms)
AVX512-AVX512-IFMA (Integer FMA)
AVX512-VAES (Vector AES) accelerating crypto
AVX512-GFNI (Galois Field)
SHA HWA accelerating hashing
AVX512-GNA (Gaussian Neural Accelerator)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

VAES and SHA acceleration improve crypto/hashing performance – important today as even LAN transfers between workstations are likely to be encrypted/signed, not to mention just about all WAN transfers, encrypted disk/containers, etc. Some SoCs will also make their way into powerful (but low power) firewall appliances where both AES and SHA acceleration will prove very useful.

From a security point-of-view, ICL mitigates all (existing/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O).

The memory controller supports LP-DDR4X at higher speeds than CML while the cache/TLB systems have been improved that should help both CPU and GPU performance (see corresponding article) as well as reduce power vs. older designs using LP-DDR3.

Finally the GPU core has been updated (Gen11) and generally contains many more cores than the old core (Gen9.5) that was used from KBL (CPU Gen7) all the way to CML (CPU Gen10) (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Intel Core Gen10 CometLake ULV (i7-10510U) Review & Benchmarks – CPU Performance

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications	AMD Ryzen 2500U Bristol Ridge	Intel i7 8550U (Coffeelake ULV)	Intel Core i7 10510U (CometLake ULV)	Intel Core i7 1065G7 (IceLake ULV)	Comments
Cores (CU) / Threads (SP)	4C / 8T	4C / 8T	4C / 8T	4C / 8T	No change in cores count.
Speed (Min / Max / Turbo)	1.6-2.0-3.6GHz	0.4-1.8-4.0GHz (1.8 @ 15W, 2GHz @ 25W)	0.4-1.8-4.9GHz (1.8GHz @ 15W, 2.3GHz @ 25W)	0.4-1.5-3.9GHz (1.0GHz @ 12W, 1.5GHz @ 25W)	ICL has lower clocks ws. CML.
Power (TDP)	15-35W	15-35W	15-35W	12-35W	Same power envelope.
L1D / L1I Caches	4x 32kB 8-way / 4x 64kB 4-way	4x 32kB 8-way / 4x 32kB 8-way	4x 32kB 8-way / 4x 32kB 8-way	4x 48kB 12-way / 4x 32kB 8-way	L1D is 50% larger.
L2 Caches	4x 512kB 8-way	4x 256kB 16-way	4x 256kB 16-way	4x 512kB 16-way	L2 has doubled.
L3 Caches	4MB 16-way	6MB 16-way	8MB 16-way	8MB 16-way	No L3 changes
Microcode (Firmware)	MU8F1100-0B	MU068E09-AE	MU068E0C-BE	MU067E05-6A	Revisions just keep on coming.
Special Instruction Sets	AVX2/FMA, SHA	AVX2/FMA	AVX2/FMA	AVX512, VNNI, SHA, VAES, GFNI	512-bit wide SIMD on mobile!
SIMD Width / Units	128-bit	256-bit	256-bit	512-bit	Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks		AMD Ryzen 2500U Bristol Ridge	Intel i7 8550U (Coffeelake ULV)	Intel Core i7 10510U (CometLake ULV)	Intel Core i7 1065G7 (IceLake ULV)	Comments

	Native Dhrystone Integer (GIPS)	103	125	134	154 [+15%]	ICL is 15% faster than CML.
	Native Dhrystone Long (GIPS)	102	115	135	151 [+12%]	With a 64-bit integer workload – 12% increase
	Native FP32 (Float) Whetstone (GFLOPS)	79	67	85	90 [+6%]	With floating-point, ICL is 6% faster than CML
	Native FP64 (Double) Whetstone (GFLOPS)	67	57	70	74 [+5%]	With FP64 we see 5% improvement
With integer (legacy) workloads (not using SIMD) we see the new ICL core is over 10% faster than the higher-clocked CML core; with floating-point we see a 5% improvement. While modest, it shows the potential of the new core over the old-but-refined cores we’ve had since SKL.

	Native Integer (Int32) Multi-Media (Mpix/s)	239	306	409	*504 [+23%]**	With AVX512 ICL wins this vectorised integer test
	Native Long (Int64) Multi-Media (Mpix/s)	53.4	117	149	145* [-3%]	With a 64-bit AVX512 integer workload we have parity.
	Native Quad-Int (Int128) Multi-Media (Mpix/s)	2.41	2.21	2.54	3.67 [+44%]	A tough test using long integers to emulate Int128 without SIMD; ICL is 44% faster!
	Native Float/FP32 Multi-Media (Mpix/s)	222	266	328	*414 [+26%]**	In this floating-point vectorised test, AVX512 is 26% faster.
	Native Double/FP64 Multi-Media (Mpix/s)	127	155.9	194	*232 [+19%]**	Switching to FP64 SIMD code, ICL is 20% faster.
	Native Quad-Float/FP128 Multi-Media (Mpix/s)	6.23	6.51	8.22	*10.2 [+24%]**	A heavy algorithm using FP64 to mantissa extend FP128 ICL is 24% faster.
With heavily vectorised SIMD workloads ICL is able to deploy AVX512 which leads to a 20-25% performance improvement even at the slower clock. However, AVX512 is quite power-hungry (as we’ve seen on HEDT) so we are power constrained in an ULV here – but higher TDP systems (28W, etc.) should perform much better. * using AVX512 instead of AVX2/FMA.

	Crypto AES-256 (GB/s)	10.9	13.1	12.1	*21.3 [+76%]**	ICL with VAES is 76% faster than CML.
	Crypto AES-128 (GB/s)	10.9	13.1	12.1	*21.3 [+76%]**	No change with AES128.
	Crypto SHA2-256 (GB/s)	6.78**	3.97	4.3	9 [+2.1x]**	Despite SHA HWA, Ryzen loses top spot.
	Crypto SHA1 (GB/s)	7.13**	7.5	7.2	15.7 [+2.2x]**	Less compute intensive SHA1 does not help.
	Crypto SHA2-512 (GB/s)	1.48	1.54		7.1***	SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite VAES (AVX512 VL) and SHA HWA support (like Ryzen), ICL wins thanks to the very fast LP-DDR4X @ 3733Mt/s. VAES marginally helps (at this time) and SHA HWA cannot beat AVX512 multi-buffer but should be much more important in single-buffer large data workloads. * using VAES (AVX512 VL) instead of AES HWA. using SHA HWA instead of multi-buffer AVX2. * using AVX512 B/W

	Black-Scholes float/FP32 (MOPT/s)	93.34	73.02		109	With non-vectorised code ICL is still faster
	Black-Scholes double/FP64 (MOPT/s)	77.86	75.24	87.2	91 [+4%]	Using FP64 ICL is 4% faster
	Binomial float/FP32 (kOPT/s)	35.49	16.2		23.5	Binomial uses thread shared data thus stresses the cache & memory system.
	Binomial double/FP64 (kOPT/s)	19.46	19.31	21	27 [+29%]	With FP64 code ICL is 29% faster.
	Monte-Carlo float/FP32 (kOPT/s)	20.11	14.61		79.9	Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
	Monte-Carlo double/FP64 (kOPT/s)	15.32	14.54	16.5	66 [+2x]	Switching to FP64 ICL is 2x faster.
With non-SIMD financial workloads, ICL still improves a significant amount over CML thus it makes sense to choose it rather than the older core. Still, it is more likely that the GPGPU will be used for such workloads today.

	SGEMM (GFLOPS) float/FP32	107	141	158	*185 [+17%]**	In this tough vectorised algorithm, ICL is 17% faster
	DGEMM (GFLOPS) double/FP64	47.2	55	69.2	*91.7 [+32%]**	With FP64 vectorised code, ICL is 32% faster.
	SFFT (GFLOPS) float/FP32	3.75	13.23	13.9	*31.7 [+2.3x%]**	FFT is also heavily vectorised and here ICL is over 2x faster.
	DFFT (GFLOPS) double/FP64	4	6.53	7.35	*17.7 [+2.4x]**	With FP64 code, ICL is even faster.
	SNBODY (GFLOPS) float/FP32	112.6	160	169	*200 [+18%]**	N-Body simulation is vectorised but with more memory accesses.
	DNBODY (GFLOPS) double/FP64	45.3	57.9	64.2	*61.8 [-4%]**	With FP64 code ICL is slighly behind CML.
With highly vectorised SIMD code (scientific workloads), ICL again shows us the power of AVX512 and can be over 2x (twice) faster than CML even at higher clock. Some algorithms may need further optimisations but even then we see 17-30% improvement. * using AVX512 instead of AVX2/FMA

	NeuralNet CNN Inference (Samples/s)	14.32	17.27	19.33	*25.62 [+33%]**	Using AVX512 ICL inference is 33% faster.
	NeuralNet CNN Training (Samples/s)	1.46	2.06	3.33	*4.56 [+37%]**	Even training improves by 37%.
	NeuralNet RNN Inference (Samples/s)	16.93	22.69	23.88	*24.93 [+4%]**	Just 4% faster but improvement is there.
	NeuralNet RNN Training (Samples/s)	1.48	1.14	1.57	*2.97 [+43%]**	Training is much faster by 43% over CML.
As we’ve seen before, ICL benefits greatly from AVX512 – manages to beat the higher-clock CML across the board from 33-43% – and that is before using VNNI to accelerate algorithms even more. * using AVX512 instead of AVX2/FMA (not using VNNI yet)

	Blur (3×3) Filter (MPix/s)	532	720	891	*1580 [+77%]**	In this vectorised integer workload ICL is 77% faster
	Sharpen (5×5) Filter (MPix/s)	146	290	359	*633 [+76%]**	Same algorithm but more shared data still 76%.
	Motion-Blur (7×7) Filter (MPix/s)	123	157	186	*326 [+75%]**	Again same algorithm but even more data shared brings 75%
	Edge Detection (2*5×5) Sobel Filter (MPix/s)	185	251	302	*502 [+66%]**	Different algorithm but still vectorised workload still 66% faster.
	Noise Removal (5×5) Median Filter (MPix/s)	26.49	25.38	27.7	*72.9 [+2.6x]**	Still vectorised code ICL rules here 2.6x faster!
	Oil Painting Quantise Filter (MPix/s)	9.38	14.29	15.7	*24.7 [57%]**	Similar improvement here of about 57%
	Diffusion Randomise (XorShift) Filter (MPix/s)	660	1525	1580	*2100 [+33%]**	With integer workload, 33% faster.
	Marbling Perlin Noise 2D Filter (MPix/s)	94,16	188.8	214	*307 [+43%]**	In this final test again with integer workload 43% faster
ICL rules this benchmark with AVX512 integer (B/W) 33-43% faster and floating-point AVX512 66-77% faster than CML even at lower clock. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes. * using AVX512 instead of AVX2/FMA

Unlike CML, ICL with AVX512 support is a revolution in performance – which is exactly what we were hoping for; even at much lower clock we see anywhere between 33% all the way to over 2x (twice) faster within the same power limits (TDP/turbo). As we know from HEDT, AVX512 is power-hungry thus higher-TDP rated version (e.g. 28W) should perform even better.

Even without AVX512, we see good improvement of 5-15% again at much lower clock (3.9GHz vs 4.9GHz) while CML and older versions relied on higher clock / more cores to outperform older versions KBL/SKL-U.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD snapping at its heel with Ryzen Mobile, Intel has finally fixed its 10nm production and rolled out the “new Skylake” we deserve: Ice Lake with AVX512 brings feature parity with the much older HEDT platform and showing good promise for the future. This is the “Core” you have been looking for.

While power-hungry and TDP constrained, AVX512 does bring sizeable performance gains that are in addition to core improvements and cache & memory sub-system improvements. Other instruction sets VAES, SHA HWA complete the package and might help in some scenarios where code has not been updated to AVX512.

With ICL, a mere 15W thin & light (e.g. Dell XPS 13 9300) can outperform older desktop-class CPUs (e.g. SKL) at 4-6x (four/six-times) TDP which makes us really keen to see what desktop-class processors will be capable of. And not before time as the competition has been bringing stronger and stronger designs (Ryzen2, future Ryzen 3).

If you have been waiting to upgrade from the much older – but still good – SKL/KBL with just 2 cores and no hardware vulnerability mitigations – then you finally have something to upgrade to: CML was not it as despite its 4 cores (and rumoured 6 core), it just did not bring enough to the table to make upgrading worth-while (save hardware mitigations that don’t cripple performance).

Overall, with GP GPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance
Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance
Benchmarks of JCC Erratum Mitigation – Intel CPUs
Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – Cache & Memory Performance
Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GPGPU Performance