Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance

Intel iRIS Xe Gen 12

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Gen12 (XE-LP) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each, 5400Mt/s)
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • New Image Processing Unit (IPU6) up to 4K90 resolution
  • New 2x Media Encoders HEVC 4K60-10b 4:4:4 & 8K30-10b 4:2:0
  • PCIe 4.0

While ICL has already greatly upgraded the GP-GPU to gen 11 cores (and more than doubled to 64EU for G7), TGL upgrades them yet again to “XE”-LP gen 12 cores now all the way up to 96EUs. While again most features seem to be geared towards gaming and media (with new image processing and media encoders) – there should be a few new instructions for AI – hopefully provided by a OpenCL extension.

Again there is no FP64 support (!) while FP16 is naturally supported at 2x rate as before. BF16 should also be supported by a future driver. Int32, Int16 performance has reportedly doubled with Int8 now supported and DP4A accelerated.

The new memory controller supports DDR5 / LPDDR5 (5400Mt/s) that should – once memory becomes readily available – provide more bandwidth for the EU cores; until then LPDDR4X can clock even faster (4267Mt/s). There is no mention about eDRAM (L4) cache at all.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously. Perhaps with the forthcoming DG1 discrete graphics

GPGPU (Xe-LP G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel Iris XE-LP G7
Intel XE-LP G1
Intel Iris Plus (IceLake) G7
AMD Vega 8 (Ryzen5)
Comments
Arch Chipset EV12 / G7 EV12 / G1 EV11 / G7 GCN1.5 The first G12 from Intel.
Cores (CU) / Threads (SP) 96 / 768 32 / 256 64 / 512 8 / 512 50% more cores vs. G11
SIMD per CU / Width 8 8 8 64 Same SIMD width
Wave/Warp Size 32 32 16/32 64 Wave size matches nVidia
Speed (Min-Turbo)
1.2GHz 1.15GHz 1.1GHz 1.1GHz Turbo speed has slightly increased.
Power (TDP) 15-35W 15-35W 15-35W 15-35W Similar power envelope.
ROP / TMU 24 / 48 8 / 16 16 / 32 8 / 32 ROPs and TMUs have also increased 50%.
Shared Memory
64kB
64kB 64kB 32kB Same shared memory but 2x Vega.
Constant Memory
3.2GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x LP-DDR4X 4267Mt/s (LPDDR5 5400Mt/s) 2x LP-DDR4X 4267Mt/s 2x LP-DDR4X 3733Mt/s 2x DDR4-2400 Can support faster (LP)DDR5 in the future.
Memory Bandwidth
42GB/s 42GB/s 58GB/s 42GB/s Highest (possible) bandwidth ever
L1 Caches 64kB x 6 64kB x 2 16kB x 8 8x 16kB L1 is much larger.
L3 Cache 3.8MB ? 3MB ? L3 has modestly increased.
Maximum Work-group Size
256×256 256×256 256×256 1024×1024 Vega supports 4x bigger workgroups.
FP64/double ratio
No! No! No! Yes, 1/16x No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Iris XE-LP G7 96EV
Intel XE-LP G1 32EV
Intel Iris Plus (IceLake) G7 64EV
AMD Vega 8 (Ryzen5) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 4,342 [+54%] 1,419 2,820 2,000 Xe beats EV11 by over 50% using FP16!
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 2,062 [+55%] 654 1,330 1,350 Standard FP32 is just as fast, 55% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 98.6* [+41%] 31.3* 70* 111 Without native FP64 support Xe craters like old EV11.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 9.91* [+31%] 3.49* 7.54* 7.11 Emulated FP128 is even harder for Xe.
Starting off, we see almost perfect scaling with improvement in EUs, with Xe 50% faster than old EV11. Unfortunately, again without native FP64 support – it cannot match the competition. For FP64 workloads – you’ll have to use the CPU; for ULV that may be OK but for discrete DG1 that is not so great.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 7.9 [+3x] 2.54 2.6 2.58 Integer performance is 3x faster than EV11
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 3.54 3.38 3.3 Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 20.52 [+3x] 6.81 6.9 14.29 Xe beats Vega even with its acceleration.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 13.34 14.18 18.77 With 128-bit Xe is even faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 2.26 3.36 64-bit integer workload is also stellar.
Despite our sample using slower DDR4 memory vs. LP-DDR4x ICL/EV11, integer performance is 3x faster – a huge upgrade. It even manages to beat AMD’s Vega with its crypto acceleration instructions (media ops). While the crypto currency frenzy has died out (not likely to mine coins on ULV GP-GPUs), the dedicated DG1 may be a serious crypto-craker GPU.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,111 2,340 1,720 With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1,603 [+22%] 993 1,310 829 With FP32 Xe is 22% faster.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 116 292 270 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 334 [+14%] 111 292 254 With FP32, XE is just 15% faster.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 470 667 584 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1,385 [+94%] 444 719 362 With FP32 code Xe is 2x faster than EV11.
For financial FP32/FP16 workloads, Xe is not always much faster than EV11, with two algorithms just 15-22% faster but one 2x as fast. Again, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.

This does not bode well for the dedicated DG1 as it would be the only discrete card without native FP64 support unlike competition. However, it is likely (some) FP64 units will be included unless Intel will aim it squarely to gamers (only).

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 528 563 884 Vega still has great performance with FP16.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 683 [+64%] 419 314 With FP32, Xe is 64% faster than EV11.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 33.32 61.4 61.34 Vega does very well here also with FP16.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 52.7 [+34%] 39.2 31.5 With FP32, Xe is 34% faster.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 652 930 623 All Intel GPUs do well here.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 908 [+60%] 566 537 With FP32, Xe is 60% faster.
On scientific algorithms, Xe does much better and manages 35-65% better performance than EV11 and generally trouncing Vega on FP32 though not quite on FP16. Shall we mention lack of FP64 again?
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 3,520 2,273
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 4,725 [+3x] 1,649 1.570 782 In this 3×3 convolution algorithm, Xe is 3x faster!
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 1,000 582
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,354 [+4.2x] 436 319 157 Same algorithm but more shared data, Xe is 4x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 924 619
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 727 [+2.2x] 232 328 161 With even more data Xe is 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 1,000 595
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,354 [+4.26x] 435 318 155 Still convolution but with 2 filters – 4.3x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 26.63 7.69
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.73 [+33%] 16.27 26.91 4.06 Different algorithm Xe just 33% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 24.34
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 23.95 [+22%] 11.11 19.63 2.59 Without major processing, Xe is only 22% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,740 2,091
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 2,772 [+48%] 1,175 1,870 2,100 This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 215 1,046
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 916 [-4%] 551 950 608 One of the most complex and largest filters, Xe ties with EV11.
For image processing tasks, Xe seems to do best, with up to 4x better performance – likely due to updated compiler and drivers. In any case for such tasks, upgrading to TGL will give you a huge boost. (fortunately no FP64 processing here)

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 44.92 [+27%] 45.9 36.3 27.2 Xe manages to squeeze more bandwidth of DDR4.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 7.75 [-54%] 7.7 17 4.74 Uploads are 1/2 slower at this time.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 7.6 [-58%] 7.6 185 Download bandwidth is not much better.
Thanks to the faster LP-DDR4X memory, Xe has even higher bandwidth than EV11; with future DDR5 / LPDDR5 this will increase even higher. At this time, perhaps due to the driver the upload/download bandwidths are 1/2x lower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Once again Intel seems to be taking graphics seriously: for the 2nd time in a row we have a major graphics upgrade with Xe with big upgrades in EV cores (count), performance and bandwidth. Overall it seems to be 50% faster than EV11 with lower-end devices benefiting most from the upgrade. While the competition was unassailable – Intel has managed to close the gap and overtake.

However, this is still a core aimed at gamers and it does not provide much for GP-GPU; the improved integer performance is very much welcome – 3-times better (!) but few and specific instructions for AI only. Lack of FP64 makes it unsuitable for high-precision financial and scientific workloads; something that the old EV7-9 cores could do reasonably well (all things considered).

For integrated graphics, this is not a problem – not many people would expect ULV GPU core to run compute-heavy workloads; however, the dedicated DG1 card would really be out-spec’d by the competition, with even old, low-end devices providing more features. However, dedicated DG1 is likely to include (some) FP64 units and/or additional units unlike the low-power (LP ULV) integrated versions.

Getting back to ULV, Xe-LP’s performance completely obsoletes devices (e.g. SKL/KBL/WHL/CML-ULV) using the older EV9x cores – unless you really don’t plan on using them except for “business 2D graphics” or displaying the desktop.

If you have not upgraded to ICL yet, TGL is a far better, compelling, proposition that should be your (current) top choice for long-term use. For ICL owners, there is still a lot to upgrade though not as massive as anything released previously.

In a word: Highly Recommended!

Please see our other articles on:

Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GPGPU Performance

Intel Iris Plus Graphics

What is “IceLake”?

It is the “proper” 10th generation Core arch (ICL) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, performance benefits)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • 64 EUs up to 1.1GHz – up to 1.12 TFLOPS/FP32, 2.25TFLOPS/FP16
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • VBR (Variable Rate Shading) – usefor for games

The biggest change GPGPU-wise is the increase in EUs (64 top end) which greatly increases processing power compared to previous generation using few EUs (24 except very rare GT3 version). Most of the  features seem to be geared towards gaming not GPGPU – thus one omission is no FP64 support! While mobile platforms are not very likely to use high-precision kernels, Gen9 FP64 performance did exceed CPU AVX2/FMA FP64 performance. FP16 is naturally supported, 2x rate as most current designs.

While there does not seem to be eDRAM (L4) cache at all, thanks to very high-speed LP-DDR4X memory (at 3733Mt/s) the bandwidth has almost doubled (58GB/s) which should greatly help bandwidth-intensive workloads. While L1 does not seem changed, L2 has been increased to 3MB (up from 1MB) which should also help.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously.

GPGPU (Gen11 G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
Arch Chipset EV9.5 / GT2 EV9 / GT3 Vega / GCN1.5 EV11 / G7 The first G11 from Intel.
Cores (CU) / Threads (SP) 24 / 192 48 / 384 8 / 512 64 / 512 Less powerful CU but same SP as Vega
SIMD per CU / Width 8 8 64 8 Same SIMD width
Wave/Warp Size 32 32 64 32 Wave size matches nVidia
Speed (Min-Turbo)
300-1000MHz 300-950MHz 300-1100MHz 400-1100MHz Turbo maches Vega.
Power (TDP) 15-25W 15-25W 25W 15-25W Same TDP
ROP / TMU 8 / 16 16 / 24 8 / 32 16 / 32
ROPs the same but TMU have increased.
Shared Memory
64kB
64kB 32kB 64kB Same shared memory but 2x Vega.
Constant Memory
1.6GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x DDR4 2133Mt/s 2x DDR4 2133Mt/s 2x DDR4 2400Mt/s 2x LP-DDR4X 3733Mt/s Fastest memory ever
Memory Bandwidth
38GB/s 38GB/s 42GB/s 58GB/s Highest bandwidth ever
L1 Caches 16kB x 24 16kB x 48 8x 16kB 16kB x 64kB L1 does not appear changed.
L2 Cache 512kB 1MB ? 3MB L2 has tripled in size
Maximum Work-group Size
256×256 256×256 1024×1024 256×256 Vega supports 4x bigger workgroups
FP64/double ratio
1/16x 1/16x 1/32x No! No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 895 1,530 2,000 2,820 [+41%] G7 beats Vega by 40%! Pretty incredible start.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 472 843 1,350 1,330 [-1%] Standard FP32 is just a tie.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 195 111 70* Without native FP64 support G7 craters, but old GT3 beats Vega.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 6 10.2 7.1 7.54* Emulated FP128 is hard on FP64 units and G7 beats Vega again.
G7 ties with Mobile Vega in FP32 which in itself is a great achievement but FP16 is much faster. Unfortunately, without native FP64 support – G7 is a lot slower using emulation – but hopefully mobile systems don’t use high-precision kernels.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.88 1.14 2.58 2.6 [+1%] G7 manages to tie with Vega on this streaming test.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.1 1.42 3.3 3.4 [+2%] Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 1.1 1.83 3.36 2.26 [-33%] Without crypto acceleration G7 cannot match Vega.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 3 4.45 14.29 6.9 [1/2x] With 128-bit G7 is 1/2 speed of Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 6.79 10.6 18.77 14.18 [-24%] 64-bit integer workload is still 25% slower.
Thanks to the fast LP-DDR4X memory and its high bandwidth, G7 performance ties with Vega on integer workloads. However, G7 has not crypto acceleration thus Vega is much faster – thus crypto-currency/coin algorithms still favour AMD.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,170 1,470 1,720 2,340 [+36%] With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 710 758 829 1,310 [+58%] With FP32 G7 is now even faster – 60% faster than Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 158 264 185 No FP64 support.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 95.7 153 254 292 [+8%] Binomial uses thread shared data thus stresses the memory system so G7 is just 15% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 20.32 31.1 15.67 No FP64 support.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 240 392 362 719 [+2x] Monte-Carlo also uses thread shared data but read-only and here G7 is 2x faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 35.27 59.7 47.13 No FP64 support.
For financial FP32/FP16 workloads, G7 is between 8% to 100% faster than the Vega – thus for financial workloads it is a great choice. Unfortunately, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.
GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 142 220 884 563 [-36%] G7 cannot beat Vega despite previous FP16 great performance.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 119 162 314 419 [+33%] With FP32, G7 is 33% faster than Vega.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 44.2 65.1 62.5 No FP64 support
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 39.77 42.54 61.34 61.4 [=] G7 manages to tie with Vega here.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 23.8 29.69 31.48 39.22 [+25%] With FP32, G7 is 25% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.81 3.43 14.19 No FP64 support
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 383 597 623 930 [+49%] G7 comes up strong here winning by 50%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 209 327 537 566 [+5%] With FP32, G7 drops to just 5% faster than Vega.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 26.93 44.19 44
On scientific algorithms, G7 manages to beat Vega between 25-50% with FP32 precision and sometimes with FP16 as well. Again, the lack of FP64 support means all the high-precision kernels cannot be used which for some algorithms may be a problem.
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 1,000 1,370 2,273 3,520 [+55%] With FP16, G7 is only 50% faster than Vega.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 498 589 781 1,570 [+2x] In this 3×3 convolution algorithm, G7 is 2x faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 307 441 382 1,000 [+72%] With FP16, G7 is just 70% faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 108 143 157 319 [+2x] Same algorithm but more shared data, G7 still 2x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 284 435 619 924 [+49%] With FP16, G7 is again 50% faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 112 156 161 328 [+2x] With even more data the gap remains at 2x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 309 428 595 1,000 [+68%] With FP16 precision, G7 is 70% faster than Vega.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 108 145 155 318 [+2x] Still convolution but with 2 filters – same 2x difference.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 8.78 8.23 7.68 26.63 [+2.5x] With FP16, G7 is “just” 2.5x faster than Vega.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7.87 6.29 4.06 26.9 [+5.6x] Different algorithm allows G7 to fly at 6x faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 9.6 9.14 24.34 G7 does similarly well with FP16
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8.84 6.77 2.59 19.63 [+6.6x] Without major processing, this filter is 6x faster on G7.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,000 1,620 2,091 1,740 [-17%] With FP16, G7 is 17% slower than Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1,000 1,560 2,100 1,870 [-11%] This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 36.5 34.32 1,046 215 [1/5x] Some issues needed to be worked out here.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 433 649 608 950 [+56%] One of the most complex and largest filters, G7 is over 50% faster.
For image processing tasks, G7 does very well – it is 2x faster than Vega while dropping to FP16 precision is around 50% faster (with Vega benefiting greatly from the lower precision). All in all a fanstastic result for those using image/video manipulation algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 21.36 23.66 27.32 36.3 [+33%] G7 has 33% more bandwidth than Vega.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 10.4 11.77 4.74 17 [+2.6x] G7 manages far higher transfers.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 10.55 11.75 5 18 [+2.6x] Again, same 2.6x delta.
Thanks to the fast LP-DDR4X memory, G7 has far more bandwidth than Vega or older GT2/GT3 design; this no doubt helps streaming algorithms as we have seen above.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 232 277 412 343 [-17%] Better latency than Vega but not less than old arch.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 363 436 519 433 [-17%] Similar 17% less than Vega.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 153 213 201 267 [+33%] Vega seems to be a lot faster than G7.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 236 252 411 350 [-15%] Same latency as global as not dedicated.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72.5 100 22.5 16.7 [-26%] G7 has greatly reduced shared memory latency.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1,116 1,500 278 1,100 [+3x] Not much improvement over older versions.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1,178 1,533 418 1,018 [+1.4x] Similar high latency for G7.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1,057 1,324 122 973 [+8x] Again Vega has much lower latencies.
Despite high bandwidth, the latencies are high as LP-DDR4 has higher latencies than standard DDR4 (tens of clocks). Like Vega there is no dedicated constant memory – unlike nVidia. But G7 has greatly reduced shared memory latency to less than Vega which greatly helps algorithms using shared memory.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It’s great to see Intel taking graphics seriously again; with ICL, you don’t just get a brand-new core but a much updated GPU core too. And it does not disappoint – it trades blows with competition (Vega Mobile) and usually wins while it is close to 2x faster than Gen9/GT3 and 3x faster than Gen9.5/GT2 – a huge improvement.

The lack of native FP64 support is puzzling – but then again it could be reserved for higher-end/workstation versions if supported at all. Intel no doubt is betting on the CPU’s AVX512 SIMD cores for FP64 performance which is considerable. Again, it’s not very likely that mobile (ULV) platforms are going to run high-precision kernels.

The memory bandwidth is also 50% higher but unfortunately latencies are also higher due to LP-DDR4(X) memory; lower-end versions using “standard” DDR4 memory will not see high bandwidth but will see lower latencies – thus it is give and take.

As we’ve said in the other reviews of ICL, if you have been waiting to upgrade from the much older – but still good – SKL/KBL with Gen8/9 GT2 GPU – the Gen11 GPU is a significant upgrade. You will no longer feel “inadequate” compared to competition integrated GPUs. Naturally, you cannot expect discrete GPU levels of performance but for an integrated APU it is more than sufficient.

Overall with CPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on: