Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance

Intel iRIS Xe Gen 12

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Gen12 (XE-LP) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each, 5400Mt/s)
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • New Image Processing Unit (IPU6) up to 4K90 resolution
  • New 2x Media Encoders HEVC 4K60-10b 4:4:4 & 8K30-10b 4:2:0
  • PCIe 4.0

While ICL has already greatly upgraded the GP-GPU to gen 11 cores (and more than doubled to 64EU for G7), TGL upgrades them yet again to “XE”-LP gen 12 cores now all the way up to 96EUs. While again most features seem to be geared towards gaming and media (with new image processing and media encoders) – there should be a few new instructions for AI – hopefully provided by a OpenCL extension.

Again there is no FP64 support (!) while FP16 is naturally supported at 2x rate as before. BF16 should also be supported by a future driver. Int32, Int16 performance has reportedly doubled with Int8 now supported and DP4A accelerated.

The new memory controller supports DDR5 / LPDDR5 (5400Mt/s) that should – once memory becomes readily available – provide more bandwidth for the EU cores; until then LPDDR4X can clock even faster (4267Mt/s). There is no mention about eDRAM (L4) cache at all.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously. Perhaps with the forthcoming DG1 discrete graphics

GPGPU (Xe-LP G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel Iris XE-LP G7
Intel XE-LP G1
Intel Iris Plus (IceLake) G7
AMD Vega 8 (Ryzen5)
Comments
Arch Chipset EV12 / G7 EV12 / G1 EV11 / G7 GCN1.5 The first G12 from Intel.
Cores (CU) / Threads (SP) 96 / 768 32 / 256 64 / 512 8 / 512 50% more cores vs. G11
SIMD per CU / Width 8 8 8 64 Same SIMD width
Wave/Warp Size 32 32 16/32 64 Wave size matches nVidia
Speed (Min-Turbo)
1.2GHz 1.15GHz 1.1GHz 1.1GHz Turbo speed has slightly increased.
Power (TDP) 15-35W 15-35W 15-35W 15-35W Similar power envelope.
ROP / TMU 24 / 48 8 / 16 16 / 32 8 / 32 ROPs and TMUs have also increased 50%.
Shared Memory
64kB
64kB 64kB 32kB Same shared memory but 2x Vega.
Constant Memory
3.2GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x LP-DDR4X 4267Mt/s (LPDDR5 5400Mt/s) 2x LP-DDR4X 4267Mt/s 2x LP-DDR4X 3733Mt/s 2x DDR4-2400 Can support faster (LP)DDR5 in the future.
Memory Bandwidth
42GB/s 42GB/s 58GB/s 42GB/s Highest (possible) bandwidth ever
L1 Caches 64kB x 6 64kB x 2 16kB x 8 8x 16kB L1 is much larger.
L3 Cache 3.8MB ? 3MB ? L3 has modestly increased.
Maximum Work-group Size
256×256 256×256 256×256 1024×1024 Vega supports 4x bigger workgroups.
FP64/double ratio
No! No! No! Yes, 1/16x No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Iris XE-LP G7 96EV
Intel XE-LP G1 32EV
Intel Iris Plus (IceLake) G7 64EV
AMD Vega 8 (Ryzen5) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 4,342 [+54%] 1,419 2,820 2,000 Xe beats EV11 by over 50% using FP16!
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 2,062 [+55%] 654 1,330 1,350 Standard FP32 is just as fast, 55% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 98.6* [+41%] 31.3* 70* 111 Without native FP64 support Xe craters like old EV11.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 9.91* [+31%] 3.49* 7.54* 7.11 Emulated FP128 is even harder for Xe.
Starting off, we see almost perfect scaling with improvement in EUs, with Xe 50% faster than old EV11. Unfortunately, again without native FP64 support – it cannot match the competition. For FP64 workloads – you’ll have to use the CPU; for ULV that may be OK but for discrete DG1 that is not so great.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 7.9 [+3x] 2.54 2.6 2.58 Integer performance is 3x faster than EV11
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 3.54 3.38 3.3 Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 20.52 [+3x] 6.81 6.9 14.29 Xe beats Vega even with its acceleration.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 13.34 14.18 18.77 With 128-bit Xe is even faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 2.26 3.36 64-bit integer workload is also stellar.
Despite our sample using slower DDR4 memory vs. LP-DDR4x ICL/EV11, integer performance is 3x faster – a huge upgrade. It even manages to beat AMD’s Vega with its crypto acceleration instructions (media ops). While the crypto currency frenzy has died out (not likely to mine coins on ULV GP-GPUs), the dedicated DG1 may be a serious crypto-craker GPU.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,111 2,340 1,720 With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1,603 [+22%] 993 1,310 829 With FP32 Xe is 22% faster.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 116 292 270 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 334 [+14%] 111 292 254 With FP32, XE is just 15% faster.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 470 667 584 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1,385 [+94%] 444 719 362 With FP32 code Xe is 2x faster than EV11.
For financial FP32/FP16 workloads, Xe is not always much faster than EV11, with two algorithms just 15-22% faster but one 2x as fast. Again, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.

This does not bode well for the dedicated DG1 as it would be the only discrete card without native FP64 support unlike competition. However, it is likely (some) FP64 units will be included unless Intel will aim it squarely to gamers (only).

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 528 563 884 Vega still has great performance with FP16.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 683 [+64%] 419 314 With FP32, Xe is 64% faster than EV11.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 33.32 61.4 61.34 Vega does very well here also with FP16.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 52.7 [+34%] 39.2 31.5 With FP32, Xe is 34% faster.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 652 930 623 All Intel GPUs do well here.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 908 [+60%] 566 537 With FP32, Xe is 60% faster.
On scientific algorithms, Xe does much better and manages 35-65% better performance than EV11 and generally trouncing Vega on FP32 though not quite on FP16. Shall we mention lack of FP64 again?
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 3,520 2,273
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 4,725 [+3x] 1,649 1.570 782 In this 3×3 convolution algorithm, Xe is 3x faster!
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 1,000 582
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,354 [+4.2x] 436 319 157 Same algorithm but more shared data, Xe is 4x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 924 619
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 727 [+2.2x] 232 328 161 With even more data Xe is 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 1,000 595
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,354 [+4.26x] 435 318 155 Still convolution but with 2 filters – 4.3x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 26.63 7.69
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.73 [+33%] 16.27 26.91 4.06 Different algorithm Xe just 33% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 24.34
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 23.95 [+22%] 11.11 19.63 2.59 Without major processing, Xe is only 22% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,740 2,091
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 2,772 [+48%] 1,175 1,870 2,100 This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 215 1,046
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 916 [-4%] 551 950 608 One of the most complex and largest filters, Xe ties with EV11.
For image processing tasks, Xe seems to do best, with up to 4x better performance – likely due to updated compiler and drivers. In any case for such tasks, upgrading to TGL will give you a huge boost. (fortunately no FP64 processing here)

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 44.92 [+27%] 45.9 36.3 27.2 Xe manages to squeeze more bandwidth of DDR4.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 7.75 [-54%] 7.7 17 4.74 Uploads are 1/2 slower at this time.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 7.6 [-58%] 7.6 185 Download bandwidth is not much better.
Thanks to the faster LP-DDR4X memory, Xe has even higher bandwidth than EV11; with future DDR5 / LPDDR5 this will increase even higher. At this time, perhaps due to the driver the upload/download bandwidths are 1/2x lower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Once again Intel seems to be taking graphics seriously: for the 2nd time in a row we have a major graphics upgrade with Xe with big upgrades in EV cores (count), performance and bandwidth. Overall it seems to be 50% faster than EV11 with lower-end devices benefiting most from the upgrade. While the competition was unassailable – Intel has managed to close the gap and overtake.

However, this is still a core aimed at gamers and it does not provide much for GP-GPU; the improved integer performance is very much welcome – 3-times better (!) but few and specific instructions for AI only. Lack of FP64 makes it unsuitable for high-precision financial and scientific workloads; something that the old EV7-9 cores could do reasonably well (all things considered).

For integrated graphics, this is not a problem – not many people would expect ULV GPU core to run compute-heavy workloads; however, the dedicated DG1 card would really be out-spec’d by the competition, with even old, low-end devices providing more features. However, dedicated DG1 is likely to include (some) FP64 units and/or additional units unlike the low-power (LP ULV) integrated versions.

Getting back to ULV, Xe-LP’s performance completely obsoletes devices (e.g. SKL/KBL/WHL/CML-ULV) using the older EV9x cores – unless you really don’t plan on using them except for “business 2D graphics” or displaying the desktop.

If you have not upgraded to ICL yet, TGL is a far better, compelling, proposition that should be your (current) top choice for long-term use. For ICL owners, there is still a lot to upgrade though not as massive as anything released previously.

In a word: Highly Recommended!

Please see our other articles on: