Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance

Intel iRIS Xe Gen 12

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Gen12 (XE-LP) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each, 5400Mt/s)
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • New Image Processing Unit (IPU6) up to 4K90 resolution
  • New 2x Media Encoders HEVC 4K60-10b 4:4:4 & 8K30-10b 4:2:0
  • PCIe 4.0

While ICL has already greatly upgraded the GP-GPU to gen 11 cores (and more than doubled to 64EU for G7), TGL upgrades them yet again to “XE”-LP gen 12 cores now all the way up to 96EUs. While again most features seem to be geared towards gaming and media (with new image processing and media encoders) – there should be a few new instructions for AI – hopefully provided by a OpenCL extension.

Again there is no FP64 support (!) while FP16 is naturally supported at 2x rate as before. BF16 should also be supported by a future driver. Int32, Int16 performance has reportedly doubled with Int8 now supported and DP4A accelerated.

The new memory controller supports DDR5 / LPDDR5 (5400Mt/s) that should – once memory becomes readily available – provide more bandwidth for the EU cores; until then LPDDR4X can clock even faster (4267Mt/s). There is no mention about eDRAM (L4) cache at all.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously. Perhaps with the forthcoming DG1 discrete graphics

GPGPU (Xe-LP G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel Iris XE-LP G7
Intel XE-LP G1
Intel Iris Plus (IceLake) G7
AMD Vega 8 (Ryzen5)
Comments
Arch Chipset EV12 / G7 EV12 / G1 EV11 / G7 GCN1.5 The first G12 from Intel.
Cores (CU) / Threads (SP) 96 / 768 32 / 256 64 / 512 8 / 512 50% more cores vs. G11
SIMD per CU / Width 8 8 8 64 Same SIMD width
Wave/Warp Size 32 32 16/32 64 Wave size matches nVidia
Speed (Min-Turbo)
1.2GHz 1.15GHz 1.1GHz 1.1GHz Turbo speed has slightly increased.
Power (TDP) 15-35W 15-35W 15-35W 15-35W Similar power envelope.
ROP / TMU 24 / 48 8 / 16 16 / 32 8 / 32 ROPs and TMUs have also increased 50%.
Shared Memory
64kB
64kB 64kB 32kB Same shared memory but 2x Vega.
Constant Memory
3.2GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x LP-DDR4X 4267Mt/s (LPDDR5 5400Mt/s) 2x LP-DDR4X 4267Mt/s 2x LP-DDR4X 3733Mt/s 2x DDR4-2400 Can support faster (LP)DDR5 in the future.
Memory Bandwidth
42GB/s 42GB/s 58GB/s 42GB/s Highest (possible) bandwidth ever
L1 Caches 64kB x 6 64kB x 2 16kB x 8 8x 16kB L1 is much larger.
L3 Cache 3.8MB ? 3MB ? L3 has modestly increased.
Maximum Work-group Size
256×256 256×256 256×256 1024×1024 Vega supports 4x bigger workgroups.
FP64/double ratio
No! No! No! Yes, 1/16x No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Iris XE-LP G7 96EV
Intel XE-LP G1 32EV
Intel Iris Plus (IceLake) G7 64EV
AMD Vega 8 (Ryzen5) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 4,342 [+54%] 1,419 2,820 2,000 Xe beats EV11 by over 50% using FP16!
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 2,062 [+55%] 654 1,330 1,350 Standard FP32 is just as fast, 55% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 98.6* [+41%] 31.3* 70* 111 Without native FP64 support Xe craters like old EV11.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 9.91* [+31%] 3.49* 7.54* 7.11 Emulated FP128 is even harder for Xe.
Starting off, we see almost perfect scaling with improvement in EUs, with Xe 50% faster than old EV11. Unfortunately, again without native FP64 support – it cannot match the competition. For FP64 workloads – you’ll have to use the CPU; for ULV that may be OK but for discrete DG1 that is not so great.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 7.9 [+3x] 2.54 2.6 2.58 Integer performance is 3x faster than EV11
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 3.54 3.38 3.3 Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 20.52 [+3x] 6.81 6.9 14.29 Xe beats Vega even with its acceleration.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 13.34 14.18 18.77 With 128-bit Xe is even faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 2.26 3.36 64-bit integer workload is also stellar.
Despite our sample using slower DDR4 memory vs. LP-DDR4x ICL/EV11, integer performance is 3x faster – a huge upgrade. It even manages to beat AMD’s Vega with its crypto acceleration instructions (media ops). While the crypto currency frenzy has died out (not likely to mine coins on ULV GP-GPUs), the dedicated DG1 may be a serious crypto-craker GPU.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,111 2,340 1,720 With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1,603 [+22%] 993 1,310 829 With FP32 Xe is 22% faster.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 116 292 270 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 334 [+14%] 111 292 254 With FP32, XE is just 15% faster.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 470 667 584 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1,385 [+94%] 444 719 362 With FP32 code Xe is 2x faster than EV11.
For financial FP32/FP16 workloads, Xe is not always much faster than EV11, with two algorithms just 15-22% faster but one 2x as fast. Again, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.

This does not bode well for the dedicated DG1 as it would be the only discrete card without native FP64 support unlike competition. However, it is likely (some) FP64 units will be included unless Intel will aim it squarely to gamers (only).

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 528 563 884 Vega still has great performance with FP16.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 683 [+64%] 419 314 With FP32, Xe is 64% faster than EV11.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 33.32 61.4 61.34 Vega does very well here also with FP16.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 52.7 [+34%] 39.2 31.5 With FP32, Xe is 34% faster.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 652 930 623 All Intel GPUs do well here.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 908 [+60%] 566 537 With FP32, Xe is 60% faster.
On scientific algorithms, Xe does much better and manages 35-65% better performance than EV11 and generally trouncing Vega on FP32 though not quite on FP16. Shall we mention lack of FP64 again?
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 3,520 2,273
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 4,725 [+3x] 1,649 1.570 782 In this 3×3 convolution algorithm, Xe is 3x faster!
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 1,000 582
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,354 [+4.2x] 436 319 157 Same algorithm but more shared data, Xe is 4x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 924 619
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 727 [+2.2x] 232 328 161 With even more data Xe is 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 1,000 595
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,354 [+4.26x] 435 318 155 Still convolution but with 2 filters – 4.3x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 26.63 7.69
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.73 [+33%] 16.27 26.91 4.06 Different algorithm Xe just 33% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 24.34
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 23.95 [+22%] 11.11 19.63 2.59 Without major processing, Xe is only 22% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,740 2,091
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 2,772 [+48%] 1,175 1,870 2,100 This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 215 1,046
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 916 [-4%] 551 950 608 One of the most complex and largest filters, Xe ties with EV11.
For image processing tasks, Xe seems to do best, with up to 4x better performance – likely due to updated compiler and drivers. In any case for such tasks, upgrading to TGL will give you a huge boost. (fortunately no FP64 processing here)

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 44.92 [+27%] 45.9 36.3 27.2 Xe manages to squeeze more bandwidth of DDR4.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 7.75 [-54%] 7.7 17 4.74 Uploads are 1/2 slower at this time.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 7.6 [-58%] 7.6 185 Download bandwidth is not much better.
Thanks to the faster LP-DDR4X memory, Xe has even higher bandwidth than EV11; with future DDR5 / LPDDR5 this will increase even higher. At this time, perhaps due to the driver the upload/download bandwidths are 1/2x lower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Once again Intel seems to be taking graphics seriously: for the 2nd time in a row we have a major graphics upgrade with Xe with big upgrades in EV cores (count), performance and bandwidth. Overall it seems to be 50% faster than EV11 with lower-end devices benefiting most from the upgrade. While the competition was unassailable – Intel has managed to close the gap and overtake.

However, this is still a core aimed at gamers and it does not provide much for GP-GPU; the improved integer performance is very much welcome – 3-times better (!) but few and specific instructions for AI only. Lack of FP64 makes it unsuitable for high-precision financial and scientific workloads; something that the old EV7-9 cores could do reasonably well (all things considered).

For integrated graphics, this is not a problem – not many people would expect ULV GPU core to run compute-heavy workloads; however, the dedicated DG1 card would really be out-spec’d by the competition, with even old, low-end devices providing more features. However, dedicated DG1 is likely to include (some) FP64 units and/or additional units unlike the low-power (LP ULV) integrated versions.

Getting back to ULV, Xe-LP’s performance completely obsoletes devices (e.g. SKL/KBL/WHL/CML-ULV) using the older EV9x cores – unless you really don’t plan on using them except for “business 2D graphics” or displaying the desktop.

If you have not upgraded to ICL yet, TGL is a far better, compelling, proposition that should be your (current) top choice for long-term use. For ICL owners, there is still a lot to upgrade though not as massive as anything released previously.

In a word: Highly Recommended!

Please see our other articles on:

nVidia Titan RTX / 2080Ti: Turing GPGPU performance in CUDA and OpenCL

nVidia RTX 2080 TI (Turing)

What is “Titan RTX / 2080Ti”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Turing” architecture, the update to the current “Volta” architecture that has had a limited release in Titan/Quadro cards. It powers the new Series 20 top-end (with RTX) and Series 16 mainstream (without RTX) cards that replace the old Series 10 “Pascal” series.

As “Volta” is intended for AI/scientific/financial data-centers – it features high-end HBM2 memory; since “Turing” is meant for gaming, rendering, etc. has “normal” GDDR6 memory. Similarly “Turing” has the new RTX (Ray-Tracing) cores for high-fidelity visualisation and image generation – in addition to the Tensor (TSX) cores that “Volta” has introduced.

While “Volta” has 1/2 FP64 ratio cores (vs. FP32), “Turing” has the normal 1/32 FP64 ratio cores: for high-precision computation – you need “Volta”. However, as “Turing” maintains the 2x FP16 rate (vs. FP32) it can run low-precision AI (neural networks) at full speed. Old “Pascal” had 1/64x FP16 ratio making it pretty much unusable in most cases.

“Turing” does not have high-end on-package HBM2 memory but instead high-speed GDDR6 memory that has decent bandwidth but is not  plentiful – with 1GB missing (11GB instead of 12GB).

With the soon-to-be unveiled “Ampere”  (Series 30) architecture, we look whether you can have a “cheap” Titan V performance using a Turing 2080TI consumer card.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan RTX / 2080TI (Turing)
nVidia Titan V (Volta)
nVidia Titan X (Pascal)
Comments
Arch Chipset Turing GP102 (7.5) Volta VP100 (7.0) Pascal FP102 (6.1) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 68 / 4352 80 / 5120 28 / 3584 Not as many cores as Volta but still decent.
ROPs / TMUs 88 / 272 96 / 320
96 / 224 Cannot match Volta but more ROPs per CU for gaming.
FP32 / FP64 / Tensor Cores 4352 / 136 / 544 5120 / 2560 / 640 3584 / 112 / no Maintains the Tensor cores important for AI tasks (neural networks, etc.)
Speed (Min-Turbo) 1.35GHz (136-1.635) 1.2GHz (135-1.455) 1.531 (135-1.910) Clocks have improved over Volta likely due to lower number of SMs.
Power (TDP) 260W 300W 250W (125-300) TDP is less due to lower CU number.
Global Memory 11GB GDDR6 14GHz 320-bit 12GB HBM2 850Mhz 3072-bit 11GB GDDR5X 10GHz 384-bit As a pro-sumer card it has 1GB less than Volta and same as Pascal.
Memory Bandwidth (GB/s)
616 652 512 Despite no HBM2, bandwidth almost matches due to high speed of GDDR6.
L1 Cache 2x (32kB + 64kB) 2x 24kB / 96kB shared L1/shared is still the same but ratios have changed.
L2 Cache 5.5MB (6MB?) 4.5MB (3MB?) 3MB L2 cache reported has increased by 25%.
FP64/double ratio
1/32x 1/2x 1/32x Low ratio like all consumer cards, Volta dominates here
FP16/half ratio
2x 2x 1/32x Same rate as Volta, 2x over FP32

nVidia RTX 2080 TI (Turing)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 41,080 / n/a [=] 40,920 / n/a 336 / n/a Right off the bat, Turing matches Volta and is miles faster than old Pascal.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 25,000 / 23,360 [+11%] 22,530 / 21320 18,000 / 16,000 With standard FP32, Turing even manages to be 11% faster despite less CUs.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 812 / 772 [-93%] 11,300 / 10,500 641 / 642 For FP64 you don’t want Turing, you want Volta. At any cost.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 30.4 / 29.1 [-94%] 472 / 468 24.4 / 27 With emulated FP128 precision Turing is again demolished.
Turing manages to improve over Volta in FP16/FP32 despite having less CUs – most likely due to faster clock and optimisations. However, if you do need FP64 precision then Volta reigns supreme – the 1/32 rate of Turing & Pascal just does not cut it.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 48 / 52 [-33%] 72 / 86 42 / 41 Streaming workloads love Volta’s HBM2 memory, Turing is 33% slower.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 64 / 70 [-30%] 92 / 115 57 / 54 Not a lot changes here, Turing is 30% slower.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 192 / 182 [+7%] 179 / 181 72 / 83 With 64-bit integer workload, Turing manages a 7% win despite “slower” memory.
GPGPU Crypto Benchmark Crypto SHA256 (GB/s) 170 / 125 [-33%] 253 / 188 95 / 60 As with AES, hashing loves HBM2 so Turing is 33% slower than Volta.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 161 / 125 [+56] 103 / 113 69 / 74 While Turing wins, it is likely a compiler optimisation.
It seems that Turing GDDR6 memory cannot keep up with Volta’s HBM2 – despite the similar bandwidths: streaming algorithms are around 30% slower on Turing. The only win is 64-bit integer workload that is 7% faster on Turing likely due to integer units optimisations.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 17,230 / 17,000 [-7%] 18,480 / 18,860 10,710 / 10,560 Turing is just 7% slower than Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,530 / 1,370 [-82%] 8,660 / 8,500 1,400 / 1,340 FP64 is almost 1/8x slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,280 / 4,250 [+4%] 4,130 / 4,110 2,220 / 2,230 Binomial uses thread shared data thus stresses the SMX’s memory system – Turing is 4% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 164 / 163 [-91%] 1,920 / 2,000 131 / 134 With FP64 code Turing is 1/10x slower.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 11,440 / 11,740 [+1%] 11,340 / 12,900 8,100 / 6,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – Turing is just 1% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 327 / 263 [-92%] 4,330 / 3,590 304 / 274 Switching to FP64 again Turing is 1/10x slower.
For financial workloads, as long as you only need FP32 (or FP16), Turing can match and slightly outperform Volta; considering the cost that is no mean feat. However, if you do need FP64 precision – as we saw before, there is no contest – Volta is 10x (ten times) faster.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 34,080 [-16%] 40,790 Using the new Tensor cores, Turing is just 16% slower.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 7,400 / 7,330 [-33%] 11,000 / 10,870 6,280 / 6,600 Perhaps surprisingly, Turing is 33% slower than Volta here.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 502 / 498 [-89%] 4,470  4,550 335 / 332 With FP64 precision, Turing is 1/10x slower than Volta.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 1,000 [+2%] 979 FFT somehow allows Turing to match Volta in performance.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 512 / 573 [-5%] 540 / 599 242 / 227 With FP32, Turing is just 5% slower.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 302 / 302 [+1%] 298 / 375 207 / 191 Completely memory bound, Turing matches Volta here.
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 9,000 [-2%] 9,160 N-Body simulation with FP16 is just 2% slower.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 9,330 / 8,120 [+27%] 7,320 / 6,620 5,600 / 4,870 N-Body simulation allows Turing to dominate.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 222 / 295 [-94%] 3,910 / 5,130 275 / 275 With FP64 precision, Turing is again 1/10x slower than Volta.
The scientific scores are a bit more mixed – but again Turing can match or slightly exceed Volta with FP32/FP16 precision – as long as we’re not memory limited; there Volta is still around 30% faster. With FP64 it’s the same story, Turing is about 1/10x slower.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 23,090 / 19,000 [-14%] 26,860 / 29,820 17,860 / 13,680 In this 3×3 convolution algorithm, Turing is 14% slower. Convolution is also used in neural nets (CNN).
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 28,240 [=] 28,310 1,570 With FP16 precision, Turing matches Volta in performance.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 6,000 / 4,350 [-35%] 9,230 / 7,250 4,800 / 3,460 Same algorithm but more shared data makes Turing 35% slower.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 10,580 [-38%] 14,676 609 With FP16 Volta is almost 40% faster over Turing.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 6,180 / 4,570 [-33%] 9,420 / 7,470 4,830 / 3,620 Again same algorithm but even more data shared Turing is 33% slower.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 10,160 [-31%] 14,651 325 With FP16 nothing much changes in this algorithm.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 6,220 / 4,340 [-30%] 8,890 / 7,000 4,740 / 3,450 Still convolution but with 2 filters – Turing is 30% slower.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 10,100 [-25%] 13,446 309 Just as we seen above, Turing is about 25% slower than Volta.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 52.53 / 59.9 [-50%] 108 / 66.34 36 / 55 Different algorithm we see the biggest delta with Turing 50% slower.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 121 [-40%] 204 71 With FP16 Turing reduces the loss to just 40%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 20.28 / 25.64 [-50%] 41.38 / 23.14 15.14 / 15.3 Without major processing, this filter flies on Volta, again Turing is 50% slower.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 59.55 [-54%] 129 50.75 FP16 precision does not change things.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,600 / 29,640 [+1%] 24,400 / 24,870 19,480 / 14,000 This algorithm is 64-bit integer heavy and here Turing is 1% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 22,400 [-8%] 24,292 6,090 FP16 does not help here as we’re at maximum performance.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 3,000 / 10,500 [-20%] 3,771 / 8,760 1,288 / 6,530 One of the most complex and largest filters, Turing is 20% slower than Volta.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 7,850 [-4%] 8,137 461 Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Turing is generally 20-35% slower than Volta somewhat in line with memory performance. If FP16 is sufficient, then we see Turing matching Volta in performance – something that old Pascal could never do.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 352, CUDA 11.3, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Memory Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 494 / 485 [-7%] 534 / 530 356 / 354 GDDR6 provides good bandwidth, only 7% less than HBM2.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.3 / 10.4 [-1%] 11.4 / 11.4 11.4 / 9 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 11.9 / 12.3 [-1%] 12.1 / 12.3 12.2 / 8.9 Again no significant difference but we were not expecting any.
Turing’s GDDR6 memory provides almost the same bandwidth as Volta’s expensive HBM2. All cards use PCIe3 x16 connections thus similar upload/download bandwidth. Hopefully the move to PCIe4/5 will improve transfers.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 135 / 143 [-25%] 180 / 187 201 / 230 From the start we see global latency accesses reduced by 25%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 243 / 248 [-22%] 311 / 317 286 / 311 Full range random accesses are also 22% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 40 / 43 [-25%] 53 / 57 89 / 121 Sequential accesses have also dropped 25%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 77 / 80 [+2%] 75 / 76 117 / 174 Constant memory latencies seem about the same.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 10.6 / 71 [-41%] 18 / 85 18.7 / 53 Shared memory latencies seem to be improved.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 157 / 217 [-26%] 212 / 279 195 / 196 Texture access latencies have also reduced by 26%.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 268 / 329 [-22%] 344 / 313 282 / 278 As we’ve seen with global memory, we see reduced latencies by 22%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 67 / 138 [-24%] 88 / 163 87 / 123 With sequential access we also see a 24% reduction.
The high data rate of Turing’s GDDR6 brings reduced latencies across the board over HBM2 although as we’ve seen in the compute benchmarks, this does not always translate in better performance. Still some algorithms, especially less optimised ones may still benefit at much lower cost.
We see L1 cache effects between 32-64kB tallying with an L1D of 32-48kB (depending on setting) with the other inflexion between 4-8MB matching the 6MB L2 cache.
As with global memory we see the same L1D (32kB) and L2 (6MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

If you wanted to upgrade your old Pascal Titan X but could not afford the Volta’s Titan V – then you can now get a cheap RTX 2080Ti or Titan RTX and get similar if not slightly faster FP16/FP32 performance that blows the not-so-old Titan X out of the water! If you can make do with FP16 and use Tensor cores, we’re looking at 6-8x performance over FP32 using a single card.

Naturally, the FP64 performance is again “gimped” at 1/32x so if that’s what you require, Turing cannot help you there – you will have to get a Volta. But then again the Titan X was similarly “gimped” thus if that’s what you had you still get a decent performance upgrade.

The GDDR6 memory may have similar bandwidth on paper, but in streaming algorithms is about 33% slower than HBM2 so there Turing cannot match Volta, but considering the cost it is a good trade. You will also lose 1GB just like with Titan X but again, not a surprise. Global/Constant/Texture memory access latencies are lower due to the high data rate which should help algorithms that are memory access limited (if you cannot help hide them).

As we’re testing GPGPU performance here, we have not touched on the ray-tracing (RTX) units, but should you happen to play a game or two when you are “resting”, then the Titan RTX / 2080TI might just impress you even more. Here, not even Volta can match it!

All in all – Titan RTX is a compelling (relatively) cheap upgrade over the old Titan X if you don’t require FP64 precision.

nVidia Titan RTX (Turing)

Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GPGPU Performance

Intel Iris Plus Graphics

What is “IceLake”?

It is the “proper” 10th generation Core arch (ICL) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, performance benefits)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • 64 EUs up to 1.1GHz – up to 1.12 TFLOPS/FP32, 2.25TFLOPS/FP16
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • VBR (Variable Rate Shading) – usefor for games

The biggest change GPGPU-wise is the increase in EUs (64 top end) which greatly increases processing power compared to previous generation using few EUs (24 except very rare GT3 version). Most of the  features seem to be geared towards gaming not GPGPU – thus one omission is no FP64 support! While mobile platforms are not very likely to use high-precision kernels, Gen9 FP64 performance did exceed CPU AVX2/FMA FP64 performance. FP16 is naturally supported, 2x rate as most current designs.

While there does not seem to be eDRAM (L4) cache at all, thanks to very high-speed LP-DDR4X memory (at 3733Mt/s) the bandwidth has almost doubled (58GB/s) which should greatly help bandwidth-intensive workloads. While L1 does not seem changed, L2 has been increased to 3MB (up from 1MB) which should also help.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously.

GPGPU (Gen11 G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
Arch Chipset EV9.5 / GT2 EV9 / GT3 Vega / GCN1.5 EV11 / G7 The first G11 from Intel.
Cores (CU) / Threads (SP) 24 / 192 48 / 384 8 / 512 64 / 512 Less powerful CU but same SP as Vega
SIMD per CU / Width 8 8 64 8 Same SIMD width
Wave/Warp Size 32 32 64 32 Wave size matches nVidia
Speed (Min-Turbo)
300-1000MHz 300-950MHz 300-1100MHz 400-1100MHz Turbo maches Vega.
Power (TDP) 15-25W 15-25W 25W 15-25W Same TDP
ROP / TMU 8 / 16 16 / 24 8 / 32 16 / 32
ROPs the same but TMU have increased.
Shared Memory
64kB
64kB 32kB 64kB Same shared memory but 2x Vega.
Constant Memory
1.6GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x DDR4 2133Mt/s 2x DDR4 2133Mt/s 2x DDR4 2400Mt/s 2x LP-DDR4X 3733Mt/s Fastest memory ever
Memory Bandwidth
38GB/s 38GB/s 42GB/s 58GB/s Highest bandwidth ever
L1 Caches 16kB x 24 16kB x 48 8x 16kB 16kB x 64kB L1 does not appear changed.
L2 Cache 512kB 1MB ? 3MB L2 has tripled in size
Maximum Work-group Size
256×256 256×256 1024×1024 256×256 Vega supports 4x bigger workgroups
FP64/double ratio
1/16x 1/16x 1/32x No! No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 895 1,530 2,000 2,820 [+41%] G7 beats Vega by 40%! Pretty incredible start.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 472 843 1,350 1,330 [-1%] Standard FP32 is just a tie.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 195 111 70* Without native FP64 support G7 craters, but old GT3 beats Vega.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 6 10.2 7.1 7.54* Emulated FP128 is hard on FP64 units and G7 beats Vega again.
G7 ties with Mobile Vega in FP32 which in itself is a great achievement but FP16 is much faster. Unfortunately, without native FP64 support – G7 is a lot slower using emulation – but hopefully mobile systems don’t use high-precision kernels.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.88 1.14 2.58 2.6 [+1%] G7 manages to tie with Vega on this streaming test.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.1 1.42 3.3 3.4 [+2%] Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 1.1 1.83 3.36 2.26 [-33%] Without crypto acceleration G7 cannot match Vega.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 3 4.45 14.29 6.9 [1/2x] With 128-bit G7 is 1/2 speed of Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 6.79 10.6 18.77 14.18 [-24%] 64-bit integer workload is still 25% slower.
Thanks to the fast LP-DDR4X memory and its high bandwidth, G7 performance ties with Vega on integer workloads. However, G7 has not crypto acceleration thus Vega is much faster – thus crypto-currency/coin algorithms still favour AMD.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,170 1,470 1,720 2,340 [+36%] With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 710 758 829 1,310 [+58%] With FP32 G7 is now even faster – 60% faster than Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 158 264 185 No FP64 support.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 95.7 153 254 292 [+8%] Binomial uses thread shared data thus stresses the memory system so G7 is just 15% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 20.32 31.1 15.67 No FP64 support.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 240 392 362 719 [+2x] Monte-Carlo also uses thread shared data but read-only and here G7 is 2x faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 35.27 59.7 47.13 No FP64 support.
For financial FP32/FP16 workloads, G7 is between 8% to 100% faster than the Vega – thus for financial workloads it is a great choice. Unfortunately, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.
GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 142 220 884 563 [-36%] G7 cannot beat Vega despite previous FP16 great performance.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 119 162 314 419 [+33%] With FP32, G7 is 33% faster than Vega.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 44.2 65.1 62.5 No FP64 support
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 39.77 42.54 61.34 61.4 [=] G7 manages to tie with Vega here.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 23.8 29.69 31.48 39.22 [+25%] With FP32, G7 is 25% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.81 3.43 14.19 No FP64 support
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 383 597 623 930 [+49%] G7 comes up strong here winning by 50%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 209 327 537 566 [+5%] With FP32, G7 drops to just 5% faster than Vega.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 26.93 44.19 44
On scientific algorithms, G7 manages to beat Vega between 25-50% with FP32 precision and sometimes with FP16 as well. Again, the lack of FP64 support means all the high-precision kernels cannot be used which for some algorithms may be a problem.
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 1,000 1,370 2,273 3,520 [+55%] With FP16, G7 is only 50% faster than Vega.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 498 589 781 1,570 [+2x] In this 3×3 convolution algorithm, G7 is 2x faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 307 441 382 1,000 [+72%] With FP16, G7 is just 70% faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 108 143 157 319 [+2x] Same algorithm but more shared data, G7 still 2x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 284 435 619 924 [+49%] With FP16, G7 is again 50% faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 112 156 161 328 [+2x] With even more data the gap remains at 2x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 309 428 595 1,000 [+68%] With FP16 precision, G7 is 70% faster than Vega.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 108 145 155 318 [+2x] Still convolution but with 2 filters – same 2x difference.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 8.78 8.23 7.68 26.63 [+2.5x] With FP16, G7 is “just” 2.5x faster than Vega.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7.87 6.29 4.06 26.9 [+5.6x] Different algorithm allows G7 to fly at 6x faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 9.6 9.14 24.34 G7 does similarly well with FP16
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8.84 6.77 2.59 19.63 [+6.6x] Without major processing, this filter is 6x faster on G7.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,000 1,620 2,091 1,740 [-17%] With FP16, G7 is 17% slower than Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1,000 1,560 2,100 1,870 [-11%] This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 36.5 34.32 1,046 215 [1/5x] Some issues needed to be worked out here.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 433 649 608 950 [+56%] One of the most complex and largest filters, G7 is over 50% faster.
For image processing tasks, G7 does very well – it is 2x faster than Vega while dropping to FP16 precision is around 50% faster (with Vega benefiting greatly from the lower precision). All in all a fanstastic result for those using image/video manipulation algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 21.36 23.66 27.32 36.3 [+33%] G7 has 33% more bandwidth than Vega.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 10.4 11.77 4.74 17 [+2.6x] G7 manages far higher transfers.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 10.55 11.75 5 18 [+2.6x] Again, same 2.6x delta.
Thanks to the fast LP-DDR4X memory, G7 has far more bandwidth than Vega or older GT2/GT3 design; this no doubt helps streaming algorithms as we have seen above.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 232 277 412 343 [-17%] Better latency than Vega but not less than old arch.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 363 436 519 433 [-17%] Similar 17% less than Vega.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 153 213 201 267 [+33%] Vega seems to be a lot faster than G7.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 236 252 411 350 [-15%] Same latency as global as not dedicated.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72.5 100 22.5 16.7 [-26%] G7 has greatly reduced shared memory latency.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1,116 1,500 278 1,100 [+3x] Not much improvement over older versions.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1,178 1,533 418 1,018 [+1.4x] Similar high latency for G7.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1,057 1,324 122 973 [+8x] Again Vega has much lower latencies.
Despite high bandwidth, the latencies are high as LP-DDR4 has higher latencies than standard DDR4 (tens of clocks). Like Vega there is no dedicated constant memory – unlike nVidia. But G7 has greatly reduced shared memory latency to less than Vega which greatly helps algorithms using shared memory.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It’s great to see Intel taking graphics seriously again; with ICL, you don’t just get a brand-new core but a much updated GPU core too. And it does not disappoint – it trades blows with competition (Vega Mobile) and usually wins while it is close to 2x faster than Gen9/GT3 and 3x faster than Gen9.5/GT2 – a huge improvement.

The lack of native FP64 support is puzzling – but then again it could be reserved for higher-end/workstation versions if supported at all. Intel no doubt is betting on the CPU’s AVX512 SIMD cores for FP64 performance which is considerable. Again, it’s not very likely that mobile (ULV) platforms are going to run high-precision kernels.

The memory bandwidth is also 50% higher but unfortunately latencies are also higher due to LP-DDR4(X) memory; lower-end versions using “standard” DDR4 memory will not see high bandwidth but will see lower latencies – thus it is give and take.

As we’ve said in the other reviews of ICL, if you have been waiting to upgrade from the much older – but still good – SKL/KBL with Gen8/9 GT2 GPU – the Gen11 GPU is a significant upgrade. You will no longer feel “inadequate” compared to competition integrated GPUs. Naturally, you cannot expect discrete GPU levels of performance but for an integrated APU it is more than sufficient.

Overall with CPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

AMD Ryzen 2 Mobile (2500U) Vega 8 GP(GPU) Performance

Amd Ryzen 2500U

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2 mobile:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Radeon RX Vega graphics core (DirectX 12.1)
  • Optimised boost (aka Turbo) algorithm – sharing between CPU & GPU cores

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

Hardware Specifications

We are comparing the graphics units of Ryzen2 mobile with competitive APUs with integrated graphics  to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications AMD Radeon RX Vega 8 (2500U)
Intel UHD 630 (7200U)
Intel HD Iris 520 (6500U)
Intel HD Iris 540 (6550U)
Comments
Arch Chipset GCN1.5 GT2 / EV9.5 GT2 / EV9 GT3 / EV9 All graphics cores are minor revisions of previous cores with extra functionality.
Cores (CU) / Threads (SP) 8 / 512 24 / 192 24 / 192 48 / 384 Vega has the most SPs though only a few but powerful CUs
ROPs / TMUs 8 / 32 8 / 16 8 / 16 16 / 24 Vega has less ROPs than GT3 but more TMUs.
Speed (Min-Turbo) 300-1100 300-1000 300-1000 300-950 Turbo boost puts Vega in top position power permitting.
Power (TDP) 25-35W 15-25W 15-25W 15-25W TDP is about the same for all though both Ryzen2 and CFL-U have somewhat higher TDP (25W).
Constant Memory 2.7GB 1.6GB 1.6GB 3.2GB There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 32kB 64kB 64kB 64kB Intel has 2x larger shared/local memory but slow (likely non dedicated) unlike Vega.
Global Memory 2.7 / 3GB 1.6 / 3.2GB 1.6 / 3.2GB 3.2 / 6.4GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System 128-bit DDR4 2400Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR4 2133MT/s Ryzen2’s memory controller is rated for faster data rates thus should be able to use faster (laptop) memory.
Memory Bandwidth (GB/s)
36 30 30 33 The high data rate of DDR4 can result in higher bandwidth useful for the GPU cores.
L2 Cache ? 512kB 512kB 1MB L2 is comparable to Intel units.
FP64/double ratio Yes, 1/16x Yes, 1/8x Yes, 1/8 Yes, 1/8x FP64 is supported and at good ratio but lower than Intel’s.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 2x Yes, 2x FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 831 927 1630 2000 [+23%] Thanks to FP16 support we see double the performance over FP32 but Vega is only 23% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 476 478 865 1350 [+56%] Vega rules FP32 and is over 50% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 122 209 111 [-47%] FP64 lower rate makes Vega 1/2 the speed of GT3 and only matching GT2 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 5.71 6.29 10.78 7.11 [-34%] Emulated FP128 precision depends entirely on FP64 performance thus not a lot changes.
Vega is over 50% faster than Intel’s top-end Iris/GT3 graphics but only in FP32 precision – while it gains from FP16 Intel scales better reducing the lead to just 25% or so. In FP64 precision though it’s relatively low 1/16x ratio means it only ties with GT2 low-end-models while GT3 is 2x (twice) as fast. Pity.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.858 0.87 1.23 2.58 [+2.1x] No wonder AMD is crypto-king: Vega is over 2x faster than even GT3.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1 1.08 1.52 3.3 [+2.17x] Nothing changes here, Vega is over 2.2x faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 2.72 3 4.7 14.29 [+3x] In this heavy integer workload, Vega is now 3x faster no wonder it’s used for crypto mining.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 6 6.64 11.59 18.77 [+62%] SHA1 is less compute intensive allowing Intel to catch up but Vega is still over 60% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.019 1.08 1.86 3.36 [+81%] With 64-bit integer workload, Vega does better and is 80% (almost 2x) faster than GT3.
Nobody will be using integrated graphics for crypto-mining any time soon, but if you needed to (perhaps using encrypted containers, VMs, etc.) then Vega is your choice – even GT3 is left in the dust despite big improvement over low-end GT2. Intel would need at least 2x more cores to be competitive here.
GPGPU Finance Benchmark Black-Scholes half/FP16 (MOPT/s) 1000 1140 1470 1720 [+17%] If 16-bit precision is sufficient for financial work, Vega is 20% faster than GT3.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 694 697 794 829 [+4%] In this relatively simple FP32 financial workload Vega is just 4% faster than GT3.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 142 154 281 185 [-33%] Switching to FP64 precision, Vega is 33% slower than GT3.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 86 95 155 270 [+74%] Switching to 16-bit precision allows Vega to gain over GT3 and is almost 2x faster.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 92 93 153 254 [+66%] Binomial uses thread shared data thus stresses the internal memory sub-system, and here Vega shows its power – it is 66% faster than GT3.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 18 18.86 32 15.67 [-51%] With FP64 precision Vega loses again vs. GT3 at 1/2 the speed and just matches GT2 units.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 211 236 395 584 [+48%] With 16-bit precision, Vega dominates again and is almost 50% faster than GT3.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 223 236 412 362 [-12%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – but Vega somehow loses against GT3.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 29.5 33.36 58.7 47.13 [-20%] Switching to FP64 precision as expected Vega is slower.
Financial algorithms perform well on Vega – at least in FP16 & FP32 precision but FP64 is too “gimped” (1/16x FP32 rate) and thus loses against GT3 despite more powerful cores.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 127 140 236 884 [+3.75x] With 16-bit precision Vega runs away with GEMM and is almost 4x faster than GT3.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 105 107 175 214 [+79%] GEMM makes heavy use of shared/local memory which is likely why Vega is 80% faster than GT3.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 38.8 41.69 70 62.6 [-11%] As expected, due to gimped FP64 rate Vega falls behind GT3 but only by just 11%.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 34.2 34.7 45.85 61.34 [+34%] 16-bit precision helps reduce memory bandwidth pressure thus Vega is 34% faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 20.9 21.45 29.69 31.48 [+6%] FFT is memory access bound but Vega does well to beat GT3.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.3 5.4 6.07 14.19 [+2.34x] Despite the FP64 rate, Vega manages its memory accesses better beating GT3 by over 2x (two times).
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 270 284 449 623 [+39%] 16-bit precision still benefits N-Body and here Vega is 40% faster than GT3.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 162 181 291 537 [+85%] Back to FP32 and Vega has a pretty large 85% lead – almost 2x GT3.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 22.73 26.1 43.34 44 [+2%] With FP64 precision, Vega and GT3 are pretty much tied.
Vega performs well on compute heavy scientific algorithms (making heavy use of shared/local memory) and also benefits from half/FP16 to reduce memory bandwidth pressure, but FP64 rate comes back to haunt it where it loses against Intel’s GT3. Pity.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 888 937 1390 2273 [+64%] With 16-bit precision Vega doubles its lead to 64% over GT3 despite its gain over FP32.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 461 491 613 781 [+27%] In this 3×3 convolution algorithm, Vega does well but only 30% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 279 302 409 582 [+42%] Again a huge gain by using FP16, over 40% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 100 107 144 157 [+9%] Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter half/FP16 (MPix/s) 254 272 396 619 [+56%] Large gain again by switching to FP16 with 3x performance over FP32.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 103 111 156 161 [+3%] With even more shared data the gap falls to just 3%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 259 281 363 595 [+64%] Another huge gain and over 3x improvement over FP32.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 99 106 145 155 [+7%] Still convolution but with 2 filters – the gap is similar to 5×5 – Vega is 7% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 7.39 9.4 8.56 7.688 [-18%] Big gain but not enough to beat GT3 here.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7 7.57 7.08 4 [-47%] Vega does not like this algorithm (lots of branching causing divergence) and is 1/2 GT3 speed.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 8.55 9.32 9.22 <BSOD> This test would cause BSOD; we are investigating.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8 8.65 6.77 2.59 [-70%] Vega does not like this algorithms either (complex branching) and neither does GT3.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 941 967 1580 2091 [+32%] In order to prevent artifacts most of this test runs in FP32 thus not much gain here.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 878 952 1550 2100 [+35%] This algorithm is 64-bit integer heavy allowing Vega 35% better performance over GT3.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 341 390 343 1046 [+2.5x] Switching to FP16 makes a huge difference to Vega which is over 2x faster.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 384 425 652 608 [-7%] One of the most complex and largest filters, Vega is a bit slower than GT3 by 7%.
For image processing Vega generally performs well in FP32 beating GT3 hands down; but there are a few algorithms that may need to be optimised for it that don’t perform as well as expected. Switching to FP16 though doubles/triples scores – thus Vega may be starved of memory.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 12.17 21.2 24 27.32 [+14%] With higher speed DDR4 memory, Vega has 14% more bandwidth.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 6 10.4 11.7 4.74 [-60%] The GPU<>CPU link seems a bit slow here at 1/2 bandwidth of Intel.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 6 10.5 11.75 5 [-57%] Download bandwidth shows a similar issue, 1/2 bandwidth expected.
All designs have to rely on the shared memory controller and Vega performs as expected with good internal bandwidth due to higher speed DDR4 memory. But – transfer up/down speeds are disappointing possibly due to the driver as “zero-copy” mode should be engaged and working on such transfers (APU mode).
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 246 244 288 412 [+49%] Similarly with CPU data latencies, global “in-page/random” (aka “TLB hit”) latencies are a bit high though not by a huge amount.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 365 372 436 519 [+19%] Due to faster memory clock but increased timings “full/random” latencies appear a bit higher.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 156 158 213 201 [-6%] Sequential access latencies are less than competition by 6%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 245 243 252 411 [+63%] None have dedicated constant memory thus we see a similar picture to global memory: somewhat high latencies.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 82 84 100 22.5 [1/5x] Vega has dedicated shared/local memory and it shows – it’s about 5x faster than Intel’s designs.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1152 1157 1500 278 [1/5x] Texture access is also very fast on Vega, with latencies 5x lower (aka 1/5) than Intel’s designs.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1178 1162 1533 418 [1/3x] Even full/random accesses are fast, 3x (three times) faster than Intel’s.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1077 1081 1324 122 [1/10x] With sequential access we see a crazy 10x lower latency as if AMD uses prefetchers and Intel does not.
As we’ve seen in Ryzen 2’s data latency tests – “in-page/random” latencies are higher than competition but the rest are comparative, with sequential (prefetched) latencies especially small. But dedicated shared/local memory is far faster (5x) and texture accesses are also very fast (3-5x) which should greatly help algorithms making use of them.
Plotting the global (or constant) memory latencies together we see that the “in-page/random” access latencies should perhaps peak somewhat lower but still nothing close to what we’ve seen in the (CPU) data memory latencies article. It is not very clear (unlike the texture latencies graph) where the caches are located.
The texture latencies graph is far clearer where we can see each level’s caches; unlike the global (or constant) latencies we see “in-page/random” latency peak and hold at a somewhat lower level (4MB).

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega mobile, as its desktop big siblings, is undoubtedly powerful and a good upgrade from the older integrated GPU cores; it also supports modern features like half/FP16 compute (which needs vectorisation what the driver reports as “optimised width”) and relishes complex algorithms making use of shared/local memory which is efficient. However Intel’s GT3 EV9.x can get close to it in some workloads and due to better FP64 ratio (1/8x vs 1/16x) even beat it in most FP64 precision tests which is somewhat disappointing.

Luckily for AMD, GT3 variant is very rare and thus Vega has an easy job defeating GT2 in just about all tests; but it shows that should Intel “get serious” and continue to improve integrated graphics (and CPUs) like they used to do before Skylake (SKL/KBL) – AMD might have more serious competition on its hands.

Note that until recently (2019) Ryzen2 mobile APUs were not supported by AMD’s main drivers (“Adrenalin”) and had to rely on pretty old OEM (HP, etc.) drivers that were somewhat problematic especially with Windows 10 changing every 6 months while the drivers were almost 1 year old. Thankfully this has now changed and users (and us) can benefit from updated, stable and performant drivers.

In any case if you want a laptop/ultraportable with just an APU and no dedicated graphics, then Vega is pretty much your only choice which means a Ryzen2 system. That pretty much means it is worthy of a recommendation.

In a word: Highly Recommended

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

Intel Core i7 8700K, 9900K CofeeLake Review & Benchmarks – UHD 630 GPGPU Performance

Intel Graphics

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). As before, the CPUs contain an integrated GPU (with compute support aka GPGPU).

While originally Intel integrated graphics were not much use – starting with SNB (“SandyBridge”) and especially its GPGPU-capable successor IVB (“IvyBridge”) the integrated graphics units made large progress, with HSW (“Haswell”) introducing powerful many compute units (GT3+) and esoteric L4 cache (eDRAM) versions (“CrystallWell) supporting high-end features like FP64 (native 64-bit floating-point support) and zero-copy CPU <> GPU transfers.

Alas, while the features remained, the higher-end versions (GT3, GT4e) never became mainstream and pretty much disappeared – except very high-end ULV/H SKUs with top-end desktop CPUs like 6700K, 8700K, etc. tested here stuck with the low-end GT2 versions. Perhaps nobody in their right mind would use such CPUs without a dedicated external (GP)GPU, it is still interesting to see how the GPU core has evolved in time.

Also let’s not forget that on the mobile platforms (either ULV/Y even H) most laptops/tablets do not have dedicated GPU and rely solely on integrated graphics – and here naturally UHD630 performance matters.

Hardware Specifications

We are comparing the graphics units of to-of-the-range Intel CPUs with low-end dedicated cards to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
Arch Chipset GT2 / EV9.5 GT2 / EV9 GP108 / SM6.1 UHD6xx is just a minor revision of the HD5xx video core.
Cores (CU) / Threads (SP) 24 / 192 24 / 192 3 / 384 No change in core / SP units.
ROPs / TMUs 8 / 16 8 / 16 16 / 24 No change in ROP/TMUs either.
Speed (Min-Turbo) 350-1200 350-1150 300-1.26-1.52 Turbo speed is only slightly increased.
Power (TDP) 95W 91W 35W TDP has gone up a bit but nothing major.
Constant Memory 3.2GB 3.2GB 64kB (dedicated) There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 64kB 64kB 48kB (dedicated) Bigger than usual shared/local memory but slow (likely non dedicated).
Global Memory 7GB (of 16GB) 7GB (of 16GB) 2GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System DDR4 3200Mt/s 128-bit DDR4 2533Mt/s 128-bit GDDR5 6Gt/s 64-bit CFL can reliably run at faster data rates thus 630 benefits too.
Memory Bandwidth (GB/s)
50 40 48 The high data rate of DDR4 can result in higher bandwidth than some dedicated cards.
L2 Cache 512kB 512kB 48kB L2 is unchanged and reasonably large.
FP64/double ratio Yes, 1/8 Yes, 1/8 Yes, 1/32 FP64 is supported and at good ration compared to gimped dedicated cards.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 1/64 FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 1150 [+7%] 1070 1660 Thanks to FP16 support we see double the performance over FP32 and thus only 50% slower than dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 584 [+9%] 535 1660 630 is almost 10% faster than old 530 but still about 1/3 of a dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 151 [+9%] 138 72.8 FP64 sees a similar delta (+9%) but much faster (2x) than a dedicated 1030 due to gimped FP64 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 7.84 [+5%] 7.46 2.88 Emulated FP128 precision depends entirely on FP64 performance and much better (3x) than gimped dedicated.
UHD630 is about 5-9% faster than 520, not much to celebrate – but due to native FP16 and especially FP64 support it can match or even overtake low-end dedicated GPUs – a pretty surprising result! If only we had more cores, it may actually be very much competitive.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 1 [+5%] 0.954 4.37 We see a 5% improvement for 630 0 but far lower performance than a dedicated GPU.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.3 [+6] 1.23 5.9 Nothing changes here , we see a 6% improvement.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 3.6 [+3%] 3.5 18.4 In this heavy integer workload, the improvement falls to just 3% – but a dedicated unit would be about 4x faster.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 8.18 [+2%] 8 24 Nothing much changes here, we see a 2% improvement.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.3 [+2%] 1.27 7.8 With 64-bit integer workload, same improvement of just 2% but now the 1030 is about 6x faster!
Nobody will be using integrated graphics for crypto-mining any time soon, we see a very minor improvement in 639 vs old 530, but overall low performance versus dedicated graphics like a 1030 which would be 4-6x faster. We would need 3x more cores to compete here.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1180 [+21%] 977 1320 In this FP32 financial workload we see a good 21% improvement vs. old 530. Also good result vs. dedicated 1030.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 180 [+2%] 175 137 Switching to FP64 code, the difference is next to nothing but better than a gimped 1030.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 111 [+12%] 99 255 Binomial uses thread shared data thus stresses the internal memory sub-system, and here 630 is 12% faster. But 1/2 the performance of a 1030.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 22.3 [+4%] 21.5 14 With FP64 code the improvement drops to 4%.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 298 [+2%] 291 617 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – strangely we see only 2% improvement and again 1/2 1030 performance.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 43.4 [+2%] 42.5 28 Switching to FP64 we see no changes. But almost 2x performance over a 1030.
You can run financial analysis algorithms with decent performance on an UHD630 – just as you could on the old 530 – and again better FP64 performance than dedicated – (GT 1030) a pretty impressive result. Naturally, you can just use the powerful CPU cores instead…
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 143 [+4%] 138 685 Using 32-bit precision 630 improves 4% but is almost 1/5 (5 times slower) than a 1030.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 55.5 [+3%] 53.7 35 With FP64 precision, the delta does not change but now 640 is amost 2x faster than a 1030.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 39.6 [+20%] 33 37 FFT is memory access bound and here 630’s faster DDR4 memory gives it a 20% lead.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 9.3 [+16%] 8 20 We see a similar improvement with FP64 about 16%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 272 [+2%] 266 637 Back to normality with this algorithm – we see just 2% improvement.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 27.7 [+3%] 26.9 32 With FP64 precision, nothing much changes.
The scientific scores are similar to financial ones – except the memory access heavy FFT which greatly benefits from better memory  (if that is provided of course) but this a dedicated card (like the 1030) is much faster in FP32 mode but again the 630 can be 2x faster in FP64 mode. Again, you’re much better off using the CPU and its powerful SIMD units for these algorithms.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 592 [+10%] 536 1620 In this 3×3 convolution algorithm, we see a 10% improvement over the old 530. But about 1/3x performance of a 1030.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 128 [+9%] 117 637 Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 133 [+9%] 122 391 With even more data the gap remains the same.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 127 [+9%] 116 368 Still convolution but with 2 filters – still 9% better.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 9.2 [+10%] 8.4 7.3 Different algorithm does not change much still 10% better.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 10.6 [+9%] 9.7 4.08 Without major processing, 630 improves by the same amount.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1640 [+2%] 1600 2350 This algorithm is 64-bit integer heavy thus we fall to the “usual” 2% improvement.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 550 [+2%] 538 849 One of the most complex and largest filters, sees the same 2% improvement.
For image processing using FP32 precision 630 performs a bit better than usual, 10% faster across the board compared to the old 530 – but still about 1/3 (third) the speed of a dedicated 1030. But if you can make do with FP16 precision image processing, then we almost double performance.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 36.4 [+21%] 30 38.5 Due to higher speed DDR4 memory, the 630 manages 21% better bandwidth than the 620 – and comparable to a 64-bit bus dedicated card.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 17.9 [+29%] 13.9 3 (PCIe3 x4) The CPU<>GPU internal link seems to have 30% more bandwidth – naturally zero transfers are also supported. And a lot better than a dedicated card on PCIe3 x4 (4 lanes).
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 17.9 [+35%] 13.3 3 (PCIe3 x4) Here again we see a good 35% bandwidth improvement.
CFL’s higher (stable) memory speed support improves bandwidth between 20-35% – which is likely behind most benchmark improvement in the compute algorithms above. However, that will only happen if high-speed DDR4 memory (3200 or faster) were to be used – an expensive proposition! eDRAM would greatly help here…
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 179 [+1%] 178 223 No changes in global latencies in-page showing no memory sub-system improvements.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 268 [-19%] 332 244 Due to faster memory clock (even with slightly increased timings) full random access latencies fall by 20% (similar to bandwidth increase).
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 126 [-5%] 132 76 Sequential access latencies do fall by a minor 5% as well though.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 181 [-6%] 192 92.5 Intel’s GPGPU don’t have dedicated constant memory thus we see similar performance to global memory.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72 [-1%] 73 16.6 Shared memory latency is unchanged – and quite slow compared to architectures from competitors like the 1030.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 138 [-9%] 151 220 Texture access latencies do seem to show a 9% improvement a surprising result.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 227 [-16%] 270 242 Just as we’ve seen with global (full range access) latencies, we see the best improvement about 16% here.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 45 [=] 45 71.9 With sequential access we see no improvement.
Anything to do with main memory access (aka “full random access”) does show a similar improvement to bandwidth increases, i.e. between 16-19% due to higher speed (but somewhat higher timings) main memory. All other access patterns show little to no improvements.

When using higher speed DDR4 memory – as we do here (3200 vs 2533) UHD630 shows a good improvement in both bandwidth and reduced latencies – but otherwise it performs just the same as the old HD520 – not a surprise really. At least you can see that your (expensive) memory investment does not go to waste – with memory bound algorithms showing good improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For GPGPU workloads, UHD630 does not bring anything new – it performs similarly to the old HD520. But as CFL can use higher (stable) memory, bandwidth and latencies are improved (when using such higher speed memory) and thus most algorithms do show good improvements. Naturally as long as you can afford to provide such memory.

The surprising support for 1/2 ratio native FP64 support means 64-bit floating-point algorithms can run faster than on a typical low-end graphics card (as despite also supporting native FP64 the ratio is 1/32 vs. FP32 rate)  so high accuracy workloads do work well on it. If loss of accuracy is OK (e.g. picture processing) native FP16 support at 2x rate makes such algorithms almost 2x faster and thus within the performance of a typical low-end graphics card (that either don’t support FP16 or their ratio is 1/64!).

As we touched in the introduction – this may not matter on desktop – but on mobile where most laptops/tablets use the integrated graphics any and all such improvements can make a big difference. While in the past the fast-improving EV cores became performance competitive with CPU cores (as there were only 2 ULV ones) – with CFL doubling number of CPU cores (4 vs. 2) it is likely that internal graphics (GPGPU) performance is now too low.

We’re sad that the GT3/GT4 versions are not common-place not to mention the L4/eDRAM which showed so much promise in the HSW days.

But Intel has recently revamped its GPU division and are committed to release dedicated (not just internal) graphics in a few years (2020?) which hopefully means we should see far more powerful GPUs from them soon.

Let’s hope they do see the light-of-day and are not cancelled like the “Phi” GPGPU accelerators (“Knights Landing”) which showed so much promise but somehow never made it outside data centres before sailing into the sunset…

nVidia Titan V: Volta GPGPU performance in CUDA and OpenCL

What is “Titan V”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Volta” architecture, the next generation to the current “Pascal” architecture on the Series 10 cards. Based on the top-end 100 chipset (not lower 102 or 104) it boasts full speed FP64/FP16 performance as well as brand-new “tensor cores” (matrix multipliers) for scientific and deep-learning workloads. It also comes with on-chip HBM2 (high-bandwidth) memory unlike more traditional GDDRX stand-alone memory.

For this reason the price is also far higher than previous Titan X/XP cards but considering the features/performance are more akin to “Tesla” series it would still be worth it depending on workload.

While using the additional cores provided in FP64/FP16 workloads is automatic – save usual code optimisations – tensor cores support requires custom code and existing libraries and apps need to be updated to make use of them. It is unknown at this time if consumer cards based on “Volta” will also include them. As they support FP16 precision only, not workloads may be able to use them – but DL (deep learning) and AI (artificial intelligence) are generally fine using lower precision thus for such tasks it is ideal.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan V
nVidia Titan X (P)
nVidia 980 GTX (M2)
Comments
Arch Chipset Volta GV100 (7.0) Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 80 / 5120 28 / 3584 16 / 2048 The V boasts 80 CU units but these contain 64 FP32 units only not 128 like lower-end chips thus equivalent with 40.
FP32 / FP64 / Tensor Cores 5120 / 2560 / 640 3584 / 112 / no 2048 / 64 / no Titan V is the only one with tensor cores and also huge amount of FP64 cores that Titan X simply cannot match; it also has full speed FP16 support.
Speed (Min-Turbo) 1.2GHz (135-1.455) 1.531GHz (139-1910) 1.126GHz (135-1.215) Slightly lower clocked than the X it will will make up for it with sheer CU units.
Power (TDP) 300W 250W (125-300) 180W (120-225) TDP increases by 50W but it is not unexpected considering the additional units.
ROP / TMU
96 / 320 96 / 224 64 / 128 Not a “gaming card” but while ROPs stay the same the number of TMUs has increased – likely required for compute tasks using textures.
Global Memory 12GB HBM2 850Mhz 3072-bit 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit Memory size stays the same at 12GB but now uses on-chip HBM2 for much higher bandwidth
Memory Bandwidth (GB/s)
652 512 224 In addition to the modest bandwidth increase, latencies are also meant to have decreased by a good amount.
L2 Cache 4.5MB 3MB 2MB L2 cache has gone up by about 50% to feed all the cores.
FP64/double ratio
1/2 1/32 1/32 For FP64 workloads the V has huge advantage as consumer and previous Titan X had far less FP64 units.
FP16/half ratio
2x 1/64 n/a The V has an even bigger advantage here with over 128x more units for FP16 tasks like DL and AI.

nVidia Titan V (Volta)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 22,400 [+25%] / 20,000 17,870 / 16,000 7,000 / 6,100 Right off the bat, the V is just 25% faster than the X some optimisations may be required.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 33,300 [135x] / n/a 245 / n/a n/a For FP16 workloads the V shows its power: it is an astonishing 135 *times* (times not %) faster than the X.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 11,000 [+16.7x] / 11,000 661 / 672 259 / 265 For FP64 precision workloads the V shines again, it is 16 times faster than the X.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 458 [+17.7x] / 455 25 / 24 10.8 / 10.7 With emulated FP128 precision the V is again 17 times faster.
As expected FP64 and FP16 performance is much improved on Titan V, with FP64 over 16x times faster than the X; FP16 performance is over 50% faster than FP32 performance making it almost 2x faster than Titan X. For workloads that need it, the performance of Titan V is stellar.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 71 [+79%] / 87 40 / 38 16 / 16 Titan V is almost 80% faster than the X here a significant improvement.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 91 [+75%] / 116 52 / 51 23 / 21 Not a lot changes here, with the V still 7% faster than the X.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 253 [+89%] / 252 134 / 142 58 / 59 In this integer workload, Titan V is almost 2x faster than the X.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 130 [+21%] / 134
107 / 114 50 / 54 SHA1 is mysteriously slower than SHA256 and here the V is just 21% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 173 [+2.4x] / 176 72 / 42 32 / 24 With 64-bit integer workload, Titan V shines again – it is almost 2.5x (times) faster than the X!
Historically, nVidia cards have not been tuned for integer workloads, but Titan V is almost 2x faster in 32-bit hashing and almost 3x faster in 64-bit hashing than the older X. For algorithms that use integer computation this can be quite significant.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 18,460 [+61%] / 18,870
11,480 / 11,470 5,280 / 5,280 Titan V manages to be 60% faster in this FP32 financial workload.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 8,400 [+6.1x] / 9,200
1,370 / 1,300 547 / 511 Switching to FP64 code, the V is over 6x (times) faster than the X.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,180 [+81%] / 4,190
2,240 / 2,240 1,200 / 1,140 Binomial uses thread shared data thus stresses the SMX’s memory system: but the V is 80% faster than the X.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 2,000 [+15.5x] / 2,000
129 / 133 51 / 51 With FP64 code the V is much faster – 15x (times) faster!
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 12,550 [+2.35x] / 12,610
5,350 / 5,150 2,140 / 2,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here the V is over 2x faster than the X and that is FP32 code!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 4,440 [+15.1x] / 4,100
294 / 267 118 / 106 Switching to FP64 the V is again over 15x (times) faster!
For financial workloads, the Titan V is significantly faster, almost twice as fast as Titan X on FP32 but over 15x (times) faster on FP64 workloads. If time is money, then this can be money well-spent!
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 9,860 [+57%] / 10,350
6,280 / 6,600 2,550 / 2,550 Without using the new “tensor cores”, Titan V is about 60% faster than the X.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 3,830 [+11.4x] / 3,920 335 / 332 130 / 129 With FP64 precision, the V crushes the X again it is 11x (times) faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 605 [+2.5x] / 391 242 / 227 148 / 136 FFT allows the V to do even better – no doubt due to HBM2 memory.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 280 [+35%] / 245 207 / 191 89 / 82 We may need some optimisations here, otherwise the V is just 35% faster.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,390 [+15%] / 4,630
5,600 / 4,870 2,100 / 2,000 N-Body simulation also needs some optimisations as the V is just 15% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 4,270 [+15.5x] / 4,200
275 / 275 82 / 81 With FP64 precision, the V again crushes the X – it is 15x faster.
The scientific scores are a bit more mixed – GEMM will require code paths to take advantage of the new “tensor cores” and some optimisations may be required – otherwise FP64 code simply flies on Titan V.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 26,790 [50%] / 26,660
17,860 / 13,680 7,310 / 5,530 In this 3×3 convolution algorithm, Titan V is 50% faster than the X. Convolution is also used in neural nets (CNN) thus performance here counts.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 29,200 [+18.6x]
1,570 n/a With FP16 precision, Titan V shines it is 18x (times faster than X) but 12% faster than FP32.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 9,295 [+94%] / 6,750
4,800 / 3,460 1,870 / 1,380 Same algorithm but more shared data allows the V to be almost 2x faster than the X.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 14,900 [24.4x]
609 n/a With FP16 Titan V is almost 25x (times) faster than X and also 60% faster than Fp32.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 9,428 [+2x] / 7,260
4,830 / 3,620 1,910 / 1,440 Again same algorithm but even more data shared the V is 2x faster than the X.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 14,790 [+45x] 325 n/a With FP16 the V is now45x (times) faster than the X showing the usefulness of FP16 support.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 9,079 [1.92x] / 7,380
4,740 / 3450 1,860 / 1,370 Still convolution but with 2 filters – Titan V is almost 2x faster again.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 13,740 [+44x]
309 n/a Just as we seen above, the V is an astonishing 44x (times) faster than the X, and also ~20% faster than FP32 code.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 111 [+3x] / 66
36 / 55 20 / 25 Different algorithm but here the V is even faster, 3x faster than the X!
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 206 [+2.89x]
71 n/a With FP16 the V is “only” 3x faster than the X but also 2x faster than FP32 code-path again a big gain for FP16 processing
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 157 [+10x] / 24
15 / 15 12 / 11 Without major processing, this filter flies on the V – it is 10x faster than the X.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 215 [+4x] 50 FP16 precision is “just” 4x faster but it is also ~40% faster than FP32.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,370 / 22,780 [+25%] 19,480 / 14,000 7,600 / 6,640 This algorithm is 64-bit integer heavy and here Titan V is 25% faster than the X.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 24,180 [+4x] 6,090 FP16 does not help a lot here, but still the V is 4x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 846 [+3x] / 874 288 / 635 210 / 308 One of the most complex and largest filters, Titan V does very well here, it is 3x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 1,712 [+3.7x]
461 n/a Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Titan V brings big performance increases from 50% to 4x (times) faster than Titan X a big upgrade. If you are willing to drop to FP16 precision, then it is an extra 50% to 2x faster again – while naturally FP16 is not really usable on the X. With potential 8x times better performance Titan V powers through image processing tasks.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 536 [+51%] / 530
356 / 354 145 / 144 HBM2 brings about 50% more raw bandwidth to feed all the extra compute cores, a significant upgrade.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.47 / 11,4
11.4 / 9 12.1 / 12 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.3 / 12.3
12.2 / 8.9 11.5 / 12.2 Again no significant difference but we were not expecting any.
Titan V’s HBM2 brings 50% more memory bandwidth but as it still uses the PCIe3 x16 connection there is no change to host upload/download bandwidth which may be a bit of a bottleneck trying to keep all those cores fed with data. Even more streaming load/save is required and code will need to be optimised to use all that processing power
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 180 [-10%] / 187
201 / 230 230 From the start we see global latency accesses reduced by 10%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 311 [+9%] / 317
286 / 311 306 Full range random accesses do seem to be 9% slower which may be due to the architecture.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53 [-40%] / 57 89 / 121 97 However, sequential accesses seem to have dropped a huge 40% likely due to better prefetchers on the Titan V.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75 [-36%] / 76 117 / 174 126 Constant memory latencies also seem to have dropped by almost 40% a great result.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 85 18 / 53 21 No significant change in shared memory latencies.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 [+9%] / 279 195 / 196 208 Texture access latencies seem to have increased by 9%
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 [+22%] / 313 282 / 278 308 As we’ve seen with global memory, we see increased latencies here by about 20%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88 / 163 87 /123 102 With sequential access there is no appreciable delta in latencies.
HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets
We see L1 cache effects between 64-128kB tallying with an L1D of 96kB – 4x more than what we’ve seen on Titan X (at 16kB). The other inflexion is at 4MB – matching the 4.5MB L2 cache size – which is 50% more than what we saw on Titan X (at 3MB).
As with global memory we see the same L1D (64kB) and L2 (4.5MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

Titan V’s memory performance does not disappoint – HBM2 obviously brings large bandwidth increase – latency depends on access pattern, when prefetchers can engage they are much lowers but in random accesses out-of-page they are a big higher but nothing significant. We’re also limited by the PCIe3 bus for transfers which requires judicious overlap of memory transfers and compute to keep the cores busy.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Volta” architecture does bring good improvements in FP32 performance which we hope to see soon in consumer (Series 11?) graphics cards – as well as lower-end Titan cards.

But here (on Titan V) we have the top-end chip with full-power FP64 and FP16 units more akin to Tesla which simply power through any and all algorithms you can throw at them. This is really the “Titan” you were looking for and upgrading from the previous Titan X (Pascal) is a huge upgrade admittedly for quite a bit more money.

If you have workloads that requires double/FP64 precision – Titan V is 15-16x times faster than Titan X – thus great value for money. If code can make do with FP16 precision then you can gain up to 2x extra performance again – as well as save storage for large data-sets – again Titan X cannot cut it here running at 1/64 rate.

We have not yet shown tensor core performance which is an additional reason for choosing such a card – if you have code that can make use of them you can gain an extra 16x (times) performance that really puts Titan V heads and shoulders over the Titan X.

All in all Titan V is a compelling upgrade if you need more power than Titan X and are (or thinking of) using multiple cards – there is simply no point. One Titan V can replace 4 or more Titan X cards on FP64 or FP16 workloads and that is before you make any optimisations. Obviously you are still “stuck” with 12GB memory and PCIe bus for transfers but with judicious optimisations this should not impact performance significantly.

nVidia Titan V (Volta)

nVidia Titan X: Pascal GPGPU Performance in CUDA and OpenCL

What is “Titan X (Pascal)”?

It is the current high-end “pro-sumer” card from nVidia using the current generation “Pascal” architecture – equivalent to the Series 10 cards. It is based on the 2nd-from-the-top 102 chipset (not the top-end 100) thus it does not feature full speed FP64/FP16 performance that is generally reserved for the “Quadro/Tesla” professional range of cards. It does however come with more memory to fit more datasets and is engineered for 24/7 performance.

Pricing has increased a bit from previous generation X/XP but that is a general trend today from all manufacturers.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan X with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan X (P) nVidia 980 GTX (M2) AMD Vega 56 AMD Fury Comments
Arch Chipset Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) Vega 10 Fiji The X uses the current Pascal architecture that is also powering the current Series 10 consumer cards
Cores (CU) / Threads (SP) 28 / 3584 16 / 2048 56 / 3584 64 / 4096 We’ve got 28CU/SMX here down from 32 on GP100/Tesla but should still be sufficient to power through tasks.
FP32 / FP64 / Tensor Cores 3584 / 112 / no 2048 / 64 / no 3584 / 448 / no 4096 / 512 / no Only 112 FP64 units – a lot less than competition from AMD, this is a card geared for FP32 workloads.
Speed (Min-Turbo) 1.531GHz (139-1910) 1.126GHz (135-1.215) 1.64GHz 1GHz Higher clocked that previous generation and comparative with competition.
Power (TDP) 250W (125-300) 180W (120-225) 200W 150W TDP has also increased to 250W but again that is inline with top-end cards that are pushing over 200W.
ROP / TMU
96 / 224 64 / 128 64 / 224 64 / 256 As it may also be used as top-end graphics card, it has a good amount of ROPs (50% more than competition) and similar numbers of TMUs.
Global Memory 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit 8GB HBM2 2Gbps 2048-bit 4GB HBM 1Gbps 4096-bit Titan X comes with a huge 12GB of current GDDR5X memory while the competition has switched to HBM2 for top-end cards.
Memory Bandwidth (GB/s)
512 224 483 512 Due to high speed GDDR5X, the X has plenty of memory bandwidth even higher than HBM2 competition.
L2 Cache 3MB 2MB 4MB 2MB L2 cache has increased by 50% over previous arch to keep all cores fed.
FP64/double ratio
1/32 1/32 1/8 1/8 The X is not really meant for FP64 workloads as it uses the same ratio 1:32 as normal consumer cards.
FP16/half ratio
1/64 n/a 1/1 1/1 With 1:64 ratio FP16 is not really usable on Titan X but can only really be used for compatibility testing.

nVidia Titan X (Pascal)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from both nVidia and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 17,870 [37%] / 16,000 7,000 / 6,100 13,000 8,720 Titan X makes a good start beating the Vega by almost 40%.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 245 [-98%] / n/a n/a 13,130 7,890 FP16 is so slow that it is unusable – just for testing.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 661 [-47%] / 672 259 / 265 1,250 901 FP64 is also quite slow though a lot faster than on the GTX 980.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 25 [-67%] / 24 10.8 / 10.7 77.3 55 Emulated FP128 precision depends entirely on FP64 performance and thus is… slow.
With FP32 “normal” workloads Titan X is quite fast, ~40% faster than Vega and about 2.5x faster than an older GTX 980 thus quite an improvement. But FP16 workloads should not apply – better off with FP32 – and FP64 is also about 1/2 the performance of a Vega – also slower than even a Fiji. As long as all workloads are FP32 there should be no problems.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 40 [-38%] / 38 16 / 16 65 46 Titan X is a lot faster than previous gen but still ~40% slower than a Vega
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 52 [-38%] / 51 23 / 21 84 60 Nothing changes here , the X still about 40% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 134 [+4%] / 142 58 / 59 129 82 In this integer workload, somehow Titan X manages to beat the Vega by 4%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 [-34%] / 114 50 / 54 163 124 SHA1 is mysteriously slower thus the X is ~35% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 72 [+2.3x] / 42 32 / 24 31 13.8 With 64-bit integer workload, Titan X is a massive 2.3x times faster than a Vega.
Historically, nVidia cards have not been tuned for integer workloads, but Titan X still manages to beat a Vega – the “gold standard” for crypto-currency hashing – on both SHA256 and especially on 64-bit integer SHA2-512! Perhaps for the first time a nVidia card is competitive on integer workloads and even much faster on 64-bit integer workloads.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11,480 [+28%] / 11,470 5,280 / 5,280 9,000 11,220 In this FP32 financial workload Titan X is almost 30% faster than a Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,370 [-36%] / 1,300 547 / 511 1,850 1,290 Switching to FP64 code, the X remains competitive and is about 35% slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 2,240 [-8%] / 2,240 1,200 / 1,140 2,440 1,760 Binomial uses thread shared data thus stresses the SMX’s memory system and here Vega surprisingly does better by 8%
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 129 [-20%] / 133 51 / 51 161 115 With FP64 code the X is only 20% slower than a Vega.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,350 [+47%] / 5,150 2,140 / 2,000 3,630 2,470 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Titan X is almost 50% faster!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 294 [-34%] / 267 118 / 106 385 332 Switching to FP64 the X is again 34% slower than a Vega.
For financial FP32 workloads, the Titan X generally beats the Vega by a good amount or at least ties with it; with FP64 precision it is about 1/2 the speed which is to be expected. As long as you have FP32 workloads this should not be a problem.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,280 [+19%] / 6,600 2,550 / 2,550 5,260 3,630 Using 32-bit precision Titan X beats the Vega by 20%.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 335 [-40%] / 332 130 / 129 555 381 With FP64 precision, unsurprisingly the X is 40% slower.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 242 [-20%] / 227 148 / 136 306 348 FFT does better with HBM memory and here Titan X is 20% slower than a Vega.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 207 / 191 89 / 82 139 116 Surprisingly the X does very well here and manages to beat all cards by almost 50%!
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 5,600 [+20%] / 4,870 2,100 / 2,000 4,670 3,080 Titan X does well in this algorithm, beating the Vega by 20%.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 275 [-20%] / 275 82 / 81 343 303 With FP64 precision, the X is again 20% slower.
The scientific scores are similar to the financial ones but the gain/loss is about 20% not 40% – in FP32 workloads Titan X is 20% faster while in FP64 it is about 20% slower than a Vega – a closer result than expected.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 14,550 [-60%] / 10,880 7,310 / 5,530 36,000 28,000 In this 3×3 convolution algorithm, somehow Titan X is over 50% slower than a Vega and even a Fury.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 3,840 [-11%] / 2,750 1,870 / 1,380 4,300 3,150 Same algorithm but more shared data reduces the gap to 10% but still a loss.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 3,920 [-10%] / 2,930 1,910 / 1,440 4,350 3,200 With even more data the gap remains at 10%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 3,740 [-11%] / 2,760 1,860 / 1,370 4,210 3,130 Still convolution but with 2 filters – Titan X is 10% slower again.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.7 / 55 [+52%] 20.6 / 25.4 36.3 30.8 Different algorithm allows the X to finally beat the Vega by 50%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 15.6 [-60%] / 15.3 12.2 / 11.4 38.7 14.3 Without major processing, this filter does not like the X much it runs 1/2 slower than the Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 16,480 [-57%] / 14,000 7,600 / 6,640 38,730 28,500 This algorithm is 64-bit integer heavy but again Titan X is 1/2 the speed of Vega.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 290 / 6,350 [+13%] 210 / 3,080 5,600 4,410 One of the most complex and largest filters, Titan X finally beats the Vega by over 10%.
For image processing using FP32 precision Titan X surprisingly does not do as well as expected – either in CUDA or OpenCL – with the Vega beating it by a good margin on most filters – a pretty surprising result. Perhaps more optimisations are needed on nVidia hardware. We obviously did not test FP16 performance at all as that would have been far slower.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from nVidia and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Memory Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 356 [+13%] / 354 145 / 144 316 387 Titan X brings more bandwidth than a Vega (+13%) but the old Fury takes the crown.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.4 / 9 12.1 / 12 12.1 11 All cards use PCIe3 x16 and thus no appreciable delta.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.2 / 8.9 11.5 / 12.2 10 9.8 Again no significant difference but we were not expecting any.
Titan X uses current GDDR5X but with high data rate allowing it to bring more bandwidth that some HBM2 designs – a pretty impressive feat. Naturally high-end cards using HBM2 should have even higher bandwidth.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 201 / 230 230 273 343 Compared to previous generation, Titan X has better latency due to higher data rate.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 286 / 311 306 399 525 Similarly, even full random accesses are faster,
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89 / 121 97 129 216 Sequential access has similarly low latencies but nothing special.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 / 174 126 269 353 Constant memory latencies have also dropped.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 53 21 49 112 Even shared memory latencies have dropped likely due to higher core clocks.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 / 196 208 121 Texture access latencies have come down as well.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 / 278 308 And even full range latencies have decreased.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87 /123 102 With sequential access there is no appreciable delta in latencies.
We’re only comparing CUDA latencies here (as OpenCL is quite variable) – thus compared to the previous generation (GTX 980) all latencies are down, either due to higher memory data rate or core clock increases – but nothing spectacular. Still good progress and everything helps.
We see L1 cache effects until 16kB (same as previous arch) and between 2-4MB tallying with the 3MB cache. While fast perhaps they could be a bit bigger.
As with global memory we see the same L1D and L2 cache affects with similar latencies. All in all good performance but we could do with bigger caches.

Titan X’s memory performance is what you’d expect from higher clocked GDDR5X memory – it is competitive even with the latest HBM2 powered competition – both bandwidth and latency wise. There are no major surprises here and everything works nicely.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Titan X based on the current “Pascal” architecture performs very well in FP32 workloads – it is much faster than previous generation for a modest price increase and is competitive with the AMD’s Vega offers. But it is likely due to be replaced soon as next-generation “Volta” architecture is already out on the high-end (Titan V) and likely due to filter down the stack to both consumer (Series 11?) cards and “pro-sumer” cheaper Titan cards than the Titan V.

For FP64 workloads it is perhaps best to choose an older Quadro/Tesla card with more FP64 units as performance is naturally much lower. FP16 performance is also restricted and pretty much not usable – good for compatibility testing should you hope to upgrade to a full-speed FP16 card in the future. For both these workloads – the high-end Titan V is the card you probably want – but at a much higher price.

Still for the money, Titan X has its place and the most common FP32 workloads (financial, scientific, high precision image processing, etc.) that do not require FP64 nor FP16 optimisations perform very well and this card is all you need.

nVidia Titan X (Pascal)

FP16 GPGPU Image Processing Performance & Quality

GPGPU Image Processing

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal) while CPU support is still limited to SIMD conversion only (FP16C). It has been added to allow mobile devices (phones, tablets) to provide increased performance (and thus save power for fixed workloads) for a small drop in quality for normal 8-bbc (24-bbp) image and video.

However, normal laptops and tablets with integrated graphics can also benefit from FP16 support in same way due to relatively low graphics compute power and the need to save power due to limited battery in thin and light formats.

In this article we’re investigating the performance differences vs. standard FP32 (aka “single”) and the resulting quality difference (if any) for mobile GPGPUs (Intel’s EV9/9.5 SKL/KBL). See the previous articles for general performance comparison:

Image Processing Performance & Quality

We are testing GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Image Filter
FP32/Single FP16/Half Comments
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  481  967 [+2x] We see a a text-book 2x performance increase for no visible drop in quality.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  107  331 [+3.1x] Using FP16 yields over 3x performance increase but we do see a few more changed pixels though no visible difference.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  112  325 [+2.9x] Again almost 3x performance increase but no visible quality difference. Result!
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  107  323 [+3.1x] Again just over 3x performance increase but no visible quality difference.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s) 5.41  5.67 [+4%] No image difference at all but also almost no performance increase – a measly 4%.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  4.7  13.48 [+2.86x] We’re back with a 2.8x times performance increase but few more differences than we’ve seen though quality seems acceptable.
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  1188  1210 [+2%] Due to random no generation using 64-bit integer processing the performance difference is minimal but the picture quality is not acceptable.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s) 470  508 [+8%] Again due to Perlin noise generation we see almost no performance gain but big drop in image quality – not worth it.

Other Image Processing relating Algorithms

Image Filter
FP16/Half FP32/Single FP64/Double Comments
GPGPU Science Benchmark GEMM OpenCL (GFLOPS)  178 [+50%]  118  35 Dropping to FP16 gives us 50% more performance, not as good as 2x but still a significant increase.
GPGPU Science Benchmark FFT OpenCL (GFLOPS)  34 [+70%]  20  5.4 With FFT we are now 70% faster, closer to the 100% promised.
GPGPU Science Benchmark N-Body OpenCL (GFLOPS)  297 [+49%]  199  35 Again we drop to “just” 50% faster with FP16 but still a great performance improvement.

Final Thoughts / Conclusions

For many image processing filters (Blur, Sharpen, Sobel/Edge-Detection, Median/De-Noise, etc.) we see a huge 2-3x performance increase – more than we’ve hoped for (2x) – with little or no image quality degradation. Thus FP16 support is very much useful and should be used when supported.

However for complex filters (Diffusion, Marble/Perlin Noise) the drop in quality is not acceptable for minor performance increase (2-8%); increasing the precision of more data items to improve quality (from FP16 to FP32) would further drop performance making the whole endeavour pointless.

For those algorithms that do benefit from FP16 the performance improvement with FP16 is very much worth it – so FP16 support is very useful indeed.

Intel Graphics GPGPU Performance

Intel Logo

Why test GPGPU performance Intel Core Graphics?

Laptops (and tablets) are still in fashion with desktops largely left to PC game enthusiasts and workstations for big compute workloads; most laptops (and all tablets) make due with integrated graphics with few dedicated graphics options mainly for mobile PC gamers.

As a result integrated graphics on Intel’s mobile platform is what the vast majority of users will experience – thus its importance is not to be underestimated. While in the past integrated graphics options were dire – the introduction of Core v3 (Ivy Bridge) series brought us a GPGPU-capable graphics processor as well an updated internal media transcoder of Core v2 (Sandy Bridge).

With each generation Intel has progressively improved the graphics core, perhaps far more than its CPU cores – and added more variants (GT3) and embedded cache (eDRAM) which greatly increased performance – all within the same power limit.

New Features enabled by the latest 21.45 graphics driver

With Intel graphics drivers supporting just 2 generations of graphics – unlike unified drivers of AMD and nVidia – old graphics quickly become obsolete with few updates; but Windows 10 “free update” forced Intel’s hand somewhat – with its driver (20.40) supporting 3 generations of graphics (Haswell, Broadwell and latest at the time Skylake).

However, the latest 21.45 driver for newly released Kabylake and older Skylake does bring new features that can make a big difference in performance:

  • Native FP64 (64-bit aka “double” floating-point support) in OpenCL – thus allowing high precision compute on integrated graphics.
  • Native FP16 (16-bit aka “half” floating-point support) in OpenCL, ComputeShader – thus allowing lower precision but faster compute.
  • Vulkan graphics interface support – OpenGL’s successor and DirectX 12’s competitor – for faster graphics and compute.

Will these new features make upgrading your laptop to a brand-new KBL laptop more compelling?

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of the new Intel ULV APUs with the old versions.

Graphics Unit Haswell HD4000 Haswell HD5000 Broadwell HD6100 Skylake HD520 Skylake HD540 Kabylake HD620 Comment
Graphics Core EV7.5 HSW GT2U EV7.5 HSW GT3U EV8 BRW GT3U EV9 SKL GT2U EV9 SKL GT3eU EV9.5 KBL GT2U Despite 4 CPU generations we really have 2 GPU generations.
APU / Processor Core i5-4210U Core i7-4650U Core i7-5557U Core i7-6500U Core i5-6260U Core i3-7100U The naming convention has changed between generations.
Cores (CU) / Shaders (SP) / Type 20C / 160SP 40C / 320SP 48C / 384SP 24C / 192SP 48C / 384SP 23C / 184SP BRW increased CUs to 24/48 and i3 misses 1 core.
Speed (Min / Max / Turbo) MHz 200-1000 200-1100 300-1100 300-1000 300-950 300-1000 The turbo clocks have hardly changed between generations.
Power (TDP) W 15 15 28 15 15 15 Except GT3 BRW, all ULVs are 15W rated.
DirectX CS Support 11.1 11.1 11.1 11.2 / 12.1 11.2 / 12.1 11.2 / 12.1 SKL/KBL enjoy v11.2 and 12.1 support.
OpenGL CS Support 4.3 4.3 4.3 4.4 4.4 4.4 SKL/KBL provide v4.4 vs. verision 4.3 for older devices.
OpenCL CS Support 1.2 1.2 1.2 2.0 2.0 2.1 SKL provides v2 support with KBL 2.1 vs 1.2 for older devices.
FP16 / FP64 Support No / No No / No No / No Yes / Yes Yes / Yes Yes / Yes SKL/KBL support both FP64 and FP16.
Byte / Integer Width 8 / 32-bit 8 / 32-bit 8 / 32-bit 128 / 128-bit 128 / 128-bit 128 / 128-bit SKL/KBL prefer vectorised integer workloads, 128-bit wide.
Float/ Double Width 32 / X-bit 32 / X-bit 32 / X-bit 32 / 64-bit 32 / 64-bit 32 / 64-bit Strangely neither arch prefers vectorised floating-point loads – driver bug?
Threads per CU 512 512 256 256 256 256 Strangely BRW and later reduced the threads/CU to 256.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 288 399 597 875 [+3x] 1500 840 [+2.8x] If FP16 is enough, KBL and SKL have 2x performance of FP32.
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 299 375 614 468 [+56%] 817 452 [+50%] SKL GT3e rules the roost but KBL hardly improves on SKL.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 18.54 (eml) 24.4 (eml) 38.9 (eml) 112 [+6x] 193 104 [+5.6x] SKL GT2 with native Fp64 is almost 4x emulated BRW GT3!
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.8 (eml) 2.36 (eml) 4.4 (eml) 6.34 (eml) [+3.5x] 10.92 (eml) 6.1 (eml) [+3.4x] Emulating Fp128 though Fp64 is ~2.5x faster than through FP32.
As expected native FP16 runs about 2x faster than FP32 and thus provides a huge performance upgrade if precision is sufficient. Native FP64 is about 8x emulated FP64 and even emulated FP128 improves by about 2.5x! Otherwise KBL GT2 matches SKL GT2 and is about 50% faster than HSW GT2 in FP32 and 6x faster in FP64.
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 1.37 1.85 2.7 2.19 [+60%] 3.36  2.21 [+60%] Since BRW integer performance is similar.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1.87 2.45 3.45 2.79 [+50%] 4.3 2.83 [+50%] Not a lot changes here.
SKL/KBL GT2 with integer workloads (with extensive memory accesses) are 50-60% faster than HSW similar to what we saw with floating-point performance. But the changed happened with BRW which improved the most over HSW with SKL and KBL not improving further.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s)  1.2 1.62 4.35  3 [+2.5x] 5.12 2.92 In this tough compute test SKL/KBL are 2.5x faster.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2.86  3.93  9.82  6.7 [+2.34x]  11.26  6.49 With a lighter algorithm SKL/KBL are still ~2.4x faster.
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s)  0.828  1.08 1.68 1.08 [+30%] 1.85  1 64-integer performance does not improve much.
In pure integer compute tests SKL/KBL greatly improve over HSW being no less than 2.5x faster a huge improvement; but 64-bit integer performance hardly improves (30% faster with 20% more CUs 24 vs 20). Again BRW is where the improvements were added with SKL GT3e hardly improving over BRW GT3.
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 461 495 493 656 [+42%]  772 618 [+40%] Pure FP32 compute SKL/KBL are 40% faster.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) 137  238 135 SKL GT3 is 73% faster than GT2 variants
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 62.45 85.76 123 86.32 [+38%]  145.6 82.8 [+35%] In this tough algorithm SKL/KBL are still amost 40% faster.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) 18.65 31.46 19 SKL GT3 is over 65% faster than GT2 KBL.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 106 160.4 192 174 [+64%] 295 166.4 [+56%] M/C is not as tough so here SKL/KBL are 60% faster.
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) 31.61 56 31 GT3 SKL manages an 80% improvement over GT2.
Intel is pulling our leg here; KBL GPU seems to show no improvement whatsoever over SKL, but both are about 40% faster in FP32 than the much older HSW. GT3 SKL variant shows good gains of 65-80% over the common GT2 and thus is the one to get if available. Obviously the ace card for SKL and KBL is FP64 support.
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS)  117  130 142 116 [=]  181 113 [=] SKL/GBL have a problem with this algorithm but GT3 does better?
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) 34.9 64.7 34.7 GT3 SKL manages a 86% improvement over GT2.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 13.3 13.1 15 20.53 [+54%]  27.3 21.9 [+64%] In a return to form SKL/KBL are 50% faster.
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) 5.2  4.19  4.69 GT3 stumbles a bit here some optimisations are needed.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS)  122  157.9 249 201 [+64%]  304 177.6 [+45%] Here SKL/KBL are 50% faster overall.
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) 19.25 31.9 17.8 GT3 manages only a 65% improvement here.
Again we see no delta between SKL and KBL – the graphics cores perform the same; again both benefit from FP64 support allowing high precision kernels to run. GT3 SKL variant greatly improves over common GT2 variant – except in one test (DFFT) that seems to be an outlier.
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  341  432  636 492 [+44%]  641 488 [+43%] We see the GT3s trading blows in this integer test, but SKL/KBL are 40% faster than HSW.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  72.7  92.8  147  106 [+45%]  139  106 [+45%] BRW GT3 just wins this with SKL/KBL again 45% faster.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  75.6  96  152  110 [+45%]  149  111 [+45%] Another win for BRW and 45% improvent for SKL/KBL.
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  72.6  90.6  147  105 [+44%]  143  105 [+44%] As above in this test.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s)  2.38  1.53  6.51  5.2 [+2.2x]  7.73  5.32 [+2.23x] SKL’s GT3 manages a win but overall SKl/KBL are over 2x faster than HSW.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  1.17  0.719  5.83  4.57 [+3.9x]  4.58  4.5 [+3.84x] Another win for BRW
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  511  688  1150  1100 [+2.1x]  1750  1080 [+2.05x]_ SKL/KBL are over 2x faster than HSW. BRW is beat here.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s)  378.5  288  424  437 [+15%]  611  443 [+17%] Some wild results here, some optimizations may be needed.
In this integer workloads (with texture access) the 28W GT3 of BRW manages a few wins over 15W GT3e of SKL – but compared to old HSW – both SKL and KBL are between 40 and 300% faster. Again we see no delta between SKL and KBL – there does not seem to be any difference at all!

If you have a HSW GT2 then an upgrade to SKL GT2 brings massive improvements as well as FP16 and FP64 native support. But HSW GT3 variant is competitive and BRW GT3 even more so. KBL GT2 shows no improvement over SKL GT2 – so it’s not just the CPU core that is unchanged but the graphics core also – it’s no EV9.5 here more like EV9.1!

For integer workloads BRW is where the big improvement came but for 64-integer that improvement is still to come, if ever. At least all drivers support native int64.

Transcoding Performance

We are testing media (video + audio) transcoding performance for common video algorithms: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
H.264/AVC Decoder/Encoder QuickSync H264 8-bit only QuickSync H264 8-bit only QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit HSW supports 8-bit only so 10-bit (high-colour) are out of luck.
H.265/HEVC Decoder/Encoder QuickSync H265 8-bit partial QuickSync H265 8-bit QuickSync H265 8-bit QuickSync H265 8/10-bit SKL has full/hardware H265/HEVC transcoding but for 8-bit only; Main10 (10-bit profile) requires KBL so finally we see a difference.
Transcode Benchmark VC 1 > H264/AVC Transcoding (MB/s)  7.55 8.4  7.42 [-2%]  8.25  8.08 [+6%] With DDR4 KBL is 6% faster.
Transcode Benchmark VC 1 > H265/HEVC Transcoding (MB/s)  0.734  3.14 [+4.2x]  3.67  3.63 [+5x] Hardware support makes SKL/KBL 4-5x faster.

If you want HEVC/H.265 then you want SKL including 4k/UHD. But if you plan on using 10-bit/HDR colour then you need KBL – finally an improvement over SKL. As it uses fixed-point hardware the GT3 performs only slightly faster.

Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX/OpenGL ComputeShader,  including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Memory Configuration 8GB DDR3 1.6GHz 128-bit 8GB DDR3 1.6GHz 128-bit 16GB DDR3 1.6GHz 128-bit 8GB DDR3 1.867GHz 128-bit 16GB DDR4 2.133GHz 128-bit 16GB DDR4 2.133GHz 128-bit All use 128-bit memory with SKL/KBL using DDR4.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 64 2048 / 64 2048 / 64 2048 / 64 Shared memory remains the same; in SKL/KBL constant memory is the same as global.
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 10.4 10.7 11 15.65 23 [+2.1x] 19.6 DDR4 seems to provide over 2x bandwidth despite low clock.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 5.23 5.35 5.54 7.74 11.23 [+2.1x] 9.46 Again over 2x increase in up speed.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 5.27 5.36 5.29 7.42 11.31 [+2.1x] 9.6 Again over 2x increase in down speed.
SKL/KBL + DDR4 provide over 2x increase in internal, up and down memory bandwidth – despite the relatively modern increase in memory speed (2133 vs 1600); with DDR3 1867MHz memory the improvement drops to 1.5x. So if you were to decide DDR3 or DDR4 the choice has been made!
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns)  179 192  234 [+30%]  296 235 [+30%] With DDR4 latency has increased by 30% not great.
GPGPU Memory Latency Constant Memory Latency (ns)  92.5  112  234 [+2.53x]  279  235 [+2.53x] Constant memory has effectively been dropped resulting in a disastrous 2.53x higher latencies.
GPGPU Memory Latency Shared Memory Latency (ns)  80  84  –  86.8 [+8%]  102  84.6 [+8%] Shared memory latency has stayed the same.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns)  283  298  56 [1/5x]
 58.1 [1/5x]
Texture access seems to have markedly improved to be 5x faster.
SKL/KBL global memory latencies have increased by 30% with DDR4 – thus wiping out some gains. The “new” constant memory (2GB!) is now really just bog-standard global memory and thus with over 2x increase in latency. Shared memory latency has stayed pretty much the same. Texture memory access is very much faster – 5x faster likely though some driver optimisations.

Again no delta between KBL and SKL; if you want bandwidth (who doesn’t?) DDR4 with modest 2133MHz memory doubles bandwidths – but latencies increase. Constant memory is now the same as global memory and does not seem any faster.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL as well as memory bandwidth performance.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 250  119 602 [+2.4x] 1000 537 [+2.1x] Fp16 support in DirectX doubles performance.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 235  109 338 [+43%]  496 289 [+23%] Fp16 does not yet work in OpenGL.
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s)  238  120 276 [+16%]  485 248 [4%] We only see a measly 4-16% better performance here.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 228  108 338 [+48%] 498 289 [+26%] SKL does better here – it’s 50% faster than HSW.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 52.4  78 76.7 [+46%] 133 69 [+30%] With FP64 SKL is still 45% faster.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 63.2  67.2 105 [+60%] 177 96 [+50%] Similar result here 50-60% faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 5.2  7 18.2 [+3.5x] 31.3 16.7 [+3.2x] Driver optimisation makes SKL/KBL over 3.5x faster.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 5.55  7.5 57.5 [+10x]  97.7 52.3 [+9.4x] Here we see SKL/KBL over 10x faster!
We see similar results to OpenCL GPGPU here – with FP16 doubling performance in DirectX – but with FP64 already supported in both DirectX and OpenGL even with HSW, KBL and SKL have less of a lead – of around 50%.
Video Memory Benchmark Internal Memory Bandwidth (GB/s)  15  14.8 27.6 [+84%]
26.9 25 [+67%] DDR4 brings almost 50% more bandwidth.
Video Memory Benchmark Upload Bandwidth (GB/s)  7  7.8 10.1 [+44%] 12.34 10.54 [+50%] Upload bandwidth has also increased ~50%.
Video Memory Benchmark Download Bandwidth (GB/s)  3.63  3.3 3.53 [-2%] 5.66 3.51 [-3%] No change in download bandwidth though.

Final Thoughts / Conclusions

SKL and KBL with the 21.45 driver yields significant gains in OpenCL making an upgrade from HSW and even BRW quite compelling despite the relatively modern 20.40 driver Intel was forced to provide for Windows 10. The GT3 version provides good gains over the standard GT2 version and should always be selected if available.

Native FP64 support is a huge addition which provides support for high-precision kernels – unheard of for integrated graphics. Native FP16 support provides an additional 2x performance in cases where 16-bit floating-point processing is sufficient.

However KBL’s EV9.5 graphics core shows no improvement at all over SKL’s EV9 core – thus it’s not just the CPU core that has not been changed but the GPU core too! Except for the updated transcoder supporting Main10 HEVC/H.265 (thus HDR / 10-bit+ colour) which is still quite useful for UHD/4K HDR media.

This is very much a surprise – as while the CPU core has not improved markedly since SNB (Core v2), the GPU core has always provided significant improvements – and now we have hit the same road-block. As dedicated GPUs have continued to improve significantly in performance and power efficiency this is quite a surprise. This marks the smallest ever generation to generation – SKL to KBL – ever, effectively KBL is a SKL refresh.

It seems the rumour that Intel may change to ATI/AMD graphics cores may not be such a crazy idea after all!

AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.