AMD Radeon 5700XT: Navi GPGPU Performance in OpenCL

What is “Navi”?

It is the code-name of the new AMD GPU, the first of the brand-new RDNA (Radeon DNA) GPU arch(itecture) – replacing the “Vega” that was the last of the GCN (graphics core next) arch(itecture). It is a mid-range GPU optimised for gaming thus not expected to set records, but GPUs today are used for many other tasks (mining, encoding, algorithm/compute acceleration, etc.) as well.

RDNA arch brings big changes from the various GCN revisions we’ve seen previously, but its first iteration here does not bring any major new features at least in the compute domain. Hopefully the next versions will bring tensor units (matrix multiplicators) and other accelerated instruction sets and so on.

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
Arch Chipset RDNA / Navi 10 GCN5.1 / Vega 20 Pascal / GP102 GCN5.0 / Vega 10 The first of the Navi chips.
Cores (CU) / Threads (SP) 40 / 2560 60 / 3840 28 / 3584 56 / 3584 Less CUs than Vega1 and same (64x) SP per CU.
SIMD per CU / Width 2 / 32 [2x] 4 / 16 4 / 16 Navi increases the SIMD width but decreases counts.
Wave/Warp Size 32 [1/2x] 64 32 64 Wave size is reduced to match nVidia.
Speed (Min-Turbo) 1.6 / 1.755 1.4 / 1.75 1.531 / 1.91 1.156 / 1.471 40% faster base and 20% turbo than Vega 1.
Power (TDP) 225W 295W 250W 210W Slightly higher TDP but nothing significant
ROP / TMU 64 / 160 64 / 240 96 / 224 64 / 224 ROPs are the same but we see ~30% less TMUs.
Shared Memory
64kB [+2x]
32kB 48kB / 96kB per SM 32kB We have 2x more shared memory allowing bigger kernels.
Constant Memory
4GB 8GB 64kB dedicated 4GB No dedicated constant memory but large.
Global Memory 8GB GDDR6 14Gt/s 256-bit 16GB HBM2 1Gt/s 4096-bit 12GB GDDR5X 10Gt/s 384-bit 8GB HBM2 900Gt/s 4096-bit Sadly no HBM this time but the faster but not very wide.
Memory Bandwidth (GB/s)
448GB/s [+9%] 1024GB/s 512GB/s 410GB/s Still bandwidth is 9% higher.
L1 Caches ? x40 16kB x60 48kB x28 16kB x56 L1 does not appear changed but unclear.
L2 Cache 4MB 4MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
1024 / 1024 256 / 1024 1024 / 2048 per SM 256 / 1024 AMD has unlocked work-group sizes to 4x.
FP64/double ratio
1/16x 1/4x 1/32x 1/16x Ratio is same as consumer Vega1 rather than pro Vega2.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 18,265 [-7%] 29,057 245 19,580 Navi starts well but cannot beat Vega1.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 11,863 [-13%] 17,991 17,870 13,550 Standard FP32 increases the gap to 13%.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 1,047 [-16%] 5,031 661 1,240 FP64 does not change much, Navi is 16% slower.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 43 [-45%] 226 25 77 Emulated FP128 is hard on FP64 units and here Navi is almost 1/2 Vega1.
Starting up, Navi does not seem to be able to beat Vega1 in heavy vectorised compute loads with FP16 most efficient (almost parity) while complex FP128 is 2x slower.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 51 [-25%] 91 42 67 Despite more bandwidth Navi is 25% slower than Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 176 [+40%] 209 145 125 Navi shows its power here beating Vega1 by a huge 40%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 76 32
Despite more bandwidth of GDDR6, streaming algorithms work better on on “old” HBM2 thus Navi cannot beat Vega. But in pure integer compute algorithms like hashing, it is much faster by a significant amount which bodes well for the future.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 12,459 [+31%] 23,164 11,480 9,500 In this FP32 financial workload Navi is 30% faster than Vega1!
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 1,370 1,880
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 850 [1/3x] 3,501 2,240 2,530 Binomial uses thread shared data thus stresses the memory system and here we have some optimisation to do.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 129 164
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,027 [+30%] 6,249 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Navi is again 30% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 294 472
For financial FP32 workloads, Navi is ~30% faster than Vega1 – a pretty good improvement – though it naturally cannot compete with Vega2 due to consumer multiplier (1/16x). Crypto-currencies fans will love the Navi.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 5,165 [+2%] 6,634 6,073 5,066 GEMM can only bring a measly 2% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 340 620
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 376 [+2%] 643 235 369 FFT loves HBM but Navi is still 2% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 207 175
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 4,534 [-6%] 6,846 5,720 4,840 Navi can’t manage as well in N-Body and ends up 6% slower.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 275 447
The scientific scores don’t show the same improvement as the financial ones likely due to heavy use of shared memory with Navi just matching Vega1. Perhaps the larger shared memory can allow us to use larger workgroups.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 8,674 [1/2.1x] 25,418 18,410 19,130 In this 3×3 convolution algorithm, Navi is 1/2x the speed of Vega1.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,734 [1/3x] 5,275 5,000 4,340 Same algorithm but more shared data makes Navi even slower.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 1,802 [1/2.5x] 5,510 5,080 4,450 With even more data the gap remains at 1/2.5x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,723 [1/2.5x] 5,273 4,800 4,300 Still convolution but with 2 filters – same 1/2.5x performance.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 48.44 [=] 92.53 37 48 Different algorithm allows Navi to tie with Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 97.34 [+2.5x] 57.66 12.7 38 Without major processing, this filter performs well on Navi.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 32,050 [+1.5x] 47,349 19,480 20,880 This algorithm is 64-bit integer heavy and Navi is 50% faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 9,516 [+1.6x] 7,708 305 6,000 One of the most complex and largest filters, Navi is again 50% faster.
For image processing using FP32 precision, Navi goes from 1/2.5x Vega1 performance (convolution) to 50% faster (complex algorithms with integer processing). It seems some optimisations are needed for the convolution algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon 5700X (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 376 [+13%] 627 356 333 Navi’s GDDR6 manages 13% more bandwidth than Vega1.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 21.56 [+77%] 12.37 11.4 12.18 PCIe 4.0 brings almost 80% more bandwidth
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 22.28 [+84%] 12.95 12.2 12.08 Again almost 2x more bandwidth.
Navi’s PCIe 4.0 interface (on 500-series motherboards) brings as expected almost 2x more upload/download bandwidth while its high-clocked GDDR6 manages just over 10% higher bandwidth over HBM2.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 276 [+11%] 202 201 247 Navi’s GDDR6 brings slight latency increase (+10%)
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 286 353
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Navi” is an interesting chip to be sure and perhaps more was expected of it; as always the drivers are the weak link and it is hard to determine which issues will be fixed driver-side and which will need to be optimised in compute kernels.

Thus performance-wise it oscillates between 1/2x and 50% Vega1 performance depending on algorithm, with compute-heavy algorithms (especially crypto-currencies) doing best and shared/local memory heavy algorithms doing worst. The 2x bigger shared memory (64kB vs 32) in conjunction with the larger work-group (1024 vs 256 by default) sizes do present future optimisation opportunities. AMD has also reduced the warp/wave size to match nVidia – a historic change.

Memory wise, the cost-cutting change from HBM2 to even high-speed GDDR6 does bring more bandwidth but naturally higher latencies – but PCIe 4.0 doubles upload/download bandwidths which will become much more important on higher capacity (16GB+) cards in the future.

Overall it is hard to recommend it for compute workloads unless the particular algorithm (crypto, financial) does well on Navi, otherwise the much older Vega1 56/64 offer better performance/cost ratio especially today. However, as drivers mature and implementations are optimised for it, Navi is likely to start to perform better.

We are looking forward to the next iterations of Navi, especially the rumoured “big Navi” version optimised for compute…

AMD Radeon VII: Vega2 GPGPU Performance in OpenCL

What is “Vega2”?

It is the code-name of the updated “Vega” GPU arch(itecture), the last of the GCN (graphics core next) arch (version 5.1) shrinked to 7nm before being replaced by the forthcoming “Navi”. Originally for the professional/workstation high-end market, “Vega2″/”big Vega” designed for compute (scientific, machine learning, etc.) workloads was pressed into service to battle the latest 2000-series “Turing”/RTX competition.

As a result it contains many high-end features not normally found on consumer cards:

  • 1/4 FP64 rate (instead of 1/16 or worse)
  • 16GB HBM2 memory (instead of 8-12)
  • 4096-bit HBM2 memory 1TB/s bandwidth (instead of 400-500)
  • Int8/Int4 support for AI/ML workloads
  • PCIe 4.0 capable but not enabled at this time

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
Arch Chipset Vega 20 / GCN 5.1 GV100 / 7.0 GP102 / 6.1 Vega 10 / GCN 5.0 A minor revision of Vega1.
Cores (CU) / Threads (SP) 60 / 3840 80 / 5120 28 / 3584 56 / 3584 More CUs than normal Vega but not 64.
SIMD per CU / Width 4 / 16 n/a n/a 4 / 16 Naturally same SIMD count and width
Wave/Warp Size 64 32 32 75 Wave size has always been 2x nVidia.
Speed (Min-Turbo) 1.4 – 1.750 [+21%] (135-1455) 1.531 (139-1910) 1.156 – 1.471 Base clock is ~20% higher and turbo
Power (TDP) 300W [+42%] 300W 250W 210W TDP has gone up by 40%.
ROP / TMU 64 / 256 96 / 320 96 / 224 64 / ROPs and TMUs unchanged
Shared Memory
32kB 48 / 96 kB 48 / 96kB 32kB No shared memory change.
Constant Memory
8GB 64kB 64kB 4GB No dedicated constant memory but large.
Global Memory 16GB HBM2 2Gbps 4096-bit 12GB HBM2 2x850Mbps 3072-bit 12GB GDDR5X 10Gbps 384-bit 8GB HBM2 1.89Gbps 2048-bit 2x as big and 2x as wide HBM a huge improvement.
Memory Bandwidth (GB/s)
1000 [+2.4x] 652 512 410 Still bandwidth is 9% higher.
L1 Caches 16kB x 60 96kB x 80 48kB x 28 16kB x 56 L1 has not changed.
L2 Cache 4MB 4.5MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
256 / 1024 1024 / 2048 1024 / 2048 256 / 1024 Same work-group sizes.
FP64/double ratio
1/4x 1/2x 1/32x 1/16x Ratio is 4x better than Vega1.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 29,057 [+48%] 33,860 245 19,580 Vega2 starts strong with a 48% lead over Vega1 and almost catching Volta.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 18,340 [+35%] 22,680 17,870 13,550 Good improvement here +35% over Vega1 again close to Volta.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 5,377 [+4.3x] 11,000 661 1,240 1/4 FP64 rate makes it over four (4x) times faster than Vega1.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 234 [+3x] 458 25.77 77 Similar to above, Vega2 is over three (3x) faster.
Vega2 looks about 35-50% faster than Vega1 in FP32/FP16 and 3-4x faster in FP64 due to its 1/4 FP64 rate. It won’t beat real workstation cards with 1/2 FP64 rate through thus that Titan has nothing to worry about.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 91 [+36%] 70 42 67 The fast HBM2 memory allows it to beat even Volta not just Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 93 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 209 [+67%] 245 145 125 Vega2 is a huge 70% faster in integer/crypto workloads.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 129 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 176 76 32
Vega2 increases its lead in integer workloads even streaming ones no doubt due to its very fast HBM2 memory making it the crypto-king of the hill though its cost may be an issue.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 23,164 [+2.3x] 18,570 11,480 9,500 Vega2 is over 2x faster than Vega1 also beating Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 [+3.84x] 8,400 1,370 1,880 In FP64 its almost 4x faster just below Volta!
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 3,501 [+38%] 4,200 2,240 2,530 Binomial uses thread shared data thus stresses the memory system Vega2 is still 40% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 [+4.8x] 2,000 129 164 With FP64 we’re almost 5x faster than Vega1.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 6,249 [+62%] 11,920 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Vega2 is 60% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 [+3.55x] 4,440 294 472 With FP64 we’re over 3.5x faster.
For financial FP32 workloads, Vega2 is 40-60% faster than Vega1 a decent improvement; naturally in FP64 it’s 4-5x times faster thus a significant upgrade for algorithms that require such precision.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,634 [+30%] 11,000 6,073 5,066 GEMM still brings a 30% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 [+3.77x] 3,830 340 620 But DGEMM is almost 4x faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 643 [+74%] 617 235 369 FFT loves HBM thus Vega2 is 75% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 [+2.1x] 280 207 175 DFFT is tough but Vega2 is still twice as fast.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,846 [+41%] 7,790 5,720 4,840 In N-Body physics Vega2 is 40% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 [+3.9x] 4,270 275 447 And in FP64 physics Vega2 is almost 4x faster.
The scientific scores show a similar improvement, with FP32 30-40% better but FP64 a whopping four (4x) faster than Vega1 and, in some algorithms, matching the hugely expensive Volta.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 25,418 [+32%] 26,790 18,410 19,130 In this 3×3 convolution algorithm, Vega2 is 32% faster than Vega1
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 5,275 [+21%] 9,295 5,000 4,340 Same algorithm but more shared data reduces the lead to 21%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 5,510 [+24%] 9,428 5,080 4,450 With even more data the gap remains constant.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 5,273 [+23%] 9,079 4,800 4,300 Still convolution but with 2 filters – similar 23% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 92 [+91%] 112 37 48 Different algorithm makes Vega2 almost 2x faster than Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 57 [+50%] 42 12.7 38 Without major processing, this filter is 50% faster on Vega2.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 47,349 [+2.3x] 24,370 19,480 20,880 This algorithm is 64-bit integer heavy and Vega2 flies 2x faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 7,708 [+28%] 8,460 305 6,000 One of the most complex and largest filters, Vega2 is 28% faster.
For image processing using FP32 precision, Vega goes from 21% to 2x faster, overall a decent improvement if you are processing a large number of images. In many filters it beats the far more expensive Volta competition.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 627 [+88%] 536 356 333 Vega2’s wide HBM2 is almost 2x faster as expected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 12.37 [+2%] 11.47 11.4 12.18 Using PCIe 3.0 similar upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.95 [+7%] 12.27 12.2 12.08 Again similar bandwidth.
Vega2 benefits greatly from its very wide HBM2 memory (4096-bit) which provides almost 2x real bandwidth as expected. But while PCIe 4.0 capable for now it has to make do with 3.0 and thus same upload/download bandwith. Here’s hoping for a BIOS update once new motherboards come out.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 202 [-19%] 180 201 247 The higher clock allows Vega2 a 20% latency reduction.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 [-4%] 311 286 353 Full range is only 4% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53.4 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75.4 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.1 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88.5 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega2 (“BigVega”) is a big improvement over normal Vega1 and its workstation-class pedigree shows. For FP16/Fp32 workloads though the 30-40% performance improvement may not be worth it considering the much higher price: naturally FP64 performance is almost 4x due to 1/4 FP64 rate though not as good at professional cards with 1/2 rate or Titan competition with similar 1/2 rate.

While the GCN core (rev 5.1) has seen internal updates, there is nothing new that can be supported/optimised for in the compute land thus any code working well on Vega1 should work just as well on Vega2.

The 16GB HBM2 wide memory also helps big workloads with 2x higher bandwidth and also lower latency due to higher clock. For some workloads this alone makes it a definite buy when competition stops at 12GB.

Unfortunately the card has had a limited release at a relatively high price thus value/price ratio depends entirely on your workload – if FP64 with large datasets then it is very much worth it; if FP32/FP16 with datasets that fit in standard 8GB memory then the older Vega1 is much better value and you can even get 2 for the price of the Vega2.

For revolutionary change we need to wait for Navi and its brand new RDNA (Radeon DNA) arch(itecture)…

AMD Ryzen 2 Mobile (2500U) Vega 8 GP(GPU) Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2 mobile:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Radeon RX Vega graphics core (DirectX 12.1)
  • Optimised boost (aka Turbo) algorithm – sharing between CPU & GPU cores

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

Hardware Specifications

We are comparing the graphics units of Ryzen2 mobile with competitive APUs with integrated graphics  to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications AMD Radeon RX Vega 8 (2500U)
Intel UHD 630 (7200U)
Intel HD Iris 520 (6500U)
Intel HD Iris 540 (6550U)
Comments
Arch Chipset GCN1.5 GT2 / EV9.5 GT2 / EV9 GT3 / EV9 All graphics cores are minor revisions of previous cores with extra functionality.
Cores (CU) / Threads (SP) 8 / 512 24 / 192 24 / 192 48 / 384 Vega has the most SPs though only a few but powerful CUs
ROPs / TMUs 8 / 32 8 / 16 8 / 16 16 / 24 Vega has less ROPs than GT3 but more TMUs.
Speed (Min-Turbo) 300-1100 300-1000 300-1000 300-950 Turbo boost puts Vega in top position power permitting.
Power (TDP) 25-35W 15-25W 15-25W 15-25W TDP is about the same for all though both Ryzen2 and CFL-U have somewhat higher TDP (25W).
Constant Memory 2.7GB 1.6GB 1.6GB 3.2GB There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 32kB 64kB 64kB 64kB Intel has 2x larger shared/local memory but slow (likely non dedicated) unlike Vega.
Global Memory 2.7 / 3GB 1.6 / 3.2GB 1.6 / 3.2GB 3.2 / 6.4GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System 128-bit DDR4 2400Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR4 2133MT/s Ryzen2’s memory controller is rated for faster data rates thus should be able to use faster (laptop) memory.
Memory Bandwidth (GB/s)
36 30 30 33 The high data rate of DDR4 can result in higher bandwidth useful for the GPU cores.
L2 Cache ? 512kB 512kB 1MB L2 is comparable to Intel units.
FP64/double ratio Yes, 1/16x Yes, 1/8x Yes, 1/8 Yes, 1/8x FP64 is supported and at good ratio but lower than Intel’s.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 2x Yes, 2x FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 831 927 1630 2000 [+23%] Thanks to FP16 support we see double the performance over FP32 but Vega is only 23% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 476 478 865 1350 [+56%] Vega rules FP32 and is over 50% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 122 209 111 [-47%] FP64 lower rate makes Vega 1/2 the speed of GT3 and only matching GT2 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 5.71 6.29 10.78 7.11 [-34%] Emulated FP128 precision depends entirely on FP64 performance thus not a lot changes.
Vega is over 50% faster than Intel’s top-end Iris/GT3 graphics but only in FP32 precision – while it gains from FP16 Intel scales better reducing the lead to just 25% or so. In FP64 precision though it’s relatively low 1/16x ratio means it only ties with GT2 low-end-models while GT3 is 2x (twice) as fast. Pity.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.858 0.87 1.23 2.58 [+2.1x] No wonder AMD is crypto-king: Vega is over 2x faster than even GT3.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1 1.08 1.52 3.3 [+2.17x] Nothing changes here, Vega is over 2.2x faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 2.72 3 4.7 14.29 [+3x] In this heavy integer workload, Vega is now 3x faster no wonder it’s used for crypto mining.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 6 6.64 11.59 18.77 [+62%] SHA1 is less compute intensive allowing Intel to catch up but Vega is still over 60% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.019 1.08 1.86 3.36 [+81%] With 64-bit integer workload, Vega does better and is 80% (almost 2x) faster than GT3.
Nobody will be using integrated graphics for crypto-mining any time soon, but if you needed to (perhaps using encrypted containers, VMs, etc.) then Vega is your choice – even GT3 is left in the dust despite big improvement over low-end GT2. Intel would need at least 2x more cores to be competitive here.
GPGPU Finance Benchmark Black-Scholes half/FP16 (MOPT/s) 1000 1140 1470 1720 [+17%] If 16-bit precision is sufficient for financial work, Vega is 20% faster than GT3.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 694 697 794 829 [+4%] In this relatively simple FP32 financial workload Vega is just 4% faster than GT3.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 142 154 281 185 [-33%] Switching to FP64 precision, Vega is 33% slower than GT3.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 86 95 155 270 [+74%] Switching to 16-bit precision allows Vega to gain over GT3 and is almost 2x faster.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 92 93 153 254 [+66%] Binomial uses thread shared data thus stresses the internal memory sub-system, and here Vega shows its power – it is 66% faster than GT3.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 18 18.86 32 15.67 [-51%] With FP64 precision Vega loses again vs. GT3 at 1/2 the speed and just matches GT2 units.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 211 236 395 584 [+48%] With 16-bit precision, Vega dominates again and is almost 50% faster than GT3.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 223 236 412 362 [-12%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – but Vega somehow loses against GT3.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 29.5 33.36 58.7 47.13 [-20%] Switching to FP64 precision as expected Vega is slower.
Financial algorithms perform well on Vega – at least in FP16 & FP32 precision but FP64 is too “gimped” (1/16x FP32 rate) and thus loses against GT3 despite more powerful cores.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 127 140 236 884 [+3.75x] With 16-bit precision Vega runs away with GEMM and is almost 4x faster than GT3.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 105 107 175 214 [+79%] GEMM makes heavy use of shared/local memory which is likely why Vega is 80% faster than GT3.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 38.8 41.69 70 62.6 [-11%] As expected, due to gimped FP64 rate Vega falls behind GT3 but only by just 11%.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 34.2 34.7 45.85 61.34 [+34%] 16-bit precision helps reduce memory bandwidth pressure thus Vega is 34% faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 20.9 21.45 29.69 31.48 [+6%] FFT is memory access bound but Vega does well to beat GT3.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.3 5.4 6.07 14.19 [+2.34x] Despite the FP64 rate, Vega manages its memory accesses better beating GT3 by over 2x (two times).
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 270 284 449 623 [+39%] 16-bit precision still benefits N-Body and here Vega is 40% faster than GT3.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 162 181 291 537 [+85%] Back to FP32 and Vega has a pretty large 85% lead – almost 2x GT3.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 22.73 26.1 43.34 44 [+2%] With FP64 precision, Vega and GT3 are pretty much tied.
Vega performs well on compute heavy scientific algorithms (making heavy use of shared/local memory) and also benefits from half/FP16 to reduce memory bandwidth pressure, but FP64 rate comes back to haunt it where it loses against Intel’s GT3. Pity.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 888 937 1390 2273 [+64%] With 16-bit precision Vega doubles its lead to 64% over GT3 despite its gain over FP32.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 461 491 613 781 [+27%] In this 3×3 convolution algorithm, Vega does well but only 30% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 279 302 409 582 [+42%] Again a huge gain by using FP16, over 40% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 100 107 144 157 [+9%] Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter half/FP16 (MPix/s) 254 272 396 619 [+56%] Large gain again by switching to FP16 with 3x performance over FP32.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 103 111 156 161 [+3%] With even more shared data the gap falls to just 3%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 259 281 363 595 [+64%] Another huge gain and over 3x improvement over FP32.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 99 106 145 155 [+7%] Still convolution but with 2 filters – the gap is similar to 5×5 – Vega is 7% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 7.39 9.4 8.56 7.688 [-18%] Big gain but not enough to beat GT3 here.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7 7.57 7.08 4 [-47%] Vega does not like this algorithm (lots of branching causing divergence) and is 1/2 GT3 speed.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 8.55 9.32 9.22 <BSOD> This test would cause BSOD; we are investigating.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8 8.65 6.77 2.59 [-70%] Vega does not like this algorithms either (complex branching) and neither does GT3.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 941 967 1580 2091 [+32%] In order to prevent artifacts most of this test runs in FP32 thus not much gain here.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 878 952 1550 2100 [+35%] This algorithm is 64-bit integer heavy allowing Vega 35% better performance over GT3.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 341 390 343 1046 [+2.5x] Switching to FP16 makes a huge difference to Vega which is over 2x faster.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 384 425 652 608 [-7%] One of the most complex and largest filters, Vega is a bit slower than GT3 by 7%.
For image processing Vega generally performs well in FP32 beating GT3 hands down; but there are a few algorithms that may need to be optimised for it that don’t perform as well as expected. Switching to FP16 though doubles/triples scores – thus Vega may be starved of memory.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 12.17 21.2 24 27.32 [+14%] With higher speed DDR4 memory, Vega has 14% more bandwidth.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 6 10.4 11.7 4.74 [-60%] The GPU<>CPU link seems a bit slow here at 1/2 bandwidth of Intel.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 6 10.5 11.75 5 [-57%] Download bandwidth shows a similar issue, 1/2 bandwidth expected.
All designs have to rely on the shared memory controller and Vega performs as expected with good internal bandwidth due to higher speed DDR4 memory. But – transfer up/down speeds are disappointing possibly due to the driver as “zero-copy” mode should be engaged and working on such transfers (APU mode).
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 246 244 288 412 [+49%] Similarly with CPU data latencies, global “in-page/random” (aka “TLB hit”) latencies are a bit high though not by a huge amount.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 365 372 436 519 [+19%] Due to faster memory clock but increased timings “full/random” latencies appear a bit higher.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 156 158 213 201 [-6%] Sequential access latencies are less than competition by 6%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 245 243 252 411 [+63%] None have dedicated constant memory thus we see a similar picture to global memory: somewhat high latencies.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 82 84 100 22.5 [1/5x] Vega has dedicated shared/local memory and it shows – it’s about 5x faster than Intel’s designs.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1152 1157 1500 278 [1/5x] Texture access is also very fast on Vega, with latencies 5x lower (aka 1/5) than Intel’s designs.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1178 1162 1533 418 [1/3x] Even full/random accesses are fast, 3x (three times) faster than Intel’s.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1077 1081 1324 122 [1/10x] With sequential access we see a crazy 10x lower latency as if AMD uses prefetchers and Intel does not.
As we’ve seen in Ryzen 2’s data latency tests – “in-page/random” latencies are higher than competition but the rest are comparative, with sequential (prefetched) latencies especially small. But dedicated shared/local memory is far faster (5x) and texture accesses are also very fast (3-5x) which should greatly help algorithms making use of them.
Plotting the global (or constant) memory latencies together we see that the “in-page/random” access latencies should perhaps peak somewhat lower but still nothing close to what we’ve seen in the (CPU) data memory latencies article. It is not very clear (unlike the texture latencies graph) where the caches are located.
The texture latencies graph is far clearer where we can see each level’s caches; unlike the global (or constant) latencies we see “in-page/random” latency peak and hold at a somewhat lower level (4MB).

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega mobile, as its desktop big siblings, is undoubtedly powerful and a good upgrade from the older integrated GPU cores; it also supports modern features like half/FP16 compute (which needs vectorisation what the driver reports as “optimised width”) and relishes complex algorithms making use of shared/local memory which is efficient. However Intel’s GT3 EV9.x can get close to it in some workloads and due to better FP64 ratio (1/8x vs 1/16x) even beat it in most FP64 precision tests which is somewhat disappointing.

Luckily for AMD, GT3 variant is very rare and thus Vega has an easy job defeating GT2 in just about all tests; but it shows that should Intel “get serious” and continue to improve integrated graphics (and CPUs) like they used to do before Skylake (SKL/KBL) – AMD might have more serious competition on its hands.

Note that until recently (2019) Ryzen2 mobile APUs were not supported by AMD’s main drivers (“Adrenalin”) and had to rely on pretty old OEM (HP, etc.) drivers that were somewhat problematic especially with Windows 10 changing every 6 months while the drivers were almost 1 year old. Thankfully this has now changed and users (and us) can benefit from updated, stable and performant drivers.

In any case if you want a laptop/ultraportable with just an APU and no dedicated graphics, then Vega is pretty much your only choice which means a Ryzen2 system. That pretty much means it is worthy of a recommendation.

In a word: Highly Recommended

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

nVidia Titan V: Volta GPGPU performance in CUDA and OpenCL

What is “Titan V”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Volta” architecture, the next generation to the current “Pascal” architecture on the Series 10 cards. Based on the top-end 100 chipset (not lower 102 or 104) it boasts full speed FP64/FP16 performance as well as brand-new “tensor cores” (matrix multipliers) for scientific and deep-learning workloads. It also comes with on-chip HBM2 (high-bandwidth) memory unlike more traditional GDDRX stand-alone memory.

For this reason the price is also far higher than previous Titan X/XP cards but considering the features/performance are more akin to “Tesla” series it would still be worth it depending on workload.

While using the additional cores provided in FP64/FP16 workloads is automatic – save usual code optimisations – tensor cores support requires custom code and existing libraries and apps need to be updated to make use of them. It is unknown at this time if consumer cards based on “Volta” will also include them. As they support FP16 precision only, not workloads may be able to use them – but DL (deep learning) and AI (artificial intelligence) are generally fine using lower precision thus for such tasks it is ideal.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan V
nVidia Titan X (P)
nVidia 980 GTX (M2)
Comments
Arch Chipset Volta VP100 (7.0) Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 80 / 5120 28 / 3584 16 / 2048 The V boasts 80 CU units but these contain 64 FP32 units only not 128 like lower-end chips thus equivalent with 40.
FP32 / FP64 / Tensor Cores 5120 / 2560 / 640 3584 / 112 / no 2048 / 64 / no Titan V is the only one with tensor cores and also huge amount of FP64 cores that Titan X simply cannot match; it also has full speed FP16 support.
Speed (Min-Turbo) 1.2GHz (135-1.455) 1.531GHz (139-1910) 1.126GHz (135-1.215) Slightly lower clocked than the X it will will make up for it with sheer CU units.
Power (TDP) 300W 250W (125-300) 180W (120-225) TDP increases by 50W but it is not unexpected considering the additional units.
ROP / TMU
96 / 320 96 / 224 64 / 128 Not a “gaming card” but while ROPs stay the same the number of TMUs has increased – likely required for compute tasks using textures.
Global Memory 12GB HBM2 850Mhz 3072-bit 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit Memory size stays the same at 12GB but now uses on-chip HBM2 for much higher bandwidth
Memory Bandwidth (GB/s)
652 512 224 In addition to the modest bandwidth increase, latencies are also meant to have decreased by a good amount.
L2 Cache 4.5MB 3MB 2MB L2 cache has gone up by about 50% to feed all the cores.
FP64/double ratio
1/2 1/32 1/32 For FP64 workloads the V has huge advantage as consumer and previous Titan X had far less FP64 units.
FP16/half ratio
2x 1/64 n/a The V has an even bigger advantage here with over 128x more units for FP16 tasks like DL and AI.

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 22,400 [+25%] / 20,000 17,870 / 16,000 7,000 / 6,100 Right off the bat, the V is just 25% faster than the X some optimisations may be required.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 33,300 [135x] / n/a 245 / n/a n/a For FP16 workloads the V shows its power: it is an astonishing 135 *times* (times not %) faster than the X.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 11,000 [+16.7x] / 11,000 661 / 672 259 / 265 For FP64 precision workloads the V shines again, it is 16 times faster than the X.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 458 [+17.7x] / 455 25 / 24 10.8 / 10.7 With emulated FP128 precision the V is again 17 times faster.
As expected FP64 and FP16 performance is much improved on Titan V, with FP64 over 16x times faster than the X; FP16 performance is over 50% faster than FP32 performance making it almost 2x faster than Titan X. For workloads that need it, the performance of Titan V is stellar.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 71 [+79%] / 87 40 / 38 16 / 16 Titan V is almost 80% faster than the X here a significant improvement.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 91 [+75%] / 116 52 / 51 23 / 21 Not a lot changes here, with the V still 7% faster than the X.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 253 [+89%] / 252 134 / 142 58 / 59 In this integer workload, Titan V is almost 2x faster than the X.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 130 [+21%] / 134
107 / 114 50 / 54 SHA1 is mysteriously slower than SHA256 and here the V is just 21% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 173 [+2.4x] / 176 72 / 42 32 / 24 With 64-bit integer workload, Titan V shines again – it is almost 2.5x (times) faster than the X!
Historically, nVidia cards have not been tuned for integer workloads, but Titan V is almost 2x faster in 32-bit hashing and almost 3x faster in 64-bit hashing than the older X. For algorithms that use integer computation this can be quite significant.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 18,460 [+61%] / 18,870
11,480 / 11,470 5,280 / 5,280 Titan V manages to be 60% faster in this FP32 financial workload.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 8,400 [+6.1x] / 9,200
1,370 / 1,300 547 / 511 Switching to FP64 code, the V is over 6x (times) faster than the X.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,180 [+81%] / 4,190
2,240 / 2,240 1,200 / 1,140 Binomial uses thread shared data thus stresses the SMX’s memory system: but the V is 80% faster than the X.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 2,000 [+15.5x] / 2,000
129 / 133 51 / 51 With FP64 code the V is much faster – 15x (times) faster!
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 12,550 [+2.35x] / 12,610
5,350 / 5,150 2,140 / 2,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here the V is over 2x faster than the X and that is FP32 code!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 4,440 [+15.1x] / 4,100
294 / 267 118 / 106 Switching to FP64 the V is again over 15x (times) faster!
For financial workloads, the Titan V is significantly faster, almost twice as fast as Titan X on FP32 but over 15x (times) faster on FP64 workloads. If time is money, then this can be money well-spent!
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 9,860 [+57%] / 10,350
6,280 / 6,600 2,550 / 2,550 Without using the new “tensor cores”, Titan V is about 60% faster than the X.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 3,830 [+11.4x] / 3,920 335 / 332 130 / 129 With FP64 precision, the V crushes the X again it is 11x (times) faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 605 [+2.5x] / 391 242 / 227 148 / 136 FFT allows the V to do even better – no doubt due to HBM2 memory.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 280 [+35%] / 245 207 / 191 89 / 82 We may need some optimisations here, otherwise the V is just 35% faster.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,390 [+15%] / 4,630
5,600 / 4,870 2,100 / 2,000 N-Body simulation also needs some optimisations as the V is just 15% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 4,270 [+15.5x] / 4,200
275 / 275 82 / 81 With FP64 precision, the V again crushes the X – it is 15x faster.
The scientific scores are a bit more mixed – GEMM will require code paths to take advantage of the new “tensor cores” and some optimisations may be required – otherwise FP64 code simply flies on Titan V.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 26,790 [50%] / 26,660
17,860 / 13,680 7,310 / 5,530 In this 3×3 convolution algorithm, Titan V is 50% faster than the X. Convolution is also used in neural nets (CNN) thus performance here counts.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 29,200 [+18.6x]
1,570 n/a With FP16 precision, Titan V shines it is 18x (times faster than X) but 12% faster than FP32.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 9,295 [+94%] / 6,750
4,800 / 3,460 1,870 / 1,380 Same algorithm but more shared data allows the V to be almost 2x faster than the X.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 14,900 [24.4x]
609 n/a With FP16 Titan V is almost 25x (times) faster than X and also 60% faster than Fp32.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 9,428 [+2x] / 7,260
4,830 / 3,620 1,910 / 1,440 Again same algorithm but even more data shared the V is 2x faster than the X.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 14,790 [+45x] 325 n/a With FP16 the V is now45x (times) faster than the X showing the usefulness of FP16 support.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 9,079 [1.92x] / 7,380
4,740 / 3450 1,860 / 1,370 Still convolution but with 2 filters – Titan V is almost 2x faster again.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 13,740 [+44x]
309 n/a Just as we seen above, the V is an astonishing 44x (times) faster than the X, and also ~20% faster than FP32 code.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 111 [+3x] / 66
36 / 55 20 / 25 Different algorithm but here the V is even faster, 3x faster than the X!
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 206 [+2.89x]
71 n/a With FP16 the V is “only” 3x faster than the X but also 2x faster than FP32 code-path again a big gain for FP16 processing
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 157 [+10x] / 24
15 / 15 12 / 11 Without major processing, this filter flies on the V – it is 10x faster than the X.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 215 [+4x] 50 FP16 precision is “just” 4x faster but it is also ~40% faster than FP32.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,370 / 22,780 [+25%] 19,480 / 14,000 7,600 / 6,640 This algorithm is 64-bit integer heavy and here Titan V is 25% faster than the X.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 24,180 [+4x] 6,090 FP16 does not help a lot here, but still the V is 4x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 846 [+3x] / 874 288 / 635 210 / 308 One of the most complex and largest filters, Titan V does very well here, it is 3x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 1,712 [+3.7x]
461 n/a Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Titan V brings big performance increases from 50% to 4x (times) faster than Titan X a big upgrade. If you are willing to drop to FP16 precision, then it is an extra 50% to 2x faster again – while naturally FP16 is not really usable on the X. With potential 8x times better performance Titan V powers through image processing tasks.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 536 [+51%] / 530
356 / 354 145 / 144 HBM2 brings about 50% more raw bandwidth to feed all the extra compute cores, a significant upgrade.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.47 / 11,4
11.4 / 9 12.1 / 12 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.3 / 12.3
12.2 / 8.9 11.5 / 12.2 Again no significant difference but we were not expecting any.
Titan V’s HBM2 brings 50% more memory bandwidth but as it still uses the PCIe3 x16 connection there is no change to host upload/download bandwidth which may be a bit of a bottleneck trying to keep all those cores fed with data. Even more streaming load/save is required and code will need to be optimised to use all that processing power
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 180 [-10%] / 187
201 / 230 230 From the start we see global latency accesses reduced by 10%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 311 [+9%] / 317
286 / 311 306 Full range random accesses do seem to be 9% slower which may be due to the architecture.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53 [-40%] / 57 89 / 121 97 However, sequential accesses seem to have dropped a huge 40% likely due to better prefetchers on the Titan V.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75 [-36%] / 76 117 / 174 126 Constant memory latencies also seem to have dropped by almost 40% a great result.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 85 18 / 53 21 No significant change in shared memory latencies.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 [+9%] / 279 195 / 196 208 Texture access latencies seem to have increased by 9%
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 [+22%] / 313 282 / 278 308 As we’ve seen with global memory, we see increased latencies here by about 20%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88 / 163 87 /123 102 With sequential access there is no appreciable delta in latencies.
HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets
We see L1 cache effects between 64-128kB tallying with an L1D of 96kB – 4x more than what we’ve seen on Titan X (at 16kB). The other inflexion is at 4MB – matching the 4.5MB L2 cache size – which is 50% more than what we saw on Titan X (at 3MB).
As with global memory we see the same L1D (64kB) and L2 (4.5MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

Titan V’s memory performance does not disappoint – HBM2 obviously brings large bandwidth increase – latency depends on access pattern, when prefetchers can engage they are much lowers but in random accesses out-of-page they are a big higher but nothing significant. We’re also limited by the PCIe3 bus for transfers which requires judicious overlap of memory transfers and compute to keep the cores busy.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Volta” architecture does bring good improvements in FP32 performance which we hope to see soon in consumer (Series 11?) graphics cards – as well as lower-end Titan cards.

But here (on Titan V) we have the top-end chip with full-power FP64 and FP16 units more akin to Tesla which simply power through any and all algorithms you can throw at them. This is really the “Titan” you were looking for and upgrading from the previous Titan X (Pascal) is a huge upgrade admittedly for quite a bit more money.

If you have workloads that requires double/FP64 precision – Titan V is 15-16x times faster than Titan X – thus great value for money. If code can make do with FP16 precision then you can gain up to 2x extra performance again – as well as save storage for large data-sets – again Titan X cannot cut it here running at 1/64 rate.

We have not yet shown tensor core performance which is an additional reason for choosing such a card – if you have code that can make use of them you can gain an extra 16x (times) performance that really puts Titan V heads and shoulders over the Titan X.

All in all Titan V is a compelling upgrade if you need more power than Titan X and are (or thinking of) using multiple cards – there is simply no point. One Titan V can replace 4 or more Titan X cards on FP64 or FP16 workloads and that is before you make any optimisations. Obviously you are still “stuck” with 12GB memory and PCIe bus for transfers but with judicious optimisations this should not impact performance significantly.

nVidia Titan X: Pascal GPGPU Performance in CUDA and OpenCL

What is “Titan X (Pascal)”?

It is the current high-end “pro-sumer” card from nVidia using the current generation “Pascal” architecture – equivalent to the Series 10 cards. It is based on the 2nd-from-the-top 102 chipset (not the top-end 100) thus it does not feature full speed FP64/FP16 performance that is generally reserved for the “Quadro/Tesla” professional range of cards. It does however come with more memory to fit more datasets and is engineered for 24/7 performance.

Pricing has increased a bit from previous generation X/XP but that is a general trend today from all manufacturers.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan X with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan X (P) nVidia 980 GTX (M2) AMD Vega 56 AMD Fury Comments
Arch Chipset Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) Vega 10 Fiji The X uses the current Pascal architecture that is also powering the current Series 10 consumer cards
Cores (CU) / Threads (SP) 28 / 3584 16 / 2048 56 / 3584 64 / 4096 We’ve got 28CU/SMX here down from 32 on GP100/Tesla but should still be sufficient to power through tasks.
FP32 / FP64 / Tensor Cores 3584 / 112 / no 2048 / 64 / no 3584 / 448 / no 4096 / 512 / no Only 112 FP64 units – a lot less than competition from AMD, this is a card geared for FP32 workloads.
Speed (Min-Turbo) 1.531GHz (139-1910) 1.126GHz (135-1.215) 1.64GHz 1GHz Higher clocked that previous generation and comparative with competition.
Power (TDP) 250W (125-300) 180W (120-225) 200W 150W TDP has also increased to 250W but again that is inline with top-end cards that are pushing over 200W.
ROP / TMU
96 / 224 64 / 128 64 / 224 64 / 256 As it may also be used as top-end graphics card, it has a good amount of ROPs (50% more than competition) and similar numbers of TMUs.
Global Memory 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit 8GB HBM2 2Gbps 2048-bit 4GB HBM 1Gbps 4096-bit Titan X comes with a huge 12GB of current GDDR5X memory while the competition has switched to HBM2 for top-end cards.
Memory Bandwidth (GB/s)
512 224 483 512 Due to high speed GDDR5X, the X has plenty of memory bandwidth even higher than HBM2 competition.
L2 Cache 3MB 2MB 4MB 2MB L2 cache has increased by 50% over previous arch to keep all cores fed.
FP64/double ratio
1/32 1/32 1/8 1/8 The X is not really meant for FP64 workloads as it uses the same ratio 1:32 as normal consumer cards.
FP16/half ratio
1/64 n/a 1/1 1/1 With 1:64 ratio FP16 is not really usable on Titan X but can only really be used for compatibility testing.

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from both nVidia and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 17,870 [37%] / 16,000 7,000 / 6,100 13,000 8,720 Titan X makes a good start beating the Vega by almost 40%.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 245 [-98%] / n/a n/a 13,130 7,890 FP16 is so slow that it is unusable – just for testing.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 661 [-47%] / 672 259 / 265 1,250 901 FP64 is also quite slow though a lot faster than on the GTX 980.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 25 [-67%] / 24 10.8 / 10.7 77.3 55 Emulated FP128 precision depends entirely on FP64 performance and thus is… slow.
With FP32 “normal” workloads Titan X is quite fast, ~40% faster than Vega and about 2.5x faster than an older GTX 980 thus quite an improvement. But FP16 workloads should not apply – better off with FP32 – and FP64 is also about 1/2 the performance of a Vega – also slower than even a Fiji. As long as all workloads are FP32 there should be no problems.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 40 [-38%] / 38 16 / 16 65 46 Titan X is a lot faster than previous gen but still ~40% slower than a Vega
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 52 [-38%] / 51 23 / 21 84 60 Nothing changes here , the X still about 40% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 134 [+4%] / 142 58 / 59 129 82 In this integer workload, somehow Titan X manages to beat the Vega by 4%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 [-34%] / 114 50 / 54 163 124 SHA1 is mysteriously slower thus the X is ~35% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 72 [+2.3x] / 42 32 / 24 31 13.8 With 64-bit integer workload, Titan X is a massive 2.3x times faster than a Vega.
Historically, nVidia cards have not been tuned for integer workloads, but Titan X still manages to beat a Vega – the “gold standard” for crypto-currency hashing – on both SHA256 and especially on 64-bit integer SHA2-512! Perhaps for the first time a nVidia card is competitive on integer workloads and even much faster on 64-bit integer workloads.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11,480 [+28%] / 11,470 5,280 / 5,280 9,000 11,220 In this FP32 financial workload Titan X is almost 30% faster than a Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,370 [-36%] / 1,300 547 / 511 1,850 1,290 Switching to FP64 code, the X remains competitive and is about 35% slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 2,240 [-8%] / 2,240 1,200 / 1,140 2,440 1,760 Binomial uses thread shared data thus stresses the SMX’s memory system and here Vega surprisingly does better by 8%
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 129 [-20%] / 133 51 / 51 161 115 With FP64 code the X is only 20% slower than a Vega.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,350 [+47%] / 5,150 2,140 / 2,000 3,630 2,470 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Titan X is almost 50% faster!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 294 [-34%] / 267 118 / 106 385 332 Switching to FP64 the X is again 34% slower than a Vega.
For financial FP32 workloads, the Titan X generally beats the Vega by a good amount or at least ties with it; with FP64 precision it is about 1/2 the speed which is to be expected. As long as you have FP32 workloads this should not be a problem.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,280 [+19%] / 6,600 2,550 / 2,550 5,260 3,630 Using 32-bit precision Titan X beats the Vega by 20%.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 335 [-40%] / 332 130 / 129 555 381 With FP64 precision, unsurprisingly the X is 40% slower.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 242 [-20%] / 227 148 / 136 306 348 FFT does better with HBM memory and here Titan X is 20% slower than a Vega.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 207 / 191 89 / 82 139 116 Surprisingly the X does very well here and manages to beat all cards by almost 50%!
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 5,600 [+20%] / 4,870 2,100 / 2,000 4,670 3,080 Titan X does well in this algorithm, beating the Vega by 20%.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 275 [-20%] / 275 82 / 81 343 303 With FP64 precision, the X is again 20% slower.
The scientific scores are similar to the financial ones but the gain/loss is about 20% not 40% – in FP32 workloads Titan X is 20% faster while in FP64 it is about 20% slower than a Vega – a closer result than expected.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 14,550 [-60%] / 10,880 7,310 / 5,530 36,000 28,000 In this 3×3 convolution algorithm, somehow Titan X is over 50% slower than a Vega and even a Fury.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 3,840 [-11%] / 2,750 1,870 / 1,380 4,300 3,150 Same algorithm but more shared data reduces the gap to 10% but still a loss.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 3,920 [-10%] / 2,930 1,910 / 1,440 4,350 3,200 With even more data the gap remains at 10%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 3,740 [-11%] / 2,760 1,860 / 1,370 4,210 3,130 Still convolution but with 2 filters – Titan X is 10% slower again.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.7 / 55 [+52%] 20.6 / 25.4 36.3 30.8 Different algorithm allows the X to finally beat the Vega by 50%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 15.6 [-60%] / 15.3 12.2 / 11.4 38.7 14.3 Without major processing, this filter does not like the X much it runs 1/2 slower than the Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 16,480 [-57%] / 14,000 7,600 / 6,640 38,730 28,500 This algorithm is 64-bit integer heavy but again Titan X is 1/2 the speed of Vega.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 290 / 6,350 [+13%] 210 / 3,080 5,600 4,410 One of the most complex and largest filters, Titan X finally beats the Vega by over 10%.
For image processing using FP32 precision Titan X surprisingly does not do as well as expected – either in CUDA or OpenCL – with the Vega beating it by a good margin on most filters – a pretty surprising result. Perhaps more optimisations are needed on nVidia hardware. We obviously did not test FP16 performance at all as that would have been far slower.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from nVidia and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 356 [+13%] / 354 145 / 144 316 387 Titan X brings more bandwidth than a Vega (+13%) but the old Fury takes the crown.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.4 / 9 12.1 / 12 12.1 11 All cards use PCIe3 x16 and thus no appreciable delta.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.2 / 8.9 11.5 / 12.2 10 9.8 Again no significant difference but we were not expecting any.
Titan X uses current GDDR5X but with high data rate allowing it to bring more bandwidth that some HBM2 designs – a pretty impressive feat. Naturally high-end cards using HBM2 should have even higher bandwidth.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 201 / 230 230 273 343 Compared to previous generation, Titan X has better latency due to higher data rate.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 286 / 311 306 399 525 Similarly, even full random accesses are faster,
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89 / 121 97 129 216 Sequential access has similarly low latencies but nothing special.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 / 174 126 269 353 Constant memory latencies have also dropped.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 53 21 49 112 Even shared memory latencies have dropped likely due to higher core clocks.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 / 196 208 121 Texture access latencies have come down as well.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 / 278 308 And even full range latencies have decreased.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87 /123 102 With sequential access there is no appreciable delta in latencies.
We’re only comparing CUDA latencies here (as OpenCL is quite variable) – thus compared to the previous generation (GTX 980) all latencies are down, either due to higher memory data rate or core clock increases – but nothing spectacular. Still good progress and everything helps.
We see L1 cache effects until 16kB (same as previous arch) and between 2-4MB tallying with the 3MB cache. While fast perhaps they could be a bit bigger.
As with global memory we see the same L1D and L2 cache affects with similar latencies. All in all good performance but we could do with bigger caches.

Titan X’s memory performance is what you’d expect from higher clocked GDDR5X memory – it is competitive even with the latest HBM2 powered competition – both bandwidth and latency wise. There are no major surprises here and everything works nicely.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Titan X based on the current “Pascal” architecture performs very well in FP32 workloads – it is much faster than previous generation for a modest price increase and is competitive with the AMD’s Vega offers. But it is likely due to be replaced soon as next-generation “Volta” architecture is already out on the high-end (Titan V) and likely due to filter down the stack to both consumer (Series 11?) cards and “pro-sumer” cheaper Titan cards than the Titan V.

For FP64 workloads it is perhaps best to choose an older Quadro/Tesla card with more FP64 units as performance is naturally much lower. FP16 performance is also restricted and pretty much not usable – good for compatibility testing should you hope to upgrade to a full-speed FP16 card in the future. For both these workloads – the high-end Titan V is the card you probably want – but at a much higher price.

Still for the money, Titan X has its place and the most common FP32 workloads (financial, scientific, high precision image processing, etc.) that do not require FP64 nor FP16 optimisations perform very well and this card is all you need.

FP16 GPGPU Image Processing Performance & Quality

GPGPU Image Processing

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal) while CPU support is still limited to SIMD conversion only (FP16C). It has been added to allow mobile devices (phones, tablets) to provide increased performance (and thus save power for fixed workloads) for a small drop in quality for normal 8-bbc (24-bbp) image and video.

However, normal laptops and tablets with integrated graphics can also benefit from FP16 support in same way due to relatively low graphics compute power and the need to save power due to limited battery in thin and light formats.

In this article we’re investigating the performance differences vs. standard FP32 (aka “single”) and the resulting quality difference (if any) for mobile GPGPUs (Intel’s EV9/9.5 SKL/KBL). See the previous articles for general performance comparison:

Image Processing Performance & Quality

We are testing GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Image Filter
FP32/Single FP16/Half Comments
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  481  967 [+2x] We see a a text-book 2x performance increase for no visible drop in quality.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  107  331 [+3.1x] Using FP16 yields over 3x performance increase but we do see a few more changed pixels though no visible difference.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  112  325 [+2.9x] Again almost 3x performance increase but no visible quality difference. Result!
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  107  323 [+3.1x] Again just over 3x performance increase but no visible quality difference.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s) 5.41  5.67 [+4%] No image difference at all but also almost no performance increase – a measly 4%.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  4.7  13.48 [+2.86x] We’re back with a 2.8x times performance increase but few more differences than we’ve seen though quality seems acceptable.
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  1188  1210 [+2%] Due to random no generation using 64-bit integer processing the performance difference is minimal but the picture quality is not acceptable.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s) 470  508 [+8%] Again due to Perlin noise generation we see almost no performance gain but big drop in image quality – not worth it.

Other Image Processing relating Algorithms

Image Filter
FP16/Half FP32/Single FP64/Double Comments
GPGPU Science Benchmark GEMM OpenCL (GFLOPS)  178 [+50%]  118  35 Dropping to FP16 gives us 50% more performance, not as good as 2x but still a significant increase.
GPGPU Science Benchmark FFT OpenCL (GFLOPS)  34 [+70%]  20  5.4 With FFT we are now 70% faster, closer to the 100% promised.
GPGPU Science Benchmark N-Body OpenCL (GFLOPS)  297 [+49%]  199  35 Again we drop to “just” 50% faster with FP16 but still a great performance improvement.

Final Thoughts / Conclusions

For many image processing filters (Blur, Sharpen, Sobel/Edge-Detection, Median/De-Noise, etc.) we see a huge 2-3x performance increase – more than we’ve hoped for (2x) – with little or no image quality degradation. Thus FP16 support is very much useful and should be used when supported.

However for complex filters (Diffusion, Marble/Perlin Noise) the drop in quality is not acceptable for minor performance increase (2-8%); increasing the precision of more data items to improve quality (from FP16 to FP32) would further drop performance making the whole endeavour pointless.

For those algorithms that do benefit from FP16 the performance improvement with FP16 is very much worth it – so FP16 support is very useful indeed.

Intel Graphics GPGPU Performance

Intel Logo

Why test GPGPU performance Intel Core Graphics?

Laptops (and tablets) are still in fashion with desktops largely left to PC game enthusiasts and workstations for big compute workloads; most laptops (and all tablets) make due with integrated graphics with few dedicated graphics options mainly for mobile PC gamers.

As a result integrated graphics on Intel’s mobile platform is what the vast majority of users will experience – thus its importance is not to be underestimated. While in the past integrated graphics options were dire – the introduction of Core v3 (Ivy Bridge) series brought us a GPGPU-capable graphics processor as well an updated internal media transcoder of Core v2 (Sandy Bridge).

With each generation Intel has progressively improved the graphics core, perhaps far more than its CPU cores – and added more variants (GT3) and embedded cache (eDRAM) which greatly increased performance – all within the same power limit.

New Features enabled by the latest 21.45 graphics driver

With Intel graphics drivers supporting just 2 generations of graphics – unlike unified drivers of AMD and nVidia – old graphics quickly become obsolete with few updates; but Windows 10 “free update” forced Intel’s hand somewhat – with its driver (20.40) supporting 3 generations of graphics (Haswell, Broadwell and latest at the time Skylake).

However, the latest 21.45 driver for newly released Kabylake and older Skylake does bring new features that can make a big difference in performance:

  • Native FP64 (64-bit aka “double” floating-point support) in OpenCL – thus allowing high precision compute on integrated graphics.
  • Native FP16 (16-bit aka “half” floating-point support) in OpenCL, ComputeShader – thus allowing lower precision but faster compute.
  • Vulkan graphics interface support – OpenGL’s successor and DirectX 12’s competitor – for faster graphics and compute.

Will these new features make upgrading your laptop to a brand-new KBL laptop more compelling?

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of the new Intel ULV APUs with the old versions.

Graphics Unit Haswell HD4000 Haswell HD5000 Broadwell HD6100 Skylake HD520 Skylake HD540 Kabylake HD620 Comment
Graphics Core EV7.5 HSW GT2U EV7.5 HSW GT3U EV8 BRW GT3U EV9 SKL GT2U EV9 SKL GT3eU EV9.5 KBL GT2U Despite 4 CPU generations we really have 2 GPU generations.
APU / Processor Core i5-4210U Core i7-4650U Core i7-5557U Core i7-6500U Core i5-6260U Core i3-7100U The naming convention has changed between generations.
Cores (CU) / Shaders (SP) / Type 20C / 160SP 40C / 320SP 48C / 384SP 24C / 192SP 48C / 384SP 23C / 184SP BRW increased CUs to 24/48 and i3 misses 1 core.
Speed (Min / Max / Turbo) MHz 200-1000 200-1100 300-1100 300-1000 300-950 300-1000 The turbo clocks have hardly changed between generations.
Power (TDP) W 15 15 28 15 15 15 Except GT3 BRW, all ULVs are 15W rated.
DirectX CS Support 11.1 11.1 11.1 11.2 / 12.1 11.2 / 12.1 11.2 / 12.1 SKL/KBL enjoy v11.2 and 12.1 support.
OpenGL CS Support 4.3 4.3 4.3 4.4 4.4 4.4 SKL/KBL provide v4.4 vs. verision 4.3 for older devices.
OpenCL CS Support 1.2 1.2 1.2 2.0 2.0 2.1 SKL provides v2 support with KBL 2.1 vs 1.2 for older devices.
FP16 / FP64 Support No / No No / No No / No Yes / Yes Yes / Yes Yes / Yes SKL/KBL support both FP64 and FP16.
Byte / Integer Width 8 / 32-bit 8 / 32-bit 8 / 32-bit 128 / 128-bit 128 / 128-bit 128 / 128-bit SKL/KBL prefer vectorised integer workloads, 128-bit wide.
Float/ Double Width 32 / X-bit 32 / X-bit 32 / X-bit 32 / 64-bit 32 / 64-bit 32 / 64-bit Strangely neither arch prefers vectorised floating-point loads – driver bug?
Threads per CU 512 512 256 256 256 256 Strangely BRW and later reduced the threads/CU to 256.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX/OpenGL ComputeShader .

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 288 399 597 875 [+3x] 1500 840 [+2.8x] If FP16 is enough, KBL and SKL have 2x performance of FP32.
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 299 375 614 468 [+56%] 817 452 [+50%] SKL GT3e rules the roost but KBL hardly improves on SKL.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 18.54 (eml) 24.4 (eml) 38.9 (eml) 112 [+6x] 193 104 [+5.6x] SKL GT2 with native Fp64 is almost 4x emulated BRW GT3!
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.8 (eml) 2.36 (eml) 4.4 (eml) 6.34 (eml) [+3.5x] 10.92 (eml) 6.1 (eml) [+3.4x] Emulating Fp128 though Fp64 is ~2.5x faster than through FP32.
As expected native FP16 runs about 2x faster than FP32 and thus provides a huge performance upgrade if precision is sufficient. Native FP64 is about 8x emulated FP64 and even emulated FP128 improves by about 2.5x! Otherwise KBL GT2 matches SKL GT2 and is about 50% faster than HSW GT2 in FP32 and 6x faster in FP64.
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 1.37 1.85 2.7 2.19 [+60%] 3.36  2.21 [+60%] Since BRW integer performance is similar.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1.87 2.45 3.45 2.79 [+50%] 4.3 2.83 [+50%] Not a lot changes here.
SKL/KBL GT2 with integer workloads (with extensive memory accesses) are 50-60% faster than HSW similar to what we saw with floating-point performance. But the changed happened with BRW which improved the most over HSW with SKL and KBL not improving further.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s)  1.2 1.62 4.35  3 [+2.5x] 5.12 2.92 In this tough compute test SKL/KBL are 2.5x faster.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2.86  3.93  9.82  6.7 [+2.34x]  11.26  6.49 With a lighter algorithm SKL/KBL are still ~2.4x faster.
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s)  0.828  1.08 1.68 1.08 [+30%] 1.85  1 64-integer performance does not improve much.
In pure integer compute tests SKL/KBL greatly improve over HSW being no less than 2.5x faster a huge improvement; but 64-bit integer performance hardly improves (30% faster with 20% more CUs 24 vs 20). Again BRW is where the improvements were added with SKL GT3e hardly improving over BRW GT3.
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 461 495 493 656 [+42%]  772 618 [+40%] Pure FP32 compute SKL/KBL are 40% faster.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) 137  238 135 SKL GT3 is 73% faster than GT2 variants
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 62.45 85.76 123 86.32 [+38%]  145.6 82.8 [+35%] In this tough algorithm SKL/KBL are still amost 40% faster.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) 18.65 31.46 19 SKL GT3 is over 65% faster than GT2 KBL.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 106 160.4 192 174 [+64%] 295 166.4 [+56%] M/C is not as tough so here SKL/KBL are 60% faster.
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) 31.61 56 31 GT3 SKL manages an 80% improvement over GT2.
Intel is pulling our leg here; KBL GPU seems to show no improvement whatsoever over SKL, but both are about 40% faster in FP32 than the much older HSW. GT3 SKL variant shows good gains of 65-80% over the common GT2 and thus is the one to get if available. Obviously the ace card for SKL and KBL is FP64 support.
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS)  117  130 142 116 [=]  181 113 [=] SKL/GBL have a problem with this algorithm but GT3 does better?
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) 34.9 64.7 34.7 GT3 SKL manages a 86% improvement over GT2.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 13.3 13.1 15 20.53 [+54%]  27.3 21.9 [+64%] In a return to form SKL/KBL are 50% faster.
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) 5.2  4.19  4.69 GT3 stumbles a bit here some optimisations are needed.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS)  122  157.9 249 201 [+64%]  304 177.6 [+45%] Here SKL/KBL are 50% faster overall.
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) 19.25 31.9 17.8 GT3 manages only a 65% improvement here.
Again we see no delta between SKL and KBL – the graphics cores perform the same; again both benefit from FP64 support allowing high precision kernels to run. GT3 SKL variant greatly improves over common GT2 variant – except in one test (DFFT) that seems to be an outlier.
GPGPU Image Processing Blur (3×3) Filter OpenCL (MPix/s)  341  432  636 492 [+44%]  641 488 [+43%] We see the GT3s trading blows in this integer test, but SKL/KBL are 40% faster than HSW.
GPGPU Image Processing Sharpen (5×5) Filter OpenCL (MPix/s)  72.7  92.8  147  106 [+45%]  139  106 [+45%] BRW GT3 just wins this with SKL/KBL again 45% faster.
GPGPU Image Processing Motion-Blur (7×7) Filter OpenCL (MPix/s)  75.6  96  152  110 [+45%]  149  111 [+45%] Another win for BRW and 45% improvent for SKL/KBL.
GPGPU Image Processing Edge Detection (2*5×5) Sobel OpenCL (MPix/s)  72.6  90.6  147  105 [+44%]  143  105 [+44%] As above in this test.
GPGPU Image Processing Noise Removal (5×5) Median OpenCL (MPix/s)  2.38  1.53  6.51  5.2 [+2.2x]  7.73  5.32 [+2.23x] SKL’s GT3 manages a win but overall SKl/KBL are over 2x faster than HSW.
GPGPU Image Processing Oil Painting Quantise OpenCL (MPix/s)  1.17  0.719  5.83  4.57 [+3.9x]  4.58  4.5 [+3.84x] Another win for BRW
GPGPU Image Processing Diffusion Randomise OpenCL (MPix/s)  511  688  1150  1100 [+2.1x]  1750  1080 [+2.05x]_ SKL/KBL are over 2x faster than HSW. BRW is beat here.
GPGPU Image Processing Marbling Perlin Noise 2D OpenCL (MPix/s)  378.5  288  424  437 [+15%]  611  443 [+17%] Some wild results here, some optimizations may be needed.
In this integer workloads (with texture access) the 28W GT3 of BRW manages a few wins over 15W GT3e of SKL – but compared to old HSW – both SKL and KBL are between 40 and 300% faster. Again we see no delta between SKL and KBL – there does not seem to be any difference at all!

If you have a HSW GT2 then an upgrade to SKL GT2 brings massive improvements as well as FP16 and FP64 native support. But HSW GT3 variant is competitive and BRW GT3 even more so. KBL GT2 shows no improvement over SKL GT2 – so it’s not just the CPU core that is unchanged but the graphics core also – it’s no EV9.5 here more like EV9.1!

For integer workloads BRW is where the big improvement came but for 64-integer that improvement is still to come, if ever. At least all drivers support native int64.

Transcoding Performance

We are testing media (video + audio) transcoding performance for common video algorithms: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (April 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
H.264/AVC Decoder/Encoder QuickSync H264 8-bit only QuickSync H264 8-bit only QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit QuickSync H264 8/10-bit HSW supports 8-bit only so 10-bit (high-colour) are out of luck.
H.265/HEVC Decoder/Encoder QuickSync H265 8-bit partial QuickSync H265 8-bit QuickSync H265 8-bit QuickSync H265 8/10-bit SKL has full/hardware H265/HEVC transcoding but for 8-bit only; Main10 (10-bit profile) requires KBL so finally we see a difference.
Transcode Benchmark VC 1 > H264/AVC Transcoding (MB/s)  7.55 8.4  7.42 [-2%]  8.25  8.08 [+6%] With DDR4 KBL is 6% faster.
Transcode Benchmark VC 1 > H265/HEVC Transcoding (MB/s)  0.734  3.14 [+4.2x]  3.67  3.63 [+5x] Hardware support makes SKL/KBL 4-5x faster.

If you want HEVC/H.265 then you want SKL including 4k/UHD. But if you plan on using 10-bit/HDR colour then you need KBL – finally an improvement over SKL. As it uses fixed-point hardware the GT3 performs only slightly faster.

Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX/OpenGL ComputeShader,  including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Memory Configuration 8GB DDR3 1.6GHz 128-bit 8GB DDR3 1.6GHz 128-bit 16GB DDR3 1.6GHz 128-bit 8GB DDR3 1.867GHz 128-bit 16GB DDR4 2.133GHz 128-bit 16GB DDR4 2.133GHz 128-bit All use 128-bit memory with SKL/KBL using DDR4.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 64 2048 / 64 2048 / 64 2048 / 64 Shared memory remains the same; in SKL/KBL constant memory is the same as global.
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 10.4 10.7 11 15.65 23 [+2.1x] 19.6 DDR4 seems to provide over 2x bandwidth despite low clock.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 5.23 5.35 5.54 7.74 11.23 [+2.1x] 9.46 Again over 2x increase in up speed.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 5.27 5.36 5.29 7.42 11.31 [+2.1x] 9.6 Again over 2x increase in down speed.
SKL/KBL + DDR4 provide over 2x increase in internal, up and down memory bandwidth – despite the relatively modern increase in memory speed (2133 vs 1600); with DDR3 1867MHz memory the improvement drops to 1.5x. So if you were to decide DDR3 or DDR4 the choice has been made!
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns)  179 192  234 [+30%]  296 235 [+30%] With DDR4 latency has increased by 30% not great.
GPGPU Memory Latency Constant Memory Latency (ns)  92.5  112  234 [+2.53x]  279  235 [+2.53x] Constant memory has effectively been dropped resulting in a disastrous 2.53x higher latencies.
GPGPU Memory Latency Shared Memory Latency (ns)  80  84  –  86.8 [+8%]  102  84.6 [+8%] Shared memory latency has stayed the same.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns)  283  298  56 [1/5x]
 58.1 [1/5x]
Texture access seems to have markedly improved to be 5x faster.
SKL/KBL global memory latencies have increased by 30% with DDR4 – thus wiping out some gains. The “new” constant memory (2GB!) is now really just bog-standard global memory and thus with over 2x increase in latency. Shared memory latency has stayed pretty much the same. Texture memory access is very much faster – 5x faster likely though some driver optimisations.

Again no delta between KBL and SKL; if you want bandwidth (who doesn’t?) DDR4 with modest 2133MHz memory doubles bandwidths – but latencies increase. Constant memory is now the same as global memory and does not seem any faster.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL as well as memory bandwidth performance.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers (Apr 2017). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors HD4000 (EV7.5 HSW-GT2U) HD5000 (EV7.5 HSW-GT3U) HD6100 (EV8 BRW-GT3U) HD520 (EV9 SKL-GT2U) HD540 (EV9 SKL-GT3eU) HD620 (EV9.5 KBL-GT2U) Comments
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 250  119 602 [+2.4x] 1000 537 [+2.1x] Fp16 support in DirectX doubles performance.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 235  109 338 [+43%]  496 289 [+23%] Fp16 does not yet work in OpenGL.
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s)  238  120 276 [+16%]  485 248 [4%] We only see a measly 4-16% better performance here.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 228  108 338 [+48%] 498 289 [+26%] SKL does better here – it’s 50% faster than HSW.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 52.4  78 76.7 [+46%] 133 69 [+30%] With FP64 SKL is still 45% faster.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 63.2  67.2 105 [+60%] 177 96 [+50%] Similar result here 50-60% faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 5.2  7 18.2 [+3.5x] 31.3 16.7 [+3.2x] Driver optimisation makes SKL/KBL over 3.5x faster.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 5.55  7.5 57.5 [+10x]  97.7 52.3 [+9.4x] Here we see SKL/KBL over 10x faster!
We see similar results to OpenCL GPGPU here – with FP16 doubling performance in DirectX – but with FP64 already supported in both DirectX and OpenGL even with HSW, KBL and SKL have less of a lead – of around 50%.
Video Memory Benchmark Internal Memory Bandwidth (GB/s)  15  14.8 27.6 [+84%]
26.9 25 [+67%] DDR4 brings almost 50% more bandwidth.
Video Memory Benchmark Upload Bandwidth (GB/s)  7  7.8 10.1 [+44%] 12.34 10.54 [+50%] Upload bandwidth has also increased ~50%.
Video Memory Benchmark Download Bandwidth (GB/s)  3.63  3.3 3.53 [-2%] 5.66 3.51 [-3%] No change in download bandwidth though.

Final Thoughts / Conclusions

SKL and KBL with the 21.45 driver yields significant gains in OpenCL making an upgrade from HSW and even BRW quite compelling despite the relatively modern 20.40 driver Intel was forced to provide for Windows 10. The GT3 version provides good gains over the standard GT2 version and should always be selected if available.

Native FP64 support is a huge addition which provides support for high-precision kernels – unheard of for integrated graphics. Native FP16 support provides an additional 2x performance in cases where 16-bit floating-point processing is sufficient.

However KBL’s EV9.5 graphics core shows no improvement at all over SKL’s EV9 core – thus it’s not just the CPU core that has not been changed but the GPU core too! Except for the updated transcoder supporting Main10 HEVC/H.265 (thus HDR / 10-bit+ colour) which is still quite useful for UHD/4K HDR media.

This is very much a surprise – as while the CPU core has not improved markedly since SNB (Core v2), the GPU core has always provided significant improvements – and now we have hit the same road-block. As dedicated GPUs have continued to improve significantly in performance and power efficiency this is quite a surprise. This marks the smallest ever generation to generation – SKL to KBL – ever, effectively KBL is a SKL refresh.

It seems the rumour that Intel may change to ATI/AMD graphics cores may not be such a crazy idea after all!

SiSoftware OpenCL Support Released

GPGPU Arithmetic Benchmark

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware OpenCL Support Released

London, UK, 30th November 2009 – SiSoftware releases its suite of OpenCL GPGPU (General Purpose Graphics Processor Unit) benchmarks as part of SiSoftware Sandra 2010, the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, and networks.

At SiSoftware we are constantly looking out for new technologies with the aim to understand how those technologies can best be benchmarked and analysed. We believe that the industry is seeing a shift from the model where heavy computational workload is processed on a traditional CPU to a model that uses the GPGPU or a combination of GPU and CPU; in a wide range of applications developers are using the power of GPGPU to aid business analysis, games, graphics and scientific applications.

As certain tasks or workloads may still perform better on traditional CPU, we see both CPU and GPGPU benchmarks to be an important part of performance analysis. Having launched the GPGPU Benchmarks with SiSoftware Sandra 2009 with support for AMD CTM/STREAM and nVidia CUDA, we have now ported the benchmark suite to OpenCL.

OpenCL is an open standard for running parallel tasks on GPUs, CPUs and hardware accelerators using the same code – unlike proprietary solutions. We believe OpenCL will become “the standard” for programming parallel workloads in the future, thus we have ported all our GPGPUs benchmarks to OpenCL.

Below is a quote we would like to share with you:

“AMD believes OpenCL is what the industry has been waiting for: an industry-standard, cross-platform development platform designed to allow developers to harness the immense computational power available in today’s GPUs and multi-core CPUs. We’ve been a staunch supporter of and contributor to OpenCL since its inception,” said Patricia Harrell, director of Stream Computing, AMD. “SiSoftware has made significant contributions to the OpenCL ecosystem with the release of its GPGPU benchmark suite with OpenCL support. This benchmark suite enables customers, partners and OpenCL developers to easily measure application performance on heterogeneous platforms, and provides the information required to help optimize this performance.”

The SiSoftware OpenCL Benchmarks look at the two major performance aspects:

  • Computational performance: in simple terms how fast it can crunch numbers. It follows the same style as the CPU Multi-Media benchmark using fractal generation as its workload. This allows the user to see the power of the GPGPU in solving a workload thus far exclusively performed on a CPU.
  • Memory performance: this analyses how fast data can be transferred to and from the GPGPU. No matter how fast the processing, ultimately the end result will be affected by memory performance.

Key features

  • 4 architectures natively supported (x86, x64/AMD64/EM64T, IA64/Itanium2, ARM)
  • 6 languages supported (English, French3, German3, Italian3, Japanese3, Russian3)
  • AMD OpenCL 1.01
  • nVidia OpenCL 1.0
  • GPU + CPU parallel execution supported, up to 8 devices in total.
  • Different models of GPUs supported, including integrated GPU + dedicated GPUs.
  • Multi-GPUs supported, up to 8 in parallel.

With each release, we continue to add support and compatibility for the latest technologies. SiSoftware works with hardware vendors to ensure the best support for new emerging hardware.

Notes:

1 Available as Beta at this time, performance cannot be guaranteed.

2 By special arrangement; Enterprise versions only.

3 Not all languages available at publication, will be released later.

Relevant Press Releases

Relevant Articles

For more details, please see the following articles comparing current devices on the market:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please contact your distributor (sales support).

Downloading

For more details, and to download the Lite version, please click here.

Reviewers and Editors

For your free review copies, please contact us.

About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Nearly 700 worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Over 9,000 on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, DirectCompute, x64, ARM, MIPS, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX3, AVX2, AVX, FMA4, FMA, NEON, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit http://www.sisoftware.net, http://www.sisoftware.eu, http://www.sisoftware.info or http://www.sisoftware.co.uk