AMD Radeon 5700XT: Navi GPGPU Performance in OpenCL

What is “Navi”?

It is the code-name of the new AMD GPU, the first of the brand-new RDNA (Radeon DNA) GPU arch(itecture) – replacing the “Vega” that was the last of the GCN (graphics core next) arch(itecture). It is a mid-range GPU optimised for gaming thus not expected to set records, but GPUs today are used for many other tasks (mining, encoding, algorithm/compute acceleration, etc.) as well.

RDNA arch brings big changes from the various GCN revisions we’ve seen previously, but its first iteration here does not bring any major new features at least in the compute domain. Hopefully the next versions will bring tensor units (matrix multiplicators) and other accelerated instruction sets and so on.

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
Arch Chipset RDNA / Navi 10 GCN5.1 / Vega 20 Pascal / GP102 GCN5.0 / Vega 10 The first of the Navi chips.
Cores (CU) / Threads (SP) 40 / 2560 60 / 3840 28 / 3584 56 / 3584 Less CUs than Vega1 and same (64x) SP per CU.
SIMD per CU / Width 2 / 32 [2x] 4 / 16 4 / 16 Navi increases the SIMD width but decreases counts.
Wave/Warp Size 32 [1/2x] 64 32 64 Wave size is reduced to match nVidia.
Speed (Min-Turbo) 1.6 / 1.755 1.4 / 1.75 1.531 / 1.91 1.156 / 1.471 40% faster base and 20% turbo than Vega 1.
Power (TDP) 225W 295W 250W 210W Slightly higher TDP but nothing significant
ROP / TMU 64 / 160 64 / 240 96 / 224 64 / 224 ROPs are the same but we see ~30% less TMUs.
Shared Memory
64kB [+2x]
32kB 48kB / 96kB per SM 32kB We have 2x more shared memory allowing bigger kernels.
Constant Memory
4GB 8GB 64kB dedicated 4GB No dedicated constant memory but large.
Global Memory 8GB GDDR6 14Gt/s 256-bit 16GB HBM2 1Gt/s 4096-bit 12GB GDDR5X 10Gt/s 384-bit 8GB HBM2 900Gt/s 4096-bit Sadly no HBM this time but the faster but not very wide.
Memory Bandwidth (GB/s)
448GB/s [+9%] 1024GB/s 512GB/s 410GB/s Still bandwidth is 9% higher.
L1 Caches ? x40 16kB x60 48kB x28 16kB x56 L1 does not appear changed but unclear.
L2 Cache 4MB 4MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
1024 / 1024 256 / 1024 1024 / 2048 per SM 256 / 1024 AMD has unlocked work-group sizes to 4x.
FP64/double ratio
1/16x 1/4x 1/32x 1/16x Ratio is same as consumer Vega1 rather than pro Vega2.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 18,265 [-7%] 29,057 245 19,580 Navi starts well but cannot beat Vega1.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 11,863 [-13%] 17,991 17,870 13,550 Standard FP32 increases the gap to 13%.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 1,047 [-16%] 5,031 661 1,240 FP64 does not change much, Navi is 16% slower.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 43 [-45%] 226 25 77 Emulated FP128 is hard on FP64 units and here Navi is almost 1/2 Vega1.
Starting up, Navi does not seem to be able to beat Vega1 in heavy vectorised compute loads with FP16 most efficient (almost parity) while complex FP128 is 2x slower.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 51 [-25%] 91 42 67 Despite more bandwidth Navi is 25% slower than Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 176 [+40%] 209 145 125 Navi shows its power here beating Vega1 by a huge 40%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 76 32
Despite more bandwidth of GDDR6, streaming algorithms work better on on “old” HBM2 thus Navi cannot beat Vega. But in pure integer compute algorithms like hashing, it is much faster by a significant amount which bodes well for the future.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 12,459 [+31%] 23,164 11,480 9,500 In this FP32 financial workload Navi is 30% faster than Vega1!
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 1,370 1,880
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 850 [1/3x] 3,501 2,240 2,530 Binomial uses thread shared data thus stresses the memory system and here we have some optimisation to do.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 129 164
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,027 [+30%] 6,249 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Navi is again 30% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 294 472
For financial FP32 workloads, Navi is ~30% faster than Vega1 – a pretty good improvement – though it naturally cannot compete with Vega2 due to consumer multiplier (1/16x). Crypto-currencies fans will love the Navi.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 5,165 [+2%] 6,634 6,073 5,066 GEMM can only bring a measly 2% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 340 620
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 376 [+2%] 643 235 369 FFT loves HBM but Navi is still 2% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 207 175
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 4,534 [-6%] 6,846 5,720 4,840 Navi can’t manage as well in N-Body and ends up 6% slower.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 275 447
The scientific scores don’t show the same improvement as the financial ones likely due to heavy use of shared memory with Navi just matching Vega1. Perhaps the larger shared memory can allow us to use larger workgroups.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 8,674 [1/2.1x] 25,418 18,410 19,130 In this 3×3 convolution algorithm, Navi is 1/2x the speed of Vega1.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,734 [1/3x] 5,275 5,000 4,340 Same algorithm but more shared data makes Navi even slower.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 1,802 [1/2.5x] 5,510 5,080 4,450 With even more data the gap remains at 1/2.5x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,723 [1/2.5x] 5,273 4,800 4,300 Still convolution but with 2 filters – same 1/2.5x performance.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 48.44 [=] 92.53 37 48 Different algorithm allows Navi to tie with Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 97.34 [+2.5x] 57.66 12.7 38 Without major processing, this filter performs well on Navi.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 32,050 [+1.5x] 47,349 19,480 20,880 This algorithm is 64-bit integer heavy and Navi is 50% faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 9,516 [+1.6x] 7,708 305 6,000 One of the most complex and largest filters, Navi is again 50% faster.
For image processing using FP32 precision, Navi goes from 1/2.5x Vega1 performance (convolution) to 50% faster (complex algorithms with integer processing). It seems some optimisations are needed for the convolution algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon 5700X (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 376 [+13%] 627 356 333 Navi’s GDDR6 manages 13% more bandwidth than Vega1.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 21.56 [+77%] 12.37 11.4 12.18 PCIe 4.0 brings almost 80% more bandwidth
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 22.28 [+84%] 12.95 12.2 12.08 Again almost 2x more bandwidth.
Navi’s PCIe 4.0 interface (on 500-series motherboards) brings as expected almost 2x more upload/download bandwidth while its high-clocked GDDR6 manages just over 10% higher bandwidth over HBM2.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 276 [+11%] 202 201 247 Navi’s GDDR6 brings slight latency increase (+10%)
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 286 353
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Navi” is an interesting chip to be sure and perhaps more was expected of it; as always the drivers are the weak link and it is hard to determine which issues will be fixed driver-side and which will need to be optimised in compute kernels.

Thus performance-wise it oscillates between 1/2x and 50% Vega1 performance depending on algorithm, with compute-heavy algorithms (especially crypto-currencies) doing best and shared/local memory heavy algorithms doing worst. The 2x bigger shared memory (64kB vs 32) in conjunction with the larger work-group (1024 vs 256 by default) sizes do present future optimisation opportunities. AMD has also reduced the warp/wave size to match nVidia – a historic change.

Memory wise, the cost-cutting change from HBM2 to even high-speed GDDR6 does bring more bandwidth but naturally higher latencies – but PCIe 4.0 doubles upload/download bandwidths which will become much more important on higher capacity (16GB+) cards in the future.

Overall it is hard to recommend it for compute workloads unless the particular algorithm (crypto, financial) does well on Navi, otherwise the much older Vega1 56/64 offer better performance/cost ratio especially today. However, as drivers mature and implementations are optimised for it, Navi is likely to start to perform better.

We are looking forward to the next iterations of Navi, especially the rumoured “big Navi” version optimised for compute…

AMD Radeon VII: Vega2 GPGPU Performance in OpenCL

What is “Vega2”?

It is the code-name of the updated “Vega” GPU arch(itecture), the last of the GCN (graphics core next) arch (version 5.1) shrinked to 7nm before being replaced by the forthcoming “Navi”. Originally for the professional/workstation high-end market, “Vega2″/”big Vega” designed for compute (scientific, machine learning, etc.) workloads was pressed into service to battle the latest 2000-series “Turing”/RTX competition.

As a result it contains many high-end features not normally found on consumer cards:

  • 1/4 FP64 rate (instead of 1/16 or worse)
  • 16GB HBM2 memory (instead of 8-12)
  • 4096-bit HBM2 memory 1TB/s bandwidth (instead of 400-500)
  • Int8/Int4 support for AI/ML workloads
  • PCIe 4.0 capable but not enabled at this time

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
Arch Chipset Vega 20 / GCN 5.1 GV100 / 7.0 GP102 / 6.1 Vega 10 / GCN 5.0 A minor revision of Vega1.
Cores (CU) / Threads (SP) 60 / 3840 80 / 5120 28 / 3584 56 / 3584 More CUs than normal Vega but not 64.
SIMD per CU / Width 4 / 16 n/a n/a 4 / 16 Naturally same SIMD count and width
Wave/Warp Size 64 32 32 75 Wave size has always been 2x nVidia.
Speed (Min-Turbo) 1.4 – 1.750 [+21%] (135-1455) 1.531 (139-1910) 1.156 – 1.471 Base clock is ~20% higher and turbo
Power (TDP) 300W [+42%] 300W 250W 210W TDP has gone up by 40%.
ROP / TMU 64 / 256 96 / 320 96 / 224 64 / ROPs and TMUs unchanged
Shared Memory
32kB 48 / 96 kB 48 / 96kB 32kB No shared memory change.
Constant Memory
8GB 64kB 64kB 4GB No dedicated constant memory but large.
Global Memory 16GB HBM2 2Gbps 4096-bit 12GB HBM2 2x850Mbps 3072-bit 12GB GDDR5X 10Gbps 384-bit 8GB HBM2 1.89Gbps 2048-bit 2x as big and 2x as wide HBM a huge improvement.
Memory Bandwidth (GB/s)
1000 [+2.4x] 652 512 410 Still bandwidth is 9% higher.
L1 Caches 16kB x 60 96kB x 80 48kB x 28 16kB x 56 L1 has not changed.
L2 Cache 4MB 4.5MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
256 / 1024 1024 / 2048 1024 / 2048 256 / 1024 Same work-group sizes.
FP64/double ratio
1/4x 1/2x 1/32x 1/16x Ratio is 4x better than Vega1.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 29,057 [+48%] 33,860 245 19,580 Vega2 starts strong with a 48% lead over Vega1 and almost catching Volta.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 18,340 [+35%] 22,680 17,870 13,550 Good improvement here +35% over Vega1 again close to Volta.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 5,377 [+4.3x] 11,000 661 1,240 1/4 FP64 rate makes it over four (4x) times faster than Vega1.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 234 [+3x] 458 25.77 77 Similar to above, Vega2 is over three (3x) faster.
Vega2 looks about 35-50% faster than Vega1 in FP32/FP16 and 3-4x faster in FP64 due to its 1/4 FP64 rate. It won’t beat real workstation cards with 1/2 FP64 rate through thus that Titan has nothing to worry about.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 91 [+36%] 70 42 67 The fast HBM2 memory allows it to beat even Volta not just Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 93 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 209 [+67%] 245 145 125 Vega2 is a huge 70% faster in integer/crypto workloads.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 129 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 176 76 32
Vega2 increases its lead in integer workloads even streaming ones no doubt due to its very fast HBM2 memory making it the crypto-king of the hill though its cost may be an issue.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 23,164 [+2.3x] 18,570 11,480 9,500 Vega2 is over 2x faster than Vega1 also beating Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 [+3.84x] 8,400 1,370 1,880 In FP64 its almost 4x faster just below Volta!
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 3,501 [+38%] 4,200 2,240 2,530 Binomial uses thread shared data thus stresses the memory system Vega2 is still 40% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 [+4.8x] 2,000 129 164 With FP64 we’re almost 5x faster than Vega1.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 6,249 [+62%] 11,920 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Vega2 is 60% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 [+3.55x] 4,440 294 472 With FP64 we’re over 3.5x faster.
For financial FP32 workloads, Vega2 is 40-60% faster than Vega1 a decent improvement; naturally in FP64 it’s 4-5x times faster thus a significant upgrade for algorithms that require such precision.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,634 [+30%] 11,000 6,073 5,066 GEMM still brings a 30% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 [+3.77x] 3,830 340 620 But DGEMM is almost 4x faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 643 [+74%] 617 235 369 FFT loves HBM thus Vega2 is 75% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 [+2.1x] 280 207 175 DFFT is tough but Vega2 is still twice as fast.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,846 [+41%] 7,790 5,720 4,840 In N-Body physics Vega2 is 40% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 [+3.9x] 4,270 275 447 And in FP64 physics Vega2 is almost 4x faster.
The scientific scores show a similar improvement, with FP32 30-40% better but FP64 a whopping four (4x) faster than Vega1 and, in some algorithms, matching the hugely expensive Volta.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 25,418 [+32%] 26,790 18,410 19,130 In this 3×3 convolution algorithm, Vega2 is 32% faster than Vega1
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 5,275 [+21%] 9,295 5,000 4,340 Same algorithm but more shared data reduces the lead to 21%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 5,510 [+24%] 9,428 5,080 4,450 With even more data the gap remains constant.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 5,273 [+23%] 9,079 4,800 4,300 Still convolution but with 2 filters – similar 23% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 92 [+91%] 112 37 48 Different algorithm makes Vega2 almost 2x faster than Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 57 [+50%] 42 12.7 38 Without major processing, this filter is 50% faster on Vega2.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 47,349 [+2.3x] 24,370 19,480 20,880 This algorithm is 64-bit integer heavy and Vega2 flies 2x faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 7,708 [+28%] 8,460 305 6,000 One of the most complex and largest filters, Vega2 is 28% faster.
For image processing using FP32 precision, Vega goes from 21% to 2x faster, overall a decent improvement if you are processing a large number of images. In many filters it beats the far more expensive Volta competition.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 627 [+88%] 536 356 333 Vega2’s wide HBM2 is almost 2x faster as expected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 12.37 [+2%] 11.47 11.4 12.18 Using PCIe 3.0 similar upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.95 [+7%] 12.27 12.2 12.08 Again similar bandwidth.
Vega2 benefits greatly from its very wide HBM2 memory (4096-bit) which provides almost 2x real bandwidth as expected. But while PCIe 4.0 capable for now it has to make do with 3.0 and thus same upload/download bandwith. Here’s hoping for a BIOS update once new motherboards come out.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 202 [-19%] 180 201 247 The higher clock allows Vega2 a 20% latency reduction.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 [-4%] 311 286 353 Full range is only 4% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53.4 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75.4 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.1 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88.5 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega2 (“BigVega”) is a big improvement over normal Vega1 and its workstation-class pedigree shows. For FP16/Fp32 workloads though the 30-40% performance improvement may not be worth it considering the much higher price: naturally FP64 performance is almost 4x due to 1/4 FP64 rate though not as good at professional cards with 1/2 rate or Titan competition with similar 1/2 rate.

While the GCN core (rev 5.1) has seen internal updates, there is nothing new that can be supported/optimised for in the compute land thus any code working well on Vega1 should work just as well on Vega2.

The 16GB HBM2 wide memory also helps big workloads with 2x higher bandwidth and also lower latency due to higher clock. For some workloads this alone makes it a definite buy when competition stops at 12GB.

Unfortunately the card has had a limited release at a relatively high price thus value/price ratio depends entirely on your workload – if FP64 with large datasets then it is very much worth it; if FP32/FP16 with datasets that fit in standard 8GB memory then the older Vega1 is much better value and you can even get 2 for the price of the Vega2.

For revolutionary change we need to wait for Navi and its brand new RDNA (Radeon DNA) arch(itecture)…