AMD Radeon VII: Vega2 GPGPU Performance in OpenCL

What is “Vega2”?

It is the code-name of the updated “Vega” GPU arch(itecture), the last of the GCN (graphics core next) arch (version 5.1) shrinked to 7nm before being replaced by the forthcoming “Navi”. Originally for the professional/workstation high-end market, “Vega2″/”big Vega” designed for compute (scientific, machine learning, etc.) workloads was pressed into service to battle the latest 2000-series “Turing”/RTX competition.

As a result it contains many high-end features not normally found on consumer cards:

  • 1/4 FP64 rate (instead of 1/16 or worse)
  • 16GB HBM2 memory (instead of 8-12)
  • 4096-bit HBM2 memory 1TB/s bandwidth (instead of 400-500)
  • Int8/Int4 support for AI/ML workloads
  • PCIe 4.0 capable but not enabled at this time

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
Arch Chipset Vega 20 / GCN 5.1 GV100 / 7.0 GP102 / 6.1 Vega 10 / GCN 5.0 A minor revision of Vega1.
Cores (CU) / Threads (SP) 60 / 3840 80 / 5120 28 / 3584 56 / 3584 More CUs than normal Vega but not 64.
SIMD per CU / Width 4 / 16 n/a n/a 4 / 16 Naturally same SIMD count and width
Wave/Warp Size 64 32 32 75 Wave size has always been 2x nVidia.
Speed (Min-Turbo) 1.4 – 1.750 [+21%] (135-1455) 1.531 (139-1910) 1.156 – 1.471 Base clock is ~20% higher and turbo
Power (TDP) 300W [+42%] 300W 250W 210W TDP has gone up by 40%.
ROP / TMU 64 / 256 96 / 320 96 / 224 64 / ROPs and TMUs unchanged
Shared Memory
32kB 48 / 96 kB 48 / 96kB 32kB No shared memory change.
Constant Memory
8GB 64kB 64kB 4GB No dedicated constant memory but large.
Global Memory 16GB HBM2 2Gbps 4096-bit 12GB HBM2 2x850Mbps 3072-bit 12GB GDDR5X 10Gbps 384-bit 8GB HBM2 1.89Gbps 2048-bit 2x as big and 2x as wide HBM a huge improvement.
Memory Bandwidth (GB/s)
1000 [+2.4x] 652 512 410 Still bandwidth is 9% higher.
L1 Caches 16kB x 60 96kB x 80 48kB x 28 16kB x 56 L1 has not changed.
L2 Cache 4MB 4.5MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
256 / 1024 1024 / 2048 1024 / 2048 256 / 1024 Same work-group sizes.
FP64/double ratio
1/4x 1/2x 1/32x 1/16x Ratio is 4x better than Vega1.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 29,057 [+48%] 33,860 245 19,580 Vega2 starts strong with a 48% lead over Vega1 and almost catching Volta.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 18,340 [+35%] 22,680 17,870 13,550 Good improvement here +35% over Vega1 again close to Volta.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 5,377 [+4.3x] 11,000 661 1,240 1/4 FP64 rate makes it over four (4x) times faster than Vega1.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 234 [+3x] 458 25.77 77 Similar to above, Vega2 is over three (3x) faster.
Vega2 looks about 35-50% faster than Vega1 in FP32/FP16 and 3-4x faster in FP64 due to its 1/4 FP64 rate. It won’t beat real workstation cards with 1/2 FP64 rate through thus that Titan has nothing to worry about.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 91 [+36%] 70 42 67 The fast HBM2 memory allows it to beat even Volta not just Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 93 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 209 [+67%] 245 145 125 Vega2 is a huge 70% faster in integer/crypto workloads.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 129 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 176 76 32
Vega2 increases its lead in integer workloads even streaming ones no doubt due to its very fast HBM2 memory making it the crypto-king of the hill though its cost may be an issue.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 23,164 [+2.3x] 18,570 11,480 9,500 Vega2 is over 2x faster than Vega1 also beating Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 [+3.84x] 8,400 1,370 1,880 In FP64 its almost 4x faster just below Volta!
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 3,501 [+38%] 4,200 2,240 2,530 Binomial uses thread shared data thus stresses the memory system Vega2 is still 40% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 [+4.8x] 2,000 129 164 With FP64 we’re almost 5x faster than Vega1.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 6,249 [+62%] 11,920 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Vega2 is 60% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 [+3.55x] 4,440 294 472 With FP64 we’re over 3.5x faster.
For financial FP32 workloads, Vega2 is 40-60% faster than Vega1 a decent improvement; naturally in FP64 it’s 4-5x times faster thus a significant upgrade for algorithms that require such precision.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,634 [+30%] 11,000 6,073 5,066 GEMM still brings a 30% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 [+3.77x] 3,830 340 620 But DGEMM is almost 4x faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 643 [+74%] 617 235 369 FFT loves HBM thus Vega2 is 75% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 [+2.1x] 280 207 175 DFFT is tough but Vega2 is still twice as fast.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,846 [+41%] 7,790 5,720 4,840 In N-Body physics Vega2 is 40% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 [+3.9x] 4,270 275 447 And in FP64 physics Vega2 is almost 4x faster.
The scientific scores show a similar improvement, with FP32 30-40% better but FP64 a whopping four (4x) faster than Vega1 and, in some algorithms, matching the hugely expensive Volta.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 25,418 [+32%] 26,790 18,410 19,130 In this 3×3 convolution algorithm, Vega2 is 32% faster than Vega1
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 5,275 [+21%] 9,295 5,000 4,340 Same algorithm but more shared data reduces the lead to 21%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 5,510 [+24%] 9,428 5,080 4,450 With even more data the gap remains constant.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 5,273 [+23%] 9,079 4,800 4,300 Still convolution but with 2 filters – similar 23% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 92 [+91%] 112 37 48 Different algorithm makes Vega2 almost 2x faster than Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 57 [+50%] 42 12.7 38 Without major processing, this filter is 50% faster on Vega2.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 47,349 [+2.3x] 24,370 19,480 20,880 This algorithm is 64-bit integer heavy and Vega2 flies 2x faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 7,708 [+28%] 8,460 305 6,000 One of the most complex and largest filters, Vega2 is 28% faster.
For image processing using FP32 precision, Vega goes from 21% to 2x faster, overall a decent improvement if you are processing a large number of images. In many filters it beats the far more expensive Volta competition.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 627 [+88%] 536 356 333 Vega2’s wide HBM2 is almost 2x faster as expected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 12.37 [+2%] 11.47 11.4 12.18 Using PCIe 3.0 similar upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.95 [+7%] 12.27 12.2 12.08 Again similar bandwidth.
Vega2 benefits greatly from its very wide HBM2 memory (4096-bit) which provides almost 2x real bandwidth as expected. But while PCIe 4.0 capable for now it has to make do with 3.0 and thus same upload/download bandwith. Here’s hoping for a BIOS update once new motherboards come out.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 202 [-19%] 180 201 247 The higher clock allows Vega2 a 20% latency reduction.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 [-4%] 311 286 353 Full range is only 4% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53.4 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75.4 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.1 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88.5 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega2 (“BigVega”) is a big improvement over normal Vega1 and its workstation-class pedigree shows. For FP16/Fp32 workloads though the 30-40% performance improvement may not be worth it considering the much higher price: naturally FP64 performance is almost 4x due to 1/4 FP64 rate though not as good at professional cards with 1/2 rate or Titan competition with similar 1/2 rate.

While the GCN core (rev 5.1) has seen internal updates, there is nothing new that can be supported/optimised for in the compute land thus any code working well on Vega1 should work just as well on Vega2.

The 16GB HBM2 wide memory also helps big workloads with 2x higher bandwidth and also lower latency due to higher clock. For some workloads this alone makes it a definite buy when competition stops at 12GB.

Unfortunately the card has had a limited release at a relatively high price thus value/price ratio depends entirely on your workload – if FP64 with large datasets then it is very much worth it; if FP32/FP16 with datasets that fit in standard 8GB memory then the older Vega1 is much better value and you can even get 2 for the price of the Vega2.

For revolutionary change we need to wait for Navi and its brand new RDNA (Radeon DNA) arch(itecture)…

AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.