SP1a for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1a (Service Pack 1a) to SiSoftware Sandra 2016.

This is a minor update that improves stability and adds a few optimisations that were developed after further testing of SP1 release.

The SP1a update also enables the Marbling: Perlin Noise 2D (3 octaves) Filter for both GPGPUs (CUDA, OpenCL) and CPU.

Sandra 2016 SP1 New Image Filters

SP1 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1 (Service Pack 1) to SiSoftware Sandra 2016.

This release introduces initial AVX512 benchmarks with all SIMD benchmarks due to be ported once compiler support becomes available:

CPU Multi-Media (Fractal Generation): single, double floating-point; integer, long benchmarks ported to AVX512. [See article Future performance with AVX512]

CPU Crypto (SHA Hashing): SHA2-256 and SHA2-512 multi-buffer ported to AVX512.

– Hardware support for future arch (AMD and Intel).

.Net Multi-Media native vector support is vector width independent and thus will support AVX512 with a future CLR release automatically

GPU Image Processing: New, more complex filters:

  • Oil Painting: Quantise (9×9) Filter: CUDA, OpenCL
  • Diffusion: Randomise (256) Filter: CUDA, OpenCL
  • Marbling: Perlin Noise 2D (3 octaves) Filter: CUDA, OpenCL

CPU Image Processing: New, more complex filters

  • Oil Painting: Quantise (9×9) Filter: AVX2/FMA, AVX, SSE2
  • Diffusion: Randomise (256) Filter: AVX2/FMA, AVX, SSE2
  • Marbling: Perlin Noise 2D (3 octaves) Filter: AVX2/FMA, AVX, SSE2

Sandra 2016 SP1 New Image FiltersMore benchmarks will be ported to AVX512 subject to compiler support; currently Microsoft’s VC++ does not support AVX512 intrinsics and in the interest of fairness we do not use specialised compilers.

Please see our article – Future performance with AVX512 – for a primer on AVX512 and projected performance improvements due to AVX512 and 512-bit transfers.

Future performance with AVX512 in Sandra 2016 SP1

Intel Skylake

What is AVX512?

AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.

Why is AVX512 important?

CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.

SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.

Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.

SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.

What CPUs will support AVX512?

It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.

AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.

It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.

What kind of performance improvement can we expect with AVX512?

The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.

AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.

In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.

Can I test AVX512 performance with Sandra?

Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:

  • Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
  • Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
  • Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now

The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:

  • Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
  • Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
  • Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
  • .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.

Hardware Stats

We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.

Processor Intel i7-6700K (Skylake) Intel i7-77XX? (next-gen) Intel i7-5820K (Haswell-E) Intel i7-78XX? (Skylake-E)
Cores/Threads 4C / 8T 4C / 8T 6C / 12T 6C / 12T
Clock Speeds (MHz) Min-Max-Turbo 800-4000-4200 assumed same 1200-3300-3600 assumed same
Caches L1/L2/L3 4x 32kB, 4x 256kB, 8MB assumed same 6x 32kB, 6x 256kB, 15MB assumed same
Power TDP Rating (W) 91W assumed same 140W assumed same
Instruction Set Support AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc. AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc.

We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.

Multi-media (Fractal Generation) Benchmark

Benchmark Future Core-i7 (4C/8T AVX512) Projected Core i7-6700K (4C/8T AVX2/FMA) Core i7-6700K (4C/8T SSEx) Future Core i7-E (6C/12T AVX512) Projected Core i7-5820K (6C/12T AVX2/FMA) Core i7-5820K (6C/12T SSEx))
 AVX512 Multi-Media
Integer SIMD (Mpix/s) 912.5 [+76% over AVX] 516.2 [+76% over SSE] 292 1020.7 [+76% over AVX] 577.4 [+76% over SSE] 327
We see around 76% improvement from AVX2 vs. SSE, thus we assume we’ll see something similar moving to AVX512 (~80%).
Long SIMD (Mpix/s) 315.3 [+66% over AVX] 190.1 [+66% over SSE] 114.6 284.3 [+66% over AVX] 171.4 [+66% over SSE] 87.6
We see around 66% improvement from AVX2 vs. SSE, but due to the new instructions we may see better AVX512 gains.
Single Float SIMD (Mpix/s) 916.8 [+2x over AVX] 458.4 [+2.12x over SSE] 216 1079 [+2x over AVX] 539.5 [+2.12x over SSE] 234.8
We saw over 2x improvement from AVX/FMA over SSE so while we may not see such a large improvement with AVX512, we may still get 100%.
Double Float SIMD (Mpix/s) 545.8 [+2x over AVX] 272.9 [+2.35x over SSE] 116.1 622.4 [+2x over AVX] 311.2 [+2.35x over SSE] 126
We see even better improvement from AVX to SSE here (2.35x) so hopefully we’ll get 2x moving to AVX512.
Quad Float SIMD (Mpix/s) 20.3 [+94% over AVX] 10.5 [+94% over SSE] 5.4 622.4 [+94% over AVX] 311.2 [+94% over SSE] 126
Emulating fp128 is hard work but even then AVX is 94% faster than SSE and thus we’d expect AVX512 to be almost 2x faster still.
Despite some being disappointed by arch-to-arch performance improvement, the Skylake 4C (i7-6700K) already goes toe-to-toe with Haswell-E 6C (i7-5820K), but with AVX512 support Skylake-E 6C/8C is projected to comprehensively outperform it.

AVX512 will also allow Skylake-E to narrow the gap between it and current GPGPUs with multi-CPU Xeon systems able to “do without” GPGPUs – well except perhaps a “Phi” or two?

 AVX512 Crypto
Hashing SHA2-256 (GB/s) 11.80 [+2x over AVX] 5.90 [+2.36x over SSE] 2.50 13.60 [+2x over AVX] 6.80 [+2.26x over SSE] 3
We see a large 2.26-2.36x improvement of AVX2 vs. SSE, thus we expect about 2x increase with AVX512 still.
Hashing SHA1 (GB/s) 23 [+2x over AVX] 11.5 [+2.16x over SSE] 5.33 27.70 [+2x over AVX] 13.85 [+2.04x over SSE] 6.79
Even with SHA1 we see a good 2.04-2.16x improvement of AVX2 vs. SSE, thus AVX512 should again double performance though we may be limited by memory bandwidth.
Hashing SHA2-512 (GB/s) 8.74 [+2x over AVX] 4.37 [+2.33x over SSE] 1.87 9.60 [+2x over AVX] 4.80 [+2.20x over SSE] 2.18
Switching to 64-bit integer SHA512 we see the best improvement yet of AVX2 vs SSE (2.2-2.33x) with AVX512 likely to improve by 2x yet again.
With hashing we see even better results than even fractal generation, with AVX2 improving over 2x over SSE – and AVX512 will thus improve by at least 100% – if anything it is likely we will hit memory bandwidth limitations.
 AVX512 Memory Bandwidth
Memory Bandwidth (GB/s) ~31.30 31.30 [0%] 31.30 ~42.00 [0%] 42.30 [-1%] 42.6
Even with DDR4 the memory sub-system hasn’t changed much and despite 512-bit transfers with AVX512 there is really no performance delta in streaming data to/from memory.
L3 Bandwidth (GB/s) ~267.97 [+10%] 243.30 [+10%] 220.90 ~202.20 [+3%] 195.90 [+3%] 189.8
As we move up the cache hierarchy, the L3 already shows a 10% bandwidth improvement using AVX2/FMA vs. SSE and AVX512 improving performance further.
L2 Bandwidth (GB/s) ~392.50 [+21%] 323.30 [+21%] 266.30 ~536.81 [+20%] 444.10 [+20%] 367.4
As we expected, L2 bandwidth improves ~20% with AVX2/FMA and likely to improve further.
L1D Bandwidth (GB/s) ~1,364.25 [+50%] 909.50 [+2.11x] 429.90 ~1,536.00 [+50%] 1,024.00 [+2x] 518
Skylake has widened the data access ports (just like Haswell before it), thus 512-bit AVX512 transfers show the best improvement yet, 40-50%!
AVX512 does help take advantage of the widened data ports in Skylake and future arch, with L1D cache showing the best bandwidth improvement just like Haswell before it (with AVX2).

Memory bandwidth is still limited by DDR4 speeds but faster modules are coming out all the time but this time their clocks are JEDEC ratified.

We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.

Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!

New Promotions for Valentine’s Day February 2016

Valentine's Day

For February 2016 – and soon to arrive Valentine’s Day – we have some promotions for you to enjoy:

Happy Valentine’s Day (in advance) 😉

Sandra 2016 - Personal - Feb Promo

 

.Net Vectors (CLR 4.6 RyuJIT) Performance

.Net Vectorised Benchmark

What is RyuJIT?

“RyuJIT” is the code-name of the latest CLR of .Net 4.6 as included in Windows 10 (with updates available for Windows 8.1, 8, 7) that includes a variety of performance optimisations as well as new features like vectorised/SIMD native support.

Why do we need .Net Vector support?

Many algorithms benefit from vectorisation/parallelisation through SIMD instruction sets in (all) modern processors; while compilers/run-times (CLR/JVM) may be able to automatically vectorise code – the most efficient way is through constructs that indicate to the compiler/run-time how to vectorise code for the hardware it is running on.

While we could always interop to native code libraries using SIMD, these would be platform / instruction-set dependent and introduce code and maintenance complexity.

What are other Pro/Cons of .Net Vector support?

The new CLR is a boon for high-performance algorithms:

  • Widely deployed: by default on Windows 10 and Windows Update on older Windows.
  • Widest possible: automatically uses the “widest” SIMD ISA (instruction set) supported by the processor, be it AVX2/FMA, AVX, SSE2, etc. [and AVX512 in future CLR] without any code modifications.
  • ISA/platform independent: same .Net code runs whatever the platform/ISA now and in the future. No need to write native code for each platform and ISA (e.g. AVX-Win64, SSE2-Win32, etc.)
  • All primitive data types supported: single/double floating-point, int/long integers.

Unfortunately Microsoft could not go the “whole way” and there are downsides:

  • x64 Only: RyuJIT is for x64 Windows only with x86 stuck with the old CLR that is unlikely to be updated.
  • Very limited Integer operators: without basic binary operators like “shift”, “mask”, “swap/permute”, etc. integer performance is low.
  • Limited functions and operators: even floating-point provides a limited subset of functions and operators.
  • CLR Issues: the new RyuJIT CLR does have problems with some .Net apps which may require users to stick to the older CLR and thus no Vector support.

.Net Vectors vs. Native SIMD Performance

We are testing native and .Net multi-media (fractal generation) performance using various SIMD instruction sets (AVX2/FMA, AVX, SSE2, etc.).

Hardware: Intel i7-4650U (Haswell ULV) with AVX2/FMA, AVX, SSE2 support.

Results Interpretation: Higher values (MPix/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on all configurations.

.Net Vectorised Performance

Data Type .Net Vectorised .Net Scalar Native AVX2/FMA Native AVX Native SSE2
Single Float (Mpix/s) 54 (8pix width) [+9.2x] 5.89 (1pix) 102.3 (8pix) [+17.4x] 89 (8pix) 57.8 (4pix)
Double Float (Mpix/s) 30.1 (4pix width) [+2.04x] 14.78 (1pix) 62.5 (4pix) [+4.2x] 53.4 (2pix) 31.9 (2pix)
Integer (Mpix/s) 1.03 (8pix width) [0.056x] 18.5 (1pix) 114.5 (16pix) [+6.2x] 73.4 (8pix) 31.3 (4pix)
Int64 (Mpix/s) 0.361 (4pix witdth) [0.020x] 18 (1pix) 41.6 (8pix) [+2.3x] 23.4 (4pix) 23 (2pix)

We can confirm the use of AVX2/FMA/AVX by the width of the Vectors (256-bit wide, with float/int being 8-units wide, double/int64 being 4-units wide).

While the performance improvement over scalar code is significant (~2x-9x), it does not quite reach the native SIMD implementation (~50%) which is somewhat disappointing but not altogether unexpected. However, future versions of the CLR will likely improve upon this – while our native code is unlikely to be optimised further.

No, the Vector integer performance is *not* a bug: the lack of bit-manipulation operations (“shift”, “swap/permute”, “mask”, etc.) makes complex Vector algorithms pretty much useless. Thus we only enable Vectors for floating-point operations.

Vectors may never replace native code completely, but lots of algorithms may now be implemented in native .Net code with good performance without the need of native libraries making deployment to different platforms (e.g. ARM/Windows, Mono/Linux, etc.) far easier.

It is good to see Microsoft adding new features to the CLR – which we would have expected Java to release first – as both the CLR and JVM have somewhat “stagnated” lately which is not good to see.

SiSoftware Sandra 2016 RTMa Released

Bulb

We are providing an update to Sandra 2016, RTMa (version 22.15) with various updates and fixes:

  • .Net native Vector support: (floating-point single/double) in latest 4.6 CLR RyuJIT. the CLR automatically uses AVX/SSE2 SIMD as supported by the CPU. (see .Net Vectors (CLR 4.6 RyuJIT) Performance article for more information)
  • CPU Image Processing: Did not run SIMD code-paths (FMA, AVX, SSE2) only FPU resulting in low performance.
  • GPGPU Image Processing: Minor performance optimisation for median/de-noise filter.
  • GPGPU Crypto: SHA performance optimisations for nVidia cards in CUDA and OpenCL (SHA1 especially).
  • Overall Score 2016: score may not generate in all cases.
  • Windows 10: 1511 SDK update (build 10586 2015 November update)
  • Website Change: Due to transition to WP links and feeds were broken.

We recommend you update your version of Sandra 2016 as soon as possible.

AMD A4 “Mullins” APU CPU: Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • Crypto: Random number generator.
  • Security: Platform Security Processor (PSP) included in the SoC (ARM based).

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing Mullins with its predecessor (Kabini) as well as its competition from Intel.

APU Specifications Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Cores (CU) / Threads (SP) 4C / 4T 4C / 4T 2C / 4T 4C / 4T We still have 4 cores and 4 threads just like Atom (old and new) – only Core M has 2 cores with HT – we shall see whether this makes a big difference.
Speed (Min / Max / Turbo) 480-1600-2400 (6x-20x-30x) 500-800-2000 (5x-8x-20x) 1000-1500 (10x-15x) 1000-1800-2400 (10x-18x-24x) Mullins is clocked a bit higher (1.8GHz vs. 1.5 – 20% faster) but also supports Turbo (up to 2.4GHz – up to 60% faster) which should give it a big advantage over old Kabini. Both Atom and Core M also depend on opportunistic Turbo for most of their performance. As Mullins/Kabini are 15W rated, they should be able to Turbo higher and for longer – at least in theory.
Power (TDP) 2.4W 4.5W 15W 15W [=] TDP remains the same at 15W which is a bit disappointing considering the new Atom is 2.4-4W rated, we’re taling between 3-5x (five times) more power!
L1D / L1I Caches 4x 24kB 6-way / 4x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 2-way 4x 32kB 8-way / 4x 32kB 2-way No change in L1 caches which pretty much Atom, comparatively Core M has half the caches.
L2 Caches 2x 1MB 16-way 4MB 16-way 2MB 16-way 2MB 16-way No change in L2 cache; here Core M has twice as much cache – same size as a normal i7 ULV. It’s a pity AMD was not able to increase the caches.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 – finally bringing “fused-multiply-add” for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) – not far from AVX3 512-bit in “Phi”.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
Arithmetic Native
CPU Arithmetic Benchmark Native Dhrystone (GIPS) 35.91 SSE4 [+20%] 31.84 AVX2 25.12 SSE4 28.77 SSE4 [+18%] Mullins like Kabini has no AVX2, but is still 18% faster than it (clocked 20% faster), unfortunately the new Atom manages to beat it. Not the best of starts.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 18.43 AVX [+13%] 21.56 AVX/FMA 13.55 AVX 16.2 AVX [+19%] Mullins has no FMA either, so is again 19% faster than Kabini – it shows the ALU and FPUs are unchanged. Again Atom manages to be faster.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 12.34 AVX [+23%] 13.49 AVX/FMA 8.44 AVX 10 AVX [+18%] With FP64 we see the same 18% difference – and Atom is still 20% faster.
We only see a 18-19% improvement in Mullins – in line with clock speed (+20%) with Turbo not doing much. The new CherryTrail Atom is thus 14-21% faster than it, not something you expect considering the hugely different TDP. Time does not stand still and Mullins is outclassed here.
SIMD Native
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 48.7 AVX [-20%] 70.8 AVX2 58.37 60.76 [+4%] Without AVX2, Mullins can only manage a paltry 4% improvement over Kabini – the only “silver lining” is that Atom is 20% slower than it – unlike what we saw before. Naturally Core M with AVX2 runs away with it.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 14.5 AVX [-38%] 24.5 AVX2 21.92 AVX 23.26 AVX [+6%] With a 64-bit integer workload, the improvement increases to 6%, less than clock difference – but thankfully Atom is much slower (by half) – naturally Core M is the winner.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 0.512 [+75%] 0.382 0.246 0.292 [+18%] This is a tough test using Long integers to emulate Int128, but here we see the full 18% improvement over Kabini – but now Atom is a huge 75% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 41.5 AVX [-4%] 61.3 FMA 36.91 AVX 43 AVX [+16%] In this floating-point AVX/FMA algorithm, Mullins returns to being 16% faster and a whisker faster than Atom (4%). With FMA, Core M is almost 50% faster still.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 15.9 AVX [-31%] 36.48 FMA 19.57 AVX 22.9 AVX [+17%] Switching to FP64 code, we see a 17% improvement for Mullins which allows it to be 30% faster than Atom.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 0.81 AVX [-37%] 1.69 FMA 1.27 AVX 1.27 AVX [=] In this heavy algorithm using FP64 to mantissa extend FP128, we see no improvement whatsoever – at least Atom is 37% slower; and yes, Core M is faster still.
Lack of AVX2/FMA and Turbo that does not seem to engage makes Mullins stuggle to be more than 16-18% faster than Kaveri – but thankfully it can beat its Atom rival sometimes by a good 30% amount. All in all, it does better with SIMD code than we’ve seen elsewhere though without any core changes it is showing its age…
Cryptography Native
BenchCrypt Crypto AES-256 (GB/s) 1.44 AES HWA [-55%] 2.59 AES HWA 3 AES HWA 3.14 AES HWA [+5%] All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – here Mullins is 5% faster than Kabini and 2x (twice) as fast as Atom, it even overtakes Core M with its dual-channel controller!
BenchCrypt Crypto AES-128 (GB/s) 2 AES HWA [-40%] ? AES HWA 3 AES HWA 3.3 AES HWA [+10%] What we saw with AES-256 was no fluke: less rounds do make some difference, Mullins is now 10% faster and 65% faster than Atom!
BenchCrypt Crypto SHA2-256 (GB/s) 0.572 AVX [-24%] 0.93 AVX2 0.708 AVX 0.659 AVX [-7%] In this tough AVX compute test, Mullins is unexpectedly 7% slower than the old Kabini – but Atom is still slower. But with SHA HWA in the next Atom, AMD will quickly be at a big disadvantage…
BenchCrypt Crypto SHA1 (GB/s) 1.17 AVX [-23%] ? AVX2 1.42 AVX 1.34 AVX [-6%] With a less complex algorithm – we still see Mullins 6% slower than Kabini – and again Atom is slower.
BenchCrypt Crypto SHA2-512 (GB/s) ? AVX ? AVX2 0.511 AVX 0.477 AVX [-7%] By using 64-bit integers this is pretty much the most complex hashing algorithm and thus tough for all CPUs – and here we see Mullins 7% slower again.
Mullins misses both AVX2 and the forthcoming SHA HWA – but manages to extract more memory bandwidth and thus is 5-10% faster than Kabini, and also much faster than Atom. Somehow it manages to be slower with hashing whatever algorithm – but remains much faster than Atom.
Financial Analysis Native
BenchFinance Black-Scholes float/FP32 (MOPT/s) 14.15 [-26%] 21.88 17.55 18.87 [+7%] In this non-SIMD test we start with a 7% improvement over Kabini, good but less than we expected – but at least faster than Atom.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 11.66 [-27%] 17.6 13.43 15.82 [+18%] Switching to FP64 code, we see a good 18% improvement – and victory over Atom again.
BenchFinance Binomial float/FP32 (kOPT/s) 3.85 [-22%] 5.1 1.31 4.33 [+3.3x] Binomial uses thread shared data thus stresses the cache & memory system; Mullins has improved by a huge 3.3x (over three times) – and a great win over Atom (but not Core M). It seems the new memory improvements do help a lot.
BenchFinance Binomial double/FP64 (kOPT/s) 4.13 [-31%] 5.02 2.66 5.91 [+2.2x] With FP64 code, Mullins is “only” 2.2x (over two times) faster than Kabini – and this again gives it a big win over its Atom competition – as well as over Core M!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 2.67 [-29%] 3.73 3.23 3.71 [+15%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; here Mullins is just 15% faster but it’s enough to tie with Core M and leave Atom in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 2.43 [-30%] 3.0 2.67 3.43 [+28%] Switching to FP64 we see a big 28% improvement – Mullins manages to beat both Atom and Core M by a good measure.
Somehow Mullins managed to redeem itself – beating Atom in all tests and even Core M in some of the tests. Running financial tests on a Mullins tablet should work better than on an Atom or Core M one.
Scientific Analysis Native
BenchScience SGEMM (GFLOPS) float/FP32 12.42 AVX [-7%] 13.99 FMA 11.43 AVX 13.22 AVX [+16%] In this tough SIMD algorithm, Mullins sees a good 16% improvement beating Atom and getting within a whisker to Core M – even without FMA.
BenchScience DGEMM (GFLOPS) double/FP64 6.09 AVX [-13%] 9.61 FMA 6.62 AVX 6.93 AVX [+5%] With FP64 SIMD code, Mullins is just 5% faster – but it’s enough to beat Atom – though not Core M. Still, a good improvement.
BenchScience SFFT (GFLOPS) float/FP32 5.17 AVX [+72%] 4.11 FMA 2.91 AVX 3 AVX [+3%] FFT also uses SIMD and thus AVX but stresses the memory sub-system more: here Mullins sees only a 3% improvement, not enough to beat Atom which is over 70% faster still. Mullins has its limits.
BenchScience DFFT (GFLOPS) double/FP64 2.83 AVX [+52%] 3.66 FMA 1.54 AVX 1.85 AVX [+20%] With FP64 code, Mullins improves by a large 20% – but again not enough to beat Atom which is now over 50% faster still.
BenchScience SNBODY (GFLOPS) float/FP32 1.58 AVX [-31%] 2.64 FMA 2.1 AVX 2.27 AVX [+8%] N-Body simulation is SIMD heavy but many memory accesses to shared data so Mullins is just 8% faster than Kaveri, but enough to beat Atom (by 43%). Unlike FFT, N-Body again agrees with Mullins/Kabini.
BenchScience DNBODY (GFLOPS) double/FP64 1.76 AVX [-30%] 3.71 FMA 2.09 AVX 2.51 AVX [+20%] With FP64 code Mullins improves again by 20% – but now more than enough to beat Atom (by 42%) – though not enough to beat Core M.
With highly optimised SIMD AVX code, Mullins sees a 5-20% improvement – which allows it to beat Atom in most tests – a good result from the rout we saw before.
Inter-Core Native
BenchMultiCore Inter-Core Bandwidth (GB/s) 1.69 [-52%] 8.5 3.0 3.47 [+15%] With unchanged L1/L2 caches Mullins relies on its higher rated speed – and 15% is a good improvement over Kabini. It’s got 2x (two times) the bandwidth Atom manages to muster – but way below Core M’s which has over 2.4x more still. We see how all these caches perform in the Cache & Memory AMD A4 “Mullins” performance article.
BenchMultiCore Inter-Core Latency (ns) 179 [1/3.5x] 76 66 31 [-33%] Latency, however, sees a massive 33% decrease, more than we’d expect – and surprisingly is way lower than Atom (1/3.5x) and even lower than Core M (1/2x).

While it does not bring any new instruction sets (AVX2, FMA, SHA HWA) and Turbo that does not seem to engage, Mullins’s 20% clock improvement does show and brings a corresponding 5-19% increase in performance over Kabini in most tests.

Against Atom, the scores are all over the place, sometimes Atom (CherryTrail) is 20-70% faster, other times Mullins is 20-55% faster. If they were rated the same TDP-wise that would be a good result – but as Mullins is rated 15W vs. 2.6-4W that’s not really power efficient. Core M is invariably faster than either in just about all tests.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 8.x/10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64 SP1, latest Intel drivers. .Net 4.5.x, Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) Comments
.Net Arithmetic
BenchDotNetAA .Net Dhrystone (GIPS) 3.85 [-33%] 2.95 4.0 5.78 [+44%] .Net CLR performance improves by a huge 44% – a great start and in line to rated clock increase and enough to beat the Atom by 33%.
BenchDotNetAA .Net Whetstone final/FP32 (GFLOPS) 9 [-3%] 10.8 12.49 9.24 [-23%] Floating-Point CLR performance takes a 23% hit over Kabini – but thankfully just a bit faster than Atom (by just 3%). Something in the new CLR does not agree with it.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 10 [-20%] 12.71 13.39 12.49 [-7%] FP64 CLR performance also sees a more modest 7% decrease a performance, but again still 20% faster than Atom.
With .Net we see a big variation from 23% lower to 44% higher performance than Kabini, but in all cases higher than Atom (between 3-33%). It is strange to see such a big variance, but the CLR changes may have something to do with it.
.Net Vectorised
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 11.34 [+7.8%] 11.25 7.12 10.52 [+47%] Just as we saw with Dhrystone, this integer workload sees a 47% improvement with Mullins – something in the CLR does agree with it – though not as much as Atom which is 8% faster still.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 6.36 [-12%] 5.62 1.78 7.2 [+4.04x] With 64-bit integer vectorised workload, we see a massive 4x (four times) improvement over Kabini – and 12% faster than Atom.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 2.6 [+25%] 4.14 2.01 2.08 [+3%] Switching to single-precision (FP32) floating-point code, we see only a minor 3% improvement – and here Atom is 25% faster still.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 6.25 [-12%] 9.36 5.97 7.1 [+19%] Switching to FP64 code, Mullins is 19% faster (in line with clock increase) and 12% faster than Atom. While unlikely compute tasks are written in .Net rather than native code, small compute code does benefit.
Vectorised .Net improves between 3-47% over Kabini (except the 64-bit integer “fluke”), and thus sometimes faster but sometimes slower than Atom.

We see a big variation here – unlike what we saw with native / SIMD code – likely due to CLR changes – but generally welcome. Against Atom we see an even larger variation – faster and slower, but overall competitive.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Despite no core changes, Mullins is helped by higher rated clock (and Turbo when that works) – which gives it a good 5-19% performance improvement in most tests over its Kabini predecessor. Any new instruction set support (AVX2, FMA, SHA HWA, etc.) will have to wait for the next model.

Unfortunately, time does not stand still – Atom has seen more core improvements (though no new instruction support either) – which means it is a far tougher competition than what Kabini had to deal with. While competitive performance wise with it – as Mullins is rated at 3-5x the power (15W vs. 2.6-4W) its performance efficiency is low. While it can be “powered down” to hit a lower TDP, performance will naturally suffer – so then it would no longer be competitive with Atom.

Previously, AMD APUs relied on their much more powerful GPUs (see AMD A4 “Mullins” APU GPGPU (Radeon R4) performance) to make up for lower CPU performance and power efficiency – but now the latest Intel APUs (be they Atom or Core) are very much competitive – thus their main advantage has gone.

The only advantage would be cost – assuming that Mullins would be much cheaper than even Atom, though that is difficult to see. Thus there is not much where Mullins would be the top choice.

We’ll have to wait for the next AMD APU model – though, again, time does not stand still – and future Atom/Core M models will bring brand-new goodies (DDR4, new instruction sets, etc.) which may well make even tougher opposition. We shall have to wait and see…

AMD A4 “Mullins” APU GPGPU (Radeon R4): Time does not stand still…

AMD Logo

What is “Mullins”?

“Mullins” (ML) is the next generation A4 “APU” SoC from AMD (v2 2015) replacing the current A4 “Kaveri” (KV) SoC which was AMD’s major foray into tablets/netbooks replacing the older “Brazos” E-Series APUs. While still at a default 15W TDP, it can be “powered down” for lower TDP where required – similar to what Intel has done with the ULV Core versions

While Kabini was a major update both CPU and GPU vs. Brazos, Mullins is a minor drop-in update adding just a few features while waiting for the next generation to take over:

  • Turbo: Where possible within power envelope Mullins can now Turbo to higher clocks.
  • Clock: Model replacements (e.g. A4-6000 vs. 5000) are clocked faster.
  • GPU: Core remains the same (GCN)

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Previous and next articles regarding AMD GPGPU performance:

Hardware Specifications

We are comparing the internal GPUs of the new AMD APU with the old version as well as its competition from Intel.

Graphics Unit CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comment
Graphics Core B-GT EV8 B-GT2Y EV8 GCN GCN There is no change in GPU core in Mullins, it appears to be a re-brand fromm 83XX to R4. But time does not stand still, so while Kabini went against BayTrail’s “crippled” EV7 (IvyBridge) GPU – Mullins must battle the “beefed-up” brand-new EV8 (Broadwell) GPU. We shall see if the old GCN core is enough…
APU / Processor Atom X7 Z8700 (CherryTrail) Core M 5Y10 (Broadwell-Y) A4-5000 (Kabini) A4-6000? (Mullins) The series has changed but not much else, not even the CPU core.
Cores (CU) / Shaders (SP) / Type 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) 2C / 128SP 2C / 128SP [=] We still have 2 GCN Compute Units but now they go against 16 EV8 units rather than 4 EV7 units. You can see just how much Intel has improved the Atom GPGPU from generation to generation while AMD has not. Will this cost them dearly?
Speed (Min / Max / Turbo) MHz 200 – 600 200 – 800 266 – 496 266 – 500 [=] Nope, clock has not changed either in Mullins.
Power (TDP) W 2.4 (under 4) 4.5 15 15 [=] As before, Intel’s designs have a crushing advantage over AMD’s: both Kabini and Mullins are rated at least 3x (three times) higher power than Core M and as much as 5-6x more than new Atom. Powered-down versions (6W?) would still consume more while performing worse.
DirectX / OpenGL / OpenCL Support 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 11.2 (12?) / 4.5 / 1.2 11.2 (12?) / 4.5 / 2.0 GCN supports DirectX 11.2 (not a big deal) and also OpenCL 4.5 (vs 4.3 on Intel but including Compute) and OpenCL 2.0 (same). All designs should benefit from Windows 10’s DirectX 12. So while AMD supports newer versions of standards there’s not much in it.
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) No / Yes No / Yes Sadly even AMD does not support FP16 (why?) but does support FP64 (double-float) in all interfaces – while Atom/Core GPU only in DirectX and OpenGL. Few people would elect to run heavy FP64 compute on these GPUs but it’s good to know it’s there…
Threads per CU 256 (256x256x256) 512 (512x512x512) 256 (256x256x256) 256 (256x256x256) GCN has traditionally not supported large number of threads-per-CPU (256) and here’s no different, with Intel’s GPU now supporting twice as many (512) – but whether this will make a difference remains to be seen.

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (July 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: GPGPU Vectorised
GPGPU Arithmetic Single/Float/FP32 Vectorised OpenCL (Mpix/s) 160.5 181.7 165.3 163.9 [-1%] Straight off the bat, we see no change in score in Mullins; however, Atom has cought up – scoring within a whisker and Core M faster still (+13%). Not what AMD is used to seeing for sure.
GPGPU Arithmetic Half/Float/FP16 Vectorised OpenCL (Mpix/s) 160 180 165 163.9 [-1%] As FP16 is not supported by any of the GPUs, unsurprisingly the results don’t change.
GPGPU Arithmetic Double/FP64 Vectorised OpenCL (Mpix/s) 10.1 (emulated) 11.6 (emulated) 13.4 14 [+4%] We see a tiny 4% improvement in Mullins but due to native FP64 support it is almost 40% faster than both Intel GPUs.
GPGPU Arithmetic Quad/FP128 Vectorised OpenCL (Mpix/s) 1.08 (emulated) 1.32 (emulated) 0.731 (emulated) 0.763 [+4%] (emulated) No GPU supports FP128, but GCN can emulate it using FP64 while EV8 needs to use more complex FP32 maths. Again we see a 4% improvement in Mullins, but despite FP64 support both Intel GPUs are much faster. Sometimes FP64/FP32 ratio is so low that it’s not worth using FP64 and emulation can be faster (e.g nVidia).
AMD Mullins: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 825 770 998 1024 [+3%] In this tough integer workload that uses shared memory (as cache), Mullins only sees a 3% improvement. GCN shows its power being 25% faster than Intel’s GPUs – TDP notwhitstanding.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 1106 ? 1280 1423 [+11%] With less rounds, Mullins is now 11% faster – finally a good improvement and again 28% faster than Atom’s GPU.
AMD Mullins: GPGPU Hash
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 309 ? 59 282 [+4.7x] This 64-bit integer compute-heavy wokload seems to have triggered a driver bug in Kabini since Mullins is almost 5x (five times) faster – perhaps 64-bit integer operations were emulated using int32 rather than native? Surprisingly Atom’s EV8 is faster (+9%) – not something we’d expect to see.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 1187 1331 1618 1638 [+1%] In this integer compute-heavy workload, Mullins is just 1% faster (within margin of error) – which again proves GPU has not changed at all vs. older Kabini. At least it’s faster than both Intel GPUs, 38% faster than Atom’s.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 2764 ? 2611 3256 [+24%] SHA1 is less compute-heavy but here we see a 24% Mullins improvement, again likely a driver “fix”. This allows it to beat Atom’s GPU 17% – showing that driver optimisations can make a big difference.
AMD Mullins: GPGPU Financial
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 299.8 280.5 248.3 326.7 [+31%] Starting with the financial tests, Mullins flies off with a 31% improvement over the old Kabini, with is just as well as both Intel GPUs are coming strong – it’s 9% faster than Atom’s GPU. One thing’s for sure, Intel’s EV8 GPU is no slouch.
GPGPU Finance Benchmark Black-Scholes FP64 OpenCL (MOPT/s) n/a (no FP64) n/a (no FP64) 21 21.2 [+1%] AMD’s GCN supports native FP64, but here Mullins is just 1% faster than Kabini (within margin of error), unable to replicate the FP32 improvement we saw.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 28 36.5 32.3 30.9 [-4%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and here Mullins somehow manages to be slower (-4%) – likely due to driver differences. Both Intel GPUs are coming strong, with Core M’s GPU 20% faster. Considering how fast the GCN shared memory is – we expected better.
GPGPU Finance Benchmark Binomial FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 1.85 1.87 [+1%] Switching to FP64 on AMD’s GPUs, Mullins is now 1% faster (within margin of error). Luckily Intel’s GPUs do not support FP64.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 61.9 54.9 32.9 46.3 [+40%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Mullins is 40% faster here (again driver change) – but surprisingly cannot match either Intel GPUs, with Atom’s GPU 32% faster! Again, we see just how much Intel has improved the GPU in Atom – perhaps too much!
GPGPU Finance Benchmark Monte-Carlo FP64 OpenCL (kOPT/s) n/a (no FP64) n/a (no FP64) 5.39 5.59 [+3%] Switching to FP64 we now see a little 3% improvement for Mullins, but better than the 1% we saw before…
AMD Mullins: GPGPU Scientific
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 45 44.1 43.5 41.5 [-5%] GEMM is quite a tough algorithm for our GPUs and Mullins manages to be 5% slower than Kabini – agin this allows Intel’s GPUs to win, with Atom’s GPU just 8% faster – but a win is a win. Mullins’s GPU is starting to look underpowered considering the much higher TDP.
GPGPU Science Benchmark DGEMM FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.11 3.73 [-9%] Swithing to FP64, Mullins now manages to be 5% slower than Kabini – thankfully Intel’s FPUs don’t support FP64…
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 9 8.94 7.89 9.5 [+20%] FFT involves many kernels processing data in a pipeline – and Mullins now manages to be 20% faster than Kabini – again, just as well as Intel’s GPUs are hot on its tail – and it is just 5% faster than Atom’s GPU!
GPGPU Science Benchmark DFFT FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 2.2 3 [+36%] Switching to FP64, Mullins is now 36% faster than Kabini – again likely a driver improvement.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 65 50 58 63 [+9%] In our last test we see Mullins is 9% faster – but not enough to beat Atom’s GPU which is 1% faster but faster still. Anybody expected that?
GPGPU Science Benchmark N-Body FP64 OpenCL (GFLOPS) n/a (no FP64) n/a (no FP64) 4.75 4.74 [=] Switching to FP64, Mullins scores exactly the same as Kabini.

Firstly, Mullins’s GPU scores are unchanged from Kabini; due to driver optimisations/fixes (as well as kernel optimisations) sometimes Mullins is faster but that’s not due to any hardware changes. If you were expecting more, you are to be disappointed.

Intel’s EV8 GPUs in the new Atom (CherryTrail) as well as Core M (Broadwell) now can keep up with it and even beat it in some tests. The crushing GPGPU advantage AMD’s APUs used to have is long gone. Considering the the TDP differences (4-5x higher) the Mullins’s GPU looks underpowered – the number of cores should at least been doubled to maintain its advantage.

Unless Atom (CherryTrail) is more expensive – there’s really no reason to choose Mullins, the power advantage of Atom is hard to be denied. The only positive for AMD is that Core M looks uncompetitive vs. Atom itself, but then again Intel’s 15W ULV designs are far more powerful.

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) AMD h264Encoder (hardware accelerated) AMD h264Encoder (hardware accelerated) Both are using their own hardware-accelerated transcoders for H264.
AMD Mullins: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 5 ? 2 2.14 [+7%] We see a small but useful 7% bandwidth improvement in Mullins vs. Kabini, but even Atom is over 2x (twice) as fast.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 4.75 ? 2 2.07 [+3.5%] When just using the H264 encoder we only see a small 3.5% bandwidth improvement in Mullins. Again, Atom is over 2x as fast.

We see a minor 3.5-7% improvement in Mullins, but the new Atom blows it out of the water – it is over twice as fast transcoding H.264! Poor Mullins/Kaveri are not even a good fit for HTPC (NUC/Brix) boxes…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
Memory Configuration 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) 4GB DDR3 1.6GHz 64-bit (shared with CPU) Except Core M, all APUs have a single memory controller, though Atom can also be configured with 2 channels.
Constant (kB) / Shared (kB) Memory 64 / 64 64 / 64 64 / 32 64 / 32 Surprisingly AMD’s GCN has 1/2 the shared memory of Intel’s EV8 (32 vs. 64) but considering the low number of threads-per-CU (256) only kernels making very heavy use of shared memory would be affected, still better more than less.
L1 / L2 / L3 Caches (kB) 256kB? L2 256kB? L2 16kB? L1 / 256kB? L2 16kB? L1 / 256kB? L2 Caches sizes are always pretty “hush hush” but since core has not changed, we would expect the same cache sizes – with GCN also sporting a L1 data cache.
AMD Mullins: GPGPU Memory BW
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 11 10.1 8.8 5.5 [-38%] OpenCL memory performance has surprisingly taken a bit hit in Mullins, most likely a driver bug. We shall see whether DirectX Compute is similarly affected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 2.09 3.91 4.1 2.88 [-30%] Upload bandwidth is similarly affected, we measure a 30% decrease.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 2.29 3.79 3.18 2.9 [-9%] Upload bandwidth is the least affected, just 9% lower.
AMD Mullins: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 829 274 1973 693 [-1/3x] Even though the core is unchanged, latency is 1/3 of Kabini. Since we don’t see a comparative increase in performance, this again points to a driver issue.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1669 ? ? 817 Going out-of-page does not increase latency much.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 279 ? ? 377 Sequential access brings the latency down to about 1/2, showing the prefetchers do a good job.
GPGPU Memory Latency Constant Memory Latency (ns) 209 ? 629 401 [-33%] The L1 (16kB) cache does not cover the whole constant memory (64kb) – and is not lower than global memory. There is no advantage to using constant memory.
GPGPU Memory Latency Shared Memory Latency (ns) 201 ? 20 16 [-20%] Shared memory is a little big faster (20% lower). We see that shared memory latency is much lower than constant/global lantency (16 vs. 401) – denoting dedicated shared memory. On Intel’s EV8 GPU there is no change (201 vs. 209) – which would indicate global memory used as shared memory.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1234 ? 2369 691 [-70%] We see a massive latency reduction – again likely a driver optimisation/fix.
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 353 ? ? 353 Sequential access brings the latency down to a quarter (1/4x) – showing the power of the prefetchers.

The optimisations in newer drivers make a big difference – though the same could apply to the previous gen (Kabini). The dedicated shared memory – compared to Intel’s GPUs – likely help GCN achieve its performance.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Shader
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 127.6 ? 128.8 129.6 [=] Starting with DirectX FP32, we see no change in Mullins – not even the DirectX driver has changed.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 121.8 172 124 124.4 [=] OpenGL does not change matters, Mullins scores exactly the same as its predecessor. But here we see Core M pulling ahead, an unexpected change.
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 109.5 ? 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 121.8 170 124 124 [=] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change either.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 18 ? 8.9 8.91 [=] Unlike OpenCL driver, DirectX Intel driver does support FP64 – which allows Atom’s GPU to be at least 2x (twice) as fast as Mullins/Kebini.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 26 46 12 12 [=] As above, Intel OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. This allows Atom’s GPU to be over 2x faster than Mullins/Kabini again – while Core M’s GPU is almost 4x (four times!) faster.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.34 (emulated) ? 1.6 (emulated) 1.66 (emulated) [+3%] Here we’re emulating (mantissa extending) FP128 using FP64: EV8 stumbles a bit allowing Mullins/Kabini to be a little bit faster despite what we saw in FP64 test.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1.1 (emulated) 3.4 (emulated) 0.738 (emulated) 0.738 (emulated) [=] OpenGL does change the results a bit, Atom’s GPU is now faster (+50%) while Core M’s GPU is far faster (+5x). Heavy shaders seem to take their toll on GCN’s GPU.

Unlike GPGPU, here Mullins scores exactly the same as Kabini – neither the DirectX nor OpenGL driver seem to make a difference. But what is different is that Intel’s GPUs support FP64 natively in both DirectX/OpenGL – making it much faster 3-5x than AMD’s GCN. If OpenCL driver were to support it – AMD woud be in trouble!

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest AMD and Intel drivers (Jul 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors CherryTrail GT Broadwell GT2Y – HD 5300 Kabini – Radeon HD 8330 Mullins – Radeon R4 Comments
AMD Mullins: Video Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 11.18 12.46 8 9.7 [+21%] DirectX bandwidth does not seem to be affected by the OpenCL “bug”, here we see Mullins having 21% more bandwidth than Kaveri using the very same memory. Perhaps the memory controller has seen some some improvements after all.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.83 5.29 3 3.61 [+20%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again Mullins does well with 20% more bandwidth.
Video Memory Benchmark Download Bandwidth (GB/s) 2.1 1.23 3 3.34 [+11%] Download bandwidth improves by 11% only, but better than nothing.

Unlike OpenCL, we see DirectX bandwidth increased by 11-20% – while using the same memory. Hopefuly AMD will “fix” the OpenCL issue which should help kernel performance no end.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mullins’s GPU is unchanged from its predecessor (Kabini) but a few driver optimisations/fixes allow it to score better is many tests by a small margin – however these would also apply to the older devices. There isn’t really more to be said – nothing has really changed.

But time does not stand still – and now Intel’s EV8 GPU that powers the new Atom (CherryTrail) as well as Core M (Broadwell) is hot on its heels and even manages to beat it in some tests – not something we’re used to seeing in AMD’s APUs. Mullins’s GPU is looking underpowered.

If we now remember that Mullins’s TDP is 15W vs. Atom at 2.6-4W or Core M at 4.6W – it’s really not looking good for AMD: it’s CPU performance is unlikely to be much better than Atom’s (we shall see in CPU AMD A4 “Mullins” performance) – and at 3-5x (three to five times) more power woefully power inefficient.

Let’s hope that the next generation APUs (aka Mullins’ replacement) perform better.