Intel Atom X7 (CherryTrail 2015) GPGPU: Closing on Core M?

Intel Logo

What is CherryTrail (Braswell)?

“CherryTrail” (CYT) is the next generation Atom “APU” SoC from Intel (v3 2015) replacing the current Z3000 “BayTrail” (BYT) SoC which was Intel’s major foray into tablets (both Windows & Android). The “desktop” APUs are known as “Braswell” (BRS) while the APUs for other platforms have different code names.

BayTrail was a major update both CPU (OOS core, SSE4.x, AES HWA, Turbo/dynamic overclocking) and GPU (EV7 IvyBridge GPGPU core) so CherryTrail is a minor process shrink – but with a very much updated GPGPU – updated to EV8 (as latest Core Broadwell).

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of 3 processors (BayTrail, CherryTrail and Broadwell-Y) that support GPGPU.

Graphics Unit BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comment
Graphics Core B-GT EV7 B-GT EV8? B-GT2Y EV8 CherryTrail’s GPU is meant to be based on EV8 like Broadwell – the very latest GPGPU core from Intel! This makes it more advanced than the very popular Core Haswell series, a first for Atom.
APU / Processor Atom Z3770 Atom X7 Z8700 Core M 5Y10 Core M is the new Core-Y UULV versions for high-end tablets against the Atom processors for “normal” tablets/phones.
Cores (CU) / Shaders (SP) / Type 4C / 32SP (2×4 SIMD) 16C / 128SP (2×4 SIMD) 24C / 192SP (2×4 SIMD) Here’s the major change: CherryTrail has no less than 4 times the compute units (CU) in the same power envelope of the old BayTrail. Broadwell has more (24) but it is also rated at higher TDP.
Speed (Min / Max / Turbo) MHz 333 – 667 200 – 600 200 – 800 CherryTrail goes down all the way to 200MHz (same as Broadwell) which should help power savings. Its top speed is a bit lower than BayTrail but not by much.
Power (TDP) W 2.4 (under 4) 2.4 (under 4) 4.5 Both Atoms have the same TDP of around 2-2.4W – while Broadwell-Y is rated at 2x at 4.5-6W. We shall see whether this makes a difference.
DirectX / OpenGL / OpenCL Support 11 / 4.0 / 1.1 11.1 (12?) / 4.3 / 1.2 11.1 (12?) / 4.3 / 2.0 Intel has continued to improve the video driver – 2 generations share a driver – but here CherryTrail has a brand-new driver that supports much newer technologies like DirectX 11.1 (vs 11.0), OpenGL 4.3 (vs 4.0) including Compute and OpenCL 1.2. Broadwell’s driver does support OpenCL 2.0 – perhaps a later CherryTrail driver will do too?
FP16 / FP64 Support No / No (OpenCL), Yes (DirectX) No / No (OpenCL), Yes (DirectX, OpenGL) No / No (OpenCL), Yes (DirectX, OpenGL) Sadly FP16 support is still missing and FP64 is also missing on OpenCL – but available in DirectX Compute as well as OpenGL Compute! Those Intel FP64 extensions are taking their time to appear…

GPGPU Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported).

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: GPGPU Vectorised
GPGPU Arithmetic Benchmark Single/Float/FP32 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 181 [+13%] Straight off the bat we see that 4x more advanced CUs in CherryTrail gives us 6.4x better performance a huge improvement! Even the brand-new Broadwell GPU is only 13% faster.
GPGPU Arithmetic Benchmark Half/Float/FP16 Vectorised OpenCL (Mpix/s) 25 160 [+6.4x] 180 [+13%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
GPGPU Arithmetic Benchmark Double/FP64 Vectorised OpenCL (Mpix/s) 1.63 (emulated) 10.1 [+6.2x] (emulated) 11.6 [+15%] (emulated) None of the GPUs support native FP64 either: emulating FP64 (mantissa extending) is quite hard on all GPUs, but the results don’t change: CherryTrail is 6.2x faster with Broadwell just 15% faster.
GPGPU Arithmetic Benchmark Quad/FP128 Vectorised OpenCL (Mpix/s) 0.18 (emulated) 1.08 [+6x] (emulated) 1.32 [+22%] (emulated) Emulating FP128 using FP32 is even more complex but CherryTrail does not disappoint, it is still 6x faster; Broadwell does pull ahead a bit being 22% faster.
Intel Braswell: GPGPU Crypto
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 96 825 [+8.6x] 770 [-7%] In this tough integer workload that uses shared memory CherryTrail does even better – it is 8.6x faster, more than we’d expect – the newer driver may help. Surprisingly this is faster than even Broadwell.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 129 1105 [+8.6x] n/a What we saw before is no fluke, CherryTrail’s GPU is still 8.6 times faster than BayTrail’s.
Intel Braswell: GPGPU Hashing
GPGPU Crypto Benchmark SHA2-512 (int64) Hash OpenCL (MB/s) 54 309 [+5.7x] This 64-bit integer compute-heavy wokload is hard on all GPUs (no native 64-bit arithmetic), but CherryTrail does well – it is almost 6x faster than the older BayTrail. Note that neither DirectX nor OpenGL natively support int64 so this is about as hard as it gets for our GPUs.
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 96 1187 [+12.4x] 1331 [+12%] In this integer compute-heavy workload, CherryTrail really shines – it is 12.4x (twelve times) faster than BayTrail! Again, even the latest Broadwell is just 12% faster than it! Atom finally kicks ass both in CPU and GPU performance.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 215 2764 [+12.8x] SHA1 is less compute-heavy, but results don’t change: CherryTrail is 12.8x times faster – the best result we’ve seen so far.
Intel Braswell: GPGPU Finance
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 34.33 299.8 [+8.7x] 280.5 [-7%] Starting with the financial tests, CherryTrail is quick off the mark – being almost 9x (nine times) faster than BayTrail – and again somehow faster than even Broadwell. Who says Atom cannot hold its own now?
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 5.16 28 [+5.4x] 36.5 [+30%] Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – but CherryTrail still holds its own, it’s over 5x times faster – not as much as we saw before but massive improvement. Broadwell’s EV8 GPU does show its prowess being 30% faster still.
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 3.6 61.9 [+17x] 54 [-12%] Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; here we see Broadwell shine – it’s 17x (seventeen times) faster than BayTrail’s GPU – so much so we had to recheck. Most likely the newer GPU driver helps – but BayTrail will not get these improvements. Broadwell is again surprisingly 12% slower.
Intel Braswell: GPGPU Science
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 6 45 [+7.5x] 44.1 [-3%] GEMM is quite a tough algorithm for our GPUs but CherryTrail remains over 7.5x faster – even Broadwell is 3% slower than it. We saw before EV8 not performing as we expected – perhaps some more optimisations are needed.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 2.27 9 [+3.96x] 8.94 [-1%] FFT involves many kernels processing data in a pipeline – so here we see CherryTrail only 4x (four times) faster – the slowest we’ve seen so far. But then again Broadwell scores about the same so it’s a tough test for all GPUs.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 10.74 65 [+6x] 50 [-23%] In our last test we see CherryTrail going back to being 6x faster than BayTrail – surprisingly again Broadwell’s EV8 GPU is 23% slower than it.

There is no simpler way to put this: CherryTrail Atom’s GPU obliterates the old one – never being less than 4x and up to 17x (yes, seventeen!) faster, many times even overtaking the much newer, more expensive and more power hungry Broadwell (Core M) EV8 GPU! It is a no-brainer really, you want it – for once Microsoft made a good choice for Surface 3 after the disasters of earlier Surfaces (perhaps they finally learn? Nah!).

There isn’t really much to criticise: sure, FP16 native support is missing – which is a pity on Android (that uses FP16 in UX) and naturally FP64 is also missing – though as usual DirectX compute and OpenGL compute. As mentioned, since OpenGL 4.3 is supported, Compute is also supported for the first time on Atom – a feature recently introduced in newer drivers on Haswell and later GPUs (EV7.5, EV8).

Just in case we’re not clear: this *is* the Atom you are looking for!

Transcoding Performance

We are testing memory performance of GPUs using their hardware transcoders using popular media formats: H.264/MP4, AVC1, M.265/HEVC.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
H.264/MP4 Decoder/Encoder QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) QuickSync H264 (hardware accelerated) Same transcoder is used for all GPUs.
Intel Braswell: Transcoding H264
Transocde Benchmark H264 > H264 Transcoding (MB/s) 2.24 5 [+2.23x] 8.41 [+68%] H.264 transcoding on the new Atom has more than doubled (2.2x) which makes it ideal as a HTPC (e.g. Plex server). However, with more power we can see that Core M has over 60% more bandwidth.
Transocde Benchmark WMV > H264 Transcoding (MB/s) 1.89 4.75 [+2.51x] 8.2 [+70%] When just using the H264 encoder we still see a 2.5x improvement (over two and a half times), with Core M again about 70% faster still.

Intel has not forgotten transcoding, with the new Atom over 2x (twice) as fast – so if you were thinking of using it as a HTPC (NUC/Brix) server, better get the new one. However, unless you really want low power – the Core M (and thus ULV) versions have are 60-70% faster still…

GPGPU Memory Performance

We are testing memory performance of GPUs using OpenCL, DirectX ComputeShader and OpenGL ComputeShader (if supported), including transfer (up/down) to/from system memory and latency.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance. Lower values (ns, clocks) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Memory Configuration 2GB DDR3 1067MHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 64/128-bit (shared with CPU) 4GB DDR3 1.6GHz 128-bit (shared with CPU) Atom is generally configured to use a single memory controller, but CherryTrail runs at 1.6Mt/s same as modern Core APUs. But Core M/Broadwell naturally has a dual-channel controller though some laptops/tablets may use just one.
Cache Configuration 32kB L2 global/texture? 128kB L3 256kB L2 global/texture? 384kB L3 256kB L2 global/texture? 384kB L3 Internal cache arrangement seems to be very secret – so a lot of it is deduced from the latency graphs. The L2 increase in CherryTrail is in line with CU increase, i.e. 8x larger.
Intel Braswell: GPGPU Memory Bandwidth
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 3.45 11 [+3.2x] 10.1 [-8%] CherryTrail manages over 3x higher bandwidth during internal transfer over BayTrail, close to what we’d expect a dual-channel system to achieve. Surprisingly our dual-channel Core M manages to be 8% slower. We did see Broadwell achieve less than Haswell – which may explain what we’re seeing here.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 1.18 2.09 [+77%] 3.91 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Here CherryTrail improves almost 2x over BayTrail – but finally we see Core M being 87% faster still.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 1.13 2.29 [+2.02x] 3.79 [+65%] While upload bandwidth was the same, download bandwidth has improved a bit more, with CherryTrail being over 2x (twice) faster – but again Broadwell is 65% faster still. This will really help GPGPU applications that need to copy large results from the GPU to CPU memory until “zero copy” feature arrives.
Intel Braswell: GPGPU Memory Latency
GPGPU Memory Latency Global Memory (In-Page Random) Latency (ns) 981 829 [-15%] 274 With the memory running faster, we see latency decreasing by 15% a good result. However, Broadwell does so much better with almost 1/4 latency.
GPGPU Memory Latency Global Memory (Full Random) Latency (ns) 1272 1669 [+31%] Surprisingly, using full-random access we see latency increase by 31%. This could be due to the larger (4GB vs. 2GB) memory arrangement – the TLB-miss hit could be much higher.
GPGPU Memory Latency Global Memory (Sequential) Latency (ns) 383 279 [-27%] Sequential access brings the latency down by 27% – a good result.
GPGPU Memory Latency Constant Memory Latency (ns) 660 209 [-1/3x] With L1 cache covering the entire constant memory on CherryTrail – we see latency decrease to 1/3 (a third), great for kernels that use more than 32kB constant data.
GPGPU Memory Latency Shared Memory Latency (ns) 215 201 [-6%] Shared memory is a bit faster (6% lower latency), nothing to write home about.
GPGPU Memory Latency Texture Memory (In-Page Random) Latency (ns) 1583 1234 [-22%] With the memory running faster, as with global memory we see latency decreasing by 22% here – a good result!
GPGPU Memory Latency Texture Memory (Sequential) Latency (ns) 916 353 [-61%] Sequential access brings the latency down by a huge 61%, an even bigger difference than what we saw with Global memory. Impressive!

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

Latency has also decreased by a good amount 6-22% likely due to the faster memory employed, and the much larger caches (8x) do help. For data that exceeded the small BayTrail cache (32kB) – the CherryTrail one should be more than sufficient.

Shader Performance

We are testing shader performance of the GPUs in DirectX and OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (Jun 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Shaders
Video Shader Benchmark Single/Float/FP32 Vectorised DirectX (Mpix/s) 39 127.6 [+3.3x] Starting with DirectX FP32, CherryTrail is over 3.3x faster than BayTrail – not as high as we saw before but a good start.
Video Shader Benchmark Single/Float/FP32 Vectorised OpenGL (Mpix/s) 38.3 121.8 [+3.2x] 172 [+41%] OpenGL does not change matters, CherryTrail is still just over 3x (three times) faster than BayTrail. Here, though, Broadwell is 41% faster still…
Video Shader Benchmark Half/Float/FP16 Vectorised DirectX (Mpix/s) 39.2 109.5 [+2.8x] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Half/Float/FP16 Vectorised OpenGL (Mpix/s) 38.11 121.8 [+3.2x] 170 [+39%] As FP16 is not supported by any of the GPUs and promoted to FP32 the results don’t change.
Video Shader Benchmark Double/FP64 Vectorised DirectX (Mpix/s) 7.48 18 [+2.4x] Unlike OpenCL driver, DirectX driver does support FP64 – so all GPUs run native FP64 code not emulation. Here, CherryTrail is only 2.4x faster than BayTrail.
Video Shader Benchmark Double/FP64 Vectorised OpenGL (Mpix/s) 9.17 26 [+2.83x] 46.45 [+78%] As above, OpenGL driver does support FP64 also – so all GPUs run native FP64 code again. CherryTrail is 2.8x times faster here, but Broadwell is 78% faster still.
Video Shader Benchmark Quad/FP128 Vectorised DirectX (Mpix/s) 1.3 (emulated) 1.34 [+3%] (emulated) (emulated) Here we’re emulating (mantissa extending) FP128 using FP64 not FP32 but it’s hard: CherryTrail’s performance falls to just 3% faster over BayTrail, perhaps some optimisations are needed.
Video Shader Benchmark Quad/FP128 Vectorised OpenGL (Mpix/s) 1 (emulated) 1.1 [+10%] (emulated) 3.4 [+3.1x] (emulated) OpenGL does not change the results – but here we see Broadwell being 3x faster than both CherryTrail and BayTrail. Perhaps such heavy shaders are too much for our Atom GPUs.

Unlike GPGPU, here we don’t see the same crushing improvement – but CherryTrail’s GPU is still about 3x (three times) faster than BayTrail’s – though Broadwell shows its power. Perhaps our shaders are a bit too complex for pixel processing and should rather stay in the GPGPU field…

Shader Memory Performance

We are testing memory performance of GPUs using DirectX and OpenGL, including transfer (up/down) to/from system memory.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers (June 2015). Turbo / Dynamic Overclocking was enabled on all configurations.

Graphics Processors BayTrail GT CherryTrail GT Broadwell GT2Y – HD 5300 Comments
Intel Braswell: Video Memory Bandwidth
Video Memory Benchmark Internal Memory Bandwidth (GB/s) 6.74 11.18 [+65%] 12.46 [+11%] DirectX bandwdith is not as “bad” as OpenCL on BayTrail (better driver?) so we start from a higher baseline: CherryTrail still manages 65% more bandwidth – with Broadwell only squeezing 11% more despite its dual-channel. It shows that OpenCL GPGPU driver has come a long way to match DirectX.
Video Memory Benchmark Upload Bandwidth (GB/s) 2.62 2.83 [+8%] 5.29 [+87%] While all APUs don’t need to transfer memory over PCIe like dedicated GPUs, they still don’t support “zero copy” – thus memory transfers are not free. Again BayTrail does better so CherryTrail can only be 8% faster than it – with Broadwell finally 87% faster.
Video Memory Benchmark Download Bandwidth (GB/s) 1.14 2.1 [+83%] 1.23 [-42%] Here BayTrail “stumbles” so CherryTrail can be 83% faster with Broadwell surprisingly 42% slower. What it does show is that the CherryTrail drivers are better despite being much newer. It is a pity Intel does not provide this driver for BayTrail too…

Again, we see big gains in CherryTrail with bandwidth increasing by 2-3x which is necessary to keep all those new EVs fed with data; Broadwell does do better but then again it has a dual-channel memory controller.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Here we tested the brand-new Atom X7 Z8700 (CherryTrail) GPU with 16 EVs (EV8) – 4x (four times) more than the older Atom Z3700 (BayTrail) GPU with 4 EVs (EV7) – so we expected big gains – and they were delivered: GPGPU performance is nothing less than stellar, obliterating the old Atom GPU to dust – no doubt also helped by the newer driver (which sadly BayTrail won’t get). And all at the same TDP of about 2.4-5W! Impressive!

Even the very latest Core M (Broadwell) GPU (EV8) is sometimes left behind – at about 2x higher power, more EVs and higher cost – perhaps the new Atom is too good?

Architecturally nothing much has changed (beside the far more EVs) – but we also get better bandwidth and lower latencies – no doubt due to higher memory bus clock.

All in all, there’s no doubt – this new Atom is the one to get and will bring far better graphics and GPGPU performance at low cost – even overshadowing the great Core M series – no doubt the reason it is found in the latest Surface 3.

To see how the Atom CherryTrail CPU fares, please see CPU Atom Z8700 (CherryTrail) performance article!

Mali T760 GPGPU (Exynos 5433 SoC): FP64 Champion – Adreno Killer?

Samsung Logo

What is Mali? What is Exynos?

“Mali” is the name of ARM’s own “standard” GPU cores that complement the standard CPU cores (Cortex). Many ARM CPU licensors integrate GPU cores from other vendors in their SoCs, e.g. Imagination, Vivante, Adreno rather than the default Mali.

Mali Series 700 is the 3-rd generation “Midgard” core that complement’s ARM’s 64-bit ArmV8 Cortex 5X designs and thus used in the very latest phones/tablets and has been updated to include support for new technologies like OpenCL ES, OpenGL ES and DirectX.

“Exynos” is the name of Samsung’s line of SoCs that is used in Samsung’s own phones/tablets/TVs. Series 5 is the 5-th generation SoC generally using ARM’s “big.LITTLE” architecture of “small” cores for low-power and “big” cores for performance. 5433 is the 1st 64-bit SoC from Samsung supporting AArch64 aka ArmV8 but running in “legacy” 32-bit ArmV7 mode.

In this article we test (GP)GPU graphics unit performance; please see our other articles on:

Hardware Specifications

We are comparing the internal GPUs of various modern phones and tablets that support GPGPU.

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
Type / Micro-Arch VLIW4 (Midgard 3nd gen) VLIW5 VLIW5 VLIW5 VLIW4 (Midgard 2nd gen) Scalar (Maxwell 3rd gen) All except K1 are VLIW thus work best with vectorised data; some compilers are very good at vectorising simple code (e.g. by executing mutiple data items simultaneously), but the programmer can generally do a better job of extracting paralellism.
Core Speed (MHz) estimated 600 600 578 400 533 ? Core speeds are comparative with latest devices not pushing the clocks too high but instead improving the cores.
OpenGL ES Support 3.1 3.1 3.0 3.0 3.0 (should support 3.1) 3.1 Mali T7xx adds official support for OpenGL ES 3.1 just like the other modern GPU designs: Adreno 400 and K1. While Mali T6xx should also suppot 3.1 the drivers have not been updated for this “legacy” device.
OpenCL ES Support 1.2 (full) 1.2 (full) 1.1 1.1 1.1 (should support full) Not for Android, supports CUDA Mali T7xx adds support for OpenCL 1.2 but also “full profile” just like Adreno 420 – both supporting all the desktop features of OpenCL – thus any kernels developed for desktop/mobile GPUs can run pretty much unchanged.
CU / SP Units 8 / 256 4 / 128 4 / 128 4 / 64 8 / 64 1 / 192 Mali T760 has 2x the CU of T628 but they should also be far more powerful. Adreno 420 only relies on more powerful CUs over the 330/320; nVidia uses only 1 SMX/CU but more SPs.
Global Memory (MB) 2048 of 3072 1400 of 3072 1400 of 3072 960 of 2048 1024 of 3072 n/a Modern phones with 3GB memory seem to allow about 50% to be allocated through OpenCL. Mali does generally seem to allow more, typically 66%.
Largest Memory Block (MB) 512 of 2048 347 of 1400 347 of 1400 227 of 960 694 of 1024 n/a The maximum block size seems to be about 25% of total memory, but Mali’s driver allows as much as 50%.
Constant Memory (kB) 64 64 4 4 64 n/a Mali T600 was already fine here, with Adreno 400 needed to catch up to the rest. Previously constant data would have needed to be kept in normal global memory due to the small constant memory size.
Shared Memory (kB) 32 32 8 8 32 n/a Again Mali T600 was fine already – with Adreno 400 finally matching the rest.
Max. Workgroup Size 256 x 256 x 256 1024 x 1024 x 1024 512 x 512 x 512 256 x 256 x 256 256 x 256 x 256 n/a Surprisingly the work-group size remains at 256 for Mali T700/T600 with Adreno 400 pushing alll the way to 1024. That does not necessarily mean it is the optimum size.
Cache (Reported) kB 256 128 32 32 n/a n/a Here Mali T760 overtakes them all with a 256kB L2 cache, 2x bigger than Adreno 400 and older Mali T600.
FP16 / FP64 Support Yes / Yes Yes / No Yes / No Yes / No No No Here we are the 1st mobile FP64 native GPU! If you have double floating-point workloads then stop reading now and get a SoC with Mali T700 series.
Byte/Integer Width 16 / 4 1 / 1 1 / 1 1 / 1 16 / 4 n/a Adreno prefers non-vectorised integer data even though it is VLIW5; only Mali prefers vectorised data (vec4) similar to the old ATI/AMD pre-GCN hardware. At least all our vectorisations are not in vain 😉
Float/Double Width 4 / 2 1 / n/a 1 / n/a 1 / n/a 4 / n/a n/a As before, Adreno prefers non-vectorised while Mali vectorised data. As Mali T760 supports FP64, it also wants vectorised double floating-point data.

GPGPU Compute Performance

We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
5433 GPGPU Arithmetic
GPGPU Arithmetic Benchmark Half-float/FP36 Vectorised OpenCL (Mpix/s) 199.9 184.4 73.4 20.2
GPGPU Arithmetic Benchmark Single-Float/FP32 Vectorised OpenCL (Mpix/s) 105.9 182 [+71%] 114.8 49.1 20.2 Adreno 420 manages to beat Mali T760 by a good ~70% even though we use a highly vectorised kernel – not what we’ve expected. But this is 5x faster than the old Mali T625 showing just how much Mali has improved since its last version – but not enough!
GPGPU Arithmetic Benchmark Double-float/FP64 Vectorised OpenCL (Mpix/s) 30.6 [+3x] 10.1 (emulated) 8.4 (emulated) 3.4 (emulated) 0.518 (emulated) Here you see the power of native support, Mali T760 blows everything out the water – it is 3x faster than the Adreno 400 and a crazy 60x (sixty times) faster than the old Mali T625! This is really the GPGPU to beat – nVidia must be regretting not adding FP64 to K1.
GPGPU Arithmetic Benchmark Quad-float/FP128 Vectorised OpenCL (Mpix/s) 0.995 [+5x] (emulated using FP64) 0.197 (emulated) 0.056 (emulated) 0.053 (emulated) failed Emulating FP128 using FP64 gives Mali T760 a big advantage – now it is 5x faster than Adreno 400! There is no question – if you have high precision computation to do, Mali T700 is your GPGPU.
soc_5433_gp_crypt
GPGPU Crypto Benchmark AES256 Crypto OpenCL (MB/s) 136 145 [+6%] 96 70 85 T760 is just a bit (6%) slower than Adreno 420 here, but still good improvement (2x) over its older brother T628.
GPGPU Crypto Benchmark AES128 Crypto OpenCL (MB/s) 200 131 94 89
soc_5433_gp_hash
GPGPU Crypto Benchmark SHA2-256 (int32) Hash OpenCL (MB/s) 321 [+2%] 314 141 106 89 In this integer compute-heavy workload, Mali T760 just edges Adreno 420 by 2% – pretty much a tie. Both GPGPUs are competitive in integer workloads as we’ve seen in the AES tests also.
GPGPU Crypto Benchmark SHA1 (int32) Hash OpenCL (MB/s) 948 442 271 294
soc_5433_gp_fin
GPGPU Finance Benchmark Black-Scholes FP32 OpenCL (MOPT/s) 212.9 235.4 [+10%] 170.7 85 98.3 Black-Scholes is not compute heavy allowing many GPUs to do well, and here Adreno 420 is 10% faster than Mali T760.
GPGPU Finance Benchmark Binomial FP32 OpenCL (kOPT/s) 0.842 6.605 [+7x] 4.737 1.477 0.797 Binomial is far more complex than Black-Scholes, involving many shared-memory operations (reduction) – and Adreno 420 does not disappoint; however Mali T780 (just as T628) is not very happy with our code with a pitiful score that is 1/7x (seven times slower).
GPGPU Finance Benchmark Monte-Carlo FP32 OpenCL (kOPT/s) 34 59.5 [+1.75x] 19.1 14.2 10.4 Monte-Carlo is also more complex than Black-Scholes, also involving shared memory – but read-only; Adreno 420 is 1.75x (almost two times) faster. It could be that Mali stumbles at the shared memory operations which are key to both algorithms.
soc_5433_gp_sci
GPGPU Science Benchmark SGEMM FP32 OpenCL (GFLOPS) 5.167 6.179 [+19%] 3.173 2.992 2.362 Adreno 420 continues its dominance here, being ~20% faster than Mali T760 but nowhere near the lead it had in Financial tests.
GPGPU Science Benchmark SFFT FP32 OpenCL (GFLOPS) 1.902 5.470 [+2.87x] 3.535 2.146 1.914 FFT involves a lot of memory accesses but here Adreno 420 is almost 3x faster than Mali T760, a lead similar to what we saw in the complex Financial (Binomial/Monte-Carlo) tests.
GPGPU Science Benchmark N-Body FP32 OpenCL (GFLOPS) 14.3 27.7 [+2x] 23.9 15.9 9.46 N-Body generally allows GPUs to “spread their wings” and here Adreno 420 does not disappoint – it is 2x faster than Mali T760.

It seems our early enthusiasm over FP64 native support was quickly extinguished: while Mali T760 naturally does well in FP64 tests – it cannot beat its rival Adreno (420) in other tests.

In single-precision floating-point (FP32) simple workloads, Adreno is only about 10-20% faster; however in complex workloads (Binomial, Monte-Carlo, FFT, GEMM) Adreno can be 2-7x (times) faster – a huge lead. It seems to do with shared memory accesses rather VLIW design needing highly-vectorised kernels which is what we’re using.

Naturally in double-precision floating-point (FP64) workloads, Mali T760 flies – being 3-5x (times) faster, so if those are the kinds of workloads you require – it is the natural choice. However, such precision is uncommon on phones/tablets – even desktop/laptop GPGPUs have crippled FP64 performance.

In integer workloads, the two GPGPUs are competitive with a 3-5% difference either way.

The relatively small (256) workgroup size may also hamper performance with Adreno 420 able to keep more (1024) threads in flight – although the shared cache size is the same.

GPGPU Memory Performance

We are testing memory bandwidth performance of GPUs using OpenCL, including transfer (up/down) to/from system memory; we also measure the latencies of the various memory types (global, constant, shared, etc.) using different access patterns (in-page random access, sequential access, etc.).

Results Interpretation (Bandwidth): Higher values (MPix/s, MB/s, etc.) mean better performance.

Results Interpretation (Latency): Lower values (ns, clocks, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 Comment
Memory Configuration 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 3GB DDR3 (shared with CPU) 2GB DDR3 (shared with CPU) 3GB DDR2 (shared with CPU) Modern phones and tablets now ship with 3GB – close to the 32-bit address limit. While all SoCs suppport unified memory, neither seem to support “zero copy” or HSA which has recently made it to OpenCL on the desktop with version 2.0. Future SoCs will fix this issue and provide global virtual memory address space that CPU & GPU can share.
soc_5433_gp_mbw
GPGPU Memory Bandwidth Internal Memory Bandwidth (MB/s) 4528 9457 [+2x] 8383 4751 1436 Qualcomm’s SoC manages to extract almost 2x more bandwidth compared to Samsung’s SoC – which may expain some of the performance issues we saw when processing large amounts of data. Adreno has almost 10GB/s bandwidth to play with, similar to single-channel desktop/laptops!
GPGPU Memory Bandwidth Upload Bandwidth (MB/s) 2095 3558 [+69%] 3294 2591 601 Adreno wins again with 70% more upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (MB/s) 2091 3689 [+76%] 2990 2412 691 Again Adreno has over 70% more download bandwidth – it is no wonder it did so well in the compute tests. Mali will have to improve markedly to match.
soc_5433_gp_mlat
GPGPU Memory Latency Global Memory Latency – In Page Random Access (ns) 199.3 [-17%] 239.6 388.2 658.8 625.8 It starts well for Mali T760, with ~20% lower response time over Adreno 420 and almost 3x faster than its old T628 brother – which no doubt helps the performance of many algorithms that access global memory randomly. All modern SoC (805, 5433, 801) show just how much improvement was made in the memory prefetchers in the last few years.
GPGPU Memory Latency Global Memory Latency – Full Random Access (ns) 346.5 [-30%] 493.2 500.2 933.7 815.2 Full random access is tough on all GPUs and here, but Mali T760 manages to be 30% faster than Adreno 420 which has not improved over 330 (same memory controller in 800 series).
GPGPU Memory Latency Global Memory Latency – Sequential Access (ns) 147.2 106.6 [-27%] 98.2 116.2 280.6 With sequential accesses – we finally see Adreno 420 (but also the older 300 series) show their prowess being 27% faster. Qualcomm’s memory prefetchers seem to be doing their job here.
GPGPU Memory Latency Constant Memory (ns) 263.1 70.7 [-1/3.75x] 74.5 103 343 Here Adreno 420’s constant memory has almost 4x lower latency than Mali (thus 1/4 response time) – which may be a clue as to why it is so much faster. Basically it does not seem that constant memory is cached on the Mali but just normal global memory.
GPGPU Memory Latency Shared Memory (ns) 301 30.1 [-1/10x] 47 83 329 Shared memory even more important as it is used to share data between threads – lack of it reduces the work-group sizes that can be used. Here we see Adreno 420’s shared memory having 10x lower latency than Mali (thus 1/10x response time) – no wonder Mali T760 is so much slower in complex workloads that make extensive use of shared memory. Basically shared memory is not *dedicated* but just normal global memory.

Memory testing seems to reveal Mali’s T760 problem: its bandwidth is much lower than Adreno while its key memories (shared, constant) latencies are far higher. It is a wonder how it performs so well actually if the numbers are to be believed – but since Mali T628 scores similarly there is no reason to doubt them.

Adreno T420 has 2x higher internal bandwidth and over 70% more upload/download bandwidth – and since neither supports HSA and thus “zero copy” – it will be much faster the bigger the memory blocks used. Here, Qualcomm’s completely-designed SoC (CPU, GPU, memory controller) pays dividends.

Mali T760’s global memory latency is lower but neither constant nor (more crucially) shared memory seem to be treated differently and thus have similar latencies to global memory; common GPGPU optimisations are thus useless and any commplex algorithm making extensive use of shared memory will be greatly bogged down. ARM should better re-think their approach for the new (T800) Mali series.

Video Shader Performance

We are testing vectorised shader compute performance of the GPUs in OpenGL.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Android 5.x.x, latest updates (May 2015).

Graphics Processors ARM Mali T760 Qualcomm Adreno 420 Qualcomm Adreno 330 Qualcomm Adreno 320 ARM Mali T628 nVidia K1 Comment
soc_5433_gp_vid_aa
Video Shader Benchmark Single-Float/FP32 OpenGL ES (Mpix/s) 49 170.5 [+3.5x%] 114.4 60 33.6 124.7 Finally the K1 can play and does very well but cannot overtake Adreno 420 which also blows the Mali T760 out of the water being over 3.5x faster. We can see just how much shader performance has improved in a few years.
Video Shader Benchmark Half-float/FP16 OpenGL ES (Mpix/s) 54 219.6 [+4x] 115 107.6 32.2 124.4 While Mali T760 finally supports FP16, it does not seem to do much good over FP32 (+10% faster) while Adreno 420 benefits greatly – thus increases its lead to being 4x faster. OpenGL is still not Mali’s forte.
Video Shader Benchmark Double-float/FP64 OpenGL ES (Mpix/s) 2.3 [1/21x] (emulated) 10.6 [+5x] [1/17x] (emulated) 9.4 [1/12x] (emulated) 4.0 [1/15x] (emulated) 2.1 [1/16x] (emulated) 26.0 [1/4.8x] (emulated) While Mali T760 does support FP64, the OpenGL extension is not yet supported/enabled – thus it is forced to run it emulated in FP32. This allows Adreno 420 to be 5x faster – though nVidia’s K1 takes the win.
Video Shader Benchmark Quad-float/FP128 OpenGLES (Mpix/s) n/a n/a n/a n/a n/a n/a Emulating FP128 using FP32 is too much for our GPUs, we shall have to wait for the new generation of mobile GPUs.

Using OpenGL ES allows the K1 to play, but more specifically it shows Mali’s OpenGL prowess is lacking – Adreno 420 is between 4-5x faster – a big difference. FP16 support seems to make no difference while FP64 support is missing in OpenGL thus it cannot play its Ace card. ARM has some OpenGL driver optimisation to make.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Mali T760 is a big upgrade over its older T600 series though a lot of the details have not changed. However, it is not enough to beat its rival Adreno 400 series – with native FP64 performance (and thus FP128 emulated) being the only shining example. While its integer workload performance is competitive – floating-point performance in complex workloads (making extensive use of shared memory) is much lower. Even highly vectorised kernels that should help its VLIW design cannot close the gap.

It seems the SoC’s memory controller lets it down, and its non-dedicated shared and constant memory means high latencies slow it down. ARM should really implement dedicated shared memory in the next major version.

Elsewhere its OpenGL shader compute performance is even slower (1/4x Adreno) with FP16 support not helping much and FP64 native support missing. This is a surprise considering its far more competive OpenCL performance. Hopefully future drivers will address this – but considering the T600 performance has remained pretty much unchanged we’re not hopeful.

To see how the Exynos 5433 CPU fares, please see Exynos 5433 CPU (Cortex A57+A53) performance article!

SiSoftware Releases Support for DirectX 11 Compute Shader/DirectCompute

GPGPU Memory Bandwidth

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Releases Support for DirectX 11 Compute Shader/DirectCompute

 

London, UK, 30th November 2009 – SiSoftware releases its suite of DirectX 11 Compute Shader/DirectCompute GPGPU (General Purpose Graphics Processor Unit) benchmarks as part of SiSoftware Sandra 2010, the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, and networks.

At SiSoftware we are constantly looking out for new technologies with the aim to understand how those technologies can best be benchmarked and analysed. We believe that the industry is seeing a shift from the model where heavy computational workload is processed on a traditional CPU to a model that uses the GPGPU or a combination of GPU and CPU; in a wide range of applications developers are using the power of GPGPU to aid business analysis, games, graphics and scientific applications.

As certain tasks or workloads may still perform better on traditional CPU, we see both CPU and GPGPU benchmarks to be an important part of performance analysis. Having launched the GPGPU Benchmarks with SiSoftware Sandra 2009 with support for AMD CTM/STREAM and nVidia CUDA, we have now ported the benchmark suite to DirectX 11 Compute Shader/DirectCompute.

Compute Shader/DirectCompute is a new programmable shader stage introduced with DirectX 11 that expands Direct3D beyond graphics programming. Windows programmers familiar with DirectX can now use high-speed general purpose computing and take advantage of the large numbers of parallel processors on GPUs. We believe Compute Shader/DirectCompute will become “the standard” for programming parallel workloads in Windows, thus we have ported all our GPGPUs benchmarks to DirectX 11 Compute Shader/DirectCompute.

Below is a quote we would like to share with you:

“As ATI Stream technology grows in popularity with software developers, SiSoftware’s Sandra 2010 benchmark is an increasingly important tool for evaluating GPU compute performance,” said Eric Demers, chief technology officer, graphics products, AMD. “As the only provider of DirectX 11 GPUs today, AMD welcomes SiSoftware’s support for that popular application programming interface in Sandra 2010.”

The SiSoftware DirectX 11 Compute Shader/DirectCompute Benchmarks look at the two major performance aspects:

  • Computational performance: in simple terms how fast it can crunch numbers. It follows the same style as the CPU Multi-Media benchmark using fractal generation as its workload. This allows the user to see the power of the GPGPU in solving a workload thus far exclusively performed on a CPU.
  • Memory performance: this analyses how fast data can be transferred to and from the GPGPU. No matter how fast the processing, ultimately the end result will be affected by memory performance.

Key features

  • 4 architectures natively supported (x86, x64/AMD64/EM64T, IA64/Itanium2, ARM)
  • 6 languages supported (English, French3, German3, Italian3, Japanese3, Russian3)
  • DirectX 11 Compute Shader/DirectCompute
  • Different models of GPUs supported, including integrated GPU + dedicated GPUs.
  • Multi-GPUs supported, up to 8 in parallel.

With each release, we continue to add support and compatibility for the latest technologies. SiSoftware works with hardware vendors to ensure the best support for new emerging hardware.

Notes:

1 Available as Beta at this time, performance cannot be guaranteed.

2 By special arrangement; Enterprise versions only.

3 Not all languages available at publication, will be released later.

Relevant Press Releases

Relevant Articles

For more details, please see the following articles comparing current devices on the market:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please contact your distributor (sales support).

Downloading

For more details, and to download the Lite version, please click here.

Reviewers and Editors

For your free review copies, please contact us.
About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Nearly 700 worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Over 9,000 on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, DirectCompute, x64, ARM, MIPS, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX3, AVX2, AVX, FMA4, FMA, NEON, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit http://www.sisoftware.net, http://www.sisoftware.eu, http://www.sisoftware.info or http://www.sisoftware.co.uk