SP2 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP2 (Service Pack 2) to SiSoftware Sandra 2016.

This new version has been built with the updated tools in order to extract the maximum performance out of the latest hardware and also contains minor additions and fixes:

  • Spanish Help file translation courtesy of Antonio Pérez Madrazo.
  • CUDA 8.0 (Pascal) preliminary device support.
  • Compiler optimisations including SIMD improvements.

As always the update is free so either visit the Sandra Lite Downloads or the Sandra Commercial Downloads.

SP1a for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1a (Service Pack 1a) to SiSoftware Sandra 2016.

This is a minor update that improves stability and adds a few optimisations that were developed after further testing of SP1 release.

The SP1a update also enables the Marbling: Perlin Noise 2D (3 octaves) Filter for both GPGPUs (CUDA, OpenCL) and CPU.

Sandra 2016 SP1 New Image Filters

SP1 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1 (Service Pack 1) to SiSoftware Sandra 2016.

This release introduces initial AVX512 benchmarks with all SIMD benchmarks due to be ported once compiler support becomes available:

CPU Multi-Media (Fractal Generation): single, double floating-point; integer, long benchmarks ported to AVX512. [See article Future performance with AVX512]

CPU Crypto (SHA Hashing): SHA2-256 and SHA2-512 multi-buffer ported to AVX512.

– Hardware support for future arch (AMD and Intel).

.Net Multi-Media native vector support is vector width independent and thus will support AVX512 with a future CLR release automatically

GPU Image Processing: New, more complex filters:

  • Oil Painting: Quantise (9×9) Filter: CUDA, OpenCL
  • Diffusion: Randomise (256) Filter: CUDA, OpenCL
  • Marbling: Perlin Noise 2D (3 octaves) Filter: CUDA, OpenCL

CPU Image Processing: New, more complex filters

  • Oil Painting: Quantise (9×9) Filter: AVX2/FMA, AVX, SSE2
  • Diffusion: Randomise (256) Filter: AVX2/FMA, AVX, SSE2
  • Marbling: Perlin Noise 2D (3 octaves) Filter: AVX2/FMA, AVX, SSE2

Sandra 2016 SP1 New Image FiltersMore benchmarks will be ported to AVX512 subject to compiler support; currently Microsoft’s VC++ does not support AVX512 intrinsics and in the interest of fairness we do not use specialised compilers.

Please see our article – Future performance with AVX512 – for a primer on AVX512 and projected performance improvements due to AVX512 and 512-bit transfers.

Future performance with AVX512 in Sandra 2016 SP1

Intel Skylake

What is AVX512?

AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.

Why is AVX512 important?

CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.

SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.

Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.

SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.

What CPUs will support AVX512?

It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.

AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.

It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.

What kind of performance improvement can we expect with AVX512?

The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.

AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.

In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.

Can I test AVX512 performance with Sandra?

Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:

  • Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
  • Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
  • Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now

The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:

  • Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
  • Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
  • Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
  • .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.

Hardware Stats

We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.

Processor Intel i7-6700K (Skylake) Intel i7-77XX? (next-gen) Intel i7-5820K (Haswell-E) Intel i7-78XX? (Skylake-E)
Cores/Threads 4C / 8T 4C / 8T 6C / 12T 6C / 12T
Clock Speeds (MHz) Min-Max-Turbo 800-4000-4200 assumed same 1200-3300-3600 assumed same
Caches L1/L2/L3 4x 32kB, 4x 256kB, 8MB assumed same 6x 32kB, 6x 256kB, 15MB assumed same
Power TDP Rating (W) 91W assumed same 140W assumed same
Instruction Set Support AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc. AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc.

We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.

Multi-media (Fractal Generation) Benchmark

Benchmark Future Core-i7 (4C/8T AVX512) Projected Core i7-6700K (4C/8T AVX2/FMA) Core i7-6700K (4C/8T SSEx) Future Core i7-E (6C/12T AVX512) Projected Core i7-5820K (6C/12T AVX2/FMA) Core i7-5820K (6C/12T SSEx))
 AVX512 Multi-Media
Integer SIMD (Mpix/s) 912.5 [+76% over AVX] 516.2 [+76% over SSE] 292 1020.7 [+76% over AVX] 577.4 [+76% over SSE] 327
We see around 76% improvement from AVX2 vs. SSE, thus we assume we’ll see something similar moving to AVX512 (~80%).
Long SIMD (Mpix/s) 315.3 [+66% over AVX] 190.1 [+66% over SSE] 114.6 284.3 [+66% over AVX] 171.4 [+66% over SSE] 87.6
We see around 66% improvement from AVX2 vs. SSE, but due to the new instructions we may see better AVX512 gains.
Single Float SIMD (Mpix/s) 916.8 [+2x over AVX] 458.4 [+2.12x over SSE] 216 1079 [+2x over AVX] 539.5 [+2.12x over SSE] 234.8
We saw over 2x improvement from AVX/FMA over SSE so while we may not see such a large improvement with AVX512, we may still get 100%.
Double Float SIMD (Mpix/s) 545.8 [+2x over AVX] 272.9 [+2.35x over SSE] 116.1 622.4 [+2x over AVX] 311.2 [+2.35x over SSE] 126
We see even better improvement from AVX to SSE here (2.35x) so hopefully we’ll get 2x moving to AVX512.
Quad Float SIMD (Mpix/s) 20.3 [+94% over AVX] 10.5 [+94% over SSE] 5.4 622.4 [+94% over AVX] 311.2 [+94% over SSE] 126
Emulating fp128 is hard work but even then AVX is 94% faster than SSE and thus we’d expect AVX512 to be almost 2x faster still.
Despite some being disappointed by arch-to-arch performance improvement, the Skylake 4C (i7-6700K) already goes toe-to-toe with Haswell-E 6C (i7-5820K), but with AVX512 support Skylake-E 6C/8C is projected to comprehensively outperform it.

AVX512 will also allow Skylake-E to narrow the gap between it and current GPGPUs with multi-CPU Xeon systems able to “do without” GPGPUs – well except perhaps a “Phi” or two?

 AVX512 Crypto
Hashing SHA2-256 (GB/s) 11.80 [+2x over AVX] 5.90 [+2.36x over SSE] 2.50 13.60 [+2x over AVX] 6.80 [+2.26x over SSE] 3
We see a large 2.26-2.36x improvement of AVX2 vs. SSE, thus we expect about 2x increase with AVX512 still.
Hashing SHA1 (GB/s) 23 [+2x over AVX] 11.5 [+2.16x over SSE] 5.33 27.70 [+2x over AVX] 13.85 [+2.04x over SSE] 6.79
Even with SHA1 we see a good 2.04-2.16x improvement of AVX2 vs. SSE, thus AVX512 should again double performance though we may be limited by memory bandwidth.
Hashing SHA2-512 (GB/s) 8.74 [+2x over AVX] 4.37 [+2.33x over SSE] 1.87 9.60 [+2x over AVX] 4.80 [+2.20x over SSE] 2.18
Switching to 64-bit integer SHA512 we see the best improvement yet of AVX2 vs SSE (2.2-2.33x) with AVX512 likely to improve by 2x yet again.
With hashing we see even better results than even fractal generation, with AVX2 improving over 2x over SSE – and AVX512 will thus improve by at least 100% – if anything it is likely we will hit memory bandwidth limitations.
 AVX512 Memory Bandwidth
Memory Bandwidth (GB/s) ~31.30 31.30 [0%] 31.30 ~42.00 [0%] 42.30 [-1%] 42.6
Even with DDR4 the memory sub-system hasn’t changed much and despite 512-bit transfers with AVX512 there is really no performance delta in streaming data to/from memory.
L3 Bandwidth (GB/s) ~267.97 [+10%] 243.30 [+10%] 220.90 ~202.20 [+3%] 195.90 [+3%] 189.8
As we move up the cache hierarchy, the L3 already shows a 10% bandwidth improvement using AVX2/FMA vs. SSE and AVX512 improving performance further.
L2 Bandwidth (GB/s) ~392.50 [+21%] 323.30 [+21%] 266.30 ~536.81 [+20%] 444.10 [+20%] 367.4
As we expected, L2 bandwidth improves ~20% with AVX2/FMA and likely to improve further.
L1D Bandwidth (GB/s) ~1,364.25 [+50%] 909.50 [+2.11x] 429.90 ~1,536.00 [+50%] 1,024.00 [+2x] 518
Skylake has widened the data access ports (just like Haswell before it), thus 512-bit AVX512 transfers show the best improvement yet, 40-50%!
AVX512 does help take advantage of the widened data ports in Skylake and future arch, with L1D cache showing the best bandwidth improvement just like Haswell before it (with AVX2).

Memory bandwidth is still limited by DDR4 speeds but faster modules are coming out all the time but this time their clocks are JEDEC ratified.

We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.

Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!

New Promotions for Valentine’s Day February 2016

Valentine's Day

For February 2016 – and soon to arrive Valentine’s Day – we have some promotions for you to enjoy:

Happy Valentine’s Day (in advance) 😉

Sandra 2016 - Personal - Feb Promo

 

.Net Vectors (CLR 4.6 RyuJIT) Performance

.Net Vectorised Benchmark

What is RyuJIT?

“RyuJIT” is the code-name of the latest CLR of .Net 4.6 as included in Windows 10 (with updates available for Windows 8.1, 8, 7) that includes a variety of performance optimisations as well as new features like vectorised/SIMD native support.

Why do we need .Net Vector support?

Many algorithms benefit from vectorisation/parallelisation through SIMD instruction sets in (all) modern processors; while compilers/run-times (CLR/JVM) may be able to automatically vectorise code – the most efficient way is through constructs that indicate to the compiler/run-time how to vectorise code for the hardware it is running on.

While we could always interop to native code libraries using SIMD, these would be platform / instruction-set dependent and introduce code and maintenance complexity.

What are other Pro/Cons of .Net Vector support?

The new CLR is a boon for high-performance algorithms:

  • Widely deployed: by default on Windows 10 and Windows Update on older Windows.
  • Widest possible: automatically uses the “widest” SIMD ISA (instruction set) supported by the processor, be it AVX2/FMA, AVX, SSE2, etc. [and AVX512 in future CLR] without any code modifications.
  • ISA/platform independent: same .Net code runs whatever the platform/ISA now and in the future. No need to write native code for each platform and ISA (e.g. AVX-Win64, SSE2-Win32, etc.)
  • All primitive data types supported: single/double floating-point, int/long integers.

Unfortunately Microsoft could not go the “whole way” and there are downsides:

  • x64 Only: RyuJIT is for x64 Windows only with x86 stuck with the old CLR that is unlikely to be updated.
  • Very limited Integer operators: without basic binary operators like “shift”, “mask”, “swap/permute”, etc. integer performance is low.
  • Limited functions and operators: even floating-point provides a limited subset of functions and operators.
  • CLR Issues: the new RyuJIT CLR does have problems with some .Net apps which may require users to stick to the older CLR and thus no Vector support.

.Net Vectors vs. Native SIMD Performance

We are testing native and .Net multi-media (fractal generation) performance using various SIMD instruction sets (AVX2/FMA, AVX, SSE2, etc.).

Hardware: Intel i7-4650U (Haswell ULV) with AVX2/FMA, AVX, SSE2 support.

Results Interpretation: Higher values (MPix/s, etc.) mean better performance.

Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on all configurations.

.Net Vectorised Performance

Data Type .Net Vectorised .Net Scalar Native AVX2/FMA Native AVX Native SSE2
Single Float (Mpix/s) 54 (8pix width) [+9.2x] 5.89 (1pix) 102.3 (8pix) [+17.4x] 89 (8pix) 57.8 (4pix)
Double Float (Mpix/s) 30.1 (4pix width) [+2.04x] 14.78 (1pix) 62.5 (4pix) [+4.2x] 53.4 (2pix) 31.9 (2pix)
Integer (Mpix/s) 1.03 (8pix width) [0.056x] 18.5 (1pix) 114.5 (16pix) [+6.2x] 73.4 (8pix) 31.3 (4pix)
Int64 (Mpix/s) 0.361 (4pix witdth) [0.020x] 18 (1pix) 41.6 (8pix) [+2.3x] 23.4 (4pix) 23 (2pix)

We can confirm the use of AVX2/FMA/AVX by the width of the Vectors (256-bit wide, with float/int being 8-units wide, double/int64 being 4-units wide).

While the performance improvement over scalar code is significant (~2x-9x), it does not quite reach the native SIMD implementation (~50%) which is somewhat disappointing but not altogether unexpected. However, future versions of the CLR will likely improve upon this – while our native code is unlikely to be optimised further.

No, the Vector integer performance is *not* a bug: the lack of bit-manipulation operations (“shift”, “swap/permute”, “mask”, etc.) makes complex Vector algorithms pretty much useless. Thus we only enable Vectors for floating-point operations.

Vectors may never replace native code completely, but lots of algorithms may now be implemented in native .Net code with good performance without the need of native libraries making deployment to different platforms (e.g. ARM/Windows, Mono/Linux, etc.) far easier.

It is good to see Microsoft adding new features to the CLR – which we would have expected Java to release first – as both the CLR and JVM have somewhat “stagnated” lately which is not good to see.

SiSoftware Sandra 2016 RTMa Released

Bulb

We are providing an update to Sandra 2016, RTMa (version 22.15) with various updates and fixes:

  • .Net native Vector support: (floating-point single/double) in latest 4.6 CLR RyuJIT. the CLR automatically uses AVX/SSE2 SIMD as supported by the CPU. (see .Net Vectors (CLR 4.6 RyuJIT) Performance article for more information)
  • CPU Image Processing: Did not run SIMD code-paths (FMA, AVX, SSE2) only FPU resulting in low performance.
  • GPGPU Image Processing: Minor performance optimisation for median/de-noise filter.
  • GPGPU Crypto: SHA performance optimisations for nVidia cards in CUDA and OpenCL (SHA1 especially).
  • Overall Score 2016: score may not generate in all cases.
  • Windows 10: 1511 SDK update (build 10586 2015 November update)
  • Website Change: Due to transition to WP links and feeds were broken.

We recommend you update your version of Sandra 2016 as soon as possible.