For the month of May we have a new promotion, a FREE SiSoftware USB drive with any purchase of SANDRA Personal or Business Edition:
We are happy to release SP2 (Service Pack 2) to SiSoftware Sandra 2016.
This new version has been built with the updated tools in order to extract the maximum performance out of the latest hardware and also contains minor additions and fixes:
- Spanish Help file translation courtesy of Antonio Pérez Madrazo.
- CUDA 8.0 (Pascal) preliminary device support.
- Compiler optimisations including SIMD improvements.
We are happy to release SP1a (Service Pack 1a) to SiSoftware Sandra 2016.
This is a minor update that improves stability and adds a few optimisations that were developed after further testing of SP1 release.
The SP1a update also enables the Marbling: Perlin Noise 2D (3 octaves) Filter for both GPGPUs (CUDA, OpenCL) and CPU.
We are happy to release SP1 (Service Pack 1) to SiSoftware Sandra 2016.
This release introduces initial AVX512 benchmarks with all SIMD benchmarks due to be ported once compiler support becomes available:
– CPU Multi-Media (Fractal Generation): single, double floating-point; integer, long benchmarks ported to AVX512. [See article Future performance with AVX512]
– CPU Crypto (SHA Hashing): SHA2-256 and SHA2-512 multi-buffer ported to AVX512.
– Hardware support for future arch (AMD and Intel).
– .Net Multi-Media native vector support is vector width independent and thus will support AVX512 with a future CLR release automatically
– GPU Image Processing: New, more complex filters:
- Oil Painting: Quantise (9×9) Filter: CUDA, OpenCL
- Diffusion: Randomise (256) Filter: CUDA, OpenCL
- Marbling: Perlin Noise 2D (3 octaves) Filter: CUDA, OpenCL
– CPU Image Processing: New, more complex filters
- Oil Painting: Quantise (9×9) Filter: AVX2/FMA, AVX, SSE2
- Diffusion: Randomise (256) Filter: AVX2/FMA, AVX, SSE2
- Marbling: Perlin Noise 2D (3 octaves) Filter: AVX2/FMA, AVX, SSE2
More benchmarks will be ported to AVX512 subject to compiler support; currently Microsoft’s VC++ does not support AVX512 intrinsics and in the interest of fairness we do not use specialised compilers.
Please see our article – Future performance with AVX512 – for a primer on AVX512 and projected performance improvements due to AVX512 and 512-bit transfers.
What is AVX512?
AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.
Why is AVX512 important?
CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.
SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.
Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.
SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.
What CPUs will support AVX512?
It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.
AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.
It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.
What kind of performance improvement can we expect with AVX512?
The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.
AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.
In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.
Can I test AVX512 performance with Sandra?
Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:
- Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
- Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
- Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now
The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:
- Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
- Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
- Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
- .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.
We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.
|Processor||Intel i7-6700K (Skylake)||Intel i7-77XX? (next-gen)||Intel i7-5820K (Haswell-E)||Intel i7-78XX? (Skylake-E)|
|Cores/Threads||4C / 8T||4C / 8T||6C / 12T||6C / 12T|
|Clock Speeds (MHz) Min-Max-Turbo||800-4000-4200||assumed same||1200-3300-3600||assumed same|
|Caches L1/L2/L3||4x 32kB, 4x 256kB, 8MB||assumed same||6x 32kB, 6x 256kB, 15MB||assumed same|
|Power TDP Rating (W)||91W||assumed same||140W||assumed same|
|Instruction Set Support||AVX2, FMA3, AVX, etc.||AVX512 + AVX2, FMA3, AVX, etc.||AVX2, FMA3, AVX, etc.||AVX512 + AVX2, FMA3, AVX, etc.|
We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.
Multi-media (Fractal Generation) Benchmark
We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.
Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!
For February 2016 – and soon to arrive Valentine’s Day – we have some promotions for you to enjoy:
- Sandra Personal – 50% Off! ($24.99 from $49.99)
- Sandra Business Edition – Free Power Bank
- Sandra Tech Support Edition – Free USB Keyring and Power Bank
Happy Valentine’s Day (in advance) 😉
What is RyuJIT?
“RyuJIT” is the code-name of the latest CLR of .Net 4.6 as included in Windows 10 (with updates available for Windows 8.1, 8, 7) that includes a variety of performance optimisations as well as new features like vectorised/SIMD native support.
Why do we need .Net Vector support?
Many algorithms benefit from vectorisation/parallelisation through SIMD instruction sets in (all) modern processors; while compilers/run-times (CLR/JVM) may be able to automatically vectorise code – the most efficient way is through constructs that indicate to the compiler/run-time how to vectorise code for the hardware it is running on.
While we could always interop to native code libraries using SIMD, these would be platform / instruction-set dependent and introduce code and maintenance complexity.
What are other Pro/Cons of .Net Vector support?
The new CLR is a boon for high-performance algorithms:
- Widely deployed: by default on Windows 10 and Windows Update on older Windows.
- Widest possible: automatically uses the “widest” SIMD ISA (instruction set) supported by the processor, be it AVX2/FMA, AVX, SSE2, etc. [and AVX512 in future CLR] without any code modifications.
- ISA/platform independent: same .Net code runs whatever the platform/ISA now and in the future. No need to write native code for each platform and ISA (e.g. AVX-Win64, SSE2-Win32, etc.)
- All primitive data types supported: single/double floating-point, int/long integers.
Unfortunately Microsoft could not go the “whole way” and there are downsides:
- x64 Only: RyuJIT is for x64 Windows only with x86 stuck with the old CLR that is unlikely to be updated.
- Very limited Integer operators: without basic binary operators like “shift”, “mask”, “swap/permute”, etc. integer performance is low.
- Limited functions and operators: even floating-point provides a limited subset of functions and operators.
- CLR Issues: the new RyuJIT CLR does have problems with some .Net apps which may require users to stick to the older CLR and thus no Vector support.
.Net Vectors vs. Native SIMD Performance
We are testing native and .Net multi-media (fractal generation) performance using various SIMD instruction sets (AVX2/FMA, AVX, SSE2, etc.).
Hardware: Intel i7-4650U (Haswell ULV) with AVX2/FMA, AVX, SSE2 support.
Results Interpretation: Higher values (MPix/s, etc.) mean better performance.
Environment: Windows 8.1 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on all configurations.
|Data Type||.Net Vectorised||.Net Scalar||Native AVX2/FMA||Native AVX||Native SSE2|
|Single Float (Mpix/s)||54 (8pix width) [+9.2x]||5.89 (1pix)||102.3 (8pix) [+17.4x]||89 (8pix)||57.8 (4pix)|
|Double Float (Mpix/s)||30.1 (4pix width) [+2.04x]||14.78 (1pix)||62.5 (4pix) [+4.2x]||53.4 (2pix)||31.9 (2pix)|
|Integer (Mpix/s)||1.03 (8pix width) [0.056x]||18.5 (1pix)||114.5 (16pix) [+6.2x]||73.4 (8pix)||31.3 (4pix)|
|Int64 (Mpix/s)||0.361 (4pix witdth) [0.020x]||18 (1pix)||41.6 (8pix) [+2.3x]||23.4 (4pix)||23 (2pix)|
We can confirm the use of AVX2/FMA/AVX by the width of the Vectors (256-bit wide, with float/int being 8-units wide, double/int64 being 4-units wide).
While the performance improvement over scalar code is significant (~2x-9x), it does not quite reach the native SIMD implementation (~50%) which is somewhat disappointing but not altogether unexpected. However, future versions of the CLR will likely improve upon this – while our native code is unlikely to be optimised further.
No, the Vector integer performance is *not* a bug: the lack of bit-manipulation operations (“shift”, “swap/permute”, “mask”, etc.) makes complex Vector algorithms pretty much useless. Thus we only enable Vectors for floating-point operations.
Vectors may never replace native code completely, but lots of algorithms may now be implemented in native .Net code with good performance without the need of native libraries making deployment to different platforms (e.g. ARM/Windows, Mono/Linux, etc.) far easier.
It is good to see Microsoft adding new features to the CLR – which we would have expected Java to release first – as both the CLR and JVM have somewhat “stagnated” lately which is not good to see.
We are providing an update to Sandra 2016, RTMa (version 22.15) with various updates and fixes:
- .Net native Vector support: (floating-point single/double) in latest 4.6 CLR RyuJIT. the CLR automatically uses AVX/SSE2 SIMD as supported by the CPU. (see .Net Vectors (CLR 4.6 RyuJIT) Performance article for more information)
- CPU Image Processing: Did not run SIMD code-paths (FMA, AVX, SSE2) only FPU resulting in low performance.
- GPGPU Image Processing: Minor performance optimisation for median/de-noise filter.
- GPGPU Crypto: SHA performance optimisations for nVidia cards in CUDA and OpenCL (SHA1 especially).
- Overall Score 2016: score may not generate in all cases.
- Windows 10: 1511 SDK update (build 10586 2015 November update)
- Website Change: Due to transition to WP links and feeds were broken.
We recommend you update your version of Sandra 2016 as soon as possible.