SiSoftware Sandra Titanium (2018) SP3 Update

We are pleased to announce SP3 (version 28.40 for Sandra Titanium (2018) with updated hardware and software support:

Sandra Titanium (2018) Press Release

GPGPU Benchmarks:

  • Enable FP16/half support in CUDA benchmarks*
  • Enable low-precision FP32 shader support in DirectX Compute benchmarks
  • Enable Tensor support for CUDA/GEMM/FP16

Note: FP16 is already supported by the OpenCL and DirectX compute benchmarks.

GPGPU Processing, Video Shader Benchmarks:

  • FP16/half-precision performance* (if supported) is included in the aggregate scores. This better reflects the performance of new GPGPUs that natively support FP16/half processing.
  • Tensors code-path used by default where applicable.

Note: FP16 is supported by all benchmarks: DirectX, OpenGL, OpenCL and now CUDA.

Reviews using Sandra 2018 SP3

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra Titanium (2018) SP2c Update

We are pleased to announce SP2c (version 28.34 for Sandra Titanium (2018) with updated hardware and software support:

Sandra Titanium (2018) Press Release

  • Hardware Information
    • Ryzen 1/2 &  Threadripper: fix crash when Hyper-V / “Windows Core Isolation – Memory Integrity” (VBS) –  are enabled
    • Ryzen 2 &  Threadripper: enabled display of Core temperature, voltage, current, power usage

Reviews using Sandra 2018 SP2c

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Black Friday 2018 Deal

SiSoftware Sandra Titanium (2018) SP2b Update

We are pleased to announce SP2b (version 28.31) for Sandra Titanium (2018) with updated hardware and software support:

Sandra Titanium (2018) Press Release

  • CPU Benchmarks:
    • Multi-Media, .Net Multi-Media, Java Multi-Media: optimised fractal picture size selection for better scaling. Crash fix upon loading large fractals.
    • Image Processing: optimised picture size selection for better scaling.
  • GPGPU Benchmarks:
    • Processing (CUDA, OpenCL, DirectX): optimised fractal picture size selection for better scaling. Crash fix upon loading large fractals.
    • Image Processing (CUDA, OpenCL, DirectX): optimised picture size selection for better scaling.
  • GPGPU: improved clock speed reporting (base/boost).

Reviews using Sandra 2018 SP2b

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra Titanium (2018) SP2 Update

We are pleased to announce SP2 (version 28.28) for Sandra Titanium (2018) with updated hardware and software support:

Sandra Titanium (2018) Press Release

  • GPGPU:
    • nVidia Series 2000 (Turing) GPGPU support (still on SDK 9.2 for broad compatibility)
    • Fixes workgroup sizes for Scientific benchmarks
  • CPU:
    • Support for Intel 9000-Series CPU
    • Crypto AES benchmark VAES 512-bit and 256-bit support

Product Reviews

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra Titanium (2018) SP1b Update

We are pleased to announce SP1b (version 28.26) for Sandra Titanium (2018) with updated hardware and software support:

Sandra Titanium (2018) Press Release

  • GPGPU: Updated nVidia CUDA 9.2 SDK with FP16 support for all benchmarks
    • nVidia Volta arch GPGPU support
  • CPU: Support update for AMD ThreadRipper v2
  • CPU: Support update for Intel v9 CPUs

Product Reviews

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra Titanium (2018) RTMa Update

We are pleased to release RTMa (RTM update A – version 28.18) update for Sandra Titanium (2018) with the following updates:

Sandra Titanium (2018) Press Release

  • CPU: Updated tools allowed for new AVX512 code-path (FFT)
  • Hardware support updates and fixes
  • Updates to the new Overall Benchmarks: CPU, GPGPU, Disk, Memory
  • Firmware logging and upload for all benchmarks including Ranker support
    • Microcode version for CPUs (to determine Spectre/Meltdown support)
    • Firmware version for Memory Controllers
    • Firmware version for Disk drives (HDD or SSD)
    • BIOS version for mainboards
    • BIOS version for Graphics Cards / GPGPUs

Note: Sandra will also display whether an updated version for microcode, firmware or BIOS is available for your device from the respective manufacturer / OEM. Community powered, this relies on users benchmarking and uploading their scores to the Ranker.

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Sandra Platinum (2017) SP4 – Updates for ‘Meltdown’ and ‘Spectre’

NB: SP4 has been refreshed to version 24.61 vs. the original 24.55 a day later.

We are pleased to release SP4 (Service Pack 4 – version 24.61) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Reporting of Operating System (Windows) speculation control settings for the recently discovered vulnerabilities:
  • Reporting of latest CPU microcode update availability
    • Hardware mitigation for BTI/’Spectre’
  • Reporting of CPU features for branch control
    • Hardware enumeration and control for speculation and predictors
      • Indirect branch restricted speculation (IBRS) and Indirect branch predictor barrier (IBPB)
      • Single thread indirect branch predictors (STIBP)
      • Architecture Capabilities (affected by IB or not)
  • Reporting of CPU support for Context ID / Indirect CID
    • Hardware acceleration for context switching thus mitigating performance loss when KVA (Kernel’s Virtual Address) Shadowing is enabled.

Windows speculation control settings reporting:

Recommended Settings (Windows and CPU updated)
Operating System Mitigation: Enabled (Windows updated)

BTI Mitigation: Enabled, CPU indirect branch-control enumeration support (microcode updated)

RDCL Mitigation: KVA Shadowing enabled, Windows/CPU context-ID support

OK Settings (Windows updated, but CPU not updated)
Operating System Mitigation: Enabled (Windows updated)

BTI Mitigation: Not enabled, no CPU support (e.g. firmware not updated – check for BIOS update)

RDCL Mitigation: KVA Shadowing enabled, Windows/CPU context-ID support

OK Settings (Windows updated, but CPU not updated, no CID)
Operating System Mitigation: Enabled (Windows updated)

BTI Mitigation: Not enabled, no CPU support (e.g. firmware not updated – check for BIOS update)

RDCL Mitigation: KVA Shadowing not enabled, no context-ID support (either CPU or Windows obsolete)

Not Recommended (Windows not updated)
Operating System Mitigation: Not enabled, Windows has not been updated – install the OS update)

Processor features and microcode updates:

CPU microcode not latest
CPU has not been updated with the latest microcode update: check for an updated BIOS/firmware from the OEM/computer manufacturer.
CPU microcode w/speculation support
CPU supports IBRS/PB (indirect branch restricted speculation/predictor barrier) ennumeration as well as STIBP (single thread indirect branch predictors).
CPU supports CID or InvPCID
CPU supports either CID (Context ID) or InvPCID (Process Context ID) for faster context switch – thus mitigating performance loss.

 

Microcode Updates for Intel processors (non-exhaustive list):

 

 

Generation Old Microcode Updated Microcode
IvyBridge 3rd Gen (IVB C0) 28 2a
Haswell 4th Gen (HSW-U/Y Cx/Dx) 20 21
Haswell-X 4th Gen (HSX C0) 3a 3b
Haswell-EX 4th Gen (HSW-EX E0) 0f 10
Broadwell 5th Gen (BDW-U/Y E/F) 25 28
Crystalwell 5th Gen (CRW CX) 17 18
Broadwell 5th Gen (BDW-H E/G) 17 1b
Broadwell 5th Gen (BDX-DE V0/V1) 0f 14
Broadwell 5th Gen (BDX-DE V2) 0d 11
SkyLake 6th Gen (SKL-U/Y D0/R0) ba c2
SkyLake-X 6th Gen (SKX H0) 35 3c
KabyLake 7th Gen (KBL-U/Y H0) 62 80
KabyLake 7th Gen (KBL Y0 / CFL D0) 70 80
KabyLake 7th Gen (KBL-H/S B0) 5e 80
CofeeLake 8th Gen (CFL U0/B0) 70 80

 

Note: Due to some unexplained crashes, resume from sleep issues, etc. the current (January 2018) microcode updates have been suspended by Intel. They are scheduled to be resumed in March 2018.

In preliminary benchmarking, we have observed no or very minor performance impact to CPU, GPGPU and memory scores; disk performance for small transfers is impacted when KVA shadowing is enabled.

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Sandra Platinum (2017) SP3 – Updates and Fixes

We are pleased to release SP3 (Service Pack 3 – version 24.50) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Fix start-up crash with non-AVX older systems (both x86 and x64)
  • Tools update allowing further optimisations to AVX512 benchmarks
  • Other hardware information fixes depending on APIC ID configuration
    • AMD Ryzen 5 – L3 cache detection 4x not 2x
    • AMD Ryzen 3 – detected as SMP

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Sandra Platinum (2017) SP2 – NUMA for ThreadRipper, AVX512 for SKL-X

Update Wizard

We are pleased to release SP2 (Service Pack 2 – version 24.41) update for Sandra Platinum (2017) with the following updates:

Sandra Platinum (2017) Press Release

  • Tools update allowing further ports of benchmarks to AVX512, e.g.:
    • CPU Multi-Media: 128-bit (octa) floating-point benchmark
    • CPU Scientific: GEMM, FFT N-Body (single and double floating-point)
    • CPU Image Processing: All filters vectorised and ported to AVX512 (Blur/Sharpen/Motion-Blur, Edge/Noise/Oil, Diffusion/Marble)
  • Algorithm harness update allowing NUMA multi-block performance improvement, e.g.:
    • CPU Multi-Media: all algorithms. both integer and floating-point.
    • CPU Cryptography: all algorithms, both crypto and hashing.
    • CPU Scientific: all algorithms (but especially (F/D)GEMM)
    • CPU Financial: Monte-Carlo (N/A others).
    • CPU Image Processing: All filters (Blur/Sharpen/Motion-Blur, Edge/Noise/Oil, Diffusion/Marble).

New articles showing the improvement that AVX512 and NUMA bring:

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

AVX512 Improvement for Skylake-X (Core i9-9700X)

Intel Skylake-X Core i9

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA3/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators – albeit in a somewhat different form – it has finally made it to its CPU lines with Skylake-X (SKL-X/EX/EP) – for now HEDT (i9) and Server (Xeon) – and hopefully to mainstream at some point.

Note it is rumoured the current Skylake (SKL)/Kabylake (KBL) are also supposed to support it based on core changes (widening of ports to 512-bit, unit changes, etc.) – nevertheless no public way of engaging them has been found.

AVX512 consists of multiple extensions and not all CPUs (or GPGPUs) may implement them all:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit. [supported by SKL-X, Phi]
  • AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers [supported by SKL-X]
  • AVX512-CD – Conflict Detection – loop vectorisation through predication [not supported by SKL-X but Phi]
  • AVX512-ER – Exponential & Reciprocal – transcedental operations [not supported by SKL-X but Phi]
  • more sets will be introduced in future versions

As with anything, simply doubling register width does not automagically increase performance by 2x (twice) as dependencies, memory load/store latencies and even data characteristics limit performance gains – some of which may require future arch or even tools to realise their true potential.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks SKL-X AVX512 SKL-X AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  1460 [+23%]  1180 For integer workloads we manage only 23% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  519 [+19%]  435 With a 64-bit integer workload the improvement reduces to 19%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  7.72 [=]  7.62 No SIMD here
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+80%]  1000 In this floating-point test we finally see the power of AVX512 – it is 80% faster than AVX2/FMA3 – a huge improvement.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  1150 [+85%]  622 Switching to FP64 increases the improvement to 85% a huge gain.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  36 [+50%]  24 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 50% improvement still nothing to ignore.
AVX512 cannot bring 100% improvement but does manage up to 85% improvement – a no mean feat! While integer workload is only 20-25% it is still decent. Heavy compute algorithms will greatly benefit from AVX512.
BenchCrypt Crypto SHA2-256 (GB/s) 26 [+78%]  14.6 With no data dependency – we get good scaling of almost 80% even with this integer workload.
BenchCrypt Crypto SHA1 (GB/s)  39.8 [+51%]  26.4 Here we see only 50% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s)  21.2 [+94%]  10.9 With 64-bit integer workload we see almost perfect scaling of 94%.
As we work on different buffers and have no dependencies, AVX512 brings up to 94% performance improvement – only limited by memory bandwidth with even 4 channel DDR4 @ 3200Mt/s not enough for 10C/20T CPU. AVX512 is absolutely worth it to drive the system to the limit.
BenchScience SGEMM (GFLOPS) float/FP32  558 [-7%]  605 Unfortunately the current compiler does not seem to help.
BenchScience DGEMM (GFLOPS) double/FP64  235 [+2%]  229 Changing to FP64 at least allows AVX512 to with by a meagre 2%.
BenchScience SFFT (GFLOPS) float/FP32  35.3 [=]  35.3 Again the compiler does not seem to help here.
BenchScience DFFT (GFLOPS) double/FP64  19.9 [-2%]  20.2 With FP64 nothing much happens.
BenchScience SNBODY (GFLOPS) float/FP32  585 [-1%]  591 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64  175 [-1%]  178 With FP64 workload nothing much changes.
With complex SIMD code – not written in assembler the compiler has some ways to go and performance is not great. But at least the performance is not worse.
CPU Image Processing Blur (3×3) Filter (MPix/s)  3830 [+60%]  2390 We start well here with AVX512 60% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  1700 [+70%]  1000 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  885 [+56%]  566 Again same algorithm but even more data shared now brings the improvement down to 56%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  1290 [+56%]  826 Using two buffers does not change much still 56% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  136 [+59%]  85 Different algorithm keeps the AVX512 advantage the same at about 60%.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  65.6 [+31.7%]  49.8 Using the new scatter/gather in AVX512 still brings 30% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  3920 [+3%]  3800 Here we have a 64-bit integer workload algorithm with many gathers with AVX512 likely memory latency bound thus almost no improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  770 [+2%]  755 Again loads of gathers does not allow AVX512 to shine but still decent performance
As with other SIMD tests, AVX512 brings between 60-70% performance increase, very impressive. However in algorithms that involve heavy memory access (scatter/gather) we are limited by memory latency and thus we see almost no delta but at least it is not slower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that even for a 1st-generation CPU with AVX512 support, SKL-X greatly benefits from the new instruction set – with anything between 50-95% performance improvement. However compiler/tools are raw (VC++ 2017 only added support in the recent 15.3 version) and performance sketchy where hand-crafted assembler is not used. But these will get better and future CPU generations (CFL-X, etc.) will likely improve performance.

Also let’s remember that some SKUs have 2x FMA (aka 512-bit) (and other instructions) licence – while most SKUs have only 1x FMA (aka 256-bit); the former SKUs likely benefit even more from AVX512 and it is something Intel may be more generous in enabling in future generations.

In algorithms heavily dependent on memory bandwidth or latency AVX512 cannot work miracles, but at least will extract the maximum possible compute performance from the CPU. SKUs with lower number of cores (8, 6, 4, etc.) likely to gain even more from AVX512.

We are eagerly awaiting the AVX512-enabled processors on desktop and mobile platforms…