SiSoftware Sandra 20/20/10 (2020 R10) Update – optimisations and fixes

Update Wizard

We are pleased to release R10 (version 30.71) update for Sandra 20/20 (2020) with the following changes:

Sandra 20/20 (2020) Press Release

Latest Sandra Version

We are moving towards a tiered system where different versions (R numbers) are provided to different customers depending on their need and version type. This allows us in these tough times to prioritise our customers while still providing stable, best features to the community. We will still aim to release all versions together where possible but we no longer guarantee it.

  • Manufacturers/OEM, Tech Support, Reviewers:
    • Latest Sandra (Beta) version, R+1
  • Commercial (Professional/Business/Engineer/Enterprise):
    • Current Sandra (Stable) version, R (R+1 if required*)
  • Lite (Evaluation):
    • Previous Sandra (Stable) version, R-1

Note (*): we will provide access to Beta versions if customer is affected by an issue resolved in the next release.

GP-GPU (CUDA / OpenCL / DirectX Compute) Benchmarks

  • Additional performance improvements for nVidia “Ampere
  • Additional performance improvements for “Image Processing” Benchmarks
  • Relaxed limits further for better performance on high-end/multiple GP-GPUs [up to 8]

CPU Benchmarks

  • Fixed possible lock-up in “Scientific Analysis” Benchmarks
  • Revised benchmarks for asymmetric work-loads for hybrid CPUs

Bug Fixes

  • Fixed (possible) crash on Intel graphics with 64-bit PCIe memory addressing
  • Reviewed all device code that deals with 64-bit PCIe memory addressing
  • Fixed TigerLake (TGL) memory information/timings for (LP)DDR5
  • Fixed TigerLake (TGL) integrated graphics memory information
  • Additional IceLake (ICL) memory information

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Commercial (Pro/Biz/Eng/Ent)

Download Sandra Lite

AVX512-IFMA(52) Improvement for IceLake and TigerLake

CPU Multi-Media Vectorised SIMD

What is Sandra’s Multi-Media benchmark?

The “multi-media” benchmark in Sandra was introduced way back with Intel’s MMX instruction set (and thus Pentium MMX) to show the difference vectorisation brings to common algorithms, in this case (Mandelbrot) fractal generation. While MMX did not have floating-point support – we can emulate them using integers of various widths (short/16-bit, int/32-bit, long/int64/64-bit, etc.).

The benchmark thus contains various precision tests using both integer and floating point data, currently 6 (single/double/quad-floating point, short/int/long integer) with more to come in the near future (half/FP16 floating-point, etc.). Larger widths provide more precision and thus generate more accurate fractals (images) but are slower to compute (they also take more memory to store).

While the latest instruction sets (AVX(2)/FMA, AVX512) do naturally support floating-point data, integer compute performance is still very much important thus its performance needs to be tested. As quantities become larger (e.g. memory/disk sizes, pointers/address spaces, etc.) we have moved from int/32-bit to long/64-bit processing with even exclusive 64-bit algorithms (e.g. hashing SHA512).

What is the “trouble” with 64-bit integers?

While all native 64-bit processors (e.g. x64, IA64, etc.) support native 64-bit integer operations, these are generally scalar with limited SIMD vectorised support. Multiplication is especially “problematic” as it has the potential to generate numbers up to twice (2x) the number of bits – thus multiplying two 64-bit integers can generate 128-bit integer full result for which there was no (SIMD) support.

Intel has added native full 128-bit multiplication support (MULX) with the BMI2 (Bit Manipulation Instructions Version 2) but that is still scalar (non-SIMD); not even the latest AVX512-DQ instruction set brought support. While we could emulate full 128-bit multiplication using native 32-bit to 64-bit halves multiplication we have chosen to wait for native support. An additional issue (for us) is that we use “signed integers” (i.e. can hold both positive (+ve) and negative (-ve) values) while most multiplication instructions are for “unsigned integers” (thus can hold only positive values) – thus we need to modify the result for our needs which incurs overheads.

Thus the long/64-bit integer benchmark in Sandra remained non-vectorised until the introduction of AVX512-IFMA52.

What is AVX512-IFMA52?

IFMA52 is one of the new extensions of AVX512 introduced with “IceLake” (ICL) that supports native 52-bit fused multiply-add with 104-bit result. As it is 512-bit wide, we can multiply-add eight (8) pairs 64-bit integers in one go every 2 clocks (0.5 throughput, 4 latency on ICL) – especially useful for algorithms like (Mandelbrot) fractals where we can operate on many pixels independently.

As is generates a 104-bit full result, it is (as per name) only a 52-bit integer thus we need to restrict our integers to 52-bits. It also operates on unsigned integers only thus needs to be modified for our signed-integer purpose. Note also that while it is a fused multiply-add, we have chosen to use only the multiply feature here (in this Sandra version 20/20 R9); future versions (of Sandra) may use the full multiply-add feature for even better performance.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512, AVX2, AVX, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
BenchCpuMM Emulated Int64 ALU64 (Mpix/s) 3.67 4.34 While native, scalar int64 processing is pretty slow.
BenchCpuMM Native Int64 ADX/BMI2 (Mpix/s) 21.24 [+5.78x] Using BMI2 for 64-bit multiplication increases (scalar) performance by 6x!
BenchCpuMM Emulated Int64 SSE4 (Mpix/s) 13.92 [-35%] Using vectorisation though SSE4 (2x wide) is not enough to beat ADX/BMI
BenchCpuMM Emulated Int64 AVX2 (Mpix/s) 22.8 [+64%] AVX2 is 4x wide (256-bit) and just about beats scalar ADX/BMI2.
BenchCpuMM Emulated Int64 AVX512/DQ (Mpix/s) 33.53 [+47%] 512-bit wide AVX512 is 47% faster than AVX2.
BenchCpuMM Native Int64 AVX512/IFMA52 (Mpix/s) 55.87 [+66%] / [+15x over ALU64] 70.41 [+16x over ALU64] IFMA52 is 66% faster than normal AVX512 and over 15x faster than scalar ALU.
With IFMA52, we finally see a big performance gain though native 64-bit integer multiplication and vectorisation (512-bit wide, thus 8x 64-bit integer pairs), it is over 15x faster on ICL and 16x faster on TGL! In fairness, ADX/BMI2 is only about 1/2 slower and that is scalar – showing how much native instructions help processing.

Conclusion

AVX512 continues to bring performance improvements by adding more sub-instruction sets like AVX512-IFMA(52) that help 64-bit integer processing. With 64-bit integers taking over most computations due to increased sizes (data, pointers, etc.) this is becoming more and more important and is not before time.

While not a full 128-bit multiplier, 104-bits allow complete 52-bit integer operation which is sufficient for most tasks – today. Perhaps in the future, a IFMA64 will be provided for full 128-bit multiply result integer support.

SiSoftware Sandra 20/20/9 (2020 R9) Update – GPGPU updates, fixes

Update Wizard

We are pleased to release R9 (version 30.69) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

GPGPU Benchmarks

  • CUDA SDK 11.3+ update for nVidia “Ampere” (SM8.x) – deprecated SM3.x support
  • OpenCL SDK updated to 1.2 minimum, 2.x recommended and 3.x experimental.
  • Increased addressing in all benchmarks (GPGPU Processing, Cryptography, Scientific Analysis, Financial Analysis Image Processing, etc) to 64-bit for large VRAM cards (12GB and larger)
  • Increased limits allowing bigger grids/workloads on large memory systems (e.g. 24GB+ VRAM, 64GB+ RAM)
  • Optimised (vectorised/predicated) some image processing kernels for higher performance on VLIW GPGPUs.

CPU / SVM Benchmarks

  • Vectorised CPU Multi-Media 128-bit integer benchmark to support ADX/BMI(2) instructions as well as vectorised to support AVX512-IFMA(52) (52-bit integer FMA, 104-bit intermediate result) on Intel “IceLake” and newer CPUs (also supporting AVX512-DQ, AVX2 and SSE4). See “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.
  • Optimised .Net and Java Multi-Media 128-bit integer benchmarks similar to CPU version.
  • Increased limits/addressing allowing bigger grids/workloads on large thread/memory systems (e.g. 256-thread, 256GB RAM)

Bug Fixes

  • Fixed incorrect rounding function resulting in negative numbers displayed for zero values(!) Display issue only, actual values not affected.
  • Fixed display for large scores in GPGPU Processing benchmarks (result would overflow the display routine). Display issue only, scores stored internally (database) or sent/received from Ranker were correct and will display correctly upon update.
  • Fixed (sub)domain for Information Engine. Updated microcode, firmware, BIOS, driver versions are displayed again when available.

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/8 (2020 R8t) Update – JCC, bigLITTLE, Hypervisors + Future Hardware

Note: The original R8 release has been updated to R8t with future hardware support.

We are pleased to release R8t (version 30.61) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

JCC Erratum Mitigation

Recent Intel processors (SKL “Skylake” and later but not ICL “IceLake”) have been found to be impacted by the JCC Erratum that had to be patched through microcode. Naturally this can cause performance degradation depending on benchmark (approx 3% but up to 10%) but can be mitigated through assembler/compiler updates that prevent this issue from happening.

We have updated the tools with with which Sandra is built to mitigate JCC and we have tested the performance implications on both Intel and AMD hardware in the linked articles.

bigLITTLE Hybrid Architecture (aka “heterogeneous multi-processing”)

big.Little HMP
While bigLITTLE arch (including low and high-performance asymmetric cores into the same processor) has been used in many ARM processors, Intel is now introducing it to x86 as “Foveros”. Thus we have have Atom (low performance but very low power) and Core (high performance but relatively high power) into the same processor – scheduled to run or be “parked” depending on compute demands.

As with any new technology, it will naturally require operating system (scheduler) support and may go through various iterations. Do note that as we’ve discussed in our 2015 (!) article – ARM big.LITTLE: The trouble with heterogeneous multi-processing: when 4 are better than 8 (or when 8 is not always the “lucky” number) – software (including benchmarks) using all cores (big & LITTLE) may have trouble correctly assigning workloads and thus not use such processors optimally.

As Sandra uses its own scheduler to assign (benchmarking) threads to logical cores, we have updated it to allow users to benchmarks not only “All Threads (MT)” and “Only Cores (MC)” but also “Only big Cores (bMC)” and “Only LITTLE Cores (LMC)“. This way you can compare and contrast the various cores performance without BIOS/firmware changes.

The (benchmark) workload scheduler also had to be updated to allow per-thread workload – with threads scheduled on LITTLE cores assigned less work and threads on big cores assigned more work depending on their relative performance. The changes to Sandra’s workload scheduler allows each core can be fully utilised – at least when benchmarking.

Note: This advanced information is subject to change pending hardware and software releases and updates.

Future Hardware Support

Update R8t adds support for “Tiger Lake” (TGL) as well as updated support for “Ice Lake” (ICL) and future processors.

AMD Power/Performance Determinism

Some AMD’s server processors allow “determinism” to be changed to either “performance” (consistent speed across nodes/sockets) or “power” (consistent power across nodes/sockets). While normally workloads require predictability and thus “consistent performance” – this can be at the expense of speed (not taking advantage of power/thermal headroom for higher speed) and even power (too much power consumed by some sockets/nodes).

As “power deterministic” mode allows each processor at the maximum performance, there can be reasonable deviations across processors – but this would be unused if each thread has been assigned the same workload. In effect, it is similar to the “hybrid” issue above, with some cores able to sustain a different workload than other cores and the workload needs to vary accordingly. Again, the changes to Sandra’s workload scheduler allows each core to be fully utilised – at least when benchmarking.

Note: In most systems the deviation between nodes/sockets is relatively small if headroom (thermal/power) is small.

Hypervisors

More and more installations are now running in virtualised mode under a (Type 1) Hypervisor: using Hyper-V, Docker, programming tools for various systems (Android, etc.) or even enabling “Memory Integrity” all mean the system will be silently be modified to run in transparently under a hypervisor (Hyper-V on Windows).

As a result, Sandra will now detect and report hypervisor details when uploading benchmarks to the SiSoftware Official Live Ranker as even when running transparently/”host mode” – there can be deviation between benchmark scores especially when I/O operations (disk, network but even memory) are involved; some mitigations for vulnerabilities apply to both the hypervisor and host/guest operating system with a “double-impact” to performance.

Note: We will publish an article detailing the deviation seen with different hypervisors (Hyper-V, VmWare, Xen, etc.).

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/7 (2020 R7) Released – updates and fixes

We are pleased to release R7 (version 30.49) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Updates & Optimisations
    • CPU Benchmarks: AMD Ryzen 4000 series (APU) preliminary support.
    • GPGPU (CUDA/OpenCL) Benchmarks: nVidia Ampere preliminary support.
    • Database: Optimise performance when accessing/updating benchmark results.
    • Branding (Benchmarks/Ranker): Updates manufacturer list.
  • Support & Fixes
    • Internet Benchmarks: Fix website access due to obsolete agent string.
    • Disk Benchmarks: Fix crash on fragmented media (HDD/SSD).
    • Database: Fix update/insert issues with specific benchmark results.

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/6 (2020 R6) Released – 2 brand-new benchmarks!

We are pleased to release R6 (version 30.45) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

Internet DNS Benchmark Internet DNS Benchmark Benchmark the performance of the DNS service. Measure the latency of both cached and un-cached DNS queries to local and remote DNS servers.
Internet Overall Score Benchmark A combined performance index all Internet benchmarks (Connection (Bandwidth/Latency), Peerage (Bandwidth/Latency) and DNS (cached/un-cached Query Latency). Rate the overall performance of your Internet connection.
  • Benchmarks:
    • New: Internet DNS Benchmark: measure cached & un-cached DNS query latency for local and public DNS servers.
    • New: Internet Overall Score: using the existing Internet benchmarks (Connection, Peerage and brand-new DNS), compute an overall score denoting the Internet connection quality.
    • Internet Connection, Internet Peerage Benchmarks: updated list of top (300) websites to test against; additional multi-threading optimisations
  • Hardware Support:
    • Additional future hardware support and optimisations.
    • Additional CPU features support
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/5 (2020 R5) Released – Updated Hardware Support

We are pleased to release R5 (version 30.41) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Benchmarks:
    • Internet Connection, Internet Peerage Benchmarks: updated list of top websites to test against; additional multi-threading optimisations
  • Hardware Support:
    • Additional IceLake (ICL Gen10 Core), Future* (RKL, TGL Gen11 Core) AVX512, VAES, SHA-HWA support (see CPU, GP-GPU, Cache & Memory, AVX512 improvement reviews)
    • Additional CPU features support
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

AVX512 Improvement for Icelake Mobile (i7-1065G7 ULV)

Ice Lake

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators, it was next introduced on the HEDT platform with Skylake-X (SKL-X/EX/EP) but until now it was not avaible on the mainstream platforms.

With the 10th “real” generation Core arch(itecture) (IceLake/ICL), we finally see “enhanced” AVX512 on the mobile platform which includes all the original extensions and quite a few new ones.

Original AVX512 extensions as supported by SKL/KBL-X HEDT processors:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit.
  • AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit
  • AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit
  • AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers
  • AVX512-CD* – Conflict Detection – loop vectorisation through predication [only on Xeon/Phi co-processors]
  • AVX512-ER* – Exponential & Reciprocal – transcedental operations [only on Xeon/Phi co-processors]

New AVX512 extensions supported by ICL processors:

  • AVX512-VNNI** (Vector Neural Network Instructions) [also supported by updated CPL-X HEDT]
  • AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-IFMA (Integer FMA)
  • AVX512-VAES (Vector AES) accelerating crypto
  • AVX512-GFNI (Galois Field)
  • AVX512-GNA (Gaussian Neural Accelerator)

As with anything, simply doubling register widths does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains; some may require future arch updates or tools to realise their true potential.

SIMD FMA Units: Unlike HEDT/server processors, ICL ULV (and likely desktop) have a single 512-bit FMA unit, not two (2): the execution rate (without dependencies) is thus similar for AVX512 and AVX2/FMA code. However, future versions are likely to increase execution units thus AVX512 code will benefit even more.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks ICL ULV AVX512 ICL ULV AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 504 [+25%] 403 For integer workloads we manage25% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 145 [+1%] 143 With a 64-bit integer workload the improvement reduces to 1%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.67 3.73 [-2%] – [No SIMD in use here]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 414 [+22%] 339 In this floating-point test, we see a 22% improvement similar to integer.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 232 [+20%] 194 Switching to FP64 we see a similar improvement.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 10.17 [+13%] 9 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 13% improvement
With limited resources, AVX512 cannot bring 100% improvement, but still manages 20-25% improvement over AVX2/FMA which is decent improvement; also consider this is a TDP-constrained ULV platform not desktop/HEDT.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [+2.25x] 4 With no data dependency – we get great scaling of over 2x in this integer workload.
BenchCrypt Crypto SHA1 (GB/s) 15.71 [+81%] 8.6 Here we see only 80% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09 [+2.3x] 3.07 With 64-bit integer workload we see larger than 2x improvement.
Thanks to the new crypto-algorithm friendly acceleration instructions of AVX512 and no doubt helped by high-bandwidth LP-DDR4X memory, we see over 2x (twice) improvement over older AVX2. ICL ULV will no doubt be a great choice for low-power network devices (routers/gateways/firewalls) able to pump 100′ Gbe crypto streams.
BenchScience SGEMM (GFLOPS) float/FP32 185 [-6%] 196 More optimisations seem to be required here for ICL at least.
BenchScience DGEMM (GFLOPS) double/FP64 91 [+18%] 77 Changing to FP64 brings a 18% improvement.
BenchScience SFFT (GFLOPS) float/FP32 31.72 [+12%] 28.34 With FFT, we see a modest 12% improvement.
BenchScience DFFT (GFLOPS) double/FP64 17.72 [-2%] 18 With FP64 we see 2% regression.
BenchScience SNBODY (GFLOPS) float/FP32 200 [+7%] 187 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64 61.76 [=] 62 With FP64 there is no delta.
With highly-optimised scientific algorithms, it seems we still have some way to go to extract more performance out of AVX512, though overall we still see a 7-12% improvement even at this time.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1,580 [+79%] 883 We start well here with AVX512 80% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 633 [+71%] 371 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 326 [+67%] 195 Again same algorithm but even more data shared now brings the improvement down to 67%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 502 [+58%] 318 Using two buffers does not change much still 58% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 72.92 [+2.4x] 30.14 Different algorithm works better, with AVX512 over 2x faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.73 [+50%] 16.45 Using the new scatter/gather in AVX512 still brings 50% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 2,100 [+33%] 1,580 Here we have a 64-bit integer workload algorithm with many gathers still good 33% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 307 [+33%] 231 Again loads of gathers and similar 33% improvement.
Image manipulation algorithms working on individual (non-dependent) pixels love AVX512, with 33-140% improvement. The new scatter/gather instructions also simplily memory access code that can benefit from future arch improvements.
Neural Networks NeuralNet CNN Inference (Samples/s) 25.94 [+3%] 25.23 Inference improves by a mere 3% only despite few dependencies.
Neural Networks NeuralNet CNN Training (Samples/s) 4.6 [+5%] 4.39 Traning improves by a slighly better 5% likely due to 512-bit accesses.
Neural Networks NeuralNet RNN Inference (Samples/s) 25.66 [-1%] 25.81 RNN interference seems very slighly slower.
Neural Networks NeuralNet RNN Training (Samples/s) 2.97 [+33%] 2.23 Finally RNN traning improves by 33%.
Unlike image manipulation, neural networks don’t seem to benefit as much pretty much the same performance across board. Clearly more optimisation is needed to push performance.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We never expected a low-power TDP (power)-limited ULV platform to benefit from AVX512 as much as HEDT/server platforms – especially when you consider the lower count of SIMD execution units. Nevertheless, it is clear that ICL (even in ULV form) benefits greatly from AVX512 with 50-100% improvement in many algorithms and no loses.

ICL also introduces many new AVX512 extensions which can even be used to accelrate existing AVX512 code (not just legacy AVX2/FMA), we are likely to see even higher gains in the future as software (and compilers) take advantage of the new extensions. Future CPU architectures are also likely to optimise complex instructions as well as add more SIMD/FMA execution units which will greatly improve AVX512 code performance.

As the data-paths for caches (L1D, L2?) have been widened, 512-bit memory accesses help extract more bandwidth for streaming algorithms (e.g. crypto) while scatter/gather instruction reduce latencies for non-sequential data accesses. Thus the benefit of AVX512 extends to more than just raw compute code.

We are excitedly waiting to see how AVX512-enabled desktop/HEDT ICL performs, not constrained by TDP and adequately cooled…

Ice Lake

Intel Ice Lake

SiSoftware Sandra 20/20/4 (2020 R4a) Released – Updated Benchmarks

Note: The original R4 release text has been updated below. The (*) denotes new changes.

We are pleased to release R4a (version 30.39) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Benchmarks:
    • Crypto AES Benchmarks*: Optimised AVX512/AVX2-VAES code to outperform AES-HWA where possible.
    • Crypto SHA Benchmarks*: Select AVX512 multi-buffer instead of SHA-HWA where supported.
    • Network (LAN), Wireless (WLAN/WWAN) Benchmarks: multi-threaded transfer tests and increased packet size to better utilise 10Gbe+ (and higher) links. [Note: threaded CPU required]
    • Internet Connection, Internet Peerage Benchmarks: multi-threaded transfer tests and increased packet size to better utilise Gigabit+ (and higher) connections.
  • Hardware Support:
    • Updated IceLake (ICL Gen10 Core), Future* (RKL, TGL Gen11 Core) AVX512, VAES, SHA-HWA support (see CPU, GP-GPU, Cache & Memory, AVX512 improvement reviews)
    • Updated CometLake (Gen10 Core) support (see CPU, GP-GPU, Cache & Memory reviews)
    • Updated CPU features support*
    • Updated NVMe support
    • Enhanced Biometrics information (fingerprint, face, voice, audio, etc. sensors)
    • Updated WiFi support (WiFi 6/802.11ax, WPA3)
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/3 (2020 R3) Released – Updated Benchmarks

We are pleased to release R3 (version 30.31) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • Additional PCIe extended capabilities support
  • CPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements
  • GPGPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite