SiSoftware Sandra 20/20 (2020) Released!

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Sandra 20/20 (2020) Released:
Brand-new benchmarks (AI/ML), hardware support

Updates: SP1.

London, UK, July 18th, 2019 – We are pleased to announce the launch of SiSoftware Sandra 20/20 (2020), the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, mobile devices and networks.

It adds two Neural Networks AI/ML (Artificial Intelligence/Machine Learning) benchmarks for both CPU and GP (GPU) to measure both CNN (Convolution Neural Network) & RNN (Recurrent Neural Networks) performance on modern hardware.

It also adds hardware support and optimisations for brand-new CPU architectures (AMD Ryzen 2 (3000 series); Intel IceLake, CometLake) not forgetting GPGPU architectures across the various interfaces (CUDA, OpenCL, DirectX ComputeShader, OpenGL Compute).

As SiSoftware operates a “just-in-time” release cycle, some features were introduced in Sandra 2017 service packs: in Sandra Titanium they have been updated and enhanced based on all the feedback received.

Operating System Module

Broad Operating System Support

All current versions supported: Windows 10, 8.1*, 8*, 7*; Server 2019, 2016, 2012/R2 and 2008/R2*

Brand new AI/ML benchmarks featuring both CNN & RNN networks testing both inference/forward and training/back-propagation performance.

Processor Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (single/FP32, double/FP64 floating-point) and instruction sets (AVX512, AVX2/FMA, AVX, SSE4, SSE2, RTM/HLE with NUMA and large-page support)

Ranker: Processor Neural Networks (Normal/Single Precision)
Ranker: Processor Neural Networks (High/Double Precision)

GP (GPU) Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (half/FP16, single/FP32 floating-point) and platforms (CUDA, OpenCL, DirectX Compute)

GP (GPU) Neural Networks (Normal/Single Precision)
GP (GPU) Neural Networks (Low/Half Precision)

CNN (Convolution Neural Network) Architecture

Detailed document on the CNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

RNN (Recurrent Neural Network) Architecture

Detailed document on the RNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

Major changes

  • All connections to website engines (Ranker, Information, Price) are now secured by SSL through HTTP.
  • Sandra client (management console) is now installed as native 64-bit (on x64 and arm64) and thus needs 64-bit Access components (2016, 2013, 2010, etc.) or SQL Server (2017, 2016, 2014, etc) for its database.

Key features of Sandra 20/20

  • 4 native architectures support (x86, x64, ARM64** – Windows; ARM, ARM64, x86, x64 – Android)
  • Huge official hardware support through technology partners (AMD/ATI, nVidia, Intel).
  • 4 native (GP)GPU/APU platforms support (OpenCL 2.1+, CUDA 10.1+, DirectX Compute Shader 11/10+, OpenGL Compute 4.5+, Vulkan 1.0+).
  • 4 native Graphics platforms support (DirectX 11.x/10.x, OpenGL 4.0+, Vulkan 1.0+).
  • 9 language versions (English, German, French, Italian, Spanish, Japanese, Chinese (Traditional, Simplified), Russian) in a single installer.
  • Enhanced Sandra Lite (Eval) version (free for personal/educational use, evaluation for other uses)

Articles & Benchmarks

For more details, please see the following articles:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please click here.

Downloading

For more details, and to download the Lite (Evaluation) version, please click here.

Reviewers and Editors

For your free review copies, please contact us.

About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Many worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Thousands on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, OpenGL, DirectCompute, x64, ARM64, ARM, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX512, AVX2/FMA3, AVX, NEON/2, SSE4.2/4, SSSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit www.sisoftware.net, www.sisoftware.eu, or www.sisoftware.co.uk

AMD Ryzen2 3700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

To upgrade from Ryzen+/Ryzen1 or not?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the middle-of-the-range Ryzen2 (3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Core counts remain the same.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die 1 chiplet+1 sio rather than 1 die
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz 3700x base clock is lower than 2700x but turbo is higher
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP has been greatly reduced vs. ZEN+
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way L1I has been halved but better no. ways
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN+
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 SIMD units are 2x wider than ZEN+

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 7 3700X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 336 [=] 334 400 485 We start with no improvement over ZEN+
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 339 [=] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 202 [+2%] 198 236 262 Floating-point performance does not change delta either – only 2% faster
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 170 [=] 169 196 223 With FP64 nothing much changes again.
In the legacy integer/floating-point benchmarks ZEN2 is not any faster than ZEN+ despite the change in clocks. Perhaps future microcode updates will help?
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1023 [+78%] 574 985 1590 ZEN2 is ~80% faster than ZEN+ despite what we’ve seen before.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 374 [+2x] 187 414 581 With a 64-bit AVX2 integer vectorised workload, ZEN2 is now 2x faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.56 [+13%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here ZEN2 is still 13% faster.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 100 [+68%] 596 914 1760 In this floating-point AVX/FMA vectorised test, ZEN2 is ~70% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 618 [+84%] 335 535 533 Switching to FP64 SIMD code, ZEN2 is now ~90% faster than ZEN+
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24.22 [+55%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, ZEN2 is still 55% faster
With its brand-new 256-bit SIMD units, ZEN2 is anywhere from 55% to 100% faster than ZEN+/ZEN1 a huge upgrade from one generation to the next. For SIMD loads upgrading to ZEN2 gives a huge performance uplift.
BenchCrypt Crypto AES-256 (GB/s) 18 [+12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound  but ZEN2 manages a 12% improvement.
BenchCrypt Crypto AES-128 (GB/s) 18.76 [+17%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; ZEN2 is now 17% faster.
BenchCrypt Crypto SHA2-256 (GB/s) 20.21 [+9%] 18.6 12 26 With SHA/HWA ZEN2 similarly powers through hashing tests leaving Intel in the dust – and is still ~10% faster than ZEN+
BenchCrypt Crypto SHA1 (GB/s) 20.41 [+6%] 19.3 22.9 38 The less compute-intensive SHA1 does not change things due to acceleration.
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
ZEN2 with AES/SHA HWA is memory bound like all other CPUs, but it still manages 6-17% better performance than ZEN+ using the same memory. But as ZEN2 is rated for faster memory – using such memory would greatly improve the results.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 229 [+5%] 219 238 277 Switching to FP64 code, ZEN2 is just 5% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 57.98 [-4%] 60.6 61.6 68 With FP64 code ZEN2 is 4% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 46.34 [+13%] 41 44.5 50.5 Switching to FP64 nothing much changes, ZEN2 is 13% faster.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: ZEN2 does not improve much and is pretty much tied with ZEN+ – thus for non SIMD workloads you might as well stick with the older versions.
BenchScience SGEMM (GFLOPS) float/FP32 263 [-12%] 300 375 413 In this tough vectorised algorithm ZEN2 is strangely slower.
BenchScience DGEMM (GFLOPS) double/FP64 193 [+63%] 119 209 212 With FP64 vectorised code, ZEN2 comes back to be over 60% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.78 [+2.5x] 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more; ZEN2 is 2.5x (times) faster.
BenchScience DFFT (GFLOPS) double/FP64 11.16 [+41%] 7.92 11.21 14.6 With FP64 code, ZEN2 is ~40% faster.
BenchScience SNBODY (GFLOPS) float/FP32 612 [+2.2x] 280 557 638 N-Body simulation is vectorised but fewer memory accesses; ZEN2 is over 2x faster.
BenchScience DNBODY (GFLOPS) double/FP64 220 [+2x] 113 171 195 With FP64 precision ZEN2 is almost 2x faster.
With highly vectorised SIMD code ZEN2 improves greatly over ZEN2 sometimes managing to be over 2x faster using the same memory.
CPU Image Processing Blur (3×3) Filter (MPix/s) 2049 [+42%] 1440 2560 4880 In this vectorised integer workload ZEN2 starts over 40% faster than ZEN+.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 950 [+52%] 627 1000 1920 Same algorithm but more shared data makes ZEN2 over 50% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 495 [+52%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 826 [+67%] 495 827 1500 Different algorithm but still vectorised workload ZEN2 is almost 70% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 89.68 [+24%] 72.1 78 221 Still vectorised code now ZEN2 drops to just 25% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 25.05 [+5%] 23.9 42.2 66.7 This test has always been tough for Ryzen so ZEN2 does not improve much.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1763 [+76%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better but ZEN2 is still almost 80% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 321 [+32%] 243 596 777 In this final test again with integer workload ZEN2 is 32% faster
As we’ve seen before, the new SIMD units are anywhere from 5% (worst-case) to 2x faster than ZEN+/1, a huge performance improvement.
Aggregate Score (Points) 8,200 [+40%] 5,850 7,930 11,810 Across all benchmarks, ZEN2 is ~40% faster than ZEN+.
Aggregating all the various scores, the result was never in doubt: ZEN2 (3700X) is 40% faster than the old ZEN+ (2700X) that itself improved over the original 1700X.

ZEN2’s 256-bit wide SIMD units are a big upgrade and show their power in every SIMD workload; otherwise there is only minor improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: For SIMD workloads you really have to upgrade to Ryzen2; otherwise stick with Ryzen+ unless lower power is preferred. 9/10 overall.

The big change in Ryzen2 are the 256-bit wide SIMD units and all vectorised workloads (Multi-Media, Scientific, Image processing, AI/Machine Learning, etc.) using AVX/FMA will greatly benefit – anything between 50-100% which is a significant increase from just one generation to the next.

But for all other workloads (e.g. Financial, legacy, etc.) there is not much improvement over Ryzen+/1 which were already doing very well against competition.

Naturally it all comes at lower TDP (65W vs 95) which may help with overclocking and also lower noise (from the cooling system) and power consumption (if electricity is expensive or you are running it continuously) thus the performance/W(att) is still greatly improved.

Overall the 3700X does represent a decent improvement over the old 2700X (which is no slouch and was a nice upgrade over 1700X due to better Turbo speeds) and should still be usable in older AM4 300/400-series mainboards with just a BIOS upgrade (without PCIe 4.0).

However, while 2700X (and 1700X/1800X) were top-of-the-line, 3700X is just middle-ground, with the new top CPUs being the 3900X and even the 3950X with twice (2x) more cores and thus potentially huge performance rivaling HEDT Threadripper. The goad-posts have thus moved and thus far higher performance can be yours with just upgrading the CPU. The future is bright…

AMD Ryzen2 3900X Review & Benchmarks – CPU 12-core/24-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

AMD Ryzen2 3950X chiplets

What’s new in the Ryzen2 core?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (3900X, 3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Matching core-count with CFL (3800X) but 3900X has 50% more cores – more than SKL-X.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die AMD uses discrete dies/chiplets unlike Intel
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz Base clock and turbo are competitive with 3800X having higher base while 3900X higher turbo.
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP remains the same but 3900X may exceed that having more cores.
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way ZEN2 matches L1I with CFL/SKL-X (1/2x ZEN+ but 8-way), L1D is unchanged (also matches Intel)
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2, still 2x CFL. Only SKL-X has its massive 1MB L2 per core which 3900X almost matches!
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN/ZEN+ and thus 2x CFL (3800X) with 3900X having a massive 64MB unheard of on the desktop platform! SKL-X can’t match it either.
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 finally matches Intel/CFL but SKL-X’s secret weapon is AVX512 with even consumer CPUs able to do 2x 512-bit FMA ops.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 551 [+38%] 334 400 485 Right off Ryzen2 demolishes all CPUs, it is 40% faster than CFL-R!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 556 [+41%] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 331 [+40%] 198 236 262 Floating-point performance does not change delta either – still 40% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 280 [+43%] 169 196 223 With FP64 nothing much changes again.
Ryzen2 starts with an astonishing display, with 3900X demolishing both 9900X and 7900X winning all tests by a large margin 38-43%! It does have 50% more cores (12 vs. 8) but it is not easy to realise gains just by increasing core counts. Intel will need to add far more cores in future CPUs in order to compete!
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1449 [+47%] 574 985 1590 Ryzen2 starts off by blowing CFL-R away by 47% and almost matching SKL-X with AVX512!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 553 [+34%] 187 414 581 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is still 34% faster!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 9.52 [+41%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 is again 41% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1480 [+62%] 596 914 1760 In this floating-point AVX/FMA vectorised test, Ryzen2 is now over 60% faster than CFL-R and not far off SKL-X!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 906 [+69%] 335 535 533 Switching to FP64 SIMD code, Ryzen2 is now 70% faster even beating SKL-X!!!
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 35.23 [+53%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, Ryzen2 is still 53% faster!
With its brand-new 256-bit SIMD units, Ryzen2 finally goes toe-to-toe with Intel, soundly beating CFL-R in all benchmarks (+34-69%) sometimes by more than just core count increase (+50%). Only SKL-X with AVX512 manages to be faster (but also with its extra 2 cores). Intel had better release AVX512 for desktop soon but even that will not be enough without increasing core counts to match AMD.
BenchCrypt Crypto AES-256 (GB/s) 15.44 [-12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound – thus Ryzen2 scores less than Ryzen+ and CFL-R.
BenchCrypt Crypto AES-128 (GB/s) 15.44 [-12%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; Ryzen2 is again slower by 12%.
BenchCrypt Crypto SHA2-256 (GB/s) 29.84 [+2.5x] 18.6 12 26 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust – 2.5x faster than CFL-R and beating SKL-X with AVX512!
BenchCrypt Crypto SHA1 (GB/s) 19.3 22.9 38
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
Ryzen2 with AES/SHA HWA is memory bound thus needs faster memory than 3200Mt/s in order to feed all the cores; otherwise due to increased contention for the same bandwidth it may end up slower than Ryzen+ and Intel designs. Here you see the need for HEDT platforms and thus ThreadRipper but at much increased cost.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 379 [+55%] 219 238 277 Switching to FP64 code, nothing much changes, Ryzen2 55% faster than CFL-R.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 95.73 [+55%] 60.6 61.6 68 With FP64 code Ryzen2 is still 55% faster!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 76.72 [+72%] 41 44.5 50.5 Switching to FP64 nothing much changes, Ryzen2 is 70% faster than CFL-R and still beating SKL-X.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: Ryzen2 is over 50% faster than CFL-R (+55-72%) and soundly beats SKL-X too! As before for financial algorithms there is only one choice and that is Ryzen, be it Ryzen1, Ryzen+ or Ryzen2!
BenchScience SGEMM (GFLOPS) float/FP32 300 375 413 In this tough vectorised algorithm Ryzen2.
BenchScience DGEMM (GFLOPS) double/FP64 212 [+1%] 119 209 212 With FP64 vectorised code, Ryzen2 matches CFL-R and SKL-X.
BenchScience SFFT (GFLOPS) float/FP32 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more;
BenchScience DFFT (GFLOPS) double/FP64 12.69 [+13%] 7.92 11.21 14.6 With FP64 code, Ryzen2 is 13% faster than CFL-R.
BenchScience SNBODY (GFLOPS) float/FP32 280 557 638 N-Body simulation is vectorised but fewer memory accesses;
BenchScience DNBODY (GFLOPS) double/FP64 332 [+94%] 113 171 195 With FP64 precision Ryzen2 is almost 2x faster than CFL-R.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. The new 256-bit SIMD units help but it seems the cores are starved of bandwidth (especially due to SMT) and some workloads may perform better with SMT off.
CPU Image Processing Blur (3×3) Filter (MPix/s) 3056 [+20%] 1440 2560 4880 In this vectorised integer workload Ryzen2 is 20% faster than CFL-R.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1499 [+50%] 627 1000 1920 Same algorithm but more shared data makes Ryzen2 50% faster!
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 767 [+48%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1298 [+57%] 495 827 1500 Different algorithm but still vectorised workload Ryzen2 is almost 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 136 [+74%] 72.1 78 221 Still vectorised code now Ryzen2 is 70% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 38.23 [-9%] 23.9 42.2 66.7 This test has always been tough for Ryzen but Ryzen2 is competitive.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1384 [-65%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 487 [-18%] 243 596 777 In this final test again with integer workload Ryzen2 is 20% slower.
Thanks to AVX512 SKL-X does win all tests but Ryzen2 beats CFL-R between 20-74% with a few test mixing integer & floating-point SIMD instructions seemingly heavily favouring Intel but nothing to worry about. Overall for image processing Ryzen2 should be your 1st choice.
Aggregate Score (Points) 10,250 [+29%] 5,850 7,930 11,810 Across all benchmarks, Ryzen2 is ~30% faster than CFL-R!
Aggregating all the various scores, the result was never in doubt: Ryzen2 (3900X) is almost 2x faster than Ryzen+ (2700X) and 30% faster than CFL-R, almost catching up HEDT SKL-X.

Ryzen2 (unlike Ryzen1/+) has no trouble with SIMD code due to its widened SIMD units (256-bit) and thus soundly beats the opposition into dust (CFL-R 9900K flagship) sometimes more than just core count increase alone (+50% i.e. 12 cores vs. 8). Sometimes it even beats the AVX512 opposition (SKL-X 7900K) with more cores (10 cores vs. 12).

The only “problematic” algorithms are the memory bound ones where the cores/threads (due to SMT we have 24!) are starved for data and due to contention we see performance lower than less-core devices. While larger caches help (thus the massive 4x 16MB L3 caches) higher clocked memory should be used to match the additional core requirements.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: Ryzen2 is phenomenal and a huge upgrade over Ryzen1/+ that (most) AM4 users can enjoy and Intel has no answer to. 10/10.

Just as original Ryzen forced Intel to increase (double really) core counts to match (from 4 to 6 then 8), Ryzen2 will force Intel to come up with even more (and better) cores in order to compete. 3900X with its 12-cores soundly beats CFL-R 9900K (8-cores) in just about all benchmarks and in some tests goes toe-to-toe with HEDT SKL-X AVX512-enabled (10-cores) except in memory-bound algorithms where the 4 DDR4 memory channels with 2x more bandwidth count. For that you need ThreadRipper!

Ryzen1/+ was already competitive with Intel on integer and floating-point (non-SIMD) workloads but would fare badly on SIMD (AVX/FMA3/AVX2) workloads due to its 128-bit units; Ryzen2 “fixes” this issue, with its 256-bit units matching Intel. Only SKL-X with its 512-bit units (AVX512) is faster and Intel will have to finally include AVX512 for consumer CPUs in order to compete (IceLake?).

For compute-bound workloads, the forthcoming 3950X with its 16-cores/32-threads brings unprecedented performance to the consumer/desktop segment pretty much unheard of just a few years ago when 4-core/8-threads (e.g. 7700K) were all you could hope for – unless paying a lot more for HEDT where 8/10-core CPUs were far far more expensive. Naturally we shall see how the reduced memory bandwidth affects its performance with likely very fast DDR4 memory (4300Mt/s+) required for best performance.

Let’s also remember than Ryzen2 adds hardware mitigation to its remaining 2 vulnerabilities while Intel has been forced to add MDS/”Zombieload” even to its very latest CFL-R that now loses its trump card: hardware RDCL/”Meltdown” fix not to forget the recommendation to disable SMT/Hyperthreading that would mean a sizeable performance drop.

What is astonishing is that TDP has remained similar and with a BIOS/firmware upgrade, owners of older 300-series boards can now upgrade to these CPUs – and likely not even change the cooler unit! Naturally for PCIe4.0 a 500-series board is recommended and 400-series boards do support more features in Ryzen2/+ but let’s remember than on Intel you can only go back/forward 1 generation even though there is pretty much no core difference from Skylake (Gen 6) to Coffeelake-R (Gen 9)!

From top-end (3950X), high-end (3800X) to low-end/APU (3200G) Ryzen2 is such a compelling choice it is hard to recommend anything else… at least at this time…

The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

What is a Recurrent Neural Network (RNN/LSTM)?

A RNN is a type of neural network that is primarily made of of neurons that store their previous states thus are said to ‘have memory’. In effect this allows them to ‘remember’ patterns or sequences.

However, they can still be used as ‘classifiers’ i.e. recognising visual patterns in images and thus can be used in visual recognition software.

What is VGG(net) is why use it now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

While it is generally implemented using CNN layers, either directly or combination like ResNet, it can also be implemented using RNN layers which is what we have done here.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much complex neurons, like LSTM, for future tests specifically designed for high-end systems as those used in research and academia.

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60K training and 10K testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour, they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data.

Today (2019) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution (32x32x3) ILSVRC is 3GB thus unusable for our purpose.

What is Sandra’s RNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, a standard RNN was chosen as the architecture. The features are:

  • Input is 224x224x1 as MNIST images are grey-scale only (up-scaled from 28×28)
  • Output is 10 as there are only 10 classes
  • 4 layer network, 1 RNN, 3 fully connected layers

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

  • Both inference/forward and train/back-propagation tested and supported.
  • Precision: single and double floating-point supported with future half/FP16.
  • SIMD Instruction Sets: CPU, SSE2, SSE4.x, AVX, AVX2/FMA and AVX512 with future VNNI.
  • Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
  • NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.

What kind of BTT (Back-propagation Through Time) is used?

Unfortunately as we only know the output (digit) at the end of the sequence (i.e. once all pixels have been presented) we cannot calculate intermediate errors in order to use TBTT (Truncated BTT) which relies on known output at intermediate sequence time-steps.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to up-scaling and the relatively large number of training images there is no danger of over-fitting.

It achieves a % detection rate (over the 10K testing images) after just 1 epoch (Epoch 0) and % after 30 epochs.

Training (30 epochs) took just X* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete reasonably quickly) – within this time as many images at random from the data-sets (60K train, 10K test) will be processed.

The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

What is a Convolution Neural Network (CNN/ConvNet)?

A CNN is a type of neural network that is primarily made of of neuron layers connected in such a way that they perform convolution over the previous layers: in effect they are filters over the input – the same way a blur/sharpen/edge/etc filter would be applied over a picture.

They are used as ‘classifiers’ i.e. recognising visual patterns in images and thus are used in visual recognition software.

What is VGG(net) is why use its architecture now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

Thus while today (2019) there are far deeper and more complex neural networks, as Sandra is intended to run on common systems we had to choose the most common but relatively simple network.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much deeper networks, like ResNet, for future tests specifically designed for high-end systems as those used research and academia.

Why not use Tensorflow, Caffee, etc. as back-end?

As with all Sandra benchmarks we develop our own code which is optimised in the conjunction with the community which includes hardware makers. This allows us to control all the benchmark stack adding new features and support as required which we would not be able to do when using a back-end.

Using a specific vendor’s libraries (e.g. cuDNN, MKL, etc.) would lock-us into a specific platform while we provide implementation for all platforms including all CPU SIMD instruction sets (SSE2, SSE4, AVX, AVX2/FMA, AVX512) and major GP (GPGPU) run-times (CUDA, OpenCL, DirectX 11/12 Compute and future Vulkan*).

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60k (thousand) training and 10k testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour (1 channel), they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 8x to 224x224x1.

Today (2018) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution ImageNet ILSVRC is 3GB thus unusable for our purpose.

What are the CIFAR datasets and why use them now?

The CIFAR datasets (https://www.cs.toronto.edu/~kriz/cifar.html) are also decently sized datasets of objects used for training and testing image processing systems like neural networks. They both consists of 50k (thousand) training and 10k testing images of 32x32x3 pixel colour images with CIFAR-10 having 10 classes and CIFAR-100 having 100 classes.

Unlike MNIST the pictures are colour (3 channels RGB) and can also be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 7x to 224x224x3.

Again, just as with MNIST this allows us to include more datasets while processing them in high resolution similar to modern neural networks without including a large dataset like ImageNet ILSVRC dataset.

What are ImageNet ILSVRC datasets and why *not* use them?

The ImageNet (ImageNet Large Scale Visual Recognition Challenge) datasets (http://www.image-net.org/challenges/LSVRC/) are used in the yearly challenge for researchers in object detection, image classification at large scale. They are used to measure progress in computer vision in the World today.

The yearly challenge/competition has thus yielded many recent advancements in the field with winners (and in some cases runner-ups) providing the classical neural networks of today: AlexNet, VGG, ResNet, Inception, etc.

Naturally the task is non-trivial and requires cutting-edge complex neural networks that generally require similarly high-end hardware that is not the domain of mass-market. While old(er) neural networks like AlexNet, VGG or ResNet can today (2018) work on consumer hardware – they are usually deployed in inference/classification mode. Training them (from scratch) would still require significant processing power and time which does not make sense for our benchmark.

Due to the nature of our software (mass-market, small, fast) the size of the datasets (about 3GB for 32x32x3 1.2 million training images) makes them unsuitable to be included either as standard or downloadable. As we aready use low-resolution datasets, it would not make sense to include another – and the high resolution versions (e.g. 256x256x3) are far larger (about 137GB train, 6.3GB test).

Another issue is the licensing: they are licensed for research which Sandra as a commercial product – even though we provide the benchmarks free of charge – would likely not qualify.

What is Sandra’s CNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, VGG-16 was chosen as the architecture. The features are:

  • For MNIST dataset
    • Input is 224x224x1 as MNIST images are grey-scale (upscaled from 28×28)
    • Output is 10* as there are only 10 classes
    • 8 convolution (3×3 step 1), 5 pooling (2×2 step 2), 3 full-connect layers
  • Network/Engine features
    • Layers: Fully Connected/Dense, Convolution, Max Pooling, Recurrent, Dropout.
    • Activation: ReLU, Leaky ReLU, Smooth ReLU, Sigmoid, TanH. Activation functions are fused to the layers for reduced memory size/bandwidth footprint.
    • Back-propagation Optimiser: 2nd order Hessian.
    • Alignment: For performance, some layer sizes may be increased (e.g. output) to match SIMD alignment; the performance due to SIMD is higher than the overhead due to more un-needed neurons.
    • SIMD Float Width: Up to 64 single-precision pixels per cycle when using AVX512.
    • SIMD Half Width: Up to 128 half-precision pixels per cycle when using AVX512/BFloat16*.
    • SIMD Int8 Width: Up to 256 int8 pixels per cycle when using AVX512/VNNI*.

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

  • Both inference/forward and train/back-propagation tested and supported.
  • Processor:
    • Precision: single/FP32 and double/FP64 supported.
    • SIMD Instruction Sets: FPU, SSE2, SSE4.x, AVX, AVX2/FMA, AVX512 with future VNNI*.
    • Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
    • Atomic Updates: TSX/RTE used where supported otherwise 128/64/32-bit interlock/update.
    • NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.
    • Large Pages: Large (2/4MB) pages used where supported and enabled.
  • GP (GPGPU):
    • Precision: single/FP32 and half/FP16 supported.
    • Run-Times: CUDA 10+, OpenCL 1.2+, DirectX 11/12 Compute.
    • Multi-GPU: Up to 8 devices are supported including CPU pseudo-device.

How is the data stored/processed?

We use the CHW format for simple SIMD implementation and performance load/store.

What activation function do you use?

We use the Sigmoid activation function with a fast (but naturally somewhat low-precision) SIMD tanh/exp implementation; while many modern networks (and VGG itself) use ReLU (for speed reasons) we’ve found the Sigmoid to work “better” for us without appreciable performance impact. By better we mean fast convergence and no need for batch normalisation.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to upscaling and the relatively large number of training images there is no danger of overfitting.

It achieves a 95.3% detection rate (over the 10k testing images) after just 1 epoch (Epoch 0) and 99.82% after 30 epochs.

Training (30 epochs) took just 7* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete resonably quickly) – within this time as many images at random from the datasets (60k train, 10k test) will be processed.