SiSoftware Sandra 20/20 (2020) Released!

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Sandra 20/20 (2020) Released:
Brand-new benchmarks (AI/ML), hardware support

Updates: RTMa (30.16, August 8th).

London, UK, July 18th, 2019 – We are pleased to announce the launch of SiSoftware Sandra 20/20 (2020), the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, mobile devices and networks.

It adds two Neural Networks AI/ML (Artificial Intelligence/Machine Learning) benchmarks for both CPU and GP (GPU) to measure both CNN (Convolution Neural Network) & RNN (Recurrent Neural Networks) performance on modern hardware.

It also adds hardware support and optimisations for brand-new CPU architectures (AMD Ryzen 2 (3000 series); Intel IceLake, CometLake) not forgetting GPGPU architectures across the various interfaces (CUDA, OpenCL, DirectX ComputeShader, OpenGL Compute).

As SiSoftware operates a “just-in-time” release cycle, some features were introduced in Sandra 2017 service packs: in Sandra Titanium they have been updated and enhanced based on all the feedback received.

Operating System Module

Broad Operating System Support

All current versions supported: Windows 10, 8.1*, 8*, 7*; Server 2019, 2016, 2012/R2 and 2008/R2*

Brand new AI/ML benchmarks featuring both CNN & RNN networks testing both inference/forward and training/back-propagation performance.

Processor Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (single/FP32, double/FP64 floating-point) and instruction sets (AVX512, AVX2/FMA, AVX, SSE4, SSE2, RTM/HLE with NUMA and large-page support)

Ranker: Processor Neural Networks (Normal/Single Precision)
Ranker: Processor Neural Networks (High/Double Precision)

GP (GPU) Neural Networks (AI/ML)

A combined performance index of CNN (inference/forward & training) & RNN (inference/forward & training) for all precisions (half/FP16, single/FP32 floating-point) and platforms (CUDA, OpenCL, DirectX Compute)

GP (GPU) Neural Networks (Normal/Single Precision)
GP (GPU) Neural Networks (Low/Half Precision)

CNN (Convolution Neural Network) Architecture

Detailed document on the CNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

RNN (Recurrent Neural Network) Architecture

Detailed document on the RNN architecture, data-sets and results that underpin our choices for the new benchmarks.

The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

Major changes

  • All connections to website engines (Ranker, Information, Price) are now secured by SSL through HTTP.
  • Sandra client (management console) is now installed as native 64-bit (on x64 and arm64) and thus needs 64-bit Access components (2016, 2013, 2010, etc.) or SQL Server (2017, 2016, 2014, etc) for its database.

Key features of Sandra 20/20

  • 4 native architectures support (x86, x64, ARM64** – Windows; ARM, ARM64, x86, x64 – Android)
  • Huge official hardware support through technology partners (AMD/ATI, nVidia, Intel).
  • 4 native (GP)GPU/APU platforms support (OpenCL 2.1+, CUDA 10.1+, DirectX Compute Shader 11/10+, OpenGL Compute 4.5+, Vulkan 1.0+).
  • 4 native Graphics platforms support (DirectX 11.x/10.x, OpenGL 4.0+, Vulkan 1.0+).
  • 9 language versions (English, German, French, Italian, Spanish, Japanese, Chinese (Traditional, Simplified), Russian) in a single installer.
  • Enhanced Sandra Lite (Eval) version (free for personal/educational use, evaluation for other uses)

Articles & Benchmarks

For more details, please see the following articles:

Purchasing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please click here.

Downloading

For more details, and to download the Lite (Evaluation) version, please click here.

Reviewers and Editors

For your free review copies, please contact us.

About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Many worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Thousands on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, OpenGL, DirectCompute, x64, ARM64, ARM, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX512, AVX2/FMA3, AVX, NEON/2, SSE4.2/4, SSSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit www.sisoftware.net, www.sisoftware.eu, or www.sisoftware.co.uk

The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

What is a Recurrent Neural Network (RNN/LSTM)?

A RNN is a type of neural network that is primarily made of of neurons that store their previous states thus are said to ‘have memory’. In effect this allows them to ‘remember’ patterns or sequences.

However, they can still be used as ‘classifiers’ i.e. recognising visual patterns in images and thus can be used in visual recognition software.

What is VGG(net) is why use it now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

While it is generally implemented using CNN layers, either directly or combination like ResNet, it can also be implemented using RNN layers which is what we have done here.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much complex neurons, like LSTM, for future tests specifically designed for high-end systems as those used in research and academia.

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60K training and 10K testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour, they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data.

Today (2019) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution (32x32x3) ILSVRC is 3GB thus unusable for our purpose.

What is Sandra’s RNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, a standard RNN was chosen as the architecture. The features are:

  • Input is 224x224x1 as MNIST images are grey-scale only (up-scaled from 28×28)
  • Output is 10 as there are only 10 classes
  • 4 layer network, 1 RNN, 3 fully connected layers

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

  • Both inference/forward and train/back-propagation tested and supported.
  • Precision: single and double floating-point supported with future half/FP16.
  • SIMD Instruction Sets: CPU, SSE2, SSE4.x, AVX, AVX2/FMA and AVX512 with future VNNI.
  • Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
  • NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.

What kind of BTT (Back-propagation Through Time) is used?

Unfortunately as we only know the output (digit) at the end of the sequence (i.e. once all pixels have been presented) we cannot calculate intermediate errors in order to use TBTT (Truncated BTT) which relies on known output at intermediate sequence time-steps.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to up-scaling and the relatively large number of training images there is no danger of over-fitting.

It achieves a % detection rate (over the 10K testing images) after just 1 epoch (Epoch 0) and % after 30 epochs.

Training (30 epochs) took just X* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete reasonably quickly) – within this time as many images at random from the data-sets (60K train, 10K test) will be processed.

The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

What is a Convolution Neural Network (CNN/ConvNet)?

A CNN is a type of neural network that is primarily made of of neuron layers connected in such a way that they perform convolution over the previous layers: in effect they are filters over the input – the same way a blur/sharpen/edge/etc filter would be applied over a picture.

They are used as ‘classifiers’ i.e. recognising visual patterns in images and thus are used in visual recognition software.

What is VGG(net) is why use its architecture now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

Thus while today (2019) there are far deeper and more complex neural networks, as Sandra is intended to run on common systems we had to choose the most common but relatively simple network.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much deeper networks, like ResNet, for future tests specifically designed for high-end systems as those used research and academia.

Why not use Tensorflow, Caffee, etc. as back-end?

As with all Sandra benchmarks we develop our own code which is optimised in the conjunction with the community which includes hardware makers. This allows us to control all the benchmark stack adding new features and support as required which we would not be able to do when using a back-end.

Using a specific vendor’s libraries (e.g. cuDNN, MKL, etc.) would lock-us into a specific platform while we provide implementation for all platforms including all CPU SIMD instruction sets (SSE2, SSE4, AVX, AVX2/FMA, AVX512) and major GP (GPGPU) run-times (CUDA, OpenCL, DirectX 11/12 Compute and future Vulkan*).

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60k (thousand) training and 10k testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour (1 channel), they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 8x to 224x224x1.

Today (2018) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution ImageNet ILSVRC is 3GB thus unusable for our purpose.

What are the CIFAR datasets and why use them now?

The CIFAR datasets (https://www.cs.toronto.edu/~kriz/cifar.html) are also decently sized datasets of objects used for training and testing image processing systems like neural networks. They both consists of 50k (thousand) training and 10k testing images of 32x32x3 pixel colour images with CIFAR-10 having 10 classes and CIFAR-100 having 100 classes.

Unlike MNIST the pictures are colour (3 channels RGB) and can also be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 7x to 224x224x3.

Again, just as with MNIST this allows us to include more datasets while processing them in high resolution similar to modern neural networks without including a large dataset like ImageNet ILSVRC dataset.

What are ImageNet ILSVRC datasets and why *not* use them?

The ImageNet (ImageNet Large Scale Visual Recognition Challenge) datasets (http://www.image-net.org/challenges/LSVRC/) are used in the yearly challenge for researchers in object detection, image classification at large scale. They are used to measure progress in computer vision in the World today.

The yearly challenge/competition has thus yielded many recent advancements in the field with winners (and in some cases runner-ups) providing the classical neural networks of today: AlexNet, VGG, ResNet, Inception, etc.

Naturally the task is non-trivial and requires cutting-edge complex neural networks that generally require similarly high-end hardware that is not the domain of mass-market. While old(er) neural networks like AlexNet, VGG or ResNet can today (2018) work on consumer hardware – they are usually deployed in inference/classification mode. Training them (from scratch) would still require significant processing power and time which does not make sense for our benchmark.

Due to the nature of our software (mass-market, small, fast) the size of the datasets (about 3GB for 32x32x3 1.2 million training images) makes them unsuitable to be included either as standard or downloadable. As we aready use low-resolution datasets, it would not make sense to include another – and the high resolution versions (e.g. 256x256x3) are far larger (about 137GB train, 6.3GB test).

Another issue is the licensing: they are licensed for research which Sandra as a commercial product – even though we provide the benchmarks free of charge – would likely not qualify.

What is Sandra’s CNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, VGG-16 was chosen as the architecture. The features are:

  • For MNIST dataset
    • Input is 224x224x1 as MNIST images are grey-scale (upscaled from 28×28)
    • Output is 10* as there are only 10 classes
    • 8 convolution (3×3 step 1), 5 pooling (2×2 step 2), 3 full-connect layers
  • Network/Engine features
    • Layers: Fully Connected/Dense, Convolution, Max Pooling, Recurrent, Dropout.
    • Activation: ReLU, Leaky ReLU, Smooth ReLU, Sigmoid, TanH. Activation functions are fused to the layers for reduced memory size/bandwidth footprint.
    • Back-propagation Optimiser: 2nd order Hessian.
    • Alignment: For performance, some layer sizes may be increased (e.g. output) to match SIMD alignment; the performance due to SIMD is higher than the overhead due to more un-needed neurons.
    • SIMD Float Width: Up to 64 single-precision pixels per cycle when using AVX512.
    • SIMD Half Width: Up to 128 half-precision pixels per cycle when using AVX512/BFloat16*.
    • SIMD Int8 Width: Up to 256 int8 pixels per cycle when using AVX512/VNNI*.

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

  • Both inference/forward and train/back-propagation tested and supported.
  • Processor:
    • Precision: single/FP32 and double/FP64 supported.
    • SIMD Instruction Sets: FPU, SSE2, SSE4.x, AVX, AVX2/FMA, AVX512 with future VNNI*.
    • Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
    • Atomic Updates: TSX/RTE used where supported otherwise 128/64/32-bit interlock/update.
    • NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.
    • Large Pages: Large (2/4MB) pages used where supported and enabled.
  • GP (GPGPU):
    • Precision: single/FP32 and half/FP16 supported.
    • Run-Times: CUDA 10+, OpenCL 1.2+, DirectX 11/12 Compute.
    • Multi-GPU: Up to 8 devices are supported including CPU pseudo-device.

How is the data stored/processed?

We use the CHW format for simple SIMD implementation and performance load/store.

What activation function do you use?

We use the Sigmoid activation function with a fast (but naturally somewhat low-precision) SIMD tanh/exp implementation; while many modern networks (and VGG itself) use ReLU (for speed reasons) we’ve found the Sigmoid to work “better” for us without appreciable performance impact. By better we mean fast convergence and no need for batch normalisation.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to upscaling and the relatively large number of training images there is no danger of overfitting.

It achieves a 95.3% detection rate (over the 10k testing images) after just 1 epoch (Epoch 0) and 99.82% after 30 epochs.

Training (30 epochs) took just 7* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete resonably quickly) – within this time as many images at random from the datasets (60k train, 10k test) will be processed.