The new Neural Networks (AI/ML) Benchmarks: CNN Architecture

What is a Convolution Neural Network (CNN/ConvNet)?

A CNN is a type of neural network that is primarily made of of neuron layers connected in such a way that they perform convolution over the previous layers: in effect they are filters over the input – the same way a blur/sharpen/edge/etc filter would be applied over a picture.

They are used as ‘classifiers’ i.e. recognising visual patterns in images and thus are used in visual recognition software.

What is VGG(net) is why use its architecture now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

Thus while today (2019) there are far deeper and more complex neural networks, as Sandra is intended to run on common systems we had to choose the most common but relatively simple network.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much deeper networks, like ResNet, for future tests specifically designed for high-end systems as those used research and academia.

Why not use Tensorflow, Caffee, etc. as back-end?

As with all Sandra benchmarks we develop our own code which is optimised in the conjunction with the community which includes hardware makers. This allows us to control all the benchmark stack adding new features and support as required which we would not be able to do when using a back-end.

Using a specific vendor’s libraries (e.g. cuDNN, MKL, etc.) would lock-us into a specific platform while we provide implementation for all platforms including all CPU SIMD instruction sets (SSE2, SSE4, AVX, AVX2/FMA, AVX512) and major GP (GPGPU) run-times (CUDA, OpenCL, DirectX 11/12 Compute and future Vulkan*).

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60k (thousand) training and 10k testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour (1 channel), they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 8x to 224x224x1.

Today (2018) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution ImageNet ILSVRC is 3GB thus unusable for our purpose.

What are the CIFAR datasets and why use them now?

The CIFAR datasets (https://www.cs.toronto.edu/~kriz/cifar.html) are also decently sized datasets of objects used for training and testing image processing systems like neural networks. They both consists of 50k (thousand) training and 10k testing images of 32x32x3 pixel colour images with CIFAR-10 having 10 classes and CIFAR-100 having 100 classes.

Unlike MNIST the pictures are colour (3 channels RGB) and can also be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data. Here we up-scale them 7x to 224x224x3.

Again, just as with MNIST this allows us to include more datasets while processing them in high resolution similar to modern neural networks without including a large dataset like ImageNet ILSVRC dataset.

What are ImageNet ILSVRC datasets and why not use them?

The ImageNet (ImageNet Large Scale Visual Recognition Challenge) datasets (http://www.image-net.org/challenges/LSVRC/) are used in the yearly challenge for researchers in object detection, image classification at large scale. They are used to measure progress in computer vision in the World today.

The yearly challenge/competition has thus yielded many recent advancements in the field with winners (and in some cases runner-ups) providing the classical neural networks of today: AlexNet, VGG, ResNet, Inception, etc.

Naturally the task is non-trivial and requires cutting-edge complex neural networks that generally require similarly high-end hardware that is not the domain of mass-market. While old(er) neural networks like AlexNet, VGG or ResNet can today (2018) work on consumer hardware – they are usually deployed in inference/classification mode. Training them (from scratch) would still require significant processing power and time which does not make sense for our benchmark.

Due to the nature of our software (mass-market, small, fast) the size of the datasets (about 3GB for 32x32x3 1.2 million training images) makes them unsuitable to be included either as standard or downloadable. As we aready use low-resolution datasets, it would not make sense to include another – and the high resolution versions (e.g. 256x256x3) are far larger (about 137GB train, 6.3GB test).

Another issue is the licensing: they are licensed for research which Sandra as a commercial product – even though we provide the benchmarks free of charge – would likely not qualify.

What is Sandra’s CNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, VGG-16 was chosen as the architecture. The features are:

For MNIST dataset
- Input is 224x224x1 as MNIST images are grey-scale (upscaled from 28×28)
- Output is 10* as there are only 10 classes
- 8 convolution (3×3 step 1), 5 pooling (2×2 step 2), 3 full-connect layers
Network/Engine features
- Layers: Fully Connected/Dense, Convolution, Max Pooling, Recurrent, Dropout.
- Activation: ReLU, Leaky ReLU, Smooth ReLU, Sigmoid, TanH. Activation functions are fused to the layers for reduced memory size/bandwidth footprint.
- Back-propagation Optimiser: 2nd order Hessian.
- Alignment: For performance, some layer sizes may be increased (e.g. output) to match SIMD alignment; the performance due to SIMD is higher than the overhead due to more un-needed neurons.
- SIMD Float Width: Up to 64 single-precision pixels per cycle when using AVX512.
- SIMD Half Width: Up to 128 half-precision pixels per cycle when using AVX512/BFloat16*.
- SIMD Int8 Width: Up to 256 int8 pixels per cycle when using AVX512/VNNI*.

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

Both inference/forward and train/back-propagation tested and supported.
Processor:
- Precision: single/FP32 and double/FP64 supported.
- SIMD Instruction Sets: FPU, SSE2, SSE4.x, AVX, AVX2/FMA, AVX512 with future VNNI*.
- Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
- Atomic Updates: TSX/RTE used where supported otherwise 128/64/32-bit interlock/update.
- NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.
- Large Pages: Large (2/4MB) pages used where supported and enabled.
GP (GPGPU):
- Precision: single/FP32 and half/FP16 supported.
- Run-Times: CUDA 10+, OpenCL 1.2+, DirectX 11/12 Compute.
- Multi-GPU: Up to 8 devices are supported including CPU pseudo-device.

How is the data stored/processed?

We use the CHW format for simple SIMD implementation and performance load/store.

What activation function do you use?

We use the Sigmoid activation function with a fast (but naturally somewhat low-precision) SIMD tanh/exp implementation; while many modern networks (and VGG itself) use ReLU (for speed reasons) we’ve found the Sigmoid to work “better” for us without appreciable performance impact. By better we mean fast convergence and no need for batch normalisation.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to upscaling and the relatively large number of training images there is no danger of overfitting.

It achieves a 95.3% detection rate (over the 10k testing images) after just 1 epoch (Epoch 0) and 99.82% after 30 epochs.

Training (30 epochs) took just 7* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete resonably quickly) – within this time as many images at random from the datasets (60k train, 10k test) will be processed.