The new Neural Networks (AI/ML) Benchmarks: RNN Architecture

What is a Recurrent Neural Network (RNN/LSTM)?

A RNN is a type of neural network that is primarily made of of neurons that store their previous states thus are said to ‘have memory’. In effect this allows them to ‘remember’ patterns or sequences.

However, they can still be used as ‘classifiers’ i.e. recognising visual patterns in images and thus can be used in visual recognition software.

What is VGG(net) is why use it now?

VGGNet is the baseline (or benchmark) CNN-type network that while did not win the ILSVRC 2014 competition (won by GoogleNet/Inception) it is still the preferred choice in the community for classification due to its uniform and thus relatively simple architecture.

While it is generally implemented using CNN layers, either directly or combination like ResNet, it can also be implemented using RNN layers which is what we have done here.

We believe this is a good test scenario and thus a relevant benchmark for today’s common systems.

We are considering much complex neurons, like LSTM, for future tests specifically designed for high-end systems as those used in research and academia.

What is the MNIST dataset and why use it now?

The MNIST database (https://en.wikipedia.org/wiki/MNIST_database) is a decently sized dataset of handwritten digits used for training and testing image processing systems like neural networks. It contains 60K training and 10K testing images of 28×28 pixel anti-aliased gray levels. The number of classes is only 10 (digits ‘0’ to ‘9’).

While they are only 28×28 and not colour, they can be up-scaled to any size by common up-scaling algorithms to test neural networks with little source data.

Today (2019) the digits would be captured in much higher resolution similar to the standard input resolution of the image processing networks of today (between 200×200 and 300×300 pixels).

As Sandra is designed to be small and easily downloadable, it is not possible to include gigabytes (GB) of data for either inference or training. Even the low-resolution (32x32x3) ILSVRC is 3GB thus unusable for our purpose.

What is Sandra’s RNN network architecture and why was it designed this way?

Due to the low complexity of the data and in order to maintain good performance even on low-end hardware, a standard RNN was chosen as the architecture. The features are:

Input is 224x224x1 as MNIST images are grey-scale only (up-scaled from 28×28)
Output is 10 as there are only 10 classes
4 layer network, 1 RNN, 3 fully connected layers

What are the implementation details of the network?

The CPU version of the neural network supports all common instruction sets and precision and will be continuously updated as the industry moves forward.

Both inference/forward and train/back-propagation tested and supported.
Precision: single and double floating-point supported with future half/FP16.
SIMD Instruction Sets: CPU, SSE2, SSE4.x, AVX, AVX2/FMA and AVX512 with future VNNI.
Threads/Cores: Up to the maximum operating system 384 threads in 64-thread groups are supported with hard affinity as all other benchmarks.
NUMA: NUMA is supported up to 16 nodes with data allocated to the closest node.

What kind of BTT (Back-propagation Through Time) is used?

Unfortunately as we only know the output (digit) at the end of the sequence (i.e. once all pixels have been presented) we cannot calculate intermediate errors in order to use TBTT (Truncated BTT) which relies on known output at intermediate sequence time-steps.

What kind of detection rate and error does Sandra’s implementation achieve?

Naturally due to the low source resolution, a much shallower/simpler network would have sufficed. However due to up-scaling and the relatively large number of training images there is no danger of over-fitting.

It achieves a % detection rate (over the 10K testing images) after just 1 epoch (Epoch 0) and % after 30 epochs.

Training (30 epochs) took just X* hours on an i9-7900X (10C/20T) using AVX512/single-precision.

Does Sandra fully infer or train the full image set when benchmarking?

As with all other Sandra benchmarks the tests are limited to 30 seconds (in order to complete reasonably quickly) – within this time as many images at random from the data-sets (60K train, 10K test) will be processed.