What is FP16 (“half”)?
FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal/Turing) and soon by CPUs (BFloat16). While originally meant for mobile devices in order to reduce memory and compute requirements – it also allows workstations/servers to handle deep neural-network workloads that have exploded in both size and compute power.
While not all algorithms can use such low precision and thus may require parts to use normal precision, nevertheless FP16 can still be used in many instances and thus needs to be implemented and benchmarked.
In addition we see the introduction of specialised compute engines that specifically support FP16 (and not higher precision like FP32/FP64) like “Tensor Engines”.
What are “Tensors”?
A tensor engine (hardware accelerator) is a specialised processing unit that accelerates matrix multiplication in hardware, in this case the latest nVidia GPGPU architectures (Pascal/Turing). While the former was targeted to workstations (Titan), the latter powers all consumer (series 2000) graphics cards – thus it has entered the mainstream. In addition, the speed restrictions (e.g. Maxwell FP16 processing speed was limited to 1/64 FP32 speed) have been lifted.
While it is used in other algorithms, it is intended to be used to accelerate neural networks (so called “AI”) that are now being used in mainstream local workloads like image/video processing (scaling, de-noising, etc.), games (anti-aliasing, de-noising when using with ray-tracing, bots/NPCs, procedural world-building, etc.).
In this article we are investigating both FP16/half performance vs. standard FP32 as well as the performance improvement when using tensors.
- nVidia Titan RTX / 2080Ti: Turing GPGPU performance in CUDA and OpenCL
- nVidia Titan V : Volta GPGPU performance in CUDA and OpenCL
- nVidia Titan X : Pascal GPGPU performance in CUDA and OpenCL
FP16/half Performance
We are testing GPGPU performance of the GPUs in CUDA as it supports both FP16/half operations and tensors; hopefully both OpenCL and DirectX will also be updated to support both FP16/half (compute) and tensors.
Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest nVidia drivers (Jan 2019). Turbo / Dynamic Overclocking was enabled on all configurations.
For image processing, Titan V brings big performance increases from 50% to 4x (times) faster than Titan X a big upgrade. If you are willing to drop to FP16 precision, then it is an extra 50% to 2x faster again – while naturally FP16 is not really usable on the X. With potential 8x times better performance Titan V powers through image processing tasks.
Final Thoughts / Conclusions
FP16/half support when unlocked can greatly benefit many algorithms – if the lower precision is acceptable: in general performance improves by about 50% – though in some cases it can reach 200%.
When using the new Tensor cores – performance improves hugely: in GEMM we see 4x performance improvement (vs. FP32). It thus makes great sense to modify algorithms (like convolution) to use matrix multiplication and thus the Tensor cores – which will greatly accelerate image processing and neural networks. With the new nVidia 2000 series – this kind of performance is available in the mainstream right now and is pretty amazing to see.
Expect to see similar hardware accelerator units from both GPGPUs and soon CPUs with AVX512-VNNI as well as FP16 processing support (BFloat16) that will allow multi-core wide-SIMD CPUs to be competitive.