What is “Titan V”?
It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Volta” architecture, the next generation to the current “Pascal” architecture on the Series 10 cards. Based on the top-end 100 chipset (not lower 102 or 104) it boasts full speed FP64/FP16 performance as well as brand-new “tensor cores” (matrix multipliers) for scientific and deep-learning workloads. It also comes with on-chip HBM2 (high-bandwidth) memory unlike more traditional GDDRX stand-alone memory.
For this reason the price is also far higher than previous Titan X/XP cards but considering the features/performance are more akin to “Tesla” series it would still be worth it depending on workload.
While using the additional cores provided in FP64/FP16 workloads is automatic – save usual code optimisations – tensor cores support requires custom code and existing libraries and apps need to be updated to make use of them. It is unknown at this time if consumer cards based on “Volta” will also include them. As they support FP16 precision only, not workloads may be able to use them – but DL (deep learning) and AI (artificial intelligence) are generally fine using lower precision thus for such tasks it is ideal.
See these other articles on Titan (and competition) performance:
- AMD Radeon RX 6900 (RDNA2, Navi2) Review & Benchmarks – GPGPU Performance
- nVidia 3090, 3080 RTX: Ampere GPGPU performance in CUDA and OpenCL
- nVidia Titan RTX / 2080Ti: Turing GPGPU performance in CUDA and OpenCL
- AMD Radeon 5700XT: Navi GPGPU Performance in OpenCL
- nVidia Titan X : Pascal GPGPU performance in CUDA and OpenCL
- nVidia Titan V/X: FP16 and Tensor CUDA Performance
Hardware Specifications
We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.
GPGPU Specifications | nVidia Titan V |
nVidia Titan X (P) |
nVidia 980 GTX (M2) |
Comments | |
Arch Chipset | Volta GV100 (7.0) | Pascal GP102 (6.1) | Maxwell 2 GM204 (5.2) | The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions | |
Cores (CU) / Threads (SP) | 80 / 5120 | 28 / 3584 | 16 / 2048 | The V boasts 80 CU units but these contain 64 FP32 units only not 128 like lower-end chips thus equivalent with 40. | |
FP32 / FP64 / Tensor Cores | 5120 / 2560 / 640 | 3584 / 112 / no | 2048 / 64 / no | Titan V is the only one with tensor cores and also huge amount of FP64 cores that Titan X simply cannot match; it also has full speed FP16 support. | |
Speed (Min-Turbo) | 1.2GHz (135-1.455) | 1.531GHz (139-1910) | 1.126GHz (135-1.215) | Slightly lower clocked than the X it will will make up for it with sheer CU units. | |
Power (TDP) | 300W | 250W (125-300) | 180W (120-225) | TDP increases by 50W but it is not unexpected considering the additional units. | |
ROP / TMU |
96 / 320 | 96 / 224 | 64 / 128 | Not a “gaming card” but while ROPs stay the same the number of TMUs has increased – likely required for compute tasks using textures. | |
Global Memory | 12GB HBM2 850Mhz 3072-bit | 12GB GDDR5X 10Gbps 384-bit | 4GB GDDR5 7Gbps 256-bit | Memory size stays the same at 12GB but now uses on-chip HBM2 for much higher bandwidth | |
Memory Bandwidth (GB/s) |
652 | 512 | 224 | In addition to the modest bandwidth increase, latencies are also meant to have decreased by a good amount. | |
L2 Cache | 4.5MB | 3MB | 2MB | L2 cache has gone up by about 50% to feed all the cores. | |
FP64/double ratio |
1/2 | 1/32 | 1/32 | For FP64 workloads the V has huge advantage as consumer and previous Titan X had far less FP64 units. | |
FP16/half ratio |
2x | 1/64 | n/a | The V has an even bigger advantage here with over 128x more units for FP16 tasks like DL and AI. |
Processing Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.
Memory Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.
HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets
Titan V’s memory performance does not disappoint – HBM2 obviously brings large bandwidth increase – latency depends on access pattern, when prefetchers can engage they are much lowers but in random accesses out-of-page they are a big higher but nothing significant. We’re also limited by the PCIe3 bus for transfers which requires judicious overlap of memory transfers and compute to keep the cores busy.
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
“Volta” architecture does bring good improvements in FP32 performance which we hope to see soon in consumer (Series 11?) graphics cards – as well as lower-end Titan cards.
But here (on Titan V) we have the top-end chip with full-power FP64 and FP16 units more akin to Tesla which simply power through any and all algorithms you can throw at them. This is really the “Titan” you were looking for and upgrading from the previous Titan X (Pascal) is a huge upgrade admittedly for quite a bit more money.
If you have workloads that requires double/FP64 precision – Titan V is 15-16x times faster than Titan X – thus great value for money. If code can make do with FP16 precision then you can gain up to 2x extra performance again – as well as save storage for large data-sets – again Titan X cannot cut it here running at 1/64 rate.
We have not yet shown tensor core performance which is an additional reason for choosing such a card – if you have code that can make use of them you can gain an extra 16x (times) performance that really puts Titan V heads and shoulders over the Titan X.
All in all Titan V is a compelling upgrade if you need more power than Titan X and are (or thinking of) using multiple cards – there is simply no point. One Titan V can replace 4 or more Titan X cards on FP64 or FP16 workloads and that is before you make any optimisations. Obviously you are still “stuck” with 12GB memory and PCIe bus for transfers but with judicious optimisations this should not impact performance significantly.