nVidia Titan X: Pascal GPGPU Performance in CUDA and OpenCL

What is “Titan X (Pascal)”?

It is the current high-end “pro-sumer” card from nVidia using the current generation “Pascal” architecture – equivalent to the Series 10 cards. It is based on the 2nd-from-the-top 102 chipset (not the top-end 100) thus it does not feature full speed FP64/FP16 performance that is generally reserved for the “Quadro/Tesla” professional range of cards. It does however come with more memory to fit more datasets and is engineered for 24/7 performance.

Pricing has increased a bit from previous generation X/XP but that is a general trend today from all manufacturers.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan X with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan X (P) nVidia 980 GTX (M2) AMD Vega 56 AMD Fury Comments
Arch Chipset Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) Vega 10 Fiji The X uses the current Pascal architecture that is also powering the current Series 10 consumer cards
Cores (CU) / Threads (SP) 28 / 3584 16 / 2048 56 / 3584 64 / 4096 We’ve got 28CU/SMX here down from 32 on GP100/Tesla but should still be sufficient to power through tasks.
FP32 / FP64 / Tensor Cores 3584 / 112 / no 2048 / 64 / no 3584 / 448 / no 4096 / 512 / no Only 112 FP64 units – a lot less than competition from AMD, this is a card geared for FP32 workloads.
Speed (Min-Turbo) 1.531GHz (139-1910) 1.126GHz (135-1.215) 1.64GHz 1GHz Higher clocked that previous generation and comparative with competition.
Power (TDP) 250W (125-300) 180W (120-225) 200W 150W TDP has also increased to 250W but again that is inline with top-end cards that are pushing over 200W.
ROP / TMU
96 / 224 64 / 128 64 / 224 64 / 256 As it may also be used as top-end graphics card, it has a good amount of ROPs (50% more than competition) and similar numbers of TMUs.
Global Memory 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit 8GB HBM2 2Gbps 2048-bit 4GB HBM 1Gbps 4096-bit Titan X comes with a huge 12GB of current GDDR5X memory while the competition has switched to HBM2 for top-end cards.
Memory Bandwidth (GB/s)
512 224 483 512 Due to high speed GDDR5X, the X has plenty of memory bandwidth even higher than HBM2 competition.
L2 Cache 3MB 2MB L2 cache has increased by 50% over previous arch to keep all cores fed.
FP64/double ratio
1/32 1/32 1/8 1/8 The X is not really meant for FP64 workloads as it uses the same ratio 1:32 as normal consumer cards.
FP16/half ratio
1/64 n/a 1/1 1/1 With 1:64 ratio FP16 is not really usable on Titan X but can only really be used for compatibility testing.

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from both nVidia and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

GPGPU Image ProcessingMotion-Blur (7×7) Filter single/FP32 (MPix/s)4,830 / 3,6201,910 / 1,440

Again same algorithm but even more data shared the V is 2x faster than the X.

Processing Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 17,870 [37%] / 16,000 7,000 / 6,100 13,000 8,720 Titan X makes a good start beating the Vega by almost 40%.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 245 [-98%] / n/a n/a 13,130 7,890 FP16 is so slow that it is unusable – just for testing.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 661 [-47%] / 672 259 / 265 1,250 901 FP64 is also quite slow though a lot faster than on the GTX 980.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 25 [-67%] / 24 10.8 / 10.7 77.3 55 Emulated FP128 precision depends entirely on FP64 performance and thus is… slow.
With FP32 “normal” workloads Titan X is quite fast, ~40% faster than Vega and about 2.5x faster than an older GTX 980 thus quite an improvement. But FP16 workloads should not apply – better off with FP32 – and FP64 is also about 1/2 the performance of a Vega – also slower than even a Fiji. As long as all workloads are FP32 there should be no problems.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 40 [-38%] / 38 16 / 16 65 46 Titan X is a lot faster than previous gen but still ~40% slower than a Vega
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 52 [-38%] / 51 23 / 21 84 60 Nothing changes here , the X still about 40% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 134 [+4%] / 142 58 / 59 129 82 In this integer workload, somehow Titan X manages to beat the Vega by 4%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 [-34%] / 114 50 / 54 163 124 SHA1 is mysteriously slower thus the X is ~35% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 72 [+2.3x] / 42 32 / 24 31 13.8 With 64-bit integer workload, Titan X is a massive 2.3x times faster than a Vega.
Historically, nVidia cards have not been tuned for integer workloads, but Titan X still manages to beat a Vega – the “gold standard” for crypto-currency hashing – on both SHA256 and especially on 64-bit integer SHA2-512! Perhaps for the first time a nVidia card is competitive on integer workloads and even much faster on 64-bit integer workloads.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11,480 [+28%] / 11,470 5,280 / 5,280 9,000 11,220 In this FP32 financial workload Titan X is almost 30% faster than a Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,370 [-36%] / 1,300 547 / 511 1,850 1,290 Switching to FP64 code, the X remains competitive and is about 35% slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 2,240 [-8%] / 2,240 1,200 / 1,140 2,440 1,760 Binomial uses thread shared data thus stresses the SMX’s memory system and here Vega surprisingly does better by 8%
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 129 [-20%] / 133 51 / 51 161 115 With FP64 code the X is only 20% slower than a Vega.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,350 [+47%] / 5,150 2,140 / 2,000 3,630 2,470 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Titan X is almost 50% faster!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 294 [-34%] / 267 118 / 106 385 332 Switching to FP64 the X is again 34% slower than a Vega.
For financial FP32 workloads, the Titan X generally beats the Vega by a good amount or at least ties with it; with FP64 precision it is about 1/2 the speed which is to be expected. As long as you have FP32 workloads this should not be a problem.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,280 [+19%] / 6,600 2,550 / 2,550 5,260 3,630 Using 32-bit precision Titan X beats the Vega by 20%.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 335 [-40%] / 332 130 / 129 555 381 With FP64 precision, unsurprisingly the X is 40% slower.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 242 [-20%] / 227 148 / 136 306 348 FFT does better with HBM memory and here Titan X is 20% slower than a Vega.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 207 / 191 89 / 82 139 116 Surprisingly the X does very well here and manages to beat all cards by almost 50%!
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 5,600 [+20%] / 4,870 2,100 / 2,000 4,670 3,080 Titan X does well in this algorithm, beating the Vega by 20%.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 275 [-20%] / 275 82 / 81 343 303 With FP64 precision, the X is again 20% slower.
The scientific scores are similar to the financial ones but the gain/loss is about 20% not 40% – in FP32 workloads Titan X is 20% faster while in FP64 it is about 20% slower than a Vega – a closer result than expected.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 14,550 [-60%] / 10,880 7,310 / 5,530 36,000 28,000 In this 3×3 convolution algorithm, somehow Titan X is over 50% slower than a Vega and even a Fury.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 3,840 [-11%] / 2,750 1,870 / 1,380 4,300 3,150 Same algorithm but more shared data reduces the gap to 10% but still a loss.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 3,920 [-10%] / 2,930 1,910 / 1,440 4,350 3,200 With even more data the gap remains at 10%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 3,740 [-11%] / 2,760 1,860 / 1,370 4,210 3,130 Still convolution but with 2 filters – Titan X is 10% slower again.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.7 / 55 [+52%] 20.6 / 25.4 36.3 30.8 Different algorithm allows the X to finally beat the Vega by 50%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 15.6 [-60%] / 15.3 12.2 / 11.4 38.7 14.3 Without major processing, this filter does not like the X much it runs 1/2 slower than the Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 16,480 [-57%] / 14,000 7,600 / 6,640 38,730 28,500 This algorithm is 64-bit integer heavy but again Titan X is 1/2 the speed of Vega.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 290 / 6,350 [+13%] 210 / 3,080 5,600 4,410 One of the most complex and largest filters, Titan X finally beats the Vega by over 10%.
For image processing using FP32 precision Titan X surprisingly does not do as well as expected – either in CUDA or OpenCL – with the Vega beating it by a good margin on most filters – a pretty surprising result. Perhaps more optimisations are needed on nVidia hardware. We obviously did not test FP16 performance at all as that would have been far slower.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from nVidia and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 356 [+13%] / 354 145 / 144 316 387 Titan X brings more bandwidth than a Vega (+13%) but the old Fury takes the crown.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.4 / 9 12.1 / 12 12.1 11 All cards use PCIe3 x16 and thus no appreciable delta.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.2 / 8.9 11.5 / 12.2 10 9.8 Again no significant difference but we were not expecting any.
Titan X uses current GDDR5X but with high data rate allowing it to bring more bandwidth that some HBM2 designs – a pretty impressive feat. Naturally high-end cards using HBM2 should have even higher bandwidth.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 201 / 230 230 273 343 Compared to previous generation, Titan X has better latency due to higher data rate.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 286 / 311 306 399 525 Similarly, even full random accesses are faster,
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89 / 121 97 129 216 Sequential access has similarly low latencies but nothing special.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 / 174 126 269 353 Constant memory latencies have also dropped.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 53 21 49 112 Even shared memory latencies have dropped likely due to higher core clocks.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 / 196 208 121 Texture access latencies have come down as well.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 / 278 308 And even full range latencies have decreased.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87 /123 102 With sequential access there is no appreciable delta in latencies.
We’re only comparing CUDA latencies here (as OpenCL is quite variable) – thus compared to the previous generation (GTX 980) all latencies are down, either due to higher memory data rate or core clock increases – but nothing spectacular. Still good progress and everything helps.
We see L1 cache effects until 16kB (same as previous arch) and between 2-4MB tallying with the 3MB cache. While fast perhaps they could be a bit bigger.
As with global memory we see the same L1D and L2 cache affects with similar latencies. All in all good performance but we could do with bigger caches.

Titan X’s memory performance is what you’d expect from higher clocked GDDR5X memory – it is competitive even with the latest HBM2 powered competition – both bandwidth and latency wise. There are no major surprises here and everything works nicely.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Titan X based on the current “Pascal” architecture performs very well in FP32 workloads – it is much faster than previous generation for a modest price increase and is competitive with the AMD’s Vega offers. But it is likely due to be replaced soon as next-generation “Volta” architecture is already out on the high-end (Titan V) and likely due to filter down the stack to both consumer (Series 11?) cards and “pro-sumer” cheaper Titan cards than the Titan V.

For FP64 workloads it is perhaps best to choose an older Quadro/Tesla card with more FP64 units as performance is naturally much lower. FP16 performance is also restricted and pretty much not usable – good for compatibility testing should you hope to upgrade to a full-speed FP16 card in the future. For both these workloads – the high-end Titan V is the card you probably want – but at a much higher price.

Still for the money, Titan X has its place and the most common FP32 workloads (financial, scientific, high precision image processing, etc.) that do not require FP64 nor FP16 optimisations perform very well and this card is all you need.

Tagged , , , , . Bookmark the permalink.

Comments are closed.