What is “Titan RTX / 2080Ti”?
It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Turing” architecture, the update to the current “Volta” architecture that has had a limited release in Titan/Quadro cards. It powers the new Series 20 top-end (with RTX) and Series 16 mainstream (without RTX) cards that replace the old Series 10 “Pascal” series.
As “Volta” is intended for AI/scientific/financial data-centers – it features high-end HBM2 memory; since “Turing” is meant for gaming, rendering, etc. has “normal” GDDR6 memory. Similarly “Turing” has the new RTX (Ray-Tracing) cores for high-fidelity visualisation and image generation – in addition to the Tensor (TSX) cores that “Volta” has introduced.
While “Volta” has 1/2 FP64 ratio cores (vs. FP32), “Turing” has the normal 1/32 FP64 ratio cores: for high-precision computation – you need “Volta”. However, as “Turing” maintains the 2x FP16 rate (vs. FP32) it can run low-precision AI (neural networks) at full speed. Old “Pascal” had 1/64x FP16 ratio making it pretty much unusable in most cases.
“Turing” does not have high-end on-package HBM2 memory but instead high-speed GDDR6 memory that has decent bandwidth but is not plentiful – with 1GB missing (11GB instead of 12GB).
With the soon-to-be unveiled “Ampere” (Series 30) architecture, we look whether you can have a “cheap” Titan V performance using a Turing 2080TI consumer card.
See these other articles on Titan performance:
Hardware Specifications
We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.
GPGPU Specifications |
nVidia Titan RTX / 2080TI (Turing)
|
nVidia Titan V (Volta)
|
nVidia Titan X (Pascal)
|
Comments |
Arch Chipset |
Turing GP102 (7.5) |
Volta VP100 (7.0) |
Pascal FP102 (6.1) |
The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions |
Cores (CU) / Threads (SP) |
68 / 4352 |
80 / 5120 |
28 / 3584 |
Not as many cores as Volta but still decent. |
ROPs / TMUs |
88 / 272 |
96 / 320
|
96 / 224 |
Cannot match Volta but more ROPs per CU for gaming. |
FP32 / FP64 / Tensor Cores |
4352 / 136 / 544 |
5120 / 2560 / 640 |
3584 / 112 / no |
Maintains the Tensor cores important for AI tasks (neural networks, etc.) |
Speed (Min-Turbo) |
1.35GHz (136-1.635) |
1.2GHz (135-1.455) |
1.531 (135-1.910) |
Clocks have improved over Volta likely due to lower number of SMs. |
Power (TDP) |
260W |
300W |
250W (125-300) |
TDP is less due to lower CU number. |
Global Memory |
11GB GDDR6 14GHz 320-bit |
12GB HBM2 850Mhz 3072-bit |
11GB GDDR5X 10GHz 384-bit |
As a pro-sumer card it has 1GB less than Volta and same as Pascal. |
Memory Bandwidth (GB/s)
|
616 |
652 |
512 |
Despite no HBM2, bandwidth almost matches due to high speed of GDDR6. |
L1 Cache |
2x (32kB + 64kB) |
2x 24kB / 96kB shared |
|
L1/shared is still the same but ratios have changed. |
L2 Cache |
5.5MB (6MB?) |
4.5MB (3MB?) |
3MB |
L2 cache reported has increased by 25%. |
FP64/double ratio
|
1/32x |
1/2x |
1/32x |
Low ratio like all consumer cards, Volta dominates here |
FP16/half ratio
|
2x |
2x |
1/32x |
Same rate as Volta, 2x over FP32 |
nVidia RTX 2080 TI (Turing)
Processing Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.
Processing Benchmarks |
nVidia Titan RTX / 2080TI (Turing) |
nVidia Titan V (Volta) |
nVidia Titan X (Pascal) |
Comments |
|
|
Mandel FP16/Half (Mpix/s) |
41,080 / n/a [=] |
40,920 / n/a |
336 / n/a |
Right off the bat, Turing matches Volta and is miles faster than old Pascal. |
|
Mandel FP32/Single (Mpix/s) |
25,000 / 23,360 [+11%] |
22,530 / 21320 |
18,000 / 16,000 |
With standard FP32, Turing even manages to be 11% faster despite less CUs. |
|
Mandel FP64/Double (Mpix/s) |
812 / 772 [-93%] |
11,300 / 10,500 |
641 / 642 |
For FP64 you don’t want Turing, you want Volta. At any cost. |
|
Mandel FP128/Quad (Mpix/s) |
30.4 / 29.1 [-94%] |
472 / 468 |
24.4 / 27 |
With emulated FP128 precision Turing is again demolished. |
Turing manages to improve over Volta in FP16/FP32 despite having less CUs – most likely due to faster clock and optimisations. However, if you do need FP64 precision then Volta reigns supreme – the 1/32 rate of Turing & Pascal just does not cut it. |
|
|
Crypto AES-256 (GB/s) |
48 / 52 [-33%] |
72 / 86 |
42 / 41 |
Streaming workloads love Volta’s HBM2 memory, Turing is 33% slower. |
|
Crypto AES-128 (GB/s) |
64 / 70 [-30%] |
92 / 115 |
57 / 54 |
Not a lot changes here, Turing is 30% slower. |
|
|
Crypto SHA2-512 (GB/s) |
192 / 182 [+7%] |
179 / 181 |
72 / 83 |
With 64-bit integer workload, Turing manages a 7% win despite “slower” memory. |
|
Crypto SHA256 (GB/s) |
170 / 125 [-33%] |
253 / 188 |
95 / 60 |
As with AES, hashing loves HBM2 so Turing is 33% slower than Volta. |
|
Crypto SHA1 (GB/s) |
161 / 125 [+56] |
103 / 113 |
69 / 74 |
While Turing wins, it is likely a compiler optimisation. |
It seems that Turing GDDR6 memory cannot keep up with Volta’s HBM2 – despite the similar bandwidths: streaming algorithms are around 30% slower on Turing. The only win is 64-bit integer workload that is 7% faster on Turing likely due to integer units optimisations. |
|
|
Black-Scholes float/FP32 (MOPT/s) |
17,230 / 17,000 [-7%] |
18,480 / 18,860 |
10,710 / 10,560 |
Turing is just 7% slower than Volta. |
|
Black-Scholes double/FP64 (MOPT/s) |
1,530 / 1,370 [-82%] |
8,660 / 8,500 |
1,400 / 1,340 |
FP64 is almost 1/8x slower. |
|
Binomial float/FP32 (kOPT/s) |
4,280 / 4,250 [+4%] |
4,130 / 4,110 |
2,220 / 2,230 |
Binomial uses thread shared data thus stresses the SMX’s memory system – Turing is 4% faster. |
|
Binomial double/FP64 (kOPT/s) |
164 / 163 [-91%] |
1,920 / 2,000 |
131 / 134 |
With FP64 code Turing is 1/10x slower. |
|
Monte-Carlo float/FP32 (kOPT/s) |
11,440 / 11,740 [+1%] |
11,340 / 12,900 |
8,100 / 6,000 |
Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – Turing is just 1% faster. |
|
Monte-Carlo double/FP64 (kOPT/s) |
327 / 263 [-92%] |
4,330 / 3,590 |
304 / 274 |
Switching to FP64 again Turing is 1/10x slower. |
For financial workloads, as long as you only need FP32 (or FP16), Turing can match and slightly outperform Volta; considering the cost that is no mean feat. However, if you do need FP64 precision – as we saw before, there is no contest – Volta is 10x (ten times) faster. |
|
|
HGEMM (GFLOPS) half/FP16 |
34,080 [-16%] |
40,790 |
|
Using the new Tensor cores, Turing is just 16% slower. |
|
SGEMM (GFLOPS) float/FP32 |
7,400 / 7,330 [-33%] |
11,000 / 10,870 |
6,280 / 6,600 |
Perhaps surprisingly, Turing is 33% slower than Volta here. |
|
DGEMM (GFLOPS) double/FP64 |
502 / 498 [-89%] |
4,470 4,550 |
335 / 332 |
With FP64 precision, Turing is 1/10x slower than Volta. |
|
HFFT (GFLOPS) half/FP16 |
1,000 [+2%] |
979 |
|
FFT somehow allows Turing to match Volta in performance. |
|
SFFT (GFLOPS) float/FP32 |
512 / 573 [-5%] |
540 / 599 |
242 / 227 |
With FP32, Turing is just 5% slower. |
|
DFFT (GFLOPS) double/FP64 |
302 / 302 [+1%] |
298 / 375 |
207 / 191 |
Completely memory bound, Turing matches Volta here. |
|
HNBODY (GFLOPS) half/FP16 |
9,000 [-2%] |
9,160 |
|
N-Body simulation with FP16 is just 2% slower. |
|
SNBODY (GFLOPS) float/FP32 |
9,330 / 8,120 [+27%] |
7,320 / 6,620 |
5,600 / 4,870 |
N-Body simulation allows Turing to dominate. |
|
DNBODY (GFLOPS) double/FP64 |
222 / 295 [-94%] |
3,910 / 5,130 |
275 / 275 |
With FP64 precision, Turing is again 1/10x slower than Volta. |
The scientific scores are a bit more mixed – but again Turing can match or slightly exceed Volta with FP32/FP16 precision – as long as we’re not memory limited; there Volta is still around 30% faster. With FP64 it’s the same story, Turing is about 1/10x slower. |
|
|
Blur (3×3) Filter single/FP32 (MPix/s) |
23,090 / 19,000 [-14%] |
26,860 / 29,820 |
17,860 / 13,680 |
In this 3×3 convolution algorithm, Turing is 14% slower. Convolution is also used in neural nets (CNN). |
|
Blur (3×3) Filter half/FP16 (MPix/s) |
28,240 [=] |
28,310 |
1,570 |
With FP16 precision, Turing matches Volta in performance. |
|
Sharpen (5×5) Filter single/FP32 (MPix/s) |
6,000 / 4,350 [-35%] |
9,230 / 7,250 |
4,800 / 3,460 |
Same algorithm but more shared data makes Turing 35% slower. |
|
Sharpen (5×5) Filter half/FP16 (MPix/s) |
10,580 [-38%] |
14,676 |
609 |
With FP16 Volta is almost 40% faster over Turing. |
|
Motion-Blur (7×7) Filter single/FP32 (MPix/s) |
6,180 / 4,570 [-33%] |
9,420 / 7,470 |
4,830 / 3,620 |
Again same algorithm but even more data shared Turing is 33% slower. |
|
Motion-Blur (7×7) Filter half/FP16 (MPix/s) |
10,160 [-31%] |
14,651 |
325 |
With FP16 nothing much changes in this algorithm. |
|
Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) |
6,220 / 4,340 [-30%] |
8,890 / 7,000 |
4,740 / 3,450 |
Still convolution but with 2 filters – Turing is 30% slower. |
|
Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) |
10,100 [-25%] |
13,446 |
309 |
Just as we seen above, Turing is about 25% slower than Volta. |
|
Noise Removal (5×5) Median Filter single/FP32 (MPix/s) |
52.53 / 59.9 [-50%] |
108 / 66.34 |
36 / 55 |
Different algorithm we see the biggest delta with Turing 50% slower. |
|
Noise Removal (5×5) Median Filter half/FP16 (MPix/s) |
121 [-40%] |
204 |
71 |
With FP16 Turing reduces the loss to just 40%. |
|
Oil Painting Quantise Filter single/FP32 (MPix/s) |
20.28 / 25.64 [-50%] |
41.38 / 23.14 |
15.14 / 15.3 |
Without major processing, this filter flies on Volta, again Turing is 50% slower. |
|
Oil Painting Quantise Filter half/FP16 (MPix/s) |
59.55 [-54%] |
129 |
50.75 |
FP16 precision does not change things. |
|
Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) |
24,600 / 29,640 [+1%] |
24,400 / 24,870 |
19,480 / 14,000 |
This algorithm is 64-bit integer heavy and here Turing is 1% faster. |
|
Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) |
22,400 [-8%] |
24,292 |
6,090 |
FP16 does not help here as we’re at maximum performance. |
|
Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) |
3,000 / 10,500 [-20%] |
3,771 / 8,760 |
1,288 / 6,530 |
One of the most complex and largest filters, Turing is 20% slower than Volta. |
|
Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) |
7,850 [-4%] |
8,137 |
461 |
Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code. |
For image processing, Turing is generally 20-35% slower than Volta somewhat in line with memory performance. If FP16 is sufficient, then we see Turing matching Volta in performance – something that old Pascal could never do. |
Memory Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 352, CUDA 11.3, OpenCL 1.2. Turbo / Boost was enabled on all configurations.
Memory Benchmarks |
nVidia Titan RTX / 2080TI (Turing) |
nVidia Titan V (Volta) |
nVidia Titan X (Pascal) |
Comments |
|
|
Internal Memory Bandwidth (GB/s) |
494 / 485 [-7%] |
534 / 530 |
356 / 354 |
GDDR6 provides good bandwidth, only 7% less than HBM2. |
|
Upload Bandwidth (GB/s) |
11.3 / 10.4 [-1%] |
11.4 / 11.4 |
11.4 / 9 |
Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4! |
|
Download Bandwidth (GB/s) |
11.9 / 12.3 [-1%] |
12.1 / 12.3 |
12.2 / 8.9 |
Again no significant difference but we were not expecting any. |
Turing’s GDDR6 memory provides almost the same bandwidth as Volta’s expensive HBM2. All cards use PCIe3 x16 connections thus similar upload/download bandwidth. Hopefully the move to PCIe4/5 will improve transfers. |
|
|
Global (In-Page Random Access) Latency (ns) |
135 / 143 [-25%] |
180 / 187 |
201 / 230 |
From the start we see global latency accesses reduced by 25%, not a lot but will help. |
|
Global (Full Range Random Access) Latency (ns) |
243 / 248 [-22%] |
311 / 317 |
286 / 311 |
Full range random accesses are also 22% faster. |
|
Global (Sequential Access) Latency (ns) |
40 / 43 [-25%] |
53 / 57 |
89 / 121 |
Sequential accesses have also dropped 25%. |
|
Constant Memory (In-Page Random Access) Latency (ns) |
77 / 80 [+2%] |
75 / 76 |
117 / 174 |
Constant memory latencies seem about the same. |
|
Shared Memory (In-Page Random Access) Latency (ns) |
10.6 / 71 [-41%] |
18 / 85 |
18.7 / 53 |
Shared memory latencies seem to be improved. |
|
Texture (In-Page Random Access) Latency (ns) |
157 / 217 [-26%] |
212 / 279 |
195 / 196 |
Texture access latencies have also reduced by 26%. |
|
Texture (Full Range Random Access) Latency (ns) |
268 / 329 [-22%] |
344 / 313 |
282 / 278 |
As we’ve seen with global memory, we see reduced latencies by 22%. |
|
Texture (Sequential Access) Latency (ns) |
67 / 138 [-24%] |
88 / 163 |
87 / 123 |
With sequential access we also see a 24% reduction. |
The high data rate of Turing’s GDDR6 brings reduced latencies across the board over HBM2 although as we’ve seen in the compute benchmarks, this does not always translate in better performance. Still some algorithms, especially less optimised ones may still benefit at much lower cost. |
|
We see L1 cache effects between 32-64kB tallying with an L1D of 32-48kB (depending on setting) with the other inflexion between 4-8MB matching the 6MB L2 cache. |
|
As with global memory we see the same L1D (32kB) and L2 (6MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches. |
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
If you wanted to upgrade your old Pascal Titan X but could not afford the Volta’s Titan V – then you can now get a cheap RTX 2080Ti or Titan RTX and get similar if not slightly faster FP16/FP32 performance that blows the not-so-old Titan X out of the water! If you can make do with FP16 and use Tensor cores, we’re looking at 6-8x performance over FP32 using a single card.
Naturally, the FP64 performance is again “gimped” at 1/32x so if that’s what you require, Turing cannot help you there – you will have to get a Volta. But then again the Titan X was similarly “gimped” thus if that’s what you had you still get a decent performance upgrade.
The GDDR6 memory may have similar bandwidth on paper, but in streaming algorithms is about 33% slower than HBM2 so there Turing cannot match Volta, but considering the cost it is a good trade. You will also lose 1GB just like with Titan X but again, not a surprise. Global/Constant/Texture memory access latencies are lower due to the high data rate which should help algorithms that are memory access limited (if you cannot help hide them).
As we’re testing GPGPU performance here, we have not touched on the ray-tracing (RTX) units, but should you happen to play a game or two when you are “resting”, then the Titan RTX / 2080TI might just impress you even more. Here, not even Volta can match it!
All in all – Titan RTX is a compelling (relatively) cheap upgrade over the old Titan X if you don’t require FP64 precision.
nVidia Titan RTX (Turing)