What is “Titan RTX / 2080Ti”?
It is the latest highend “prosumer” card from nVidia with the nextgeneration “Turing” architecture, the update to the current “Volta” architecture that has had a limited release in Titan/Quadro cards. It powers the new Series 20 topend (with RTX) and Series 16 mainstream (without RTX) cards that replace the old Series 10 “Pascal” series.
As “Volta” is intended for AI/scientific/financial datacenters – it features highend HBM2 memory; since “Turing” is meant for gaming, rendering, etc. has “normal” GDDR6 memory. Similarly “Turing” has the new RTX (RayTracing) cores for highfidelity visualisation and image generation – in addition to the Tensor (TSX) cores that “Volta” has introduced.
While “Volta” has 1/2 FP64 ratio cores (vs. FP32), “Turing” has the normal 1/32 FP64 ratio cores: for highprecision computation – you need “Volta”. However, as “Turing” maintains the 2x FP16 rate (vs. FP32) it can run lowprecision AI (neural networks) at full speed. Old “Pascal” had 1/64x FP16 ratio making it pretty much unusable in most cases.
“Turing” does not have highend onpackage HBM2 memory but instead highspeed GDDR6 memory that has decent bandwidth but is not plentiful – with 1GB missing (11GB instead of 12GB).
With the soontobe unveiled “Ampere” (Series 30) architecture, we look whether you can have a “cheap” Titan V performance using a Turing 2080TI consumer card.
See these other articles on Titan performance:
Hardware Specifications
We are comparing the topoftherange Titan V with previous generation Titans and competing architectures with a view to upgrading to a midrange high performance design.
GPGPU Specifications 
nVidia Titan RTX / 2080TI (Turing)

nVidia Titan V (Volta)

nVidia Titan X (Pascal)

Comments 
Arch Chipset 
Turing GP102 (7.5) 
Volta VP100 (7.0) 
Pascal FP102 (6.1) 
The V is the only one using the topend 100 chip not 102 or 104 lowerend versions 
Cores (CU) / Threads (SP) 
68 / 4352 
80 / 5120 
28 / 3584 
Not as many cores as Volta but still decent. 
ROPs / TMUs 
88 / 272 
96 / 320

96 / 224 
Cannot match Volta but more ROPs per CU for gaming. 
FP32 / FP64 / Tensor Cores 
4352 / 136 / 544 
5120 / 2560 / 640 
3584 / 112 / no 
Maintains the Tensor cores important for AI tasks (neural networks, etc.) 
Speed (MinTurbo) 
1.35GHz (1361.635) 
1.2GHz (1351.455) 
1.531 (1351.910) 
Clocks have improved over Volta likely due to lower number of SMs. 
Power (TDP) 
260W 
300W 
250W (125300) 
TDP is less due to lower CU number. 
Global Memory 
11GB GDDR6 14GHz 320bit 
12GB HBM2 850Mhz 3072bit 
11GB GDDR5X 10GHz 384bit 
As a prosumer card it has 1GB less than Volta and same as Pascal. 
Memory Bandwidth (GB/s)

616 
652 
512 
Despite no HBM2, bandwidth almost matches due to high speed of GDDR6. 
L1 Cache 
2x (32kB + 64kB) 
2x 24kB / 96kB shared 

L1/shared is still the same but ratios have changed. 
L2 Cache 
5.5MB (6MB?) 
4.5MB (3MB?) 
3MB 
L2 cache reported has increased by 25%. 
FP64/double ratio

1/32x 
1/2x 
1/32x 
Low ratio like all consumer cards, Volta dominates here 
FP16/half ratio

2x 
2x 
1/32x 
Same rate as Volta, 2x over FP32 
nVidia RTX 2080 TI (Turing)
Processing Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.
Processing Benchmarks 
nVidia Titan RTX / 2080TI (Turing) 
nVidia Titan V (Volta) 
nVidia Titan X (Pascal) 
Comments 


Mandel FP16/Half (Mpix/s) 
41,080 / n/a [=] 
40,920 / n/a 
336 / n/a 
Right off the bat, Turing matches Volta and is miles faster than old Pascal. 

Mandel FP32/Single (Mpix/s) 
25,000 / 23,360 [+11%] 
22,530 / 21320 
18,000 / 16,000 
With standard FP32, Turing even manages to be 11% faster despite less CUs. 

Mandel FP64/Double (Mpix/s) 
812 / 772 [93%] 
11,300 / 10,500 
641 / 642 
For FP64 you don’t want Turing, you want Volta. At any cost. 

Mandel FP128/Quad (Mpix/s) 
30.4 / 29.1 [94%] 
472 / 468 
24.4 / 27 
With emulated FP128 precision Turing is again demolished. 
Turing manages to improve over Volta in FP16/FP32 despite having less CUs – most likely due to faster clock and optimisations. However, if you do need FP64 precision then Volta reigns supreme – the 1/32 rate of Turing & Pascal just does not cut it. 


Crypto AES256 (GB/s) 
48 / 52 [33%] 
72 / 86 
42 / 41 
Streaming workloads love Volta’s HBM2 memory, Turing is 33% slower. 

Crypto AES128 (GB/s) 
64 / 70 [30%] 
92 / 115 
57 / 54 
Not a lot changes here, Turing is 30% slower. 


Crypto SHA2512 (GB/s) 
192 / 182 [+7%] 
179 / 181 
72 / 83 
With 64bit integer workload, Turing manages a 7% win despite “slower” memory. 

Crypto SHA256 (GB/s) 
170 / 125 [33%] 
253 / 188 
95 / 60 
As with AES, hashing loves HBM2 so Turing is 33% slower than Volta. 

Crypto SHA1 (GB/s) 
161 / 125 [+56] 
103 / 113 
69 / 74 
While Turing wins, it is likely a compiler optimisation. 
It seems that Turing GDDR6 memory cannot keep up with Volta’s HBM2 – despite the similar bandwidths: streaming algorithms are around 30% slower on Turing. The only win is 64bit integer workload that is 7% faster on Turing likely due to integer units optimisations. 


BlackScholes float/FP32 (MOPT/s) 
17,230 / 17,000 [7%] 
18,480 / 18,860 
10,710 / 10,560 
Turing is just 7% slower than Volta. 

BlackScholes double/FP64 (MOPT/s) 
1,530 / 1,370 [82%] 
8,660 / 8,500 
1,400 / 1,340 
FP64 is almost 1/8x slower. 

Binomial float/FP32 (kOPT/s) 
4,280 / 4,250 [+4%] 
4,130 / 4,110 
2,220 / 2,230 
Binomial uses thread shared data thus stresses the SMX’s memory system – Turing is 4% faster. 

Binomial double/FP64 (kOPT/s) 
164 / 163 [91%] 
1,920 / 2,000 
131 / 134 
With FP64 code Turing is 1/10x slower. 

MonteCarlo float/FP32 (kOPT/s) 
11,440 / 11,740 [+1%] 
11,340 / 12,900 
8,100 / 6,000 
MonteCarlo also uses thread shared data but readonly thus reducing modify pressure – Turing is just 1% faster. 

MonteCarlo double/FP64 (kOPT/s) 
327 / 263 [92%] 
4,330 / 3,590 
304 / 274 
Switching to FP64 again Turing is 1/10x slower. 
For financial workloads, as long as you only need FP32 (or FP16), Turing can match and slightly outperform Volta; considering the cost that is no mean feat. However, if you do need FP64 precision – as we saw before, there is no contest – Volta is 10x (ten times) faster. 


HGEMM (GFLOPS) half/FP16 
34,080 [16%] 
40,790 

Using the new Tensor cores, Turing is just 16% slower. 

SGEMM (GFLOPS) float/FP32 
7,400 / 7,330 [33%] 
11,000 / 10,870 
6,280 / 6,600 
Perhaps surprisingly, Turing is 33% slower than Volta here. 

DGEMM (GFLOPS) double/FP64 
502 / 498 [89%] 
4,470 4,550 
335 / 332 
With FP64 precision, Turing is 1/10x slower than Volta. 

HFFT (GFLOPS) half/FP16 
1,000 [+2%] 
979 

FFT somehow allows Turing to match Volta in performance. 

SFFT (GFLOPS) float/FP32 
512 / 573 [5%] 
540 / 599 
242 / 227 
With FP32, Turing is just 5% slower. 

DFFT (GFLOPS) double/FP64 
302 / 302 [+1%] 
298 / 375 
207 / 191 
Completely memory bound, Turing matches Volta here. 

HNBODY (GFLOPS) half/FP16 
9,000 [2%] 
9,160 

NBody simulation with FP16 is just 2% slower. 

SNBODY (GFLOPS) float/FP32 
9,330 / 8,120 [+27%] 
7,320 / 6,620 
5,600 / 4,870 
NBody simulation allows Turing to dominate. 

DNBODY (GFLOPS) double/FP64 
222 / 295 [94%] 
3,910 / 5,130 
275 / 275 
With FP64 precision, Turing is again 1/10x slower than Volta. 
The scientific scores are a bit more mixed – but again Turing can match or slightly exceed Volta with FP32/FP16 precision – as long as we’re not memory limited; there Volta is still around 30% faster. With FP64 it’s the same story, Turing is about 1/10x slower. 


Blur (3×3) Filter single/FP32 (MPix/s) 
23,090 / 19,000 [14%] 
26,860 / 29,820 
17,860 / 13,680 
In this 3×3 convolution algorithm, Turing is 14% slower. Convolution is also used in neural nets (CNN). 

Blur (3×3) Filter half/FP16 (MPix/s) 
28,240 [=] 
28,310 
1,570 
With FP16 precision, Turing matches Volta in performance. 

Sharpen (5×5) Filter single/FP32 (MPix/s) 
6,000 / 4,350 [35%] 
9,230 / 7,250 
4,800 / 3,460 
Same algorithm but more shared data makes Turing 35% slower. 

Sharpen (5×5) Filter half/FP16 (MPix/s) 
10,580 [38%] 
14,676 
609 
With FP16 Volta is almost 40% faster over Turing. 

MotionBlur (7×7) Filter single/FP32 (MPix/s) 
6,180 / 4,570 [33%] 
9,420 / 7,470 
4,830 / 3,620 
Again same algorithm but even more data shared Turing is 33% slower. 

MotionBlur (7×7) Filter half/FP16 (MPix/s) 
10,160 [31%] 
14,651 
325 
With FP16 nothing much changes in this algorithm. 

Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 
6,220 / 4,340 [30%] 
8,890 / 7,000 
4,740 / 3,450 
Still convolution but with 2 filters – Turing is 30% slower. 

Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 
10,100 [25%] 
13,446 
309 
Just as we seen above, Turing is about 25% slower than Volta. 

Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 
52.53 / 59.9 [50%] 
108 / 66.34 
36 / 55 
Different algorithm we see the biggest delta with Turing 50% slower. 

Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 
121 [40%] 
204 
71 
With FP16 Turing reduces the loss to just 40%. 

Oil Painting Quantise Filter single/FP32 (MPix/s) 
20.28 / 25.64 [50%] 
41.38 / 23.14 
15.14 / 15.3 
Without major processing, this filter flies on Volta, again Turing is 50% slower. 

Oil Painting Quantise Filter half/FP16 (MPix/s) 
59.55 [54%] 
129 
50.75 
FP16 precision does not change things. 

Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 
24,600 / 29,640 [+1%] 
24,400 / 24,870 
19,480 / 14,000 
This algorithm is 64bit integer heavy and here Turing is 1% faster. 

Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 
22,400 [8%] 
24,292 
6,090 
FP16 does not help here as we’re at maximum performance. 

Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 
3,000 / 10,500 [20%] 
3,771 / 8,760 
1,288 / 6,530 
One of the most complex and largest filters, Turing is 20% slower than Volta. 

Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 
7,850 [4%] 
8,137 
461 
Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code. 
For image processing, Turing is generally 2035% slower than Volta somewhat in line with memory performance. If FP16 is sufficient, then we see Turing matching Volta in performance – something that old Pascal could never do. 
Memory Performance
We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.
Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.
Environment: Windows 10 x64, latest nVidia drivers 352, CUDA 11.3, OpenCL 1.2. Turbo / Boost was enabled on all configurations.
Memory Benchmarks 
nVidia Titan RTX / 2080TI (Turing) 
nVidia Titan V (Volta) 
nVidia Titan X (Pascal) 
Comments 


Internal Memory Bandwidth (GB/s) 
494 / 485 [7%] 
534 / 530 
356 / 354 
GDDR6 provides good bandwidth, only 7% less than HBM2. 

Upload Bandwidth (GB/s) 
11.3 / 10.4 [1%] 
11.4 / 11.4 
11.4 / 9 
Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4! 

Download Bandwidth (GB/s) 
11.9 / 12.3 [1%] 
12.1 / 12.3 
12.2 / 8.9 
Again no significant difference but we were not expecting any. 
Turing’s GDDR6 memory provides almost the same bandwidth as Volta’s expensive HBM2. All cards use PCIe3 x16 connections thus similar upload/download bandwidth. Hopefully the move to PCIe4/5 will improve transfers. 


Global (InPage Random Access) Latency (ns) 
135 / 143 [25%] 
180 / 187 
201 / 230 
From the start we see global latency accesses reduced by 25%, not a lot but will help. 

Global (Full Range Random Access) Latency (ns) 
243 / 248 [22%] 
311 / 317 
286 / 311 
Full range random accesses are also 22% faster. 

Global (Sequential Access) Latency (ns) 
40 / 43 [25%] 
53 / 57 
89 / 121 
Sequential accesses have also dropped 25%. 

Constant Memory (InPage Random Access) Latency (ns) 
77 / 80 [+2%] 
75 / 76 
117 / 174 
Constant memory latencies seem about the same. 

Shared Memory (InPage Random Access) Latency (ns) 
10.6 / 71 [41%] 
18 / 85 
18.7 / 53 
Shared memory latencies seem to be improved. 

Texture (InPage Random Access) Latency (ns) 
157 / 217 [26%] 
212 / 279 
195 / 196 
Texture access latencies have also reduced by 26%. 

Texture (Full Range Random Access) Latency (ns) 
268 / 329 [22%] 
344 / 313 
282 / 278 
As we’ve seen with global memory, we see reduced latencies by 22%. 

Texture (Sequential Access) Latency (ns) 
67 / 138 [24%] 
88 / 163 
87 / 123 
With sequential access we also see a 24% reduction. 
The high data rate of Turing’s GDDR6 brings reduced latencies across the board over HBM2 although as we’ve seen in the compute benchmarks, this does not always translate in better performance. Still some algorithms, especially less optimised ones may still benefit at much lower cost. 

We see L1 cache effects between 3264kB tallying with an L1D of 3248kB (depending on setting) with the other inflexion between 48MB matching the 6MB L2 cache. 

As with global memory we see the same L1D (32kB) and L2 (6MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches. 
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
If you wanted to upgrade your old Pascal Titan X but could not afford the Volta’s Titan V – then you can now get a cheap RTX 2080Ti or Titan RTX and get similar if not slightly faster FP16/FP32 performance that blows the notsoold Titan X out of the water! If you can make do with FP16 and use Tensor cores, we’re looking at 68x performance over FP32 using a single card.
Naturally, the FP64 performance is again “gimped” at 1/32x so if that’s what you require, Turing cannot help you there – you will have to get a Volta. But then again the Titan X was similarly “gimped” thus if that’s what you had you still get a decent performance upgrade.
The GDDR6 memory may have similar bandwidth on paper, but in streaming algorithms is about 33% slower than HBM2 so there Turing cannot match Volta, but considering the cost it is a good trade. You will also lose 1GB just like with Titan X but again, not a surprise. Global/Constant/Texture memory access latencies are lower due to the high data rate which should help algorithms that are memory access limited (if you cannot help hide them).
As we’re testing GPGPU performance here, we have not touched on the raytracing (RTX) units, but should you happen to play a game or two when you are “resting”, then the Titan RTX / 2080TI might just impress you even more. Here, not even Volta can match it!
All in all – Titan RTX is a compelling (relatively) cheap upgrade over the old Titan X if you don’t require FP64 precision.
nVidia Titan RTX (Turing)