nVidia RTX 3090, 3080: Ampere GPGPU performance in CUDA and OpenCL

What is “Ampere”?

It is the latest arch(itecture) (SM8.x) from nVidia launching with the new Series 30 mainstream cards (RTX 3090, 3080 and soon 3070, 3060) cards, a major update from the previous “Turing”/”Volta” (Series 20, SM 7.x). A “Titan” pro-sumer version will also launch soon while the data-center A100 version is already available.

Like previous mainstream versions, “Ampere” uses the standard compute ratios (1/32 FP64, 2x FP16) and (high-speed) GDDR6X memory (not HBM2+). It brings update 3rd gen(eration) tensor cores and 2nd gen ray-tracing (RTX) cores but not any new core types.

The updated tensor cores now support FP64 precision, BF16 (in addition to existing FP16) and also TF32 – an “optimised” FP32 precision format that can speed-up operations that require higher precision (than 16-bit). Thus for the first time, high precision algorithms can make use of the tensor cores, greatly expanding their use.

It supports PCIe4, thus doubling transfers bandwidth (PCIe4 x16 up to 32GB/s) on supported platforms (AMD only for now but soon Intel with RocketLake and later) which was needed considering size of video memory (up to 24GB). It also supports “RTX I/O” that can asynchronously transfer from storage direct to GPU; this will be used for Microsoft’s DirectStorage (and similar) and hopefully CUDA / OpenCL extensions.

For higher bandwidth, “Ampere” supports GDDR6X, an evolution of GDDR6 allowing for much higher data rates – up to 40% over previous generation. Size-wise the 3090 comes with 24GB video memory, over 2x increase over the previous 2080Ti!

Note: Due to the great increase in both compute and memory capacity, we (SiSoftware) have had to increase (Sandra’s GPGPU) benchmark limits to take advantage of the new capabilities. Please update to Sandra 20/20 R10 or later for best results. The optimisation work is on-going and likely further updates will be released in due course.

See these other articles on Titan (and competition) performance:

Hardware Specifications

We are comparing the top-of-the-range “Ampere” with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia RTX 3090 FE (Ampere) nVidia RTX 3080 FE (Ampere) nVidia RTX 2080TI (Turing) nVidia Titan V (Volta) Comments
Arch Chipset Ampere GA102 (SM8.6) Ampere GA102 (SM8.6) Turing GP102 (SM7.5) Volta VP100 (SM7.0) The V is the only one using the top-end 100 chip
Cores (CU) / Threads (SP) 82 / 10,496 [+2.4x] 68 / 8,704 [2x] 68 / 4,352 80 / 5,120 2x more units per SP quite an increase.
ROPs / TMUs 112 / 328 96 / 272 88 / 272 96 / 320 More units per SP.
Tensor Cores (TC)
328 272 [1/2x] 544 640 More powerful tensor cores despite  less count.
Speed (Min-Turbo) 1.4GHz (135-1.78) 1.44GHz (135-1.71) 1.35GHz (136-1.635) 1.2GHz (135-1.455) Clocks have improved over Volta likely due to lower number of SMs.
Power (TDP) (W)
350W [+34%] 320W [+23%] 260W 300W TDP has greatly increased.
Global Memory (GP)
24GB GDDR6X 19GHz 384-bit 10GB GDDR6X 19GHz 320-bit 11GB GDDR6 14GHz 320-bit 12GB HBM2 850Mhz 3072-bit 2x more memory than even Volta.
Memory Bandwidth (GB/s)
936 [+52%] 760 [+23%] 616 652 Despite no HBM2, 40% more bandwidth
L1 Cache (kB)
2x (64kB + 64kB) [+33%] 2x (64kB +  64kB) [+33%] 2x (32kB + 64kB) 2x 24kB / 96kB shared L1/shared is still the same but ratios have changed.
L2 Cache (MB)
6MB 5MB 5.5MB
4.5MB L2 cache reported has increased.
FP64/double ratio 1/32x 1/32x 1/32x 1/2x Low ratio like all consumer cards, Volta dominates here
FP16/half ratio 2x 2x 2x 2x Same rate as Volta, 2x over FP32
Price/RRP (USD)
$1,500 [+25%, +$300] $700 $1,200 $3,000 Ampere gets a $300 bump, about %25% vs. Turing.
nVidia 3090 RTX (Ampere)

nVidia RTX 3090 (Ampere)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.

 

Processing Benchmarks nVidia RTX 3090 FE (Ampere) nVidia RTX 3080 FE (Ampere) nVidia RTX 2080TI (Turing) nVidia Titan V (Volta) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 58,880 [+43%] 48,692 [+18%] 41,080 / n/a 40,920 / n/a Right off the bat, Ampere is 43% faster.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 41,378 [+66%] 33,666 [+34%] 25,000 / 23,360 22,530 / 21,320 With standard FP32, Ampere is 66% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 996 [+23%] 835 [+3%] 812 / 772 11,300 / 10,500 For FP64 you don’t want consumer Ampere.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 37 31 [+2%] 30.4 / 29.1 472 / 468 With emulated FP128 precision Ampere is again demolished.
Ampere greatly improves over Turing/Volta using FP16/FP32 precision, between 40-70%! Naturally being consumer, FP64 performance is too low to be considered an option.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 108 [+2.25x] 91 [+89%] 48 / 52 72 / 86 Streaming workloads fly on Ampere despite no HBMx.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 64 / 70 92 / 115 Not a lot changes here.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 192 / 182 179 / 181 With 64-bit integer workload.
GPGPU Crypto Benchmark Crypto SHA256 (GB/s) 348 [+2.05x] 317 [+86%] 170 / 125 253 / 188 Despite no HBM, again Ampere reigns.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 161 / 125 103 / 113 Nothing much changes here
While Turing’s GDDR6 memory could not keep up with Volta’s HBM2 – Ampere’s GDDR6X has no problems: it is over 2x faster than Turing in both streaming benchmarks (crypto or hashing). With the huge increase in size (24GB) – it is a significant upgrade.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 25,122 [+46%] 21,337 [+24%] 17,230 / 17,000 18,480 / 18,860 Ampere starts 46% faster than Turing.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,927 [+26%] 1,336 [-13%] 1,530 / 1,370 8,660 / 8,500 FP64 is 26% faster but no point.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 7,718 [+80%] 6,665 [+55%] 4,280 / 4,250 4,130 / 4,110 Binomial uses thread shared data thus stresses the SMX’s memory system.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 207 [+26%] 157 [-5%] 164 / 163 1,920 / 2,000 With FP64 again no point.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 17,636 [+54%] 15,904 [+39%] 11,440 / 11,740 11,340 / 12,900 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 413 [+26%] 255 [-4%] 327 / 263 4,330 / 3,590 Switching to FP64 again little point.
For financial workloads, as long as you only need FP32 (or FP16), Ampere is again 40-80% faster than Turing. But anything using high-precision (FP64) need not apply.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 34,080* 40,790* Using tensor cores.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 14,433 [+95%] 12,582 [+70%] 7,400 / 7,330 11,000 / 10,870 Ampere is almost 2x faster than Turing.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 502 / 498 4,470 4,550 With FP64 precision.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 1,000 979 FFT is memory-bound.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 1,014 [+98%] 891 [+74%] 512 / 573 540 / 599 With FP32, Ampere is again 2x faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 302 / 302 298 / 375 Completely memory bound.
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 9,000 9,160 N-Body simulation with FP16.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 13,910 [+49%] 11,314 [+21%] 9,330 / 8,120 7,320 / 6,620 N-Body simulation allows Ampere to dominate.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 222 / 295 3,910 / 5,130 With FP64 precision.
With the new tensor cores Ampere enjoys a 2x lead over Turing; in other benchmarks we see the similar 50% improvement. Again, FP64 performance is too low to matter, tensor cores or not.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 39,152 [+70%] 34,722 23,090 / 19,000 26,860 / 29,820 In this 3×3 convolution algorithm, Ampere is 70% faster.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 28,240 28,310 With FP16 precision.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 13,766[+2.29x] 11,895 6,000 / 4,350 9,230 / 7,250 More shared data: Ampere is over 2x faster!
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 10,580 14,676 With FP16.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 13,484 [+2.18x] 11,764 6,180 / 4,570 9,420 / 7,470 Even more data, Ampere still 2x faster.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 10,160 14,651 With FP16 nothing much changes in this algorithm.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 13,120 [+2.11x] 11,477 6,220 / 4,340 8,890 / 7,000 Still convolution but with 2 filters – Ampere still 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 10,100 13,446 Just as we seen above.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 244 [+4.5x] 215 52.53 / 59.9 108 / 66.34 In this very memory sensitive algorithm, Ampere is over 4x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 121 204 With FP16.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 92 [+4.5x] 86 20.28 / 25.64 41.38 / 23.14 Memory helps Ampere be 4.5x faster.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 59.55 129 FP16 precision does not change things.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 81,962 [+54%] 72,367 24,600 / 29,640 24,400 / 24,870 This algorithm is 64-bit integer heavy: Ampere is 54% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 22,400 24,292 FP16 does not help here as we’re at maximum performance.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 1,233 [-70%] 1,087 3,000 / 10,500 3,771 / 8,760 Complex and largest filters needs some  optimisations.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 7,850 8,137 Switching to FP16.
For image processing, Ampere is even faster than what we’ve seen in other tests – routinely 2x faster than Turing. SM improvements and memory perfomance seem to help a lot here.
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 766 [+55%] 624 [+26%] 494 / 485 534 / 530 GDDR6X gives 55% better performance.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 24.4 [+2.16x] 23.3 [+2x] 11.3 / 10.4 11.4 / 11.4 PCIe4 is 2x faster.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 24.5 [+2.05x] 24.4 [+2x] 11.9 / 12.3 12.1 / 12.3 Again, PCIe4 is 2x faster.
GDDR6X brings over 50% more bandwidth and overtakes even Volta’s HBM2; PCIe4 increases upload/download bandwidth by 2x which should greatly help the large memory transfers. All in all a huge upgrade over Turing.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 156 [+16%] 151 135 / 143 180 / 187 Despite the higher clock latencies seem to go up.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 243 / 248 311 / 317 Full range random accesses are also 22% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 40 / 43 53 / 57 Sequential accesses have also dropped 25%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 77 / 80 75 / 76 Constant memory latencies seem about the same.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 10.6 / 71 18 / 85 Shared memory latencies seem to be improved.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 157 / 217 212 / 279 Texture access latencies have also reduced by 26%.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 268 / 329 344 / 313 As we’ve seen with global memory, we see reduced latencies by 22%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 67 / 138 88 / 163 With sequential access we also see a 24% reduction.
For now, we see Ampere’s GDDR6X bring higher latencies despite the great increase of clock and bandwidth. Perhaps future versions will either increase clocks (while maintaining timings) or decrease timings as better memory becomes available.
Memory Benchmarks nVidia 3090 RTX FE (Ampere) nVidia 3080 RTX FE (Ampere) nVidia 2080TI (Turing) nVidia Titan V (Volta) Comments

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: Big, expensive but immensely powerful: 9/10 overall.

For compute loads on mainstream cards, “Ampere” brings big gains (50-100%) when using FP16/FP32 precision, a sizeable improvement. The new updated tensor cores also allow TF32/FP64 acceleration (for the first time) that greatly help many algorithms (e.g. convolution: neural networks/AI, image processing, etc.). The increase in memory size and performance also allows much bigger kernels and data sets to run.

Still as with all mainstream cards, FP64 performance is too reduced to be usable, for that you need either a full-Titan (not consumer) or a professional card. If the performance (especially with tensor cores supporting FP64 now) is similar to FP16/FP32, then the gains will be significant.

GDDR6X and PCIe4 bring sizeable bandwidth increases (50%-2x) and while latencies seem to have gone up a bit, they are manageable and don’t seem to have an effect on performance. As mentioned the top-end memory size (24GB) could be a game-changer if the dataset now fits.

Except physical size (it takes 3 slots) and power (TDP is now up to 350W (up from 280-300W) there aren’t really any downsides to the new “Ampere”. Most systems should have adequate power supplies however thus no worries there.

In summary, even upgrading from previous Turing arch(itecture) cards is worth-while as the performance gains are significant enough; but the old cards have maintained their value well and can offset the new cost – thus making the upgrade much cheaper. As algorithms get updated and data sets increase we should see even higher performance gains.

nVidia 3090 RTX (Ampere)

nVidia 3090 RTX (Ampere)

Tagged , , , , , , , . Bookmark the permalink.

Comments are closed.