AMD Ryzen Threadripper 3970X, 3960X Review & Benchmarks – CPU Performance

What is “Ryzen ThreadRipper 3000 series” (Zen2-TR3)?

AMD’s Zen2 TR3 (“Starship”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ TR2 (“”) core was just an optimisation of the original ZEN TR that introduces many design improvements over both cores. Unlike desktop parts – a new socket “sTRX4” is introduced thus new mainboards based on the new chipset “TRX40” are required.

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 4 chiplets on desktop platform thus up to 4x 2x 4C (32C/64T TR 3970X)
  • 2x larger L3 cache per CCX thus up to 4x 2x 16MB (128MB) L3 cache (TR 3960X+)
  • 60 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 4x DDR4 memory controllers up to 4266Mt/s
  • New sTRX4 Socket with AMD TRX40 chipset

4 CCD “chiplets” (2 CCX each) + 1 I/O hub

Even with previous ThreadRippers AMD used to recommend UMA mode (interleave across nodes) rather than NUMA mode (interleave across channels only) – despite the fact that some CCXes naturally had to access memory through (other) CCXes with DRAM controllers at much higher latency. Now, with TR3 and sIO-based memory controllers there is a weaker affinity between cores/CCX/chiplets and DRAM controllers thus NUMA mode is pretty much gone.

SiSoftware has always “disagreed” with AMD on this – with our recommendation to always use NUMA mode and that even normal Ryzen should have a “fake NUMA” mode where each CCX/chiplet is a “NUMA node”. This would have instantly greatly improved performance as all NUMA-aware operating system (OS) schedulers would not migrate threads off NUMA modes (aka CCX/chiplets) nor would it require “Ryzen/TR custom power management profile” on Windows client/server.

While recent Windows kernels have improved matters (especially with custom profile), we still need to wait for 2020 kernel for better scheduling.

What’s new in the Ryzen2 core?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range ThreadRipper 3 (3970X, 3960X) with previous generation TR (2990X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Threadripper 3970X (Zen2) AMD Threadripper 2990(W)X (Zen+) AMD Threadripper 3960X (Zen2) Intel Core i9-10980X (CSL-X) AMD Ryzen 9 3950X (Zen2)
Comments
Cores (CU) / Threads (SP) 32C / 64T 32C / 64T 24C / 48T 18C / 36T 16C / 32T Same core count but still +78% more than CSL-X.
Topology 4 chiplets, each 2 CCX, each 4 cores (32C) 4 chiplet, 2 CCX, each 4 cores (32C) 4 chiplets, each 2 CCX, each 3 cores (24C) Monolithic die 2 chiplets, each 2 CCX, each 4 cores (16C) AMD uses discrete dies/chiplets unlike Intel.
Speed (Min / Max / Turbo) 3.7 [+23%] / 4.5GHz [+7%]
3.0 / 4.2GHz 3.8 / 4.5GHz 3.0 / 4.6GHz 3.8 / 4.6GHz Large base clock increase (+20%) over last gen.
Power (TDP / Turbo) 280W [+12%] 250W 280W 165 – 250W 105 – 135W TDP has gone up a bit [+12%]
L1D / L1I Caches 32x 32kB 8-way / 32x 32kB 8-way 32x 32kB 8-way / 32x 64kB 4-way 24x 32kB 8-way / 24x 64kB 4-way 18x 32kB 8-way / 8x 18kB 8-way 16x 32kB 8-way / 16x 32kB 8-way TR3 matches L1I with Intel (1/2x ZEN+ but 8-way), L1D is unchanged (also matches Intel)
L2 Caches 32x 512kB (16MB) 8-way 32x 512kB (16MB) 8-way 24x 512kB (12MB) 8-way 18x 1MB (18MB) 16-way 16x 512kB (8MB) 8-way No L2 changes and almost a match for Intel.
L3 Caches 8x 16MB (128MB) 16-way 8x 8MB (64MB) 16-way 8x 8MB (128MB) 16-way 24.75 11-way 4x 16MB (64MB) 16-way L3 is 2x ZEN/ZEN+ and thus 5x (five times) CSL-X
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware BTI/”Spectre”, SSB/”Spectre v4″ hardware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware BTI/”Spectre”, SSB/”Spectre v4″ hardware TR3 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F3100-25 MU-8F0802-0B MU-8F3100-25 MU-065507-29 MU-8F7100-11 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 128-bit AVX/FMA3/AVX2 256bit AVX/FMA3/AVX2 512-bit AVX512 256bit AVX/FMA3/AVX2 TR3 finally matches Intel but CSL-X’s secret weapon is AVX512 with even consumer CPUs able to do 2x 512-bit FMA ops.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Threadripper 3970X (Zen2) 32C/64T
AMD Threadripper 2990(W)X (Zen+) 32C/64T
AMD Threadripper 3960X (Zen2) 24C/48T
Intel Core i9-10980X (CSL-X) 18C/36T
AMD Ryzen 9 3950X (Zen2) 16C/32T
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 1,488 [+33%]
1,119 1,208 779 753 Right off Ryzen2 demolishes all CPUs, it is 40% faster than CFL-R!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 1,458 [+30%]
1,121 1,197 835 750 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 905 [+30%]
696 733 459 464 Floating-point performance does not change delta either – still 40% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 779 [+33%] 584 631 379 393 With FP64 nothing much changes again.
Ryzen2 starts with an astonishing display, with 3900X demolishing both 9900X and 7900X winning all tests by a large margin 38-43%! It does have 50% more cores (12 vs. 8) but it is not easy to realise gains just by increasing core counts. Intel will need to add far more cores in future CPUs in order to compete!
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 4,125 [+2.13x]
1,932 3,309 2,341 1,873 Ryzen2 starts off by blowing CFL-R away by 47% and almost matching SKL-X with AVX512!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 1,485 [+2.25x]
659 1,206 913 744 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is still 34% faster!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 35.43 [+75%]
20.22 20.52 12.92 12.98 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 is again 41% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 4,113 [+2x]
2,053 3,480 2,676 1,970 In this floating-point AVX/FMA vectorised test, Ryzen2 is now over 60% faster than CFL-R and not far off SKL-X!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 2,431 [+2.09x]
1,164 2,079 1,738 1,200 Switching to FP64 SIMD code, Ryzen2 is now 70% faster even beating SKL-X!!!
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 96.72 [+79%]
53.98 76.05 56.41 46.49 In this heavy algorithm using FP64 to mantissa extend FP128, Ryzen2 is still 53% faster!
With its brand-new 256-bit SIMD units, Ryzen2 finally goes toe-to-toe with Intel, soundly beating CFL-R in all benchmarks (+34-69%) sometimes by more than just core count increase (+50%). Only SKL-X with AVX512 manages to be faster (but also with its extra 2 cores). Intel had better release AVX512 for desktop soon but even that will not be enough without increasing core counts to match AMD.
BenchCrypt Crypto AES-256 (GB/s) 37.9 [+89%]
20 37.73 33.93 13 With AES/HWA support all CPUs are memory bandwidth bound – thus Ryzen2 scores less than Ryzen+ and CFL-R.
BenchCrypt Crypto AES-128 (GB/s) 37.92 [+90%]
20 37.83 33.93 13 What we saw with AES-256 just repeats with AES-128; Ryzen2 is again slower by 12%.
BenchCrypt Crypto SHA2-256 (GB/s) 72.1 [+56%]
46.18 57.21 33.52 28.58 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust – 2.5x faster than CFL-R and beating SKL-X with AVX512!
BenchCrypt Crypto SHA1 (GB/s) 53
BenchCrypt Crypto SHA2-512 (GB/s)
Ryzen2 with AES/SHA HWA is memory bound thus needs faster memory than 3200Mt/s in order to feed all the cores; otherwise due to increased contention for the same bandwidth it may end up slower than Ryzen+ and Intel designs. Here you see the need for HEDT platforms and thus ThreadRipper but at much increased cost.
BenchFinance Black-Scholes float/FP32 (MOPT/s)
BenchFinance Black-Scholes double/FP64 (MOPT/s) 933 [+84%]
506 599 497 308 Switching to FP64 code, nothing much changes, Ryzen2 55% faster than CFL-R.
BenchFinance Binomial float/FP32 (kOPT/s) Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 233 [+21%]
193 220 128 124 With FP64 code Ryzen2 is still 55% faster!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 394 [+2.9x]
136 361 178 212 Switching to FP64 nothing much changes, Ryzen2 is 70% faster than CFL-R and still beating SKL-X.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: Ryzen2 is over 50% faster than CFL-R (+55-72%) and soundly beats SKL-X too! As before for financial algorithms there is only one choice and that is Ryzen, be it Ryzen1, Ryzen+ or Ryzen2!
BenchScience SGEMM (GFLOPS) float/FP32 In this tough vectorised algorithm Ryzen2.
BenchScience DGEMM (GFLOPS) double/FP64 190 240 165 With FP64 vectorised code, Ryzen2 matches CFL-R and SKL-X.
BenchScience SFFT (GFLOPS) float/FP32 FFT is also heavily vectorised but stresses the memory sub-system more;
BenchScience DFFT (GFLOPS) double/FP64 12.76 22.07 With FP64 code, Ryzen2 is 13% faster than CFL-R.
BenchScience SNBODY (GFLOPS) float/FP32 N-Body simulation is vectorised but fewer memory accesses;
BenchScience DNBODY (GFLOPS) double/FP64 391 292 388 With FP64 precision Ryzen2 is almost 2x faster than CFL-R.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. The new 256-bit SIMD units help but it seems the cores are starved of bandwidth (especially due to SMT) and some workloads may perform better with SMT off.
CPU Image Processing Blur (3×3) Filter (MPix/s) 8,090 [+87%]
4,319 5,418 7,295 2,883 In this vectorised integer workload TR2 is 87% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 3,593 [+96%]
1,832 3,168 2,868 1,857 Same algorithm but more shared data makes TR2 2x faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,831 [+79%]
1,018 1,632 1,724 959 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 3,022 [+95%] 1,549 2,761 2,285 1,589 Different algorithm but still vectorised workload Ryzen2 is almost 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 333 [+47%] 226 293 332 174 Still vectorised code now Ryzen2 is 70% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 93.15 [+16%] 80.42 80.15 112 49.71 This test has always been tough for Ryzen but Ryzen2 is competitive.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3,944 [+36%] 2,900 2,308 3,573 1,505 With integer workload, Intel CPUs seem to do much better.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 1,159 [+44%] 804 1,022 1,162 627 In this final test again with integer workload TR2 is 44% faster.
Thanks to AVX512 SKL-X does win all tests but Ryzen2 beats CFL-R between 20-74% with a few test mixing integer & floating-point SIMD instructions seemingly heavily favouring Intel but nothing to worry about. Overall for image processing Ryzen2 should be your 1st choice.

Ryzen2 (unlike Ryzen1/+) has no trouble with SIMD code due to its widened SIMD units (256-bit) and thus soundly beats the opposition into dust (CFL-R 9900K flagship) sometimes more than just core count increase alone (+50% i.e. 12 cores vs. 8). Sometimes it even beats the AVX512 opposition (SKL-X 7900K) with more cores (10 cores vs. 12).

The only “problematic” algorithms are the memory bound ones where the cores/threads (due to SMT we have 24!) are starved for data and due to contention we see performance lower than less-core devices. While larger caches help (thus the massive 4x 16MB L3 caches) higher clocked memory should be used to match the additional core requirements.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: Ryzen2 is phenomenal and a huge upgrade over Ryzen1/+ that (most) AM4 users can enjoy and Intel has no answer to. 10/10.

Just as original Ryzen forced Intel to increase (double really) core counts to match (from 4 to 6 then 8), Ryzen2 will force Intel to come up with even more (and better) cores in order to compete. 3900X with its 12-cores soundly beats CFL-R 9900K (8-cores) in just about all benchmarks and in some tests goes toe-to-toe with HEDT SKL-X AVX512-enabled (10-cores) except in memory-bound algorithms where the 4 DDR4 memory channels with 2x more bandwidth count. For that you need ThreadRipper!

Ryzen1/+ was already competitive with Intel on integer and floating-point (non-SIMD) workloads but would fare badly on SIMD (AVX/FMA3/AVX2) workloads due to its 128-bit units; Ryzen2 “fixes” this issue, with its 256-bit units matching Intel. Only SKL-X with its 512-bit units (AVX512) is faster and Intel will have to finally include AVX512 for consumer CPUs in order to compete (IceLake?).

For compute-bound workloads, the forthcoming 3950X with its 16-cores/32-threads brings unprecedented performance to the consumer/desktop segment pretty much unheard of just a few years ago when 4-core/8-threads (e.g. 7700K) were all you could hope for – unless paying a lot more for HEDT where 8/10-core CPUs were far far more expensive. Naturally we shall see how the reduced memory bandwidth affects its performance with likely very fast DDR4 memory (4300Mt/s+) required for best performance.

Let’s also remember than Ryzen2 adds hardware mitigation to its remaining 2 vulnerabilities while Intel has been forced to add MDS/”Zombieload” even to its very latest CFL-R that now loses its trump card: hardware RDCL/”Meltdown” fix not to forget the recommendation to disable SMT/Hyperthreading that would mean a sizeable performance drop.

What is astonishing is that TDP has remained similar and with a BIOS/firmware upgrade, owners of older 300-series boards can now upgrade to these CPUs – and likely not even change the cooler unit! Naturally for PCIe4.0 a 500-series board is recommended and 400-series boards do support more features in Ryzen2/+ but let’s remember than on Intel you can only go back/forward 1 generation even though there is pretty much no core difference from Skylake (Gen 6) to Coffeelake-R (Gen 9)!

From top-end (3950X), high-end (3800X) to low-end/APU (3200G) Ryzen2 is such a compelling choice it is hard to recommend anything else… at least at this time…

Tagged , , , , , . Bookmark the permalink.

Comments are closed.