AMD Ryzen 7 5800X-3D (Zen3 V-Cache) Review & Benchmarks – CPU & Cache Performance

What is “Zen3” (Ryzen 5000)?

AMD’s Zen3 (“Vermeer”) is the 3rd generation ZEN core – aka the new 5000-series of CPUs from AMD, that introduces further refinements of the ZEN(2) core and layout. An APU version (with integrated graphics) is also available. The CPU/APUs remain socket AM4 compatible on desktop – thus allowing in-place upgrade (subject to BIOS upgrade as always) – but series 500-chipsets are recommended to enable all features (e.g. PCIe4, etc.). [Note this is the last CPU that will fit AM4 socket; future CPUs supporting DDR5 need a new socket]

Unlike ZEN2, the main changes are to the core/cache layout but they could still prove significant considering the cache/memory latencies issues that have impacted ZEN designs:

  • (AMD) Claims +19% IPC (instructions per clock) overall improvement vs. ZEN2
    • Higher base and turbo clocks +7% [for 5800X vs. 3700X]
  • Still built around “chiplets” CCX (“core complexes”) but now of 8C/16T and larger L3 cache (still 7nm)
    • Same central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
    • Still up to 2 chiplets on desktop platform thus up to 2x 8C (16C/32T 5950X)
  • L3 is still the same 32MB but now unified (not 2x 16MB) still up to 64MB on 5950X
    • 3D V-Cache L3 is 96MB unified, thus 3x (!) larger than original Zen3
  • 20 PCIe 4.0 lanes
  • 2x DDR4 memory controllers up to 3200Mt/s official (4266Mt/s max) [future AM5 socket for DDR5 support]

What is the new Zen3-3D V-Cache (Ryzen 5000-3D)?

It is a version of Zen3+ chiplet with vertically stacked (thus the 3D(imensions) moniker) L3 cache that is 3x larger (thus 96MB). The latency is expected to be slightly higher (+4 clock) and bandwidth also slightly lower (~10% less).

However, the sheer size of the L3 cache allows many (desktop) workloads’ data sets to be fulfilled directly from the L3 cache thus avoiding main memory access (with higher latencies and lower bandwidth). Inter-core/thread transfers of relatively large data sets (12MB/core) can also be fulfilled directly by the L3 cache.

Until recently, top-end 8-core Intel CPUs (e.g. 11900K, 10700K, etc.) had only 16MB L3 cache (1/2x normal Ryzen, 1/6x 3D Ryzen) – with only recent Intel “AlderLake” (ADL) 16-core (8C+8c) having a comparable 30MB L3 cache.

To upgrade from standard Zen3 or not?

Except the new L3 3D/V-Cache cache, there are no other major changes:

  • Minor stepping update (S2 vs. S0) with no major fixes
  • Requires AGESA V2 1.2.0.6+ for support – update BIOS before installing
  • Base and Turbo clocks are lower than normal Zen3 (5800X), thus raw compute power is lower

It all depends on the data set(s) of the workload(s) you are running:

  • Data sets that either entirely fit or can be significantly served in the 96MB L3 cache – will see significant uplift
  • Inter-core/thread data transfers that can entirely fit in the 3D L3 cache – will see significant uplift
  • Streaming workloads or with very large data sets may not show uplift but be slower due to lower base/turbo clocks
  • Compute heavy algorithms with small data sets will be slower due to lower base/turbo clocks

Review

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-range Ryzen 7 5000-series (Zen3 8-core) with previous generation Ryzen 7 3000-series (Zen2 8-core) and competing architectures with a view to upgrading to a top-range, high performance design.

CPU Specifications AMD Ryzen 7 5800X-3D 8C/16T (Vermeer-3D)
AMD Ryzen 7 5800X 8C/16T (Vermeer) Intel Core i7 11700K 8C/16T (RocketLake) Intel Core i7 12700 8C+4c / 20T (AlderLake) Comments
Cores (CU) / Threads (SP) 8C / 16T 8C / 16T 8C / 16T 8C + 4c / 20T Core counts remain the same.
Topology 1 chiplet, 1 CCX, each 8 core (8C) + I/O hub 1 chiplet, 1 CCX, each 8 core (8C) + I/O hub Monolithic die Monolithic die Same topology
Speed (Min / Max / Turbo) (GHz)
3.4 / 4.5GHz 3.8 / 4.7GHz 3.6 / 5GHz 2.1+1.6 / 4.8+3.6 Both base and turbo are down
Power (TDP / Turbo) (W)
105 / 135W (PL2) 105 / 135W (PL2) 125 / 175W (PL2) 65 / 180W (PL2) Same TDP
L1D / L1I Caches (kB)
8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32k+4x 48kB / 8x 48kB + 4x 32kB No changes to L1
L2 Caches (MB)
8x 512kB (4MB) 8-way inclusive 8x 512kB (4MB) 8-way inclusive 8x 512kB (4MB) 8x 1.25MB + 2MB No changes to L2
L3 Caches (MB)
96MB 16-way exclusive [+3x]
32MB 16-way exclusive 16MB 16-way 25MB 11-way 3x larger L3
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware No new fixes required… yet!
Microcode (MU)
A20F12-05 A20F10-16 0A0671-50 090672-15 The latest microcodes have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 256-bit AVX/FMA3/AVX2 Same SIMD widths
Price/RRP (USD)
$449
$449 $399 $349 Same price as normal version

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. AMD, etc.). All trademarks acknowledged and used for identification only under fair use.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Zen3 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 7 5800X-3D 8C/16T (Vermeer-3D) AMD Ryzen 7 5800X 8C/16T (Vermeer) Intel Core i7 11700K 8C/16T (RocketLake) Intel Core i7 12700 8C+4c / 20T (AlderLake) Comments
CPU Multi-Core Benchmark Total Inter-Thread Bandwidth – Best Pairing (GB/s) 99.2 [+8%] 91.55 77.89* 94.69 3D Zen3 has 8% more overall bandwidth
As the 3D L3 is the “star of the show” – we start with the inter-thread benchmark – where we see a +8% overall bandwidth improvement over the original Zen3, as even large data blocks transfers between threads can be fulfilled by the 3D L3 cache and do not need to go through much slower system memory anymore.

This should benefit all algorithms where larger data blocks are processed that cannot fit even in the generous 32MB of the original Zen3 L3 cache. Let’s note that most but the very recent CPUs only had up to 16MB L3 if not much less, even the original Zen3 has the largest L3 in the business.

Note:* using AVX512 512-bit wide transfers.

CPU Multi-Core Benchmark Average Inter-Thread Latency (ns) 22.6 [+13%] 20 29.3 42.6 3D Zen3 is 13% slower.
CPU Multi-Core Benchmark Inter-Thread Latency (Same Core) Latency (ns) 10.5 [+9%] 9.6 13.4 14.6 Inter-module is also 9% slower.
CPU Multi-Core Benchmark Inter-Core Latency (big Core, same Module) Latency (ns) 23.5 [+13%] 20.8 30.4 38.9 Similar 13% slower than Zen3.
CPU Multi-Core Benchmark Inter-Core (Little Core, same Module) Latency (ns) 51.2 n/a
CPU Multi-Core Benchmark Inter-Big-Little Latency (Same Module) Latency (ns) 56.4 n/a
Surprisingly, we see 3D Zen3’s inter-core latencies somewhat higher than we’d expect just by clock difference (+5%), most likely it is some configuration issue. In any case, they are still much lower than the competition (Intel) and this has not changed.
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 325 [-4%] 339 224 378 3D Zen3 is 4% slower than the normal version.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 338 [-1%] 343 207 377 With a 64-bit integer workload, it’s 1% slower
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 276 [-5%] 290 165 280 Floating-point performance is 5W slower
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 227 [-7%] 243 139 207 With FP64 we’re down 7%, still beating Intel
3D Zen3 is  about 4-5% slower than normal Zen3 – that is exactly what we’d expect from the lower clocks (-5%) in these legacy integer/floating-point benchmarks – that fit entirely in the L1/L2 and won’t take any advantage of the immense new L3 cache.

Against the competition, the situation does not change much, with Zen3 still competitive against Intel’s ADL.

BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,894 [-5%] 1,997 1,428* 1,361 3D Zen3 is again 5% slower than Zen3 as expected
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 650 [-6%] 691 363* 540 With a 64-bit integer workload nothing changes.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 115 [-6%] 122 78.8* 98.2 This is a tough test using Long integers to emulate Int128 nothing changes.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,712 [-7%] 1,847 890* 1,413 In this floating-point test, we’re 7% slower.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 876 [-7%] 946 446* 787 Switching to FP64 code, nothing changes.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 36.23 [-5%] 38.3 22.5* 40.5 In this heavy algorithm using FP64 to mantissa extend FP128, nothing changes.
Even in heavy compute SIMD vectorised algorithms we see the same results, ~5% slower than normal Zen3 as expected. This is due to the relatively small data set (Mandelbrot fractal bitmap) that already fits in “normal size” L3 caches.

If we were to use a much larger data set (e.g. 64MB) that would have overwhelmed the smaller caches – but fit in the new 3D V-Cache, we will see a benefit. We are looking to provide benchmark configuration with larger datasets in order to show this benefit if such caches become mainstream.

Note*: using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 20.43 [+5%] 19.39 23.53*** 18.5*** 3D Zen3 sees a 5% improvement over normal Zen3.
BenchCrypt Crypto AES-128 (GB/s) 20.43 [+5%] 19.39 23.71*** 18.5*** What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 25.01** [-5%] 26.25** 14.45* 22.41 With SHA/HWA we return to 5% slower.
BenchCrypt Crypto SHA1 (GB/s) 28.56** 39.94* The less compute-intensive SHA1 does not change things due to acceleration.
While streaming tests (crypto/hashing) are memory bound, 3D Zen3 does see a small improvement (+5%) in AES but the same drop (-5%) in SHA – thus overall pretty much tied with original Zen3.

Again, should our dataset be able to fit entirely in L3 cache or significantly serviced by it – we would see a big improvement over the orignal Zen3. But with large dataset (up to 16GB total on 32GB systems) the size of the L3 cache is of little benefit. Again, perhaps allowing configurable size data sets is an idea should these large L3 caches become mainstream.

Note***: using VAES 256-bit (AVX2) or 512-bit (AVX512)

Note**: using SHA HWA not SIMD (e.g. AVX512, AVX2, AVX, etc.)

Note*: using AVX512 not AVX2.

BenchFinance Black-Scholes float/FP32 (MOPT/s) 371 The standard financial algorithm.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 312 [-6%] 332 150 347 Switching to FP64 code, 3D Zen3 is 6% slower.
BenchFinance Binomial float/FP32 (kOPT/s) 162 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 91.5 [-7%] 98.7 41.5 105 With FP64 code 3D Zen3 is 7% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 292 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 124 [-6%] 132 54.4 138 No improvement here either.
Ryzen always did well on non-SIMD floating-point algorithms and here 3D Zen3 performs as expected, it is about 6% slower than normal Zen3. Again, we need updated algorithms that can buffer into the L3 cache now that it is so big in order to see improvements.
BenchScience SGEMM (GFLOPS) float/FP32 410 553* In this tough vectorised algorithm that is widely used (e.g. AI/ML).
BenchScience DGEMM (GFLOPS) double/FP64 332 [+74%] 191 211* 178 With FP64 3D Zen finally sees big uplift.
BenchScience SFFT (GFLOPS) float/FP32 23.7 30.6* FFT is also heavily vectorised but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 12.37 [=] 12.43 14.57* 11.24 With FP64 code, scores are tied.
BenchScience SNBODY (GFLOPS) float/FP32 518 606* N-Body simulation is vectorised but fewer memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 217 [-6%] 231 108* 165 With FP64 precision 3D Zen3 is 6% slower.
The main news here is that with a dataset that fits in the 3D L3 cache in GEMM – we see a 74% improvement over normal Zen3. GEMM is already using the L1D caches to buffer the tiles for higher performance – but here we see the huge improvement the L3 cache makes if the whole dataset fits the L3 cache.

Note*: using AVX512 not AVX2/FMA3.

CPU Image Processing Blur (3×3) Filter (MPix/s) 3,469  [-5%] 3,642 3,803* 3,430 In this vectorised integer workload 3D Zen3 is 5% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1,299 [-5%] 1,372 1,907* 1,353 Same algorithm but more shared data no changes.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 667 [-5%] 703 981* 679 Again same algorithm but even more data shared – no change.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,094 [-6%] 1,162 1,523* 1,146 Different algorithm but still vectorised no change.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 115 [-7%] 123 236* 103 Still vectorised code but no change
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 35.81 [-5%] 37.89 72.7* 54.54 This test has always been tough but still no change.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3,917 [+23%] 3,190 3,739* 3,755 With integer workload, we see an unexpected 23% improvement
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 490 [-3%] 507 951* 764 In this final test we see little change.
Again, if the dataset is too small and thus can fit in the normal L3 caches (e.g. 32MB) – you’re not going to see benefit from the much larger 3D V-Cache. In all other respects, 3D Zen3 performs as expected.

Note*: using AVX512 not AVX2/FMA3.

Aggregate Score (Points) 12,050 [+3%] 11,740 9,610* 10,790 Across all benchmarks, 3D Zen3 is 3% faster!
Despite being 5% slower in most compute benchmarks, the cache-sensitive benchmarks (Inter-Core Transfer, Crypto AES) do manage to bring 3D Zen3 to 3% faster than the normal Zen3 – which is a great result.

Note*: using AVX512 note AVX2/FMA3.

Price/RRP (USD) $449 [=] $449 $399 $349 Price stays the same.
Price Efficiency (Perf. vs. Cost) (Points/USD) 26.84 [+3%] 26.15 24.09 30.92 Small 3% efficiency in line with performance.
As AMD has kept the cost the same – 3D Zen3 sees the same improvement as overall performance: +3%. This means it is still below Intel’s latest ADL competition – that is much cheaper and thus more “bang-per-buck” despite lower overall performance. How the tables have turned!
Power/TDP (W) 105-135 [=] 105-135 125-175 65-180 TDP has remained the same.
Power Efficiency (Perf. vs. Power) (Points/W) 114.7 [+3%] 111.8 76.88 166 As TDP is the same, we see same improvement.
As AMD has kept the TDP the same – and lowered clocks to make sure that actual power consumed is kept in check – we see the same performance uplift as overall performance. Perhaps AMD could have just kept the clocks the same to ensure an outright victory over the normal Zen3.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: Recommended if moving to Zen3 from older versions of Ryzen (8/10).

Perhaps the biggest issue with the 3D V-Cache Zen3 is that the original (standard) Zen3 is too good – that the enormous 3D cache does not make more of a difference. The original Zen3’s L3 cache (32MB) is already large (compared to all but most recent CPUs especially Intel competition), provides good bandwidth, it has reasonably low latencies – and is already unified!

As 3D Zen3 has lower base/turbo clocks, it is already at a bit of disadvantage over original Zen3 – and in raw compute workloads it is naturally ~5% slower. In workloads with small data sets that already fit in the original L3 cache (32MB) – the higher latency and slightly lower bandwidth of the 3D L3 cache – makes it slightly slower than original Zen3 yet again.

We do see good gains in inter-thread transfer bandwidth (when larger blocks are transferred between threads) of about +9% overall and overall cache & memory bandwidth is overall +20% higher (when larger blocks are read/written) which can improve some algorithms (e.g. GEMM) by over 70%. But it all depends on the dataset size.

If you work with datasets comparable to the new 3D L3 cache size – you will thus see a big uplift in performance. Otherwise, you may well small decrease in performance. Thus it is a very niche product – but at the same price point & TDP – it is one we’d choose over the original if moving to Zen3 from older versions. In effect, it is the “top-end” 8-core AM4 socket Ryzen!

But for “top-end” AMD4 socket performance – there are higher core Zen3 versions, all the way up to the 16-core 5950X – which may also be “upgraded” to 3D V-Cache at some point – that also have larger (2x 32MB aka 64MB) total L3 cache. With more cores/threads, the 3D Zen3 cannot be expected to match/beat them just with a L3 cache upgrade.

Please see the other reviews on different Ryzen variants:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. AMD, etc.). All trademarks acknowledged and used for identification only under fair use.

Tagged , , , , . Bookmark the permalink.

Comments are closed.