AMD Ryzen 7 7800X-3D (Zen4 V-Cache) Review & Benchmarks – VCache for the Win!

What is “Zen4” (Ryzen 7000)?

AMD’s Zen4 (“Raphael”) is the 4rd generation ZEN core – aka the new 7000-series of CPUs from AMD – that brings brand new features like AVX512 ISA (instruction set support), DDR5 and PCIe5. These do require a brand new platform (AM5) almost a decade since the current AM4 platform was launched before even the 1st generation Ryzen. With any luck, it will remain for the next 4 or even more CPU generations, unlike the 2 generation support on competitor (Intel) platform.

Zen4 contains only big/P(erformance) cores and it is not a hybrid design. It remains to be seen if AMD will launch such hybrid (big/LITTLE) products that, in our opinion, are too problematic on desktop platforms for the benefits they bring. Even on mobile platforms where efficiency is a top priority – workloads do not easily lend to a hybrid design despite huge work done on the Windows scheduler for Windows 11. In this regard, a non-hybrid design like Zen4 is very much preferred.

AVX512 is a huge boost for compute performance as we’ve seen on Intel since SKL-X (Skylake-X). There is a reason it exists + all the extensions (IFMA, VNNI, VAES, etc.) and it is not unexpected that even basic usage can bring up to 100% (2x) performance improvement and even higher with specific instructions. While originally CPUs would reduce clocks due to the power generated – this has pretty much been mitigated in modern designs. Even Centaur (before Intel bought them) had AVX512-enabled (LITTLE) cores.

While here AMD has implemented it as 2x 256-bit ops (similar to previous AVX2/FMA3 in Zen1/1+/2 implemented as 2x 128-bit) – we still benefit from 2x more registers + 2x wider registers (4x overall), arguably better instruction specification, optimised extensions (IFMA, VNNI, VAES, etc.) that overall can still build up to a big improvement over old AVX2/FMA3.

  • 5nm process (TSMC) for CCX (vs. 7nm on Zen3) for better efficiency and clocks
  • 6nm process (TSMC) for I/O hub (vs. 12nm for Zen3) for better memory speeds
    • claimed 13% IPC increase vs. Zen3 + clock increase uplift => ~29% total uplift vs. Zen 3
  • AVX512 instruction support, with potential 100%+ improvement in optimised workloads
    • Executed as 2x 256-bit (not true 512-bit like Intel) but still many benefits over AVX2/FMA3
    • Specific AVX512 extensions (IFMA, VNNI, VAES, etc.) can bring well over 100% improvement
  • DDR5 support up to 5200Mt/s (official) for much higher memory bandwidth vs. DDR4 Zen3
    • Unofficial support for at least 6400Mt/s with XMP3/EXPO profiles
    • AMD says 6000Mt/s is the “sweet-spot” for performance/value
  • 1MB L2 per core (2x vs. 512kB on Zen3)
  • Standard L3 is the same 32MB, V-Cache the same 96MB
  • PCIe5 support, up to 24 lanes (2x bandwidth vs. PCIe4)
  • Still up to 2 chiplets (at launch) thus up to 2x 8C big/P cores (16C/32T on 7950X)
  • Much higher both base and turbo speeds in most variants, e.g. 7950X
    • Higher base 4.5GHz of standard CCX (vs. 3.4GHz on 5950X +32% clock uplift)
    • Higher base 4.2GHz of V-Cache CCX (vs. 3.4GHz on 5950X +24% clock uplift)
    • Higher turbo 5.7GHz (vs. 4.9GHz on 5950X +17% clock uplift)
  • TDP has increased to 120W (vs. 105W on 5950X) thus 14% higher
    • Turbo (PPT aka PL2) around 160W (vs. 142W on 5950X) thus 14% higher
    • Note that other models (e.g. 7700X) have kept the same TDP/Turbo
  • Built-in Radeon Graphics (RDNA2) core
    • 2CU / 128SP 400-2.2GHz cores for very basic graphics
AMD Zen4-3D (Ryzen 7800X-3D), V-Cache CCX + I/O

AMD Zen4-3D (Ryzen 7800X-3D), V-Cache CCX + I/O

What is the new Zen4-3D V-Cache (Ryzen 7000-3D)?

It is a version of Zen4+ chiplet/CCX with vertically stacked (thus the 3D(imensions) moniker) L3 cache that is 3x larger (thus 96MB). The latency is expected to be slightly higher (+4 clock) and bandwidth also slightly lower (~10% less).

Originally, AMD launched the asymmetric/hybrid (VCache CCX + Standard CCX) dual CCX processors (7950X-3D, 7900X-3D) – likely to benefit from early adopters. Now we finally have the cheaper, single-VCache CCX version (7800X-3D).

Similar to Zen3-3D – the clocks (Base) of the cores on the V-Cache CCX (5.25GHz) are lower than the standard CCX (5.7GHz).

To upgrade from standard Zen4 or not?

Except the new L3 3D/V-Cache cache, there are no other major changes:

  • Minor stepping update (S2 vs. S0) with no major fixes
    • Base and Turbo clocks of standard CCX are the same as original Zen4 (e.g. 7950X)
    • Base clocks of V-Cache CCX are lower than original Zen4, thus raw compute power is lower
  • AMD provided Windows driver to migrate threads to the “proper” CCX while parking other CCX
    • Games scheduled on V-Cache/slow CCX
    • Normal workloads scheduled on standard/fast CCX
    • This assumes the workload uses 16-threads or less

It all depends on the data set(s) of the workload(s) you are running:

  • Data sets that either entirely fit or can be significantly served in the 96MB L3 cache – will see significant uplift
  • Inter-core/thread data transfers that can entirely fit in the 3D L3 cache – will see significant uplift
  • Streaming workloads or with very large data sets may not show uplift but be slower due to lower base/turbo clocks
  • Compute heavy algorithms with small data sets will be slower due to lower base/turbo clocks

Review

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-range Ryzen 9 7000-series (Zen4 3D) with standard Ryzen 9 and competing architectures with a view to upgrading to a top-range, high performance design.

CPU Specifications AMD Ryzen 7 7800X-3D 8C/16T (Raphael-3D)
AMD Ryzen 7 5800X-3D 8C/16T (Vermeer-3D) AMD Ryzen 7 7800X 8C/16T (Raphael) Intel Core i7 13700K 8C+8c/24T (Raptor Lake) Comments
Cores (CU) / Threads (SP)  8C / 16T 8C / 16T 8C / 16T 8C+8c / 24T Core counts remain the same.
Topology 3D/CCX 8C + I/O hub 3D/CCX 8C + I/O hub CCX 8C + I/O hub Monolithic die Same topology but asymmetric
Speed (Min / Max / Turbo) (GHz)
4.2 [+23%] – 5.0GHz [+11%] 3.4 – 4.5GHz 4.5 – 5.7GHz 3.4 – 5.4GHz / 2.5 – 4.2GHz Base up 23%, turbo 11%
Power (TDP / Turbo) (W)
120 – 253W [+14%] 105 – 135W 105 – 142W 125 – 253W TDP is 14% higher
L1D / L1I Caches (kB)
8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 64kB + 8x 32kB / 8x 32kB + 8x 48kB No changes to L1
L2 Caches (MB)
8x 1MB (8MB) 8-way 8x 512kB (4MB) 8-way 8x 1MB (8MB) 8-way 8x 2MB + 2x 4MB [24MB] L2 is 2x larger
L3 Caches (MB)
96MB 16-way exclusive
96MB 16-way exclusive 32MB 16-way exclusive 20MB 16-way L3 is the same
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware No new fixes required… yet!
Microcode (MU)
A60F12-1203 A20F12-1205 A60F12-03 0B0671-10E The latest microcodes have been loaded.
SIMD Units 2x 256-bit (512-bit total) AVX512+ 256-bit AVX/FMA3/AVX2 2x 256-bit (512-bit total) AVX512+ 256-bit AVX/FMA3/AVX2 Same SIMD widths
Price/RRP (USD)
$449
$449 $399
$419 Same price as old 3D

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. AMD, etc.). All trademarks acknowledged and used for identification only under fair use.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets. Zen4 supports all modern instruction sets including AVX2/FMA3 and crypto SHA HWA but also AVX-512 and extensions (IFMA, VNNI, VAES, etc.)

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 11 x64 (21H2), latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 7 7800X-3D 8C/16T (Raphael-3D) AMD Ryzen 7 5800X-3D 8C/16T (Vermeer-3D) AMD Ryzen 7 7800X 8C/16T (Raphael) Intel Core i7 13700K 8C+8c/24T (Raptor Lake) Comments
CPU Multi-Core Benchmark Total Inter-Thread Bandwidth – Best Pairing (GB/s) 116* [+17%] 99.2 110* 128 3D Zen4 has 17% more bandwidth
As the 3D/V-Cache L3 is the “star of the show” – we start with the inter-thread benchmark – where we see a +17% overall bandwidth improvement over the Zen3-3D and also higher than the Zen4-standard. Even large data blocks transfers between threads can be fulfilled by the 3D L3 cache and do not need to go through much slower system memory anymore.

RPL still has higher bandwidth – but that is largely thanks to the extra 8 little cores with their own L1D and shared L2 caches. Let’s note that most but the very recent CPUs only had up to 16MB L3 if not much less and 128MB total L3 is huge for desktop processors.

Note:* using AVX512 512-bit wide transfers.

CPU Multi-Core Benchmark Average Inter-Thread Latency (ns) 19.1 [-15%] 22.6 16.8 36.3 15% less latency than Zen3-3D
CPU Multi-Core Benchmark Inter-Thread Latency (Same Core) Latency (ns) 8.9 [-15%] 10.5 7.9 11.4 15% lower latency than Zen3-3D
CPU Multi-Core Benchmark Inter-Core Latency (big Core, same Module) Latency (ns) 19.8 [-16%] 23.5 17.4 33 16% lower latency than Zen3-3D
CPU Multi-Core Benchmark Inter-Core (Little Core, same Module) Latency (ns) 42.5 n/a
CPU Multi-Core Benchmark Inter-Module Latency (ns) Single CCX
Overall, Zen4-3D has 15% lower latencies than Zen3-3D which is a big result, but naturally higher latencies than the standard Zen4 that runs at higher clocks.
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 538 [+65%] 326 612 211 Z4-3D is 65% faster than Z3-3D.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 551 [+63%] 338 637 180 With a 64-bit integer workload, nothing changes
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 319 [+16%] 276 350 175 Floating-point performance is 16% faster
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 263 [+16%] 227 295 147 With FP64 data nothing changes
Zen4-3D is about 40% faster than the old Zen3-3D – but naturally cannot beat the standard Zen4 with its much higher clocks. In these legacy integer/floating-point benchmarks the V-Cache does not help and clocks rule.

In any case, Zen4 (3D or not) soundly beats Intel’s latest RPL (13700K) that is equivalent to ADL’s top-of-the-range ADL (12900K) of yesteryear.

BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 2,367* [+25%] 1,894 2,648* 2,207 Z4-3D is 25% faster than old Z3-3D
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 710* [+9%] 650 816* 727 With a 64-bit integer workload it’s 9% faster
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 200* [+74%] 115 227* 135 In this 128-bit int emulation, Z4-3D is 75% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 2,118* [+24%] 1,712 2,341* 2,199 In this floating-point test, it’s 24% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1,159* [+32%] 876 1,271* 1,123 Switching to FP64 code, it’s 32% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 46* [+26%] 36.23 50.49* 52 Emulating 128-bit floats, Z4-3D is 26% faster.
Even in heavy compute SIMD vectorised algorithms we see similar results, Zen4-3D is 32% faster than the old Zen3-3D but cannot beat the standard Zen4 due to the relatively small data set (Mandelbrot fractal bitmap) that already fits in the standard size L3 caches.

If we were to use a much larger data set (e.g. 96-128MB) that would have overwhelmed the smaller caches – but fit in the new 3D V-Cache, we will see a benefit.

Note*: using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 25*** [+22%] 20.43 26.49*** 28 Z4-3D sees a 22% improvement Z3-3D
BenchCrypt Crypto AES-128 (GB/s) 25*** [+22%] 20.43 27 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 26* [2%] 25.01** 27.92* 31** With SHA/HWA it is 2% faster
BenchCrypt Crypto SHA1 (GB/s) The less compute-intensive SHA1 does not change things due to acceleration.
While streaming tests (crypto/hashing) are memory bound, Zen4-3D does not really see an uplift.

Again, should our dataset be able to fit entirely in L3 cache(s) or significantly serviced by it – we would see a big improvement over the standard Zen4. But with large dataset (up to 16GB total on 32GB systems) the size of the L3 cache is of little benefit. Again, perhaps allowing configurable size data sets is an idea should these large L3 caches become mainstream.

Note***: using VAES 256-bit (AVX2) or 512-bit (AVX512)

Note**: using SHA HWA not SIMD (e.g. AVX512, AVX2, AVX, etc.)

Note*: using AVX512 not AVX2.

BenchFinance Black-Scholes float/FP32 (MOPT/s) The standard financial algorithm.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 344 [+10%] 312 391 461 Switching to FP64 code, Z4-3D is 10% faster.
BenchFinance Binomial float/FP32 (kOPT/s) Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 94.9 [+4%] 91.47 117 140 With FP64 code Z4-3D is 4% faster
BenchFinance Monte-Carlo float/FP32 (kOPT/s) Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 150 [+21%] 124 165 182 Again, we see around 20% improvement.
Ryzen always did well on non-SIMD floating-point algorithms and here 3D Zen4 performs as expected; naturally it cannot beat the faster standard Zen4 and thus we see no uplift from the bigger L3 cache. Again, we need updated algorithms that can buffer into the L3 cache now that it is so big in order to see improvements.
BenchScience SGEMM (GFLOPS) float/FP32 In this tough vectorised algorithm that is widely used (e.g. AI/ML).
BenchScience DGEMM (GFLOPS) double/FP64 472* [+42%] 332 340* 325 With FP64 Z4-3D 42% faster!
BenchScience SFFT (GFLOPS) float/FP32 FFT is also heavily vectorised but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 15.52* [+25%] 12.37 14.92* 19.8 With FP64 code, Z4-3D is 25% faster
BenchScience SNBODY (GFLOPS) float/FP32 N-Body simulation is vectorised but fewer memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 282 [+30%] 217 318 213 With FP64 precision Z4-3D is 30% faster
The main news here is that with a dataset that fits in the 3D L3 cache in GEMM – we see a 42% improvement over standard Zen4. GEMM is already using the L1D caches to buffer the tiles for higher performance – but here we see the huge improvement the L3 cache makes if the whole dataset fits the L3 cache.

Note*: using AVX512 not AVX2/FMA3.

CPU Image Processing Blur (3×3) Filter (MPix/s) 6,088* [+75%] 3,469 6,237* 5,130 In this vectorised integer workload Z4-3D is 75% faster!
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 2,355* [+81%] 1,299 2,673* 1,943 Same algorithm but more shared data 81% faster
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,215* [+82%] 667 1,385* 966 Again same algorithm but even more data shared – 82% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,873* [+71%] 1,094 2,109* 1,624 Different algorithm but still vectorised no change.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 253* [+2.2x] 115 284* 141 Still vectorised code but over 2x faster
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 39* [+9%] 35.81 43.37* 72 This test has always been tough – 9% faster
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 5,608* [+43%] 3,917 4,841* 5,539 With integer workload, we see an unexpected 43%  faster
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 633* [+29%] 490 692* 986 In this final test we see little change.
Again, if the dataset is too small and thus can fit in the normal L3 caches (e.g. 64MB) – you’re not going to see benefit from the much larger 3D V-Cache. Interestingly, we see 2 tests where the improvement is a huge 40%! This is not a fluke, as we’ve seen similar improvement in Zen3-3D vs. standard Zen3.

Note*: using AVX512 not AVX2/FMA3.

Aggregate Score (Points) 15,430* [+28%] 12,050 16,390* 16,020 Zen4-3D is 28% faster than Zen3-3D
As with standard Zen4, Zen4-3D is a huge 28% faster across all benchmarks than the older Zen3-3D; still without using the VCache-optimised datasets the normal Zen4 performs better.

Note*: using AVX512 note AVX2/FMA3.

Price/RRP (USD) $449 [=] $449 $399 $419 Same price
Price Efficiency (Perf. vs. Cost) (Points/USD) 34.37 [+28%] 26.84 41.08 38.23 Zen4-3D is 28% better than the old one
AMD has kept the same price, thus the new Zen4-3D provides 28% better performance/price. However, due to lower price, the standard Zen4 is more efficient and even Intel’s RPL seems more efficient.
Power/TDP (W) 128 – 253W [+18%] 105 – 135W 105 – 142W 125 – 253W TDP is 18% higher
Power Efficiency (Perf. vs. Power) (Points/W) 128 [+12%] 115 156 128 As the TDP is a bit higher, Zen4-3D is 12% better
With the slightly higher TDP, Zen4-3D comes up just 12% more power efficient than the old Zen3-3D but that is bound to increase going forward as apps are updated to support it. Still, the standard Zen4 reigns supreme due to both lower power and higher performance – at least on our benchmarks.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: The 8-Core King Returns: 9/10

Even with the original 3D V-Cache Zen3 (5800X-3D) – the biggest issue was that the standard Zen3 was too good/performant and the huge L3 cache only made a difference in some workloads (notably games!). The standard 32MB L3 CCX cache is already large enough and fast enough especially considering the competition (Intel). Still, the 3D model had 3x (three times) larger L3 that can be a big asset.

Unlike the multi-CCX designs with asymmetric/hybrid L3 cache of different sizes – the 7800X-3D brings back “normality” with a single, unified 3D-VCache. No need for special drivers for games and other applications to schedule threads on the “right” CCX for best performance and turn off other cores/CCX for best power efficiency.

Due to much higher bandwidth on AM5/DDR5 platform (e.g. standard DDR5-6500 memory) vs. old AM4/DDR4 (e.g. common DDR4-3200 memory) – Zen4-3D takes less of a hit going to main memory than Zen3-3D though L3 bandwidth is still 10x higher than DDR5.

In the end – it all depends on your workloads: if you game regularly and thus want a 3D/V-Cache Zen4 but also regularly need more cores/threads for other tasks than a (future) 7800X-3D can provide, then these 7950X-3D/7900X-3D could work for you.

Otherwise you’re better off with the standard Zen4 (7700X), or if you are still on AM4 platform, the Zen3-3D (5800X-3D) has come down in price and is still very much competitive.

Please see the other reviews on other Ryzen variants:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. AMD, etc.). All trademarks acknowledged and used for identification only under fair use.

Tagged , , , , , . Bookmark the permalink.

Comments are closed.