AMD Ryzen 2 Mobile 2500U Review & Benchmarks – Cache & Memory Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

Why review it now?

With Ryzen3 soon to be released later this year (2019) – with a corresponding Ryzen3 APU mobile – it is good to re-test the platform especially in light of the many BIOS/firmware updates, many video/GPU driver updates and not forgetting the many operating system (Windows) vulnerabilities (“Spectre”) mitigations that have greatly affected performance – sometimes for the good (firmware, drivers, optimisations) sometimes for the bad (mitigations).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

 

CPU Specifications AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 icache is 2x of Intel with matching dcache.
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 2x 256kB 16-way 4x 256kB 16-way Ryzen2 L2 cache is 2x bigger than Intel and thus 4x larger than older SKL/KBL-U.
L3 Caches 4MB 16-way 4MB 16-way 4MB 16-way 6MB 16-way Here CFL-U brings 50% bigger L3 cache (6 vs 4MB) which may help some workloads.
TLB 4kB pages
64 full-way / 1536 8-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way / 1536 2-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600 2600 (400-3100) 2700 (400-3500) 1600 (400-3400) Ryzen2’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
1200-2400 (2667) 1033-1866 (2133) 1067-2133 (2400) 1200-2400 (2533) Ryzen2 now supports up to 2667MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
17-17-17-39 8-56-18-9 1T 14-17-17-40 10-57-16-11 2T 15-15-15-36 4-51-17-8 2T 19-19-19-43 5-63-21-9 2T Timings naturally depend on memory which for laptops is somewhat limited and quite expensive.
Memory Controller Firmware
2.1.0 3.6.0 3.6.4 Firmware is the same as on desktop devices.

Core Topology and Testing

As discussed in the previous articles (Ryzen1 and Ryzen2 reviews), cores on Ryzen are grouped in blocks (CCX or compute units) each with its own L3 cache – but connected via a 256-bit bus running at memory controller clock. However – unlike desktop/workstations – so far all Ryzen2 mobile designs have a single (1) CCX thus all the issues that “plagued” the desktop/workstation Ryzen designs do note apply here.

However, AMD could have released higher-core mobile designs to go against Intel’s H-line (beefed to 6-core / 12-threads with CFL-H) that would have likely required 2 CCX blocks. At this time (start 2019) considering that Ryzen3 (mobile) will launch soon that seems unlikely to happen…

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen2 mobile supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 18.65 [-21%] 16.81 18.93 23.65 Ryzen2 L1D is not as wide as Intel’s designs (512-bit) thus inter-core transfers in L1D are 20% slower.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 9.29 [=] 6.62 7.4 9.3 Using the unified L3 caches – both Ryzen2 and CFL-U manage the same bandwidths.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 16 [-24%] 21 18 19 Within the same core (share L1D) Ryzen2 has lower latencies by 24% than all Intel CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 46 [-23%] 61 54 56 Within the same compute unit (shareL3) Ryzen2 again yields 23% lower latencies.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) n/a n/a n/a n/a With a single CCX we have no latency issues.
While the L1D cache on Ryzen2 is not as wide as on Intel SKL/KBL/CFL-U to yield the same bandwidth (20% lower), both it and L3 manage lower latencies by a relatively large ~25%. With a single CCX design we have none of the issues seen on the desktop/workstation CPUs.
Aggregated L1D Bandwidth (GB/s) 267 [-67%] 315 302 628 Ryzen2’s L1D is just not wide enough – even 2-core SKL/KBL-U have more bandwidth and CFL-U has almost 3x more.
Aggregated L2 Bandwidth (GB/s) 225 [-29%] 119 148 318 The 2x larger L2 caches (512 vs 256kB) perform better but still CFL-U manages 30% more bandwidth.
Aggregated L3 Bandwidth (GB/s) 130 [-31%] 90 95 188 CFL-U not only has 50% bigger L3 (6 vs 4MB) but also somehow manages 30% more bandwidth too while SKL/KBL-U are left in the dust.
Aggregated Memory (GB/s) 24 [=]
21 21 24 With the same memory clock, Ryzen2 ties with CFL-U which means good bandwidth for the cores.
While we saw big improvements on Ryzen2 (desktop) for all caches L1D/L2/L3 – more work needs to be done: in particular the L1D caches are not wide enough compared to Intel’s CPUs – and even L2/L3 need to be wider. Most likely Ryzen3 with native wide 256-bit SIMD (unlike 128-bit as Ryzen1/2) will have twice as wide L1D/L2 that should be sufficient to match Intel.

The memory controller performs well matching CFL-U and is officially rated for higher DDR4 memory – though on laptops the choices are more limited and more expensive.

Data In-Page Random Latency (ns) 91.8 [4-13-32] [+2.75x] 34.6 [3-10-17] 27.6 [4-12-22] 24.5 As on desktop Ryzen1/2 in-page random latencies are large compared to the competition while L1D/L2 are OK but L3 also somewhat large.
Data Full Random Latency (ns) 117 [4-13-32] [-16%] 108 [3-10-27] 84.7 [4-12-33] 139 Out-of-page latencies are not much different which means Ryzen2 is a lot more competitive but still somewhat high.
Data Sequential Latency (ns) 4.1 [4-6-7] [-31%]
5.6 [3-10-11] 6.5 [4-12-13] 5.9 Ryzen’s prefetchers are working well with sequential access with lower latencies than Intel
Ryzen1/2 desktop issues were high memory latencies (in-page/full random) and nothing much changes here. “In-Page/Random pattern” (TLB hit) latencies are almost 3x higher – actually not much lower compared to “Full/Random pattern” (TBL miss) – which are comparable to Intel’s SKL/KBL/CFL. On the other hand “Sequential pattern” yields lower latencies (30% less) than Intel thus simple access patterns work better than complex/random access patterns.
Looking at the data access latencies’ graph for Ryzen2 mobile – we see the “in-page/random” following the “full/random” latencies all the way to 8MB block where they plateau; we would have expected them to plateau at a lower value. See the “code access latencies” graph below.
Code In-Page Random Latency (ns) 17.6 [5-9-25] [+14%] 13.3 [2-9-18] 14.9 [2-11-21] 15.5 Code latencies were not a problem on Ryzen1/2 and they are OK here, 14% higher.
Code Full Random Latency (ns) 108 [5-15-48] [+19%] 91.8 [2-10-38] 90.4 [2-11-45] 91 Out-of-page latency is also competitive and just 20% higher.
Code Sequential Latency (ns) 8.2 [5-13-20] [+37%] 5.9 [2-4-8] 7.8 [2-4-9] 6 Ryzen’s prefetchers are working well with sequential access pattern latency but not as fast as Intel.
Unlike data, code latencies (any pattern) are competitive with Intel though CFL-U does have lower latencies (between 15-20%) but in exchange you get a 2x bigger L1I (64 vs 32kB) which should help complex software.
This graph for code access latencies is what we expected to see for data: “in-page/random” latencies plateau much earlier than “full/random” thus “TLB hit” latencies being much lower than “TLB miss” latencies.
Memory Update Transactional (MTPS) 7.17 [-7%] 6.5 7.72 7.2 As none of Intel’s CPUs have HLE enabled Ryzen2 performs really well with just 7% less transactions/second.
Memory Update Record Only (MTPS) 5.66 [+5%] 4.66 5.25 5.4 With only record updates it manages to be 5% faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We saw good improvement on Ryzen2 (desktop/workstation) but still not enough to beat Intel and a lot more work is needed both on L1/L2 cache bandwidth/widening and memory latency (“in-page” aka “TBL hit” random access pattern) that cannot be improved with firmware/BIOS updates (AGESA firmware). Ryzen2 mobile does have the potential to use faster DDR4 memory (officially rated 2667MHz) thus could overtake Intel using faster memory – but laptop DDR4 SODIMM choice is limited.

Regardless of these differences – the CPU results we’ve seen are solid thus sufficient to recommend Ryzen2 mobile especially when at a much lower cost than competing designs. Even if you do choose Intel – you will be picking up a better design due to Ryzen2 mobile competition – just compare the SKL/KBL-U and CFL/WHL-U results.

We are looking forward to see what improvements Ryzen3 mobile brings to the mobile platform.

In a word: Recommended – with reservations

In this article we tested CPU Cache and Memory performance; please see our other articles on:

Intel Core i9 9900K CofeeLake-R Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No L1D/I changes, Ryzen’s L1I is twice as big.
L2 Caches 8x 256kB 4-way 6x 256kB 4-way 8x 512kB 8-way 10x 1MB 16-way No L2 changes, Ryzen’s L2 is twice as big again.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 2x 8MB 16-way L3 has also increased with no of cores, and now matches Ryzen.
TLB 4kB pages
64 4-way / 64 8-way / 1536 6-way 64 4-way / 64 8-way/ 1536 6-way 64 full-way 1536 8-way 64 4-way / 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
8 full-way / 1536 6-way 8 full-way / 1536 6-way 64 full-way 1536 2-way 8 full-way / 1536 6-way No TLB changes.
Memory Controller Speed (MHz) 1200-5000 1200-4400 1333-2667 1200-2700 The uncore (memory controller) runs at faster clock due to higher rated clock but not a lot in it.
Memory Data Speed (MHz)
3200 3200 2667 3200 CFL/R can easily run at 3200Mt/s while KBL/SKL were not as reliable. We could not get Ryzen past 2667 while it does support 2933.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Bandwidth (GB/s)
50 50 42 100 Bandwidth has naturally increased with memory clock speed but latencies are higher.
Uncore / Memory Controller Firmware
2.6.2 2.6.2 We’re on firmware 2.6.x on both.
Memory Timing (clocks)
16-16-16-36 6-52-25-12 2T 16-16-16-36 6-52-25-12 2T 16-17-17-35 7-60-20-10 2T Timings are very much BIOS dependent and vary a lot.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL-R supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 70.7 [+28%] 52.5 55.3 86 CFL-R finally overtakes Ryzen2 in inter-core bandwidth with almost 30% more bandwidth.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 15.4 [-1%] 15.5 6.35 25.7 In worst-case pairs on Ryzen2 must go across CCXes – unlike Intel’s CPUs – thus CFL can muster over 2x more bandwidth in this case.
CFL-R manages good bandwidth improvement with its 2  extra cores allowing it to dominate Ryzen  2; worst-case bandwidth does not improve as the inter-core connector has remained the same
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.4 [-7%] 14.4 13.5 15 With its faster clock, CFL-R manages lower inter-core latency with 7% drop.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 43.7 [-3%] 45 40 75 Within the same unit, Ryzen2 is again faster than CFL/R.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 115 Obviously going across CCXes is slow, about 3x slower which needs careful thread scheduling.
The multiple CCX designof Ryzen 2 still presents some challenges to programmers requiring threads to be carefully scheduled – thus the unified CFL-R just like CFL before it enjoys lower latencies throughout.
Aggregated L1D Bandwidth (GB/s) 1890 [+39%] 1630
854 2220 Intel’s wide L1D in CFL/R means almost 2x more bandwidth than Ryzen 2.
Aggregated L2 Bandwidth (GB/s) 618 [+8%] 571 720 985 But Ryzen2’s L2 caches are not only twice as big but also very wide – CFL/R surprisingly cannot beat it.
Aggregated L3 Bandwidth (GB/s) 326 [=] 327 339 464 Ryzen’s 2 L3 caches also provide good bandwidth matching CFL’s unified L3 cache.
Aggregated Memory (GB/s) 35.5 [=] 35.6 32.2 70 Running at 3200Mt’s obviously CFL enjoys higher bandwidth than Ryzen2 at 2667Mt’s but somehow the latter has better efficiency.
Nothing much has changed in CFL/R vs. old SKL/KBL thus while L1 caches are wide and thus fast – the L2, L3 are not as impressive and the memory controller while competitive it does not seem as efficient as Ryzen2 but is more stable at high data rates allowing for higher bandwidth.
Data In-Page Random Latency (ns) 17.5 (3-10-21) 17.4 (4-11-20) [-73%] 63.4 (4-12-31) 25.5 (4-13-30) While clock latencies have not changed w.s. old KBL/SKL, CFLR enjoys lower latencies due to higher data rates. Ryzen2 has problems here.
Data Full Random Latency (ns) 54.3 (3-10-36) 53.4 (4-11-42) [-30%] 76.2 (4-12-32) 74 (4-13-62) Out-of-page clock latencies have increased but still overall lower. Ryzen2 has almost caught up here.
Data Sequential Latency (ns) 3.8 (3-10-11) 3.8 (4-11-12) 3.3 (4-6-7) 5.3 (4-12-12) With sequential access, Ryzen2 is now faster as CFL/R’s clock latencies have not changed.
CFL-R does not improve over CFL (same memory controller) is lucky here as even Ryzen2 still has high latencies in random accesses (either in-page or full range) but manages to be faster with sequential access. Intel will need to improve going forward as clock latencies while good have really not improved at all.
Code In-Page Random Latency (ns) 8.6 (2-9-19) 8.7 (2-10-21) 13.8 (4-9-24) 11.8 (4-14-25) Code clock latencies also have not changed and again and while Ryzen2 performs a lot better, CFL/R manage to be ~35% faster.
Code Full Random Latency (ns) 60.1 (2-9-48) 59.8 (2-10-48) 85.7 (4-14-49) 83.6 (4-15-74) Out-of-page clock latencies also have not changed and here CFL/R is 20% faster over Ryzen2.
Code Sequential Latency (ns) 4.3 (2-3-8) 4.5 (2-4-10) 7.4 (4-12-20) 6.8 (4-7-11) Ryzen2 is competitive but again CFL/R manages to be almost 40% faster.
CFL/R does not improve over CFL but still dominates here and enjoys 30-40% less latency over Ryzen2 but the latter has improved a lot in time.
Memory Update Transactional (MTPS) 73.3 [+36%] 54 5 59 Finally all top-end Intel CPUs have HLE enabled and working and thus enjoy huge performance increase.
Memory Update Record Only (MTPS) 53.4 [+41%] 38 4.58 59 Nothing much changes here. CFL-R can do over 40% more transactions.

CFL-R does not really perform any different cache/memory wise vs. old CFL as the caches and memory controller are unchanged.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

CFL-R just adds more cores, thus enjoys higher aggregated L1D/L2 bandiwdths vs CFL but the L3 is still disappointing – especially as now it has to feed 33% more cores/threads (8/16 vs 6/12). Latencies (in clocks) do not change either but as it can clock higher they do decrease in real terms (ns).

The memory controller is the very same (even running same firmware) thus performs the same though now it has to feed 33% more cores/threads (8/16 vs 6/12) thus when all cores/threads are used the aggregated bandwidth falls due to extra contention. In fairness Ryzen2 has the same issue (too many cores/threads for too little bandwidth) thus SKL/KBL-X is where you should be looking for more bandwidth.

Intel Core i7 8700K CofeeLake Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). The server/workstation (SKL-X/KBL-X) CPU core saw new instruction set support (AVX512) as well as other improvements – these have not made the transition yet.

Possibly due limited competition (before AMD Ryzen launch), process issues (still at 14nm) and the disclosure of a whole host of hardware vulnerabilities (Spectre, Meltdown, etc.) which required microcode (firmware) updates – performance improvements have not been forthcoming. This is pretty much unprecedented – while some Core updates were only evolutionary we have not had complete stagnation before; in addition the built-in GPU core has also remained pretty much stagnant – we will investigate this in a subsequent article.

However, CFL does bring up a major change – and that is increased core counts both on desktop and mobile: on desktop we go from 4 to 6 cores (+50%) while on mobile (ULV) we go from 2 to 4 (+100%) within the same TDP envelope!

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
L1D / L1I Caches 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No L1D/I changes, Ryzen’s L1I is twice as big.
L2 Caches 6x 256kB 4-way 8x 512kB 8-way 10x 1MB 16-way 4x 256kB 4-way No L2 changes, Ryzen’s L2 is twice as big again.
L3 Caches 12MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way L3 has also increased with no of cores, still behind Ryzen’s dual 8MB L3 caches.
TLB 4kB pages
64 4-way / 64 8-way/ 1536 6-way 64 full-way 1536 8-way 64 4-way / 64 8-way / 1536 6-way 64 4-way / 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
8 full-way / 1536 6-way 64 full-way 1536 2-way 8 full-way / 1536 6-way 8 full-way / 1536 6-way No TLB changes.
Memory Controller Speed (MHz) 1200-4400 1333-2667 1200-2700 1200-4000 The uncore (memory controller) runs at faster clock due to higher rated clock but not a lot in it.
Memory Data Speed (MHz)
3200 2667 3200 2533 CFL can easily run at 3200Mt/s while KBL/SKL were not as reliable. We could not get Ryzen past 2667 while it does support 2933.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Bandwidth (GB/s)
50 42 100 40 Bandwidth has naturally increased with memory clock speed but latencies are higher.
Uncore / Memory Controller Firmware
2.6.2 2.0.0.6 We’re on firmware 2.6.x vs. 2.0.x on old SKL/KBL.
Memory Timing (clocks)
16-16-16-36 6-52-25-12 2T 16-17-17-35 7-60-20-10 2T 16-18-18-36 5-54-21-10 2T Timings are very much BIOS dependent and vary a lot.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 52.5 [-5%] 55.3 86 39.5 Despite just 2 less cores, CFL has only 5% less bandwidth than Ryzen 2.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 15.5 [+144%] 6.35 25.7 16.1 In worst-case pairs on Ryzen2 must go across CCXes – unlike Intel’s CPUs – thus CFL can muster over 2x more bandwidth in this case.
CFL manages good bandwidth improvement over KBL/SKL – and due to unified design matching Ryzen2 in best case and beating it soundly in worst case.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 14.4 [+7%] 13.5 15 16 Surprisingly, Ryzen2 manages lower thread latency when sharing core.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 45 [+12%] 40 75 47 Within the same unit, Ryzen2 is again faster than CFL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 115 Obviously going across CCXes is slow, about 3x slower which needs careful thread scheduling.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see Ryzen2 with lower latencies for both core and unit a surprising result as usually Intel’s caches are lower latency.
Aggregated L1D Bandwidth (GB/s) 1630 [+59%]
854 2220 884 Intel’s wide data path L1 caches allow even old SKL to beat Ryzen2 with CFL enjoying 60% more bandwidth.
Aggregated L2 Bandwidth (GB/s) 571 [-21%] 720 985 329 But Ryzen2’s L2 caches are not only twice as big but also very wide – CFL has 20% less bandwidth.
Aggregated L3 Bandwidth (GB/s) 327 [-4%] 339 464 243 Ryzen’s 2 L3 caches also provide good bandwidth matching CFL’s unified L3 cache.
Aggregated Memory (GB/s) 35.6 [+11%] 32.2 70 30.1 Running at 3200Mt’s obviously CFL enjoys higher bandwidth than Ryzen2 at 2667Mt’s but somehow the latter has better efficiency.
Nothing much has changed in CFL vs. old SKL thus while L1 caches are wide and thus fast – the L2, L3 are not as impressive and the memory controller while competitive it does not seem as efficient as Ryzen2 but is more stable at high data rates allowing for higher bandwidth.
Data In-Page Random Latency (ns) 17.4 (4-11-20) [-73%] 63.4 (4-12-31) 25.5 (4-13-30) 20.4 (4-12-21) While clock latencies have not changed w.s. old KBL/SKL, CFL enjoys lower latencies due to higher data rates. Ryzen2 has problems here.
Data Full Random Latency (ns) 53.4 (4-11-42) [-30%] 76.2 (4-12-32) 74 (4-13-62) 63.9 (4-12-34) Out-of-page clock latencies have increased but still overall lower. Ryzen2 has almost caught up here.
Data Sequential Latency (ns) 3.8 (4-11-12) [+15%] 3.3 (4-6-7) 5.3 (4-12-12) 4.1 (4-12-13) With sequential access, Ryzen2 is now faster as CFL’s clock latencies have not changed.
CFL is lucky here as even Ryzen2 still has high latencies in random accesses (either in-page or full range) but manages to be faster with sequential access. Intel will need to improve going forward as clock latencies while good have really not improved at all.
Code In-Page Random Latency (ns) 8.7 (2-10-21) [-37%] 13.8 (4-9-24) 11.8 (4-14-25) 10.1 (2-10-21) Code clock latencies also have not changed and again and while Ryzen2 performs a lot better, CFL (even old SKL) manage to be ~35% faster.
Code Full Random Latency (ns) 59.8 (2-10-48) [-30%] 85.7 (4-14-49) 83.6 (4-15-74) 70.7 (2-11-46) Out-of-page clock latencies also have not changed and here CFL is 20% faster over Ryzen2.
Code Sequential Latency (ns) 4.5 (2-4-10) [-39%] 7.4 (4-12-20) 6.8 (4-7-11) 5 (2-4-9) Ryzen2 is competitive but again CFL manages to be almost 40% faster.
CFL dominates here and enjoys 30-40% less latency over Ryzen2 but the latter has improved a lot in time.
Memory Update Transactional (MTPS) 54 [+980%] 5 59 35 Finally all top-end Intel CPUs have HLE enabled and working and thus enjoy huge performance increase.
Memory Update Record Only (MTPS) 38 [+730%] 4.58 59 24.8 Nothing much changes here.

CFL does not bring anything new vs. old KBL/SKL, both caches and memory controller are unchanged. The latter can now (officially) use higher clocked memory thus it does improve in terms of bandwidth/latencies and the uncore can also clock a bit higher but that is it.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

CFL’s caches and memory (uncore) sub-systems are unchanged from SKL/KBL and thus provide no surprises, with rock-solid performance at 3200Mt/s with huge bandwidth (needed after all to feed 12 threads) but Ryzen2 has improved a lot over old AMD CPU designs.

With the continuous increase in cores/threads (8/12 in CFL-R) as with Ryzen1/2 but modest DDR4 speed increases (not to mention very high cost), the desktop platforms are likely to see diminishing returns due to core/thread data starvation while the extra cores just cannot be fed by the memory sub-systems. The L2 and L3 caches will need to be improved (widened, larger as with SKL-X) also the now defunct L4/eDRAM cache should re-emerge to mitigate these issues…

AMD Ryzen+ 2700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen+” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen+” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen+:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we are testing them in this article!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen+ (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen+ 2700X Pinnacle Ridge AMD Ryzen+ 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen+ data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen+ L2 cache is unchanged but we’re told latencies have been improved. And 4x bigger than Intel’s!
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen+ L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
TLB 4kB pages
64 full-way 1536 8-way 64 full-way 1536 8-way 64 full-way 1536 8-way 64 8-way 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way 1536 2-way 64 full-way 1536 2-way 64 full-way 1536 2-way 8 full-way 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600-1200 600-1200 600-1200 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
2400 / 2933 2400 / 2933 2400 / 2666 2533 / 2400 Ryzen+ how supports up to 2933MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Memory runs at the same timings on both Ryzen+ and Ryzen but we shall see if measured latencies are different.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T SkyLake
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 54.9 [+15%] 46.5 47.8 39 Ryzen+ manages 15% higher bandwidth between its cores, slightly better than just 11% clock increase – signalling some improvements under the hood.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 5.89 [+2%] 5.53 5.8 16.3 In worst-case pairs on Ryzen must go across CCXes – and with this link running at the same clock (1200MHz) on Ryzen+ we can only manage a 2% increase in bandwidth. This is why faster memory is needed.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.5 [-13%] 15.4 15.6 16.2 Within the same core (sharing L1D/L2), Ryzen+ manages a 13% reduction in latency, again better than just clock speed increase.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 40.1 [-7%] 43.5 43.2 47.3 Within the same compute unit (sharing L3), the latency decreased by 7% on Ryzen+ thus L3 seems to have improved also.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 128 [-6%] 132 236 Going inter-CCX we still see a 6% reduction in latency on Ryzen+ – with the CCX link at the same speed – a welcome surprise.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see a decent 6-7% reduction in L3/CCX latencies on Ryzen+ even when running at the same clock as Ryzen.
Aggregated L1D Bandwidth (GB/s) 862 [+18%] 615 730 837 Right off we see a 18% bandwidth increase – almost 2x higher (than the 11% clock increase) – thus some improvements have been made to the cache system. It allows Ryzen+ to finally beat the i7 with its wide L1 data paths (512-bit) though with 2x more caches (8 vs 4).
Aggregated L2 Bandwidth (GB/s) 736 [+32%] 542 556 329 We see a huge 32% increase in L2 cache bandwidth – almost 3x clock increase (the 11%) suggesting the L2 caches have been improved also. Ryzen+ has thus 2x the L2 bandwidth of i7 though with 2x more caches (8 vs 4).
Aggregated L3 Bandwidth (GB/s) 339 [+19%] 398 284 238 The bandwidth of the L3 caches has also increased by 19% (2x clock increase) though we see the 6-core 2600 doing better (398 vs 339) likely due to less threads competing for the same L3 caches (12 vs 16). Ryzen+ L3 caches are not just 2x bigger than Intel but also 2x more bandwidth.
Aggregated Memory (GB/s) 30.2 [+2%] 30.2 29.6 29.1 With the same memory clock, Ryzen+ does still manage a small 2% improvement – signalling memory controller improvements. We also see Ryzen’s memory at 2400Mt/s having better bandwidth than Intel at 2533.
We see big improvements on Ryzen+ for all caches L1D/L2/L3 of 20-30% – more than just raw clock increase (11%) – so AMD has indeed made improvements – which to be fair needed to be done. The memory controller is also a bit more efficient (2%) though it can run at higher clocks than tested (2400Mt/s) – hopefully fast DDR4 memory will become more affordable.
Data In-Page Random Latency (ns) 66.4 (4-12-31) [-6%] [0][-5][-4] 66.4 (4-12-31) 70.5 (4-17-35) 20.4 (4-12-21) In-page latency has decreased by a noticeable 6% on Ryzen+ (both 2700X and 2600) – we see 5 clocks reduction for L2 and 4 for L3 a welcome improvement. But still a way to go to catch Intel which has 1/3x (three times less) latency.
Data Full Random Latency (ns) 80.9 (4-12-32) [-8%] [0][-5][-4] 79.4 (4-12-32) 87.6 (4-17-36) 63.9 (4-12-34) Out-of-page latencies have also been reduced by 8% on Ryzen+ (same memory) and we see the same 5 and 4 clock reduction for L2 and L3 (on both 2700X and 2600 it’s no fluke). Again these are welcome but still have a way to go to catch Intel.
Data Sequential Latency (ns) 3.4 (4-6-7) [-8%] [0][-1][0] 3.5 (4-6-7) 3.7 (4-7-7) 4.1 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% latency drop for Ryzen+.
Ryzen’s issue was high memory latencies (in-page/full random) and Ryzen+ has reduced them all by 6-8%. While it is a good improvement, they are still pretty high compared to Intel’s thus more work needs to be done here.
Code In-Page Random Latency (ns) 14.2 (4-9-24) [-9%] [0][0][0] 14.6 (4-9-24) 15.6 (4-9-24) 10.1 (2-10-21) Code latencies were not a problem on Ryzen but we still see a welcome reduction of 9% on Ryzen+. (no clocks delta)
Code Full Random Latency (ns) 88.6 (4-14-49) [-9%] [0][+1][+2] 89.3 (4-14-49) 97.4 (4-13-47) 70.7 (2-11-46) Out-of-page latency also sees a 9% decrease on Ryzen+ but somewhat surprisingly a 1-2 clock increase.
Code Sequential Latency (ns) 7.6 (4-12-20) [-8%] [0][+1][+1] 7.8 (4-12-20) 8.3 (4-11-19) 5.0 (2-4-9) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% reduction on Ryzen+.
While code access latencies were not a problem on Ryzen and they also see a 8% improvement on Ryzen+ which is welcome. Note code L1i cache is 2x Intel’s (64kB vs 32).
Memory Update Transactional (MTPS) 4.7 [+10%] 5 4.28 33.2 HLE Ryzen+ is 10% faster than Ryzen but naturally without HLE support it cannot match the i7. But with Intel disabling HLE on all but top-end CPUs AMD does not have much to worry.
Memory Update Record Only (MTPS) 4.6 [+11%] 4.75 4.16 23 HLE With only record updates we still see an 11% increase.

Ryzen+ brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

 

Final Thoughts / Conclusions

As with original Ryzen, the cache and memory system performance is not the clean-sweep we’ve seen in CPU testing – but Ryzen+ does bring welcome improvements in bandwidth and latency – which hopefully will further improve with firmware/BIOS updates (AGESA firmware).

With the potential to use faster DDR4 memory – Ryzen+ can do far better than in this test (e.g. with 2933/3200MHz memory). Unfortunately at this time DDR4 – especially high-end fast versions – memory is hideously expensive which is a bit of a problem. You may be better off using less but fast(er) memory with Ryzen designs.

Ryzen+ is a great update that will not disappoint upgraders and is likely to increase AMD’s market share. AMD is here to stay!

AMD Threadripper 1950X Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.