Intel Core i7 8700K CofeeLake Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). The server/workstation (SKL-X/KBL-X) CPU core saw new instruction set support (AVX512) as well as other improvements – these have not made the transition yet.

Possibly due limited competition (before AMD Ryzen launch), process issues (still at 14nm) and the disclosure of a whole host of hardware vulnerabilities (Spectre, Meltdown, etc.) which required microcode (firmware) updates – performance improvements have not been forthcoming. This is pretty much unprecedented – while some Core updates were only evolutionary we have not had complete stagnation before; in addition the built-in GPU core has also remained pretty much stagnant – we will investigate this in a subsequent article.

However, CFL does bring up a major change – and that is increased core counts both on desktop and mobile: on desktop we go from 4 to 6 cores (+50%) while on mobile (ULV) we go from 2 to 4 (+100%) within the same TDP envelope!

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i7-9800K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
L1D / L1I Caches 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No L1D/I changes, Ryzen’s L1I is twice as big.
L2 Caches 6x 256kB 4-way 8x 512kB 8-way 10x 1MB 16-way 4x 256kB 4-way No L2 changes, Ryzen’s L2 is twice as big again.
L3 Caches 12MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way L3 has also increased with no of cores, still behind Ryzen’s dual 8MB L3 caches.
TLB 4kB pages
64 4-way / 64 8-way/ 1536 6-way 64 full-way 1536 8-way 64 4-way / 64 8-way / 1536 6-way 64 4-way / 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
8 full-way / 1536 6-way 64 full-way 1536 2-way 8 full-way / 1536 6-way 8 full-way / 1536 6-way No TLB changes.
Memory Controller Speed (MHz) 1200-4400 1333-2667 1200-2700 1200-4000 The uncore (memory controller) runs at faster clock due to higher rated clock but not a lot in it.
Memory Data Speed (MHz)
3200 2667 3200 2533 CFL can easily run at 3200Mt/s while KBL/SKL were not as reliable. We could not get Ryzen past 2667 while it does support 2933.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Bandwidth (GB/s)
50 42 100 40 Bandwidth has naturally increased with memory clock speed but latencies are higher.
Uncore / Memory Controller Firmware
2.6.2 2.0.0.6 We’re on firmware 2.6.x vs. 2.0.x on old SKL/KBL.
Memory Timing (clocks)
16-16-16-36 6-52-25-12 2T 16-17-17-35 7-60-20-10 2T 16-18-18-36 5-54-21-10 2T Timings are very much BIOS dependent and vary a lot.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i7-9800K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 52.5 [-5%] 55.3 86 39.5 Despite just 2 less cores, CFL has only 5% less bandwidth than Ryzen 2.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 15.5 [+144%] 6.35 25.7 16.1 In worst-case pairs on Ryzen2 must go across CCXes – unlike Intel’s CPUs – thus CFL can muster over 2x more bandwidth in this case.
CFL manages good bandwidth improvement over KBL/SKL – and due to unified design matching Ryzen2 in best case and beating it soundly in worst case.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 14.4 [+7%] 13.5 15 16 Surprisingly, Ryzen2 manages lower thread latency when sharing core.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 45 [+12%] 40 75 47 Within the same unit, Ryzen2 is again faster than CFL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 115 Obviously going across CCXes is slow, about 3x slower which needs careful thread scheduling.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see Ryzen2 with lower latencies for both core and unit a surprising result as usually Intel’s caches are lower latency.
Aggregated L1D Bandwidth (GB/s) 1630 [+59%]
854 2220 884 Intel’s wide data path L1 caches allow even old SKL to beat Ryzen2 with CFL enjoying 60% more bandwidth.
Aggregated L2 Bandwidth (GB/s) 571 [-21%] 720 985 329 But Ryzen2’s L2 caches are not only twice as big but also very wide – CFL has 20% less bandwidth.
Aggregated L3 Bandwidth (GB/s) 327 [-4%] 339 464 243 Ryzen’s 2 L3 caches also provide good bandwidth matching CFL’s unified L3 cache.
Aggregated Memory (GB/s) 35.6 [+11%] 32.2 70 30.1 Running at 3200Mt’s obviously CFL enjoys higher bandwidth than Ryzen2 at 2667Mt’s but somehow the latter has better efficiency.
Nothing much has changed in CFL vs. old SKL thus while L1 caches are wide and thus fast – the L2, L3 are not as impressive and the memory controller while competitive it does not seem as efficient as Ryzen2 but is more stable at high data rates allowing for higher bandwidth.
Data In-Page Random Latency (ns) 17.4 (4-11-20) [-73%] 63.4 (4-12-31) 25.5 (4-13-30) 20.4 (4-12-21) While clock latencies have not changed w.s. old KBL/SKL, CFL enjoys lower latencies due to higher data rates. Ryzen2 has problems here.
Data Full Random Latency (ns) 53.4 (4-11-42) [-30%] 76.2 (4-12-32) 74 (4-13-62) 63.9 (4-12-34) Out-of-page clock latencies have increased but still overall lower. Ryzen2 has almost caught up here.
Data Sequential Latency (ns) 3.8 (4-11-12) [+15%] 3.3 (4-6-7) 5.3 (4-12-12) 4.1 (4-12-13) With sequential access, Ryzen2 is now faster as CFL’s clock latencies have not changed.
CFL is lucky here as even Ryzen2 still has high latencies in random accesses (either in-page or full range) but manages to be faster with sequential access. Intel will need to improve going forward as clock latencies while good have really not improved at all.
Code In-Page Random Latency (ns) 8.7 (2-10-21) [-37%] 13.8 (4-9-24) 11.8 (4-14-25) 10.1 (2-10-21) Code clock latencies also have not changed and again and while Ryzen2 performs a lot better, CFL (even old SKL) manage to be ~35% faster.
Code Full Random Latency (ns) 59.8 (2-10-48) [-30%] 85.7 (4-14-49) 83.6 (4-15-74) 70.7 (2-11-46) Out-of-page clock latencies also have not changed and here CFL is 20% faster over Ryzen2.
Code Sequential Latency (ns) 4.5 (2-4-10) [-39%] 7.4 (4-12-20) 6.8 (4-7-11) 5 (2-4-9) Ryzen2 is competitive but again CFL manages to be almost 40% faster.
CFL dominates here and enjoys 30-40% less latency over Ryzen2 but the latter has improved a lot in time.
Memory Update Transactional (MTPS) 54 [+980%] 5 59 35 Finally all top-end Intel CPUs have HLE enabled and working and thus enjoy huge performance increase.
Memory Update Record Only (MTPS) 38 [+730%] 4.58 59 24.8 Nothing much changes here.

Ryzen2 brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

CFL’s caches and memory (uncore) sub-systems are unchanged from SKL/KBL and thus provide no surprises, with rock-solid performance at 3200Mt/s with huge bandwidth (needed after all to feed 12 threads) but Ryzen2 has improved a lot over old AMD CPU designs.

With the continuous increase in cores/threads (8/12 in CFL-R) as with Ryzen1/2 but modest DDR4 speed increases (not to mention very high cost), the desktop platforms are likely to see diminishing returns due to core/thread data starvation while the extra cores just cannot be fed by the memory sub-systems. The L2 and L3 caches will need to be improved (widened, larger as with SKL-X) also the now defunct L4/eDRAM cache should re-emerge to mitigate these issues…

AMD Ryzen2 2700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen2” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen2” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we are testing them in this article!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen2 2700X Pinnacle Ridge AMD Ryzen2 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen2 L2 cache is unchanged but we’re told latencies have been improved. And 4x bigger than Intel’s!
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen2 L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
TLB 4kB pages
64 full-way 1536 8-way 64 full-way 1536 8-way 64 full-way 1536 8-way 64 8-way 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way 1536 2-way 64 full-way 1536 2-way 64 full-way 1536 2-way 8 full-way 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600-1200 600-1200 600-1200 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
2400 / 2933 2400 / 2933 2400 / 2666 2533 / 2400 Ryzen2 how supports up to 2933MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Memory runs at the same timings on both Ryzen2 and Ryzen1 but we shall see if measured latencies are different.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen2 2700X 8C/16T Pinnacle Ridge
Ryzen2 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T SkyLake
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 54.9 [+15%] 46.5 47.8 39 Ryzen2 manages 15% higher bandwidth between its cores, slightly better than just 11% clock increase – signalling some improvements under the hood.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 5.89 [+2%] 5.53 5.8 16.3 In worst-case pairs on Ryzen must go across CCXes – and with this link running at the same clock (1200MHz) on Ryzen2 we can only manage a 2% increase in bandwidth. This is why faster memory is needed.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.5 [-13%] 15.4 15.6 16.2 Within the same core (sharing L1D/L2), Ryzen2 manages a 13% reduction in latency, again better than just clock speed increase.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 40.1 [-7%] 43.5 43.2 47.3 Within the same compute unit (sharing L3), the latency decreased by 7% on Ryzen2 thus L3 seems to have improved also.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 128 [-6%] 132 236 Going inter-CCX we still see a 6% reduction in latency on Ryzen2 – with the CCX link at the same speed – a welcome surprise.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see a decent 6-7% reduction in L3/CCX latencies on Ryzen2 even when running at the same clock as Ryzen1.
Aggregated L1D Bandwidth (GB/s) 862 [+18%] 615 730 837 Right off we see a 18% bandwidth increase – almost 2x higher (than the 11% clock increase) – thus some improvements have been made to the cache system. It allows Ryzen2 to finally beat the i7 with its wide L1 data paths (512-bit) though with 2x more caches (8 vs 4).
Aggregated L2 Bandwidth (GB/s) 736 [+32%] 542 556 329 We see a huge 32% increase in L2 cache bandwidth – almost 3x clock increase (the 11%) suggesting the L2 caches have been improved also. Ryzen2 has thus 2x the L2 bandwidth of i7 though with 2x more caches (8 vs 4).
Aggregated L3 Bandwidth (GB/s) 339 [+19%] 398 284 238 The bandwidth of the L3 caches has also increased by 19% (2x clock increase) though we see the 6-core 2600 doing better (398 vs 339) likely due to less threads competing for the same L3 caches (12 vs 16). Ryzen2 L3 caches are not just 2x bigger than Intel but also 2x more bandwidth.
Aggregated Memory (GB/s) 30.2 [+2%] 30.2 29.6 29.1 With the same memory clock, Ryzen2 does still manage a small 2% improvement – signalling memory controller improvements. We also see Ryzen’s memory at 2400Mt/s having better bandwidth than Intel at 2533.
We see big improvements on Ryzen2 for all caches L1D/L2/L3 of 20-30% – more than just raw clock increase (11%) – so AMD has indeed made improvements – which to be fair needed to be done. The memory controller is also a bit more efficient (2%) though it can run at higher clocks than tested (2400Mt/s) – hopefully fast DDR4 memory will become more affordable.
Data In-Page Random Latency (ns) 66.4 (4-12-31) [-6%] [0][-5][-4] 66.4 (4-12-31) 70.5 (4-17-35) 20.4 (4-12-21) In-page latency has decreased by a noticeable 6% on Ryzen2  (both 2700X and 2600) – we see 5 clocks reduction for L2 and 4 for L3 a welcome improvement. But still a way to go to catch Intel which has 1/3x (three times less) latency.
Data Full Random Latency (ns) 80.9 (4-12-32) [-8%] [0][-5][-4] 79.4 (4-12-32) 87.6 (4-17-36) 63.9 (4-12-34) Out-of-page latencies have also been reduced by 8% on Ryzen2 (same memory) and we see the same 5 and 4 clock reduction for L2 and L3 (on both 2700X and 2600 it’s no fluke). Again these are welcome but still have a way to go to catch Intel.
Data Sequential Latency (ns) 3.4 (4-6-7) [-8%] [0][-1][0] 3.5 (4-6-7) 3.7 (4-7-7) 4.1 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% latency drop for Ryzen2.
Ryzen1’s issue was high memory latencies (in-page/full random) and Ryzen2 has reduced them all by 6-8%. While it is a good improvement, they are still pretty high compared to Intel’s thus more work needs to be done here.
Code In-Page Random Latency (ns) 14.2 (4-9-24) [-9%] [0][0][0] 14.6 (4-9-24) 15.6 (4-9-24) 10.1 (2-10-21) Code latencies were not a problem on Ryzen1 but we still see a welcome reduction of 9% on Ryzen2. (no clocks delta)
Code Full Random Latency (ns) 88.6 (4-14-49) [-9%] [0][+1][+2] 89.3 (4-14-49) 97.4 (4-13-47) 70.7 (2-11-46) Out-of-page latency also sees a 9% decrease on Ryzen2 but somewhat surprisingly a 1-2 clock increase.
Code Sequential Latency (ns) 7.6 (4-12-20) [-8%] [0][+1][+1] 7.8 (4-12-20) 8.3 (4-11-19) 5.0 (2-4-9) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% reduction on Ryzen2.
While code access latencies were not a problem on Ryzen1 and they also see a 8% improvement on Ryzen2 which is welcome. Note code L1i cache is 2x Intel’s (64kB vs 32).
Memory Update Transactional (MTPS) 4.7 [+10%] 5 4.28 33.2 HLE Ryzen2 is 10% faster than Ryzen1 but naturally without HLE support it cannot match the i7. But with Intel disabling HLE on all but top-end CPUs AMD does not have much to worry.
Memory Update Record Only (MTPS) 4.6 [+11%] 4.75 4.16 23 HLE With only record updates we still see an 11% increase.

Ryzen2 brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

As with original Ryzen, the cache and memory system performance is not the clean-sweep we’ve seen in CPU testing – but Ryzen2 does bring welcome improvements in bandwidth and latency – which hopefully will further improve with firmware/BIOS updates (AGESA firmware).

With the potential to use faster DDR4 memory – Ryzen2 can do far better than in this test (e.g. with 2933/3200MHz memory). Unfortunately at this time DDR4 – especially high-end fast versions – memory is hideously expensive which is a bit of a problem. You may be better off using less but fast(er) memory with Ryzen designs.

Ryzen2 is a great update that will not disappoint upgraders and is likely to increase AMD’s market share. AMD is here to stay!

AMD Threadripper 1950X Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.