AMD Ryzen2 3700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

To upgrade from Ryzen+/Ryzen1 or not?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the middle-of-the-range Ryzen2 (3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Core counts remain the same.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die 1 chiplet+1 sio rather than 1 die
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz 3700x base clock is lower than 2700x but turbo is higher
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP has been greatly reduced vs. ZEN+
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way L1I has been halved but better no. ways
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN+
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 SIMD units are 2x wider than ZEN+

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 7 3700X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 336 [=] 334 400 485 We start with no improvement over ZEN+
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 339 [=] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 202 [+2%] 198 236 262 Floating-point performance does not change delta either – only 2% faster
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 170 [=] 169 196 223 With FP64 nothing much changes again.
In the legacy integer/floating-point benchmarks ZEN2 is not any faster than ZEN+ despite the change in clocks. Perhaps future microcode updates will help?
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1023 [+78%] 574 985 1590 ZEN2 is ~80% faster than ZEN+ despite what we’ve seen before.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 374 [+2x] 187 414 581 With a 64-bit AVX2 integer vectorised workload, ZEN2 is now 2x faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.56 [+13%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here ZEN2 is still 13% faster.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 100 [+68%] 596 914 1760 In this floating-point AVX/FMA vectorised test, ZEN2 is ~70% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 618 [+84%] 335 535 533 Switching to FP64 SIMD code, ZEN2 is now ~90% faster than ZEN+
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24.22 [+55%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, ZEN2 is still 55% faster
With its brand-new 256-bit SIMD units, ZEN2 is anywhere from 55% to 100% faster than ZEN+/ZEN1 a huge upgrade from one generation to the next. For SIMD loads upgrading to ZEN2 gives a huge performance uplift.
BenchCrypt Crypto AES-256 (GB/s) 18 [+12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound  but ZEN2 manages a 12% improvement.
BenchCrypt Crypto AES-128 (GB/s) 18.76 [+17%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; ZEN2 is now 17% faster.
BenchCrypt Crypto SHA2-256 (GB/s) 20.21 [+9%] 18.6 12 26 With SHA/HWA ZEN2 similarly powers through hashing tests leaving Intel in the dust – and is still ~10% faster than ZEN+
BenchCrypt Crypto SHA1 (GB/s) 20.41 [+6%] 19.3 22.9 38 The less compute-intensive SHA1 does not change things due to acceleration.
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
ZEN2 with AES/SHA HWA is memory bound like all other CPUs, but it still manages 6-17% better performance than ZEN+ using the same memory. But as ZEN2 is rated for faster memory – using such memory would greatly improve the results.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 229 [+5%] 219 238 277 Switching to FP64 code, ZEN2 is just 5% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 57.98 [-4%] 60.6 61.6 68 With FP64 code ZEN2 is 4% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 46.34 [+13%] 41 44.5 50.5 Switching to FP64 nothing much changes, ZEN2 is 13% faster.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: ZEN2 does not improve much and is pretty much tied with ZEN+ – thus for non SIMD workloads you might as well stick with the older versions.
BenchScience SGEMM (GFLOPS) float/FP32 263 [-12%] 300 375 413 In this tough vectorised algorithm ZEN2 is strangely slower.
BenchScience DGEMM (GFLOPS) double/FP64 193 [+63%] 119 209 212 With FP64 vectorised code, ZEN2 comes back to be over 60% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.78 [+2.5x] 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more; ZEN2 is 2.5x (times) faster.
BenchScience DFFT (GFLOPS) double/FP64 11.16 [+41%] 7.92 11.21 14.6 With FP64 code, ZEN2 is ~40% faster.
BenchScience SNBODY (GFLOPS) float/FP32 612 [+2.2x] 280 557 638 N-Body simulation is vectorised but fewer memory accesses; ZEN2 is over 2x faster.
BenchScience DNBODY (GFLOPS) double/FP64 220 [+2x] 113 171 195 With FP64 precision ZEN2 is almost 2x faster.
With highly vectorised SIMD code ZEN2 improves greatly over ZEN2 sometimes managing to be over 2x faster using the same memory.
CPU Image Processing Blur (3×3) Filter (MPix/s) 2049 [+42%] 1440 2560 4880 In this vectorised integer workload ZEN2 starts over 40% faster than ZEN+.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 950 [+52%] 627 1000 1920 Same algorithm but more shared data makes ZEN2 over 50% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 495 [+52%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 826 [+67%] 495 827 1500 Different algorithm but still vectorised workload ZEN2 is almost 70% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 89.68 [+24%] 72.1 78 221 Still vectorised code now ZEN2 drops to just 25% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 25.05 [+5%] 23.9 42.2 66.7 This test has always been tough for Ryzen so ZEN2 does not improve much.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1763 [+76%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better but ZEN2 is still almost 80% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 321 [+32%] 243 596 777 In this final test again with integer workload ZEN2 is 32% faster
As we’ve seen before, the new SIMD units are anywhere from 5% (worst-case) to 2x faster than ZEN+/1, a huge performance improvement.
Aggregate Score (Points) 8,200 [+40%] 5,850 7,930 11,810 Across all benchmarks, ZEN2 is ~40% faster than ZEN+.
Aggregating all the various scores, the result was never in doubt: ZEN2 (3700X) is 40% faster than the old ZEN+ (2700X) that itself improved over the original 1700X.

ZEN2’s 256-bit wide SIMD units are a big upgrade and show their power in every SIMD workload; otherwise there is only minor improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: For SIMD workloads you really have to upgrade to Ryzen2; otherwise stick with Ryzen+ unless lower power is preferred. 9/10 overall.

The big change in Ryzen2 are the 256-bit wide SIMD units and all vectorised workloads (Multi-Media, Scientific, Image processing, AI/Machine Learning, etc.) using AVX/FMA will greatly benefit – anything between 50-100% which is a significant increase from just one generation to the next.

But for all other workloads (e.g. Financial, legacy, etc.) there is not much improvement over Ryzen+/1 which were already doing very well against competition.

Naturally it all comes at lower TDP (65W vs 95) which may help with overclocking and also lower noise (from the cooling system) and power consumption (if electricity is expensive or you are running it continuously) thus the performance/W(att) is still greatly improved.

Overall the 3700X does represent a decent improvement over the old 2700X (which is no slouch and was a nice upgrade over 1700X due to better Turbo speeds) and should still be usable in older AM4 300/400-series mainboards with just a BIOS upgrade (without PCIe 4.0).

However, while 2700X (and 1700X/1800X) were top-of-the-line, 3700X is just middle-ground, with the new top CPUs being the 3900X and even the 3950X with twice (2x) more cores and thus potentially huge performance rivaling HEDT Threadripper. The goad-posts have thus moved and thus far higher performance can be yours with just upgrading the CPU. The future is bright…

AMD Ryzen2 3900X Review & Benchmarks – CPU 12-core/24-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

AMD Ryzen2 3950X chiplets

What’s new in the Ryzen2 core?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (3900X, 3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Matching core-count with CFL (3800X) but 3900X has 50% more cores – more than SKL-X.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die AMD uses discrete dies/chiplets unlike Intel
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz Base clock and turbo are competitive with 3800X having higher base while 3900X higher turbo.
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP remains the same but 3900X may exceed that having more cores.
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way ZEN2 matches L1I with CFL/SKL-X (1/2x ZEN+ but 8-way), L1D is unchanged (also matches Intel)
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2, still 2x CFL. Only SKL-X has its massive 1MB L2 per core which 3900X almost matches!
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN/ZEN+ and thus 2x CFL (3800X) with 3900X having a massive 64MB unheard of on the desktop platform! SKL-X can’t match it either.
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 finally matches Intel/CFL but SKL-X’s secret weapon is AVX512 with even consumer CPUs able to do 2x 512-bit FMA ops.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 551 [+38%] 334 400 485 Right off Ryzen2 demolishes all CPUs, it is 40% faster than CFL-R!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 556 [+41%] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 331 [+40%] 198 236 262 Floating-point performance does not change delta either – still 40% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 280 [+43%] 169 196 223 With FP64 nothing much changes again.
Ryzen2 starts with an astonishing display, with 3900X demolishing both 9900X and 7900X winning all tests by a large margin 38-43%! It does have 50% more cores (12 vs. 8) but it is not easy to realise gains just by increasing core counts. Intel will need to add far more cores in future CPUs in order to compete!
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1449 [+47%] 574 985 1590 Ryzen2 starts off by blowing CFL-R away by 47% and almost matching SKL-X with AVX512!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 553 [+34%] 187 414 581 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is still 34% faster!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 9.52 [+41%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 is again 41% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1480 [+62%] 596 914 1760 In this floating-point AVX/FMA vectorised test, Ryzen2 is now over 60% faster than CFL-R and not far off SKL-X!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 906 [+69%] 335 535 533 Switching to FP64 SIMD code, Ryzen2 is now 70% faster even beating SKL-X!!!
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 35.23 [+53%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, Ryzen2 is still 53% faster!
With its brand-new 256-bit SIMD units, Ryzen2 finally goes toe-to-toe with Intel, soundly beating CFL-R in all benchmarks (+34-69%) sometimes by more than just core count increase (+50%). Only SKL-X with AVX512 manages to be faster (but also with its extra 2 cores). Intel had better release AVX512 for desktop soon but even that will not be enough without increasing core counts to match AMD.
BenchCrypt Crypto AES-256 (GB/s) 15.44 [-12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound – thus Ryzen2 scores less than Ryzen+ and CFL-R.
BenchCrypt Crypto AES-128 (GB/s) 15.44 [-12%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; Ryzen2 is again slower by 12%.
BenchCrypt Crypto SHA2-256 (GB/s) 29.84 [+2.5x] 18.6 12 26 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust – 2.5x faster than CFL-R and beating SKL-X with AVX512!
BenchCrypt Crypto SHA1 (GB/s) 19.3 22.9 38
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
Ryzen2 with AES/SHA HWA is memory bound thus needs faster memory than 3200Mt/s in order to feed all the cores; otherwise due to increased contention for the same bandwidth it may end up slower than Ryzen+ and Intel designs. Here you see the need for HEDT platforms and thus ThreadRipper but at much increased cost.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 379 [+55%] 219 238 277 Switching to FP64 code, nothing much changes, Ryzen2 55% faster than CFL-R.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 95.73 [+55%] 60.6 61.6 68 With FP64 code Ryzen2 is still 55% faster!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 76.72 [+72%] 41 44.5 50.5 Switching to FP64 nothing much changes, Ryzen2 is 70% faster than CFL-R and still beating SKL-X.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: Ryzen2 is over 50% faster than CFL-R (+55-72%) and soundly beats SKL-X too! As before for financial algorithms there is only one choice and that is Ryzen, be it Ryzen1, Ryzen+ or Ryzen2!
BenchScience SGEMM (GFLOPS) float/FP32 300 375 413 In this tough vectorised algorithm Ryzen2.
BenchScience DGEMM (GFLOPS) double/FP64 212 [+1%] 119 209 212 With FP64 vectorised code, Ryzen2 matches CFL-R and SKL-X.
BenchScience SFFT (GFLOPS) float/FP32 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more;
BenchScience DFFT (GFLOPS) double/FP64 12.69 [+13%] 7.92 11.21 14.6 With FP64 code, Ryzen2 is 13% faster than CFL-R.
BenchScience SNBODY (GFLOPS) float/FP32 280 557 638 N-Body simulation is vectorised but fewer memory accesses;
BenchScience DNBODY (GFLOPS) double/FP64 332 [+94%] 113 171 195 With FP64 precision Ryzen2 is almost 2x faster than CFL-R.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. The new 256-bit SIMD units help but it seems the cores are starved of bandwidth (especially due to SMT) and some workloads may perform better with SMT off.
CPU Image Processing Blur (3×3) Filter (MPix/s) 3056 [+20%] 1440 2560 4880 In this vectorised integer workload Ryzen2 is 20% faster than CFL-R.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1499 [+50%] 627 1000 1920 Same algorithm but more shared data makes Ryzen2 50% faster!
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 767 [+48%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1298 [+57%] 495 827 1500 Different algorithm but still vectorised workload Ryzen2 is almost 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 136 [+74%] 72.1 78 221 Still vectorised code now Ryzen2 is 70% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 38.23 [-9%] 23.9 42.2 66.7 This test has always been tough for Ryzen but Ryzen2 is competitive.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1384 [-65%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 487 [-18%] 243 596 777 In this final test again with integer workload Ryzen2 is 20% slower.
Thanks to AVX512 SKL-X does win all tests but Ryzen2 beats CFL-R between 20-74% with a few test mixing integer & floating-point SIMD instructions seemingly heavily favouring Intel but nothing to worry about. Overall for image processing Ryzen2 should be your 1st choice.
Aggregate Score (Points) 10,250 [+29%] 5,850 7,930 11,810 Across all benchmarks, Ryzen2 is ~30% faster than CFL-R!
Aggregating all the various scores, the result was never in doubt: Ryzen2 (3900X) is almost 2x faster than Ryzen+ (2700X) and 30% faster than CFL-R, almost catching up HEDT SKL-X.

Ryzen2 (unlike Ryzen1/+) has no trouble with SIMD code due to its widened SIMD units (256-bit) and thus soundly beats the opposition into dust (CFL-R 9900K flagship) sometimes more than just core count increase alone (+50% i.e. 12 cores vs. 8). Sometimes it even beats the AVX512 opposition (SKL-X 7900K) with more cores (10 cores vs. 12).

The only “problematic” algorithms are the memory bound ones where the cores/threads (due to SMT we have 24!) are starved for data and due to contention we see performance lower than less-core devices. While larger caches help (thus the massive 4x 16MB L3 caches) higher clocked memory should be used to match the additional core requirements.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: Ryzen2 is phenomenal and a huge upgrade over Ryzen1/+ that (most) AM4 users can enjoy and Intel has no answer to. 10/10.

Just as original Ryzen forced Intel to increase (double really) core counts to match (from 4 to 6 then 8), Ryzen2 will force Intel to come up with even more (and better) cores in order to compete. 3900X with its 12-cores soundly beats CFL-R 9900K (8-cores) in just about all benchmarks and in some tests goes toe-to-toe with HEDT SKL-X AVX512-enabled (10-cores) except in memory-bound algorithms where the 4 DDR4 memory channels with 2x more bandwidth count. For that you need ThreadRipper!

Ryzen1/+ was already competitive with Intel on integer and floating-point (non-SIMD) workloads but would fare badly on SIMD (AVX/FMA3/AVX2) workloads due to its 128-bit units; Ryzen2 “fixes” this issue, with its 256-bit units matching Intel. Only SKL-X with its 512-bit units (AVX512) is faster and Intel will have to finally include AVX512 for consumer CPUs in order to compete (IceLake?).

For compute-bound workloads, the forthcoming 3950X with its 16-cores/32-threads brings unprecedented performance to the consumer/desktop segment pretty much unheard of just a few years ago when 4-core/8-threads (e.g. 7700K) were all you could hope for – unless paying a lot more for HEDT where 8/10-core CPUs were far far more expensive. Naturally we shall see how the reduced memory bandwidth affects its performance with likely very fast DDR4 memory (4300Mt/s+) required for best performance.

Let’s also remember than Ryzen2 adds hardware mitigation to its remaining 2 vulnerabilities while Intel has been forced to add MDS/”Zombieload” even to its very latest CFL-R that now loses its trump card: hardware RDCL/”Meltdown” fix not to forget the recommendation to disable SMT/Hyperthreading that would mean a sizeable performance drop.

What is astonishing is that TDP has remained similar and with a BIOS/firmware upgrade, owners of older 300-series boards can now upgrade to these CPUs – and likely not even change the cooler unit! Naturally for PCIe4.0 a 500-series board is recommended and 400-series boards do support more features in Ryzen2/+ but let’s remember than on Intel you can only go back/forward 1 generation even though there is pretty much no core difference from Skylake (Gen 6) to Coffeelake-R (Gen 9)!

From top-end (3950X), high-end (3800X) to low-end/APU (3200G) Ryzen2 is such a compelling choice it is hard to recommend anything else… at least at this time…

AMD Ryzen 2 Mobile (2500U) Vega 8 GP(GPU) Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2 mobile:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Radeon RX Vega graphics core (DirectX 12.1)
  • Optimised boost (aka Turbo) algorithm – sharing between CPU & GPU cores

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

Hardware Specifications

We are comparing the graphics units of Ryzen2 mobile with competitive APUs with integrated graphics  to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications AMD Radeon RX Vega 8 (2500U)
Intel UHD 630 (7200U)
Intel HD Iris 520 (6500U)
Intel HD Iris 540 (6550U)
Comments
Arch Chipset GCN1.5 GT2 / EV9.5 GT2 / EV9 GT3 / EV9 All graphics cores are minor revisions of previous cores with extra functionality.
Cores (CU) / Threads (SP) 8 / 512 24 / 192 24 / 192 48 / 384 Vega has the most SPs though only a few but powerful CUs
ROPs / TMUs 8 / 32 8 / 16 8 / 16 16 / 24 Vega has less ROPs than GT3 but more TMUs.
Speed (Min-Turbo) 300-1100 300-1000 300-1000 300-950 Turbo boost puts Vega in top position power permitting.
Power (TDP) 25-35W 15-25W 15-25W 15-25W TDP is about the same for all though both Ryzen2 and CFL-U have somewhat higher TDP (25W).
Constant Memory 2.7GB 1.6GB 1.6GB 3.2GB There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 32kB 64kB 64kB 64kB Intel has 2x larger shared/local memory but slow (likely non dedicated) unlike Vega.
Global Memory 2.7 / 3GB 1.6 / 3.2GB 1.6 / 3.2GB 3.2 / 6.4GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System 128-bit DDR4 2400Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR4 2133MT/s Ryzen2’s memory controller is rated for faster data rates thus should be able to use faster (laptop) memory.
Memory Bandwidth (GB/s)
36 30 30 33 The high data rate of DDR4 can result in higher bandwidth useful for the GPU cores.
L2 Cache ? 512kB 512kB 1MB L2 is comparable to Intel units.
FP64/double ratio Yes, 1/16x Yes, 1/8x Yes, 1/8 Yes, 1/8x FP64 is supported and at good ratio but lower than Intel’s.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 2x Yes, 2x FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 831 927 1630 2000 [+23%] Thanks to FP16 support we see double the performance over FP32 but Vega is only 23% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 476 478 865 1350 [+56%] Vega rules FP32 and is over 50% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 122 209 111 [-47%] FP64 lower rate makes Vega 1/2 the speed of GT3 and only matching GT2 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 5.71 6.29 10.78 7.11 [-34%] Emulated FP128 precision depends entirely on FP64 performance thus not a lot changes.
Vega is over 50% faster than Intel’s top-end Iris/GT3 graphics but only in FP32 precision – while it gains from FP16 Intel scales better reducing the lead to just 25% or so. In FP64 precision though it’s relatively low 1/16x ratio means it only ties with GT2 low-end-models while GT3 is 2x (twice) as fast. Pity.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.858 0.87 1.23 2.58 [+2.1x] No wonder AMD is crypto-king: Vega is over 2x faster than even GT3.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1 1.08 1.52 3.3 [+2.17x] Nothing changes here, Vega is over 2.2x faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 2.72 3 4.7 14.29 [+3x] In this heavy integer workload, Vega is now 3x faster no wonder it’s used for crypto mining.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 6 6.64 11.59 18.77 [+62%] SHA1 is less compute intensive allowing Intel to catch up but Vega is still over 60% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.019 1.08 1.86 3.36 [+81%] With 64-bit integer workload, Vega does better and is 80% (almost 2x) faster than GT3.
Nobody will be using integrated graphics for crypto-mining any time soon, but if you needed to (perhaps using encrypted containers, VMs, etc.) then Vega is your choice – even GT3 is left in the dust despite big improvement over low-end GT2. Intel would need at least 2x more cores to be competitive here.
GPGPU Finance Benchmark Black-Scholes half/FP16 (MOPT/s) 1000 1140 1470 1720 [+17%] If 16-bit precision is sufficient for financial work, Vega is 20% faster than GT3.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 694 697 794 829 [+4%] In this relatively simple FP32 financial workload Vega is just 4% faster than GT3.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 142 154 281 185 [-33%] Switching to FP64 precision, Vega is 33% slower than GT3.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 86 95 155 270 [+74%] Switching to 16-bit precision allows Vega to gain over GT3 and is almost 2x faster.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 92 93 153 254 [+66%] Binomial uses thread shared data thus stresses the internal memory sub-system, and here Vega shows its power – it is 66% faster than GT3.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 18 18.86 32 15.67 [-51%] With FP64 precision Vega loses again vs. GT3 at 1/2 the speed and just matches GT2 units.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 211 236 395 584 [+48%] With 16-bit precision, Vega dominates again and is almost 50% faster than GT3.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 223 236 412 362 [-12%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – but Vega somehow loses against GT3.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 29.5 33.36 58.7 47.13 [-20%] Switching to FP64 precision as expected Vega is slower.
Financial algorithms perform well on Vega – at least in FP16 & FP32 precision but FP64 is too “gimped” (1/16x FP32 rate) and thus loses against GT3 despite more powerful cores.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 127 140 236 884 [+3.75x] With 16-bit precision Vega runs away with GEMM and is almost 4x faster than GT3.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 105 107 175 214 [+79%] GEMM makes heavy use of shared/local memory which is likely why Vega is 80% faster than GT3.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 38.8 41.69 70 62.6 [-11%] As expected, due to gimped FP64 rate Vega falls behind GT3 but only by just 11%.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 34.2 34.7 45.85 61.34 [+34%] 16-bit precision helps reduce memory bandwidth pressure thus Vega is 34% faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 20.9 21.45 29.69 31.48 [+6%] FFT is memory access bound but Vega does well to beat GT3.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.3 5.4 6.07 14.19 [+2.34x] Despite the FP64 rate, Vega manages its memory accesses better beating GT3 by over 2x (two times).
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 270 284 449 623 [+39%] 16-bit precision still benefits N-Body and here Vega is 40% faster than GT3.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 162 181 291 537 [+85%] Back to FP32 and Vega has a pretty large 85% lead – almost 2x GT3.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 22.73 26.1 43.34 44 [+2%] With FP64 precision, Vega and GT3 are pretty much tied.
Vega performs well on compute heavy scientific algorithms (making heavy use of shared/local memory) and also benefits from half/FP16 to reduce memory bandwidth pressure, but FP64 rate comes back to haunt it where it loses against Intel’s GT3. Pity.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 888 937 1390 2273 [+64%] With 16-bit precision Vega doubles its lead to 64% over GT3 despite its gain over FP32.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 461 491 613 781 [+27%] In this 3×3 convolution algorithm, Vega does well but only 30% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 279 302 409 582 [+42%] Again a huge gain by using FP16, over 40% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 100 107 144 157 [+9%] Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter half/FP16 (MPix/s) 254 272 396 619 [+56%] Large gain again by switching to FP16 with 3x performance over FP32.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 103 111 156 161 [+3%] With even more shared data the gap falls to just 3%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 259 281 363 595 [+64%] Another huge gain and over 3x improvement over FP32.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 99 106 145 155 [+7%] Still convolution but with 2 filters – the gap is similar to 5×5 – Vega is 7% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 7.39 9.4 8.56 7.688 [-18%] Big gain but not enough to beat GT3 here.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7 7.57 7.08 4 [-47%] Vega does not like this algorithm (lots of branching causing divergence) and is 1/2 GT3 speed.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 8.55 9.32 9.22 <BSOD> This test would cause BSOD; we are investigating.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8 8.65 6.77 2.59 [-70%] Vega does not like this algorithms either (complex branching) and neither does GT3.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 941 967 1580 2091 [+32%] In order to prevent artifacts most of this test runs in FP32 thus not much gain here.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 878 952 1550 2100 [+35%] This algorithm is 64-bit integer heavy allowing Vega 35% better performance over GT3.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 341 390 343 1046 [+2.5x] Switching to FP16 makes a huge difference to Vega which is over 2x faster.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 384 425 652 608 [-7%] One of the most complex and largest filters, Vega is a bit slower than GT3 by 7%.
For image processing Vega generally performs well in FP32 beating GT3 hands down; but there are a few algorithms that may need to be optimised for it that don’t perform as well as expected. Switching to FP16 though doubles/triples scores – thus Vega may be starved of memory.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 12.17 21.2 24 27.32 [+14%] With higher speed DDR4 memory, Vega has 14% more bandwidth.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 6 10.4 11.7 4.74 [-60%] The GPU<>CPU link seems a bit slow here at 1/2 bandwidth of Intel.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 6 10.5 11.75 5 [-57%] Download bandwidth shows a similar issue, 1/2 bandwidth expected.
All designs have to rely on the shared memory controller and Vega performs as expected with good internal bandwidth due to higher speed DDR4 memory. But – transfer up/down speeds are disappointing possibly due to the driver as “zero-copy” mode should be engaged and working on such transfers (APU mode).
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 246 244 288 412 [+49%] Similarly with CPU data latencies, global “in-page/random” (aka “TLB hit”) latencies are a bit high though not by a huge amount.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 365 372 436 519 [+19%] Due to faster memory clock but increased timings “full/random” latencies appear a bit higher.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 156 158 213 201 [-6%] Sequential access latencies are less than competition by 6%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 245 243 252 411 [+63%] None have dedicated constant memory thus we see a similar picture to global memory: somewhat high latencies.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 82 84 100 22.5 [1/5x] Vega has dedicated shared/local memory and it shows – it’s about 5x faster than Intel’s designs.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1152 1157 1500 278 [1/5x] Texture access is also very fast on Vega, with latencies 5x lower (aka 1/5) than Intel’s designs.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1178 1162 1533 418 [1/3x] Even full/random accesses are fast, 3x (three times) faster than Intel’s.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1077 1081 1324 122 [1/10x] With sequential access we see a crazy 10x lower latency as if AMD uses prefetchers and Intel does not.
As we’ve seen in Ryzen 2’s data latency tests – “in-page/random” latencies are higher than competition but the rest are comparative, with sequential (prefetched) latencies especially small. But dedicated shared/local memory is far faster (5x) and texture accesses are also very fast (3-5x) which should greatly help algorithms making use of them.
Plotting the global (or constant) memory latencies together we see that the “in-page/random” access latencies should perhaps peak somewhat lower but still nothing close to what we’ve seen in the (CPU) data memory latencies article. It is not very clear (unlike the texture latencies graph) where the caches are located.
The texture latencies graph is far clearer where we can see each level’s caches; unlike the global (or constant) latencies we see “in-page/random” latency peak and hold at a somewhat lower level (4MB).

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega mobile, as its desktop big siblings, is undoubtedly powerful and a good upgrade from the older integrated GPU cores; it also supports modern features like half/FP16 compute (which needs vectorisation what the driver reports as “optimised width”) and relishes complex algorithms making use of shared/local memory which is efficient. However Intel’s GT3 EV9.x can get close to it in some workloads and due to better FP64 ratio (1/8x vs 1/16x) even beat it in most FP64 precision tests which is somewhat disappointing.

Luckily for AMD, GT3 variant is very rare and thus Vega has an easy job defeating GT2 in just about all tests; but it shows that should Intel “get serious” and continue to improve integrated graphics (and CPUs) like they used to do before Skylake (SKL/KBL) – AMD might have more serious competition on its hands.

Note that until recently (2019) Ryzen2 mobile APUs were not supported by AMD’s main drivers (“Adrenalin”) and had to rely on pretty old OEM (HP, etc.) drivers that were somewhat problematic especially with Windows 10 changing every 6 months while the drivers were almost 1 year old. Thankfully this has now changed and users (and us) can benefit from updated, stable and performant drivers.

In any case if you want a laptop/ultraportable with just an APU and no dedicated graphics, then Vega is pretty much your only choice which means a Ryzen2 system. That pretty much means it is worthy of a recommendation.

In a word: Highly Recommended

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

AMD Ryzen 2 Mobile 2500U Review & Benchmarks – Cache & Memory Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

Why review it now?

With Ryzen3 soon to be released later this year (2019) – with a corresponding Ryzen3 APU mobile – it is good to re-test the platform especially in light of the many BIOS/firmware updates, many video/GPU driver updates and not forgetting the many operating system (Windows) vulnerabilities (“Spectre”) mitigations that have greatly affected performance – sometimes for the good (firmware, drivers, optimisations) sometimes for the bad (mitigations).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

 

CPU Specifications AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 icache is 2x of Intel with matching dcache.
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 2x 256kB 16-way 4x 256kB 16-way Ryzen2 L2 cache is 2x bigger than Intel and thus 4x larger than older SKL/KBL-U.
L3 Caches 4MB 16-way 4MB 16-way 4MB 16-way 6MB 16-way Here CFL-U brings 50% bigger L3 cache (6 vs 4MB) which may help some workloads.
TLB 4kB pages
64 full-way / 1536 8-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way / 1536 2-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600 2600 (400-3100) 2700 (400-3500) 1600 (400-3400) Ryzen2’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
1200-2400 (2667) 1033-1866 (2133) 1067-2133 (2400) 1200-2400 (2533) Ryzen2 now supports up to 2667MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
17-17-17-39 8-56-18-9 1T 14-17-17-40 10-57-16-11 2T 15-15-15-36 4-51-17-8 2T 19-19-19-43 5-63-21-9 2T Timings naturally depend on memory which for laptops is somewhat limited and quite expensive.
Memory Controller Firmware
2.1.0 3.6.0 3.6.4 Firmware is the same as on desktop devices.

Core Topology and Testing

As discussed in the previous articles (Ryzen1 and Ryzen2 reviews), cores on Ryzen are grouped in blocks (CCX or compute units) each with its own L3 cache – but connected via a 256-bit bus running at memory controller clock. However – unlike desktop/workstations – so far all Ryzen2 mobile designs have a single (1) CCX thus all the issues that “plagued” the desktop/workstation Ryzen designs do note apply here.

However, AMD could have released higher-core mobile designs to go against Intel’s H-line (beefed to 6-core / 12-threads with CFL-H) that would have likely required 2 CCX blocks. At this time (start 2019) considering that Ryzen3 (mobile) will launch soon that seems unlikely to happen…

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen2 mobile supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 18.65 [-21%] 16.81 18.93 23.65 Ryzen2 L1D is not as wide as Intel’s designs (512-bit) thus inter-core transfers in L1D are 20% slower.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 9.29 [=] 6.62 7.4 9.3 Using the unified L3 caches – both Ryzen2 and CFL-U manage the same bandwidths.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 16 [-24%] 21 18 19 Within the same core (share L1D) Ryzen2 has lower latencies by 24% than all Intel CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 46 [-23%] 61 54 56 Within the same compute unit (shareL3) Ryzen2 again yields 23% lower latencies.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) n/a n/a n/a n/a With a single CCX we have no latency issues.
While the L1D cache on Ryzen2 is not as wide as on Intel SKL/KBL/CFL-U to yield the same bandwidth (20% lower), both it and L3 manage lower latencies by a relatively large ~25%. With a single CCX design we have none of the issues seen on the desktop/workstation CPUs.
Aggregated L1D Bandwidth (GB/s) 267 [-67%] 315 302 628 Ryzen2’s L1D is just not wide enough – even 2-core SKL/KBL-U have more bandwidth and CFL-U has almost 3x more.
Aggregated L2 Bandwidth (GB/s) 225 [-29%] 119 148 318 The 2x larger L2 caches (512 vs 256kB) perform better but still CFL-U manages 30% more bandwidth.
Aggregated L3 Bandwidth (GB/s) 130 [-31%] 90 95 188 CFL-U not only has 50% bigger L3 (6 vs 4MB) but also somehow manages 30% more bandwidth too while SKL/KBL-U are left in the dust.
Aggregated Memory (GB/s) 24 [=]
21 21 24 With the same memory clock, Ryzen2 ties with CFL-U which means good bandwidth for the cores.
While we saw big improvements on Ryzen2 (desktop) for all caches L1D/L2/L3 – more work needs to be done: in particular the L1D caches are not wide enough compared to Intel’s CPUs – and even L2/L3 need to be wider. Most likely Ryzen3 with native wide 256-bit SIMD (unlike 128-bit as Ryzen1/2) will have twice as wide L1D/L2 that should be sufficient to match Intel.

The memory controller performs well matching CFL-U and is officially rated for higher DDR4 memory – though on laptops the choices are more limited and more expensive.

Data In-Page Random Latency (ns) 91.8 [4-13-32] [+2.75x] 34.6 [3-10-17] 27.6 [4-12-22] 24.5 As on desktop Ryzen1/2 in-page random latencies are large compared to the competition while L1D/L2 are OK but L3 also somewhat large.
Data Full Random Latency (ns) 117 [4-13-32] [-16%] 108 [3-10-27] 84.7 [4-12-33] 139 Out-of-page latencies are not much different which means Ryzen2 is a lot more competitive but still somewhat high.
Data Sequential Latency (ns) 4.1 [4-6-7] [-31%]
5.6 [3-10-11] 6.5 [4-12-13] 5.9 Ryzen’s prefetchers are working well with sequential access with lower latencies than Intel
Ryzen1/2 desktop issues were high memory latencies (in-page/full random) and nothing much changes here. “In-Page/Random pattern” (TLB hit) latencies are almost 3x higher – actually not much lower compared to “Full/Random pattern” (TBL miss) – which are comparable to Intel’s SKL/KBL/CFL. On the other hand “Sequential pattern” yields lower latencies (30% less) than Intel thus simple access patterns work better than complex/random access patterns.
Looking at the data access latencies’ graph for Ryzen2 mobile – we see the “in-page/random” following the “full/random” latencies all the way to 8MB block where they plateau; we would have expected them to plateau at a lower value. See the “code access latencies” graph below.
Code In-Page Random Latency (ns) 17.6 [5-9-25] [+14%] 13.3 [2-9-18] 14.9 [2-11-21] 15.5 Code latencies were not a problem on Ryzen1/2 and they are OK here, 14% higher.
Code Full Random Latency (ns) 108 [5-15-48] [+19%] 91.8 [2-10-38] 90.4 [2-11-45] 91 Out-of-page latency is also competitive and just 20% higher.
Code Sequential Latency (ns) 8.2 [5-13-20] [+37%] 5.9 [2-4-8] 7.8 [2-4-9] 6 Ryzen’s prefetchers are working well with sequential access pattern latency but not as fast as Intel.
Unlike data, code latencies (any pattern) are competitive with Intel though CFL-U does have lower latencies (between 15-20%) but in exchange you get a 2x bigger L1I (64 vs 32kB) which should help complex software.
This graph for code access latencies is what we expected to see for data: “in-page/random” latencies plateau much earlier than “full/random” thus “TLB hit” latencies being much lower than “TLB miss” latencies.
Memory Update Transactional (MTPS) 7.17 [-7%] 6.5 7.72 7.2 As none of Intel’s CPUs have HLE enabled Ryzen2 performs really well with just 7% less transactions/second.
Memory Update Record Only (MTPS) 5.66 [+5%] 4.66 5.25 5.4 With only record updates it manages to be 5% faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We saw good improvement on Ryzen2 (desktop/workstation) but still not enough to beat Intel and a lot more work is needed both on L1/L2 cache bandwidth/widening and memory latency (“in-page” aka “TBL hit” random access pattern) that cannot be improved with firmware/BIOS updates (AGESA firmware). Ryzen2 mobile does have the potential to use faster DDR4 memory (officially rated 2667MHz) thus could overtake Intel using faster memory – but laptop DDR4 SODIMM choice is limited.

Regardless of these differences – the CPU results we’ve seen are solid thus sufficient to recommend Ryzen2 mobile especially when at a much lower cost than competing designs. Even if you do choose Intel – you will be picking up a better design due to Ryzen2 mobile competition – just compare the SKL/KBL-U and CFL/WHL-U results.

We are looking forward to see what improvements Ryzen3 mobile brings to the mobile platform.

In a word: Recommended – with reservations

In this article we tested CPU Cache and Memory performance; please see our other articles on:

AMD Ryzen 2 Mobile 2500U Review & Benchmarks – CPU Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

Why review it now?

With Ryzen3 soon to be released later this year (2019) – with a corresponding Ryzen3 APU mobile – it is good to re-test the platform especially in light of the many BIOS/firmware updates, many video/GPU driver updates and not forgetting the many operating system (Windows) vulnerabilities (“Spectre”) mitigations that have greatly affected performance – sometimes for the good (firmware, drivers, optimisations) sometimes for the bad (mitigations).

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 mobile (2500U) with competing architectures (Intel gen 6, 7, 8) with a view to upgrading to a mid-range but high performance design.

 

CPU Specifications AMD Ryzen2 2500U Bristol Ridge
Intel i7 6500U (Skylake ULV)
Intel i7 7500U (Kabylake ULV)
Intel i5 8250U (Coffeelake ULV)
Comments
Cores (CU) / Threads (SP) 4C / 8T 2C / 4T 2C / 4T 4C / 8T Ryzen has double the cores of ULV Skylake/Kabylake and only recently Intel has caught up by also doubling cores.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz (16x-20x-36x) 0.4-2.6-3.1GHz (4x-26x-31x) 0.4-2.7-3.5GHz (4x-27x-35x) 0.4-1.6-3.4GHz (4x-16x-34x) Ryzen2 has higher base and turbo than CFL-U and higher turbo than all Intel competition.
Power (TDP) 25-35W 15-25W 15-25W 25-35W Both Ryzen2 and CFL-U have higher TDP at 25W and turbo up to 35W depending on configuration while older devices were mostly 15W with turbo 20-25W.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 icache is 2x of Intel with matching dcache.
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 2x 256kB 16-way 4x 256kB 16-way Ryzen2 L2 cache is 2x bigger than Intel and thus 4x larger than older SKL/KBL-U.
L3 Caches 4MB 16-way 4MB 16-way 4MB 16-way 6MB 16-way Here CFL-U brings 50% bigger L3 cache (6 vs 4MB) which may help some workloads.
Microcode (Firmware) MU8F1100-0B MU064E03-C6 MU068E09-8E MU068E09-96 On Intel you can see just how many updates the platforms have had – we’re now at CX versions but even Ryzen2 has had a few.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 [-6%] 52 73 109 Right off Ryzen2 does not beat CFL-U but is very close, soundly beating the older Intel designs.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 [-4%] 51 74 106 With a 64-bit integer workload – the difference drops to 4%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 [+18%] 39 45 67 Somewhat surprisingly, Ryzen2 is almost 20% faster than CFL-U here.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 [+22%] 33 37 55 With FP64 nothing much changes, with Ryzen2 over 20% faster.
You can see why Intel needed to double the cores for ULV: otherwise even top-of-the-line i7 SKL/KBL-U are pounded into dust by Ryzen2. CFL-U does trade blows with it and manages to pull ahead in Dhrystone but Ryzen2 is 20% faster in floating-point. Whatever you choose you can thank AMD for forcing Intel’s hand.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 [-32%] 183 193 350 In this vectorised AVX2 integer test Ryzen2 starts 30% slower than CFL-U but does beat the older designs.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 [-58%] 68.2 75 127 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is even slower.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 [+12%] 1.15 1.12 2.15 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 has its 1st win by 12% over CFL-U.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 [-20%] 149 159 277 In this floating-point AVX/FMA vectorised test, Ryzen2 is still slower but only by 20%.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 126 [-22%] 88.3 94.8 163 Switching to FP64 SIMD code, nothing much changes still 20% slower.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 [-16%] 3.79 4.04 7.4 In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – Ryzen2 is less than 20% slower.
Just as on desktop, we did not expect AMD’s Ryzen2 mobile to beat 4-core CFL-U (with Intel’s wide SIMD units) and it doesn’t: but it remains very competitive and is just 20% slower. In any case, it soundly beats all older but ex-top-of-the-line i7 SKL/KBL-U thus making them all obsolete at a stroke.
BenchCrypt Crypto AES-256 (GB/s) 10.9 [+1%] 6.29 7.28 10.8 With AES/HWA support all CPUs are memory bandwidth bound – here Ryzen2 ties with CFL-U and soundly beats older versions.
BenchCrypt Crypto AES-128 (GB/s) 10.9 [+1%] 8.84 9.07 10.8 What we saw with AES-256 just repeats with AES-128; Ryzen2 is marginally faster but the improvement is there.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78 [+60%] 2 2.55 4.24 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; SHA is still memory bound but Ryzen2 is 60% faster than CFL-U.
BenchCrypt Crypto SHA1 (GB/s) 7.13 [+2%] 3.88 4.07 7.02 Ryzen also accelerates the soon-to-be-defunct SHA1 but CFL-U with AVX2 has caught up.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 [-44%] 1.47 1.54 2.66 SHA2-512 is not accelerated by SHA/HWA thus Ryzen2 falls behind here.
Ryzen2 mobile (like its desktop brother) gets a boost from SHA/HWA but otherwise ties with CFL-U which is helped by its SIMD units. As before older 2-core i7 SKL/KBL-U are left with no hope and cannot even saturate the memory bandwidth.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.3 [-4%] 44.7 49.3 97 In this non-vectorised test we see Ryzen2 matches CFL-U.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.8 [-8%] 39 43.3 84.7 Switching to FP64 code, nothing much changes, Ryzen2 is 8% slower.
BenchFinance Binomial float/FP32 (kOPT/s) 35.5 [+61%] 10.4 12.3 22 Binomial uses thread shared data thus stresses the cache & memory system; here the arch(itecture) improvements do show, Ryzen2 is 60% faster than CFL-U.
BenchFinance Binomial double/FP64 (kOPT/s) 19.5 [-7%] 10.1 11.4 21 With FP64 code Ryzen2 drops back from its previous win.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.1 [+1%] 9.24 9.87 19.8 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen2 cannot match its previous gain.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.3 [-3%] 7.38 7.88 15.8 Switching to FP64 nothing much changes, Ryzen2 matches CFL-U.
Unlike desktop where Ryzen2 is unstoppable, here we are a more mixed result – with CFL-U able to trade blows with it except one test where Ryzen2 is 60% faster. Otherwise CFL-U does manage to be just a bit faster in the other tests but nothing significant.
BenchScience SGEMM (GFLOPS) float/FP32 107 [+16%] 92 76 85 In this tough vectorised AVX2/FMA algorithm Ryzen2 manages to be almost 20% faster than CFL-U.
BenchScience DGEMM (GFLOPS) double/FP64 47.2 [-6%] 44.2 31.7 50.5 With FP64 vectorised code, Ryzen2 drops down to 6% slower.
BenchScience SFFT (GFLOPS) float/FP32 3.75 [-53%] 7.17 7.21 8 FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more; Ryzen2 does not like it much.
BenchScience DFFT (GFLOPS) double/FP64 4 [-7%] 3.23 3.95 4.3 With FP64 code, Ryzen2 does better and is just 7% slower.
BenchScience SNBODY (GFLOPS) float/FP32 112 [-27%] 96.6 104.9 154 N-Body simulation is vectorised but many memory accesses and not a Ryzen2 favourite.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 [-30%] 29.6 30.64 64.8 With FP64 code nothing much changes.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. Just as with desktop Ryzen1/2 it may require SIMD code changes for best performance due to its 128-bit units; Ryzen3 with 256-bit units should fix that.
CPU Image Processing Blur (3×3) Filter (MPix/s) 532 [-39%] 418 474 872 In this vectorised integer AVX2 workload Ryzen2 is quite a bit slower than CFL-U.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 [-58%] 168 191 350 Same algorithm but more shared data makes Ryzen2 even slower, 1/2 CFL-U.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 [-32%] 87.6 98 181 Again same algorithm but even more data shared reduces the delta to 1/3.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 [-37%] 136 164 295 Different algorithm but still AVX2 vectorised workload still Ryzen2 is ~35% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.5 [-1%] 13.3 14.4 26.7 Still AVX2 vectorised code but here Ryzen2 ties with CFL-U.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 [-38%] 7.21 7.63 15.09 Again we see Ryzen2 fall behind CFL-U.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 [-53%] 730 764 1394 With integer AVX2 workload, Ryzen2 falls behind even SKL/KBL-U.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94.1 [-55%] 99.6 105 209 In this final test again with integer AVX2 workload Ryzen2 is 1/2 speed of CFL-U.

With all the modern instruction sets supported (AVX2, FMA, AES and SHA/HWA) Ryzen2 does extremely well in all workloads – and makes all older i7 SKL/KBL-U designs obsolete and unable to compete. As we said – Intel pretty much had to double the number of cores in CFL-U to stay competitive – and it does – but it is all thanks to AMD.

Even then Ryzen2 does beat CFL-U in non-SIMD tests with the latter being helped tremendously by its wide (256-bit) SIMD units and greatly benefits from AVX2/FMA workloads. But Ryzen3 with double-width SIMD units should be much faster and thus greatly beating Intel designs.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest drivers. .Net 4.7.x (RyuJit), Java 1.9.x. Turbo / Boost was enabled on all configurations.

VM Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 22.7 [+39%] 9.58 12.1 16.36 .Net CLR integer starerts great – Ryzen2 is 40% faster than CFL-U.
BenchDotNetAA .Net Dhrystone Long (GIPS) 22 [+34%] 9.24 12.1 16.4 64-bit integer workloads also favour Ryzen2, still 35% faster.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 40.5 [+9%] 18.7 22.5 37.1 Floating-Point CLR performance is also good but just about 10% faster than CFL-U.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 49.6 [+6%] 23.7 28.8 46.8 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen2 faster by 6%.
.Net CLR performance was always incredible on Ryzen1 and 2 (desktop/workstation) and here is no exception – all Intel designs are left in the dust with even CFL-U soundly beated by anything between 10-40%.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 43.23 [+20%] 21.32 25 35 Just as we saw with Dhrystone, this integer workload sees a big 20% improvement for Ryzen2.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 44.71 [+21%] 21.27 26 37 With 64-bit integer workload we see a similar story – 21% better.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 137 [+46%] 78.17 94 56 Here we make use of RyuJit’s support for SIMD vectors thus running AVX2/FMA code – Ryzen2 does even better here 50% faster than CFL-U.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 75.2 [+45%] 43.59 52 35 Switching to FP64 SIMD vector code – still running AVX2/FMA – we see a similar gain
As before Ryzen2 dominates .Net CLR performance – even when using RyuJit’s SIMD instructions we see big gains of 20-45% over CFL-U.
Java Arithmetic Java Dhrystone Integer (GIPS) 222 [+13%] 119 150 196 We start JVM integer performance with a 13% lead over CFL-U.
Java Arithmetic Java Dhrystone Long (GIPS) 208 [+12%] 101 131 185 Nothing much changes with 64-bit integer workload – Ryzen2 still faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 50.9 [+9%] 23.13 27.8 46.6 With a floating-point workload Ryzen2 performance improvement drops a bit.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 54 [+13%] 23.74 28.7 47.7 With FP64 workload Ryzen2 gets back to 13% faster.
Java JVM performance delta is not as high as .Net but still decent just over 10% over CFL-U similar to what we’ve seen on desktop.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 48.74 [+15%] 20.5 24 42.5 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but Ryzen2 is still 15% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 46.75 [+4%] 20.3 24.8 44.8 With 64-bit vectorised workload Ryzen2’s lead drops to 4%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 38.2 [+9%] 14.59 17.6 35 Switching to floating-point we return to a somewhat expected 9% improvement.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 35.7 [+2%] 14.59 17.4 35 With FP64 workload Ryzen2’s lead somewhat unexplicably drops to 2%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets allow Ryzen2 to do well and overtake CFL-U between 2-15%.

Ryzen2 on desktop dominated the .Net and Java benchmarks – and Ryzen2 mobile does not disappoint – it is consistently faster than CFL-U which does not bode well for Intel. If you mainly run .Net and Java apps on your laptop then Ryzen2 is the one to get.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen2 was a worthy update on the desktop and Ryzen2 mobile does not disappoint; it instantly obsoleted all older Intel designs (SKL/KBL-U) with only the very latest 4-core ULV (CFL/WHL-U) being able to match it. You can see from the results how AMD forced Intel’s hand to double cores in order to stay competitive.

Even then Ryzen2 manages to beat CFL-U in non-SIMD workloads and remains competitive in SIMD AVX2/FMA workloads (only 20% or so slower) while soundly beating SKL/KBL-U with their 2-cores and wide SIMD units. With soon-to-be-released Ryzen3 with wide SIMD units (256-bit as CFL/WHL-U) – Intel will need AVX512 to stay competitive – however it has its own issues which may be problematic in mobile/ULV space.

Both Ryzen2 mobile and CFL/WHL-U have increased TDP (~25W) in order to manage the increased number of cores (instead of 15W with older 2-core designs) and turbo short-term power as much as 35W. This means while larger 14/15″ designs with good cooling are able to extract top performance – smaller 12/13″ designs are forced to use lower cTDP of 15W (20-25W turbo) thus with lower multi-threaded performance.

Also consider than Ryzen2 is not affected by most “Spectre” vulnerabilities and not by “Meltdown” either thus does not need KVA (kernel pages virtualisation) that greatly impacts I/O workloads. Only the very latest Whiskey-Lake ULV (WHL-U gen 8-refresh) has hardware “Meltdown” fixes – thus there is little point buying CFL-U (gen 8 original) and even less point buying older SKL/KBL-U.

In light of the above – Ryzen2 mobile is a compelling choice especially as it comes at a (much) lower price-point: its competition is really only the very latest WHL-U i5/i7 which do not come cheap – with most vendors still selling CFL-U and even KBL-U inventory. The only issue is the small choice of laptops available with it – hopefully the vendors (Dell, HP, etc.) will continue to release more versions especially with Ryzen 3 mobile.

In a word: Highly Recommended!

Please see our other articles on:

AMD Ryzen+ 2700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen+” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen+” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen+:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we are testing them in this article!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen+ (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen+ 2700X Pinnacle Ridge AMD Ryzen+ 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen+ data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen+ L2 cache is unchanged but we’re told latencies have been improved. And 4x bigger than Intel’s!
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen+ L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
TLB 4kB pages
64 full-way 1536 8-way 64 full-way 1536 8-way 64 full-way 1536 8-way 64 8-way 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way 1536 2-way 64 full-way 1536 2-way 64 full-way 1536 2-way 8 full-way 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600-1200 600-1200 600-1200 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
2400 / 2933 2400 / 2933 2400 / 2666 2533 / 2400 Ryzen+ how supports up to 2933MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Memory runs at the same timings on both Ryzen+ and Ryzen but we shall see if measured latencies are different.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T SkyLake
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 54.9 [+15%] 46.5 47.8 39 Ryzen+ manages 15% higher bandwidth between its cores, slightly better than just 11% clock increase – signalling some improvements under the hood.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 5.89 [+2%] 5.53 5.8 16.3 In worst-case pairs on Ryzen must go across CCXes – and with this link running at the same clock (1200MHz) on Ryzen+ we can only manage a 2% increase in bandwidth. This is why faster memory is needed.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.5 [-13%] 15.4 15.6 16.2 Within the same core (sharing L1D/L2), Ryzen+ manages a 13% reduction in latency, again better than just clock speed increase.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 40.1 [-7%] 43.5 43.2 47.3 Within the same compute unit (sharing L3), the latency decreased by 7% on Ryzen+ thus L3 seems to have improved also.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 128 [-6%] 132 236 Going inter-CCX we still see a 6% reduction in latency on Ryzen+ – with the CCX link at the same speed – a welcome surprise.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see a decent 6-7% reduction in L3/CCX latencies on Ryzen+ even when running at the same clock as Ryzen.
Aggregated L1D Bandwidth (GB/s) 862 [+18%] 615 730 837 Right off we see a 18% bandwidth increase – almost 2x higher (than the 11% clock increase) – thus some improvements have been made to the cache system. It allows Ryzen+ to finally beat the i7 with its wide L1 data paths (512-bit) though with 2x more caches (8 vs 4).
Aggregated L2 Bandwidth (GB/s) 736 [+32%] 542 556 329 We see a huge 32% increase in L2 cache bandwidth – almost 3x clock increase (the 11%) suggesting the L2 caches have been improved also. Ryzen+ has thus 2x the L2 bandwidth of i7 though with 2x more caches (8 vs 4).
Aggregated L3 Bandwidth (GB/s) 339 [+19%] 398 284 238 The bandwidth of the L3 caches has also increased by 19% (2x clock increase) though we see the 6-core 2600 doing better (398 vs 339) likely due to less threads competing for the same L3 caches (12 vs 16). Ryzen+ L3 caches are not just 2x bigger than Intel but also 2x more bandwidth.
Aggregated Memory (GB/s) 30.2 [+2%] 30.2 29.6 29.1 With the same memory clock, Ryzen+ does still manage a small 2% improvement – signalling memory controller improvements. We also see Ryzen’s memory at 2400Mt/s having better bandwidth than Intel at 2533.
We see big improvements on Ryzen+ for all caches L1D/L2/L3 of 20-30% – more than just raw clock increase (11%) – so AMD has indeed made improvements – which to be fair needed to be done. The memory controller is also a bit more efficient (2%) though it can run at higher clocks than tested (2400Mt/s) – hopefully fast DDR4 memory will become more affordable.
Data In-Page Random Latency (ns) 66.4 (4-12-31) [-6%] [0][-5][-4] 66.4 (4-12-31) 70.5 (4-17-35) 20.4 (4-12-21) In-page latency has decreased by a noticeable 6% on Ryzen+ (both 2700X and 2600) – we see 5 clocks reduction for L2 and 4 for L3 a welcome improvement. But still a way to go to catch Intel which has 1/3x (three times less) latency.
Data Full Random Latency (ns) 80.9 (4-12-32) [-8%] [0][-5][-4] 79.4 (4-12-32) 87.6 (4-17-36) 63.9 (4-12-34) Out-of-page latencies have also been reduced by 8% on Ryzen+ (same memory) and we see the same 5 and 4 clock reduction for L2 and L3 (on both 2700X and 2600 it’s no fluke). Again these are welcome but still have a way to go to catch Intel.
Data Sequential Latency (ns) 3.4 (4-6-7) [-8%] [0][-1][0] 3.5 (4-6-7) 3.7 (4-7-7) 4.1 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% latency drop for Ryzen+.
Ryzen’s issue was high memory latencies (in-page/full random) and Ryzen+ has reduced them all by 6-8%. While it is a good improvement, they are still pretty high compared to Intel’s thus more work needs to be done here.
Code In-Page Random Latency (ns) 14.2 (4-9-24) [-9%] [0][0][0] 14.6 (4-9-24) 15.6 (4-9-24) 10.1 (2-10-21) Code latencies were not a problem on Ryzen but we still see a welcome reduction of 9% on Ryzen+. (no clocks delta)
Code Full Random Latency (ns) 88.6 (4-14-49) [-9%] [0][+1][+2] 89.3 (4-14-49) 97.4 (4-13-47) 70.7 (2-11-46) Out-of-page latency also sees a 9% decrease on Ryzen+ but somewhat surprisingly a 1-2 clock increase.
Code Sequential Latency (ns) 7.6 (4-12-20) [-8%] [0][+1][+1] 7.8 (4-12-20) 8.3 (4-11-19) 5.0 (2-4-9) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% reduction on Ryzen+.
While code access latencies were not a problem on Ryzen and they also see a 8% improvement on Ryzen+ which is welcome. Note code L1i cache is 2x Intel’s (64kB vs 32).
Memory Update Transactional (MTPS) 4.7 [+10%] 5 4.28 33.2 HLE Ryzen+ is 10% faster than Ryzen but naturally without HLE support it cannot match the i7. But with Intel disabling HLE on all but top-end CPUs AMD does not have much to worry.
Memory Update Record Only (MTPS) 4.6 [+11%] 4.75 4.16 23 HLE With only record updates we still see an 11% increase.

Ryzen+ brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

 

Final Thoughts / Conclusions

As with original Ryzen, the cache and memory system performance is not the clean-sweep we’ve seen in CPU testing – but Ryzen+ does bring welcome improvements in bandwidth and latency – which hopefully will further improve with firmware/BIOS updates (AGESA firmware).

With the potential to use faster DDR4 memory – Ryzen+ can do far better than in this test (e.g. with 2933/3200MHz memory). Unfortunately at this time DDR4 – especially high-end fast versions – memory is hideously expensive which is a bit of a problem. You may be better off using less but fast(er) memory with Ryzen designs.

Ryzen+ is a great update that will not disappoint upgraders and is likely to increase AMD’s market share. AMD is here to stay!

AMD Ryzen+ 2700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen+” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen+” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen+:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen+ (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen+ 2700X Pinnacle Ridge
AMD Ryzen+ 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 4C / 8T Ryzen+ like its predecessor has the most cores and threads; it thus be down to IPC and clock speeds for performance improvements.
Speed (Min / Max / Turbo) 2.2-3.7-4.2GHz (22x-37x-42x) [+9% rated, +11% turbo] 1.55-3.4-3.9GHz (15x-34x-39x) 2.2-3.3-3.8GHz (22x-34x-38x) 0.8-4.0-4.2GHz (8x-40x-42x) Ryzen+ base clock is 9% higher while Turbo/Boost/XFR is 11% higher; we thus expect at least about 10% improvement in CPU benchmarks.
Power (TDP) 105W 65W 95W 91W Ryzen+ also increases TDP by 11% (105W vs 95) which may require a bit more cooling especially when overclocking.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen+ data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen+ L2 cache is unchanged but we’re told latencies have been improved. 4x bigger than Intel’s.
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen+ L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
SIMD Units 128-bit AVX/FMA3/AVX2 128-bit AVX/FMA3/AVX2 128-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 Ryzen+ still uses the 128-bit SIMD units going against the 256-bit SIMD units that all Intel CPUs have had since Sandy Bridge!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen+ supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 323 [+8%] 236 298 194 Right off Ryzen+ is 8% faster than Ryzen, let’s hope it does better. Even 2600 beats the i7 easily
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 337 [+12%] 238 301 194 With a 64-bit integer workload – we finally get into gear, Ryzen+ is 12% faster than its old brother.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 204 [+12%] 144 182 107 Even in this floating-point test, Ryzen+ is again 12% faster. All AMD CPUs beat the i7 into dust.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 172 [+11%] 123 155 89 With FP64 nothing much changes, Ryzen+ is still 11% faster.
From integer workloads in Dhrystone to floating-point workloads in Whetstone, Ryzen+ is about 10% faster than Ryzen: this is exactly in line with the speed increase (9-11%) but if you were expecting more you may be a tiny bit disappointed.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 619 [+16%] 428 535 510 In this vectorised AVX2 integer test Ryzen+ starts to pull ahead and is 16% faster than Ryzen; perhaps some of the arch improvements benefit SIMD vectorised workloads.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 187 [+10%] 132 170 197 With a 64-bit AVX2 integer vectorised workload, Ryzen+ drops to just 10% but still in line with speed increase.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.83 [+7%] 4.12 5.47 3 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen+ drops to just 7% faster than Ryzen but still a decent improvement.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 577 [+11%] 409 520 453 In this floating-point AVX/FMA vectorised test, Ryzen+ is the standard 11% faster than Ryzen.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 332 [+11%] 236 299 267 Switching to FP64 SIMD code, again Ryzen+ is just the standard 11% faster than Ryzen.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 15.6 [+15%] 11 13.7 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen+ manages to pull ahead further and is 15% faster.
In vectorised AVX2/FMA code we see a similar story with 10% average improvement (7-15%). It seems the SIMD units are unchanged. In any case the i7 is left in the dust.
BenchCrypt Crypto AES-256 (GB/s) 14.1 [+1%] 14.1 13.9 14.7 With AES HWA support all CPUs are memory bandwidth bound; as we’re testing Ryzen+ running at the same memory speed/timings there is still a very small improvement of 1%. But its advantage is that the memory controller is rated for 2933Mt/s operation (vs. 2533) thus with faster memory it could run considerably faster.
BenchCrypt Crypto AES-128 (GB/s) 14.2 [+1%] 14.2 14 14.8 What we saw with AES-256 just repeats with AES-128; Ryzen+ is marginally faster but the improvement is there.
BenchCrypt Crypto SHA2-256 (GB/s) 18.4 [+12%] 13.2 16.5 5.9 With SHA HWA Ryzen+ similarly powers through hashing tests leaving Intel in the dust; SHA is still memory bound but with just one (1) buffer it has larger headroom. Thus Ryzen+ can use its speed advantage and be 12% faster – impressive.
BenchCrypt Crypto SHA1 (GB/s) 19.2 [+14%] 13.1 16.8 11.3 Ryzen+ also accelerates the soon-to-be-defunct SHA1 and here it is even faster – 14% faster than Ryzen.
BenchCrypt Crypto SHA2-512 (GB/s) 3.75 [+12%] 2.66 3.34 4.4 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen+ has to use the same vectorised AVX2 code path – it still is 12% faster than Ryzen but still loses to the i7. Those SIMD units are tough to beat.
In memory bandwidth bound algorithms, Ryzen+ will have to be used with faster memory (up to 2933Mt/s officially) in order to significantly beat its older Ryzen brother. Otherwise there is only a tiny 1% improvement.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 260 [+11%] 184 235 126 In this non-vectorised test we see Ryzen+ is the standard 11% faster than Ryzen.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 221 [+11%] 157 199 112 Switching to FP64 code, nothing changes, Ryzen+ is still 11% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 106 [+23%] 76 86 27 Binomial uses thread shared data thus stresses the cache & memory system; here the arch(itecture) improvements do show, Ryzen+ 23% faster – 2x more than expected. Not to mention 3x (three times) faster than the i7.
BenchFinance Binomial double/FP64 (kOPT/s) 60.8 [+28%] 43.2 47.5 29.2 With FP64 code Ryzen+ is now even faster – 28% faster than Ryzen not to mention 2x faster than the i7. Indeed it seems there improvements to the cache and memory system.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.4 [+11%] 38.6 49.2 49.2 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen+ does not seem to be able to reproduce its previous gain and is just the standard 11% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 41.2 [+10%] 29.1 37.3 20.3 Switching to FP64 nothing much changes, Ryzen+ is 10% faster.
Ryzen dies very well in these algorithms, but Ryzen+ does even better – especially when thread-local data is involved managing 23-28% improvement. For financial workloads Intel does not seem to have a chance anymore – Ryzen is impossible to beat.
BenchScience SGEMM (GFLOPS) float/FP32 275 [+10%] 238 250 267 In this tough vectorised AVX2/FMA algorithm Ryzen+ is still “just” the 10% faster than older Ryzen – but it finally manages to beat the i7.
BenchScience DGEMM (GFLOPS) double/FP64 113 [+4%] 103 109 116 With FP64 vectorised code, Ryzen+ only manages to be 4% faster. It seems the memory is holding it back thus faster memory would allow it to do much better.
BenchScience SFFT (GFLOPS) float/FP32 8.56 [+4%] 7.36 8.2 19.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; Ryzen+ is just 4% faster again and is still 1/2x the speed of the i7. Again it seems faster memory would help.
BenchScience DFFT (GFLOPS) double/FP64 7.42 [+1%] 6.87 7.32 9.19 With FP64 code, Ryzen+’s improvement reduces to just 1% over Ryzen and again slower than the i7.
BenchScience SNBODY (GFLOPS) float/FP32 279 [+12%] 197 249 269 N-Body simulation is vectorised but many memory accesses to shared data and Ryzen+ gets back to 12% improvement over Ryzen. This allows it to finally overtake the i7.
BenchScience DNBODY (GFLOPS) double/FP64 114 [+13%] 80 101 79 With FP64 code nothing much changes, Ryzen+ is still 13% faster.
With highly vectorised SIMD code Ryzen+ still improves by the standard 10-12% but in memory-heavy code it needs to run at higher memory speed to significantly overtake Ryzen. But it allows it to beat the i7 in more algorithms.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1290 [+11%] 913 1160 1170 In this vectorised integer AVX2 workload Ryzen+ is 11% faster allowing it to soundly beat the i7.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 551 [+11%] 391 497 435 Same algorithm but more shared data does not change things for Ryzen+. Only the i7 falls behind.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 307 [+11%] 218 276 233 Again same algorithm but even more data shared does not change anything, but now the i7 is so far behind Ryzen+ is 50% faster. Incredible.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 461 [+11%] 326 415 384 Different algorithm but still AVX2 vectorised workload still changes nothing – Ryzen+ is 11% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 69.7 [+12%] 49.7 62 38 Still AVX2 vectorised code and still nothing changes; the i7 falls even further behind with Ryzen+ 2x (two times) as fast.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.7 [+11%] 17.5 22.3 20 Again we see Ryzen+ 11% faster than the older Ryzen and pulling away from the i7.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1460 [+8%] 1130 1350 1670 Here Ryzen+ is just 8% faster than Ryzen but strangely it’s not enough to beat the i7. Those SIMD units are way fast.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 243 [+11%] 172 219 268 In this final test, Ryzen+ returns to being 11% faster and again strangely not enough to beat the i7.

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen+ does extremely well in all workloads – but it generally improves only by the 11% as per clock speed increase, except in some cases which seem to show improvements in the cache and memory system (which we have not tested yet).

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest drivers. .Net 4.7.x (RyuJit), Java 1.9.x. Turbo / Boost was enabled on all configurations.

VM Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 63.2 [+8%] 30 58.6 26 .Net CLR integer performance starts off OK with Ryzen+ just 8% faster than Ryzen but now almost 3x (three times) faster than i7.
BenchDotNetAA .Net Dhrystone Long (GIPS) 49.6 [+20%] 33.6 41.2 27 Ryzen seems to favour 64-bit integer workloads, with Ryzen+ 20% faster a lot higher than expected.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 104 [+15%] 71.2 90.5 54.3 Floating-Point CLR performance was pretty spectacular with Ryzen already, but Ryzen+ is 15% than Ryzen still.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 122 [+20%] 88.2 102 65.6 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen+ even faster by 20%.
Ryzen’s performance in .Net was pretty incredible but Ryzen+ is even faster – even faster than expected by mere clock speed increase. There is only one game in town now for .Net applications.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 106 [+9%] 74 97 54 Just as we saw with Dhrystone, this integer workload sees a 9% improvement for Ryzen+ which makes it 2x faster than the i7.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 111 [+8%] 78 103 57 With 64-bit integer workload we see a similar story – Ryzen+ is 8% faster and again 2x faster than the i7.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 387 [+11%] 278 348 240 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Ryzen+ is 11% faster but still almost 2x faster than i7 despite its fast SIMD units
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 217 [+12%] 153 194 48.6 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen+ is still 12% faster. i7 is truly left in the dust 1/4x the speed.
Ryzen+ is the usual 9-12% faster than Ryzen here but it means that even RyuJit’s SIMD support cannot save Intel’s i7 – it would take 2x as many cores (not 50%) to beat Ryzen+.
Java Arithmetic Java Dhrystone Integer (GIPS) 574 [+12%] 399 514 We start JVM integer performance with the usual 12% gain over Ryzen.
Java Arithmetic Java Dhrystone Long (GIPS) 559 [+12%] 392 500 Nothing much changes with 64-bit integer workload, we have Ryzen+ 12% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 138 [+13%] 99 122 With a floating-point workload Ryzen+ performance improvement is 13%.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 137 [+7%] 97 128 With FP64 workload Ryzen+ is just 7% faster but still welcome
Java performance improves by the expected amount 7-13% on Ryzen+ and allows it to completely dominate the i7.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 108 [+15%] 76 94 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen+ manages a 15% lead over Ryzen.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 114 [+24%] 73 92 With 64-bit vectorised workload Ryzen+ (similar to .Net) increases its lead by 24%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 99 [+14%] 69 87 Switching to floating-point we return to the usual 14% speed improvement.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+1%] 64 92 With FP64 workload Ryzen+’s lead somewhat unexplicably drops to 1%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen+ free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen dominated the .Net and Java benchmarks – but now Ryzen+ extends that dominance out-of-reach. It would take a very much improved run-time or Intel CPU to get anywhere close. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen+ is a worthy update but its speed increase is generally due to its faster clock speed – similar to Intel’s SkyLake > KabyLake (gen 6 to gen 7) transition. But coming at the same price, a “free” performance increase of 10% or so is obviously not to be ignored. Let’s not forget that Ryzen+ can still use all the existing series 300 mainboards – subject to BIOS update.

The process shrink and power optimisations does allow Ryzen+ to run at lower voltages and consume less power – even though TDP has increased at least “on paper”.

Some algorithms do seem to show that the cache and memory system has been improved – but Ryzen+’s advantage is that it can (much) faster memory. Unfortunately at this time DDR4 memory, especially fast versions, are very expensive. Here Intel does (still) have an advantage in that fast DDR4 memory is not required except for bandwidth bound algorithms.

One advantage is that by now operating systems (and applications) have been updated to deal with its dual-CCX design that used to be so much trouble when we benchmarked Ryzen initially. With AMD increasing its market share no high-performance application can afford to ignore AMD CPUs.

We (just) cannot wait to see the new improvements in future AMD designs and especially the ThreadRipper2 update!

AMD Threadripper 1950X Review & Benchmarks – CPU 16-core/32-thread Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on.

AMD Epyc/Threadripper DieIn this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Threadripper (1950X) with HEDT competition (Intel SKL-X) as well as normal desktop solutions (Ryzen, Skylake) which also serves to compare HEDT with the “normal” desktop solution.

CPU Specifications AMD Threadripper 1950X Intel i9 9700X (SKL-X) AMD Ryzen 1700X Intel i7 6700K (SKL) Comments
Cores (CU) / Threads (SP) 16C / 32T 10C / 20T 8C / 16T 4C / 8T Just as Ryzen, TR has the most cores though Intel has just announced new SKL-X with more cores.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 0.8-4.0-4.2GHz (8x-40x-42x) SKL has the highest base clock but all CPUs have similar Turbo clocks
Power (TDP) 180W 150W 95W 91W TR has higher TDP than SKL-X just like Ryzen so may need a beefier cooling system
L1D / L1I Caches 16x 32kB 8-way / 16x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way TR and Ryzen’s instruction caches are 2x data (and SKL/X) but all caches are 8-way.
L2 Caches 16x 512kB 8-way (8MB total) 20x 1MB 16-way (20MB total) 8x 512kB 8-way (4MB total) 4x 256kB 8-way (1MB total) SKL-X has really pushed the boat out with a 1MB L2 cache that dwarfs all other CPUs.
L3 Caches 4x 8MB 16-way (32MB total) 13.75MB 11-way 2x 8MB 16-way (16MB total) 8MB 16-way TR actually has 2 sets of 2 L3 caches rather than a combined L3 cache like SKL/X.
NUMA Nodes
2x 16GB each no, unified 32GB no, unified 16GB no, unified 16GB Only TR has 2 NUMA nodes

Thread Scheduling and Windows

Threadripper’s topology (4 cores in each CCX, with 2 CCX in one node and 2 nodes) makes things even more compilcated for operating system (Windows) schedulers. Effectively we have a 2-tier NUMA SMP system where CCXes are level 1 and nodes are level 2 thus the scheduling of threads matters a lot.

Also keep in mind this is a NUMA system (2 nodes) with each node having its own memory; while for compatibility AMD recommends (and the BIOS defaults) to “UMA” (Unified) “interleaving across nodes” – for best performance the non-interleaving mode (or “interleaving across CCX”) should be used.

What all this means is that you likely need a reasonably new operating system – thus Windows 10 / Server 2016 – with a kernel that has been updated to support Ryzen/TR as Microsoft is not likely to care about old verions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen/TR support all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 447 [-2%] 454 226 186 TR can keep up with SKL-X and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+1%] 456 236 184 An Int64 load does not change results.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 352 [+30%] 269 184 107 Finally TR soundly beats SKL-X by 30% and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 295 [+32%] 223 154 89 With a FP64 work-load the lead inceases slightly.
Unlike Ryzen which soundly dominated Skylake (albeit with 2x more cores, 8 vs. 4), Threadripper does not have the same advantage (16 vs. 10) thus it can only beat SKL-X in floating-point work-loads where it is 30% faster, still a good result.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 918 [-22%] 1180 535 527 With AVX2/FMA SKL-X is just too strong, with TR 22% slower.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 307 [-29%] 435 161 191 With Int64 AVX2 TR is almost 20% slower than SKL-X.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 7 [+30%] 5.4 3.6 2 This is a tough test using Long integers to emulate Int128 without SIMD and here TR manages to be 30 faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 996 [=] 1000 518 466 In this floating-point AVX2/FMA vectorised test  TR manages to tie with SKL-X.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 559 [-10%] 622 299 273 Switching to FP64 SIMD code, TR is now 10% slower than SKL-X.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 27 [+12%] 24 13.7 10.7 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – TR manages a 12% win.
In vectorised AVX2/FMA code we see TR lose in most tests, or tie in one – and only shine in emulation tests not using SIMD instruction sets. Intel’s SIMD units – even without AVX512 that SKL-X brings – are just too strong for TR just as we saw Ryzen struggle against normal Skylake.
BenchCrypt Crypto AES-256 (GB/s) 27.1 [-21%] 34.4  14  15 All CPUs support AES HWA – but TR/Ryzen memory is just 2400Mt/s vs 3200 that SKL-X enjoys (+33%) thus this is a good result; TR seems to use its channels pretty effectively.
BenchCrypt Crypto AES-128 (GB/s)  27.4 [-18%]  33.5  14  15 Similar to what we saw above TR is just 18% slower which is a good result; unfortunately we cannot get the memory over 2400Mt/s.
BenchCrypt Crypto SHA2-256 (GB/s)  32.2 [+2.2x]
 14.6  17.1  5.9 Like Ryzen, TR’s secret weapon is SHA HWA which allows it to soundly beat SKL-X over 2.2x faster!
BenchCrypt Crypto SHA1 (GB/s) 34.2 [+30%] 26.4  17.7  11.3 Even with SHA HWA, the multi-buffer AVX2 implementation allows SKL-X to beat TR by 16% but it still scores well.
BenchCrypt Crypto SHA2-512 (GB/s)  6.34 [-41%]  10.9  3.35  4.38 SHA2-512 is not accelerated by SHA HWA (version 1) thus TR has to use the same vectorised AVX2 code thus is 41% slower.
TR’s secret crypto weapon (as Ryzen) is SHA HWA which allows it to soundly beat SKL-X even with 33% less memory bandwidth; provided software is NUMA-enabled it seems TR can effectively use its 4-channel memory controllers.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 436 [+35%] 322  234.6  129 In this non-vectorised test TR bets SKL-X by 35%. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s)  366 [+32%]
277  198.6  109 Switching to FP64 code,TR still beats SKL-X by over 30%. So far so great.
BenchFinance Binomial float/FP32 (kOPT/s)  165 [+2.46x]
 67.3  85.6  27.25 Binomial uses thread shared data thus stresses the cache & memory system; we would expect TR to falter – but nothing of the sort – it is actually over 2.5x faster than SKL-X leaving it in the dust!
BenchFinance Binomial double/FP64 (kOPT/s)  83.7 [+27%]
 65.6  45.6  25.54 With FP64 code the situation changes somewhat – TR is only 27% faster but still an appreciable lead. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s)  91.6 [+42]
 64.3  49.1  25.92 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; TR reigns supreme being 40% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s)  68.7 [+34%]
 51.2  37.1  19 Switching to FP64, TR is just 34% faster but still a good lead
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads TR soundly beats SKL-X by a big margin that even a 16-core version may not be able to match. But should these tests be vectorisable using SIMD – especially AVX512 – then we would fully expect Intel to win. But for now – for financial workloads there is only one choice: TR/Ryzen!!!
BenchScience SGEMM (GFLOPS) float/FP32  165 [?] 623  240.7  268 We need to implement NUMA fixes here to allow TR to scale.
BenchScience DGEMM (GFLOPS) double/FP64  75.9 [?]  216  102.2  92.2 We need to implement NUMA fixes here to allow TR to scale.
BenchScience SFFT (GFLOPS) float/FP32  16.6 [-51%]  34.3  8.57  19 FFT is also heavily vectorised but stresses the memory sub-system more; here TR cannot beat SKL-X and is 50% slower – but scales well against Ryzen.
BenchScience DFFT (GFLOPS) double/FP64  8 [-65%]  23.18  7.6  11.13 With FP64 code, the gap only widens with TR over 65% slower than SKL-X and little scaling over Ryzen.
BenchScience SNBODY (GFLOPS) float/FP32  456 [-22%]  587  234  272 N-Body simulation is vectorised but has many memory accesses to shared data – and here TR is only 22% slower than SKL-X but again scales well vs Ryzen.
BenchScience DNBODY (GFLOPS) double/FP64  173 [-2%]  178  87.2  79.6 With FP64 code TR almost catches up with SKL-X
With highly vectorised SIMD code TR cannot do as well – but an additional issue is that NUMA support needs to be improved – F/D-GEMM shows how much of a problem this can be as all memory traffic is using a single NUMA node.
CPU Image Processing Blur (3×3) Filter (MPix/s)  1470 [-6%] 1560  775  634 In this vectorised integer AVX2 workload TR does surprisingly well against SKL-X just 6% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  617 [-10%]  693  327  280 Same algorithm but more shared data used sees TR now 10%, more NUMA optimisations needed.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  361 [-6%]  384  192  154 Again same algorithm but even more data shared now TR is 6% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  570 [-6%]  609  307  271 Different algorithm but still AVX2 vectorised workload – TR is still 6% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  106 [+35%]  78.3  57.3  34.9 Still AVX2 vectorised code but TR does far better, it is no less than 35% faster than SKL-X!
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  37.8 [-17%]  45.8  20  18.1 TR does worst here, it is 17% slower than SKL-X but still scales well vs. Ryzen.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1260 [?]  4260  1160  2280 This 64-bit SIMD integer workload is a problem for TR but likely NUMA issue again as not much scaling vs. Ryzen.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 420 [-45%]  777  175  359 TR really does not do well here but does scale well vs. Ryzen, likely some code optimisation is needed.

As TR (like Ryzen) supports most modern instruction sets now (AVX2, FMA, AES/SHA HWA) it does well but generally not enough to beat SKL-X; unfortunately the latter with AVX512 can potentially get even faster (up to 100%) increasing the gap even more.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) – even more memory bandwidth is needed to feed all the 16 cores (32 threads) and being able to run the memory at higher speeds would be appreciated.

NUMA support is crucial – as non-NUMA algorithms take a big hit (see GEMM) where performance can be even lower than Ryzen. While complex server or scientific software won’t have this problem, most programs will not be NUMA aware.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS)  111 [+88%]  59  61.5  29 .Net CLR integer performance starts off very well with TR 88% faster than SKL-X an incredible result! This is *not* a fluke as Ryzen scores incredibly too.
BenchDotNetAA .Net Dhrystone Long (GIPS) 62.9 [+3%]  61  41  29 TR cannot match the same gain with 64-bit integer, but still just about manages to beat SKL-X.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS)  193 [+82%]  106  103  50 Floating-Point CLR performance is pretty spectacular with TR (like Ryzen) dominating – it is no less than 82% faster than SKL-X!
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS)  225 [+67%]  134  111  63 FP64 performance is also great with TR 67% faster than SKL-X an absolutely huge win!
It’s pretty incredible, for .Net applications TR – like Ryzen – is king! It is pretty incredible that is is between 60-80% faster in all tests (except 64-bit integer). With more and more applications (apps?) running under the CLR, TR (like Ryzen) has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s)  195 [+38%]
141  92.6  53.4 In this non-vectorised test, TR is almost 40% faster than SKL-X not as high as what we’ve seen before but still significant.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s)  192 [+34%]
 143  97.6  56.5 With 64-bit integer workload this time we see no changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s)  626 [+27%]
 491  347  241 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units but TR is a comfortably 27% faster than it.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s)  344 [+14%]
 301  192  135 Switching to FP64 SIMD vector code – still running AVX/FMA – TR’s lead falls to 14% but it is still a win!
Taking advantage of RyuJit’s support for vectors/SIMD (through SSE2, AVX/FMA) allows SKL-X to gain some traction – TR remains very much faster up to 40%. Whatever the workload, it seems TR just loves it.
Java Arithmetic Java Dhrystone Integer (GIPS)  1000 [+16%]  857 JVM integer performance is only 16% faster on TR than SKL-X – but a win is a win.
Java Arithmetic Java Dhrystone Long (GIPS)  974 [+26%]  771 With 64-bit integer workloads, TR is now 26% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS)  231 [+48%]  156 With a floating-point workload TR increases its lead to a massive 48%, a pretty incredible result.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS)  183 [+14%]  160 With FP64 workload the gap reduces way down to 14% but it is still faster than SKL-X.
Java performance is not as incredible as we’ve seen with .Net, but TR is still 15-50% faster than SKL-X – no mean feat! Again if you have Java workloads, then TR should be the CPU of choice.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s)  200 [+45%]  137 The JVM does not support SIMD/vectors, thus TR uses its scalar prowess to be 45% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s)  186 [+33%]  139 With 64-bit vectorised workload Ryzen is still 33% faster.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s)  169 [+69%]  100 With floating-point, TR is a massive 69% faster than SKL-X a pretty incredible result.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s)  159 [+59%]  100 With FP64 workload TR’s lead falls just a little to 59% – a huge win over SKL-X.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives TR (like Ryzen) free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

TR (like Ryzen) absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than the latest Intel SKL-X – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on TR. For .Net and Java code, TR is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It may be difficult to decide whether AMD’s design (multiple CCX units, multiple dies/nodes on a socket) is “cool” and supporting it effectively is not easy for programmers – be they OS/kernel or application – but when it works it works extremely well! There is no doubt that Threadripper can beat Skylake-X at the same cost (approx 1,000$) though using more coress just as its little (single-die) brother Ryzen.

Scalar workloads, .Net/Java workloads just fly on it – but highly vectorised AVX2/FMA workloads only perform competitively; unfortunately once AVX512 support is added SKL-X is likely to dominate effectively these workloads though for now it’s early days.

It’s multiple NUMA node design – unless running in UMA (unified) mode – requires both OS and application support, otherwise performance can tank to Ryzen levels; while server and scientific programs are likely to be so – this is a problem for most applications. Then we have its dual-CCX design which further complicate workloads, effectively being a 2nd NUMA level; we can see inter-core latencies being 4 tiers while SKL-X only has 2 tiers.

In effect both platforms will get better in the future: Intel’s SKL-X with AVX512 support and AMD’s Threadripper with NUMA/CCX memory optimisations (and hopefully AVX512 support at one point). Intel are also already launching newer versions with more cores (up to 18C/36T) while AMD can release some server EPYC versions with 4 dies (and thus up to 32C/64T) that will both push power envelopes to the maximum.

For now, Threadripper is a return to form from AMD.

AMD Threadripper 1950X Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.

AMD Ryzen 5 Series Launch & Reviews

AMD Logo

Today marks the day AMD’s latest Ryzen 5 series launches (6C/12T 1600X, 1600 and 4C/8T 1500X, 1400) and the reviews – including Sandra benchmarks have hit the web:

Congratulations to AMD a great product and look forward our review of Ryzen 5 here too!