AMD Ryzen 1700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen”?

“Ryzen” (code-name ZP aka “Zeppelin”) is the latest generation CPU from AMD (2017) replacing the previous “Vishera”/”Bulldozer” designs for desktop and server platforms. An APU version with an integrated (GP)GPU will be launched later (Ryzen2) and likely include a few improvements as well.

This is the “make or break” CPU for AMD and thus greatly improve performance, including much higher IPC (instructions per clock), higher sustained clocks, better Turbo performance and “proper” SMT (simultaneous multi-threading). Thus there are no longer “core modules” but proper “cores with 2 SMT threads” so an “eight-core CPU” really sports 8C/16T and not 4M/8T.

No new chipsets have been introduced – thus Ryzen should work with current 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update – making it a great upgrade.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design.

Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Ryzen 1700X
Intel 6700K (Skylake)
Intel 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 8C / 16T 4C / 8T 6C / 12T Ryzen has the most cores and threads – so it will be down to IPC and clock speeds. But if it’s threads you want Ryzen delivers.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x)  0.8-4.0-4.2GHz (8x-40x-42x)  1.2-3.3-4.0GHz (12x-33x-40x) SKL has the highest rated speed @4GHz but all three have comparative Turbo clocks thus depends on how long they can sustain it.
Power (TDP) 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E almost 50% higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Ryzen L2 is 2x as big as either Intel CPU which should help quite a bit though still 8-way
L3 Caches 2x 8MB 16-way 8MB 16-way 15MB 20-way With 2x as many cores/threads, Ryzen has 2 8MB caches one for each CCX.

Thread Scheduling and Windows

Ryzen’s topology (4 cores in 2 CCXes (compute clusters)) makes it akin to the old Core 2 Quad or Pentium D (2 dies onto 1 socket) effectively a SMP (dual CPU) system on a single socket. Windows has always tended to migrate running threads from unit to unit in order to equalise thermal dissipation though Windows 10/Server 2016 have increased the ‘stickiness’ of threads to units.

As the Windows’ scheduler is inter-twined with the power management system, under ‘Balanced‘ and other power saving profiles – unused cores are ‘parked’ (aka powered down) which affects which cores are available for scheduling. AMD has recommended ‘High Performance‘ profile as well as initially claiming the Windows’ scheduler is not ‘Ryzen-aware’ before retracting the statement.

However, there does seem to be a problem as in Sandra tests when using less than the total 16 threads (e.g. MC test with 8 threads) in tests where Sandra does not hard schedule threads based on its own scheduler (e.g. .Net, Java benchmarks) the scheduling does not appear optimal:

 Ryzen Hard Affinity  Ryzen no Affinity
Ryzen Hard Affinity (e.g. Native) Ryzen No Affinity (e.g. Java/.Net)

While in the left image we see Sandra at work assigning the 8 threads on the 8 different cores – with 100% utilisation on those units and almost nothing on the other 8 – on the right image we see 10 units (!) used, 4 not used at all but still 50% utilisation.

This does not seem to happen on Intel hardware – even SMP systems – thus it may be something to be adjusted in future Windows versions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 290 [+24%] | 242 [+13%] AVX2 185 | 146 233 | 213 Right off the bat Ryzen beats both Intel CPUs in both MT and MC tests with SMT providing a good gain (hard scheduled of course).
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 292 [+27%] | 260 [+22%] AVX2 185 | 146 230 | 213 With a 64-bit integet workload nothing much changes, Ryzen still beats both in both tests, 27% faster than HSW-E! AMD has ri-sen from the ashes like the Phoenix!
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 185 [+23%] | 123 [+23%] AVX/FMA 109 | 74 150 | 100 Even in this floating-point test, Ryzen beats both again by a similar margin, 23% better than HSW-E. What performance for the money!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 155 [+33%] | 102 [+32%] AVX/FMA 89 | 60 116 | 77 With FP64 the winning streak continues, with the difference increasing to 33% over HSW-E a huge gain.
From integer workloads in Dhyrstone to floating-point workloads in Whestone Ryzen rules the roost blowing both SKL and HSW-E away being between 23-33% faster, with or without SMT. SMT does yield bigger gain than on Intel’s designs also.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 535 [-16%] | 421 [-13%] AVX2 513 | 389 639 | 485 In this vectorised AVX2 integer test Ryzen just overtakes SKL but cannot beat HSW-E and is just 16% slower; still it is a good result but it shows Intel’s SIMD units are really strong with AMD’s 8 cores matching Intel’s 4 cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 159 [-16%] | 137 [-18%] AVX2 191 | 158 191 | 168 With a 64-bit AVX2 integer vectorised workload again Ryzen is unable to beat either Intel CPU being slower by a similar margin -16%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.61 [+30%] | 2.1 [+11%] 2.15 | 1.36 2.74 | 1.88 This is a tough test using Long integers to emulate Int128 without SIMD and here Ryzen comes back on top being 30% faster similar to what we saw in Dhrystone.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 530 [-11%] | 424 [-4%] FMA 479 | 332 601 | 440 In this floating-point AVX/FMA vectorised test we see again the power of Intel’s SIMD units, with Ryzen being only 11% slower than HSW-E but beating SKL.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 300 [-13%] | 247 [=] FMA 271 | 189 345 | 248 Switching to FP64 SIMD code, again Ryzen cannot beat HSW-E but does beat SKL which should be sufficient.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 13.7 [+14%] | 9.7 [+2%] FMA 10.7 | 7.5 12 | 9.5 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen manages to beat both CPUs being 14% faster. So AVX2 or FMA code is not a problem.
In vectorised AVX2/FMA code we see Ryzen lose for the first time to Intel’s SIMD units but not by a large margin; in non-vectorised code as with Dhrystone and Whetstone Ryzen is again quite a bit faster than either Intel CPUs. Overall Ryzen would be the preferred choice unless number-crunching vectorised code.
BenchCrypt Crypto AES-256 (GB/s) 13.8 [-31%] | 14 [-32%] AES 15 | 15.4 20 | 20.7 All three CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and 2 memory channels is just not enough; with its 4 channels HSW-E is unbeatable for streaming tests. But Ryzen is only marginally slower than its counterpart SKL.
BenchCrypt Crypto AES-128 (GB/s) 13.9 [-31%] | 14 [-33%] AES 15 | 15.4 20.1 | 21.2 What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to beat HSW-E but at least is marginally slower than SKL.
BenchCrypt Crypto SHA2-256 (GB/s) 17.1 [+2.25x] | 10.6 [+49%] SHA 5.9 | 5.5 AVX2 7.6 | 7.1 AVX2 Ryzen’s secret weapon is revealed: by supporting SHA HWA it soundly beats both Intel CPUs even running multi-buffer vectorised AVX2 code – it’s 2.2x faster! Surprisingly disabling SMT (MC mode) reduces performance appreciably, not what would be expected.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [+14%] | 11.4 [-14%] SHA 11.3 | 10.6 AVX2 15.1 | 13.3 AVX2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the AVX2 implementation is much less complex allowing SNB-E to come within a whisker of Ryzen and beat it in MC mode by a similar amount 14%. Still, much better to have SHA HWA than finding multiple buffers to process with AVX2.
BenchCrypt Crypto SHA2-512 (GB/s) 3.34 [-37%] | 3.32 [-36%] AVX2 4.4 | 4.2 5.34 | 5.2 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen has to use the same vectorised AVX2 code path where Intel’s SIMD units show their power again.
Ryzen’s secret crypto weapon is support for SHA HWA (which Intel only supports on Atom currently) which allows it to beat both Intel’s CPUs. For streaming algorithms like encrypt/decrypt it would probably benefit from more memory channels to feed all those cores. But overall it would still be the overall choice.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 234 [+49] | 166 [+36%] 129 | 97 157 | 122 In this non-vectorised test we see Ryzen shine brightly again beating even SNB-E by 50% an incredible result. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s) 198 [+51%] | 139 [+39%] 108 | 83 131 | 100 Switching to FP64 code, Ryzen still shines beating SNB-E by 50% again and totally demolishing SKL. So far so great!
BenchFinance Binomial float/FP32 (kOPT/s) 85.1 [+2.25x] | 83.2 [+3.23x] 27.2 | 18.1 37.8 | 25.7 Binomial uses thread shared data thus stresses the cache & memory system; we would expect Ryzen to falter here but nothing of the sort – it actually totally beats both Intel CPUs to dust – it’s 2.25 times faster than SNB-E! Even a 12 core SNB-E would not be sufficient.
BenchFinance Binomial double/FP64 (kOPT/s) 45.8 [+37%] | 46.3 [+38%] 25.5 | 24.6 33.3 | 33.5 With FP64 code the situation changes somewhat – with Ryzen only 37% faster than SNB-E; but it’s still an appreciable win. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 49.2 [+55%] | 41.2 [+52%] 25.9 | 21.9 31.6 | 27 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen reigns supreme here also being 50% faster than even HSW-E. SKL is left in the dust.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 37.3 [+75%] | 31.8 [+41%] 19.1 | 17.2 21.2 | 22.5 Switching to FP64 Ryzen increases its dominance to 75% over SNB-E and destroying SKL completely.
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads Ryzen reigns supreme beating even 6-core Haswell-E into dust by such a margin that even a 12-core HSW-E may not beat it. For financial workloads there is only one choice: Ryzen!!! Long live the new king!
BenchScience SGEMM (GFLOPS) float/FP32 68.3 [-63%] | 155 [-27%] FMA 109 | 162 185 | 213 In this tough vectorised AVX2/FMA algorithm Ryzen falters and gets soundly beaten by both SKL and HSW-E. Again the powerful SIMD units of Intel’s CPUs allow them to finally beat it as we’ve seen in previous tests. It’s its Achille’s heel.
BenchScience DGEMM (GFLOPS) double/FP64 62.7 [-28%] | 78.4 [-23%] FMA 72 | 67.8 87.7 | 103 With FP64 vectorised code, the gap reduces with Ryzen just 28% slower than HSW-E and just a bit slower than SKL. Again vectorised SIMD code is problematic.
BenchScience SFFT (GFLOPS) float/FP32 8.9 [-50%] | 9.85 [-39%] FMA 18.9 | 15 18 | 16.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here Ryzen is again much slower than both SKL and HSW-E; for vectorised code it seems it needs 2x more SIMD units to match Intel.
BenchScience DFFT (GFLOPS) double/FP64 7.5 [-31%] | 7.3 [-30%] FMA 9.3 | 9 10.9 | 10.5 With FP64 code, Ryzen does improve (or Intel gets slower) only 30% slower than HSW-E and 15% slower than SKL.
BenchScience SNBODY (GFLOPS) float/FP32 234 [-15%] | 225 [-16%] FMA 273 | 271 158 | 158 N-Body simulation is vectorised but many memory accesses to shared data and here SKL seems to do unusually well beating Ryzen in 2nd place but only by 15%. Strangely HSW-E does badly in this test even with 6-cores.
BenchScience DNBODY (GFLOPS) double/FP64 87 [+10%] | 87 FMA 79 | 79 40 | 40 With FP64 code Ryzen improves beating its SKL rival by 10%; again SNB-E does pretty badly in this test.
With highly vectorised SIMD code Ryzen is again the loser but not by a lot; Intel has just one chance – highly vectorised SIMD algorithms that allow the powerful SIMD units to shine. Everything else is dominated by Ryzen.
CPU Image Processing Blur (3×3) Filter (MPix/s) 750 [-1%] | 699 [+4%] AVX2 655 | 563 760 | 668 In this vectorised integer AVX2 workload Ryzen ties with HSW-E, a good result considering we saw it lose in similar algorithms.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 316 [-8%] | 301 AVX2 285 | 258 345 | 327 Same algorithm but more shared data used sees Ryzen now 8% slower than SNB-E but still beating SKL.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 172 [-8%] | 166 AVX2 151 | 141 188 | 182 Again same algorithm but even more data shared does not change anything, Ryzen is again 8% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 292 [-7%] | 279 AVX2 271 | 242 316 | 276 Different algorithm but still AVX2 vectorised workload sees Ryzen still about 7% slower than HSW-E but again still faster than SKL.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 58.5 [+16%] | 37.4 AVX2 35.4 | 26.4 50.3 | 37 Still AVX2 vectorised code but here Ryzen manages to beat even SNB-E by 16%. Thus it is not a given it will lose in all such tests, it just depends.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.6 [+26%] | 5.2 6.3 | 4.2 7.6 | 5.5 This test is not vectorised though it uses SIMD instructions and here Ryzen manages a 26% win even over SNB-E while leaving SKL in the dust.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 852 [+50%] | 525 422 | 297 571 | 420 Again in a non-vectorised test Ryzen just flies: it’s 2x faster than SKL and no less than 50% faster than SNB-E! Intel does not have its way all the time – unless the code is highly vectorised!
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 147 [+47%] | 101 75 | 55 101 | 77 In this final non-vectorised test Ryzen really flies, it’s again 2x faster than SKL and almost 50% faster than SNB-E! Intel must be getting desperate for SIMD cectorised versions of algorithms by now…

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen does extremely well beating both Skylake 4C and even Haswell-E 6C in all workloads except highly vectorised SIMD code where the powerful Intel SIMD units can shine. Overall it would still be the choice for most workloads but SIMD number-crunching tasks which are somewhat specialised.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) more memory bandwidth to feed all the 16-threads would not go amiss but the difference may not justify the increased cost as we see with Intel 2011 platform and HSW-E.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.6.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks Ryzen 1700X 8C/16T (MT)
8C/8T (MC)
i7-6700K 4C/8T (MT)
4C/4T (MC)
i7-5820K 6C/12T (MT)
6C/6T (MC)
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 36.5 [+18%] | 25 23.3 | 17.2 30.7 | 26.8 .Net CLR integer performance starts off very well with a 36% better performance even over HSW-E which admittedly does not do much better over SKL.
BenchDotNetAA .Net Dhrystone Long (GIPS) 45.1 [+60%] | 26 23.6 | 21.6 28.2 | 25 Ryzen seems to greatly favour 64-bit integer workloads, here it is 60% faster than even HSW-E and over 2x faster than SKL. All CPUs perform better with 64-bit workloads.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 100.6 [+53%] | 53 47.4 | 21.4 65.4 | 39.4 Floating-Point CLR performance is pretty spectacular with Ryzen beating HSW-E by over 50% a pretty incredible result. Native or CLR code works just great on Ryzen.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 121.3 [+41%] | 62 63.6 | 37.5 85.7 | 53.4 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen just over 40% faster than HSW-E.
It’s pretty incredible, for .Net applications Ryzen is king – no point buying Intel’s 2011 platform – buy Ryzen! With more and more applications (apps?) running under the CLR, Ryzen has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 92.6 [+22%] | 49 55.7 | 37.5 75.4 | 49.9 Just as we saw with Dhrystone, this integer workload sees a 22% improvement for Ryzen. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 97.8 [+23%] | 51 60.3 | 39.5 79.2 | 53.1 With 64-bit integer workload we see a similar story – Ryzen is 23% faster than even HSW-E. If only RyuJit SIMD would fix integer workloads too.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 272.7 [-4%] | 156 AVX 12.9 | 6.74 284.2 | 187.1 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units with Ryzen 4% slower than SNB-E. Still Intel usually wins these kinds of tests.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 149 [-15%] | 85 AVX 38.7 | 21.38 176.1 | 103.3 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen loses again, this time by 15% against SNB-E.
The only tests Intel’s CPUs can win are vectorised ones using RyuJit’s support for SIMD (aka SSE2, AVX/FMA) and thus allowing Intel’s SIMD units to shine; otherwise Ryzen dominates absolutely everything without fail.
Java Arithmetic Java Dhrystone Integer (GIPS)  513 [+29%] | 311  313 | 289  395 | 321 We start JVM integer performance with an even bigger gain, Ryzen is ~30% faster than HSW-E and 60% faster than SKL.
Java Arithmetic Java Dhrystone Long (GIPS) 514 [+28%] | 311 332 | 299 399 | 367 Nothing much changes with 64-bit integer workload, we have Ryzen 28% faster than HSW-E.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 117 [+31%] | 66 62.8 | 34.6 89 | 49 With a floating-point workload Ryzen continues its lead over both Intel’s CPUs. Native or CLR or JVM code works just great on Ryzen.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 128 [+40%] | 63 64.6 | 36 91 | 53 With FP64 workload the gap increases even further to 40% over HSW-E and an incredible 2x over SKL! Ryzen is the JVM king.
Java performance is even more incredible than what we’ve seen in .Net; server people rejoice, if you have Java workloads Ryzen is the CPU for you! 40% better performance than Intel’s 2011 platform for much lower cost? Yes please!
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 99 [+20%] | 52.6 59.5 | 36.5 82 | 49 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen manages a 20% lead over HSW-E but is almost 2x faster than SKL.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 93 [+17%] | 51 60.6 | 37.7 79 | 53 With 64-bit vectorised workload Ryzen maintains its lead of about 20%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 86 [+40%] | 42.3 40.6 | 22.1 61 | 32 Just as we’ve seen with Whetstone, Ryzen is about 40% faster than HSW-E and over 2x faster than SKL! It does not get a lot better than this.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 82 [+30%] | 42 40.9 | 22.1 63 | 32 With FP64 workload Ryzen’s lead somewhat unexplicably drops to ‘just’ 30% but remains over 2x faster than SKL. Nothing to grumble about really.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than on Intel’s (ex)-top-of-the-range Haswell-E – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on Ryzen. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

What a return of fortune from AMD! Despite a hurried launch and inevitable issues which will be fixed in time (e.g. Windows scheduler), Ryzen puts a strong performance beating Intel’s previous top-of-the-range Skylake 6700K and Haswell-E 6820K into dust in most tests at a much cheaper price.

Of course there are setbacks, highly vectorised AVX2/FMA code greatly favour Intel’s SIMD units and here Ryzen falls behind a bit; streaming algorithms can overload the 2 memory channels but then again Intel’s mainstream platform has only 2 also. Still if you were replacing a 2011 4-channel platform with Ryzen then very high-speed memory may be required to sustain performance.

It’s dual-CCX design may also affect non-symmetrical workloads where different threads execute different code with thread data-sharing across CCX naturally slower. Clever thread assignment to the ‘right’ CCX should fix those issues but that is down to each application with Windows (or other OSes) may not be able to fix. Considering we have SMP and NUMA systems out there – it is not a new problem but perhaps one not usually seen on normal desktop systems due to the high-cost of SMP/NUMA systems.

All in all Ryzen is a solid CPU which should worry Intel at the high-end, we shall have to see how the lower-end 4-core and even 2-core versions perform.

SiSoftware Sandra Platinum (2017) Released!

FOR IMMEDIATE RELEASE

Contact: Press Office

SiSoftware Sandra Platinum (2017) Released:
Brand-new benchmarks, hardware support

Superseded By: SiSoftware Sandra Titanium (2018)

Updates: RTMa, RTMc, SP1, SP1a, SP2, SP3, SP4.

London, UK, March 24th 2017 – We are pleased to announce the launch of SiSoftware Sandra Platinum (2017), the latest version of our award-winning utility, which includes remote analysis, benchmarking and diagnostic features for PCs, servers, mobile devices and networks.

Sandra Platinum has a brand-new fresh look that is not out-of-place on all operating systems from the classic Windows 7 (Aero) to future Windows 10 (Acrylic).

We have added hardware support and optimisations for brand-new CPU architectures (AMD Ryzen/Threadripper, future AVX512, etc.) not forgetting GPGPU architectures across the various interfaces (CUDA, OpenCL, DirectX ComputeShader, OpenGL Compute).

As SiSoftware operates a “just-in-time” release cycle, some features were introduced in Sandra 2016 service packs: in Sandra Platinum they have been updated and enhanced based on all the feedback received.

Here is an in-depth new feature list of Sandra Platinum:

Brand-new look and icons for all versions of Windows from 7 (Aero) to future 10 (Acrylic)

  • Updated benchmark icons: high resolution 256×256 icons suitable for high-dpi screens (4k/UHD).

Main

Benchmarks

Wizards

Information

Windows

Broad Operating System Support
All current versions supported: Windows 10, 8.1, 8, 7; Server 2016, 2012/R2 and 2008/R2

  • Updated Benchmark Module: GPGPU Image Processing (oil painting, diffuse/random, marbling/perlin noise) supporting all modern interfaces (CUDA, OpenCL, DirectX ComputeShader)
  • New Benchmark Module: CPU Image Processing (oil painting, diffuse, marbling) supporting all modern vectorised SIMD instruction sets (AVX512, AVX2/FMA3, AVX, SSE4, SSSE3, SSE2)
  • New OpenGL Compute Support: Ported GPGPU benchmarks to OpenGL (4.3+) Compute Shader (Fractals, Crypto, Image Processing)
  • New GPU Precision: FP16/half-float precision benchmarks (Image Processing)
  • Maintained Benchmark: Updated Overall Score (2016/2017) by adding new benchmarks to the index.
  • New Hardware Support: New AMD Ryzen, Threadripper architecture support; future AVX512-supporting hardware support (SKL-X, KBL-X, Coffeelake, etc.).
System Overall Benchmark

Overall Score 2017/Platinum benchmark:
16 benchmarks to fully evaluate computer performance

While each benchmark measures the performance of a specific device (CPU, Memory, (GP)GPU, Storage, etc.), there is a real need for a benchmark to evaluate the overall computer performance: this new benchmark is a weighted average of the individual scores of the existing benchmarks:

  • Native CPU Arithmetic, Cryptographic, Multi-Media (SIMD), Financial and Scientific: measures native processing performance using the very latest instruction sets (AVX512, AVX2/FMA3, AVX, SSE4, SSE2)
  • .Net/Java Arithmetic: measures software virtual machine performance (e.g. for .Net WPF/Silverlight/Modern applications)
  • Memory and Cache Bandwidth and Latency: measures memory and caches performance
  • File System/Storage Bandwidth and I/O: measures storage performance
  • GP (General Processing) / HC (Heterogeneous Compute) (GPU/APU) Arithmetic, Cryptographic, Financial, Scientific: measures (GP)GPU/APU processing performance
  • GP (General Processing) / HC (Heterogeneous Compute) (GPU/APU) Memory Bandwidth and Latency: measures (GP)GPU/APU memory performance

Key features of Sandra Platinum

  • 4 native architectures support (x86, x64 – Windows; ARM, ARM64, x86, x64 – Android)
  • Huge official hardware support through technology partners (AMD/ATI, nVidia, Intel).
  • 4 native (GP)GPU/APU platforms support (OpenCL 1.2+, CUDA 8.0+, DirectX Compute Shader 10+, OpenGL Compute 4.3+).
  • 4 native Graphics platforms support (DirectX 12.x, DirectX 11.x, DirectX 10.x, OpenGL 3.0+).
  • 9 language versions (English, German, French, Italian, Spanish, Japanese, Chinese (Traditional, Simplified), Russian) in a single installer.
  • Enhanced Sandra Lite (Eval) version (free for personal/educational use, evaluation for other uses)

Articles & Benchmarks

For more details, please see the following articles:

Purchahttp://ranker.sisoftware.netsing

For more details, and to purchase the commercial versions, please click here.

Updating or Upgrading

To update your existing commercial version, please click here.

Downloading

For more details, and to download the Lite (Evaluation) version, please click here.

Reviewers and Editors

For your free review copies, please contact us.

About SiSoftware

SiSoftware, founded in 1995, is one of the leading providers of computer analysis, diagnostic and benchmarking software. The flagship product, known as “SANDRA”, was launched in 1997 and has become one of the most widely used products in its field. Many worldwide IT publications, magazines and review sites use SANDRA to analyse the performance of today’s computers. Thousands on-line reviews of computer hardware that use SANDRA are catalogued on our website alone.

Since launch, SiSoftware has always been at the forefront of the technology arena, being among the first providers of benchmarks that show the power of emerging new technologies such as multi-core, GPGPU, OpenCL, OpenGL, DirectCompute, x64, ARM, NUMA, SMT (Hyper-Threading), SMP (multi-threading), AVX512, AVX2, AVX, FMA3, NEON, SSE4.2, SSE4, SSSE3, SSE2, SSE, Java and .NET.

SiSoftware is located in London, UK. For more information, please visit www.sisoftware.net, www.sisoftware.eu, www.sisoftware.info or www.sisoftware.co.uk

SP3 for Sandra 2016 Released!

Update Wizard

As we move towards 2017, we have not forgotten our users with the just released SP3 for Sandra 2016 adding many additions and fixes to keep it up-to-date with new developments:

  • NVMe SSD support (Windows 10, Samsung, Intel)
  • nVidia CUDA 8 SDK support with native nVidia Pascal SM6.x hardware support (GTX 1080, 1070, 1060, etc.)
  • Intel Core 7th Gen (Kaby Lake) support (ULV, Y, future desktop/H)
  • TPM 2.0 support (mandated for new Windows 10)
  • Windows 10 Anniversary Update (1607) native support

As always the update is free, so please update as soon as possible.

Sandra USB supplied SanDisk* Extreme disks

We are now providing Sandra USB versions on one of the fastest USB3 flash disks – the SanDisk* Extreme! With read bandwidth ~200MB/s you won’t be waiting for Sandra to start off the USB drive (write speed is not bad either at ~60MB/s).

Now they are not the smallest of flash drives but then again you are unlikely to lose them. Here they are for your viewing pleasure:

Sandra USB on SanDisk Extreme

Sandra USB on SanDisk Extreme

Intel Broadwell-E Launch and Reviews

Intel has launched the Broadwell-E (6900,6800 series) CPU for 2011 platform (X99).  Is it worth upgrading from your 5800 Haswell-E? Here some reviews that contain Sandra benchmarks to help you make your decision:

Stay tuned for our own mini-review of this CPU. Note that as this is not Skylake-E it does not support the AVX512 instruction set but it does contain many other improvements…

SP2 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP2 (Service Pack 2) to SiSoftware Sandra 2016.

This new version has been built with the updated tools in order to extract the maximum performance out of the latest hardware and also contains minor additions and fixes:

  • Spanish Help file translation courtesy of Antonio Pérez Madrazo.
  • CUDA 8.0 (Pascal) preliminary device support.
  • Compiler optimisations including SIMD improvements.

As always the update is free so either visit the Sandra Lite Downloads or the Sandra Commercial Downloads.

SP1a for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1a (Service Pack 1a) to SiSoftware Sandra 2016.

This is a minor update that improves stability and adds a few optimisations that were developed after further testing of SP1 release.

The SP1a update also enables the Marbling: Perlin Noise 2D (3 octaves) Filter for both GPGPUs (CUDA, OpenCL) and CPU.

Sandra 2016 SP1 New Image Filters

SP1 for SiSoftware Sandra 2016 Released!

Update Wizard

We are happy to release SP1 (Service Pack 1) to SiSoftware Sandra 2016.

This release introduces initial AVX512 benchmarks with all SIMD benchmarks due to be ported once compiler support becomes available:

CPU Multi-Media (Fractal Generation): single, double floating-point; integer, long benchmarks ported to AVX512. [See article Future performance with AVX512]

CPU Crypto (SHA Hashing): SHA2-256 and SHA2-512 multi-buffer ported to AVX512.

– Hardware support for future arch (AMD and Intel).

.Net Multi-Media native vector support is vector width independent and thus will support AVX512 with a future CLR release automatically

GPU Image Processing: New, more complex filters:

  • Oil Painting: Quantise (9×9) Filter: CUDA, OpenCL
  • Diffusion: Randomise (256) Filter: CUDA, OpenCL
  • Marbling: Perlin Noise 2D (3 octaves) Filter: CUDA, OpenCL

CPU Image Processing: New, more complex filters

  • Oil Painting: Quantise (9×9) Filter: AVX2/FMA, AVX, SSE2
  • Diffusion: Randomise (256) Filter: AVX2/FMA, AVX, SSE2
  • Marbling: Perlin Noise 2D (3 octaves) Filter: AVX2/FMA, AVX, SSE2

Sandra 2016 SP1 New Image FiltersMore benchmarks will be ported to AVX512 subject to compiler support; currently Microsoft’s VC++ does not support AVX512 intrinsics and in the interest of fairness we do not use specialised compilers.

Please see our article – Future performance with AVX512 – for a primer on AVX512 and projected performance improvements due to AVX512 and 512-bit transfers.

Future performance with AVX512 in Sandra 2016 SP1

Intel Skylake

What is AVX512?

AVX512 is a new SIMD instruction set operating on 512-bit registers that is the natural progression from FMA/AVX (256-bit registers). It was first introduced with Intel’ “Phi” co-processor (Intel’s answer to GPGPUs) and now a version of it is making its way to CPUs themselves.

Why is AVX512 important?

CPU performance has only marginally increased (5-10%) from one generation to the next, with power efficiency being the primary goal; with limited options (cannot increase clocks speeds, must reduce power, hard to improve execution efficiency, etc.) exploiting data level parallelism through SIMD is a relatively simple way to improve performance.

SIMD instructions have long been used to increase performance (since the introduction of MMX with the Pentium in 1997!) and their register width has been increasing steadily from 64-bit (MMX) to 128-bit (SSEx) to 256-bit (AVX/FMA) and now to 512-bit (AVX512) – thus processing more and more data simultaneously.

Unfortunately, software has to be specifically modified to support AVX512 (or at the very least re-compiled) but developers are generally used to this these days after the SSE to AVX transition.

SiSoftware has thus been updating its benchmarks to AVX512, though some need compiler support and will need to wait until Microsoft updates its Visual C++ compiler at some point.

What CPUs will support AVX512?

It was rumoured that the newly released “Skylake” Core consumer CPUs were going to support AVX512 – but they do not. The future “Skylake-E” Xeon “Purley” server/workstation CPUs are supposed to support it.

AVX512 is actually a set of multiple sets – with “Skylake-E” supporting F (foundation) and CD (conflict detection), BW (byte & word), DQ (double-word and quad-word) and VL (vector length extension) – and future “Canonlake-E” supporting IFMA (integer FMA), VBM (vector byte manipulation) and perhaps others.

It is disappointing that AVX512 is not enabled on consumer CPUs (Core) but it will eventually appear in future iterations; gamers/enthusiasts need to buy into the “extreme/Skylake-E” platform and business users getting “Xeon/Skylake-E” in their workstations.

What kind of performance improvement can we expect with AVX512?

The transition from SSE 128-bit to AVX/FMA/AVX2 256-bit has – eventually – resulted in 70-120% improvement, with compute intensive code that seldom access memory yielding the best improvement. Note that AVX executes at lower clock than “normal”/SSE code.

AVX512 not only doubles width (512-bit) but also number of registers (32 vs 16) thus we can hold 4x (four times) more data which may reduce cache/memory accesses by caching more data locally. But AVX512 code will again run at lower clock versus AVX/FMA.

In the next examples we project future gains through AVX512 for common algorithms as implemented in Sandra’s benchmarks and what they might mean to customers.

Can I test AVX512 performance with Sandra?

Yes, with the release of Sandra 2016 SP1 – you can now test AVX512 performance – naturally you need the required CPU. All the low-level benchmarks (below) have been ported to AVX512:

  • Multi-Media (Fractal Generation) Benchmark: AVX512 F, BW, DQ supported now
  • Cryptography (SHA Hashing) Benchmark: AVX512 BW, DQ supported now
  • Memory & Cache Bandwidth Benchmarks: AVX512 F, DQ supported now

The following benchmarks require future compiler support (Microsoft VC++) and have not been released at this time:

  • Financial Analysis (Black-Scholes, Binomial, Monte-Carlo): AVX512 F support coming soon
  • Scientific Analysis (GEMM, FFT, N-Body): AVX512 F support coming soon
  • Image Processing (Blur/Sharpen/Motion-Blur, Sobel, Median): AVX512 BW support coming soon
  • .Net Vectorised (Fractal Generation): AVX512 support dependent on RyuJIT numerics libraries that need to be updated by Microsoft. No changes required.

Hardware Stats

We are comparing two released public CPUs with their projected next-gen counterparts supporting AVX512.

Processor Intel i7-6700K (Skylake) Intel i7-77XX? (next-gen) Intel i7-5820K (Haswell-E) Intel i7-78XX? (Skylake-E)
Cores/Threads 4C / 8T 4C / 8T 6C / 12T 6C / 12T
Clock Speeds (MHz) Min-Max-Turbo 800-4000-4200 assumed same 1200-3300-3600 assumed same
Caches L1/L2/L3 4x 32kB, 4x 256kB, 8MB assumed same 6x 32kB, 6x 256kB, 15MB assumed same
Power TDP Rating (W) 91W assumed same 140W assumed same
Instruction Set Support AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc. AVX2, FMA3, AVX, etc. AVX512 + AVX2, FMA3, AVX, etc.

We do not expect major changes in future AVX512 supporting arch, especially with Skylake-E as Core Skylake is already out and the core specifications are known.

Multi-media (Fractal Generation) Benchmark

Benchmark Future Core-i7 (4C/8T AVX512) Projected Core i7-6700K (4C/8T AVX2/FMA) Core i7-6700K (4C/8T SSEx) Future Core i7-E (6C/12T AVX512) Projected Core i7-5820K (6C/12T AVX2/FMA) Core i7-5820K (6C/12T SSEx))
 AVX512 Multi-Media
Integer SIMD (Mpix/s) 912.5 [+76% over AVX] 516.2 [+76% over SSE] 292 1020.7 [+76% over AVX] 577.4 [+76% over SSE] 327
We see around 76% improvement from AVX2 vs. SSE, thus we assume we’ll see something similar moving to AVX512 (~80%).
Long SIMD (Mpix/s) 315.3 [+66% over AVX] 190.1 [+66% over SSE] 114.6 284.3 [+66% over AVX] 171.4 [+66% over SSE] 87.6
We see around 66% improvement from AVX2 vs. SSE, but due to the new instructions we may see better AVX512 gains.
Single Float SIMD (Mpix/s) 916.8 [+2x over AVX] 458.4 [+2.12x over SSE] 216 1079 [+2x over AVX] 539.5 [+2.12x over SSE] 234.8
We saw over 2x improvement from AVX/FMA over SSE so while we may not see such a large improvement with AVX512, we may still get 100%.
Double Float SIMD (Mpix/s) 545.8 [+2x over AVX] 272.9 [+2.35x over SSE] 116.1 622.4 [+2x over AVX] 311.2 [+2.35x over SSE] 126
We see even better improvement from AVX to SSE here (2.35x) so hopefully we’ll get 2x moving to AVX512.
Quad Float SIMD (Mpix/s) 20.3 [+94% over AVX] 10.5 [+94% over SSE] 5.4 622.4 [+94% over AVX] 311.2 [+94% over SSE] 126
Emulating fp128 is hard work but even then AVX is 94% faster than SSE and thus we’d expect AVX512 to be almost 2x faster still.
Despite some being disappointed by arch-to-arch performance improvement, the Skylake 4C (i7-6700K) already goes toe-to-toe with Haswell-E 6C (i7-5820K), but with AVX512 support Skylake-E 6C/8C is projected to comprehensively outperform it.

AVX512 will also allow Skylake-E to narrow the gap between it and current GPGPUs with multi-CPU Xeon systems able to “do without” GPGPUs – well except perhaps a “Phi” or two?

 AVX512 Crypto
Hashing SHA2-256 (GB/s) 11.80 [+2x over AVX] 5.90 [+2.36x over SSE] 2.50 13.60 [+2x over AVX] 6.80 [+2.26x over SSE] 3
We see a large 2.26-2.36x improvement of AVX2 vs. SSE, thus we expect about 2x increase with AVX512 still.
Hashing SHA1 (GB/s) 23 [+2x over AVX] 11.5 [+2.16x over SSE] 5.33 27.70 [+2x over AVX] 13.85 [+2.04x over SSE] 6.79
Even with SHA1 we see a good 2.04-2.16x improvement of AVX2 vs. SSE, thus AVX512 should again double performance though we may be limited by memory bandwidth.
Hashing SHA2-512 (GB/s) 8.74 [+2x over AVX] 4.37 [+2.33x over SSE] 1.87 9.60 [+2x over AVX] 4.80 [+2.20x over SSE] 2.18
Switching to 64-bit integer SHA512 we see the best improvement yet of AVX2 vs SSE (2.2-2.33x) with AVX512 likely to improve by 2x yet again.
With hashing we see even better results than even fractal generation, with AVX2 improving over 2x over SSE – and AVX512 will thus improve by at least 100% – if anything it is likely we will hit memory bandwidth limitations.
 AVX512 Memory Bandwidth
Memory Bandwidth (GB/s) ~31.30 31.30 [0%] 31.30 ~42.00 [0%] 42.30 [-1%] 42.6
Even with DDR4 the memory sub-system hasn’t changed much and despite 512-bit transfers with AVX512 there is really no performance delta in streaming data to/from memory.
L3 Bandwidth (GB/s) ~267.97 [+10%] 243.30 [+10%] 220.90 ~202.20 [+3%] 195.90 [+3%] 189.8
As we move up the cache hierarchy, the L3 already shows a 10% bandwidth improvement using AVX2/FMA vs. SSE and AVX512 improving performance further.
L2 Bandwidth (GB/s) ~392.50 [+21%] 323.30 [+21%] 266.30 ~536.81 [+20%] 444.10 [+20%] 367.4
As we expected, L2 bandwidth improves ~20% with AVX2/FMA and likely to improve further.
L1D Bandwidth (GB/s) ~1,364.25 [+50%] 909.50 [+2.11x] 429.90 ~1,536.00 [+50%] 1,024.00 [+2x] 518
Skylake has widened the data access ports (just like Haswell before it), thus 512-bit AVX512 transfers show the best improvement yet, 40-50%!
AVX512 does help take advantage of the widened data ports in Skylake and future arch, with L1D cache showing the best bandwidth improvement just like Haswell before it (with AVX2).

Memory bandwidth is still limited by DDR4 speeds but faster modules are coming out all the time but this time their clocks are JEDEC ratified.

We will update the article with future (projected) results once more benchmarks are converted to AVX512 – once compiler support is released – but even so far we see excellent performance improvement.

Until then, those of you with access to AVX512 supporting hardware can download Sandra 2016 SP1 and test away!