Intel Core i7 8700K CofeeLake Review & Benchmarks – CPU 6-core/12-thread Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). The server/workstation (SKL-X/KBL-X) CPU core saw new instruction set support (AVX512) as well as other improvements – these have not made the transition yet.

Possibly due limited competition (before AMD Ryzen launch), process issues (still at 14nm) and the disclosure of a whole host of hardware vulnerabilities (Spectre, Meltdown, etc.) which required microcode (firmware) updates – performance improvements have not been forthcoming. This is pretty much unprecedented – while some Core updates were only evolutionary we have not had complete stagnation before; in addition the built-in GPU core has also remained pretty much stagnant – we will investigate this in a subsequent article.

However, CFL does bring up a major change – and that is increased core counts both on desktop and mobile: on desktop we go from 4 to 6 cores (+50%) while on mobile (ULV) we go from 2 to 4 (+100%) within the same TDP envelope!

While this article is a bit late in the day considering the 8700K launched last year – we are preparing to review the brand-new CofeeLake-R (Refresh) Core i9-9900K and it seems a good time to see what has changed performance-wise for the previous top-of-the-range CPU.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i7-9800K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 6C/12T 8C / 16T 10C / 20T 4C / 8T We have 50% more cores compared to SKL/KBL but still not as much as Ryzen/2 with 8 cores.
Speed (Min / Max / Turbo) 0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) 0.8-4.0-4.2GHz (8x-40x-42x) Single-core Turbo has increased close to 5GHz (reserved for Special Edition 8086K) way above SKL/KBL and Ryzen.
Power (TDP) 95W (131) 105W (135) 140W (308) 91W (100) TDP has only increased by 4% and is still below Ryzen though Turbo is comparable.
L1D / L1I Caches 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way 4x 256kB 8-way No change in L2 caches. Just more of them.
L3 Caches 12MB 16-way 2x 8MB 16-way 13.75MB 11-way 8MB 16-way L3 has also increased by 50% in line with cores, but still below Ryzen’s 16MB.
Microcode/Firmware MU069E0A-96 MU8F0802-04 MU065504-49 MU065E03-C2 L3 has also increased by 50% in line with cores, but still below Ryzen’s 16MB.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i7-9800K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 291 [-13%] 334 485 190 In the old Drystone integer workload, CFL is still 13% slower than Ryzen 2 depite the huge lead over SKL.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 296 [-12%] 335 485 192 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 170 [-14%] 198 262 105 Switching to floating-point, CFL is still 14% slower in the old Whetstone also a micro-benchmark.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 143 [-15%] 169 223 89 With FP64 nothing much changes.
From integer workloads in Dhyrstone to floating-point workloads in Whestone, CFL is still 12-15% slower than Ryzen 2 with its 2 more cores (8 vs. 2), but much faster than the old SKL with 4 cores. We begin to see now why Intel is adding more cores in CFL-R.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 741 [+29%] 574 1590 (AVX512) 474 In this vectorised AVX2 integer test we see CFL beating Ryzen by ~30% despite less cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 305 [+63%] 187 581 (AVX512) 194 With a 64-bit AVX2 integer vectorised workload, CFL is now 63% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.9 [-16%] 5.8 7.6 3 This is a tough test using Long integers to emulate Int128 without SIMD: Ryzen 2 thus wins this one with CFL slower by 16%.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 678 [+14%] 596 1760 (AVX512) 446 In this floating-point AVX/FMA vectorised test, CFL is again 14% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 402 [+20%] 335 533 (AVX512) 268 Switching to FP64 SIMD code, CFL is again 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 16.7 [+7%] 15.6 40.3 (AVX512) 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL is just 7% faster but does win.
In vectorised SIMD code we see the power of Intel’s SIMD units that can execute 256-bit instructions in one go; CFL soundly beats Ryzen2 despite fewer cores (7-60%). SKL-X shows that AVX512 brings further gains and is a pity CFL still does not support them.
BenchCrypt Crypto AES-256 (GB/s) 17.8 [+11%] 16.1 23 15 With AES HWA support all CPUs are memory bandwidth bound; unfortunately Ryzen 2 is at 2667 vs CFL/SKL-X at 3200 which means CFL is 11% faster.
BenchCrypt Crypto AES-128 (GB/s) 17.8 [+11%] 16.1 23 15 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [-51%] 18.6 26 (AVX512) 5.9 With SHA HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; CFL is thus 50% slower.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [-9%] 19.3 38 (AVX512) 11.2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy thus CFL is only 9% slower.
BenchCrypt Crypto SHA2-512 (GB/s) 6.65 [+77%] 3.77 21 (AVX512) 4.4 SHA2-512 is not accelerated by SHA HWA, allowing CFL to use its SIMD units and be 77% faster.
AES HWA is memory bound and here CFL comfortably works with 3200Mt/s memory thus is faster than Ryzen 2 with 2667Mt/s memory (our sample); likely both would score similarly at 3200Mt/s. Ryzen 2 SHA HWA allows it to easily beat all other CPUs but only in SHA1/SHA256 – in others the CFL SIMD units win the day.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 207 [-19%] 257 309 128 In this non-vectorised test CFL cannot match Ryzen 2 and is ~20% slower.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 180 [-18%] 219 277 113 Switching to FP64 code, nothing much changes, Ryzen 2 is still faster.
BenchFinance Binomial float/FP32 (kOPT/s) 47 [-56%] 107 70.5 29.3 Binomial uses thread shared data thus stresses the cache & memory system; Ryzen 2 does very well here with CFL almost 60% slower.
BenchFinance Binomial double/FP64 (kOPT/s) 44.2 [-27%] 60.6 68 27.3 With FP64 code Ryzen2’s lead diminishes, CFL is “only” 27% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 41.6 [-23%] 54.2 63 25.7 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen 2 also wins this one, CFL is 23% slower.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 32.9 [-20%] 41 50.5 20.3 Switching to FP64 nothing much changes, CFL is still 20% slower.
Without SIMD support, CFL loses to Ryzen 2 as we saw with Dhrystone/Whetstone – between 20 and 50%. As we noted before, Intel will still need to add more cores in order to beat Ryzen 2. Still big improvement over the old SKL/KBL as expected.
BenchScience SGEMM (GFLOPS) float/FP32 385 [+28%] 300 413 (AVX512) 268 In this tough vectorised AVX2/FMA algorithm CFL is ~30% faster.
BenchScience DGEMM (GFLOPS) double/FP64 135 [+13%] 119 212 (AVX512) 130 With FP64 vectorised code, CFL’s lead reduces to 13% over Ryzen 2.
BenchScience SFFT (GFLOPS) float/FP32 24 [167%] 9 28.6 (AVX512) 16.1 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here CFL is over 2.5x faster
BenchScience DFFT (GFLOPS) double/FP64 11.9 [+51%] 7.92 14.6 (AVX512) 7.2 With FP64 code, CFL’s lead reduces to ~50%.
BenchScience SNBODY (GFLOPS) float/FP32 411 [+47%] 280 638 (AVX512) 271 N-Body simulation is vectorised but many memory accesses to shared data but CFL remains ~50% faster.
BenchScience DNBODY (GFLOPS) double/FP64 127 [+13%] 113 195 (AVX512) 79 With FP64 code CFL’s lead reduces to 13% over Ryzen 2.
With highly vectorised SIMD code CFL performs well soundly beating Ryzen 2 between 13-167% as well as significantly improving over older SKL/KBL. As long as SIMD code is used Intel has little to fear.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1700 [+39%] 1220 4540 (AVX512) 1090 In this vectorised integer AVX2 workload CFL enjoys a ~40% lead over Ryzen 2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 675 [+25%] 542 1790 (AVX512) 433 Same algorithm but more shared data reduces the lead to 25% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 362 [+19%] 303 940 (AVX512) 233 Again same algorithm but even more data shared reduces the lead to 20%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 589 [+30%] 453 1520 (AVX512) 381 Different algorithm but still AVX2 vectorised workload means CFL is 30% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 57.8 [-17%] 69.7 223 (AVX512) 37.6 Still AVX2 vectorised code but CFL stumbles a bit here – it’s 17% slower than Ryzen 2.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 31.8 [+29%] 24.6 70.8 (AVX512) 20 Again we see CFL ~30% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3480 [+140%] 1450 3570 (AVX512) 2300 CFL (like all Intel CPUs) does very well here – it’s a huge 140% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 448 [+84%] 243 909 (AVX512) 283 In this final test, CFL is almost 2x faster than Ryzen 2.

The addition of 2 more cores brings big performance gains (not to mention the higher Turbo clock) over the old SKL/KBL which is pretty impressive considering TDP has stayed the same. With SIMD code (AVX/AVX2/FMA3) CFL has no problem beating Ryzen 2 by a pretty large margin (up to 2x faster) – but any algorithm not vectorised allows Ryzen 2 to win – though not by much 12-20%.

Streaming tests likely benefit from the higher supported memory frequencies that while in theory could be used on the older SKL/KBL (memory overclock) they were not supported officially nor stable in all cases. We shall test memory performance in a forthcoming article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Ryzen2 2700X 8C/16T Pinnacle Ridge
Ryzen2 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 41 [-33%] 61 52 28 .Net CLR integer performance starts off well over old SKL but still 33% slower than Ryzen 2.
BenchDotNetAA .Net Dhrystone Long (GIPS) 41 [-32%] 60 54 27 With 64-bit integers nothing much changes.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 78 [-24%] 102 107 49 Floating-Point CLR performance does not change much, CFL is still 25% slower than Ryzen despite big gain over old SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 95 [-19%] 117 137 62 FP64 performance is similar to FP32.
Ryzen 2 performs exceedingly well in .Net workloads – soundly beating all Intel CPUs, with CFL between 20-33% slower. More cores will be needed for parity with Ryzen 2, but at least CFL improves a lot over SKL/KBL.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 93.5 [-16%] 111 144 57 Just as we saw with Dhrystone, this integer workload sees CFL improve greatly over SKL but Ryzen 2 is still faster.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 93.1 [-15%] 109 143 57 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 361 [-8%] 392 585 228 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code but CFL is still 8% slower than Ryzen 2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 198 [-9%] 217 314 128 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL is still slower.
We see a similar improvement for CFL but again not enough to beat Ryzen 2; even using RyuJit’s vectorised support CFL cannot beat it – just reduce the loss to 8-9%.
Java Arithmetic Java Dhrystone Integer (GIPS) 557 [-3%] 573 877 352 Java JVM performance is almost neck-and-neck with Ryzen 2 despite 2 less cores.
Java Arithmetic Java Dhrystone Long (GIPS) 488 [-12%] 553 772 308 With 64-bit integers, CFL does fall behind Ryzen2 by 12%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 101 [-23%] 131 156 62 Floating-point JVM performance is worse though, CFL is now 23% slower.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 103 [-26%] 139 160 64 With 64-bit precision nothing much changes.
While CFL improves markedly over the old SKL and almost ties with Ryzen 2 in integer workloads, it does fall behind in floating-point by a good amount.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 100 [-12%] 113 140 63 Without SIMD acceleration we see the usual delta (around 12%) with integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 89 [-12%] 101 152 62 Nothing changes when changing to 64-bit integer workloads.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 64 [-34%] 97 98 41 With floating-point non-SIMD accelerated we see a bigger delta of about 30% less vs. Ryzen 2.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 64 [-29%] 90 99 41 With 64-bit floatint-point precision nothing much changes.
With compute heavy vectorised code but not SIMD accelerated, CFL cannot keep up with Ryzen 2 and the difference increases to about 30% less. Intel really needs to get Oracle to add SIMD extensions similar to .Net’s new CLR.

Ryzen dominates .Net and Java benchmarks – Intel will need more cores in order to compete; while the additional 2 cores helped a lot, it is not enough!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Due to Core improvement stagnation (4 on desktop, 2 on mobile), Intel had no choice really but increase core counts in light of new competition from AMD with Ryzen/2 with twice as many cores (8) and SMT (Hyper-Threading) as well! While KBL had increased base/Turbo cores appreciably over SKL within the same power envelope, CFL had to add more cores in order to compete.

With 50% more cores (6) CFL performs much better than the older SKL/KBL as expected but that is not enough in non-vectorised loads; Ryzen 2 with 2 more cores is still faster, not by much (12-20%) but still faster. Once vectorised code is used, the power of Intel’s SIMD units shows – with CFL soundly beating Ryzen despite not supporting AVX512 nor SHA HWA which is a pity. AMD still has work to do with Ryzen if it wants to be competitive in vectorised workloads (integer or floating-point).

We now see why CFL-R (CofeeLake Refresh) will add even more cores (8C/16T with 9900K which we review in a subsequent article) – it is the only way to beat Ryzen 2 in all workloads. In effect AMD’s has reached parity performance with Intel in all but SIMD workloads – a great achievement!

Unfortunately (unlike AMD’s AM4 Ryzen) CFL does require new chipset/boards (series 300) which makes it an expensive upgrade for SKL/KBL owners; otherwise it would have been a pretty no-brainer upgrade for those needing more compute power. While the new platform does bring some improvements (USB 3.1 Gen 2 aka 10GB/s, more PCIe lanes, integrated 802.11ac WiFi – at least on mobile) it’s nothing over the competition.

Roll on CofeeLake Refresh and the new CPUs: they are sorely needed…

Intel Core i9 (SKL-X) Review & Benchmarks – CPU 10-core/20-thread AVX512 Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels:

  • Server 2S, 4S and 8S (sockets)
  • Workstation 1S and 2S
  • Up to 28 cores and 56 threads per CPU
  • Up to 48 PCIe 3.0 lanes
  • Up to 46-bit physical address space and 48-bit virtual address space
  • 512-bit SIMD aka AVX512F, AVX512BandW, AVX512DWandQW

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

On the desktop – Intel is now using the “i9” moniker for its top parts – in a way a much needed change for its top HEDT platform (socket 2011 now socket 2066) to differentiate from its mainstream one.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 4C / 8T 6C / 12T SKL-X manages more cores than Ryzen (10 vs 8) which considering their speed may just be too tough to beat. HSW-E topped at 8 cores also.
Speed (Min / Max / Turbo) 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) 0.8-4.0-4.2GHz (8x-40x-42x) 1.2-3.3-4.0GHz (12x-33x-40x) SKL-X somehow manages higher single-core turbo than even SKL-A (42x v 43x) – but its rated speed is a match for Ryzen and HSW-E.
Power (TDP) 140W 95W 91W 140W Ryzen has comparative TDP to SKL while HSW-E and SKL-X are both almost 50% higher
L1D / L1I Caches 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 6x 32kB 8-way / 6x 32kB 2-way Ryzen instruction cache is 2x the data cache a somewhat strange decision; all caches are 8-way except the HSW-E’s L1I.
L2 Caches 10x 1MB 16-way 8x 512kB 8-way 4x 256kB 8-way 6x 256kB 8-way Surprise surprise – the new SKL-X’ L2 is 4-times the size of SKL/HSW-E and thus even beating Ryzen. Large datasets should have no problem getting cached.
L3 Caches 13.75MB 11-way 2x 8MB 16-way 8MB 16-way 15MB 20-way In a somewhat surprising move, the L3 cache has been reduced pretty drastically and is now smaller than both Ryzen and even the very old HSW-E!

 

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 446 [+54%] AVX2 290 AVX2 185 AVX2 233 AVX2 Dhrystone does not yet use AVX512 – but no matter SKL-X beats Ryzen by over 50%!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+57%] AVX2 292 AVX2 185 AVX2 230 AVX2 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 271 [+46%] AVX/FMA 185 AVX/FMA 109 AVX/FMA 150 AVX/FMA Whetstone does not yet use AVX512 either – but SKL-X is still approx 50% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 223 [+50%] AVX/FMA 155 AVX/FMA 89 AVX/FMA 116 AVX/FMA With FP64 the winning streak continues.
The Empire strikes back – SKL-X beats Ryzen by a sizeable difference (50%) across integer or floating-point workloads even on “legacy” AVX2/FMA instruction set. It will only get faster once AVX512 is enabled.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1460 [+2.7x] AVX512DQW 535 AVX2 513 AVX2 639 AVX2 For the 1st time we see AVX512 in action and everything is pummeled into dust – almost 3-times faster than Ryzen!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 521 [+3.3x] AVX512DQW 159 AVX2 191 AVX2 191 AVX2 With a 64-bit integer vectorised workload SKL-X is over 3-times faster than Ryzen!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.37 [+48%] 3.61 2.15 2.74 This is a tough test using Long integers to emulate Int128 without SIMD and thus SKL-X returns to “just” 50% faster than Ryzen.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+3.4x] AVX512F 530 FMA 479 FMA 601 FMA In this floating-point vectorised test we see again the power of AVX512 with SKL-X is again over 3-times faster than Ryzen!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1140 [+3.8x] AVX512F 300 FMA 271 FMA 345 FMA Switching to FP64 SIMD code SKL-X gets even faster approaching 4-times
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24 [+84%] AVX512F 13.7 FMA 10.7 FMA 12 FMA In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – SKL-X returns to just 85% faster.
Ryzen’s SIMD units were never strong – splitting 256-bit ops into 2 – but with AV512 SKL-X is unstoppable: integer or floating-point we see it over 3-times faster that is a serious improvement in performance. Even against its older HSW-E it is over 2-times faster a significant upgrade. For heavy vectorised SIMD code – as long as it’s updated to AVX512 – there is no other choice.
BenchCrypt Crypto AES-256 (GB/s) 32.7 [+2.4x] AES 13.8 AES 15 AES 20 AES All  CPUs support AES HWA – thus it is mainly a matter of memory bandwidth – and with 4 memory channels SKL-X reigns supreme – it’s over 2-times faster.
BenchCrypt Crypto AES-128 (GB/s) 32 [+2.3x] AES 13.9 AES 15 AES 20.1 AES What we saw with AES-256 just repeats with AES-128; Ryzen would need more memory channels to even HSW-E never mind SKL-X.
BenchCrypt Crypto SHA2-256 (GB/s) 25 [+46%] AVX512DQW 17.1 SHA 5.9 AVX2 7.6 AVX2 Even Ryzen’s support for SHA hardware acceleration is not enough as memory bandwidth lets it down with SKL-X “only” 50% faster through AVX512.
BenchCrypt Crypto SHA1 (GB/s) 39.3 [+2.3x] AVX512DQW 17.3 SHA 11.3 AVX2 15.1 AVX2 SKL-X only gets faster with the simpler SHA1 and is now over 2-times faster.
BenchCrypt Crypto SHA2-512 (GB/s) 21.1 [+6.3x] AVX512DQW 3.34 AVX2 4.4 AVX2 5.34 AVX2 SHA2-512 is not accelerated by SHA HWA thus Ryzen is forced to use SIMD and loses badly.
Memory bandwidth rules here and SKL-X with its 4-channels of ~100GB/s bandwidth reigns supreme (we can only imagine what the 6-channel beast will score) – so Ryzen loses badly. Its ace card – support for SHA HWA is not enough to “save it” as AVX512 allows SKL-X to power through algorithms like a knife through butter. The 64-bit SHA2-512 test is sobbering with SKL-X no less than 6-times faster than Ryzen.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 320 [+36%] 234 129 157 In this non-vectorised test SKL-X is only 36% faster than Ryzen. SIMD would greaty help it here.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 277 [+40%] 198 108 131 Switching to FP64 code nothing much changes, SKL-X is just 40% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 66.9 [-21%] 85.1 27.2 37.8 Binomial uses thread shared data thus stresses the cache & memory system; somehow Ryzen manages to win this.
BenchFinance Binomial double/FP64 (kOPT/s) 65 [+41%] 45.8 25.5 33.3 With FP64 code the situation gets back to “normal” – with SKL-X again 40% faster than Ryzen.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 64 [+30%] 49.2 25.9 31.6 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; SKL-X is just 30% faster here.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 51 [+36%] 37.3 19.1 21.2 Switching to FP64 where Ryzen did so well – SKL-X returns to 40% faster.
Without the help of its SIMD engine, SKL-X is still 30-40% faster than Ryzen but over 2-times faster than HSW-E showing just how much the core has improved for complex code with lots of shared data (read-only or modifyable). While Ryzen thought it found its “niche” it has been already beaten…
BenchScience SGEMM (GFLOPS) float/FP32 343 [5x] FMA 68.3 FMA 109 FMA 185 FMA GEMM has not yet been updated for AVX512 but SKL-X is an incredible 5x faster!
BenchScience DGEMM (GFLOPS) double/FP64 124 [+2x] FMA 62.7 FMA 72 FMA 87.7 FMA Even without AVX512, with FP64 vectorised code, SKL-X still manages 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 34 [+3.8x] FMA 8.9 FMA 18.9 FMA 18 FMA FFT has also not been updated to AVX512 but SKL-X is still 4x faster than Ryzen!
BenchScience DFFT (GFLOPS) double/FP64 19 [+2.5x] FMA 7.5 FMA 9.3 FMA 10.9 FMA With FP64 SIMD SKL-X is over 2.5x faster than Ryzen in this tough algorithm with loads of memory accesses.
BenchScience SNBODY (GFLOPS) float/FP32 585 [+2.5x] FMA 234 FMA 273 FMA 158 FMA NBODY is not yet updated to AVX512 but again SKL-X wins.
BenchScience DNBODY (GFLOPS) double/FP64 179 [+2x] FMA 87 FMA 79 FMA 40 FMA With FP64 code SKL-X is still 2-times faster than Ryzen.
With highly vectorised SIMD code, even without the help of AVX512, SKL-X is over 2.5x faster than Ryzen, but more than that – almost 4-times faster than its older HSW-E brother!
CPU Image Processing Blur (3×3) Filter (MPix/s) 1639 [+2.2x] AVX2 750 AVX2 655 AVX2 760 AVX2 In this vectorised integer AVX2 workload SKL-X is over 2x faster than Ryzen.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 711 [+2.2x] AVX2 316 AVX2 285 AVX2 345 AVX2 Same algorithm but more shared data does not change anything.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 377 [+2.2x] AVX2 172 AVX2 151 AVX2 188 AVX2 Again same algorithm but even more data shared does not change anything again.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 609 [+2.1x] AVX2 292 AVX2 271 AVX2 316 AVX2 Different algorithm but still SKL-X is still 2x faster than Ryzen.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 79.8 [+36%] AVX2 58.5 AVX2 35.4 AVX2 50.3 AVX2 Still AVX2 vectorised code but here Ryzen does much better, with SKL-X just 36% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 15.7 [+63%] 9.6 6.3 7.6 This test is not vectorised though it uses SIMD instructions and here SKL-X only manages to be 63% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1000 [+17%] 852 422 571 Again in a non-vectorised test Ryzen just flies but SKL-X manages to be 20% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 190 [+29%] 147 75 101 In this final non-vectorised test Ryzen really flies but not enough to beat SKL-X which is 30% faster.
As with other SIMD tests, SKL-X remains just over 2-times faster than Ryzen and about as fast over HSW-E. But without SIMD it drops significantly to just 20-60% showing just how good Ryzen performs.

When using the new AVX512 instruction set – we see incredible performance with SKL-X about 3x faster than its Ryzen competitor and about 2x faster than the older HSW-E; with the older AVX2/FMA instruction sets supported by all CPUs, it is “only” about 2x faster. When using non-vectorised SIMD code its lead shortens to about 30-60%.

While we’ve not tested memory performance in this article, we see that in streaming tests its 4 DDR4 channels trounce 2-channel CPUs that just cannot feed all their cores. Being able to use much faster DDR4 memory (3200 vs 2133) allows it to also soundly beat its older HSW-E brother.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks i9-7900X (Skylake-X) Ryzen 1700X i7-6700K 4C/8T (Skylake)
i7-5820K (Haswell-E)
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 69.8 [+1.9x]
36.5 23.3 30.7 While Ryzen used to dominate .Net CLR workloads, now SKL-X is 2x faster than it and naturally older HSW-E.
BenchDotNetAA .Net Dhrystone Long (GIPS) 60.9 [+35%] 45.1 23.6 28.2 Ryzen seems to do very well here cutting SKL-X’s lead to just 35% – while still being almost 2x faster than HSW-E
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 112 [+12%] 100.6 47.4 65.4 Floating-Point CLR performance is pretty spectacular with Ryzen  and SKL-X only manages 12% faster.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 138 [+14%] 121.3 63.6 85.7 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with SKL-X just 14% faster.
While Ryzen used to dominate .Net workloads, SKL-X restores the balance in Intel’s favour – though in many tests it is just over 10% faster than Ryzen. The CLR definitely seems to prefer Ryzen.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 140 [+50%] 92.6 55.7 75.4 Just as we saw with Dhrystone, this integer workload sees a 50% improvement for SKL-X. While RiuJit supports SIMD integer vectors the lack of bitfield instructions make it slower for our code; shame.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 143 [+47%] 97.8 60.3 79.2 With 64-bit integer workload we see a similar story – SKL-X is about 50% faster.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 543 [+2x] AVX/FMA 272.7 AVX/FMA 12.9 284.2 AVX/FMA Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code – SKL-X strikes back to 2x faster than Ryzen.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 294 [+2x] AVX/FMAX 149 AVX/FMAX 38.7 176.1 AVX/FMA Switching to FP64 SIMD vector code – still running AVX/FMA – SKL-X is still 2x faster.
With RyuJIT’s support for SIMD vector instructions – SKL-X brings its power to bear, being the usual 2-times faster than Ryzen; it does not seem that RyuJIT supports AVX512 yet – something that will make it evern faster. With scalar instructions SKL-X is “only” 50% faster but still about 2x fasster than HSW-E.
Java Arithmetic Java Dhrystone Integer (GIPS) 716 [+39%] 513 313 395 Ryzen puts a strong performance with SKL-X “just” 40% faster. Still it’s almost 2x faster than HSW-E.
Java Arithmetic Java Dhrystone Long (GIPS) 873 [+70%] 514 332 399 Somehow SKL-X does better here with 70% faster than Ryzen.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 155 [+32%] 117
62.8 89 With a floating-point workload Ryzen continues to do well so SKL-X is again “just” 30% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 160 [+25%] 128 64.6 91 With FP64 workload SKL-X’s lead drops to 25%.
With the JVM seemingly favouring Ryzen – and without SIMD – SKL-X is just 25-40% faster than it – but do note it absolutely trounces its older HSW-E brother – being almost 2x faster. So Intel has made big gains but at a cost.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 135 [+40%] 99 59.5 82 Oracle’s JVM does not yet support SIMD vectors so SKL-X is “just” 40% faster than Ryzen.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 132 [+41%] 93 60.6 79 With 64-bit integers nothing much changes.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 97 [+13%] 86 40.6 61 Scary times as SKL-X manages its smallest lead over Ryzen at just over 10%.

Intel better hope Oracle will add vector primitives allowing SIMD code to use the power of its CPU’s SIMD units.

Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 99 [+20%] 82 40.9 63 With FP64 workload SKL-X is lucky to increase its lead to 20%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA, AVX512) allows the competition to creep up on SKL-X in performance but at far lower cost. This is not a good place for Intel to be in.

While Ryzen used to dominate .Net and Java benchmarks – SKL-X restores the balance in Intel’s favour – through both the CLR and JVM do seem to “favour” Ryzen for some reason. If you are running the older HSW-E then you can be sure SKL-X is over 2x faster than it thoughout.

Thus thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run much better on SKL-X than older Intel designs but also reasonably well on Ryzen – at least if not using SIMD vector extensions when SKL-X’s power comes to the fore.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Just when AMD were likely celebrating their fantastic Ryzen, Intel strikes back with a killer – though really expensive CPU. While we’ve not seen major core advances since SandyBridge (SNB and SNB-E) and likely not even see anything new in Coffeelake (CFK) – somehow these improvements add up to quite a lot – with SKL-X soundly beating both Ryzen and its older HSW-E brother.

We finally see AVX512 released and it does not disappoint: SKL-X increases its lead by 50% through it, but note that lower-end CPUs will execute some instructions a lot slower which is unfortunate. Using AVX512 also requires new tools – either compiler which on Windows means the brand-new Visual C++ 2017 or assemblers – and decent amount of work – thus not something most developers will do – at least until the normal desktop/mobile platforms will support it too.

All in all it is a solid upgrade – though costly – but if performance you’re after you can “safely” remain with Intel – you don’t need to join the “rebel camp”. But we’ll need to see what AMD’s Threadripper has in store for us… 😉