Intel Core i9 9900K CofeeLake-R Review & Benchmarks – CPU 8-core/16-thread Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i9 (9900K) with previous generation (8700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R
Intel i7-8700K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 10C / 20T We have 33% more cores matching Ryzen and close to mainstream SKL-X!
Speed (Min / Max / Turbo) 0.8-3.6-5GHz

(8x-36x-50x)

0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) Single/Dual core Turbo has now reached 5GHz same as 8086K special edition.
Power (TDP) 95W (135) 95W (131) 105W (135) 140W (308) TDP is the same but overall power consumption likely far higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 8x 256kB 8-way 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way No change in L2 caches. Just more of them.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 13.75MB 11-way L3 has also increased by 33% in line with cores matching Ryzen.
Microcode/Firmware MU069E0C-9E MU069E0A-96 MU8F0802-04 MU065504-49 We have a new stepping and slightly newer microcode.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 400 [+20%] 291 334 485 In the old Dhrystone integer workload, CFL-R finally beats Ryzen by 20%.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 393 [+17%] 296 335 485 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 236 [+19%] 170 198 262 Switching to floating-point, CFL-R still beats Ryzen by 20% in old Whetstone.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 196 [+16%] 143 169 223 With FP64 nothing much changes.
From integer workloads in Dhrystone to floating-point workloads in Whetstone, CFL-R handily beats Ryzen by about 20% and is also 33-39% faster than old CFL. It’s “king of the hill”.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1000 [+74%] 741 574 1590 (AVX512) In this vectorised AVX2 integer test CFL-R is almost 2x faster than Ryzen
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 416 [+122%] 305 187 581 (AVX512) With a 64-bit AVX2 integer vectorised workload, CFL-R is now over 2.2x faster than Ryzen.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.75 [+16%] 4.9 5.8 7.6 This is a tough test using Long integers to emulate Int128 without SIMD: still CFL-R is fastest.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 927 [+56%] 678 596 1760 (AVX512) In this floating-point AVX/FMA vectorised test, CFL-R is 56% faster than Ryzen.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 544 [+62%] 402 335 533 (AVX512) Switching to FP64 SIMD code, CFL-R increases its lead to 612.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 23.3 [+49%] 16.7 15.6 40.3 (AVX512) In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL-R is 50% faster.
In vectorised SIMD code we know Intel’s SIMD units (that can execute 256-bit instructions in one go) are much more powerful than AMD’s and it shows: Ryzen is soundly beaten by a big 50-100% margin. Naturally it cannot beat its “big brother” SKL-X with AVX512 – which is likely why Intel has not enabled them.
BenchCrypt Crypto AES-256 (GB/s) 17.6 [+9%] 17.8 16.1 23 With AES HWA support all CPUs are memory bandwidth bound; core contention (8 vs 6) means CFL-R scores slightly worse than CFL.
BenchCrypt Crypto AES-128 (GB/s) 17.6 [+9%] 17.8 16.1 23 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 12.2 [-34%] 9 18.6 26 (AVX512) With SHA HWA Ryzen2 powers through but CFL-R is only 34% slower.
BenchCrypt Crypto SHA1 (GB/s) 23 [+19%] 17.3 19.3 38 (AVX512) Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy allowing CFL-R to beat it.
BenchCrypt Crypto SHA2-512 (GB/s) 9 [+139%] 6.65 3.77 21 (AVX512) SHA2-512 is not accelerated by SHA HWA, allowing CFL-R to use its SIMD units and be 139% faster.
AES HWA is memory bound and here and CFL-R also enjoys the 3200Mt/s memory – but now feeding 8C / 16T which all contend for the bandwidth. Thus CFL-R does score slighly less than CFL and obviously gets left in the dust by SKL-X with 4 memory channels. Ryzen2 SHA HWA does manage a lonely win but anything SIMD accelerated belongs to Intel.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 [+7%] 207 257 309 In this non-vectorised test CFL-R is just a bit faster than Ryzen2.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 240 [+10%] 180 219 277 Switching to FP64 code, nothing much changes, CFL-R is 10% faster
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 [-44%] 47 107 70.5 Binomial uses thread shared data thus stresses the cache & memory system; CFL-R strangely loses by 44%.
BenchFinance Binomial double/FP64 (kOPT/s) 61.9 [+2%] 44.2 60.6 68 With FP64 code Ryzen2’s lead diminishes, CFL is pretty much tied with it.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 [+4%] 41.6 54.2 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; CFL-R is just 4% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 44.3 [+8%] 32.9 41 50.5 Switching to FP64 CFL-R increases its lead to 8%.
Without SIMD support, CFL-R relies on its thread count increase (thus matching Ryzen2) to beat Ryzen2 by a small amount and lose in one test. But in a test that AMD used to always win (with Ryzen 1/2) Intel now has the lead.
BenchScience SGEMM (GFLOPS) float/FP32 403 [+34%] 385 300 413 (AVX512) In this tough vectorised AVX2/FMA algorithm CFL-R is 35% faster than Ryzen2
BenchScience DGEMM (GFLOPS) double/FP64 269 [+126%] 135 119 212 (AVX512) With FP64 vectorised code, CFL-R is over 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 23.4 [+160%] 24 9 28.6 (AVX512) FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more CFL-R is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 11.2 [+41%] 11.9 7.92 14.6 (AVX512) With FP64 code, CFL-R’s lead reduces to 41%.
BenchScience SNBODY (GFLOPS) float/FP32 550 [+96%] 411 280 638 (AVX512) N-Body simulation is vectorised but many memory accesses to shared data but CFL-R is 2x faster than Ryzen2.
BenchScience DNBODY (GFLOPS) double/FP64 172 [+52%] 127 113 195 (AVX512) With FP64 code CFL’s lead reduces to 50% over Ryzen 2.
With highly vectorised SIMD code CFL-R performs very well – dominating Ryzen2: in some tests it is over 2x faster! Then again CFL did not have any issues here either, Intel is just extending their lead…
CPU Image Processing Blur (3×3) Filter (MPix/s) 2270 [+86%] 1700 1220 4540 (AVX512) In this vectorised integer AVX2 workload CFL-R is amost 2x faster than Ryzen2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 903 [+67%] 675 542 1790 (AVX512) Same algorithm but more shared data reduces the lead to 67% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 488 [+61%] 362 303 940 (AVX512) Again same algorithm but even more data shared reduces the lead to 61%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 784 [+73%] 589 453 1520 (AVX512) Different algorithm but still AVX2 vectorised workload means CFL-R is 73% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 78.6 [+13%] 57.8 69.7 223 (AVX512) Still AVX2 vectorised code but CFL-R stumbles a bit here – but it’s still 13% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 43.4 [+76%] 31.8 24.6 70.8 (AVX512) CFL-R recovers its dominance over Ryzen2.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4470 [+208%] 3480 1450 3570 (AVX512) CFL-R (like all Intel CPUs) does very well here – it’s a huge 3x faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 614 [+153%] 448 243 909 (AVX512) In this final test, CFL-R is over 2x faster than Ryzen 2.

Adding 2 more cores brings additional performance gains (not to mention the higher Turbo clock) over CFL which showed big gains over the old SKL/KBL again within the same (rated) TDP. Intel never had any problem with SIMD code (AVX/AVX2/FMA3) beating Ryzen2 by a large margin (now over 2x faster) but now also wins pretty much all tests.

It is consistently 33-40% faster than CFL (8700K) in line with core/speed increases which bodes well for compute-heavy code; streaming performance can be lower due to increase core contention for bandwidth and here faster (though more expensive) memory would help.

No – it cannot beat its “older big brother” SKL-X with AVX512 – not to mention increased core/thread count as well as memory channels, but in some tests it is competitive.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 54.3 [-11%] 41 61 52 .Net CLR integer performance starts off well but CFL-R is 10% slower.
BenchDotNetAA .Net Dhrystone Long (GIPS) 55.8 [-7%] 41 60 54 With 64-bit integers the gap lowers to 7%.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 95.5 [-6%] 78 102 107 Floating-Point CLR performance does not change much, CFL-R is still 6% slower than Ryzen2 despite big gain over old CFL/KBL/SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 126 [+8%] 95 117 137 FP64 performance allows CFL-R to win in the end.
Ryzen 2 performs exceedingly well in .Net workloads – but CFL-R can hold its own and overall it is not significantly slower (under 10%) and even wins 1 test out of 4.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 144 [+30%] 93.5 111 144 Unlike CFL, here CFL-R is 30% faster than Ryzen2.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 139 [+28%] 93.1 109 143 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 499 [+27%] 361 392 585 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code and CFL-R is 27% faster than Ryzen2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 274 [+26%] 198 217 314 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL-R is still faster.
We see a big improvement in CFL-R even against old CFL: this allows it to soundly beat Ryzen 2 (which used to win this test) by about 30%, a significant margin. It is possible the hardware fixes (“Meltdown”) are having an effect here.
Java Arithmetic Java Dhrystone Integer (GIPS) 614 [+7%] 557 573 877 Java JVM performance starts well with a 7% lead over Ryzen2.
Java Arithmetic Java Dhrystone Long (GIPS) 644 [+16%] 488 553 772 With 64-bit integers, CFL-R doubles its lead to 16%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 143 [+9%] 101 131 156 Floating-point JVM performance is similar – 9% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 147 [+6%] 103 139 160 With 64-bit precision the lead drops to 6%.
While CFL-R improves by a good amount over CFL (which itself improved greatly over KBL/SKL) and now beats Ryzen2 by a good margin 7-16%.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 147 [+30%] 100 113 140 Without SIMD acceleration we still see CFL-R 30% than Ryzen2 in this integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 142 [+41%] 89 101 152 Changing to 64-bit integer increases the lead to 41%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 91 [-6%] 64 97 98 With floating-point non-SIMD accelerated Ryzen2 is faster.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+3%] 64 90 99 With 64-bit floatint-point precision CFL-R is back on top by just 3%.
With compute heavy vectorised code but not SIMD accelerated, CFL-R is still faster or at least ties with Ryzen2.

CFL-R now beats or at least matches Ryzen2 in VM tests the latter used to easily win. It may not have the lead native SIMD vectorised code allows it – but has no trouble keeping up – unlike its older SKL/KBL “brothers”.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

While Intel had finally increased core counts with CFL (after many years) in light of new competition from AMD with Ryzen/2 – it still relied on fewer but more powerful (at least in SIMD) cores to compete. CFL-R finally brings core (8 vs 8) and thread (16 vs 16) parity with the competition to ensure domination.

And with few exceptions – that’s what it has achieved – 9900K is the fastest desktop CPU at this time (November 2018) and if you can afford it you can upgrade on the same, old, 200-series mainboards (though naturally 300-series is recommended). The performance improvement (33-40% over 8700K) is pretty significant to upgrade – again if you can afford it – considering it is a “mere refresh”.

The “Meltdown” fixes in hardware are also likely to bring big improvement in I/O workloads – or to be precise – restore the performance loss of OS mitigations (KVA) that have been deployed this year (2018). Still, in very rough terms, now you don’t have to decide between “speed” and “security” – though perhaps KVA should be used by default just in case any CPU (not just Intel) leaks information between user/kernel spaces by a yet undiscovered side-channel vulnerability.

But despite the “i9” moniker – don’t think you’re getting a workstation-class CPU on the cheap: SKL-X not only (still) has more cores/threads and 2x memory channels but also supports AVX512 beating it soundly. It will also be refreshed soon – sporting the same “Meltdown” in-hardware fixes. But again considering the costs (almost 2x) CFL-R is likely the performance/price winner on most workloads.

For now, on the desktop, the 9900K is “king of the hill”!

Intel Core i7 8700K CofeeLake Review & Benchmarks – CPU 6-core/12-thread Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). The server/workstation (SKL-X/KBL-X) CPU core saw new instruction set support (AVX512) as well as other improvements – these have not made the transition yet.

Possibly due limited competition (before AMD Ryzen launch), process issues (still at 14nm) and the disclosure of a whole host of hardware vulnerabilities (Spectre, Meltdown, etc.) which required microcode (firmware) updates – performance improvements have not been forthcoming. This is pretty much unprecedented – while some Core updates were only evolutionary we have not had complete stagnation before; in addition the built-in GPU core has also remained pretty much stagnant – we will investigate this in a subsequent article.

However, CFL does bring up a major change – and that is increased core counts both on desktop and mobile: on desktop we go from 4 to 6 cores (+50%) while on mobile (ULV) we go from 2 to 4 (+100%) within the same TDP envelope!

While this article is a bit late in the day considering the 8700K launched last year – we have now also reviewed the brand-new CofeeLake-R (Refresh) Core i9-9900K and it seems a good time to see what has changed performance-wise for the previous top-of-the-range CPU.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i7-8700K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 6C/12T 8C / 16T 10C / 20T 4C / 8T We have 50% more cores compared to SKL/KBL but still not as much as Ryzen/2 with 8 cores.
Speed (Min / Max / Turbo) 0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) 0.8-4.0-4.2GHz (8x-40x-42x) Single-core Turbo has increased close to 5GHz (reserved for Special Edition 8086K) way above SKL/KBL and Ryzen.
Power (TDP) 95W (131) 105W (135) 140W (308) 91W (100) TDP has only increased by 4% and is still below Ryzen though Turbo is comparable.
L1D / L1I Caches 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way 4x 256kB 8-way No change in L2 caches. Just more of them.
L3 Caches 12MB 16-way 2x 8MB 16-way 13.75MB 11-way 8MB 16-way L3 has also increased by 50% in line with cores, but still below Ryzen’s 16MB.
Microcode/Firmware MU069E0A-96 MU8F0802-04 MU065504-49 MU065E03-C2 We have a new model and somewhat newer microcode.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 291 [-13%] 334 485 190 In the old Drystone integer workload, CFL is still 13% slower than Ryzen 2 depite the huge lead over SKL.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 296 [-12%] 335 485 192 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 170 [-14%] 198 262 105 Switching to floating-point, CFL is still 14% slower in the old Whetstone also a micro-benchmark.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 143 [-15%] 169 223 89 With FP64 nothing much changes.
From integer workloads in Dhyrstone to floating-point workloads in Whestone, CFL is still 12-15% slower than Ryzen 2 with its 2 more cores (8 vs. 2), but much faster than the old SKL with 4 cores. We begin to see now why Intel is adding more cores in CFL-R.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 741 [+29%] 574 1590 (AVX512) 474 In this vectorised AVX2 integer test we see CFL beating Ryzen by ~30% despite less cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 305 [+63%] 187 581 (AVX512) 194 With a 64-bit AVX2 integer vectorised workload, CFL is now 63% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.9 [-16%] 5.8 7.6 3 This is a tough test using Long integers to emulate Int128 without SIMD: Ryzen 2 thus wins this one with CFL slower by 16%.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 678 [+14%] 596 1760 (AVX512) 446 In this floating-point AVX/FMA vectorised test, CFL is again 14% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 402 [+20%] 335 533 (AVX512) 268 Switching to FP64 SIMD code, CFL is again 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 16.7 [+7%] 15.6 40.3 (AVX512) 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL is just 7% faster but does win.
In vectorised SIMD code we see the power of Intel’s SIMD units that can execute 256-bit instructions in one go; CFL soundly beats Ryzen2 despite fewer cores (7-60%). SKL-X shows that AVX512 brings further gains and is a pity CFL still does not support them.
BenchCrypt Crypto AES-256 (GB/s) 17.8 [+11%] 16.1 23 15 With AES HWA support all CPUs are memory bandwidth bound; unfortunately Ryzen 2 is at 2667 vs CFL/SKL-X at 3200 which means CFL is 11% faster.
BenchCrypt Crypto AES-128 (GB/s) 17.8 [+11%] 16.1 23 15 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [-51%] 18.6 26 (AVX512) 5.9 With SHA HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; CFL is thus 50% slower.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [-9%] 19.3 38 (AVX512) 11.2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy thus CFL is only 9% slower.
BenchCrypt Crypto SHA2-512 (GB/s) 6.65 [+77%] 3.77 21 (AVX512) 4.4 SHA2-512 is not accelerated by SHA HWA, allowing CFL to use its SIMD units and be 77% faster.
AES HWA is memory bound and here CFL comfortably works with 3200Mt/s memory thus is faster than Ryzen 2 with 2667Mt/s memory (our sample); likely both would score similarly at 3200Mt/s. Ryzen 2 SHA HWA allows it to easily beat all other CPUs but only in SHA1/SHA256 – in others the CFL SIMD units win the day.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 207 [-19%] 257 309 128 In this non-vectorised test CFL cannot match Ryzen 2 and is ~20% slower.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 180 [-18%] 219 277 113 Switching to FP64 code, nothing much changes, Ryzen 2 is still faster.
BenchFinance Binomial float/FP32 (kOPT/s) 47 [-56%] 107 70.5 29.3 Binomial uses thread shared data thus stresses the cache & memory system; Ryzen 2 does very well here with CFL almost 60% slower.
BenchFinance Binomial double/FP64 (kOPT/s) 44.2 [-27%] 60.6 68 27.3 With FP64 code Ryzen2’s lead diminishes, CFL is “only” 27% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 41.6 [-23%] 54.2 63 25.7 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen 2 also wins this one, CFL is 23% slower.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 32.9 [-20%] 41 50.5 20.3 Switching to FP64 nothing much changes, CFL is still 20% slower.
Without SIMD support, CFL loses to Ryzen 2 as we saw with Dhrystone/Whetstone – between 20 and 50%. As we noted before, Intel will still need to add more cores in order to beat Ryzen 2. Still big improvement over the old SKL/KBL as expected.
BenchScience SGEMM (GFLOPS) float/FP32 385 [+28%] 300 413 (AVX512) 268 In this tough vectorised AVX2/FMA algorithm CFL is ~30% faster.
BenchScience DGEMM (GFLOPS) double/FP64 135 [+13%] 119 212 (AVX512) 130 With FP64 vectorised code, CFL’s lead reduces to 13% over Ryzen 2.
BenchScience SFFT (GFLOPS) float/FP32 24 [167%] 9 28.6 (AVX512) 16.1 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here CFL is over 2.5x faster
BenchScience DFFT (GFLOPS) double/FP64 11.9 [+51%] 7.92 14.6 (AVX512) 7.2 With FP64 code, CFL’s lead reduces to ~50%.
BenchScience SNBODY (GFLOPS) float/FP32 411 [+47%] 280 638 (AVX512) 271 N-Body simulation is vectorised but many memory accesses to shared data but CFL remains ~50% faster.
BenchScience DNBODY (GFLOPS) double/FP64 127 [+13%] 113 195 (AVX512) 79 With FP64 code CFL’s lead reduces to 13% over Ryzen 2.
With highly vectorised SIMD code CFL performs well soundly beating Ryzen 2 between 13-167% as well as significantly improving over older SKL/KBL. As long as SIMD code is used Intel has little to fear.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1700 [+39%] 1220 4540 (AVX512) 1090 In this vectorised integer AVX2 workload CFL enjoys a ~40% lead over Ryzen 2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 675 [+25%] 542 1790 (AVX512) 433 Same algorithm but more shared data reduces the lead to 25% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 362 [+19%] 303 940 (AVX512) 233 Again same algorithm but even more data shared reduces the lead to 20%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 589 [+30%] 453 1520 (AVX512) 381 Different algorithm but still AVX2 vectorised workload means CFL is 30% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 57.8 [-17%] 69.7 223 (AVX512) 37.6 Still AVX2 vectorised code but CFL stumbles a bit here – it’s 17% slower than Ryzen 2.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 31.8 [+29%] 24.6 70.8 (AVX512) 20 Again we see CFL ~30% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3480 [+140%] 1450 3570 (AVX512) 2300 CFL (like all Intel CPUs) does very well here – it’s a huge 140% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 448 [+84%] 243 909 (AVX512) 283 In this final test, CFL is almost 2x faster than Ryzen 2.

The addition of 2 more cores brings big performance gains (not to mention the higher Turbo clock) over the old SKL/KBL which is pretty impressive considering TDP has stayed the same. With SIMD code (AVX/AVX2/FMA3) CFL has no problem beating Ryzen 2 by a pretty large margin (up to 2x faster) – but any algorithm not vectorised allows Ryzen 2 to win – though not by much 12-20%.

Streaming tests likely benefit from the higher supported memory frequencies that while in theory could be used on the older SKL/KBL (memory overclock) they were not supported officially nor stable in all cases. We shall test memory performance in a forthcoming article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 41 [-33%] 61 52 28 .Net CLR integer performance starts off well over old SKL but still 33% slower than Ryzen 2.
BenchDotNetAA .Net Dhrystone Long (GIPS) 41 [-32%] 60 54 27 With 64-bit integers nothing much changes.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 78 [-24%] 102 107 49 Floating-Point CLR performance does not change much, CFL is still 25% slower than Ryzen despite big gain over old SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 95 [-19%] 117 137 62 FP64 performance is similar to FP32.
Ryzen 2 performs exceedingly well in .Net workloads – soundly beating all Intel CPUs, with CFL between 20-33% slower. More cores will be needed for parity with Ryzen 2, but at least CFL improves a lot over SKL/KBL.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 93.5 [-16%] 111 144 57 Just as we saw with Dhrystone, this integer workload sees CFL improve greatly over SKL but Ryzen 2 is still faster.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 93.1 [-15%] 109 143 57 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 361 [-8%] 392 585 228 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code but CFL is still 8% slower than Ryzen 2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 198 [-9%] 217 314 128 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL is still slower.
We see a similar improvement for CFL but again not enough to beat Ryzen 2; even using RyuJit’s vectorised support CFL cannot beat it – just reduce the loss to 8-9%.
Java Arithmetic Java Dhrystone Integer (GIPS) 557 [-3%] 573 877 352 Java JVM performance is almost neck-and-neck with Ryzen 2 despite 2 less cores.
Java Arithmetic Java Dhrystone Long (GIPS) 488 [-12%] 553 772 308 With 64-bit integers, CFL does fall behind Ryzen2 by 12%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 101 [-23%] 131 156 62 Floating-point JVM performance is worse though, CFL is now 23% slower.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 103 [-26%] 139 160 64 With 64-bit precision nothing much changes.
While CFL improves markedly over the old SKL and almost ties with Ryzen 2 in integer workloads, it does fall behind in floating-point by a good amount.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 100 [-12%] 113 140 63 Without SIMD acceleration we see the usual delta (around 12%) with integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 89 [-12%] 101 152 62 Nothing changes when changing to 64-bit integer workloads.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 64 [-34%] 97 98 41 With floating-point non-SIMD accelerated we see a bigger delta of about 30% less vs. Ryzen 2.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 64 [-29%] 90 99 41 With 64-bit floatint-point precision nothing much changes.
With compute heavy vectorised code but not SIMD accelerated, CFL cannot keep up with Ryzen 2 and the difference increases to about 30% less. Intel really needs to get Oracle to add SIMD extensions similar to .Net’s new CLR.

Ryzen dominates .Net and Java benchmarks – Intel will need more cores in order to compete; while the additional 2 cores helped a lot, it is not enough!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Due to Core improvement stagnation (4 on desktop, 2 on mobile), Intel had no choice really but increase core counts in light of new competition from AMD with Ryzen/2 with twice as many cores (8) and SMT (Hyper-Threading) as well! While KBL had increased base/Turbo cores appreciably over SKL within the same power envelope, CFL had to add more cores in order to compete.

With 50% more cores (6) CFL performs much better than the older SKL/KBL as expected but that is not enough in non-vectorised loads; Ryzen 2 with 2 more cores is still faster, not by much (12-20%) but still faster. Once vectorised code is used, the power of Intel’s SIMD units shows – with CFL soundly beating Ryzen despite not supporting AVX512 nor SHA HWA which is a pity. AMD still has work to do with Ryzen if it wants to be competitive in vectorised workloads (integer or floating-point).

We now see why CFL-R (CofeeLake Refresh) will add even more cores (8C/16T with 9900K which we review in a subsequent article) – it is the only way to beat Ryzen 2 in all workloads. In effect AMD’s has reached parity performance with Intel in all but SIMD workloads – a great achievement!

Unfortunately (unlike AMD’s AM4 Ryzen) CFL does require new chipset/boards (series 300) which makes it an expensive upgrade for SKL/KBL owners; otherwise it would have been a pretty no-brainer upgrade for those needing more compute power. While the new platform does bring some improvements (USB 3.1 Gen 2 aka 10GB/s, more PCIe lanes, integrated 802.11ac WiFi – at least on mobile) it’s nothing over the competition.

Roll on CofeeLake Refresh and the new CPUs: they are sorely needed…