Intel Core i9 9900K CofeeLake-R Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No L1D/I changes, Ryzen’s L1I is twice as big.
L2 Caches 8x 256kB 4-way 6x 256kB 4-way 8x 512kB 8-way 10x 1MB 16-way No L2 changes, Ryzen’s L2 is twice as big again.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 2x 8MB 16-way L3 has also increased with no of cores, and now matches Ryzen.
TLB 4kB pages
64 4-way / 64 8-way / 1536 6-way 64 4-way / 64 8-way/ 1536 6-way 64 full-way 1536 8-way 64 4-way / 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
8 full-way / 1536 6-way 8 full-way / 1536 6-way 64 full-way 1536 2-way 8 full-way / 1536 6-way No TLB changes.
Memory Controller Speed (MHz) 1200-5000 1200-4400 1333-2667 1200-2700 The uncore (memory controller) runs at faster clock due to higher rated clock but not a lot in it.
Memory Data Speed (MHz)
3200 3200 2667 3200 CFL/R can easily run at 3200Mt/s while KBL/SKL were not as reliable. We could not get Ryzen past 2667 while it does support 2933.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Bandwidth (GB/s)
50 50 42 100 Bandwidth has naturally increased with memory clock speed but latencies are higher.
Uncore / Memory Controller Firmware
2.6.2 2.6.2 We’re on firmware 2.6.x on both.
Memory Timing (clocks)
16-16-16-36 6-52-25-12 2T 16-16-16-36 6-52-25-12 2T 16-17-17-35 7-60-20-10 2T Timings are very much BIOS dependent and vary a lot.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL-R supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 70.7 [+28%] 52.5 55.3 86 CFL-R finally overtakes Ryzen2 in inter-core bandwidth with almost 30% more bandwidth.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 15.4 [-1%] 15.5 6.35 25.7 In worst-case pairs on Ryzen2 must go across CCXes – unlike Intel’s CPUs – thus CFL can muster over 2x more bandwidth in this case.
CFL-R manages good bandwidth improvement with its 2  extra cores allowing it to dominate Ryzen  2; worst-case bandwidth does not improve as the inter-core connector has remained the same
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.4 [-7%] 14.4 13.5 15 With its faster clock, CFL-R manages lower inter-core latency with 7% drop.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 43.7 [-3%] 45 40 75 Within the same unit, Ryzen2 is again faster than CFL/R.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 115 Obviously going across CCXes is slow, about 3x slower which needs careful thread scheduling.
The multiple CCX designof Ryzen 2 still presents some challenges to programmers requiring threads to be carefully scheduled – thus the unified CFL-R just like CFL before it enjoys lower latencies throughout.
Aggregated L1D Bandwidth (GB/s) 1890 [+39%] 1630
854 2220 Intel’s wide L1D in CFL/R means almost 2x more bandwidth than Ryzen 2.
Aggregated L2 Bandwidth (GB/s) 618 [+8%] 571 720 985 But Ryzen2’s L2 caches are not only twice as big but also very wide – CFL/R surprisingly cannot beat it.
Aggregated L3 Bandwidth (GB/s) 326 [=] 327 339 464 Ryzen’s 2 L3 caches also provide good bandwidth matching CFL’s unified L3 cache.
Aggregated Memory (GB/s) 35.5 [=] 35.6 32.2 70 Running at 3200Mt’s obviously CFL enjoys higher bandwidth than Ryzen2 at 2667Mt’s but somehow the latter has better efficiency.
Nothing much has changed in CFL/R vs. old SKL/KBL thus while L1 caches are wide and thus fast – the L2, L3 are not as impressive and the memory controller while competitive it does not seem as efficient as Ryzen2 but is more stable at high data rates allowing for higher bandwidth.
Data In-Page Random Latency (ns) 17.5 (3-10-21) 17.4 (4-11-20) [-73%] 63.4 (4-12-31) 25.5 (4-13-30) While clock latencies have not changed w.s. old KBL/SKL, CFLR enjoys lower latencies due to higher data rates. Ryzen2 has problems here.
Data Full Random Latency (ns) 54.3 (3-10-36) 53.4 (4-11-42) [-30%] 76.2 (4-12-32) 74 (4-13-62) Out-of-page clock latencies have increased but still overall lower. Ryzen2 has almost caught up here.
Data Sequential Latency (ns) 3.8 (3-10-11) 3.8 (4-11-12) 3.3 (4-6-7) 5.3 (4-12-12) With sequential access, Ryzen2 is now faster as CFL/R’s clock latencies have not changed.
CFL-R does not improve over CFL (same memory controller) is lucky here as even Ryzen2 still has high latencies in random accesses (either in-page or full range) but manages to be faster with sequential access. Intel will need to improve going forward as clock latencies while good have really not improved at all.
Code In-Page Random Latency (ns) 8.6 (2-9-19) 8.7 (2-10-21) 13.8 (4-9-24) 11.8 (4-14-25) Code clock latencies also have not changed and again and while Ryzen2 performs a lot better, CFL/R manage to be ~35% faster.
Code Full Random Latency (ns) 60.1 (2-9-48) 59.8 (2-10-48) 85.7 (4-14-49) 83.6 (4-15-74) Out-of-page clock latencies also have not changed and here CFL/R is 20% faster over Ryzen2.
Code Sequential Latency (ns) 4.3 (2-3-8) 4.5 (2-4-10) 7.4 (4-12-20) 6.8 (4-7-11) Ryzen2 is competitive but again CFL/R manages to be almost 40% faster.
CFL/R does not improve over CFL but still dominates here and enjoys 30-40% less latency over Ryzen2 but the latter has improved a lot in time.
Memory Update Transactional (MTPS) 73.3 [+36%] 54 5 59 Finally all top-end Intel CPUs have HLE enabled and working and thus enjoy huge performance increase.
Memory Update Record Only (MTPS) 53.4 [+41%] 38 4.58 59 Nothing much changes here. CFL-R can do over 40% more transactions.

CFL-R does not really perform any different cache/memory wise vs. old CFL as the caches and memory controller are unchanged.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

CFL-R just adds more cores, thus enjoys higher aggregated L1D/L2 bandiwdths vs CFL but the L3 is still disappointing – especially as now it has to feed 33% more cores/threads (8/16 vs 6/12). Latencies (in clocks) do not change either but as it can clock higher they do decrease in real terms (ns).

The memory controller is the very same (even running same firmware) thus performs the same though now it has to feed 33% more cores/threads (8/16 vs 6/12) thus when all cores/threads are used the aggregated bandwidth falls due to extra contention. In fairness Ryzen2 has the same issue (too many cores/threads for too little bandwidth) thus SKL/KBL-X is where you should be looking for more bandwidth.

Intel Core i9 9900K CofeeLake-R Review & Benchmarks – CPU 8-core/16-thread Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i9 (9900K) with previous generation (8700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R
Intel i7-8700K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 10C / 20T We have 33% more cores matching Ryzen and close to mainstream SKL-X!
Speed (Min / Max / Turbo) 0.8-3.6-5GHz

(8x-36x-50x)

0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) Single/Dual core Turbo has now reached 5GHz same as 8086K special edition.
Power (TDP) 95W (135) 95W (131) 105W (135) 140W (308) TDP is the same but overall power consumption likely far higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 8x 256kB 8-way 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way No change in L2 caches. Just more of them.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 13.75MB 11-way L3 has also increased by 33% in line with cores matching Ryzen.
Microcode/Firmware MU069E0C-9E MU069E0A-96 MU8F0802-04 MU065504-49 We have a new stepping and slightly newer microcode.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 400 [+20%] 291 334 485 In the old Dhrystone integer workload, CFL-R finally beats Ryzen by 20%.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 393 [+17%] 296 335 485 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 236 [+19%] 170 198 262 Switching to floating-point, CFL-R still beats Ryzen by 20% in old Whetstone.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 196 [+16%] 143 169 223 With FP64 nothing much changes.
From integer workloads in Dhrystone to floating-point workloads in Whetstone, CFL-R handily beats Ryzen by about 20% and is also 33-39% faster than old CFL. It’s “king of the hill”.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1000 [+74%] 741 574 1590 (AVX512) In this vectorised AVX2 integer test CFL-R is almost 2x faster than Ryzen
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 416 [+122%] 305 187 581 (AVX512) With a 64-bit AVX2 integer vectorised workload, CFL-R is now over 2.2x faster than Ryzen.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.75 [+16%] 4.9 5.8 7.6 This is a tough test using Long integers to emulate Int128 without SIMD: still CFL-R is fastest.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 927 [+56%] 678 596 1760 (AVX512) In this floating-point AVX/FMA vectorised test, CFL-R is 56% faster than Ryzen.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 544 [+62%] 402 335 533 (AVX512) Switching to FP64 SIMD code, CFL-R increases its lead to 612.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 23.3 [+49%] 16.7 15.6 40.3 (AVX512) In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL-R is 50% faster.
In vectorised SIMD code we know Intel’s SIMD units (that can execute 256-bit instructions in one go) are much more powerful than AMD’s and it shows: Ryzen is soundly beaten by a big 50-100% margin. Naturally it cannot beat its “big brother” SKL-X with AVX512 – which is likely why Intel has not enabled them.
BenchCrypt Crypto AES-256 (GB/s) 17.6 [+9%] 17.8 16.1 23 With AES HWA support all CPUs are memory bandwidth bound; core contention (8 vs 6) means CFL-R scores slightly worse than CFL.
BenchCrypt Crypto AES-128 (GB/s) 17.6 [+9%] 17.8 16.1 23 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 12.2 [-34%] 9 18.6 26 (AVX512) With SHA HWA Ryzen2 powers through but CFL-R is only 34% slower.
BenchCrypt Crypto SHA1 (GB/s) 23 [+19%] 17.3 19.3 38 (AVX512) Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy allowing CFL-R to beat it.
BenchCrypt Crypto SHA2-512 (GB/s) 9 [+139%] 6.65 3.77 21 (AVX512) SHA2-512 is not accelerated by SHA HWA, allowing CFL-R to use its SIMD units and be 139% faster.
AES HWA is memory bound and here and CFL-R also enjoys the 3200Mt/s memory – but now feeding 8C / 16T which all contend for the bandwidth. Thus CFL-R does score slighly less than CFL and obviously gets left in the dust by SKL-X with 4 memory channels. Ryzen2 SHA HWA does manage a lonely win but anything SIMD accelerated belongs to Intel.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 [+7%] 207 257 309 In this non-vectorised test CFL-R is just a bit faster than Ryzen2.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 240 [+10%] 180 219 277 Switching to FP64 code, nothing much changes, CFL-R is 10% faster
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 [-44%] 47 107 70.5 Binomial uses thread shared data thus stresses the cache & memory system; CFL-R strangely loses by 44%.
BenchFinance Binomial double/FP64 (kOPT/s) 61.9 [+2%] 44.2 60.6 68 With FP64 code Ryzen2’s lead diminishes, CFL is pretty much tied with it.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 [+4%] 41.6 54.2 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; CFL-R is just 4% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 44.3 [+8%] 32.9 41 50.5 Switching to FP64 CFL-R increases its lead to 8%.
Without SIMD support, CFL-R relies on its thread count increase (thus matching Ryzen2) to beat Ryzen2 by a small amount and lose in one test. But in a test that AMD used to always win (with Ryzen 1/2) Intel now has the lead.
BenchScience SGEMM (GFLOPS) float/FP32 403 [+34%] 385 300 413 (AVX512) In this tough vectorised AVX2/FMA algorithm CFL-R is 35% faster than Ryzen2
BenchScience DGEMM (GFLOPS) double/FP64 269 [+126%] 135 119 212 (AVX512) With FP64 vectorised code, CFL-R is over 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 23.4 [+160%] 24 9 28.6 (AVX512) FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more CFL-R is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 11.2 [+41%] 11.9 7.92 14.6 (AVX512) With FP64 code, CFL-R’s lead reduces to 41%.
BenchScience SNBODY (GFLOPS) float/FP32 550 [+96%] 411 280 638 (AVX512) N-Body simulation is vectorised but many memory accesses to shared data but CFL-R is 2x faster than Ryzen2.
BenchScience DNBODY (GFLOPS) double/FP64 172 [+52%] 127 113 195 (AVX512) With FP64 code CFL’s lead reduces to 50% over Ryzen 2.
With highly vectorised SIMD code CFL-R performs very well – dominating Ryzen2: in some tests it is over 2x faster! Then again CFL did not have any issues here either, Intel is just extending their lead…
CPU Image Processing Blur (3×3) Filter (MPix/s) 2270 [+86%] 1700 1220 4540 (AVX512) In this vectorised integer AVX2 workload CFL-R is amost 2x faster than Ryzen2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 903 [+67%] 675 542 1790 (AVX512) Same algorithm but more shared data reduces the lead to 67% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 488 [+61%] 362 303 940 (AVX512) Again same algorithm but even more data shared reduces the lead to 61%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 784 [+73%] 589 453 1520 (AVX512) Different algorithm but still AVX2 vectorised workload means CFL-R is 73% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 78.6 [+13%] 57.8 69.7 223 (AVX512) Still AVX2 vectorised code but CFL-R stumbles a bit here – but it’s still 13% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 43.4 [+76%] 31.8 24.6 70.8 (AVX512) CFL-R recovers its dominance over Ryzen2.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4470 [+208%] 3480 1450 3570 (AVX512) CFL-R (like all Intel CPUs) does very well here – it’s a huge 3x faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 614 [+153%] 448 243 909 (AVX512) In this final test, CFL-R is over 2x faster than Ryzen 2.

Adding 2 more cores brings additional performance gains (not to mention the higher Turbo clock) over CFL which showed big gains over the old SKL/KBL again within the same (rated) TDP. Intel never had any problem with SIMD code (AVX/AVX2/FMA3) beating Ryzen2 by a large margin (now over 2x faster) but now also wins pretty much all tests.

It is consistently 33-40% faster than CFL (8700K) in line with core/speed increases which bodes well for compute-heavy code; streaming performance can be lower due to increase core contention for bandwidth and here faster (though more expensive) memory would help.

No – it cannot beat its “older big brother” SKL-X with AVX512 – not to mention increased core/thread count as well as memory channels, but in some tests it is competitive.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 54.3 [-11%] 41 61 52 .Net CLR integer performance starts off well but CFL-R is 10% slower.
BenchDotNetAA .Net Dhrystone Long (GIPS) 55.8 [-7%] 41 60 54 With 64-bit integers the gap lowers to 7%.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 95.5 [-6%] 78 102 107 Floating-Point CLR performance does not change much, CFL-R is still 6% slower than Ryzen2 despite big gain over old CFL/KBL/SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 126 [+8%] 95 117 137 FP64 performance allows CFL-R to win in the end.
Ryzen 2 performs exceedingly well in .Net workloads – but CFL-R can hold its own and overall it is not significantly slower (under 10%) and even wins 1 test out of 4.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 144 [+30%] 93.5 111 144 Unlike CFL, here CFL-R is 30% faster than Ryzen2.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 139 [+28%] 93.1 109 143 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 499 [+27%] 361 392 585 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code and CFL-R is 27% faster than Ryzen2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 274 [+26%] 198 217 314 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL-R is still faster.
We see a big improvement in CFL-R even against old CFL: this allows it to soundly beat Ryzen 2 (which used to win this test) by about 30%, a significant margin. It is possible the hardware fixes (“Meltdown”) are having an effect here.
Java Arithmetic Java Dhrystone Integer (GIPS) 614 [+7%] 557 573 877 Java JVM performance starts well with a 7% lead over Ryzen2.
Java Arithmetic Java Dhrystone Long (GIPS) 644 [+16%] 488 553 772 With 64-bit integers, CFL-R doubles its lead to 16%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 143 [+9%] 101 131 156 Floating-point JVM performance is similar – 9% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 147 [+6%] 103 139 160 With 64-bit precision the lead drops to 6%.
While CFL-R improves by a good amount over CFL (which itself improved greatly over KBL/SKL) and now beats Ryzen2 by a good margin 7-16%.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 147 [+30%] 100 113 140 Without SIMD acceleration we still see CFL-R 30% than Ryzen2 in this integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 142 [+41%] 89 101 152 Changing to 64-bit integer increases the lead to 41%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 91 [-6%] 64 97 98 With floating-point non-SIMD accelerated Ryzen2 is faster.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+3%] 64 90 99 With 64-bit floatint-point precision CFL-R is back on top by just 3%.
With compute heavy vectorised code but not SIMD accelerated, CFL-R is still faster or at least ties with Ryzen2.

CFL-R now beats or at least matches Ryzen2 in VM tests the latter used to easily win. It may not have the lead native SIMD vectorised code allows it – but has no trouble keeping up – unlike its older SKL/KBL “brothers”.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

While Intel had finally increased core counts with CFL (after many years) in light of new competition from AMD with Ryzen/2 – it still relied on fewer but more powerful (at least in SIMD) cores to compete. CFL-R finally brings core (8 vs 8) and thread (16 vs 16) parity with the competition to ensure domination.

And with few exceptions – that’s what it has achieved – 9900K is the fastest desktop CPU at this time (November 2018) and if you can afford it you can upgrade on the same, old, 200-series mainboards (though naturally 300-series is recommended). The performance improvement (33-40% over 8700K) is pretty significant to upgrade – again if you can afford it – considering it is a “mere refresh”.

The “Meltdown” fixes in hardware are also likely to bring big improvement in I/O workloads – or to be precise – restore the performance loss of OS mitigations (KVA) that have been deployed this year (2018). Still, in very rough terms, now you don’t have to decide between “speed” and “security” – though perhaps KVA should be used by default just in case any CPU (not just Intel) leaks information between user/kernel spaces by a yet undiscovered side-channel vulnerability.

But despite the “i9” moniker – don’t think you’re getting a workstation-class CPU on the cheap: SKL-X not only (still) has more cores/threads and 2x memory channels but also supports AVX512 beating it soundly. It will also be refreshed soon – sporting the same “Meltdown” in-hardware fixes. But again considering the costs (almost 2x) CFL-R is likely the performance/price winner on most workloads.

For now, on the desktop, the 9900K is “king of the hill”!

Intel Core i7 8700K, 9900K CofeeLake Review & Benchmarks – UHD 630 GPGPU Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). As before, the CPUs contain an integrated GPU (with compute support aka GPGPU).

While originally Intel integrated graphics were not much use – starting with SNB (“SandyBridge”) and especially its GPGPU-capable successor IVB (“IvyBridge”) the integrated graphics units made large progress, with HSW (“Haswell”) introducing powerful many compute units (GT3+) and esoteric L4 cache (eDRAM) versions (“CrystallWell) supporting high-end features like FP64 (native 64-bit floating-point support) and zero-copy CPU <> GPU transfers.

Alas, while the features remained, the higher-end versions (GT3, GT4e) never became mainstream and pretty much disappeared – except very high-end ULV/H SKUs with top-end desktop CPUs like 6700K, 8700K, etc. tested here stuck with the low-end GT2 versions. Perhaps nobody in their right mind would use such CPUs without a dedicated external (GP)GPU, it is still interesting to see how the GPU core has evolved in time.

Also let’s not forget that on the mobile platforms (either ULV/Y even H) most laptops/tablets do not have dedicated GPU and rely solely on integrated graphics – and here naturally UHD630 performance matters.

Hardware Specifications

We are comparing the graphics units of to-of-the-range Intel CPUs with low-end dedicated cards to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
Arch Chipset GT2 / EV9.5 GT2 / EV9 GP108 / SM6.1 UHD6xx is just a minor revision of the HD5xx video core.
Cores (CU) / Threads (SP) 24 / 192 24 / 192 3 / 384 No change in core / SP units.
ROPs / TMUs 8 / 16 8 / 16 16 / 24 No change in ROP/TMUs either.
Speed (Min-Turbo) 350-1200 350-1150 300-1.26-1.52 Turbo speed is only slightly increased.
Power (TDP) 95W 91W 35W TDP has gone up a bit but nothing major.
Constant Memory 3.2GB 3.2GB 64kB (dedicated) There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 64kB 64kB 48kB (dedicated) Bigger than usual shared/local memory but slow (likely non dedicated).
Global Memory 7GB (of 16GB) 7GB (of 16GB) 2GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System DDR4 3200Mt/s 128-bit DDR4 2533Mt/s 128-bit GDDR5 6Gt/s 64-bit CFL can reliably run at faster data rates thus 630 benefits too.
Memory Bandwidth (GB/s)
50 40 48 The high data rate of DDR4 can result in higher bandwidth than some dedicated cards.
L2 Cache 512kB 512kB 48kB L2 is unchanged and reasonably large.
FP64/double ratio Yes, 1/8 Yes, 1/8 Yes, 1/32 FP64 is supported and at good ration compared to gimped dedicated cards.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 1/64 FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 1150 [+7%] 1070 1660 Thanks to FP16 support we see double the performance over FP32 and thus only 50% slower than dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 584 [+9%] 535 1660 630 is almost 10% faster than old 530 but still about 1/3 of a dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 151 [+9%] 138 72.8 FP64 sees a similar delta (+9%) but much faster (2x) than a dedicated 1030 due to gimped FP64 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 7.84 [+5%] 7.46 2.88 Emulated FP128 precision depends entirely on FP64 performance and much better (3x) than gimped dedicated.
UHD630 is about 5-9% faster than 520, not much to celebrate – but due to native FP16 and especially FP64 support it can match or even overtake low-end dedicated GPUs – a pretty surprising result! If only we had more cores, it may actually be very much competitive.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 1 [+5%] 0.954 4.37 We see a 5% improvement for 630 0 but far lower performance than a dedicated GPU.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.3 [+6] 1.23 5.9 Nothing changes here , we see a 6% improvement.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 3.6 [+3%] 3.5 18.4 In this heavy integer workload, the improvement falls to just 3% – but a dedicated unit would be about 4x faster.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 8.18 [+2%] 8 24 Nothing much changes here, we see a 2% improvement.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.3 [+2%] 1.27 7.8 With 64-bit integer workload, same improvement of just 2% but now the 1030 is about 6x faster!
Nobody will be using integrated graphics for crypto-mining any time soon, we see a very minor improvement in 639 vs old 530, but overall low performance versus dedicated graphics like a 1030 which would be 4-6x faster. We would need 3x more cores to compete here.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1180 [+21%] 977 1320 In this FP32 financial workload we see a good 21% improvement vs. old 530. Also good result vs. dedicated 1030.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 180 [+2%] 175 137 Switching to FP64 code, the difference is next to nothing but better than a gimped 1030.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 111 [+12%] 99 255 Binomial uses thread shared data thus stresses the internal memory sub-system, and here 630 is 12% faster. But 1/2 the performance of a 1030.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 22.3 [+4%] 21.5 14 With FP64 code the improvement drops to 4%.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 298 [+2%] 291 617 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – strangely we see only 2% improvement and again 1/2 1030 performance.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 43.4 [+2%] 42.5 28 Switching to FP64 we see no changes. But almost 2x performance over a 1030.
You can run financial analysis algorithms with decent performance on an UHD630 – just as you could on the old 530 – and again better FP64 performance than dedicated – (GT 1030) a pretty impressive result. Naturally, you can just use the powerful CPU cores instead…
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 143 [+4%] 138 685 Using 32-bit precision 630 improves 4% but is almost 1/5 (5 times slower) than a 1030.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 55.5 [+3%] 53.7 35 With FP64 precision, the delta does not change but now 640 is amost 2x faster than a 1030.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 39.6 [+20%] 33 37 FFT is memory access bound and here 630’s faster DDR4 memory gives it a 20% lead.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 9.3 [+16%] 8 20 We see a similar improvement with FP64 about 16%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 272 [+2%] 266 637 Back to normality with this algorithm – we see just 2% improvement.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 27.7 [+3%] 26.9 32 With FP64 precision, nothing much changes.
The scientific scores are similar to financial ones – except the memory access heavy FFT which greatly benefits from better memory  (if that is provided of course) but this a dedicated card (like the 1030) is much faster in FP32 mode but again the 630 can be 2x faster in FP64 mode. Again, you’re much better off using the CPU and its powerful SIMD units for these algorithms.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 592 [+10%] 536 1620 In this 3×3 convolution algorithm, we see a 10% improvement over the old 530. But about 1/3x performance of a 1030.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 128 [+9%] 117 637 Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 133 [+9%] 122 391 With even more data the gap remains the same.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 127 [+9%] 116 368 Still convolution but with 2 filters – still 9% better.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 9.2 [+10%] 8.4 7.3 Different algorithm does not change much still 10% better.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 10.6 [+9%] 9.7 4.08 Without major processing, 630 improves by the same amount.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1640 [+2%] 1600 2350 This algorithm is 64-bit integer heavy thus we fall to the “usual” 2% improvement.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 550 [+2%] 538 849 One of the most complex and largest filters, sees the same 2% improvement.
For image processing using FP32 precision 630 performs a bit better than usual, 10% faster across the board compared to the old 530 – but still about 1/3 (third) the speed of a dedicated 1030. But if you can make do with FP16 precision image processing, then we almost double performance.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 36.4 [+21%] 30 38.5 Due to higher speed DDR4 memory, the 630 manages 21% better bandwidth than the 620 – and comparable to a 64-bit bus dedicated card.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 17.9 [+29%] 13.9 3 (PCIe3 x4) The CPU<>GPU internal link seems to have 30% more bandwidth – naturally zero transfers are also supported. And a lot better than a dedicated card on PCIe3 x4 (4 lanes).
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 17.9 [+35%] 13.3 3 (PCIe3 x4) Here again we see a good 35% bandwidth improvement.
CFL’s higher (stable) memory speed support improves bandwidth between 20-35% – which is likely behind most benchmark improvement in the compute algorithms above. However, that will only happen if high-speed DDR4 memory (3200 or faster) were to be used – an expensive proposition! eDRAM would greatly help here…
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 179 [+1%] 178 223 No changes in global latencies in-page showing no memory sub-system improvements.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 268 [-19%] 332 244 Due to faster memory clock (even with slightly increased timings) full random access latencies fall by 20% (similar to bandwidth increase).
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 126 [-5%] 132 76 Sequential access latencies do fall by a minor 5% as well though.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 181 [-6%] 192 92.5 Intel’s GPGPU don’t have dedicated constant memory thus we see similar performance to global memory.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72 [-1%] 73 16.6 Shared memory latency is unchanged – and quite slow compared to architectures from competitors like the 1030.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 138 [-9%] 151 220 Texture access latencies do seem to show a 9% improvement a surprising result.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 227 [-16%] 270 242 Just as we’ve seen with global (full range access) latencies, we see the best improvement about 16% here.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 45 [=] 45 71.9 With sequential access we see no improvement.
Anything to do with main memory access (aka “full random access”) does show a similar improvement to bandwidth increases, i.e. between 16-19% due to higher speed (but somewhat higher timings) main memory. All other access patterns show little to no improvements.

When using higher speed DDR4 memory – as we do here (3200 vs 2533) UHD630 shows a good improvement in both bandwidth and reduced latencies – but otherwise it performs just the same as the old HD520 – not a surprise really. At least you can see that your (expensive) memory investment does not go to waste – with memory bound algorithms showing good improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For GPGPU workloads, UHD630 does not bring anything new – it performs similarly to the old HD520. But as CFL can use higher (stable) memory, bandwidth and latencies are improved (when using such higher speed memory) and thus most algorithms do show good improvements. Naturally as long as you can afford to provide such memory.

The surprising support for 1/2 ratio native FP64 support means 64-bit floating-point algorithms can run faster than on a typical low-end graphics card (as despite also supporting native FP64 the ratio is 1/32 vs. FP32 rate)  so high accuracy workloads do work well on it. If loss of accuracy is OK (e.g. picture processing) native FP16 support at 2x rate makes such algorithms almost 2x faster and thus within the performance of a typical low-end graphics card (that either don’t support FP16 or their ratio is 1/64!).

As we touched in the introduction – this may not matter on desktop – but on mobile where most laptops/tablets use the integrated graphics any and all such improvements can make a big difference. While in the past the fast-improving EV cores became performance competitive with CPU cores (as there were only 2 ULV ones) – with CFL doubling number of CPU cores (4 vs. 2) it is likely that internal graphics (GPGPU) performance is now too low.

We’re sad that the GT3/GT4 versions are not common-place not to mention the L4/eDRAM which showed so much promise in the HSW days.

But Intel has recently revamped its GPU division and are committed to release dedicated (not just internal) graphics in a few years (2020?) which hopefully means we should see far more powerful GPUs from them soon.

Let’s hope they do see the light-of-day and are not cancelled like the “Phi” GPGPU accelerators (“Knights Landing”) which showed so much promise but somehow never made it outside data centres before sailing into the sunset…