Intel Core i7 8700K, 9900K CofeeLake Review & Benchmarks – UHD 630 GPGPU Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). As before, the CPUs contain an integrated GPU (with compute support aka GPGPU).

While originally Intel integrated graphics were not much use – starting with SNB (“SandyBridge”) and especially its GPGPU-capable successor IVB (“IvyBridge”) the integrated graphics units made large progress, with HSW (“Haswell”) introducing powerful many compute units (GT3+) and esoteric L4 cache (eDRAM) versions (“CrystallWell) supporting high-end features like FP64 (native 64-bit floating-point support) and zero-copy CPU <> GPU transfers.

Alas, while the features remained, the higher-end versions (GT3, GT4e) never became mainstream and pretty much disappeared – except very high-end ULV/H SKUs with top-end desktop CPUs like 6700K, 8700K, etc. tested here stuck with the low-end GT2 versions. Perhaps nobody in their right mind would use such CPUs without a dedicated external (GP)GPU, it is still interesting to see how the GPU core has evolved in time.

Also let’s not forget that on the mobile platforms (either ULV/Y even H) most laptops/tablets do not have dedicated GPU and rely solely on integrated graphics – and here naturally UHD630 performance matters.

Hardware Specifications

We are comparing the graphics units of to-of-the-range Intel CPUs with low-end dedicated cards to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
Arch Chipset GT2 / EV9.5 GT2 / EV9 GP108 / SM6.1 UHD6xx is just a minor revision of the HD5xx video core.
Cores (CU) / Threads (SP) 24 / 192 24 / 192 3 / 384 No change in core / SP units.
ROPs / TMUs 8 / 16 8 / 16 16 / 24 No change in ROP/TMUs either.
Speed (Min-Turbo) 350-1200 350-1150 300-1.26-1.52 Turbo speed is only slightly increased.
Power (TDP) 95W 91W 35W TDP has gone up a bit but nothing major.
Constant Memory 3.2GB 3.2GB 64kB (dedicated) There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 64kB 64kB 48kB (dedicated) Bigger than usual shared/local memory but slow (likely non dedicated).
Global Memory 7GB (of 16GB) 7GB (of 16GB) 2GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System DDR4 3200Mt/s 128-bit DDR4 2533Mt/s 128-bit GDDR5 6Gt/s 64-bit CFL can reliably run at faster data rates thus 630 benefits too.
Memory Bandwidth (GB/s)
50 40 48 The high data rate of DDR4 can result in higher bandwidth than some dedicated cards.
L2 Cache 512kB 512kB 48kB L2 is unchanged and reasonably large.
FP64/double ratio Yes, 1/8 Yes, 1/8 Yes, 1/32 FP64 is supported and at good ration compared to gimped dedicated cards.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 1/64 FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 1150 [+7%] 1070 1660 Thanks to FP16 support we see double the performance over FP32 and thus only 50% slower than dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 584 [+9%] 535 1660 630 is almost 10% faster than old 530 but still about 1/3 of a dedicated 1030.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 151 [+9%] 138 72.8 FP64 sees a similar delta (+9%) but much faster (2x) than a dedicated 1030 due to gimped FP64 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 7.84 [+5%] 7.46 2.88 Emulated FP128 precision depends entirely on FP64 performance and much better (3x) than gimped dedicated.
UHD630 is about 5-9% faster than 520, not much to celebrate – but due to native FP16 and especially FP64 support it can match or even overtake low-end dedicated GPUs – a pretty surprising result! If only we had more cores, it may actually be very much competitive.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 1 [+5%] 0.954 4.37 We see a 5% improvement for 630 0 but far lower performance than a dedicated GPU.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.3 [+6] 1.23 5.9 Nothing changes here , we see a 6% improvement.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 3.6 [+3%] 3.5 18.4 In this heavy integer workload, the improvement falls to just 3% – but a dedicated unit would be about 4x faster.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 8.18 [+2%] 8 24 Nothing much changes here, we see a 2% improvement.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.3 [+2%] 1.27 7.8 With 64-bit integer workload, same improvement of just 2% but now the 1030 is about 6x faster!
Nobody will be using integrated graphics for crypto-mining any time soon, we see a very minor improvement in 639 vs old 530, but overall low performance versus dedicated graphics like a 1030 which would be 4-6x faster. We would need 3x more cores to compete here.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1180 [+21%] 977 1320 In this FP32 financial workload we see a good 21% improvement vs. old 530. Also good result vs. dedicated 1030.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 180 [+2%] 175 137 Switching to FP64 code, the difference is next to nothing but better than a gimped 1030.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 111 [+12%] 99 255 Binomial uses thread shared data thus stresses the internal memory sub-system, and here 630 is 12% faster. But 1/2 the performance of a 1030.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 22.3 [+4%] 21.5 14 With FP64 code the improvement drops to 4%.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 298 [+2%] 291 617 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – strangely we see only 2% improvement and again 1/2 1030 performance.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 43.4 [+2%] 42.5 28 Switching to FP64 we see no changes. But almost 2x performance over a 1030.
You can run financial analysis algorithms with decent performance on an UHD630 – just as you could on the old 530 – and again better FP64 performance than dedicated – (GT 1030) a pretty impressive result. Naturally, you can just use the powerful CPU cores instead…
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 143 [+4%] 138 685 Using 32-bit precision 630 improves 4% but is almost 1/5 (5 times slower) than a 1030.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 55.5 [+3%] 53.7 35 With FP64 precision, the delta does not change but now 640 is amost 2x faster than a 1030.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 39.6 [+20%] 33 37 FFT is memory access bound and here 630’s faster DDR4 memory gives it a 20% lead.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 9.3 [+16%] 8 20 We see a similar improvement with FP64 about 16%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 272 [+2%] 266 637 Back to normality with this algorithm – we see just 2% improvement.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 27.7 [+3%] 26.9 32 With FP64 precision, nothing much changes.
The scientific scores are similar to financial ones – except the memory access heavy FFT which greatly benefits from better memory  (if that is provided of course) but this a dedicated card (like the 1030) is much faster in FP32 mode but again the 630 can be 2x faster in FP64 mode. Again, you’re much better off using the CPU and its powerful SIMD units for these algorithms.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 592 [+10%] 536 1620 In this 3×3 convolution algorithm, we see a 10% improvement over the old 530. But about 1/3x performance of a 1030.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 128 [+9%] 117 637 Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 133 [+9%] 122 391 With even more data the gap remains the same.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 127 [+9%] 116 368 Still convolution but with 2 filters – still 9% better.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 9.2 [+10%] 8.4 7.3 Different algorithm does not change much still 10% better.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 10.6 [+9%] 9.7 4.08 Without major processing, 630 improves by the same amount.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1640 [+2%] 1600 2350 This algorithm is 64-bit integer heavy thus we fall to the “usual” 2% improvement.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 550 [+2%] 538 849 One of the most complex and largest filters, sees the same 2% improvement.
For image processing using FP32 precision 630 performs a bit better than usual, 10% faster across the board compared to the old 530 – but still about 1/3 (third) the speed of a dedicated 1030. But if you can make do with FP16 precision image processing, then we almost double performance.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (8700K, 9900K) Intel HD 530 (6700K) nVidia GT 1030 Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 36.4 [+21%] 30 38.5 Due to higher speed DDR4 memory, the 630 manages 21% better bandwidth than the 620 – and comparable to a 64-bit bus dedicated card.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 17.9 [+29%] 13.9 3 (PCIe3 x4) The CPU<>GPU internal link seems to have 30% more bandwidth – naturally zero transfers are also supported. And a lot better than a dedicated card on PCIe3 x4 (4 lanes).
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 17.9 [+35%] 13.3 3 (PCIe3 x4) Here again we see a good 35% bandwidth improvement.
CFL’s higher (stable) memory speed support improves bandwidth between 20-35% – which is likely behind most benchmark improvement in the compute algorithms above. However, that will only happen if high-speed DDR4 memory (3200 or faster) were to be used – an expensive proposition! eDRAM would greatly help here…
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 179 [+1%] 178 223 No changes in global latencies in-page showing no memory sub-system improvements.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 268 [-19%] 332 244 Due to faster memory clock (even with slightly increased timings) full random access latencies fall by 20% (similar to bandwidth increase).
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 126 [-5%] 132 76 Sequential access latencies do fall by a minor 5% as well though.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 181 [-6%] 192 92.5 Intel’s GPGPU don’t have dedicated constant memory thus we see similar performance to global memory.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72 [-1%] 73 16.6 Shared memory latency is unchanged – and quite slow compared to architectures from competitors like the 1030.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 138 [-9%] 151 220 Texture access latencies do seem to show a 9% improvement a surprising result.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 227 [-16%] 270 242 Just as we’ve seen with global (full range access) latencies, we see the best improvement about 16% here.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 45 [=] 45 71.9 With sequential access we see no improvement.
Anything to do with main memory access (aka “full random access”) does show a similar improvement to bandwidth increases, i.e. between 16-19% due to higher speed (but somewhat higher timings) main memory. All other access patterns show little to no improvements.

When using higher speed DDR4 memory – as we do here (3200 vs 2533) UHD630 shows a good improvement in both bandwidth and reduced latencies – but otherwise it performs just the same as the old HD520 – not a surprise really. At least you can see that your (expensive) memory investment does not go to waste – with memory bound algorithms showing good improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For GPGPU workloads, UHD630 does not bring anything new – it performs similarly to the old HD520. But as CFL can use higher (stable) memory, bandwidth and latencies are improved (when using such higher speed memory) and thus most algorithms do show good improvements. Naturally as long as you can afford to provide such memory.

The surprising support for 1/2 ratio native FP64 support means 64-bit floating-point algorithms can run faster than on a typical low-end graphics card (as despite also supporting native FP64 the ratio is 1/32 vs. FP32 rate)  so high accuracy workloads do work well on it. If loss of accuracy is OK (e.g. picture processing) native FP16 support at 2x rate makes such algorithms almost 2x faster and thus within the performance of a typical low-end graphics card (that either don’t support FP16 or their ratio is 1/64!).

As we touched in the introduction – this may not matter on desktop – but on mobile where most laptops/tablets use the integrated graphics any and all such improvements can make a big difference. While in the past the fast-improving EV cores became performance competitive with CPU cores (as there were only 2 ULV ones) – with CFL doubling number of CPU cores (4 vs. 2) it is likely that internal graphics (GPGPU) performance is now too low.

We’re sad that the GT3/GT4 versions are not common-place not to mention the L4/eDRAM which showed so much promise in the HSW days.

But Intel has recently revamped its GPU division and are committed to release dedicated (not just internal) graphics in a few years (2020?) which hopefully means we should see far more powerful GPUs from them soon.

Let’s hope they do see the light-of-day and are not cancelled like the “Phi” GPGPU accelerators (“Knights Landing”) which showed so much promise but somehow never made it outside data centres before sailing into the sunset…

Tagged , , , , , . Bookmark the permalink.

Comments are closed.