AVX512 Improvement for Icelake Mobile (i7-1065G7 ULV)

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators, it was next introduced on the HEDT platform with Skylake-X (SKL-X/EX/EP) but until now it was not avaible on the mainstream platforms.

With the 10th “real” generation Core arch(itecture) (IceLake/ICL), we finally see “enhanced” AVX512 on the mobile platform which includes all the original extensions and quite a few new ones.

Original AVX512 extensions as supported by SKL/KBL-X HEDT processors:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit.
  • AVX512-DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit
  • AVX512-BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit
  • AVX512-VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers
  • AVX512-CD* – Conflict Detection – loop vectorisation through predication [only on Xeon/Phi co-processors]
  • AVX512-ER* – Exponential & Reciprocal – transcedental operations [only on Xeon/Phi co-processors]

New AVX512 extensions supported by ICL processors:

  • AVX512-VNNI** (Vector Neural Network Instructions) [also supported by updated CPL-X HEDT]
  • AVX512-VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • AVX512-BITALG (Bit Algorithms)
  • AVX512-IFMA (Integer FMA)
  • AVX512-VAES (Vector AES) accelerating crypto
  • AVX512-GFNI (Galois Field)
  • AVX512-GNA (Gaussian Neural Accelerator)

As with anything, simply doubling register widths does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains; some may require future arch updates or tools to realise their true potential.

SIMD FMA Units: Unlike HEDT/server processors, ICL ULV (and likely desktop) have a single 512-bit FMA unit, not two (2): the execution rate (without dependencies) is thus similar for AVX512 and AVX2/FMA code. However, future versions are likely to increase execution units thus AVX512 code will benefit even more.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks ICL ULV AVX512 ICL ULV AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 504 [+25%] 403 For integer workloads we manage25% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 145 [+1%] 143 With a 64-bit integer workload the improvement reduces to 1%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 3.67 3.73 [-2%] – [No SIMD in use here]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 414 [+22%] 339 In this floating-point test, we see a 22% improvement similar to integer.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 232 [+20%] 194 Switching to FP64 we see a similar improvement.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 10.17 [+13%] 9 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 13% improvement
With limited resources, AVX512 cannot bring 100% improvement, but still manages 20-25% improvement over AVX2/FMA which is decent improvement; also consider this is a TDP-constrained ULV platform not desktop/HEDT.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [+2.25x] 4 With no data dependency – we get great scaling of over 2x in this integer workload.
BenchCrypt Crypto SHA1 (GB/s) 15.71 [+81%] 8.6 Here we see only 80% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09 [+2.3x] 3.07 With 64-bit integer workload we see larger than 2x improvement.
Thanks to the new crypto-algorithm friendly acceleration instructions of AVX512 and no doubt helped by high-bandwidth LP-DDR4X memory, we see over 2x (twice) improvement over older AVX2. ICL ULV will no doubt be a great choice for low-power network devices (routers/gateways/firewalls) able to pump 100′ Gbe crypto streams.
BenchScience SGEMM (GFLOPS) float/FP32 185 [-6%] 196 More optimisations seem to be required here for ICL at least.
BenchScience DGEMM (GFLOPS) double/FP64 91 [+18%] 77 Changing to FP64 brings a 18% improvement.
BenchScience SFFT (GFLOPS) float/FP32 31.72 [+12%] 28.34 With FFT, we see a modest 12% improvement.
BenchScience DFFT (GFLOPS) double/FP64 17.72 [-2%] 18 With FP64 we see 2% regression.
BenchScience SNBODY (GFLOPS) float/FP32 200 [+7%] 187 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64 61.76 [=] 62 With FP64 there is no delta.
With highly-optimised scientific algorithms, it seems we still have some way to go to extract more performance out of AVX512, though overall we still see a 7-12% improvement even at this time.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1,580 [+79%] 883 We start well here with AVX512 80% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 633 [+71%] 371 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 326 [+67%] 195 Again same algorithm but even more data shared now brings the improvement down to 67%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 502 [+58%] 318 Using two buffers does not change much still 58% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 72.92 [+2.4x] 30.14 Different algorithm works better, with AVX512 over 2x faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.73 [+50%] 16.45 Using the new scatter/gather in AVX512 still brings 50% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 2,100 [+33%] 1,580 Here we have a 64-bit integer workload algorithm with many gathers still good 33% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 307 [+33%] 231 Again loads of gathers and similar 33% improvement.
Image manipulation algorithms working on individual (non-dependent) pixels love AVX512, with 33-140% improvement. The new scatter/gather instructions also simplily memory access code that can benefit from future arch improvements.
Neural Networks NeuralNet CNN Inference (Samples/s) 25.94 [+3%] 25.23 Inference improves by a mere 3% only despite few dependencies.
Neural Networks NeuralNet CNN Training (Samples/s) 4.6 [+5%] 4.39 Traning improves by a slighly better 5% likely due to 512-bit accesses.
Neural Networks NeuralNet RNN Inference (Samples/s) 25.66 [-1%] 25.81 RNN interference seems very slighly slower.
Neural Networks NeuralNet RNN Training (Samples/s) 2.97 [+33%] 2.23 Finally RNN traning improves by 33%.
Unlike image manipulation, neural networks don’t seem to benefit as much pretty much the same performance across board. Clearly more optimisation is needed to push performance.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We never expected a low-power TDP (power)-limited ULV platform to benefit from AVX512 as much as HEDT/server platforms – especially when you consider the lower count of SIMD execution units. Nevertheless, it is clear that ICL (even in ULV form) benefits greatly from AVX512 with 50-100% improvement in many algorithms and no loses.

ICL also introduces many new AVX512 extensions which can even be used to accelrate existing AVX512 code (not just legacy AVX2/FMA), we are likely to see even higher gains in the future as software (and compilers) take advantage of the new extensions. Future CPU architectures are also likely to optimise complex instructions as well as add more SIMD/FMA execution units which will greatly improve AVX512 code performance.

As the data-paths for caches (L1D, L2?) have been widened, 512-bit memory accesses help extract more bandwidth for streaming algorithms (e.g. crypto) while scatter/gather instruction reduce latencies for non-sequential data accesses. Thus the benefit of AVX512 extends to more than just raw compute code.

We are excitedly waiting to see how AVX512-enabled desktop/HEDT ICL performs, not constrained by TDP and adequately cooled…

Intel Iris Plus G7 Gen11 IceLake ULV (i7-1065G7) Review & Benchmarks – GPGPU Performance

What is “IceLake”?

It is the “proper” 10th generation Core arch (ICL) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, performance benefits)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • 64 EUs up to 1.1GHz – up to 1.12 TFLOPS/FP32, 2.25TFLOPS/FP16
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • VBR (Variable Rate Shading) – usefor for games

The biggest change GPGPU-wise is the increase in EUs (64 top end) which greatly increases processing power compared to previous generation using few EUs (24 except very rare GT3 version). Most of the  features seem to be geared towards gaming not GPGPU – thus one omission is no FP64 support! While mobile platforms are not very likely to use high-precision kernels, Gen9 FP64 performance did exceed CPU AVX2/FMA FP64 performance. FP16 is naturally supported, 2x rate as most current designs.

While there does not seem to be eDRAM (L4) cache at all, thanks to very high-speed LP-DDR4X memory (at 3733Mt/s) the bandwidth has almost doubled (58GB/s) which should greatly help bandwidth-intensive workloads. While L1 does not seem changed, L2 has been increased to 3MB (up from 1MB) which should also help.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously.

GPGPU (Gen11 G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
Arch Chipset EV9.5 / GT2 EV9 / GT3 Vega / GCN1.5 EV11 / G7 The first G11 from Intel.
Cores (CU) / Threads (SP) 24 / 192 48 / 384 8 / 512 64 / 512 Less powerful CU but same SP as Vega
SIMD per CU / Width 8 8 64 8 Same SIMD width
Wave/Warp Size 32 32 64 32 Wave size matches nVidia
Speed (Min-Turbo)
300-1000MHz 300-950MHz 300-1100MHz 400-1100MHz Turbo maches Vega.
Power (TDP) 15-25W 15-25W 25W 15-25W Same TDP
ROP / TMU 8 / 16 16 / 24 8 / 32 16 / 32
ROPs the same but TMU have increased.
Shared Memory
64kB
64kB 32kB 64kB Same shared memory but 2x Vega.
Constant Memory
1.6GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x DDR4 2133Mt/s 2x DDR4 2133Mt/s 2x DDR4 2400Mt/s 2x LP-DDR4X 3733Mt/s Fastest memory ever
Memory Bandwidth
38GB/s 38GB/s 42GB/s 58GB/s Highest bandwidth ever
L1 Caches 16kB x 24 16kB x 48 8x 16kB 16kB x 64kB L1 does not appear changed.
L2 Cache 512kB 1MB ? 3MB L2 has tripled in size
Maximum Work-group Size
256×256 256×256 1024×1024 256×256 Vega supports 4x bigger workgroups
FP64/double ratio
1/16x 1/16x 1/32x No! No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 895 1,530 2,000 2,820 [+41%] G7 beats Vega by 40%! Pretty incredible start.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 472 843 1,350 1,330 [-1%] Standard FP32 is just a tie.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 195 111 70* Without native FP64 support G7 craters, but old GT3 beats Vega.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 6 10.2 7.1 7.54* Emulated FP128 is hard on FP64 units and G7 beats Vega again.
G7 ties with Mobile Vega in FP32 which in itself is a great achievement but FP16 is much faster. Unfortunately, without native FP64 support – G7 is a lot slower using emulation – but hopefully mobile systems don’t use high-precision kernels.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.88 1.14 2.58 2.6 [+1%] G7 manages to tie with Vega on this streaming test.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1.1 1.42 3.3 3.4 [+2%] Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 1.1 1.83 3.36 2.26 [-33%] Without cryto acceleration G7 cannot match Vega.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 3 4.45 14.29 6.9 [1/2x] With 128-bit G7 is 1/2 speed of Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 6.79 10.6 18.77 14.18 [-24%] 64-bit integer workload is still 25% slower.
Thanks to the fast LP-DDR4X memory and its high bandwidth, G7 performance ties with Vega on integer workloads. However, G7 has not crypto acceleration thus Vega is much faster – thus crypto-currency/coin algorithms still favour AMD.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,170 1,470 1,720 2,340 [+36%] With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 710 758 829 1,310 [+58%] With FP32 G7 is now even faster – 60% faster than Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 158 264 185 No FP64 support.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 95.7 153 254 292 [+8%] Binomial uses thread shared data thus stresses the memory system so G7 is just 15% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 20.32 31.1 15.67 No FP64 support.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 240 392 362 719 [+2x] Monte-Carlo also uses thread shared data but read-only and here G7 is 2x faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 35.27 59.7 47.13 No FP64 support.
For financial FP32/FP16 workloads, G7 is between 8% to 100% faster than the Vega – thus for financial workloads it is a great choice. Unfortunately, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.
GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 142 220 884 563 [-36%] G7 cannot beat Vega despite previous FP16 great performance.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 119 162 314 419 [+33%] With FP32, G7 is 33% faster than Vega.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 44.2 65.1 62.5 No FP64 support
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 39.77 42.54 61.34 61.4 [=] G7 manages to tie with Vega here.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 23.8 29.69 31.48 39.22 [+25%] With FP32, G7 is 25% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.81 3.43 14.19 No FP64 support
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 383 597 623 930 [+49%] G7 comes up strong here winning by 50%.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 209 327 537 566 [+5%] With FP32, G7 drops to just 5% faster than Vega.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 26.93 44.19 44
On scientific algorithms, G7 manages to beat Vega between 25-50% with FP32 precision and sometimes with FP16 as well. Again, the lack of FP64 support means all the high-precision kernels cannot be used which for some algorithms may be a problem.
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 1,000 1,370 2,273 3,520 [+55%] With FP16, G7 is only 50% faster than Vega.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 498 589 781 1,570 [+2x] In this 3×3 convolution algorithm, G7 is 2x faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 307 441 382 1,000 [+72%] With FP16, G7 is just 70% faster.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 108 143 157 319 [+2x] Same algorithm but more shared data, G7 still 2x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 284 435 619 924 [+49%] With FP16, G7 is again 50% faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 112 156 161 328 [+2x] With even more data the gap remains at 2x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 309 428 595 1,000 [+68%] With FP16 precision, G7 is 70% faster than Vega.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 108 145 155 318 [+2x] Still convolution but with 2 filters – same 2x difference.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 8.78 8.23 7.68 26.63 [+2.5x] With FP16, G7 is “just” 2.5x faster than Vega.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7.87 6.29 4.06 26.9 [+5.6x] Different algorithm allows G7 to fly at 6x faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 9.6 9.14 24.34 G7 does similarly well with FP16
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8.84 6.77 2.59 19.63 [+6.6x] Without major processing, this filter is 6x faster on G7.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,000 1,620 2,091 1,740 [-17%] With FP16, G7 is 17% slower than Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 1,000 1,560 2,100 1,870 [-11%] This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 36.5 34.32 1,046 215 [1/5x] Some issues needed to be worked out here.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 433 649 608 950 [+56%] One of the most complex and largest filters, G7 is over 50% faster.
For image processing tasks, G7 does very well – it is 2x faster than Vega while dropping to FP16 precision is around 50% faster (with Vega benefiting greatly from the lower precision). All in all a fanstastic result for those using image/video manipulation algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 21.36 23.66 27.32 36.3 [+33%] G7 has 33% more bandwidth than Vega.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 10.4 11.77 4.74 17 [+2.6x] G7 manages far higher transfers.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 10.55 11.75 5 18 [+2.6x] Again, same 2.6x delta.
Thanks to the fast LP-DDR4X memory, G7 has far more bandwidth than Vega or older GT2/GT3 design; this no doubt helps streaming algorithms as we have seen above.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 232 277 412 343 [-17%] Better latency than Vega but not less than old arch.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 363 436 519 433 [-17%] Similar 17% less than Vega.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 153 213 201 267 [+33%] Vega seems to be a lot faster than G7.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 236 252 411 350 [-15%] Same latency as global as not dedicated.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 72.5 100 22.5 16.7 [-26%] G7 has greatly reduced shared memory latency.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1,116 1,500 278 1,100 [+3x] Not much improvement over older versions.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1,178 1,533 418 1,018 [+1.4x] Similar high latency for G7.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1,057 1,324 122 973 [+8x] Again Vega has much lower latencies.
Despite high bandwidth, the latencies are high as LP-DDR4 has higher latencies than standard DDR4 (tens of clocks). Like Vega there is no dedicated constant memory – unlike nVidia. But G7 has greatly reduced shared memory latency to less than Vega which greatly helps algorithms using shared memory.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It’s great to see Intel taking graphics seriously again; with ICL, you don’t just get a brand-new core but a much updated GPU core too. And it does not disappoint – it trades blows with competition (Vega Mobile) and usually wins while it is close to 2x faster than Gen9/GT3 and 3x faster than Gen9.5/GT2 – a huge improvement.

The lack of native FP64 support is puzzling – but then again it could be reserved for higher-end/workstation versions if supported at all. Intel no doubt is betting on the CPU’s AVX512 SIMD cores for FP64 performance which is considerable. Again, it’s not very likely that mobile (ULV) platforms are going to run high-precision kernels.

The memory bandwidth is also 50% higher but unfortunately latencies are also higher due to LP-DDR4(X) memory; lower-end versions using “standard” DDR4 memory will not see high bandwidth but will see lower latencies – thus it is give and take.

As we’ve said in the other reviews of ICL, if you have been waiting to upgrade from the much older – but still good – SKL/KBL with Gen8/9 GT2 GPU – the Gen11 GPU is a significant upgrade. You will no longer feel “inadequate” compared to competition integrated GPUs. Naturally, you cannot expect discrete GPU levels of performance but for an integrated APU it is more than sufficient.

Overall with CPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

SiSoftware Sandra 20/20/4a (2020 R4a) Released

Note: The original R4 release text has been updated below. The (*) denotes new changes.

We are pleased to release R4a (version 30.39) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Benchmarks:
    • Crypto AES Benchmarks*: Optimised AVX512/AVX2-VAES code to outperform AES-HWA where possible.
    • Crypto SHA Benchmarks*: Select AVX512 multi-buffer instead of SHA-HWA where supported.
    • Network (LAN), Wireless (WLAN/WWAN) Benchmarks: multi-threaded transfer tests and increased packet size to better utilise 10Gbe+ (and higher) links. [Note: threaded CPU required]
    • Internet Connection, Internet Peerage Benchmarks: multi-threaded transfer tests and increased packet size to better utilise Gigabit+ (and higher) connections.
  • Hardware Support:
    • Updated IceLake (ICL Gen10 Core), Future* (RKL, TGL Gen11 Core) AVX512, VAES, SHA-HWA support (see CPU, GP-GPU, Cache & Memory, AVX512 improvement reviews)
    • Updated CometLake (Gen10 Core) support (see CPU, GP-GPU, Cache & Memory reviews)
    • Updated CPU features support*
    • Updated NVMe support
    • Enhanced Biometrics information (fingerprint, face, voice, audio, etc. sensors)
    • Updated WiFi support (WiFi 6/802.11ax, WPA3)
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Intel Core Gen10 IceLake ULV (i7-1065G7) Review & Benchmarks – CPU AVX512 Performance

What is “IceLake”?

It is the “real” 10th generation Core arch(itecture) (ICL/IceLake) from Intel – the brand new core to replace the ageing “Skylake” (SKL) arch and its many derivatives; due to delays it actually debuts shortly after the latest update (“CometLake” (CLM)) that is also called 10th generation. Firstly launched for mobile ULV (U/Y) devices, it will also be launched for mainstream (desktop/workstations) soon.

Thus it contains extensive changes to all parts of the SoC: CPU, GPU, memory controller:

  • 10nm+ process (lower voltage, performance benefits)
  • Up to 4C/8T on ULV (similar to WhiskyLake but less than top-end CometLake 6C/12T)
  • Gen11 graphics (finally up from Gen9.5 for CometLake/WhiskyLake)
  • AVX512 instruction set (like HEDT platform)
  • SHA HWA instruction set (like Ryzen)
  • 2-channel LP-DDR4X support up to 3733Mt/s
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • WiFi6 (802.11ax) AX201 integrated

Probably the biggest change is support for AVX512-family instruction set, effectively doubling the SIMD processing width (vs. AVX2/FMA) as well as adding a whole host of specialised instructions that even the HEDT platform (SKL/KBL-X) does not support:

  • VNNI (Vector Neural Network Instructions)
  • VBMI, VBMI2 (Vector Byte Manipulation Instructions)
  • BITALG (Bit Algorithms)
  • IFMA (Integer FMA)
  • VAES (Vector AES) accelerating crypto
  • GFNI (Galois Field)
  • SHA accelerating hashing
  • GNA (Gaussian Neural Accelerator)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

VAES and SHA acceleration improve crypto/hashing performance – important today as even LAN transfers between workstations are likely to be encrypted/signed, not to mention just about all WAN transfers, encrypted disk/containers, etc. Some SoCs will also make their way into powerful (but low power) firewall appliances where both AES and SHA acceleration will prove very useful.

From a security point-of-view, ICL mitigates all (existing/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O).

The memory controller supports LP-DDR4X at higher speeds than CML while the cache/TLB systems have been improved that should help both CPU and GPU performance (see corresponding article) as well as reduce power vs. older designs using LP-DDR3.

Finally the GPU core has been updated (Gen11) and generally contains many more cores than the old core (Gen9.5) that was used from KBL (CPU Gen7) all the way to CML (CPU Gen10) (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

 

CPU Specifications AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
Cores (CU) / Threads (SP) 4C / 8T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz 0.4-1.8-4.0GHz
(1.8 @ 15W, 2GHz @ 25W)
0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
ICL has lower clocks ws. CML.
Power (TDP) 15-35W 15-35W 15-35W 12-35W Same power envelope.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way L1D is 50% larger.
L2 Caches 4x 512kB 8-way 4x 256kB 16-way 4x 256kB 16-way 4x 512kB 16-way L2 has doubled.
L3 Caches 4MB 16-way 6MB 16-way 8MB 16-way 8MB 16-way No L3 changes
Microcode (Firmware) MU8F1100-0B MU068E09-AE MU068E0C-BE MU067E05-6A Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX2/FMA AVX512, VNNI, SHA, VAES, GFNI 512-bit wide SIMD on mobile!
SIMD Width / Units
128-bit 256-bit 256-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

 

Native Benchmarks AMD Ryzen 2500U Bristol Ridge Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 125 134 154 [+15%]
ICL is 15% faster than CML.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 115 135 151 [+12%]
With a 64-bit integer workload – 12% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 67 85 90 [+6%]
With floating-point, ICL is 6% faster than CML
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 57 70 74 [+5%]
With FP64 we see 5% improvement
With integer (legacy) workloads (not using SIMD) we see the new ICL core is over 10% faster than the higher-clocked CML core; with floating-point we see a 5% improvement. While modest, it shows the potential of the new core over the old-but-refined cores we’ve had since SKL.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 306 409 504* [+23%] With AVX512 ICL wins this vectorised integer test
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 117 149 145* [-3%] With a 64-bit AVX512 integer workload we have parity.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 2.21 2.54 3.67 [+44%] A tough test using long integers to emulate Int128 without SIMD;  ICL is 44% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 266 328 414* [+26%]
In this floating-point vectorised test, AVX512 is 26% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 127 155.9 194 232* [+19%]
Switching to FP64 SIMD code,  ICL is 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 6.51 8.22 10.2* [+24%]
A heavy algorithm using FP64 to mantissa extend FP128 ICL is 24% faster.
With heavily vectorised SIMD workloads ICL is able to deploy AVX512 which leads to a 20-25% performance improvement even at the slower clock. However, AVX512 is quite power-hungry (as we’ve seen on HEDT) so we are power constrained in an ULV here – but higher TDP systems (28W, etc.) should perform much better.

* using AVX512 instead of AVX2/FMA.

BenchCrypt Crypto AES-256 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
ICL with VAES is 76% faster than CML.
BenchCrypt Crypto AES-128 (GB/s) 10.9 13.1 12.1 21.3* [+76%]
No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78** 3.97 4.3 9** [+2.1x] Despite SHA HWA, Ryzen loses top spot.
BenchCrypt Crypto SHA1 (GB/s) 7.13** 7.5 7.2 15.7** [+2.2x] Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 1.54 7.1*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite VAES (AVX512 VL) and SHA HWA support (like Ryzen), ICL wins thanks to the very fast LP-DDR4X @ 3733Mt/s. VAES marginally helps (at this time) and SHA HWA cannot beat AVX512 multi-buffer but should be much more important in single-buffer large data workloads.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.34 73.02 109 With non-vectorised code ICL is still faster
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.86 75.24 87.2 91 [+4%] Using FP64 ICL is 4% faster
BenchFinance Binomial float/FP32 (kOPT/s) 35.49 16.2 23.5 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.46 19.31 21 27 [+29%] With FP64 code ICL is 29% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.11 14.61 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.32 14.54 16.5 66 [+2x] Switching to FP64 ICL is 2x faster.
With non-SIMD financial workloads, ICL still improves a significant amount over CML thus it makes sense to choose it rather than the older core. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 107 141 158 185* [+17%]
In this tough vectorised  algorithm, ICL is 17% faster
BenchScience DGEMM (GFLOPS) double/FP64 47.2 55 69.2 91.7* [+32%]
With FP64 vectorised code, ICL is 32% faster.
BenchScience SFFT (GFLOPS) float/FP32 3.75 13.23 13.9 31.7* [+2.3x%]
FFT is also heavily vectorised and here ICL is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 4 6.53 7.35 17.7* [+2.4x]
With FP64 code, ICL is even faster.
BenchScience SNBODY (GFLOPS) float/FP32 112.6 160 169 200* [+18%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 57.9 64.2 61.8* [-4%]
With FP64 code ICL is slighly behind CML.
With highly vectorised SIMD code (scientific workloads), ICL again shows us the power of AVX512 and can be over 2x (twice) faster than CML even at higher clock. Some algorithms may need further optimisations but even then we see 17-30% improvement.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 14.32 17.27 19.33 25.62* [+33%] Using AVX512 ICL inference is 33% faster.
Neural Networks NeuralNet CNN Training (Samples/s) 1.46 2.06 3.33 4.56* [+37%] Even training improves by 37%.
Neural Networks NeuralNet RNN Inference (Samples/s) 16.93 22.69 23.88 24.93* [+4%] Just 4% faster but improvement is there.
Neural Networks NeuralNet RNN Training (Samples/s) 1.48 1.14 1.57 2.97* [+43%] Training is much faster by 43% over CML.
As we’ve seen before, ICL benefits greatly from AVX512 – manages to beat the higher-clock CML across the board from 33-43% – and that is before using VNNI to accelerate algorithms even more.

* using AVX512 instead of AVX2/FMA (not using VNNI yet)

CPU Image Processing Blur (3×3) Filter (MPix/s) 532 720 891  1580* [+77%] In this vectorised integer workload ICL is 77% faster
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 290 359 633* [+76%]
Same algorithm but more shared data still 76%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 157 186 326* [+75%]
Again same algorithm but even more data shared brings 75%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 251 302 502* [+66%]
Different algorithm but still vectorised workload still 66% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.49 25.38 27.7 72.9* [+2.6x]
Still vectorised code ICL rules here 2.6x faster!
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 14.29 15.7 24.7* [57%]
Similar improvement here of about 57%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 1525 1580 2100* [+33%]
With integer workload, 33% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94,16 188.8 214 307* [+43%]
In this final test again with integer workload 43% faster
ICL rules this benchmark with AVX512 integer (B/W) 33-43% faster and floating-point AVX512 66-77% faster than CML even at lower clock. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Unlike CML, ICL with AVX512 support is a revolution in performance – which is exactly what we were hoping for; even at much lower clock we see anywhere between 33% all the way to over 2x (twice) faster within the same power limits (TDP/turbo). As we know from HEDT, AVX512 is power-hungry thus higher-TDP rated version (e.g. 28W) should perform even better.

Even without AVX512, we see good improvement of 5-15% again at much lower clock (3.9GHz vs 4.9GHz) while CML and older versions relied on higher clock / more cores to outperform older versions KBL/SKL-U.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD snapping at its heel with Ryzen Mobile, Intel has finally fixed its 10nm production and rolled out the “new Skylake” we deserve: Ice Lake with AVX512 brings feature parity with the much older HEDT platform and showing good promise for the future. This is the “Core” you have been looking for.

While power-hungry and TDP constrained, AVX512 does bring sizeable performance gains that are in addition to core improvements and cache & memory sub-system improvements. Other instruction sets VAES, SHA HWA complete the package and might help in some scenarios where code has not been updated to AVX512.

With ICL, a mere 15W thin & light (e.g. Dell XPS 13 9300) can outperform older desktop-class CPUs (e.g. SKL) at 4-6x (four/six-times) TDP which makes us really keen to see what desktop-class processors will be capable of. And not before time as the competition has been bringing stronger and stronger designs (Ryzen2, future Ryzen 3).

If you have been waiting to upgrade from the much older – but still good – SKL/KBL with just 2 cores and no hardware vulnerability mitigations – then you finally have something to upgrade to: CML was not it as despite its 4 cores (and rumoured 6 core), it just did not bring enough to the table to make upgrading worth-while (save hardware mitigations that don’t cripple performance).

Overall, with GP GPU and memory improvements, ICL-U is a very compelling proposition that cost permitting should be your top choice for long-term use.

In a word: Highly Recommended!

Please see our other articles on:

SiSoftware Sandra 20/20/3 (2020 R3) Released

We are pleased to release R3 (version 30.31) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • Additional PCIe extended capabilities support
  • CPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements
  • GPGPU Cyrptography Benchmarks:
    • Block size changed to ~1500 bytes similar to Ethernet packet
    • Various stability and reliability improvements

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/2 (2020 R2) Released

We are pleased to release R2 (version 30.27) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • PCIe extended capabilities support
  • Software Support:
    • ReFS format Disk benchmark stability issues
  • CPU Benchmarks:
    • Tools (Visual C++ compiler 2019) Update
  • GPGPU Benchmarks:
    • CUDA: Updated SDK 10.2/10.1
    • OpenCL: Updated SDK support

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

SiSoftware Sandra 20/20/1a (2020 R1a) Released

Update November 25th: Released patch (version 30.24) to add further hardware and software support.

Update October 24th: Released patch (version 30.21) to corrrect Windows 7 / Server 2008/R2 run-time issues.

We are pleased to release R1 (version 30.24) update for 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

  • Hardware Support:
    • AMD Ryzen2 (series 3000 Matisse), Stoney Ridge updated support
    • Intel Cascade Lake (CSL), Comet Lake (CML), Cannon Lake (CNL), Ice Lake (ICL) updated support
  • CPU Benchmarks:
    • Tools (Visual C++ compiler 2019) Update
  • GPGPU Benchmarks:
    • CUDA: Updated SDK 10.2/10.1
    • OpenCL: Updated SDK support

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Intel Core i9 10980X (Cascade Lake) Review & Benchmarks – CPU 18-core/36-thread AVX512 Performance

Intel Skylake-X Core i9

What is “Cascade Lake (CSL-X)”?

It is one of the 10th generation Core X-Series  arch (CSL-X) from Intel – the latest revision of the “Skylake-X” (SKL-X) arch; it succeeds the older 9900 and 7900 X-Series for HEDT platform. Again, as on the desktop/mobile – it is not the “real” 10th generation Core-X arch – but unlike those platforms – it does actually bring a few more features thus it may be thought as “gen 9.5”:

  • Up to 18C/36T (matching older 7/9-X series)
  • Increased Turbo ratios (e.g. 3.0/4.6GHz for 10980X vs. 2.6/4.2 for 7980X)
  • 4-channel DDR4-2933 (up from 2667) and 256GB (up from 128)
  • AVX512-VNNI, aka “Deep Learning Boost” (DLB) for AI/ML neural networks
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • Reduced cost – by 50% ($999 for 10980X vs. $1999 for 7980X)

Unfortunately there are no core-count increases here as the CPUs are still power limited especially with AVX512 loads, but we do have some base and turbo ratio increases that should come in useful. We also get a good increase in (official) memory data-speed support and double memory size support (256GB!) for those big servers.

New instruction sets are always appreciated, though “VNNI” is just an acceleration for twin 8/16-bit integer multiply/accumulate for faster sumation for low-precision quantised (thus integer not floating-point) neural networks. Thus it is not something most algorithms can benefit from: if all you’re going to be using your CPU is AI/ML then great – otherwise it may not be much use.

Dropping the price by a *huge* 50% instantly doubles performance/cost ratio making the CSL-X far more competitive against the new Ryzen 3 / ThreadRipper 3 that have brought big performance gains. Alternatively, it also allows almost doubling the no. of cores/cost – which is a nice upgrade for lower-end users but will not help top-end (12C+) users.

Why review it now?

Until “IceLake” (ICL-X) makes its public debut, “Cascade Lake” is the latest X-Series CPU from Intel you can buy today; despite being just a revision of “Skylake-X” due to its reduced price they may still prove worthy competitors not just in cost but also performance.

As they contain hardware fixes/mitigations for vulnerabilities discovered since original “Skylake-X” has launched (especially “Meltdown” but also various “Spectre” variants), the operating system & applications do not need to deploy slower mitigations that can affect performance (especially I/O, virtualisation) on the older designs. For some algorithms, this may be enough to warrant an upgrade alone!

In this article we test CPU core performance; please see our other articles on:

Other articles using Sandra around the Internet:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications Intel Core i9-10980X (CSL-X)
Intel Core i9-9900K (CFL-R)
Intel Core i9-7900X (SKL-X)
AMD Ryzen 9 3950X (R3)
Comments
Cores (CU) / Threads (SP) 18C / 36T 8C / 16T 10C / 20T 16C / 32T CSL-X has the most cores thus a big advantage.
Speed (Min / Max / Turbo) 3.0 – 4.6GHz 3.6 – 5.0GHz 3.3-4.3GHz 3.8-4.6GHz CSL-X improves Turbo clock over SKL-X
Power (TDP) 165 – 250W 95 – 135W 140 – 250W 105 – 135W TDP has increased over SKL-X
L1D / L1I Caches 18x 32kB / 18x 32kB 8x 32kB / 8x 32kB 10x 32kB / 10x 32kB 16x 32kB / 16x 32kB No L1 change
L2 Caches 18x 1MB (18MB) 8x 256kB (2MB) 10x 1MB (10MB) 16x 512kB (8MB) No L2 change and good size vs. Ryzen3
L3 Caches 24.75MB 16MB 13.75MB 4x 16MB (64MB) L3/Core stays the same – too little vs. Ryzen3
Microcode (Firmware) MU065507-29 MU069E0C-9E MU065504-49 MU8F7100-11 Just a stepping change of the same core

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “CometLake” (CML) supports all modern instruction sets including AVX2, FMA3 but not AVX512 (like “IceLake”) or SHA HWA (like Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i9-10980X (18C/36T) Intel Core i9-9900K (8C/16T) Intel Core i9-7900X (10C/20T) AMD Ryzen 9 3950X (16C/32T) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 779 [+3%] 400 455 753 CSL-X is just 3% faster than Ryzen3.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 835 [+11%] 393 448 750 With a 64-bit integer workload – the gain is 11%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 459 [-1%] 236 262 464 With floating-point workload we have a tie.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 379 [-4%] 196 223 393 With FP64 it is 4% slower than R3.
Despite its extra 2 cores (18 vs. 16 Ryzen 3), CSL-X pretty much ties with Ryzen3 across both legacy workloads (integer and floating-point). The performance increase vs. the older SKL-X is pretty much inline with the no cores (18 vs. 10) thus no discernible core improvement.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 2,341 [+25%] 985 1,430 1,873 In this vectorised integer test, AVX512 allows CSL-X a 25% win.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 913 [+23%] 414 550 744 With a 64-bit AVX2 integer workload the gain is 23%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 12.92 [=] 6.75 9.58 12.98 This is a tough test using Long integers to emulate Int128 without SIMD it’s a tie.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 2,676 [+36%] 914 1,740 1,970 In this floating-point vectorised test, CSL-X is 36% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1,738 [+45%] 535 1,140 1,200 Switching to FP64 SIMD code, the gain is 45%.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 56.4 [+21%] 23 38.7 46.5 In this heavy algorithm using FP64 to mantissa extend FP128 CSL-X is 21% faster.
Thanks to AVX512 support, CSL-X still manages to beat the new Ryzen3 (with its double-width SIMD units) by ~25% on vectorised integer and ~40% on floating-point workloads. With older AVX2/FMA we have a tie despite the extra 2 cores. Again, no appreciable delta vs. the old SKL-X thus without VNNI support there is nothing to see here.
BenchCrypt Crypto AES-256 (GB/s) 33.9 [2.6x] 17.6 34 13 With AES/HWA support CSL-X wins due to 4-channels.
BenchCrypt Crypto AES-128 (GB/s) 33.9 [2.6x] 17.6 34 13 No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 33.5 [+17%] 12 26 28.6 Without SHA/HWA CSL-X still wins.
BenchCrypt Crypto SHA1 (GB/s) 22.9 38 Less compute intensive SHA1 .
BenchCrypt Crypto SHA2-512 (GB/s) 9 21 SHA2-512 is not accelerated by SHA/HWA.
The memory sub-system is crucial here, and with 4-channel DDR4 CSL-X easily wins against Ryzen3 even lacking SHA/HWA support. But again, nothing special vs. old SKL-X at 3200Mt/s that ties with it despite less cores. But if you were to use only “official/non-XMP” clocks then CSL-X would win.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 344 With non vectorised workload.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 497 [+61%] 238 277 308 Using FP64 CSL-X is 60% faster than Ryzen3.
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 68.3 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 128 [+3%] 61.6 68 124 With FP64 code CSL-X is just 3% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 257 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 178 [-16%] 44.5 103 212 Switching to FP64 CSL-X is 16% slower.
With non-SIMD financial workloads, CSL-X does not always win outright over Ryzen3, sometimes it ties, sometimes it is slower and sometimes faster. It is a big improvement over SKL-X only due to having more cores at the same price.
BenchScience SGEMM (GFLOPS) float/FP32 375 413 In this tough vectorised AVX2/FMA algorithm.
BenchScience DGEMM (GFLOPS) double/FP64 240 [+45%] 209 212 165 With FP64 vectorised code, CSL-X is 45% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.3 28.6 FFT is also heavily vectorised but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 22.07 [+2.6x] 11.21 14.6 8.56 With FP64 code, CSL-X is over 2x faster.
BenchScience SNBODY (GFLOPS) float/FP32 557 638 N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 292 [-25%] 171 195 388 With FP64 code CSL-X is 25% slower.
With highly vectorised SIMD code (scientific workloads) using AVX512 – CSL-X easily wins again (up to 50% faster) over Ryzen3 but again nothing special over older SKL-X save its more cores. You will need AVX512 optimised algorithms though to realise these gains, otherwise it is again pretty much a tie vs. Ryzen3.
CPU Image Processing Blur (3×3) Filter (MPix/s) 7,295 [+2.53x] 2,560 4,880 2,883 In this vectorised integer workload CSL-X is 2.5x faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 2,868 [+54%] 1,000 1,880 1,857 Same algorithm but more shared data still 54%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,724 [+80%] 519 1,000 959 Same algorithm but even more data shared 80% faster.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 2,285 [+44%] 827 1,500 1,589 Different algorithm but still vectorised workload 44% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 332 [+91%] 78 221 174 Still vectorised code again almost 2x faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 112 [+2.25x] 42.2 66.7 49.7 Even better improvement here of 2.25x
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3,573 [2.37x] 4,000 3,084 1,505 With integer workload 2.5x faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 1,162 [+85%] 596 776 627 In this final test again with integer workload CSL-X is 85% faster.

Thanks to AVX512 CSL-X manages to easily beat Ryzen3 in heavily vectorised algorithms (up to 50% faster) and also in memory-bandwidth heavy algorithms (due to its 4-channels memory sub-system). But, despite having 2 extra cores, on older AVX2/FMA we pretty much have a tie – not something we are used to see from Intel.

Also, the improvement over older SKL-X is exactly in-line with the increase in cores (18 vs. 10 here) – thus there are no appreciable core improvements to boost performance. Without specific VNNI-accelerated algorithms, there is no point for SKL-X users to upgrade: you do get more cores for a lot less money but your hardware is also worth a lot less.

It shows how much Ryzen3 has improved (especially due to 256-bit width AVX2/FMA units) and ThreadRipper with 4-channels and even more cores (up to 32!) and threads (up to 64!) should nullify Intel’s AVX512 benefit.

SiSoftware Official Ranker Scores

 

Final Thoughts / Conclusions

For many, it may be disappointing that we do not have the brand-new “IceLake-X” (ICL-X) now rather than a 3-rd revision “Skylake-X” – and indeed “Cascade Lake-X” (CSL-X) does struggle against newer (and older) competition to make its mark. Without core-count increases and with minor clock increases while still very limited by power at the high end it just does not bring enough improvement. The only exception are workloads (low-precision quantised neural networks) that can use AVX512-VNNI.

Indeed, its “ace card” is the 1/2 price reduction vs. old 7/9-X series and that just about makes it competitive; thankfully existing X299 mainboards can use it through a BIOS/ME update – although boards are still expensive.

But the competition (AMD Ryzen 3, ThreadRipper 3) has much higher performance these days while older CPUs (Ryzen 2 / ThreadRipper 2) have also been greatly reduced in price. They also can use older boards, although to use new features (PCIe 4.0, better power management) new boards are required.

All in all, Intel has done all it can – fix vulnerabilities, greatly reduce the price – to keep the X-Series competitive with current designs – and that it has pretty much achieved. The next X-Series arch better deliver otherwise it will be dead and buried.

In a word: Recommended due to 1/2 price drop

Please see our other articles on:

Intel Core Gen10 CometLake ULV (i7-10510U) Review & Benchmarks – CPU Performance

What is “CometLake”?

It is one of the 10th generation Core arch (CML) from Intel – the latest revision of the venerable (6th gen!) “Skylake” (SKL) arch; it succeeds the “WhiskyLake”/”CofeeLake” 8/9-gen current architectures for mobile (ULV U/Y) devices. The “real” 10th generation Core arch is “IceLake” (ICL) that does bring many changes but has not made its mainstream debut yet.

As a result there ar no major updates vs. previous Skylake designs, save increase in core count top end versions and hardware vulnerability mitigations which can still make a big difference:

  • Up to 6C/12T (from 4C/8T WhiskyLake/CoffeeLake or 2C/4T Skylake/KabyLake)
  • Increase Turbo ratios
  • 2-channel LP-DDR4 support and DDR4-2667 (up from 2400)
  • WiFi6 (802.11ax) AX201 integrated (from WiFi5 (802.11ac) 9560)
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)

The 3x (three times) increase in core count (6C/12T vs. Skylake/KabyLake 2C/8T) in the same 15-28W power envelope is pretty significant considering that Core ULV designs since 1st gen have always had 2C/4T; unfortunately it is limited to top-end thus even i7-10510U still has 4C/8T.

LP-DDR4 support is important as many thin & light laptops (e.g. Dell XPS, Lenovo Carbon X1, etc.) have been “stuck” with slow LP-DDR3 memory instead of high-bandwidth DDR4 memory in order to save power. Note the Y-variants (4.5-6W) will not support this.

WiFi is now integrated in the PCH and has been updated to WiFi6/AX (2×2 streams, up to 2400Mbps with 160MHz-wide channel) from WiFi5/AX (1733Mbps); this also means no simple WiFi-card upgrade in the future as with older laptops (except those with “whitelists” like HP, Lenovo, etc.)

Why review it now?

Until “IceLake” makes its public debut, “CometLake” latest ULV APUs from Intel you can buy today; despite being just a revision of “Skylake” due to increased core counts/Turbo ratios they may still prove worthy competitors not just in cost but also performance.

As they contain hardware fixes/mitigations for vulnerabilities discovered since original “Skylake” has launched (especially “Meltdown” but also various “Spectre” variants), the operating system & applications do not need to deploy slower mitigations that can affect performance (especially I/O) on the older designs. For some algorithms, this may be worth an upgrade alone!

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen2 2500U Bristol Ridge Intel i7 7500U (Kabylake ULV) Intel i7 8550U (Coffeelake ULV) Intel Core i7 10510U (CometLake ULV) Comments
Cores (CU) / Threads (SP) 4C / 8T 2C / 4T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz 0.4-2.7-3.5GHz 0.4-1.8-4.0GHz
(1.8 @ 15W, 2GHz @ 25W)
0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
CML has +22% faster turbo.
Power (TDP) 15-35W 15-25W 15-35W 15-35W Same power envelope.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No L1 changes
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 4x 256kB 16-way 4x 256kB 16-way No L2 changes
L3 Caches 4MB 16-way 4MB 16-way 8MB 16-way 8MB 16-way And no L3 changes
Microcode (Firmware) MU8F1100-0B MU068E09-8E MU068E09-AE MU068E0C-BE Revisions just keep on coming.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “CometLake” (CML) supports all modern instruction sets including AVX2, FMA3 but not AVX512 (like “IceLake”) or SHA HWA (like Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge
Intel i7 7500U (Kabylake ULV)
Intel i7 8550U (Coffeelake ULV)
Intel Core i7 10510U (CometLake ULV)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 73.15 125 134 [+8%] CML starts off 7% faster than CFL a good start.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 74.74 115 135 [+17%] With a 64-bit integer workload – increases to 17%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 45 67.29 84.95 [+26%] With floating-point workload CML is 26% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 37 57 70.63 [+24%] With FP64 we see a similar 24% improvement.
With integer (legacy) workloads, CML-U brings a modest improvement of about 10% over CFL-U, cementing its top position. But with floating-points (also legacy) workloads we see a larger 25% increase which allows it to beat the competition (Ryzen Mobile) that was beating older designs (CFL-U, WHL-U, KBL-U, etc.)
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 193 306 409 [+34%] In this vectorised AVX2 integer test  CML-U is 34% faster than CFL-U.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 75 117 149 [+27%] With a 64-bit AVX2 integer workload the difference drops to 27%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 1.12 2.21 2.54 [+15%] This is a tough test using Long integers to emulate Int128 without SIMD; here CML-U is still 15% faster.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 160 266 328 [+23%] In this floating-point AVX/FMA vectorised test, CML-U is 23% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 127 94.8 155.9 194.4 [+25%] Switching to FP64 SIMD code, nothing much changes still 20% slower.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 4.04 6.51 8.22 [+26%] In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – we see 26% improvement.
With heavily vectorised SIMD workloads CML-U is 25% faster than previous CFL-U that may be sufficient to see future competition from Gen3 Ryzen Mobile with improved (256-bit) SIMD units, something that CFL/WHL-U may not beat. IcyLake (ICL) with AVX512 should improve over this despite lower clocks.
BenchCrypt Crypto AES-256 (GB/s) 10.9 7.28 13.11 12.11 [-8%] With AES/HWA support all CPUs are memory bandwidth bound.
BenchCrypt Crypto AES-128 (GB/s) 10.9 9.07 13.11 12.11 [-8%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78 2.55 3.97 4.28 [+8%] Without SHA/HWA Ryzen Mobile beats even CML-U.
BenchCrypt Crypto SHA1 (GB/s) 7.13 4.07 7.19 Less compute intensive SHA1 allows CML-U to catch up.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 1.54 SHA2-512 is not accelerated by SHA/HWA CML-U does better.
The memory sub-system is crucial here, and CML-U can improve over older designs when using faster memory (which we were not able to use here). Without SHA/HWA supported by Ryzen Mobile, it cannot beat it and improves marginally over older CFL-U.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.34 49.34 73.02 With non vectorised CML-U needs to cath up.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.86 43.33 75.24 87.17 [+16%] Using FP64 CML-U is 16% faster finally beating Ryzen Mobile.
BenchFinance Binomial float/FP32 (kOPT/s) 35.49 12.3 16.2 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 19.46 11.4 19.31 20.99 [+9%] With FP64 code CML-U is 9% faster than CFL-U.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.11 9.87 14.61 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.32 7.88 14.54 16.54 [+14%] Switching to FP64 nothing much changes, CML-U is 14% faster.
With non-SIMD financial workloads, CML-U modestly improves (10-15%) over the older CFL-U but this does allow it to beat the competition (Ryzen Mobile) which dominated older CFL-U designs. This may just be enough to match future Gen3 Ryzen Mobile and thus be competitive all-round.
BenchScience SGEMM (GFLOPS) float/FP32 107 76.14 141 158 [+12%] In this tough vectorised AVX2/FMA algorithm CML-U is 12% faster.
BenchScience DGEMM (GFLOPS) double/FP64 47.2 31.71 55 69.2 [+26%] With FP64 vectorised code, CML-U is 26% faster than CFL-U.
BenchScience SFFT (GFLOPS) float/FP32 3.75 7.21 13.23 13.93 [+5%] FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 4 3.95 6.53 7.35 [+13%] With FP64 code, CML-U is 13% faster.
BenchScience SNBODY (GFLOPS) float/FP32 112.6 105 160 169 [+6%] N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 30.64 57.9 64.16 [+11%] With FP64 code nothing much changes.
With highly vectorised SIMD code (scientific workloads) CML-U is again 15-25% faster than CFL-U which should be enough to match future Gen3 Ryzen Mobile with 256-bit SIMD units. Again we need ICL with AVX512 to bring dominance to these workloads or more cores.
CPU Image Processing Blur (3×3) Filter (MPix/s) 532 474 720 891 [+24%] In this vectorised integer AVX2 workload CML-U is 24% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 191 290 359 [+24%] Same algorithm but more shared data still 24%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 98.3 157 186 [+18%] Again same algorithm but even more data shared reduces improvement to 18%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 164 251 302 [+20%] Different algorithm but still AVX2 vectorised workload still 20% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.49 14.38 25.38 27.73 [+9%] Still AVX2 vectorised code but here just 9% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 7.63 14.29 15.74 [+10%] Similar improvement here of about 10%.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 764 1525 1580 [+4%] With integer AVX2 workload, only 4% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94,16 105.1 188.8 214 [+13%] In this final test again with integer AVX2 workload CML-U is 13% faster.

Without any new instruction sets (AVX512, SHA/HWA, etc.) support, CML-U was never going to be a revolution in performance and has to rely on clock and very minor improvements/fixes (especially for vulnerabilities) only. Versions with more cores (6C/12T) would certainly help if they can stay within the power limits (TDP/turbo).

Intel themselves did not claim a big performance improvement – still CML-U is 10-25% faster than CFL-U across workloads – at same TDP. At the same cost/power, it is a welcome improvement and it does allow it to beat current competition (Ryzen Mobile) which was nipping at its heels; it may also be enough to match future Gen3 Ryzen Mobile designs.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For some it may be disappointing we do not have brand-new improved “IceLake” (ICL-U) now rather than a 3-rd revision “Skylake” – but “CometLake” (CML-U) does seem to improve even over the previous revisions (8/9th gen “WhiskyLake”/”CofeeLake” WHL/CFL-U) while due to 2x core count completly outperforming the original (6/7th gen “Skylake”/”KabyLake”) in the same power envelope. Perhaps it also shows how much Intel has had to improve at short notice due to Ryzen Mobile APUs (e.g. 2500U) that finally brought competition to the mobile space.

While owners of 8/9-th gen won’t be upgrading – it is very rare to recommend changing from one generation to another anyway – owners of older hardware can look forward to over 2x performance increase in most workloads for the same power draw, not to mention the additional features (integrated WiFi6, Thunderbolt, etc.).

On the other hand, the competition (AMD Ryzen Mobile) also good performance and older 8/9th gen also offer competitive performance – thus it will all depend on price. With Gen3 Ryzen Mobile on the horizon (with 256-bit SIMD units) “CometLake” may just manage to match it on performance. It may also be worth waiting for “IceLake” to make its debut to see what performance improvements it brings and at what cost – which may also push “CometLake” prices down.

All in all Intel has managed to “squeeze” all it can from the old Skylake arch that while not revolutionary, still has enough to be competitive with current designs – and with future 50% increase core count (6C/12T from 4C/8T) might even beat them not just in cost but also in performance.

In a word: Qualified Recommendation!

Please see our other articles on:

AMD Radeon 5700XT: Navi GPGPU Performance in OpenCL

What is “Navi”?

It is the code-name of the new AMD GPU, the first of the brand-new RDNA (Radeon DNA) GPU arch(itecture) – replacing the “Vega” that was the last of the GCN (graphics core next) arch(itecture). It is a mid-range GPU optimised for gaming thus not expected to set records, but GPUs today are used for many other tasks (mining, encoding, algorithm/compute acceleration, etc.) as well.

RDNA arch brings big changes from the various GCN revisions we’ve seen previously, but its first iteration here does not bring any major new features at least in the compute domain. Hopefully the next versions will bring tensor units (matrix multiplicators) and other accelerated instruction sets and so on.

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
Arch Chipset RDNA / Navi 10 GCN5.1 / Vega 20 Pascal / GP102 GCN5.0 / Vega 10 The first of the Navi chips.
Cores (CU) / Threads (SP) 40 / 2560 60 / 3840 28 / 3584 56 / 3584 Less CUs than Vega1 and same (64x) SP per CU.
SIMD per CU / Width 2 / 32 [2x] 4 / 16 4 / 16 Navi increases the SIMD width but decreases counts.
Wave/Warp Size 32 [1/2x] 64 32 64 Wave size is reduced to match nVidia.
Speed (Min-Turbo) 1.6 / 1.755 1.4 / 1.75 1.531 / 1.91 1.156 / 1.471 40% faster base and 20% turbo than Vega 1.
Power (TDP) 225W 295W 250W 210W Slightly higher TDP but nothing significant
ROP / TMU 64 / 160 64 / 240 96 / 224 64 / 224 ROPs are the same but we see ~30% less TMUs.
Shared Memory
64kB [+2x]
32kB 48kB / 96kB per SM 32kB We have 2x more shared memory allowing bigger kernels.
Constant Memory
4GB 8GB 64kB dedicated 4GB No dedicated constant memory but large.
Global Memory 8GB GDDR6 14Gt/s 256-bit 16GB HBM2 1Gt/s 4096-bit 12GB GDDR5X 10Gt/s 384-bit 8GB HBM2 900Gt/s 4096-bit Sadly no HBM this time but the faster but not very wide.
Memory Bandwidth (GB/s)
448GB/s [+9%] 1024GB/s 512GB/s 410GB/s Still bandwidth is 9% higher.
L1 Caches ? x40 16kB x60 48kB x28 16kB x56 L1 does not appear changed but unclear.
L2 Cache 4MB 4MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
1024 / 1024 256 / 1024 1024 / 2048 per SM 256 / 1024 AMD has unlocked work-group sizes to 4x.
FP64/double ratio
1/16x 1/4x 1/32x 1/16x Ratio is same as consumer Vega1 rather than pro Vega2.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 18,265 [-7%] 29,057 245 19,580 Navi starts well but cannot beat Vega1.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 11,863 [-13%] 17,991 17,870 13,550 Standard FP32 increases the gap to 13%.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 1,047 [-16%] 5,031 661 1,240 FP64 does not change much, Navi is 16% slower.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 43 [-45%] 226 25 77 Emulated FP128 is hard on FP64 units and here Navi is almost 1/2 Vega1.
Starting up, Navi does not seem to be able to beat Vega1 in heavy vectorised compute loads with FP16 most efficient (almost parity) while complex FP128 is 2x slower.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 51 [-25%] 91 42 67 Despite more bandwidth Navi is 25% slower than Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 176 [+40%] 209 145 125 Navi shows its power here beating Vega1 by a huge 40%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 76 32
Despite more bandwidth of GDDR6, streaming algorithms work better on on “old” HBM2 thus Navi cannot beat Vega. But in pure integer compute algorithms like hashing, it is much faster by a significant amount which bodes well for the future.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 12,459 [+31%] 23,164 11,480 9,500 In this FP32 financial workload Navi is 30% faster than Vega1!
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 1,370 1,880
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 850 [1/3x] 3,501 2,240 2,530 Binomial uses thread shared data thus stresses the memory system and here we have some optimisation to do.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 129 164
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,027 [+30%] 6,249 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Navi is again 30% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 294 472
For financial FP32 workloads, Navi is ~30% faster than Vega1 – a pretty good improvement – though it naturally cannot compete with Vega2 due to consumer multiplier (1/16x). Crypto-currencies fans will love the Navi.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 5,165 [+2%] 6,634 6,073 5,066 GEMM can only bring a measly 2% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 340 620
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 376 [+2%] 643 235 369 FFT loves HBM but Navi is still 2% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 207 175
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 4,534 [-6%] 6,846 5,720 4,840 Navi can’t manage as well in N-Body and ends up 6% slower.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 275 447
The scientific scores don’t show the same improvement as the financial ones likely due to heavy use of shared memory with Navi just matching Vega1. Perhaps the larger shared memory can allow us to use larger workgroups.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 8,674 [1/2.1x] 25,418 18,410 19,130 In this 3×3 convolution algorithm, Navi is 1/2x the speed of Vega1.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,734 [1/3x] 5,275 5,000 4,340 Same algorithm but more shared data makes Navi even slower.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 1,802 [1/2.5x] 5,510 5,080 4,450 With even more data the gap remains at 1/2.5x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,723 [1/2.5x] 5,273 4,800 4,300 Still convolution but with 2 filters – same 1/2.5x performance.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 48.44 [=] 92.53 37 48 Different algorithm allows Navi to tie with Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 97.34 [+2.5x] 57.66 12.7 38 Without major processing, this filter performs well on Navi.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 32,050 [+1.5x] 47,349 19,480 20,880 This algorithm is 64-bit integer heavy and Navi is 50% faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 9,516 [+1.6x] 7,708 305 6,000 One of the most complex and largest filters, Navi is again 50% faster.
For image processing using FP32 precision, Navi goes from 1/2.5x Vega1 performance (convolution) to 50% faster (complex algorithms with integer processing). It seems some optimisations are needed for the convolution algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon 5700X (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 376 [+13%] 627 356 333 Navi’s GDDR6 manages 13% more bandwidth than Vega1.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 21.56 [+77%] 12.37 11.4 12.18 PCIe 4.0 brings almost 80% more bandwidth
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 22.28 [+84%] 12.95 12.2 12.08 Again almost 2x more bandwidth.
Navi’s PCIe 4.0 interface (on 500-series motherboards) brings as expected almost 2x more upload/download bandwidth while its high-clocked GDDR6 manages just over 10% higher bandwidth over HBM2.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 276 [+11%] 202 201 247 Navi’s GDDR6 brings slight latency increase (+10%)
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 286 353
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Navi” is an interesting chip to be sure and perhaps more was expected of it; as always the drivers are the weak link and it is hard to determine which issues will be fixed driver-side and which will need to be optimised in compute kernels.

Thus performance-wise it oscillates between 1/2x and 50% Vega1 performance depending on algorithm, with compute-heavy algorithms (especially crypto-currencies) doing best and shared/local memory heavy algorithms doing worst. The 2x bigger shared memory (64kB vs 32) in conjunction with the larger work-group (1024 vs 256 by default) sizes do present future optimisation opportunities. AMD has also reduced the warp/wave size to match nVidia – a historic change.

Memory wise, the cost-cutting change from HBM2 to even high-speed GDDR6 does bring more bandwidth but naturally higher latencies – but PCIe 4.0 doubles upload/download bandwidths which will become much more important on higher capacity (16GB+) cards in the future.

Overall it is hard to recommend it for compute workloads unless the particular algorithm (crypto, financial) does well on Navi, otherwise the much older Vega1 56/64 offer better performance/cost ratio especially today. However, as drivers mature and implementations are optimised for it, Navi is likely to start to perform better.

We are looking forward to the next iterations of Navi, especially the rumoured “big Navi” version optimised for compute…