Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Compute

Hyper-V

What is “Hyper-V”?

Hyper-V is Microsoft’s virtualisation solution, a Type 1 (i.e. “bare-metal”) hypervisor that is included in both Windows Server (2005) and more recently Windows Client (e.g. Windows 10). A hypervisor creates “virtual computers” where complete operating systems (VMs) can be installed – all sharing the same physical hardware.

While companies have been virtualising servers for a decade, not to forget more recent transition to the “cloud” – client computers generally used Type 2 (i.e. running on top of operating system) hypervisors (e.g. Oracle VirtualBox, VmWare Workstation, Microsoft’s old Virtual PC). With new security technologies like “Core Isolation“, “Windows Memory Integrity“, container technologies (e.g. Docker) and programming environments (e.g. Android, Windows Mobile (now dead)) many users now have Hyper-V enabled.

Hyper-V like most modern hypervisors do require hardware-assistance virtualisation support (e.g. Intel VT-x, AMD-V), SLAT (2nd Level Address Translation) and IOMMU (e.g. Intel VT-d) but most modern hardware (CPU, chipset, BIOS, etc.) should support these.

One huge advantage of Hyper-V is that unlike dedicated hypervisors (e.g. VmWare ESXi) it uses standard Windows drivers for hardware – thus if Windows supports it, Hyper-V will support it too – which can be important when the hardware is not “server grade”, niche or too old or too new.

What is the (performance) impact of enabling Hyper-V?

Enabling Hyper-V is easy and visual changes are minimal – but big changes take place “under the hood”; the operating system (Windows) no longer runs on the bare-metal but becomes a VM (virtual machine) running as the “root/parent partition” to which key hardware is passed-through. Hardware like the video card (GP-GPU) thus work as if nothing has happened but can be “detached” and “passed-trough” to other VMs (child partitions) that can now be run in addition to the root operating system. However, only one (1) VM can use the hardware directly.

More advanced hardware (generally network cards) support SR-IOV (Single-Root I/O Virtualisation) that can expose multiple VFs (Virtual Functions) that allow it to be shared between VMs as if they all have their own hardware. New video cards (e.g. nVidia “Ampere”) now support SR-IOV allowing multiple VMs to use hardware compute (GPGPU) capabilities of the physical host.

While this root partition Windows now inhabits is “priviledged” having access to all the hardware – it is still virtualised running on top of Hyper-V and thus performance will be impacted. Mitigations for vulnerabilities (e.g. “Meltdown”, “Spectre”, “MDS”, etc.) may apply to both hypervisor in addition to the root Windows operating system and would impact performance even further.

Users may decide to create and run other VMs (child partitions), install additional copies of Windows and run various applications or services there – and leave the host Windows partiton “clean”. A better way would be to use the free Windows Hyper-V Server to manage Hyper-V and create and run Windows client as a VM.

Why measure the performance impact of hypervisors?

Power users (i.e. our clients using Sandra) want to get the very best performance out of their systems; many may overclock or even disable vulnerabilities mitigations in the quest for the highest performance (or benchmark scores). While modern hardware and hypervisors use virtualisation-acceleration features – there is still a performance impact to enabling virtualisation (and thus Hyper-V). Even on modern hardware with many cores & threads – this performance degradation may be significant.

Users may also need to create a VM/container using an older (e.g. Windows 7, XP, etc.) or different (e.g. Linux, FreeBSD, etc.) operating system in order to run older/non-Windows applications or services (e.g. game emulation, firewall/VPN, home automation, etc.) that cannot run on the host operating system.

It is also a good idea to run untrusted apps/services in a separate VM/container in order not to corrupt the host operating system. Evaluation software (whether try-before-you-buy or pre-release/beta) are also commonly provided in container/VM form for easy deployment and evaluation.

CPU Performance Impact of Hyper-V

In this article we test CPU core performance; please see our other articles on:

  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Cache and Memory
  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Storage
  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Networking

Hardware Specifications

We are comparing (relatively) high-end desktop hardware running latest (client) Windows with/without Hyper-V and running a Windows (client) VM of comparable specification.

CPU Specifications Bare-Metal  (Intel i9-7900X) 10C/20T Root HyperV (Intel i9-7900X) 10C/20T VM HyperV (Intel i9-7900X) 20 vCpu Comments
Cores (CU) / Threads (SP) 10C / 20T 10C / 20T 20 vCPUs Same thread counts.
Memory 4x 8GB (32GB) DDR4 3200Mt/s 4x 8GB (32GB) DDR4 3200Mt/s 24GB VM has slightly less memory assigned.
Power Profile
Balanced Balanced Balanced Default power profile.
Storage 512GB NVMe NTFS 512GB NVMe NTFS 256GB NTFS VMDX Same storage backend
Instruction Sets AVX512, AES AVX512, AES AVX512, AES All instruction sets passed through (native)

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512*, AES*, SHA*).

Note(*): To enable advanced SIMD instruction sets in a VM –  the VM must have “Migrate to a physical processor with different processor version” disabled; otherwise only basic instruction sets will be available resulting in much lower performance.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks VM HyperV (Intel i9-7900X) 20 vCpu Root HyperV (Intel i9-7900X) 10C/20T Bare-Metal  (Intel i9-7900X) 10C/20T Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 397 [-11%] 446 [=] 444 No difference for Root, VM is 11% slower.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 390 [-13%] 445 [=] 446 No significant changes, VM is 13% slower.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 247 [-7%] 260 [-3%] 267 With floating-point, VM is 7% slower and root 3% slower.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 208 [-6%] 221 [=] 222 With FP64, VM is just 6% slower.
With legacy workloads (not using SIMD) – the root partition is just as fast as bare metal. The VM, despite the same number of threads does take a performance hit between 6 and 13% – higher for integer workloads, lower for floating-point.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,360 [-7%] 1,470 [=] 1,460 With AVX512, VM is 7% slower, Root no change.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 524 [-3%] 547 [+1%] 542 With a 64-bit AVX512 integer no change.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 119 [-4%] 125 [+1%] 124 A tough test using long integers to emulate Int128  no change.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,710 [-9%] 1,870 [=] 1,870 In this floating-point vectorised test VM is 9% slower.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1,090 [-8%] 1,180 [=] 1,180 Switching to FP64 SIMD AVX512 code no change.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 46.7 [-5%] 49 [=] 48.88 A heavy algorithm using FP64 to mantissa extend FP128 no change.
With heavily vectorised SIMD workloads – we still see a similar pattern. Root partition is as fast as bare metal, but VM does take a less performance hit between 3-9%. Thus, enabling HV has no discernible effect on heavy compute performance (on root partition with no VMs running) – while even VM use has only a small performance impact.
BenchCrypt Crypto AES-256 (GB/s) 33.36 [-3%] 34.2 [=] 34.28 Memory bandwidth rules here thus VM is just 3% slower.
BenchCrypt Crypto AES-128 (GB/s) 33 [=] 33 [=] 33.18 No performance difference here at all.
BenchCrypt Crypto SHA2-256 (GB/s) 26.2 [-4%] 27.2 [=] 27.2 VM is 4% slower here.
BenchCrypt Crypto SHA1 (GB/s) 43.1 [-5%] 45.4 [=] 45.5 Less compute intensive SHA1 makes VM 5% slower.
BenchCrypt Crypto SHA2-512 (GB/s) 22 22.7 [-4%] 22.9 SHA2-512 is compute intensive thus same performance.
The memory sub-system bandwidth is crucial here, and we see the least variation in performance – even the VM ties up with bare metal in two tests. For streaming tests, HV does not affect performance – we shall later see if this holds for latency tests.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 269 [-20%] 335 [-1%] 337 With BS we see the biggest VM hit of 20%.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 238 [-18%] 288 [-1%] 290 Using FP64 we see a similar 18% loss.
BenchFinance Binomial float/FP32 (kOPT/s) 64.6 [-5%] 67.3 [-1%] 68.1 Binomial uses thread shared data and here we see a 5% loss.
BenchFinance Binomial double/FP64 (kOPT/s) 67.2 [-6%] 71.8 [=] 71.85 With FP64 code we see a 6% loss.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 246 [-3%] 252 [=] 253 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 102 [-2%] 103 [-1%] 103.9 Switching to FP64 we see only a 2% loss.
With non-SIMD financial workloads, we see the biggest performance drop for VM of ~20%, though other tests are just 2-6% slower. Root partition use is just 1% slower than bare metal which is within margin of error. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 706 [+1%] 708 [+2%] 698 In this tough vectorised algorithm, we see minor changes.
BenchScience DGEMM (GFLOPS) double/FP64 292 [+2%] 286 [+1%] 284 With FP64 vectorised code, minor change.
BenchScience SFFT (GFLOPS) float/FP32 38 [-3%] 39 [=] 39.12 FFT is also heavily vectorised but memory dependent still minor.
BenchScience DFFT (GFLOPS) double/FP64 18.78 [-5%] 19.65 [=] 19.7 With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 573 [-4%] 591 [=] 592 N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 171 [-4%] 179 [=] 179 With FP64 code we see 4% loss.
With highly vectorised SIMD code (scientific workloads), the performance changes become minimal – with even VM at most 4% slower and in some cases even slightly faster (likely due to synchronisation). For such heavy compute, you can even use VMs with no appreciable performance loss; naturally root partition shows no loss whatsoever.
Neural Networks NeuralNet Single SCNN Inference (Samples/s) 56.37 [-7%] 61.49 [+1%] 60.94 Also heavily vectorised, inference is 7% slower in VM.
Neural Networks NeuralNet Single SCNN Training (Samples/s) 8.59 [-6%] 9.17 [=] 9.1 Training is compute intensive but we see similar results.
Neural Networks NeuralNet Double DCNN Inference (Samples/s) 20.22 [-7%] 20.23 [-7%] 21.68 FP64 brings first loss for root at 7% (same as VM)
Neural Networks NeuralNet Double DCNN Training (Samples/s) 3.07 [-5%] 2.98 [-7%] 3.2 FP64 training  shows a similar 7% loss.
Neural Networks NeuralNet Single SRNN Inference (Samples/s) 60.8 [-15%] 71.77 [=] 71.84 RNN is memory access heavy and here VM takes 15% loss.
Neural Networks NeuralNet Single SRNN Training (Samples/s) 6.03 [-6%] 5.99 [-6%] 6.37 Training is compute intensive but VM is just 6% slower.
Neural Networks NeuralNet Double DRNN Inference (Samples/s) 29.92 [-13%] 34.26 [=] 34.3 FP64 also brings a VM loss of 13% here.
Neural Networks NeuralNet Double DRNN Training (Samples/s) 3.37 [-3%] 3.51 [+1%] 3.45 Traning brings down loses to just 3% for VM.
While heavily vectorised/SIMD, neural networks are also memory intensive which for the first time shows a loss for root partition, likely to different memory access latencies. However this is a very isolated result, not seen in other tests (so far). VM use also shows the largest performance loss of up to 15% though usual loses are between 5-7%.
CPU Image Processing Blur (3×3) Filter (MPix/s) 4,000 [-12%] 4,530 [=] 4,530 In this vectorised integer workload VM is 12% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1,840 [-8%] 2,000 [=] 2,000 Same algorithm but more shared data VM is 8% slower.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 965 [-5%] 1,000 [=] 1,000 Again same algorithm but even more data shared VM is 5% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,390 [-11%] 1,560 [=] 1,560 Different algorithm but still vectorised VM is 11% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 199 [-9%] 216 [=] 217 Still vectorised VM is 9% slower.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 65.29 [-4%] 68 [=] 68 Different algorithm, VM is just 4% slower.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3,160 [-23%] 3,460 [-6%] 4,090 With integer workload, VM is 23% slower while Root is 6% slower.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 727 [-7%] 780 [=] 775 In this final test again VM is 7% slower.
Similar to what we saw before, VM is between 4-12% slower than bare-metal with an outlier of 23% slower. Root partition is again as fast as bare metal – with an outlier of 6% slower. Overall we see the same deltas we have seen before.

For native compute workloads, either legacy or vectorised/SIMD – enabling HV on the system has no discernible performance impact on the root partition (with *no* VMs running). Enabling HV for better security does not lose any performance.

Running the same workloads in a VM, even with the same number of threads (vCPU) as the system – does mean a performance loss between 5-10% depending on workload, with legacy (integer) workloads more affected (~10%) more than heavy SIMD compute workloads (~3-5%).

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Virtualisation has come a long way and is no longer the preserve of servers; it is likely to be enabled by default even on client computers in order to provide security for the operating system (Windows) as well as providing isolation (sandboxes) for applications and services. You may even decide to run some additional VMs (e.g. running a different operating system like Linux, FreeBSD, etc.) or containers (e.g. running game emulators, firewall/VPN, home automatic, etc.) as well.

The good news is that enabling Hyper-V (for any reason) does not cause performance degradation – despite the virtualisation of the OS into the parent partition and despite the various vulnerability mitigations deployed for both OS and hypervisor. It is really great to see that all the benefits of virtualisation bring no performance loss.

Running tasks in a separate VM (with the same number of vCPUs as host threads) does mean slight performance degradation (5-10%). This should be acceptable if you need to keep those workloads completely separate for whatever reason (security, requiring old OS, requiring different OS, etc.). Let’s remember the parent partition is also running in this case (thus 2 VMs + hypervisor). [Ordinarily, you would not assign a VM as many threads as the host – but if you need to (especially on low core count hosts) – you can]

Adding additional complexity (running virtualised) can bring new issues mainly concerning 3rd-party device drivers (video, network, peripherals, etc.) that may throw new errors when running virtualised – however by this time most modern drivers would have been tested and certified to work in virtualised mode. It is possible that old device drivers may be problematic.

In conclusion we do not see any downside to enabling Hyper-V and the new security measures (Core Isolation, Memory Integrity, etc.) in Windows.  You can also check out creating VMs, containers and playing with the new technology.

In a word: Recommended!

Please see our other articles on:

  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Cache and Memory
  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Storage
  • Performance Impact of Hyper-V virtualisation (Windows 10 Pro) – Networking

SiSoftware Sandra 20/20/10 (2020 R10x) Update – optimisations and fixes

Update Wizard

Note: Original R10 release article has been updated with R10x update.

We are pleased to release R10x (version 30.77) update for Sandra 20/20 (2020) with the following changes:

Sandra 20/20 (2020) Press Release

Latest Sandra Version

We are moving towards a tiered system where different versions (R numbers) are provided to different customers depending on their need and version type. This allows us in these tough times to prioritise our customers while still providing stable, best features to the community. We will still aim to release all versions together where possible as before but we no longer guarantee it.

  • Manufacturers/OEM, Tech Support, Reviewers:
    • Latest Sandra (Beta) version, R+1
  • Commercial (Professional/Business/Engineer/Enterprise):
    • Current Sandra (Stable) version, R (R+1 if required*)
  • Lite (Evaluation):
    • Previous Sandra (Stable) version, R-1 (R+1 if testing*)

Note (*): we do provide access to Beta versions if customer is affected by an issue resolved in the next release.

GP-GPU (CUDA / OpenCL / DirectX Compute) Benchmarks

  • Additional improvements for nVidia “Ampere“; CUDA SDK updated to 11.1
  • Additional improvements for “Image Processing” benchmarks
  • Additional improvements for “Scientific Analysis” benchmarks (FFT, GEMM) [perf impact]
  • Reverted (R3 update) hashing/SHA block change in “Cryptography” benchmarks [perf impact]
  • Relaxed limits further for better performance on high-end/multiple GP-GPUs [up to 8]

CPU Benchmarks

  • Fixed possible lock-up in “Scientific Analysis” benchmarks
  • Reverted (R3 update) hashing/SHA block change in “Cryptography” benchmarks [no impact]
  • Revised benchmarks for asymmetric work-loads for hybrid CPUs

Bug Fixes

  • Fixed (possible) crash on Intel graphics with 64-bit PCIe memory addressing
  • Reviewed all device code that deals with 64-bit PCIe memory addressing
  • Fixed TigerLake (TGL) memory information/timings for (LP)DDR5
  • Fixed TigerLake (TGL) integrated graphics memory information
  • Additional IceLake (ICL) memory information

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Commercial (Pro/Biz/Eng/Ent)

Download Sandra Lite

AVX512-IFMA(52) Improvement for IceLake and TigerLake

CPU Multi-Media Vectorised SIMD

What is Sandra’s Multi-Media benchmark?

The “multi-media” benchmark in Sandra was introduced way back with Intel’s MMX instruction set (and thus Pentium MMX) to show the difference vectorisation brings to common algorithms, in this case (Mandelbrot) fractal generation. While MMX did not have floating-point support – we can emulate them using integers of various widths (short/16-bit, int/32-bit, long/int64/64-bit, etc.).

The benchmark thus contains various precision tests using both integer and floating point data, currently 6 (single/double/quad-floating point, short/int/long integer) with more to come in the near future (half/FP16 floating-point, etc.). Larger widths provide more precision and thus generate more accurate fractals (images) but are slower to compute (they also take more memory to store).

While the latest instruction sets (AVX(2)/FMA, AVX512) do naturally support floating-point data, integer compute performance is still very much important thus its performance needs to be tested. As quantities become larger (e.g. memory/disk sizes, pointers/address spaces, etc.) we have moved from int/32-bit to long/64-bit processing with even exclusive 64-bit algorithms (e.g. hashing SHA512).

What is the “trouble” with 64-bit integers?

While all native 64-bit processors (e.g. x64, IA64, etc.) support native 64-bit integer operations, these are generally scalar with limited SIMD vectorised support. Multiplication is especially “problematic” as it has the potential to generate numbers up to twice (2x) the number of bits – thus multiplying two 64-bit integers can generate 128-bit integer full result for which there was no (SIMD) support.

Intel has added native full 128-bit multiplication support (MULX) with the BMI2 (Bit Manipulation Instructions Version 2) but that is still scalar (non-SIMD); not even the latest AVX512-DQ instruction set brought support. While we could emulate full 128-bit multiplication using native 32-bit to 64-bit halves multiplication we have chosen to wait for native support. An additional issue (for us) is that we use “signed integers” (i.e. can hold both positive (+ve) and negative (-ve) values) while most multiplication instructions are for “unsigned integers” (thus can hold only positive values) – thus we need to modify the result for our needs which incurs overheads.

Thus the long/64-bit integer benchmark in Sandra remained non-vectorised until the introduction of AVX512-IFMA52.

What is AVX512-IFMA52?

IFMA52 is one of the new extensions of AVX512 introduced with “IceLake” (ICL) that supports native 52-bit fused multiply-add with 104-bit result. As it is 512-bit wide, we can multiply-add eight (8) pairs 64-bit integers in one go every 2 clocks (0.5 throughput, 4 latency on ICL) – especially useful for algorithms like (Mandelbrot) fractals where we can operate on many pixels independently.

As is generates a 104-bit full result, it is (as per name) only a 52-bit integer thus we need to restrict our integers to 52-bits. It also operates on unsigned integers only thus needs to be modified for our signed-integer purpose. Note also that while it is a fused multiply-add, we have chosen to use only the multiply feature here (in this Sandra version 20/20 R9); future versions (of Sandra) may use the full multiply-add feature for even better performance.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512, AVX2, AVX, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
BenchCpuMM Emulated Int64 ALU64 (Mpix/s) 3.67 4.34 While native, scalar int64 processing is pretty slow.
BenchCpuMM Native Int64 ADX/BMI2 (Mpix/s) 21.24 [+5.78x] Using BMI2 for 64-bit multiplication increases (scalar) performance by 6x!
BenchCpuMM Emulated Int64 SSE4 (Mpix/s) 13.92 [-35%] Using vectorisation though SSE4 (2x wide) is not enough to beat ADX/BMI
BenchCpuMM Emulated Int64 AVX2 (Mpix/s) 22.8 [+64%] AVX2 is 4x wide (256-bit) and just about beats scalar ADX/BMI2.
BenchCpuMM Emulated Int64 AVX512/DQ (Mpix/s) 33.53 [+47%] 512-bit wide AVX512 is 47% faster than AVX2.
BenchCpuMM Native Int64 AVX512/IFMA52 (Mpix/s) 55.87 [+66%] / [+15x over ALU64] 70.41 [+16x over ALU64] IFMA52 is 66% faster than normal AVX512 and over 15x faster than scalar ALU.
With IFMA52, we finally see a big performance gain though native 64-bit integer multiplication and vectorisation (512-bit wide, thus 8x 64-bit integer pairs), it is over 15x faster on ICL and 16x faster on TGL! In fairness, ADX/BMI2 is only about 1/2 slower and that is scalar – showing how much native instructions help processing.

Conclusion

AVX512 continues to bring performance improvements by adding more sub-instruction sets like AVX512-IFMA(52) that help 64-bit integer processing. With 64-bit integers taking over most computations due to increased sizes (data, pointers, etc.) this is becoming more and more important and is not before time.

While not a full 128-bit multiplier, 104-bits allow complete 52-bit integer operation which is sufficient for most tasks – today. Perhaps in the future, a IFMA64 will be provided for full 128-bit multiply result integer support.

Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance

Intel iRIS Xe Gen 12

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Gen12 (XE-LP) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each, 5400Mt/s)
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • New Image Processing Unit (IPU6) up to 4K90 resolution
  • New 2x Media Encoders HEVC 4K60-10b 4:4:4 & 8K30-10b 4:2:0
  • PCIe 4.0

While ICL has already greatly upgraded the GP-GPU to gen 11 cores (and more than doubled to 64EU for G7), TGL upgrades them yet again to “XE”-LP gen 12 cores now all the way up to 96EUs. While again most features seem to be geared towards gaming and media (with new image processing and media encoders) – there should be a few new instructions for AI – hopefully provided by a OpenCL extension.

Again there is no FP64 support (!) while FP16 is naturally supported at 2x rate as before. BF16 should also be supported by a future driver. Int32, Int16 performance has reportedly doubled with Int8 now supported and DP4A accelerated.

The new memory controller supports DDR5 / LPDDR5 (5400Mt/s) that should – once memory becomes readily available – provide more bandwidth for the EU cores; until then LPDDR4X can clock even faster (4267Mt/s). There is no mention about eDRAM (L4) cache at all.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously. Perhaps with the forthcoming DG1 discrete graphics

GPGPU (Xe-LP G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel Iris XE-LP G7
Intel XE-LP G1
Intel Iris Plus (IceLake) G7
AMD Vega 8 (Ryzen5)
Comments
Arch Chipset EV12 / G7 EV12 / G1 EV11 / G7 GCN1.5 The first G12 from Intel.
Cores (CU) / Threads (SP) 96 / 768 32 / 256 64 / 512 8 / 512 50% more cores vs. G11
SIMD per CU / Width 8 8 8 64 Same SIMD width
Wave/Warp Size 32 32 16/32 64 Wave size matches nVidia
Speed (Min-Turbo)
1.2GHz 1.15GHz 1.1GHz 1.1GHz Turbo speed has slightly increased.
Power (TDP) 15-35W 15-35W 15-35W 15-35W Similar power envelope.
ROP / TMU 24 / 48 8 / 16 16 / 32 8 / 32 ROPs and TMUs have also increased 50%.
Shared Memory
64kB
64kB 64kB 32kB Same shared memory but 2x Vega.
Constant Memory
3.2GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x LP-DDR4X 4267Mt/s (LPDDR5 5400Mt/s) 2x LP-DDR4X 4267Mt/s 2x LP-DDR4X 3733Mt/s 2x DDR4-2400 Can support faster (LP)DDR5 in the future.
Memory Bandwidth
42GB/s 42GB/s 58GB/s 42GB/s Highest (possible) bandwidth ever
L1 Caches 64kB x 6 64kB x 2 16kB x 8 8x 16kB L1 is much larger.
L3 Cache 3.8MB ? 3MB ? L3 has modestly increased.
Maximum Work-group Size
256×256 256×256 256×256 1024×1024 Vega supports 4x bigger workgroups.
FP64/double ratio
No! No! No! Yes, 1/16x No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Iris XE-LP G7 96EV
Intel XE-LP G1 32EV
Intel Iris Plus (IceLake) G7 64EV
AMD Vega 8 (Ryzen5) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 4,342 [+54%] 1,419 2,820 2,000 Xe beats EV11 by over 50% using FP16!
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 2,062 [+55%] 654 1,330 1,350 Standard FP32 is just as fast, 55% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 98.6* [+41%] 31.3* 70* 111 Without native FP64 support Xe craters like old EV11.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 9.91* [+31%] 3.49* 7.54* 7.11 Emulated FP128 is even harder for Xe.
Starting off, we see almost perfect scaling with improvement in EUs, with Xe 50% faster than old EV11. Unfortunately, again without native FP64 support – it cannot match the competition. For FP64 workloads – you’ll have to use the CPU; for ULV that may be OK but for discrete DG1 that is not so great.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 7.9 [+3x] 2.54 2.6 2.58 Integer performance is 3x faster than EV11
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 3.54 3.38 3.3 Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 20.52 [+3x] 6.81 6.9 14.29 Xe beats Vega even with its acceleration.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 13.34 14.18 18.77 With 128-bit Xe is even faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 2.26 3.36 64-bit integer workload is also stellar.
Despite our sample using slower DDR4 memory vs. LP-DDR4x ICL/EV11, integer performance is 3x faster – a huge upgrade. It even manages to beat AMD’s Vega with its crypto acceleration instructions (media ops). While the crypto currency frenzy has died out (not likely to mine coins on ULV GP-GPUs), the dedicated DG1 may be a serious crypto-craker GPU.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,111 2,340 1,720 With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1,603 [+22%] 993 1,310 829 With FP32 Xe is 22% faster.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 116 292 270 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 334 [+14%] 111 292 254 With FP32, XE is just 15% faster.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 470 667 584 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1,385 [+94%] 444 719 362 With FP32 code Xe is 2x faster than EV11.
For financial FP32/FP16 workloads, Xe is not always much faster than EV11, with two algorithms just 15-22% faster but one 2x as fast. Again, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.

This does not bode well for the dedicated DG1 as it would be the only discrete card without native FP64 support unlike competition. However, it is likely (some) FP64 units will be included unless Intel will aim it squarely to gamers (only).

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 528 563 884 Vega still has great performance with FP16.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 683 [+64%] 419 314 With FP32, Xe is 64% faster than EV11.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 33.32 61.4 61.34 Vega does very well here also with FP16.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 52.7 [+34%] 39.2 31.5 With FP32, Xe is 34% faster.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 652 930 623 All Intel GPUs do well here.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 908 [+60%] 566 537 With FP32, Xe is 60% faster.
On scientific algorithms, Xe does much better and manages 35-65% better performance than EV11 and generally trouncing Vega on FP32 though not quite on FP16. Shall we mention lack of FP64 again?
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 3,520 2,273
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 4,725 [+3x] 1,649 1.570 782 In this 3×3 convolution algorithm, Xe is 3x faster!
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 1,000 582
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,354 [+4.2x] 436 319 157 Same algorithm but more shared data, Xe is 4x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 924 619
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 727 [+2.2x] 232 328 161 With even more data Xe is 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 1,000 595
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,354 [+4.26x] 435 318 155 Still convolution but with 2 filters – 4.3x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 26.63 7.69
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.73 [+33%] 16.27 26.91 4.06 Different algorithm Xe just 33% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 24.34
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 23.95 [+22%] 11.11 19.63 2.59 Without major processing, Xe is only 22% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,740 2,091
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 2,772 [+48%] 1,175 1,870 2,100 This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 215 1,046
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 916 [-4%] 551 950 608 One of the most complex and largest filters, Xe ties with EV11.
For image processing tasks, Xe seems to do best, with up to 4x better performance – likely due to updated compiler and drivers. In any case for such tasks, upgrading to TGL will give you a huge boost. (fortunately no FP64 processing here)

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 44.92 [+27%] 45.9 36.3 27.2 Xe manages to squeeze more bandwidth of DDR4.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 7.75 [-54%] 7.7 17 4.74 Uploads are 1/2 slower at this time.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 7.6 [-58%] 7.6 185 Download bandwidth is not much better.
Thanks to the faster LP-DDR4X memory, Xe has even higher bandwidth than EV11; with future DDR5 / LPDDR5 this will increase even higher. At this time, perhaps due to the driver the upload/download bandwidths are 1/2x lower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Once again Intel seems to be taking graphics seriously: for the 2nd time in a row we have a major graphics upgrade with Xe with big upgrades in EV cores (count), performance and bandwidth. Overall it seems to be 50% faster than EV11 with lower-end devices benefiting most from the upgrade. While the competition was unassailable – Intel has managed to close the gap and overtake.

However, this is still a core aimed at gamers and it does not provide much for GP-GPU; the improved integer performance is very much welcome – 3-times better (!) but few and specific instructions for AI only. Lack of FP64 makes it unsuitable for high-precision financial and scientific workloads; something that the old EV7-9 cores could do reasonably well (all things considered).

For integrated graphics, this is not a problem – not many people would expect ULV GPU core to run compute-heavy workloads; however, the dedicated DG1 card would really be out-spec’d by the competition, with even old, low-end devices providing more features. However, dedicated DG1 is likely to include (some) FP64 units and/or additional units unlike the low-power (LP ULV) integrated versions.

Getting back to ULV, Xe-LP’s performance completely obsoletes devices (e.g. SKL/KBL/WHL/CML-ULV) using the older EV9x cores – unless you really don’t plan on using them except for “business 2D graphics” or displaying the desktop.

If you have not upgraded to ICL yet, TGL is a far better, compelling, proposition that should be your (current) top choice for long-term use. For ICL owners, there is still a lot to upgrade though not as massive as anything released previously.

In a word: Highly Recommended!

Please see our other articles on:

Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

Intel Core i7 Gen 11

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Willow Cove” on ULV  (CometLake up to 6C/12T)
  • Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • AVX512 and more of its friends
  • Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
  • Increased L3 cache from 8MB to 12MB (+50%)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
  • PCIe 4.0
  • Thunderbolt 4 (and thus USB 4.0 support) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

  • AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
  • AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well  (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
Cores (CU) / Threads (SP) 6C / 6T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.3-4.0GHz 0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) Both rates and Turbo clocks are way up
Power (TDP) 15-35W 15-35W 15-35W 12-35W Similar power envelope possibly higher.
L1D / L1I Caches 6x 32kB 8-way / 6x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way No change L1D
L2 Caches 6x 512kB 8-way 4x 256kB 16-way 4x 512kB 16-way 4x 1.25MB L2 has more than doubled (2.5x)!
L3 Caches 2x 4MB 16-way 8MB 16-way 8MB 16-way 12MB 16-way L3 is 50% larger
Microcode (Firmware) n/a MU-068E09-CC MU-067E05-6A MU-TBD Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX512, VNNI, SHA, VAES,  IFMA AVX512, VNNI, SHA, VAES,  IFMA More AVX512!
SIMD Width / Units
256-bit 256-bit 512-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 208 134 154 169 [+10%] TGL is 10% faster than ICL but not enough to beat AMD.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 191 135 151 167 [+11%] With a 64-bit integer workload – 11% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 89 85 90  99.5 [+10%]
With floating-point, TGL is only 10% faster but enough to beat AMD.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 75 70 74  83 [+12%]
With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 506 409 504* 709* [+41%] With AVX512 TGL is over 40% faster than ICL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 193 149 145* 216* [+49%] With a 64-bit AVX512 integer workload TGL is 50% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.47 2.54 3.67** 4.34** [+18%] A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 433 328 414*  666* [+61%]
In this floating-point vectorised test TGL is 61% faster!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 251 194 232*  381* [+64%]
Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 11.23 8.22 10.2*  15.28* [+50%]
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!

* using AVX512 instead of AVX2/FMA.

** note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

BenchCrypt Crypto AES-256 (GB/s) 13.46 12.11 21.3*  19.72* [-7%] Memory bandwidth rules here so TGL is similar to ICL in speed.
BenchCrypt Crypto AES-128 (GB/s) 13.5 12.11 21.3* 19.8* [-7%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 7.03** 4.28 9*** 13.87*** [+54%] Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
BenchCrypt Crypto SHA1 (GB/s) 7.19 15.71***   Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL!  AVX512 helps even against native hashing acceleration.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 64.16 109
BenchFinance Black-Scholes double/FP64 (MOPT/s) 91.48 87.17 91 132 [+45%] Using FP64 TGL is 45% faster than ICL.
BenchFinance Binomial float/FP32 (kOPT/s) 16.34 23.55 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 31.2 21 27  37.23 [+38%]
With FP64 code TGL is 38% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 12.48 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 45.59 16.5 33 45.98 [+39%] Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 158 185*  294* [+59%]
In this tough vectorised algorithm, TGL is 60% faster!
BenchScience DGEMM (GFLOPS) double/FP64 76.86 69.2 91.7*  167* [+82%]
With FP64 vectorised code, TGL is over 80% faster!
BenchScience SFFT (GFLOPS) float/FP32 13.9 31.7*  31.14* [-2%] FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
BenchScience DFFT (GFLOPS) double/FP64 7.15 7.35 17.7*  16.41* [-3%] With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 169 200*  286* [+43%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 98.7 64.2 61.8* 81.61* [+32%]
With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 19.33 25.62*  
Neural Networks NeuralNet CNN Training (Samples/s) 3.33 4.56*
Neural Networks NeuralNet RNN Inference (Samples/s) 23.88 24.93*
Neural Networks NeuralNet RNN Training (Samples/s) 1.57 2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)
CPU Image Processing Blur (3×3) Filter (MPix/s) 1060 891 1580* 2276* [+44%] In this vectorised integer workload TGL is 44% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 441 359 633*  912* [+44%] Same algorithm but more shared data TGL still 44% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 231 186 326*  480* [+47%]
Again same algorithm but even more data shared brings 47%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 363 302 502*  751* [+50%]
Different algorithm but still vectorised workload still 50% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 28.02 27.7 72.9*  109* [+49%]
Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.23 15.7 24.7*  34.74* [+40%]
Similar improvement here of about 40%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 936 1580 2100*  2998* [+43%]
With integer workload, 43% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 127 214 307*  430* [+40%]
In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended!

Please see our other articles on:

nVidia Titan RTX / 2080Ti: Turing GPGPU performance in CUDA and OpenCL

nVidia RTX 2080 TI (Turing)

What is “Titan RTX / 2080Ti”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Turing” architecture, the update to the current “Volta” architecture that has had a limited release in Titan/Quadro cards. It powers the new Series 20 top-end (with RTX) and Series 16 mainstream (without RTX) cards that replace the old Series 10 “Pascal” series.

As “Volta” is intended for AI/scientific/financial data-centers – it features high-end HBM2 memory; since “Turing” is meant for gaming, rendering, etc. has “normal” GDDR6 memory. Similarly “Turing” has the new RTX (Ray-Tracing) cores for high-fidelity visualisation and image generation – in addition to the Tensor (TSX) cores that “Volta” has introduced.

While “Volta” has 1/2 FP64 ratio cores (vs. FP32), “Turing” has the normal 1/32 FP64 ratio cores: for high-precision computation – you need “Volta”. However, as “Turing” maintains the 2x FP16 rate (vs. FP32) it can run low-precision AI (neural networks) at full speed. Old “Pascal” had 1/64x FP16 ratio making it pretty much unusable in most cases.

“Turing” does not have high-end on-package HBM2 memory but instead high-speed GDDR6 memory that has decent bandwidth but is not  plentiful – with 1GB missing (11GB instead of 12GB).

With the soon-to-be unveiled “Ampere”  (Series 30) architecture, we look whether you can have a “cheap” Titan V performance using a Turing 2080TI consumer card.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan RTX / 2080TI (Turing)
nVidia Titan V (Volta)
nVidia Titan X (Pascal)
Comments
Arch Chipset Turing GP102 (7.5) Volta VP100 (7.0) Pascal FP102 (6.1) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 68 / 4352 80 / 5120 28 / 3584 Not as many cores as Volta but still decent.
ROPs / TMUs 88 / 272 96 / 320
96 / 224 Cannot match Volta but more ROPs per CU for gaming.
FP32 / FP64 / Tensor Cores 4352 / 136 / 544 5120 / 2560 / 640 3584 / 112 / no Maintains the Tensor cores important for AI tasks (neural networks, etc.)
Speed (Min-Turbo) 1.35GHz (136-1.635) 1.2GHz (135-1.455) 1.531 (135-1.910) Clocks have improved over Volta likely due to lower number of SMs.
Power (TDP) 260W 300W 250W (125-300) TDP is less due to lower CU number.
Global Memory 11GB GDDR6 14GHz 320-bit 12GB HBM2 850Mhz 3072-bit 11GB GDDR5X 10GHz 384-bit As a pro-sumer card it has 1GB less than Volta and same as Pascal.
Memory Bandwidth (GB/s)
616 652 512 Despite no HBM2, bandwidth almost matches due to high speed of GDDR6.
L1 Cache 2x (32kB + 64kB) 2x 24kB / 96kB shared L1/shared is still the same but ratios have changed.
L2 Cache 5.5MB (6MB?) 4.5MB (3MB?) 3MB L2 cache reported has increased by 25%.
FP64/double ratio
1/32x 1/2x 1/32x Low ratio like all consumer cards, Volta dominates here
FP16/half ratio
2x 2x 1/32x Same rate as Volta, 2x over FP32

nVidia RTX 2080 TI (Turing)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 41,080 / n/a [=] 40,920 / n/a 336 / n/a Right off the bat, Turing matches Volta and is miles faster than old Pascal.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 25,000 / 23,360 [+11%] 22,530 / 21320 18,000 / 16,000 With standard FP32, Turing even manages to be 11% faster despite less CUs.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 812 / 772 [-93%] 11,300 / 10,500 641 / 642 For FP64 you don’t want Turing, you want Volta. At any cost.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 30.4 / 29.1 [-94%] 472 / 468 24.4 / 27 With emulated FP128 precision Turing is again demolished.
Turing manages to improve over Volta in FP16/FP32 despite having less CUs – most likely due to faster clock and optimisations. However, if you do need FP64 precision then Volta reigns supreme – the 1/32 rate of Turing & Pascal just does not cut it.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 48 / 52 [-33%] 72 / 86 42 / 41 Streaming workloads love Volta’s HBM2 memory, Turing is 33% slower.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 64 / 70 [-30%] 92 / 115 57 / 54 Not a lot changes here, Turing is 30% slower.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 192 / 182 [+7%] 179 / 181 72 / 83 With 64-bit integer workload, Turing manages a 7% win despite “slower” memory.
GPGPU Crypto Benchmark Crypto SHA256 (GB/s) 170 / 125 [-33%] 253 / 188 95 / 60 As with AES, hashing loves HBM2 so Turing is 33% slower than Volta.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 161 / 125 [+56] 103 / 113 69 / 74 While Turing wins, it is likely a compiler optimisation.
It seems that Turing GDDR6 memory cannot keep up with Volta’s HBM2 – despite the similar bandwidths: streaming algorithms are around 30% slower on Turing. The only win is 64-bit integer workload that is 7% faster on Turing likely due to integer units optimisations.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 17,230 / 17,000 [-7%] 18,480 / 18,860 10,710 / 10,560 Turing is just 7% slower than Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,530 / 1,370 [-82%] 8,660 / 8,500 1,400 / 1,340 FP64 is almost 1/8x slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,280 / 4,250 [+4%] 4,130 / 4,110 2,220 / 2,230 Binomial uses thread shared data thus stresses the SMX’s memory system – Turing is 4% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 164 / 163 [-91%] 1,920 / 2,000 131 / 134 With FP64 code Turing is 1/10x slower.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 11,440 / 11,740 [+1%] 11,340 / 12,900 8,100 / 6,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – Turing is just 1% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 327 / 263 [-92%] 4,330 / 3,590 304 / 274 Switching to FP64 again Turing is 1/10x slower.
For financial workloads, as long as you only need FP32 (or FP16), Turing can match and slightly outperform Volta; considering the cost that is no mean feat. However, if you do need FP64 precision – as we saw before, there is no contest – Volta is 10x (ten times) faster.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 34,080 [-16%] 40,790 Using the new Tensor cores, Turing is just 16% slower.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 7,400 / 7,330 [-33%] 11,000 / 10,870 6,280 / 6,600 Perhaps surprisingly, Turing is 33% slower than Volta here.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 502 / 498 [-89%] 4,470  4,550 335 / 332 With FP64 precision, Turing is 1/10x slower than Volta.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 1,000 [+2%] 979 FFT somehow allows Turing to match Volta in performance.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 512 / 573 [-5%] 540 / 599 242 / 227 With FP32, Turing is just 5% slower.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 302 / 302 [+1%] 298 / 375 207 / 191 Completely memory bound, Turing matches Volta here.
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 9,000 [-2%] 9,160 N-Body simulation with FP16 is just 2% slower.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 9,330 / 8,120 [+27%] 7,320 / 6,620 5,600 / 4,870 N-Body simulation allows Turing to dominate.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 222 / 295 [-94%] 3,910 / 5,130 275 / 275 With FP64 precision, Turing is again 1/10x slower than Volta.
The scientific scores are a bit more mixed – but again Turing can match or slightly exceed Volta with FP32/FP16 precision – as long as we’re not memory limited; there Volta is still around 30% faster. With FP64 it’s the same story, Turing is about 1/10x slower.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 23,090 / 19,000 [-14%] 26,860 / 29,820 17,860 / 13,680 In this 3×3 convolution algorithm, Turing is 14% slower. Convolution is also used in neural nets (CNN).
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 28,240 [=] 28,310 1,570 With FP16 precision, Turing matches Volta in performance.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 6,000 / 4,350 [-35%] 9,230 / 7,250 4,800 / 3,460 Same algorithm but more shared data makes Turing 35% slower.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 10,580 [-38%] 14,676 609 With FP16 Volta is almost 40% faster over Turing.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 6,180 / 4,570 [-33%] 9,420 / 7,470 4,830 / 3,620 Again same algorithm but even more data shared Turing is 33% slower.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 10,160 [-31%] 14,651 325 With FP16 nothing much changes in this algorithm.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 6,220 / 4,340 [-30%] 8,890 / 7,000 4,740 / 3,450 Still convolution but with 2 filters – Turing is 30% slower.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 10,100 [-25%] 13,446 309 Just as we seen above, Turing is about 25% slower than Volta.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 52.53 / 59.9 [-50%] 108 / 66.34 36 / 55 Different algorithm we see the biggest delta with Turing 50% slower.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 121 [-40%] 204 71 With FP16 Turing reduces the loss to just 40%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 20.28 / 25.64 [-50%] 41.38 / 23.14 15.14 / 15.3 Without major processing, this filter flies on Volta, again Turing is 50% slower.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 59.55 [-54%] 129 50.75 FP16 precision does not change things.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,600 / 29,640 [+1%] 24,400 / 24,870 19,480 / 14,000 This algorithm is 64-bit integer heavy and here Turing is 1% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 22,400 [-8%] 24,292 6,090 FP16 does not help here as we’re at maximum performance.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 3,000 / 10,500 [-20%] 3,771 / 8,760 1,288 / 6,530 One of the most complex and largest filters, Turing is 20% slower than Volta.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 7,850 [-4%] 8,137 461 Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Turing is generally 20-35% slower than Volta somewhat in line with memory performance. If FP16 is sufficient, then we see Turing matching Volta in performance – something that old Pascal could never do.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 352, CUDA 11.3, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Memory Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 494 / 485 [-7%] 534 / 530 356 / 354 GDDR6 provides good bandwidth, only 7% less than HBM2.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.3 / 10.4 [-1%] 11.4 / 11.4 11.4 / 9 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 11.9 / 12.3 [-1%] 12.1 / 12.3 12.2 / 8.9 Again no significant difference but we were not expecting any.
Turing’s GDDR6 memory provides almost the same bandwidth as Volta’s expensive HBM2. All cards use PCIe3 x16 connections thus similar upload/download bandwidth. Hopefully the move to PCIe4/5 will improve transfers.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 135 / 143 [-25%] 180 / 187 201 / 230 From the start we see global latency accesses reduced by 25%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 243 / 248 [-22%] 311 / 317 286 / 311 Full range random accesses are also 22% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 40 / 43 [-25%] 53 / 57 89 / 121 Sequential accesses have also dropped 25%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 77 / 80 [+2%] 75 / 76 117 / 174 Constant memory latencies seem about the same.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 10.6 / 71 [-41%] 18 / 85 18.7 / 53 Shared memory latencies seem to be improved.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 157 / 217 [-26%] 212 / 279 195 / 196 Texture access latencies have also reduced by 26%.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 268 / 329 [-22%] 344 / 313 282 / 278 As we’ve seen with global memory, we see reduced latencies by 22%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 67 / 138 [-24%] 88 / 163 87 / 123 With sequential access we also see a 24% reduction.
The high data rate of Turing’s GDDR6 brings reduced latencies across the board over HBM2 although as we’ve seen in the compute benchmarks, this does not always translate in better performance. Still some algorithms, especially less optimised ones may still benefit at much lower cost.
We see L1 cache effects between 32-64kB tallying with an L1D of 32-48kB (depending on setting) with the other inflexion between 4-8MB matching the 6MB L2 cache.
As with global memory we see the same L1D (32kB) and L2 (6MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

If you wanted to upgrade your old Pascal Titan X but could not afford the Volta’s Titan V – then you can now get a cheap RTX 2080Ti or Titan RTX and get similar if not slightly faster FP16/FP32 performance that blows the not-so-old Titan X out of the water! If you can make do with FP16 and use Tensor cores, we’re looking at 6-8x performance over FP32 using a single card.

Naturally, the FP64 performance is again “gimped” at 1/32x so if that’s what you require, Turing cannot help you there – you will have to get a Volta. But then again the Titan X was similarly “gimped” thus if that’s what you had you still get a decent performance upgrade.

The GDDR6 memory may have similar bandwidth on paper, but in streaming algorithms is about 33% slower than HBM2 so there Turing cannot match Volta, but considering the cost it is a good trade. You will also lose 1GB just like with Titan X but again, not a surprise. Global/Constant/Texture memory access latencies are lower due to the high data rate which should help algorithms that are memory access limited (if you cannot help hide them).

As we’re testing GPGPU performance here, we have not touched on the ray-tracing (RTX) units, but should you happen to play a game or two when you are “resting”, then the Titan RTX / 2080TI might just impress you even more. Here, not even Volta can match it!

All in all – Titan RTX is a compelling (relatively) cheap upgrade over the old Titan X if you don’t require FP64 precision.

nVidia Titan RTX (Turing)

SiSoftware Sandra 20/20/9 (2020 R9) Update – GPGPU updates, fixes

Update Wizard

We are pleased to release R9 (version 30.69) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

GPGPU Benchmarks

  • CUDA SDK 11.3+ update for nVidia “Ampere” (SM8.x) – deprecated SM3.x support
  • OpenCL SDK updated to 1.2 minimum, 2.x recommended and 3.x experimental.
  • Increased addressing in all benchmarks (GPGPU Processing, Cryptography, Scientific Analysis, Financial Analysis Image Processing, etc) to 64-bit for large VRAM cards (12GB and larger)
  • Increased limits allowing bigger grids/workloads on large memory systems (e.g. 24GB+ VRAM, 64GB+ RAM)
  • Optimised (vectorised/predicated) some image processing kernels for higher performance on VLIW GPGPUs.

CPU / SVM Benchmarks

  • Vectorised CPU Multi-Media 128-bit integer benchmark to support ADX/BMI(2) instructions as well as vectorised to support AVX512-IFMA(52) (52-bit integer FMA, 104-bit intermediate result) on Intel “IceLake” and newer CPUs (also supporting AVX512-DQ, AVX2 and SSE4). See “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.
  • Optimised .Net and Java Multi-Media 128-bit integer benchmarks similar to CPU version.
  • Increased limits/addressing allowing bigger grids/workloads on large thread/memory systems (e.g. 256-thread, 256GB RAM)

Bug Fixes

  • Fixed incorrect rounding function resulting in negative numbers displayed for zero values(!) Display issue only, actual values not affected.
  • Fixed display for large scores in GPGPU Processing benchmarks (result would overflow the display routine). Display issue only, scores stored internally (database) or sent/received from Ranker were correct and will display correctly upon update.
  • Fixed (sub)domain for Information Engine. Updated microcode, firmware, BIOS, driver versions are displayed again when available.

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Benchmarks of JCC Erratum Mitigation – Intel CPUs

What is the JCC Erratum?

It is a bug (“errata”) that affects pretty much all Intel Core processors (from 2nd generation “Sandy Bridge” SNB to 10th generation “Comet Lake” CML) but not the next-generation Core (10th generation “Ice Lake” ICL and later). JCC stands for “Jump Conditional Code”, i.e. conditional instructions (compare/test and jump) that are very common in all software. Intel has indicated that some conditions may cause “unpredictable behaviour” that perhaps can be “weaponised” thus it had to be patched through microcode (firmware). This affects all code, privileged/kernel as well as user mode programs.

Unfortunately the patch can result in somewhat significant performance regression since any “jumps” that span 32-byte boundary (not uncommon) now cannot be cached (by the “decoded iCache (instruction cache)” aka “DSB”). The DSB caches decoded instructions so they don’t need to be decoded again when the same code executes again (pretty likely).

This code is now forced to use the “legacy (execution) pipeline” (that is not fast as the “modern (execution) pipeline(s)”) that is naturally much slower and also incurs a time penalty (latency) switching between pipelines.

Can these performance penalties be reduced?

By rebuilding software (if possible i.e. source code is available) with updated tools (compilers, assemblers) this condition can be avoided by aligning “jumps” to 32-byte boundaries. This way the mitigation will not be engaged, thus the performance regression can be avoided. However, everything must be rebuilt – programs, libraries (DLL, object), device drivers and so on –  old software in object form cannot be “automagically” fixed at run-time.

The alignment is done though “padding” with dummy code (“no-op” or superfluous encodings) and thus does increase code size aka “bloat”. Fortunately the on average the code size increases by 3-5% which is manageable.

What about JIT CPU-agnostic byte-code (.Net, Java)?

JIT compiled code (such as .Net, Java, etc.) will require engine (JVM/CLR) updates but will not require rebuilding. Current engines and libraries are not likely to be updated – thus this will require new versions (e.g. Java 8/11/12 to Java 13) that will need to be tested for full compatibility.

What software can and has been updated so far?

Open-source software (“Linux”, “FreeBSD”, etc.) can easily be rebuild as the source-code (except proprietary blobs) is available. Current versions of main distributions have not been updated so far but future versions are likely to be so, starting with 2020 updates.

Microsoft has indicated that Windows 20/04 has been rebuilt and future versions are likely to be updated, naturally all older versions of client & server (19XX, 18XX, etc.) will naturally not be updated. Thus servers rather than clients are more likely to be affected by this change as not likely to be updated until the next major long-term refresh.

What has been updated in Sandra?

Sandra 20/20/8 – aka Release 8 / version 30.50 – and later has been built with updated tools (Visual Studio 2019 latest version, ML/MASM, TASM assemblers) and JCC mitigation enabled. This includes all benchmarks including assembler code (x86 and x64). Note that assembler code needs to be modified by hand by adding alignment instructions where necessary.

We are still analysing and instrumenting the benchmarks on a variety of processors and are continuing to optimise the code where required.

To compare against the other processors, please see our other articles:

Hardware Specifications

We are comparing common Intel Core/X architectures (gen 7, 8, 9) that are affected by the JCC erratum and microcode mitigating it has been installed. In this article we test the effect on Intel hardware only. See the other article for the effect on AMD hardware.

CPU Specifications Intel i9-7900X (10C/20T) (Skylake-X) Intel i9-9900K (8C/16T) (CoffeeLake-R) Intel i7-8700K (6C/12T) (Coffeelake) Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 6C / 12T Various code counts.
Special Instruction Sets
AVX512 AVX2/FMA AVX2/FMA 512 or 256-bit.
Microcode no JCC
5E Ax, Bx Ax, Bx
Microcode with JCC
65  Dx Cx More revisions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512, AVX2/FMA, AVX, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. Latest JCC-enabling microcode has been installed either through the latest BIOS or Windows itself.

Native Benchmarks Intel i9-7900X (10C/20T) (Skylake-X) Intel i9-9900K (8C/16T) (CoffeeLake-R) Intel i7-8700K (6C/12T) (Coffeelake) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) -0.79% +0.57% +10.67% Except CFL gaining 10%, little variation.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) -0.28% +6.72% +13.85% With a 64-bit integer – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) +2.78% +8% -1.16% With floating-point, CFL-R gains 8%.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) +2.36% +2.05% +2.88% With FP64 we see a 3% improvement.
While CFL (8700K) gains 10% in legacy integer workload and CFL-R (9900K) gains 8% in legacy floating-point workload, there are only minor variations. It seems CFL-series shows more variability than the older SKL series.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) -2.56% +5.49% +2.82% With AVX2 integer CFL/R both gain 3-5%.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) -0.18% +1.44% +3.28% With a 64-bit AVX2 integer we see smaller improvement.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) +0.8% +9.14% +1.51% A tough test using long integers to emulate Int128 without SIMD, CFL-R gains 9%.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) +2.35% +0.1% +0.57% Floating-point shows minor variation
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 0% +0.18% +1.45% Switching to FP64 SIMD  nothing much changes.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) +1.75% +1.63% +0.23% A heavy algorithm using FP64 to mantissa extend FP128 minor changes.
With heavily vectorised SIMD workloads (written in assembler) we see smaller variation, altough both CFL/R processors are marginally faster (~3%). Unlike high-level code (C/C++, etc.) assembler code is less dependent on the tools used for building – thus shows less variability across versions.
BenchCrypt Crypto AES-256 (GB/s) -0.47% +0.18% +0.19% Memory bandwidth rules here thus minor variation.
BenchCrypt Crypto AES-128 (GB/s) +0.04% +0.30% +0.06% No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) +0.54% +0.32% +0.86% No change with SIMD code.
BenchCrypt Crypto SHA1 (GB/s) +1.98% +6.16% +0.62% Less compute intensive SHA1 does not change things.
BenchCrypt Crypto SHA2-512 (GB/s) +0.32% +6.53% +1.41% 64-bit SIMD does not change results.
The memory sub-system is crucial here, thus we see the least variation in performance; even SIMD workloads are not affected much. Again we see CFL-R showing the biggest gain while SKL-X pretty much constant.
BenchFinance Black-Scholes float/FP32 (MOPT/s) +9.26% +1.04% +3.83% B/S does not use much shared data and here SKL-X gains large.
BenchFinance Black-Scholes double/FP64 (MOPT/s) +2.14% +2.02% -1.06% Using FP64 code variability decreases.
BenchFinance Binomial float/FP32 (kOPT/s) +4.55% +1.63% -0.05% Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) +1.7% +0.43% 0% With 64-bit code we see less delta.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) +1.38% 0% +7.1% Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) +1.61% -7.96% +4.08% Switching to FP64 we see less variability
With non-SIMD financial workloads, we see bigger differences with either CFL or SKL-X showing big improvements in some tests but not in others. Overall it shows that, perhaps the tools still need some work as the gains/losses are not consistent.
BenchScience SGEMM (GFLOPS) float/FP32 +1.18% +6.13% +4.97% In this tough vectorised workload CFL/R gains most.
BenchScience DGEMM (GFLOPS) double/FP64 +12.64% +8.58% +7.08% With FP64 vectorised all CPUs gain big.
BenchScience SFFT (GFLOPS) float/FP32 +1.52% +1.12% +0.21% FFT is also heavily vectorised but memory dependent we see little variation.
BenchScience DFFT (GFLOPS) double/FP64 +1.38% +1.18% +0.10% With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 +3.5% +1.04% +0.82% N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 +4.6% +0.59% 0% With FP64 code SKL-X improves.
With highly vectorised SIMD code (scientific workloads), SKL-X finally improves while CFL/R does not change much – although this could be due to optimisations elsewhere. Some algorithms that are completely memory latency/bandwidth dependent thus will not be affected by JCC.
Neural Networks NeuralNet CNN Inference (Samples/s) +3.09% +7.8% +2.9% We see a decent improvement in inference of 3-8%.
Neural Networks NeuralNet CNN Training (Samples/s) +6.31% +8.1% +12.98% Training seems to improve even more.
Neural Networks NeuralNet RNN Inference (Samples/s) -0.19% +2.7% +2.31 RNN inference shows little variation.
Neural Networks NeuralNet RNN Training (Samples/s) -3.81% -0.49% -3.83% Strangely all CPUs are slower here.
Despite heavily use of vectorised SIMD code, using intrinsics (C++) rather than assembler can result in larger performance variation from one version of compiler (and code-generation options) to the next. While some tests do gain, others show regressions which likely will be addressed by future versions.
CPU Image Processing Blur (3×3) Filter (MPix/s) +2.5% +11.5% +5% In this vectorised integer workload CFL/R gains 5-10%
CPU Image Processing Sharpen (5×5) Filter (MPix/s) -1.6% +24.38% +24.84% Same algorithm but more shared CFL/R zooms to 24%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) -1.5% +9.31% 0% Again same algorithm but even more data shared brings 10%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) +4.93% +23.77 +20% Different algorithm but still vectorised workload still CFL/R is 20% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) +7.62 +7% +7.13% Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) +3.08% 0% +0.53% Not much improvement here.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) +12.69% 0% +1.85% With integer workload, SKL-X is 12% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) +3.69 +0.33% -0.44% In this final test again with integer workload minor changes.
Similar to what we saw before, intrinsic (thus compile) code shows larger gains that hand-optimised assembler code and here again CFL/R gain most while the old SKL-X shows almost no variation.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

JCC is perhaps a more problematic errata than the other vulnerabilities (“Meltdown”, “Spectre”, etc.) that have affected Intel Core processors – in the sense that it affects all software (both kernel and user mode) and requires re-building everything (programs, libraries, device drivers, etc.) using updated tools. While open-source software is likely to do so – on Windows it is unlikely that all but the very newest versions of Windows (2020+) and actively maintained software (such as Sandra) will be updated; all older software will not.

Server software, either hypervisors, server operating systems (LTS – long term support), server programs (database servers, web-servers, storage servers, etc.) are very unlikely to be updated despite the performance regressions as (re)testing would be required for various certification and compatibility.

As the microcode updates for JCC also include previous mitigations for the older “Meltdown”/”Spectre” vulnerabilities – if you want to patch JCC only you cannot. With microcode updates being pushed aggressively by both BIOS and operating systems it is now much harder not to update. [Some users have chosen to remain on old microcode either due to incompatibilities or performance regression despite the “risks”.]

While older gen 6/7 “Skylake” (SKL/X) do not show much variation, newer gen 8/9/10 “CoffeeLake” (CFL/R) gain the most from new code, especially high-level C/C++ (or intrinsics); hand-written assembler code (suitably patched) does not improve as much.  Some gains are in the region of 10-20% (or perhaps this is the loss of the new microcode) thus it makes sense to update any and all software with JCC mitigation if at all possible. [Unfortunately we were unable to test pre-JCC microcode due to the current situation.]

With “real” gen(eration) 10 “Ice Lake” (ICL) and soon-to-be-released gen 11 “Tiger Lake” (TGL) not affected by this erratum, not forgetting the older erratums (“Meltdown”/”Spectre”) – all bringing their own performance degradation – it is perhaps a good time to upgrade. To some extent the new processors are faster simply because they are not affected by all these erratums!

Note: we have also tested the effect the JCC erratum mitigation has (if any) on the competition – namely AMD.

Should you decide to do so, please check out our other articles:

SiSoftware Sandra 20/20/8 (2020 R8t) Update – JCC, bigLITTLE, Hypervisors + Future Hardware

Sandra 20/20

Note: The original R8 release has been updated to R8t with future hardware support.

We are pleased to release R8t (version 30.61) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

JCC Erratum Mitigation

Recent Intel processors (SKL “Skylake” and later but not ICL “IceLake”) have been found to be impacted by the JCC Erratum that had to be patched through microcode. Naturally this can cause performance degradation depending on benchmark (approx 3% but up to 10%) but can be mitigated through assembler/compiler updates that prevent this issue from happening.

We have updated the tools with with which Sandra is built to mitigate JCC and we have tested the performance implications on both Intel and AMD hardware in the linked articles.

bigLITTLE Hybrid Architecture (aka “heterogeneous multi-processing”)

big.Little HMP
While bigLITTLE arch (including low and high-performance asymmetric cores into the same processor) has been used in many ARM processors, Intel is now introducing it to x86 as “Foveros”. Thus we have have Atom (low performance but very low power) and Core (high performance but relatively high power) into the same processor – scheduled to run or be “parked” depending on compute demands.

As with any new technology, it will naturally require operating system (scheduler) support and may go through various iterations. Do note that as we’ve discussed in our 2015 (!) article – ARM big.LITTLE: The trouble with heterogeneous multi-processing: when 4 are better than 8 (or when 8 is not always the “lucky” number) – software (including benchmarks) using all cores (big & LITTLE) may have trouble correctly assigning workloads and thus not use such processors optimally.

As Sandra uses its own scheduler to assign (benchmarking) threads to logical cores, we have updated it to allow users to benchmarks not only “All Threads (MT)” and “Only Cores (MC)” but also “Only big Cores (bMC)” and “Only LITTLE Cores (LMC)“. This way you can compare and contrast the various cores performance without BIOS/firmware changes.

The (benchmark) workload scheduler also had to be updated to allow per-thread workload – with threads scheduled on LITTLE cores assigned less work and threads on big cores assigned more work depending on their relative performance. The changes to Sandra’s workload scheduler allows each core can be fully utilised – at least when benchmarking.

Note: This advanced information is subject to change pending hardware and software releases and updates.

Future Hardware Support

Update R8t adds support for “Tiger Lake” (TGL) as well as updated support for “Ice Lake” (ICL) and future processors.

AMD Power/Performance Determinism

Some AMD’s server processors allow “determinism” to be changed to either “performance” (consistent speed across nodes/sockets) or “power” (consistent power across nodes/sockets). While normally workloads require predictability and thus “consistent performance” – this can be at the expense of speed (not taking advantage of power/thermal headroom for higher speed) and even power (too much power consumed by some sockets/nodes).

As “power deterministic” mode allows each processor at the maximum performance, there can be reasonable deviations across processors – but this would be unused if each thread has been assigned the same workload. In effect, it is similar to the “hybrid” issue above, with some cores able to sustain a different workload than other cores and the workload needs to vary accordingly. Again, the changes to Sandra’s workload scheduler allows each core to be fully utilised – at least when benchmarking.

Note: In most systems the deviation between nodes/sockets is relatively small if headroom (thermal/power) is small.

Hypervisors

More and more installations are now running in virtualised mode under a (Type 1) Hypervisor: using Hyper-V, Docker, programming tools for various systems (Android, etc.) or even enabling “Memory Integrity” all mean the system will be silently be modified to run in transparently under a hypervisor (Hyper-V on Windows).

As a result, Sandra will now detect and report hypervisor details when uploading benchmarks to the SiSoftware Official Live Ranker as even when running transparently/”host mode” – there can be deviation between benchmark scores especially when I/O operations (disk, network but even memory) are involved; some mitigations for vulnerabilities apply to both the hypervisor and host/guest operating system with a “double-impact” to performance.

Note: We will publish an article detailing the deviation seen with different hypervisors (Hyper-V, VmWare, Xen, etc.).

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Intel Core Gen10 CometLake (i9-10900K) Review & Benchmarks – CPU Performance

Intel Core i9 10th Gen

What is “CometLake”?

It is one of the 10th generation Core arch (CML) from Intel – the latest revision of the venerable (6th gen!) “Skylake” (SKL) arch; it succeeds the “CofeeLake” 8/9-gen current architectures for desktop devices. The “real” 10th generation Core arch is “IceLake” (ICL) that does bring many changes – but has only released on mobile (ULV) devices so far. It is likely Intel will skip it altogether – on the desktop.

As a result there ar no major updates vs. previous “Skylake” (SKL) designs, save increase in core count top end versions and hardware vulnerability mitigations which can still make a big difference:

  • Up to 10C/20T (from 8C/10T “CoffeeLake” or 4C/8T “Skylake”/”KabyLake”)
  • Increase Turbo ratios, base clocks
  • Hyper-Threading (SMT) enabled on all Core SKUs (i9, i7, i5, i3)
  • 2-channel DDR4-2933 (up from 2667)
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • New platform based on LGA1200 socket – thus new motherboards

Unlike CML ULV – we have a modest increase in core count (10C/20T vs. CFL 8C/16T) in the same 125W TDP power envelope – but it is still a big increase vs older designs that have always had 4C/8T. Hyper-Threading is no longer disabled on i7, i5 that – should you wish to keep it enabled – can still provide good performance gains in many applications.

DDR4 official speed support has gone up to 2993Mt/s (46GB/s bandwidth) up from 2667Mt/s which should help feed all those extra cores.

While CFL does mitigate “Meltdown” (CVE-2017-5754 “rogue data cache load”) and reports “not vulnerable” (can be checked with Sandra or similar utility) – due to MDS (to which CFL is vulnerable) recent versions of Windows do consider KVA (“kernel VA shadowing”) required and enable it by default. Thus the relatively large overhead of “Meltdown” mitigation is back. ML does report “not vulnerable” to both “Meltdown” and MDS and thus KVA is not required nor enabled. Hopefully there will be no further vulnerabilities discovered to undo these fixes.

Why review it now?

As “IceLake” (ICL) does not seem to make its public debut on desktop/workstation, “CometLake” (CML) is the latest APU from Intel you can buy today;despite being just a revision of “Skylake” due to increased core counts/Turbo ratios they may still prove worthy competitors not just in cost but also performance.

As per above, the additional hardware fixes/mitigations for vulnerabilities discovered since “Cofeelake” launched – especially “Meltdown” but also “Spectre” variants – the operating system & applications do not need to deploy slower mitigations that can affect performance (especially I/O). For some workloads, this may be worth an upgrade alone!

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel desktop with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications Intel i9 10900K (CML) Intel i9 9900K (CFL) AMD Ryzen 9 3900X AMD Ryzen 7 3700X Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 12C / 24T 8C / 16T 25% increase in core count
Speed (Min / Max / Turbo) 1.6-3.7-5.3GHz 1.6-3.6-5GHz 3.8-4.6GHz 3.6-4.4GHz CML has modest Turbo increase.
Power (TDP) 125W 95W 105W 65W 25% increase in TDP
L1D / L1I Caches 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way No L1 changes
L2 Caches 10x 256kB 8-way 8x 256kB 16-way 12x 512kB 16-way 8x512kB 16-way No L2 changes
L3 Caches 20MB 16-way 16MB 16-way 4x 16MB 16-way (64MB) 2x 16MB 16-way (32MB) 25% larger L3
Microcode (Firmware) MU06A505-C8 MU069E0C-9E MU8F71000-21 MU8F7100-13 Revisions just keep on coming.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “CometLake” (CML) supports all modern instruction sets including AVX2, FMA3 but not AVX512 (like “IceLake”, “Skylake-X”) or SHA HWA (like Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel i9 10900K (CML) Intel i9 9900K (CFL) AMD Ryzen 9 3900X AMD Ryzen 7 3700X Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 553 [+38%] 400 572 336 CML starts off 38% faster than CFL with 25% more cores.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 537 [+37%] 393 559 339 With a 64-bit integer workload still 37% faster.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 350 [+48%] 236 338 202 With floating-point workload CML is 48% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 288 [+47%] 196 292 170 With FP64 we see a similar 47% improvement.
With integer (legacy) workloads, CML is almost 40% faster than CFL – much more than just core increase (+25%). With floating-point we see an ever greater almost 50% improvement! This allows it to get within a whisker of AMD’s 3900X with its 12C.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,561 [+58%] 985 1,467 1,023 In this vectorised AVX2 integer test CML is ~60% faster than CFL!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 645 [+56%] 414 552 374 With a 64-bit AVX2 integer workload the difference is similar 56%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 13.3 [+97%] 6.75 15.26 6.54 This is a tough test using Long integers to emulate Int128 without SIMD but CML is almost 2x faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,551 [+70%] 914 1,510 1,000 In this floating-point AVX/FMA vectorised test, CML- 70% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 895 [+67%] 535 931 618 Switching to FP64 SIMD code, nothing much changes still 67% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 34.9 [+52%] 23 35.2 24.2 In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – we see 52% improvement.
With heavily vectorised SIMD workloads CML improves even more over CFL, once even 2x faster, which again allows it to trade blows with 3900X and its 12C. All the mitigations must weigh heavy on CFL as the large improvement is hard to justify otherwise.
BenchCrypt Crypto AES-256 (GB/s) 20.4 [+16%] 17.6 23.9 18.04 With AES/HWA support all CPUs are memory bandwidth bound.
BenchCrypt Crypto AES-128 (GB/s) 20.4 [+16%] 17.6 24.4 18.76 No change with AES128, CML is 16% faster.
BenchCrypt Crypto SHA2-256 (GB/s) 19.46 [+62%] 12 33.6 24.2 Without SHA/HWA Ryzen beats CML.
BenchCrypt Crypto SHA1 (GB/s) 22.9 34 23 Less compute intensive SHA1 allows CML to catch up.
BenchCrypt Crypto SHA2-512 (GB/s) 9 SHA2-512 is not accelerated by SHA/HWA CML does better.
The memory sub-system is crucial here, and CML improves over CFL with faster memory – the extra cores don’t help. But Ryzen is still faster and with SHA/HWA much faster in hashing than even Intel’s AVX2 SIMD units can muster.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 With non vectorised CML needs to catch up.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 377 [+58%] 238 424 257 Using FP64 CML is 58% faster but cannot beat Ryzen
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 90.1 [+46%] 61.6 113 64 With FP64 code CML is 46% faster than CFL.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 125 [+2.8x] 44.5 184 110 Switching to FP64 nothing much changes, CML is 2.8x faster.
With non-SIMD financial workloads, CML still improves ~50% over CFL that is a big change; unfortunately AMD’s 3900X is still faster but at least CML remains competitive while CFL was outclassed. ZEN3 will prove a big challenge though.
BenchScience SGEMM (GFLOPS) float/FP32 375 446 263 In this tough vectorised AVX2/FMA algorithm.
BenchScience DGEMM (GFLOPS) double/FP64 216 [+3%] 209 201 193 With FP64 vectorised code, CML is just 3% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.33 25.13 22.78 FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 9.11 [-19%] 11.21 18.62 11.16 With FP64 code, Ryzen is king.
BenchScience SNBODY (GFLOPS) float/FP32 557 689 612 N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 252 [+47%] 171 300 220 With FP64 code CML is ~50% faster
With highly vectorised SIMD code (scientific workloads) CML improvement is variable but it is there; it is likely that subtle improvements must be made in software for some workloads due to the core-contention for many-threaded cores. However, Ryzen 3900X is always faster.
CPU Image Processing Blur (3×3) Filter (MPix/s) 3,823 [+49%] 2,560 3,380 2,564 In this vectorised integer AVX2 workload CML is 50% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1,530 [+53%] 1,000 1,612 955 Same algorithm but more shared data, 53% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 781 [+50%] 519 819 492 Again same algorithm but even more data shared still 50% faster.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,335 [+61%] 827 1,395 832 Different algorithm but still AVX2 vectorised workload now 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 123 [+58%] 78 147 90.45 Still AVX2 vectorised code but here just 58% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 63 [+49%] 42.2 40.4 25.3 Similar improvement here of about 49%.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4,145 [+4%] 4,000 1,718 1,763 With integer AVX2 workload, only 4% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 891 [+49%] 596 519 323 In this final test again with integer AVX2 workload CML is 50% faster.
Without any new instruction sets (AVX512, SHA/HWA, etc.) support, CML was never going to be a revolution in performance but again we see it beat CFL by ~50% similar to what we’ve seen in other benchmarks.

Intel themselves did not claim a big performance improvement – possibly as it makes CFL pretty much obsolete, but with slightly more cores and higher clocks/TDP CML can reach Ryzen 3900X levels of performance which is no mean feat. With ZEN3 looking to launch soon, this is not before time.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For some it may be disappointing we do not have brand-new improved “IceLake” (ICL) now rather than a 3-rd revision “Skylake”, but “CometLake” (CML) does seem to improve even over the previous revisions (8/9th gen /”CofeeLake”CFL) due to modest increase in cores, base/Turbo clocks but perhaps also due to hardware-based vunerabilities mitigations which no longer require costly software versions.

Thus, somewhat surprisingly CML is able to trade blows with the 3900X and its 12C/24T that shows the heavily-revised “Skylake” core can still pack a punch against AMD’s latest and greatest. Naturally we would have preferred 12-cores not 10 but that would likely “eat” even further into Intel’s HEDT platform.

While owners of 8/9-th gen won’t be upgrading – it is very rare to recommend changing from one generation to another anyway – owners of older hardware can look forward to over 2x performance increase in most workloads for the same power draw, not to mention the additional features.

On the other hand, the competition (AMD Ryzen 3000 series) has more cores (12C and more) for great cost and performance – and still compatible with the old (with BIOS update) AM4 socket mainboards! With CML needing a new motherboard (LGA1200) and future “IceLake”-based CPUs possibly needing new motherboards again, CML is very much a stop-gap solution.

All in all Intel has managed to squeeze all it can from the old “Skylake” arch that while not revolutionary, still has enough to be competitive with current designs; while it goes out on a high, it is likely the end-of-the-road for this core.

In a word: Qualified Recommendation

Please see our other articles on: