SiSoftware Sandra 20/20/8 (2020 R8t) Update – JCC, bigLITTLE, Hypervisors + Future Hardware

Sandra 20/20

Note: The original R8 release has been updated to R8t with future hardware support.

We are pleased to release R8t (version 30.61) update for Sandra 20/20 (2020) with the following updates:

Sandra 20/20 (2020) Press Release

JCC Erratum Mitigation

Recent Intel processors (SKL “Skylake” and later but not ICL “IceLake”) have been found to be impacted by the JCC Erratum that had to be patched through microcode. Naturally this can cause performance degradation depending on benchmark (approx 3% but up to 10%) but can be mitigated through assembler/compiler updates that prevent this issue from happening.

We have updated the tools with with which Sandra is built to mitigate JCC and we have tested the performance implications on both Intel and AMD hardware in the linked articles.

bigLITTLE Hybrid Architecture (aka “heterogeneous multi-processing”)

big.Little HMP
While bigLITTLE arch (including low and high-performance asymmetric cores into the same processor) has been used in many ARM processors, Intel is now introducing it to x86 as “Foveros”. Thus we have have Atom (low performance but very low power) and Core (high performance but relatively high power) into the same processor – scheduled to run or be “parked” depending on compute demands.

As with any new technology, it will naturally require operating system (scheduler) support and may go through various iterations. Do note that as we’ve discussed in our 2015 (!) article – ARM big.LITTLE: The trouble with heterogeneous multi-processing: when 4 are better than 8 (or when 8 is not always the “lucky” number) – software (including benchmarks) using all cores (big & LITTLE) may have trouble correctly assigning workloads and thus not use such processors optimally.

As Sandra uses its own scheduler to assign (benchmarking) threads to logical cores, we have updated it to allow users to benchmarks not only “All Threads (MT)” and “Only Cores (MC)” but also “Only big Cores (bMC)” and “Only LITTLE Cores (LMC)“. This way you can compare and contrast the various cores performance without BIOS/firmware changes.

The (benchmark) workload scheduler also had to be updated to allow per-thread workload – with threads scheduled on LITTLE cores assigned less work and threads on big cores assigned more work depending on their relative performance. The changes to Sandra’s workload scheduler allows each core can be fully utilised – at least when benchmarking.

Note: This advanced information is subject to change pending hardware and software releases and updates.

Future Hardware Support

Update R8t adds support for “Tiger Lake” (TGL) as well as updated support for “Ice Lake” (ICL) and future processors.

AMD Power/Performance Determinism

Some AMD’s server processors allow “determinism” to be changed to either “performance” (consistent speed across nodes/sockets) or “power” (consistent power across nodes/sockets). While normally workloads require predictability and thus “consistent performance” – this can be at the expense of speed (not taking advantage of power/thermal headroom for higher speed) and even power (too much power consumed by some sockets/nodes).

As “power deterministic” mode allows each processor at the maximum performance, there can be reasonable deviations across processors – but this would be unused if each thread has been assigned the same workload. In effect, it is similar to the “hybrid” issue above, with some cores able to sustain a different workload than other cores and the workload needs to vary accordingly. Again, the changes to Sandra’s workload scheduler allows each core to be fully utilised – at least when benchmarking.

Note: In most systems the deviation between nodes/sockets is relatively small if headroom (thermal/power) is small.

Hypervisors

More and more installations are now running in virtualised mode under a (Type 1) Hypervisor: using Hyper-V, Docker, programming tools for various systems (Android, etc.) or even enabling “Memory Integrity” all mean the system will be silently be modified to run in transparently under a hypervisor (Hyper-V on Windows).

As a result, Sandra will now detect and report hypervisor details when uploading benchmarks to the SiSoftware Official Live Ranker as even when running transparently/”host mode” – there can be deviation between benchmark scores especially when I/O operations (disk, network but even memory) are involved; some mitigations for vulnerabilities apply to both the hypervisor and host/guest operating system with a “double-impact” to performance.

Note: We will publish an article detailing the deviation seen with different hypervisors (Hyper-V, VmWare, Xen, etc.).

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Lite

Intel Core Gen10 CometLake (i9-10900K) Review & Benchmarks – CPU Performance

Intel Core i9 10th Gen

What is “CometLake”?

It is one of the 10th generation Core arch (CML) from Intel – the latest revision of the venerable (6th gen!) “Skylake” (SKL) arch; it succeeds the “CofeeLake” 8/9-gen current architectures for desktop devices. The “real” 10th generation Core arch is “IceLake” (ICL) that does bring many changes – but has only released on mobile (ULV) devices so far. It is likely Intel will skip it altogether – on the desktop.

As a result there ar no major updates vs. previous “Skylake” (SKL) designs, save increase in core count top end versions and hardware vulnerability mitigations which can still make a big difference:

  • Up to 10C/20T (from 8C/10T “CoffeeLake” or 4C/8T “Skylake”/”KabyLake”)
  • Increase Turbo ratios, base clocks
  • Hyper-Threading (SMT) enabled on all Core SKUs (i9, i7, i5, i3)
  • 2-channel DDR4-2933 (up from 2667)
  • Thunderbolt 3 integrated
  • Hardware fixes/mitigations for vulnerabilities (“Meltdown”, “MDS”, various “Spectre” types)
  • New platform based on LGA1200 socket – thus new motherboards

Unlike CML ULV – we have a modest increase in core count (10C/20T vs. CFL 8C/16T) in the same 125W TDP power envelope – but it is still a big increase vs older designs that have always had 4C/8T. Hyper-Threading is no longer disabled on i7, i5 that – should you wish to keep it enabled – can still provide good performance gains in many applications.

DDR4 official speed support has gone up to 2993Mt/s (46GB/s bandwidth) up from 2667Mt/s which should help feed all those extra cores.

While CFL does mitigate “Meltdown” (CVE-2017-5754 “rogue data cache load”) and reports “not vulnerable” (can be checked with Sandra or similar utility) – due to MDS (to which CFL is vulnerable) recent versions of Windows do consider KVA (“kernel VA shadowing”) required and enable it by default. Thus the relatively large overhead of “Meltdown” mitigation is back. ML does report “not vulnerable” to both “Meltdown” and MDS and thus KVA is not required nor enabled. Hopefully there will be no further vulnerabilities discovered to undo these fixes.

Why review it now?

As “IceLake” (ICL) does not seem to make its public debut on desktop/workstation, “CometLake” (CML) is the latest APU from Intel you can buy today;despite being just a revision of “Skylake” due to increased core counts/Turbo ratios they may still prove worthy competitors not just in cost but also performance.

As per above, the additional hardware fixes/mitigations for vulnerabilities discovered since “Cofeelake” launched – especially “Meltdown” but also “Spectre” variants – the operating system & applications do not need to deploy slower mitigations that can affect performance (especially I/O). For some workloads, this may be worth an upgrade alone!

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel desktop with competing architectures (gen 8, 7, 6) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications Intel i9 10900K (CML) Intel i9 9900K (CFL) AMD Ryzen 9 3900X AMD Ryzen 7 3700X Comments
Cores (CU) / Threads (SP) 10C / 20T 8C / 16T 12C / 24T 8C / 16T 25% increase in core count
Speed (Min / Max / Turbo) 1.6-3.7-5.3GHz 1.6-3.6-5GHz 3.8-4.6GHz 3.6-4.4GHz CML has modest Turbo increase.
Power (TDP) 125W 95W 105W 65W 25% increase in TDP
L1D / L1I Caches 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way No L1 changes
L2 Caches 10x 256kB 8-way 8x 256kB 16-way 12x 512kB 16-way 8x512kB 16-way No L2 changes
L3 Caches 20MB 16-way 16MB 16-way 4x 16MB 16-way (64MB) 2x 16MB 16-way (32MB) 25% larger L3
Microcode (Firmware) MU06A505-C8 MU069E0C-9E MU8F71000-21 MU8F7100-13 Revisions just keep on coming.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “CometLake” (CML) supports all modern instruction sets including AVX2, FMA3 but not AVX512 (like “IceLake”, “Skylake-X”) or SHA HWA (like Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel i9 10900K (CML) Intel i9 9900K (CFL) AMD Ryzen 9 3900X AMD Ryzen 7 3700X Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 553 [+38%] 400 572 336 CML starts off 38% faster than CFL with 25% more cores.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 537 [+37%] 393 559 339 With a 64-bit integer workload still 37% faster.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 350 [+48%] 236 338 202 With floating-point workload CML is 48% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 288 [+47%] 196 292 170 With FP64 we see a similar 47% improvement.
With integer (legacy) workloads, CML is almost 40% faster than CFL – much more than just core increase (+25%). With floating-point we see an ever greater almost 50% improvement! This allows it to get within a whisker of AMD’s 3900X with its 12C.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,561 [+58%] 985 1,467 1,023 In this vectorised AVX2 integer test CML is ~60% faster than CFL!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 645 [+56%] 414 552 374 With a 64-bit AVX2 integer workload the difference is similar 56%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 13.3 [+97%] 6.75 15.26 6.54 This is a tough test using Long integers to emulate Int128 without SIMD but CML is almost 2x faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,551 [+70%] 914 1,510 1,000 In this floating-point AVX/FMA vectorised test, CML- 70% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 895 [+67%] 535 931 618 Switching to FP64 SIMD code, nothing much changes still 67% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 34.9 [+52%] 23 35.2 24.2 In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – we see 52% improvement.
With heavily vectorised SIMD workloads CML improves even more over CFL, once even 2x faster, which again allows it to trade blows with 3900X and its 12C. All the mitigations must weigh heavy on CFL as the large improvement is hard to justify otherwise.
BenchCrypt Crypto AES-256 (GB/s) 20.4 [+16%] 17.6 23.9 18.04 With AES/HWA support all CPUs are memory bandwidth bound.
BenchCrypt Crypto AES-128 (GB/s) 20.4 [+16%] 17.6 24.4 18.76 No change with AES128, CML is 16% faster.
BenchCrypt Crypto SHA2-256 (GB/s) 19.46 [+62%] 12 33.6 24.2 Without SHA/HWA Ryzen beats CML.
BenchCrypt Crypto SHA1 (GB/s) 22.9 34 23 Less compute intensive SHA1 allows CML to catch up.
BenchCrypt Crypto SHA2-512 (GB/s) 9 SHA2-512 is not accelerated by SHA/HWA CML does better.
The memory sub-system is crucial here, and CML improves over CFL with faster memory – the extra cores don’t help. But Ryzen is still faster and with SHA/HWA much faster in hashing than even Intel’s AVX2 SIMD units can muster.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 With non vectorised CML needs to catch up.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 377 [+58%] 238 424 257 Using FP64 CML is 58% faster but cannot beat Ryzen
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 90.1 [+46%] 61.6 113 64 With FP64 code CML is 46% faster than CFL.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 125 [+2.8x] 44.5 184 110 Switching to FP64 nothing much changes, CML is 2.8x faster.
With non-SIMD financial workloads, CML still improves ~50% over CFL that is a big change; unfortunately AMD’s 3900X is still faster but at least CML remains competitive while CFL was outclassed. ZEN3 will prove a big challenge though.
BenchScience SGEMM (GFLOPS) float/FP32 375 446 263 In this tough vectorised AVX2/FMA algorithm.
BenchScience DGEMM (GFLOPS) double/FP64 216 [+3%] 209 201 193 With FP64 vectorised code, CML is just 3% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.33 25.13 22.78 FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more.
BenchScience DFFT (GFLOPS) double/FP64 9.11 [-19%] 11.21 18.62 11.16 With FP64 code, Ryzen is king.
BenchScience SNBODY (GFLOPS) float/FP32 557 689 612 N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 252 [+47%] 171 300 220 With FP64 code CML is ~50% faster
With highly vectorised SIMD code (scientific workloads) CML improvement is variable but it is there; it is likely that subtle improvements must be made in software for some workloads due to the core-contention for many-threaded cores. However, Ryzen 3900X is always faster.
CPU Image Processing Blur (3×3) Filter (MPix/s) 3,823 [+49%] 2,560 3,380 2,564 In this vectorised integer AVX2 workload CML is 50% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1,530 [+53%] 1,000 1,612 955 Same algorithm but more shared data, 53% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 781 [+50%] 519 819 492 Again same algorithm but even more data shared still 50% faster.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,335 [+61%] 827 1,395 832 Different algorithm but still AVX2 vectorised workload now 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 123 [+58%] 78 147 90.45 Still AVX2 vectorised code but here just 58% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 63 [+49%] 42.2 40.4 25.3 Similar improvement here of about 49%.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4,145 [+4%] 4,000 1,718 1,763 With integer AVX2 workload, only 4% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 891 [+49%] 596 519 323 In this final test again with integer AVX2 workload CML is 50% faster.
Without any new instruction sets (AVX512, SHA/HWA, etc.) support, CML was never going to be a revolution in performance but again we see it beat CFL by ~50% similar to what we’ve seen in other benchmarks.

Intel themselves did not claim a big performance improvement – possibly as it makes CFL pretty much obsolete, but with slightly more cores and higher clocks/TDP CML can reach Ryzen 3900X levels of performance which is no mean feat. With ZEN3 looking to launch soon, this is not before time.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

For some it may be disappointing we do not have brand-new improved “IceLake” (ICL) now rather than a 3-rd revision “Skylake”, but “CometLake” (CML) does seem to improve even over the previous revisions (8/9th gen /”CofeeLake”CFL) due to modest increase in cores, base/Turbo clocks but perhaps also due to hardware-based vunerabilities mitigations which no longer require costly software versions.

Thus, somewhat surprisingly CML is able to trade blows with the 3900X and its 12C/24T that shows the heavily-revised “Skylake” core can still pack a punch against AMD’s latest and greatest. Naturally we would have preferred 12-cores not 10 but that would likely “eat” even further into Intel’s HEDT platform.

While owners of 8/9-th gen won’t be upgrading – it is very rare to recommend changing from one generation to another anyway – owners of older hardware can look forward to over 2x performance increase in most workloads for the same power draw, not to mention the additional features.

On the other hand, the competition (AMD Ryzen 3000 series) has more cores (12C and more) for great cost and performance – and still compatible with the old (with BIOS update) AM4 socket mainboards! With CML needing a new motherboard (LGA1200) and future “IceLake”-based CPUs possibly needing new motherboards again, CML is very much a stop-gap solution.

All in all Intel has managed to squeeze all it can from the old “Skylake” arch that while not revolutionary, still has enough to be competitive with current designs; while it goes out on a high, it is likely the end-of-the-road for this core.

In a word: Qualified Recommendation

Please see our other articles on: