big/Performance Core Performance Analysis (Intel 12th Gen Core AlderLake (i9-12900K))

What is “AlderLake”?

It is the “next-generation” (12th) Core architecture – replacing the short-lived “RocketLake” (RKL) that finally replaced the many, many “Skylake” (SKL) derivative architectures (6th-10th). It is the 1st mainstream “hybrid” arch – i.e. combining big/P(erformant) “Core” cores with LITTLE/E(fficient) “Atom” cores in a single package. While in the ARM world such SoC designs are quite common, this is quite new for x86 – thus operating systems (Windows) and applications may need to be updated.

For more details, please see our Intel 12th Gen Core AlderLake (i9-12900K) Review & Benchmarks – big/LITTLE Performance review. In this article we deal exclusively with the big/P “Core” cores.

big/P(erformance) “Core” core

  • Up to 8C/16T “Golden Cove” cores 7nm – improved from “Willow Cove” in TGL – claimed +19% IPC uplift
  • Disabled AVX512! in order to match Atom cores (on consumer)
    • (Server versions support AVX512 and new extensions like AMX and FP16 data-format)
  • SMT support still included, 2 threads/core – thus 16 total
  • 6-wide decode (from 4-way until now) + many other front-end upgrades
  • L1I remains at 32kB but iTLB increased 2x (256 vs. 128)
  • L1D remains at 48kB but dTLB increased 50% (96 vs. 64)
  • L2 increased to 1.25MB per core (over 2x TGL of 512kB) – server versions 2MB

The big news – beside the hybrid arch – is that AVX512 supported by desktop/mobile “Ice Lake” (ICL), “Tiger Lake” (TGL) and “Rocket Lake” (RKL) – is no longer enabled on “Alder Lake” big/P cores in order to match the Atom LITTLE/E cores. Future HEDT/server versions with presumably only big/P cores should support it just like Ice Lake-X (ICL-X).

Note: It seems that AVX512 can be enabled on big/P Cores (at least for now) on some mainboards that provide such a setting; naturally LITTLE/E Atom cores need to be disabled. We plan to test this ASAP.

In order to somewhat compensate – there are now AVX2 versions of AVX512 extensions:

  • VNNI/256 – (Vector Neural Network Instructions, dlBoost FP16/INT8) e.g. convolution
  • VAES/256 – (Vector AES) accelerating block-crypto
  • SHA HWA accelerating hashing (SHA1, SHA2-256 only)

We saw in the “RocketLake” review (Intel 11th Gen Core RocketLake AVX512 Performance Improvement vs AVX2/FMA3) that AVX512 makes RKL almost 40% faster vs. AVX2/FMA3 – and despite its high power consumption – it made RKL competitive. Without it – RKL with 2 less cores than “Comet Lake” (CML) would have not sufficiently improved to be worth it.

At SiSoftware – with Sandra – we naturally adopted and supported AVX512 from the start (before SKL-X) and added support for various new extensions as they were added in subsequent cores (VAES, IFMA52, VNNI, BF16, etc.) – this is even more disappointing; while it is not a problem to add AVX2 versions (e.g. VAES, VNNI) the performance cannot be expected to match the AVX512 original versions.

Let’s note that originally AVX512 launched with the Atom-core powered “Phi” GP-GPU accelerators (“Knights Landing” KNI) – thus it would not have been impossible for Intel to add support to the new Atom core – and perhaps we shall see that in future arch… when an additional compute performance uplift will be required (i.e. deal with AMD competition).

Only big/P Cores, LITTLE/E Atom disabled

Only big/P Cores, LITTLE/E Atom disabled

Changes in Sandra to support Hybrid

Like Windows (and other operating systems), we have had to make extensive changes to both detection, thread scheduling and benchmarks to support hybrid/big-LITTLE. Thankfully, this means we are not dependent on Windows support – you can confidently test AlderLake on older operating systems (e.g. Windows 10 or earlier – or Server 2022/2019/2016 or earlier) – although it is probably best to run the very latest operating systems for best overall (outside benchmarking) computing experience.

  • Detection Changes
    • Detect big/P and LITTLE/E cores
    • Detect correct number of cores (and type), modules and threads per core -> topology
    • Detect correct cache sizes (L1D, L1I, L2) depending on core
    • Detect multipliers depending on core
  • Scheduling Changes

    • All Threads (MT/MC)” (thus all cores + all threads – e.g. 24T
      • All Cores (MC aka big+LITTLE) Only” (both core types, no threads) – thus 16T
    • “All Threads big/P Cores Only” (only “Core” cores + their threads) – thus 16T
      • big/P Cores Only” (only “Core” cores) – thus 8T
      • LITTLE/E Cores Only” (only “Atom” cores) – thus 8T
    • Single Thread big/P Core Only” (thus single “Core” core) – thus 1T
    • Single Thread LITTLE/E Core Only” (thus single “Atom” core) – thus 1T
  • Benchmarking Changes
    • Dynamic/Asymmetric workload allocator – based on each thread’s compute power
      • Note some tests/algorithms are not well-suited for this (here P threads will finish and wait for E threads – thus effectively having only E threads). Different ways to test algorithm(s) will be needed.
    • Dynamic/Asymmetric buffer sizes – based on each thread’s L1D caches
      • Memory/Cache buffer testing using different block/buffer sizes for P/E threads
      • Algorithms (e.g. GEMM) using different block sizes for P/E threads
    • Best performance core/thread default selection – based on test type
      • Some tests/algorithms run best just using cores only (SMT threads would just add overhead)
      • Some tests/algorithms (streaming) run best just using big/P cores only (E cores just too slow and waste memory bandwidth)
      • Some tests/algorithms sharing data run best on same type of cores only (either big/P or LITTLE/E) (sharing between different types of cores incurs higher latencies and lower bandwidth)
    • Reporting the Compute Power Contribution of each thread
      • Thus the big/P and LITTLE/E cores contribution for each algorithm can be presented. In effect, this allows better optimisation of algorithms tested, e.g. detecting when either big/P or LITTLE/E cores are not efficiently used (e.g. overloaded)

As per above you can be forgiven that some developers may just restrict their software to use big/Performance threads only and just ignore the LITTLE/Efficient threads at all – at least when using compute heavy algorithms.

For this reason we recommend using the very latest version of Sandra and keep up with updated versions that likely fix bugs, improve performance and stability.

CPU (big Core) Performance Benchmarking

In this article we test CPU big Core performance; please see our other articles on:

Hardware Specifications

We are comparing the mythical top-of-the-range Gen 12 Intel with competing architectures as well as competitors (AMD) with a view to upgrading to a top-of-the-range, high performance design.

Specifications Intel Core i9 12900K 8C+8c/24T (ADL) big+LITTLE
Intel Core i9 12900K 8C/16T (ADL) big ONLY – no AVX512
Intel Core i9 11900K 8C/16T (RKL) – no AVX512
AMD Ryzen 9 5900X 12C/24T (Zen3) Comments
Arch(itecture) Golden Cove + Gracemont / AlderLake Golden Cove / AlderLake Cypress Cove / RocketLake Zen3 / Vermeer The very latest arch
Cores (CU) / Threads (SP) 8C+8c / 24T 8C / 16T 8C / 16T 2M / 12C / 24T 8 more LITTLE cores
Rated Speed (GHz) 3.2 big / 2.4 LITTLE 3.2 big 3.5 3.7 Base clock is a bit higher
All/Single Turbo Speed (GHz)
5.0 – 5.2 big / 3.7 – 3.9 LITTLE 5.0 – 5.2 big 4.8 – 5.3 4.5 – 4.8 Turbo is a bit lower
Rated/Turbo Power (W)
125 – 250 125 – 250 125 – 228 105 – 135 TDP is the same on paper.
L1D / L1I Caches 8x 48kB/32kB + 8x 64kB/32kB
8x 48kB/32kB 8x 48kB 12-way / 8x 32kB 8-way 12x 32kB 8-way / 12x 32kB 8-way L1D is 50% larger.
L2 Caches 8x 1.25MB + 2x 2MB (14MB)
8x 1.25MB (10MB) 8x 512kB 16-way (4MB) 12x 512kB 16-way (6MB) L2 has almost doubled
L3 Cache(s) 30MB 16-way 20MB 16-way 16MB 16-way 2x 32MB 16-way (64MB) L3 is almost 2x larger
Microcode (Firmware) 090672-0F [updated] 090672-0F [updated] 06A701-40 8F7100-1009 Revisions just keep on coming.
Special Instruction Sets
VNNI/256, SHA, VAES/256 (AVX512 disabled) VNNI/256, SHA, VAES/256 (AVX512 disabled) SHA, VAES/256 (AVX512 disabled)
AVX2/FMA, SHA Losing AVX512
SIMD Width / Units
256-bit 256-bit (AVX512 disabled) 256-bit (AVX512 disabled) 256-bit Same width SIMD units
Price / RRP (USD)
$599 $599 $539 $549 Price is a little higher.

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The review contains only public information and not provided under NDA nor embargoed. At publication time, the products have not been directly tested by SiSoftware but submitted to the public Benchmark Ranker; thus the accuracy of the benchmark scores cannot be verified, however, they appear consistent and pass current validation checks.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets. “AlderLake” (ADL) does not support AVX512 – but it does support 256-bit versions of some original AVX512 extensions.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 11 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i9 12900K 8C+8c/24T (ADL) big+LITTLE Intel Core i9 12900K 8C/16T (ADL) big ONLY Intel Core i9 11900K 8C/16T (RKL) – no AVX512 AMD Ryzen 9 5900X 12C/24T (Zen3) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 694 498 [72%] 545 589 ADL’s big Cores provide 72% of overall performance.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 703 544 [77%] 551 594 A 64-bit integer workload increases ratio to 77%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 496 334 [67%] 285 388 With floating-point, ADL’s big Cores provide only 67%.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 385 279 [72%] 239 324 With FP64 the ratio is back to 72%.
To start, in legacy integer/floating-point code, we see the big/P Cores in ADL provide about 70% of overall performance; thus the LITTLE/E Atom cores contribute about 30%. This is just enough to beat both RKL and Zen3 competition.

Without them, and thus just with the big/P Cores, ADL would be about 16% faster than RKL – which is just a bit less than what Intel claims (19% uplift). So while the big/P Cores have been improved, the difference is perhaps not as high as expected considering all the changes (2 generations’ updates, process shrink, etc.)

Note that due to being “legacy” none of the benchmarks support AVX512; while we could update them, they are not vectorise-able in the “spirit they were written” – thus single-lane AVX512 cannot run faster than AVX2/SSEx.

BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1,699 1,223 [72%] 1,100 2,000 big Cores provide 72% of performance.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 695 522 [75%] 504 805 With a 64-bit, the ratio goes to 75%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 131 98.86 [75%] 96 157 Using 64-bit int to emulate Int128 nothing changes.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1,981 1,559 [79%] 1,160 2,000 In this floating-point vectorised test big Cores provide 79%.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 1,126 866 [79%] 636 1,190 Switching to FP64 nothing much changes.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 53.2 41.55 [78%] 30.68 49.47 Using FP64 to mantissa extend FP128 ratio falls to 78%.
With heavily vectorised SIMD workloads – even without AVX512 support – ADL’s big/P Cores provide about 80% of overall performance (vs. 70% with legacy integer/floating-point code as we’ve seen before); thus the LITTLE/E Atom cores provide only about 20% of performance.

This does make sense, as while the LITTLE/E Atom cores now support AVX2/FMA3, they cannot really match the SIMD units of the big/P Cores, thus despite the count being the same (8C vs. 8c) their contribution falls to just 20%. Still, 20% is not to be ignored and it does help ADL get much closer to its Zen3 competition which naturally cannot beat.

Against RKL – not using AVX512 – the ADL’s big/P cores are about 35% faster (same number of Cores/threads) which is a pretty big improvement (2 generations’ worth). Naturally, RKL does have AVX512 that when used makes it about 40% faster (as we’ve seen in our article) – thus ADL would have still ended up slower than RKL with/AVX512 that is a big loss.

BenchCrypt Crypto AES-256 (GB/s) 31.84* 26.9* [85%] 20.81* 18.74 DDR5 memory rules here whatever core type.
BenchCrypt Crypto AES-128 (GB/s) 31.8* 20.82* 18.7 No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 33.08** 20.03** [61%] 13.67** 37** With compute, big Cores provide just 60%.
BenchCrypt Crypto SHA1 (GB/s) 27.43** 39.08** Less compute intensive SHA1.
BenchCrypt Crypto SHA2-512 (GB/s) 10.32 16.56 SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and these (streaming) tests show that using SMT threads is just not worth it. As both big/P Cores and LITTLE/E Atom cores support VAES/AVX and AES HWA (hardware acceleration) – their raw compute power does not matter as much, so the big/P Cores provide about 85% of overall performance, the highest ratio yet.

With compute hashing, SHA HWA (hardware acceleration) again helps both core types and here the big/P Cores (without AVX512) contribute just over 60% of overall performance. Here, the LITTLE/E Atom cores help best – showing that Atom cores are decent cryptographic processors thanks to both AES and SHA hardware acceleration.

We have seen in RKL testing that AVX512 hashing is faster than even SHA HWA due to multi-buffer processing, so here ADL would have performed even better with AVX512.

* using VAES (AVX512-VL or AVX2-VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2. [note multi-buffer AVX2 is slower than SHA hardware-acceleration]

AlderLake Inter-Thread/CoreLatency Heatmap (ns)

AlderLake Inter-Thread/CoreLatency Heatmap (ns)

CPU Multi-Core Benchmark Average Inter-Thread Latency (ns) 38.5 29 [75%] 28.5 45.1 With just big/P Cores average latency is lower.
CPU Multi-Core Benchmark Inter-Thread Latency (Same Core) (ns) 11 11 13.2 10 Naturally no changes.
CPU Multi-Core Benchmark Inter-Core Latency (Same Module, Same Type) (ns) 32.4 32.4 29.1 21.1 Again no changes.
CPU Multi-Core Benchmark Inter-big-2-LITTLE-Core Latency (Same Module, Different Type) (ns) 42.9 Between different core types latency is 32% higher.
CPU Multi-Core Benchmark Inter-Module (CCX) Latency (Same Package) (ns) 68.1 Only Zen3 has different CCX/modules.
With the LITTLE/E Atom cores disabled, the higher (compared to the inter-thread of big/P Cores) latencies based on inter-LITTLE/E-cores and inter-big/P-to-LITTLE/E-cores no longer exist, thus the average decreases to 75% (thus 25% lower).

All values pretty much match RKL thus we do not have any regressions. In effect it becomes a “standard” non-hybrid CPU with simple two-tier latencies between threads rather than the complex hybrid 4-tier latencies in the default configuration.

CPU Multi-Core Benchmark Total Inter-Thread Bandwidth – Best Pairing (GB/s) 111.9 87.48 147 ADL’s bandwidth is ~30% higher than RKL.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst Pairing (GB/s) 25.55 9 Waiting for data.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 379 542 Black-Scholes is un-vectorised and compute heavy.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 464 380 [82%] 332 449 Using FP64 big/P Cores provide 82%.
BenchFinance Binomial float/FP32 (kOPT/s) 81.83 228 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 155 100 [65%] 85.19 120 With FP64 code big/P Cores provide 65%
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 264 427 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 205 137 [67%] 117 182 Switching to FP64 big/P Cores provide 67%.
With non-SIMD financial workloads, similar to what we’ve seen in legacy floating-point code (Whetstone), ADL’s big/P Cores provide between 65-80% of overall performance dependent on the algorithm. Again, the LITTLE/E Atom cores provide just enough uplift to beat both RKL and more importantly the Zen3 competition.

Perhaps such code is better offloaded to GP-GPUs these days, but still lots of financial software do not use GP-GPUs even today. Thus with such code ADL performs very well, much better than RKL with its AVX512 unused.

BenchScience SGEMM (GFLOPS) float/FP32 405 927 In this tough vectorised algorithm ADL does well.
BenchScience DGEMM (GFLOPS) double/FP64 446 278 [62%] 186 393 With FP64 vectorised code, big/P Cores provide 62%.
BenchScience SFFT (GFLOPS) float/FP32 23.25 26.87 FFT is also heavily vectorised but memory dependent.
BenchScience DFFT (GFLOPS) double/FP64 28.72 24.11 [84%] 10.89 14.68 With FP64 code, big/P Cores provide 84%.
BenchScience SN-BODY (GFLOPS) float/FP32 605 802 N-Body simulation is vectorised but with more memory accesses.
BenchScience DN-BODY (GFLOPS) double/FP64 227 190 [84%] 184 317 With FP64 the big/P Cores provide 84%.
With highly vectorised SIMD code (scientific workloads), as we’ve seen in SIMD Vectorised processing, ADL’s big/P Cores provide about 84% of overall performance; thus the LITTLE/E Atom cores can only provide about 15%.

With RKL not using AVX512, ADL’s big/P Cores with the help of DDR5 are also very much faster which does show that SIMD unit performance has increased vs. RKL. Naturally, we’ve seen that AVX512 provides 40% uplift on RKL which is tough to match.

While it may be somewhat disappointing to see just 15% uplift from the LITTLE/E Atom cores, that is still a decent improvement and with further optimisations perhaps can be increased.

CPU Image Processing Blur (3×3) Filter (MPix/s) 5,823 4,253 [73%] 3,080 2,000 In this vectorised integer workload big/P Cores provide 73%.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 2,275 1,608 [71%] 1,310 1,270 Same algorithm but more shared data ratio is 71%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 1,117 798 [71%] 641 861 Again same algorithm but even more data shared 71% ratio.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1,926 1,422 [74%] 1,000 1,390 Different algorithm but still vectorised big/P cores provide 74%.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 157 108 [69%] 88.1 160 Still vectorised code ratio falls to 69%.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 79.78 62.6 [78%] 52.9 52.88 The big/P Cores contribute 78% here.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 6,082 5,153 [85%] 4,210 1,480 With integer workload, the ratio is 85%!
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 1,016 842 [83%] 737 622 In this final test again with integer workload ratio is unchanged.
These SIMD benchmarks are not as compute heavy as the Scientific ones and here the big/P Cores provide about 70-75% of the overall performance; naturally, this means the LITTLE/E Atom cores provide about 25-30%.

What is clear is that the big/P ADL Cores have greatly improved vs. RKL’s Cores (without AVX512) and they are about 35% faster, a significant improvement (2 generations’ worth, process shrink, etc.). Naturally RKL has AVX512 which absolutely *loves* these benchmarks.

It is also clear that the LITTLE/E Atom cores do provide generous uplift which is again enough to beat the Zen3 competition. The scores are always better across all tests with the LITTLE/E Atom cores enabled.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Summary: Forward-looking but expensive upgrade (desktop)

ADL’s big/P Cores have improved decently over RKL – perhaps not a surprise as it is 2-generations over ICL/RKL and TGL had already shown decent improvements. The loss of AVX512 is painful in heavy compute SIMD algorithms and is missed.

Across most tests, the big/P Cores in ADL account for ~70-85% of overall performance (8C/16T) thus the LITTLE/E Atom cores provide about 20-30% (8c/8T). Perhaps overclockers may see better results with the LITTLE/E cores disabled or power/thermally constrained systems may see better Turbo performance – but at stock across *all benchmarks* show better performance with the LITTLE/E Atom cores *enabled*.

  • In heavy-compute SIMD tests – without AVX512 on RKL thus using AVX2/FMA3 only – ADL’s big Cores are 10-30% faster than RKL’s. This is a decent improvement, but not enough to beat RKL with/AVX512 nor AMD’s Zen3 (with 12C/24T).
  • In non-SIMD tests, we see ADL’s big Cores 10-15% faster than RKL – thus here the LITTLE Atom cores provide more uplift. Without them, ADL is not able to pull significantly away from RKL and match competition.
  • Streaming (bandwidth bound) tests benefit greatly from DDR5 bandwidth: just the big Cores provide 85% of ADL’s performance (8T) and are 30-45% faster than RKL with DDR4 (3200). But this requires expensive DDR5 memory.

Perhaps not unexpectedly, the LITTLE Atom cores do provide decent uplift (especially with non-SIMD/legacy code) that makes up for losing AVX512. Without them, ADL is not significantly faster than RKL. By handling background tasks, the LITTLE Atom cores can prevent the big Cores for being interrupted (literally) from their heavy compute work and thus perform better. With thousands (~2,000) of threads and hundreds (~200) of processes in Windows just looking at the desktop – this is no joke.

However, with the reduced/no LITTLE Atom core versions priced better – they likely represent better value, at least on the Desktop platform. On the mobile and perhaps SOHO server where power efficiency is at a premium, the LITTLE Atom cores should become more useful.

Long Summary: ADL is a revolutionary design that adds support for many new technologies (hybrid, DDR5, PCIe5, TB4, ThreadDirector, etc.) but the platform will be expensive at launch and requires a full upgrade (DDR5 memory, dedicated PCIe5 GP-GPU, TB4 devices, Windows 11 upgrade, upgraded apps/games/etc.). For mobile devices (laptops/tablets) it is likely to be a better upgrade.

Summary: Forward-looking but expensive upgrade (desktop)

Further Articles

Please see our other articles on:

Disclaimer

This is an independent review (critical appraisal) that has not been endorsed nor sponsored by any entity (e.g. Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The review contains only public information and not provided under NDA nor embargoed. At publication time, the products have not been directly tested by SiSoftware but submitted to the public Benchmark Ranker; thus the accuracy of the benchmark scores cannot be verified, however, they appear consistent and pass current validation checks.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Tagged , , , , , , . Bookmark the permalink.

Comments are closed.