Windows 10 Arm64 WoW – x86 Emulation Performance

What is “WoW”?

“WoW” stands for “Windows-on-Windows” is the compatibility/emulation layer that allows the execution of different architecture applications on the same version of Windows. Originally (in the days of Windows NT) it allowed running 16-bit x86 legacy applications on 32-bit x86 Windows NT. Today, with 64-bit Windows, it allows 32-bit legacy applications to run. [Note that kernel drivers always needed to be native, there has never been any emulation for drivers.]

In addition, different architecture Windows – e.g. 64-bit IA64/Itanium also allowed running 32-bit x86 applications and in the same vein, Arm64 Windows also allows 32-bit x86 applications to run. [We also had Alpha A64 in the NT4 days but that’s too far back]

Similarly, other operating systems like Apple Mac (OS X now) with “Rosetta” can run different architecture applications. Recently, Apple has dropped Intel in favour of ARM (after dropping PowerPC in favour of Intel) and has a similar compatibility layer for x86 to Arm64 translation. Although, since they went “all the way” – it seems they have done a more complete job…

What application types can Arm64 Windows emulate under WoW?

Current Arm64 Windows 10/11 can run the following types under WoW emulation:

32-bit ARM (ARMv7 AArch32) applications – YES
32-bit x86 (IA32) applications – YES
64-bit x64 (AMD64/EM64T) applications – NO (Windows 10)!
64-bit IA64 (Itanium) applications – NO, “it’s dead Jim”!

To start – 32-bit ARM applications are supported – yes, if you happen to have a “Windows RT” copy of Microsoft Office ARM-version then it will work! And the few Metro/UWP ARM applications for Windows RT – we did not bother – will also work! Yay!

Thus 64-bit x64 applications *are not* currently supported; while it has taken more than a decade to port “dinosaur” applications (like Microsoft Office aka T-Rex) to x64 – pretty much all Windows applications today are 64-bit x64. With Windows 11 dropping support for 32-bit x86 processors – the only market left for 32-bit x86 applications are stalwarts still on Windows 7 x86. But developers are not going to be bothering to ship x86 builds.

Another thing to remember is that x86 applications are limited to 2GB memory (unless specifically using AWE extensions which is pretty rare). For over a decade we have been used to 64-bit addressing and increasing files and data sizes – dropping back to 32-bit can be a deal-breaker for some users.

As such – lack of 64-bit x64 application emulation is a serious issue for Arm64 Windows adoption and Microsoft needs to be addressing that relatively quickly – especially as their own Arm64 Surface X range is already a few years old.

Microsoft is also using their “own” SoC called “SQ” – a rebranded Qualcomm 8cx (4 big cores + 4 LITTLE cores) SoC as found in other competing tablets (Samsung, Lenovo). The nymber represents the 8cx generation (SQ1 – Gen 1, SQ2 – Gen 2) with the latter bringing faster clocks, updated graphics and generally having double memory (16GB LP-DDR4X vs. 8GB).

What is “Windows Arm64”?

It is the 64-bit version of “desktop” Windows 10/11 for AArch64 ARM devices – analogous to the current x64 Windows 10/11 for Intel & AMD CPUs. While “desktop” Windows 8.x has been available for ARM (AArch32) as Windows RT – it did not allow running of non-Microsoft native Win32 applications (like Sandra) and also did not support emulation for current x86/x64 applications.

We should also recall that Windows Phone (though now dead) has always run on ARM devices and was first to unify the Windows desktop and Windows CE kernels – thus this is not a brand-new port. Windows was first ported to 64-bit with the long-dead Alpha64 (Windows NT 4) and then Itanium IA64 (Windows XP 64) which showed the versatility of the NT micro-kernel.

x86 CPU Emulated Features on Arm64 WoW

The WoW emulation layer on Arm64 Windows presents a Virtual x86 CPU with the following characteristics:

Type: Intel P4P-T/J (Prescott) Pentium 4E (!)
Clock: same as Arm64 host
Cores/Threads: same as Arm64 host (but can be restricted)
Page Sizes: 4kB, 2MB (thus large pages available)
L1D Cache: 16kB WT 4-way (not same as host – ouch!)
L1I Cache: 32kB WB 4-way (not same as host – ouch!)
L2 Cache: 512kB Adv 8-way (not same as host – ouch!)

Wow seems to be emulating a Pentium 4E in all aspects, including cache sizes – which is pretty obsolete now; only very old 32-bit x86 applications would be expecting to run on a Pentium 4 now – Microsoft should really be considering a Core 2 Duo if not a “Nehalem”/”Westmere” aka 1st generation Core i3/i5/i7.

The cache sizes not matching the host wreak havoc with optimised code – like ours – that create buffers based on L1D/L2 cache sizes (e.g. GEMM, N-Body, neural-networks CNN/RNN) – not to mention cache bandwidth/latency measurements will be incorrect. Microsoft should really try to report the host cache sizes (L1, L2, L3) although for big/LITTLE hybrid (DynamiQ) CPUs this may be be more difficult.

Here are the feature flags that this emulated x86 CPU presents (we’ve cut the ones not present:

FPU – Integrated Co-Processor : Yes
TSC – Time Stamp Counter : Yes
CX8 – Compare and Exchange 8-bytes Instruction : Yes
CMOV – Conditional Move Instruction : Yes
MMX – Multi-Media eXtensions : Yes (!)
FXSR – Fast Float Save and Restore : Yes
SSE – Streaming SIMD Extensions : Yes
SSE2 – Streaming SIMD Extensions v2 : Yes
HTT – Hyper-Threading Technology : Yes
SSE3 – Streaming SIMD Extensions v3 : Yes
SSSE3 – Supplemental SSE3 : Yes
SSE4.1 – Streaming SIMD Extensions v4.1 : Yes
SSE4.2 – Streaming SIMD Extensions v4.2 : No (Windows 10)
POPCNT – Pop Count : Yes
AES – Accelerated Cryptography Support : No (Windows 10)
AVX – Advanced Vector eXtensions : No
AVX2 – Advanced Vector eXtensions v2 : No
FMA3 – Fused Multiply-Add: No
F16C – Half Precision Float Conversion : No
RNG – Random Number Generator : No

Thus we have SSE2, (S)SSE3 and SSE4.1 SIMD 128-bit wide instruction sets (but not SSE4.2?) as well as some compare/exchange and pop-count instructions. This makes sense as Arm64 NEON SIMD is 128-bit wide. We don’t have crypto AES, SHA or RNG hardware acceleration support [but note that Pi 4B host does not have them either].

Naturally there is no AVX, FMA, AVX2 support that will likely come in time, but not to Windows 10 – likely Windows 11 only on Arm64 SVE-capable CPUs.

In addition, few extended (AMD) features are also supported:

XD/NX – No-execute Page Protection : Yes
RDTSCP – Serialised TSC : Yes
AMD64/EM64T Technology : No (Windows 10)
3D Now! Prefetch Technology : Yes (!)

Thus while page protection support is included, naturally there is no 64-bit x64 support reported nor virtualization support – just in case you wanted to run a hypervisor on an emulated CPU 😉

All in all pretty much bare minimum of x86 support – we’re talking 20 years ago (Pentium 4 “Prescott” was launched in 2004!), not exactly what we’d hope for 2022. Microsoft really needs to be improve this if Arm64 Windows is to be a success – but perhaps they keep it this way to keep their long-time partner Intel happy by cutting (some) corners…

Here Apple Mac “Rosetta” emulation is far more advanced – but Apple has gone all the way in and completely replaced x86 with their own Arm64 (M1+) CPUs – thus they do need good x86 emulation for users to move to the new arch. Microsoft has no intention to abandon x86 just yet but are just hedging their bets (after all after the “death” of Windows phone – the ARM team had to have something to do).

Changes in Sandra to support ARM

As a relatively old piece of software (launched all the way back in 1997 (!)), Sandra contains a good amount of legacy but optimised code, with the benchmarks generally written in assembler (MASM, NASM and previously YASM) for x86/x64 using various SIMD instruction sets: SSE2, SSSE3, SSE4.1/4.2, AVX/FMA3, AVX2 and finally AVX512. All this had to be translated in more generic C/C++ code using templated instrinsics implementation for both x86/x64 and ARM/Arm64.

As a result, some of the historic benchmarks in Sandra have substantially changed – with the scores changing to some extent. This cannot be helped as it ensures benchmarking fairness between x86/x64 and ARM/Arm64 going forward.

For this reason we recommend using the very latest version of Sandra and keep up with updated versions that likely fix bugs, improve performance and stability.

CPU Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

CPU/SoC

Hardware Specifications

We are comparing the Microsoft/Qualcomm with Intel ULV x64 processors of similar vintage – all running current Windows 10, latest drivers.

Specifications	Microsoft SQ2	Raspberry Pi 4B	Intel Core i5-6300	Intel Core i5-8265	Comments
Arch(itecture)	Kryo 495 Gold (Cortex A76) + Kryo 495 Silver (Cortex A55) 7nm	Cortex A72 16nm	Skylake-ULV (Gen6)	WhiskeyLake-ULV (Gen8)	big+LITLE cores vs. SMT
Launch Date	Q3 2020	2019	Q3 2015	Q3 2018	Much newer design
Cores (CU) / Threads (SP)	4C + 4c (8T)	4C / 4T	2C / 4T	4C / 8T	Same number of threads
Rated Speed (GHz)	1.8	1.5	2.5	1.5	Similar base clock
All/Single Turbo Speed (GHz)	3.14	2.0	3.0	3.9	Turbo could be higher
Rated/Turbo Power (W)	5-7	4-6	15-25	15-25	Much lower TDP, 1/3x Intel
L1D / L1I Caches	4x 64kB \| 4x 64kB	4x 32kB 2-way \| 4x 48kB 3-way	2x 32kB \| 2x 32kB	4x 32kB \| 4x 32kB	Similar L1 caches
L2 Caches	4x 512kB \| 4x 128kB	1MB 16-way	2x 256kB	4x 256kB	L2 is 2x on big cores
L3 Cache(s)	4MB	n/a	3MB	4MB	Same size L3
Microcode (Firmware)					Updates keep on coming
Special Instruction Sets	v8.2-A, VFP4, AES, SHA, TZ, Neon	v8-A, VFP4, TZ, Neon	AVX2/FMA, AES, VT-d/x	AVX2/FMA, AES, VT-d/x	All 8.2 instructions
SIMD Width / Units	128-bit	128-bit	256-bit	256-bit	Neon is not wide enough
Price / RRP (USD)	$146	$55 (whole SBC)	$281	$297	Pi price is for whole BMC! (including memory)

Disclaimer

This is an independent article that has not been endorsed nor sponsored by any entity (e.g. Microsoft, Qualcomm, Intel, etc.). All trademarks acknowledged and used for identification only under fair use.

The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!

Native Performance

We are testing native and emulated arithmetic, SIMD and cryptography performance using the highest performing instruction sets, on both x64 and Arm64 platforms.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64/Arm64, latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations where supported.

Native Benchmarks		Raspberry Pi 4B (4C 2GHz) Arm64 Native	Raspberry Pi 4B (4C 2GHz) x86 Emulated	Microsoft SQ2 (4C 3.15GHz + 4c 1.8GHz) Arm64 Native	Microsoft SQ2 (4C 3.15GHz + 4c 1.8GHz) x86 Emulated	Comments

	Native Dhrystone Integer (GIPS)	32.6	17.61* [-46%]	100	54*	x86 emulation is about 46% slower.
	Native Dhrystone Long (GIPS)	33.42	15* [-56%]	102	45.8*	A 64-bit int workload, 56% slower
	Native FP32 (Float) Whetstone (GFLOPS)	25.1	12.54* [-51%]	87.86	43.9*	With floating-point, emulation is 50% slower
	Native FP64 (Double) Whetstone (GFLOPS)	23.76	10.77* [-55%]	79.25	35.9*	With FP64 emulation is 55 slower.
In these standard legacy tests, both SQ and Pi4 are about 50% slower when emulating x86 code vs. native Arm64 code. Both integer and floating-point (scalar) workloads show similar performance degradation. Microsoft’s SQ is powerful enough to still be usable due to its 8 cores (4C + 4c) but in effect you are down to 4 cores (2C + 2c) which puts it into Intel SKL (Skylake) 2C / 4T performance territory rather than Intel WHL (WhiskeyLake) 4C / 8T… Native code is still the best and what you need to be running especially on low-powered devices; emulation does not yet cut it. Note*: using SSE2-3 SIMD processing.

	Native Integer (Int32) Multi-Media (Mpix/s)	36.7*	17.55 [-53%]	95.12*	50.87	x86 emulation is 53% slower.
	Native Long (Int64) Multi-Media (Mpix/s)	24.88*	8.16** [-68%]	60.99*	24.74**	With a 64-bit, we are 68% slower!
	Native Quad-Int (Int128) Multi-Media (Mpix/s)	3.3*	2.41** [-27%]	10.85*	6.99**	Using 64-bit int to emulate Int128, we are 27% slower.
	Native Float/FP32 Multi-Media (Mpix/s)	52.7*	34.54** [-35%]	187*	81.38**	In this FP32 vectorised is 35% slower.
	Native Double/FP64 Multi-Media (Mpix/s)	29.73*	17.66** [-41%]	106*	42.74**	Switching to FP64 41% slower.
	Native Quad-Float/FP128 Multi-Media (Mpix/s)	1.31*	1.27** [-4%]	4.84*	4.29**	Using FP64 to mantissa extend FP128 is 4% slower.
With heavily vectorised SIMD workloads – the emulation degradation is more variable 27-68% lower, but on average we are about 50% lower as we’ve seen before. Again, powerful multi-core Arm64 CPUs can manage this performance drop but low-end devices (like the Pi 4B) struggling a bit more. It will be interesting to see whether Microsoft will emulate AVX2/FMA3 on SVE/SVE2 Arm64 devices or will just keep it as it and rely on applications to bridge the emulation gap. Note: using SSE2-4 128-bit SIMD processing. Note*: using NEON 128-bit SIMD processing.

	Crypto AES-256 (GB/s)	0.333	0.147 [-56%]	0.79	6.8*	We are adding AES acceleration
	Crypto AES-128 (GB/s)	0.451	0.205 [-55%]	0.96		No change with AES128, 55% slower.
	Crypto SHA2-256 (GB/s)	0.599**	0.267 [-56%]	1.69	0.68	We need SHA acceleration here
	Crypto SHA1 (GB/s)	0.996**	0.536 [-47%]	3.1		Less compute intensive SHA1 is 47% slower.
Across both AES and SHA without hardware acceleration (HWA) we see the same 47-56% performance degradation we’ve seen before. One interesting data point is that we have x86 AES hardware emulation on SQ2 which increases the score by 10x (ten times!). Once we add AES hardware acceleration support in Arm64 Sandra we should see even better numbers. Note: using AES HWA (hardware acceleration). Note*: using NEON multi-buffer hashing.

	Inter-Module (CCX) Latency (Same Package) (ns)	74.1	79.9 [+7%]	90	112	Latencies rise 7% vs. native code.
With big and LITTLE core clusters, the inter-core latencies vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores. Here we present the overall inter-core latencies of all latencies. Judicious thread scheduling is needed so that data transfers are efficient.

	Total Inter-Thread Bandwidth – Best Pairing (GB/s)	3**	1* [1/3x]	15**	4.37*	Bandwidth drops to 1/3x native.
As with latencies, inter-core bandwidths vary greatly between big-2-big, LITTLE-2-LITTLE and big-2-LITTLE cores, here we present the aggregate bandwidth between all cores. Judicious thread scheduling is needed so that data transfers are efficient. Note:* using SSE2 128-bit wide transfers. Note**: using NEON 128-bit wide transfers.

	Black-Scholes float/FP32 (MOPT/s)	24.68	11.41 [-54%]	101		Black-scholes is un-vectorised and compute heavy.
	Black-Scholes double/FP64 (MOPT/s)	20.53	10.88 [-47%]	82.42	33.43	Using FP64, we have 50% drop vs. native.
	Binomial float/FP32 (kOPT/s)	12.8	10.63 [-17%]	40.22		Binomial uses thread shared data thus stresses the cache & memory system.
	Binomial double/FP64 (kOPT/s)	5.57	5.52 [-1%]	19.73	17.63	With FP64, we are tied!
	Monte-Carlo float/FP32 (kOPT/s)	8.53	9	36.84		Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
	Monte-Carlo double/FP64 (kOPT/s)	3.48	4.37 [+25%]	18.95	14.35	Switching to FP64, we’re actually 25% faster!
With non-SIMD financial workloads, we actually see better emulated performance – with one test showing the usual 50% drop but the others either tied or even faster (!) Here a powerful device like the SQ can manage to mitigate the drop in emulation performance and not end up slower than x86 Intel designs.

	SGEMM (GFLOPS) float/FP32	18.77*	11.36** [-40%]	57.38*		In this tough vectorised algorithm we get a 40% drop.
	DGEMM (GFLOPS) double/FP64	5.87*	5.19** [-12%]	19.5*	16.77**	With FP64 vectorised code, we’re only 12% slower.
	SFFT (GFLOPS) float/FP32	0.657*	1.07** [+62%]	2.01*		FFT is also heavily vectorised but memory dependent.
	DFFT (GFLOPS) double/FP64	0.505*	0.601** [+19%]	1.41*	1.56**	With FP64 code, we’re 19% faster!
	SN-BODY (GFLOPS) float/FP32	16.49*	7.89** [-53%]	79.61*		N-Body simulation is vectorised but with more memory accesses.
	DN-BODY (GFLOPS) double/FP64	5.83*	2.72** [-54%]	27.36*	19.24**	With FP64 we’re 54% slower.
With highly vectorised SIMD code (scientific workloads), we still see the usual 50% drop in performance – except FFT (which is very memory latency bound) where we end up 20-60% faster. Clearly emulation is doing a much better job than us! Note: using SSE2-4 128-bit SIMD. Note*: using NEON 128-bit SIMD.

	Blur (3×3) Filter (MPix/s)	165**	76.53* [-54%]	451**	241*	In this vectorised integer workload we’re 54% slower.
	Sharpen (5×5) Filter (MPix/s)	56.81**	30.66* [-47%]	195**	105*	Same algorithm but more shared data 47% slower.
	Motion-Blur (7×7) Filter (MPix/s)	28.08**	16.27* [-43%]	109**	59.84*	Again same algorithm but even more data, 43% slower
	Edge Detection (2*5×5) Sobel Filter (MPix/s)	39.71**	20.63* [-49%]	156**	74.03*	Different algorithm but still vectorised 49% slower
	Noise Removal (5×5) Median Filter (MPix/s)	3**	2.12* [-30%]	9.43**	6.54*	Still vectorised code 30% slower emulation
	Oil Painting Quantise Filter (MPix/s)	3.57**	2.25* [-36%]	9.31**	6.44*	In this tough filter, we’re 36% slower.
	Diffusion Randomise (XorShift) Filter (MPix/s)	174**	126* [-28%]	583**	423*	With 64-bit integer workload, we’re 28% slower
	Marbling Perlin Noise 2D Filter (MPix/s)	35.26**	21.19* [-40%]	108**	65.88*	In this final test (scatter/gather) we’re 40% slower.
We know these benchmarks love SIMD, with wider SIMD (e.g. AVX512, AVX2) performing strongly – but we still see the same 30-60% performance degradation for x86 emulation. Again, as with other compute-heavy algorithms – these days such algorithms would be offloaded to the Cloud or locally to the GP-GPU – thus the CPU does not need to be a SIMD speed demon. Note: using SSE2-4 128-bit SIMD. Note*: using NEON 128-bit SIMD.

	Aggregate Score (Points)	360	190 [-48%]	1,846	900	Across all benchmarks, emulation is 48% slower
Not a surprise – we see x86 emulation -50% slower across all benchmarks, thus effectively cutting performance in half. However, with Arm64 optimisations – it is likely the ratio will become worse unless WoW emulation is improved.

	Price/RRP (USD)	$55 (whole SBC)	$55 (whole SBC)	$146	$146	Price does not change.

	Price Efficiency (Perf. vs. Cost) (Points/USD)	6.55	3.45 [-48%]	12.64	6.16	Naturally efficiency drops by 48%.
Unsurprisingly, value for money (price efficiency) drops by the same percentage, 50%. This does matter, as this puts it right into Intel x86 territory (6.26 – 8.48) – in effect losing the price efficiency advantage over competition. This may also translate in higher software cost, e.g. to get the native Arm64 version of software may require upgrading to a newer version at extra cost…

	Power/TDP (W)	4W	4W	5W	5W	Power does not change

	Power Efficiency (Perf. vs. Power) (W)	90	47.50 [-48%]	369	180	Naturally efficiency drops by the same 48%.
While Arm64 power usage is low, by cutting performance by half due to x86 emulation, the power efficiency is no longer as high vs. Intel x86 native designs. Arm64 still has an advantage but is only about 50% now rather than the “crushing” 2.5x better as with native code.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

x86 Emulation on Arm64: Needs Improvement: 6/10

Unfortunately, x86/x64 emulation on Windows 10 Arm64 needs some improvement – before we’ll get most users can move to Arm64 versions of Windows. Primarily – lack of x64 (64-bit) emulation can be a show-stopper today (in 2022) with most modern applications either no longer shipping a 32-bit x86 version or likely to stop soon. [we at SiSoftware are stopping soon too]

Yes, you can run your old 32-bit ARM applications if you have any…

There is also a performance degradation of about –50% (Arm64 native code vs. x86 emulated) which could be improved but is manageable. With Arm64 designs fielding a relatively large number of “real” cores vs. x86 (e.g. 4C + 4c vs. 4C / 8T as standard) there is some compute headroom to absorb the losses.

The hybrid “big.LITTLE” (DynamiQ) technology that ARM has deployed for many years – and Intel has recently adopted with “AlderLake” ADL – also provides more efficient power utilisation at low compute workloads, I/O, etc. and thus further reducing energy usage. For tablets or netbooks (aka small laptops) this is an important metric.

With the good performance of Microsoft’s SQ SoC (and similar Qualcomm 7c, 8cx, etc.) and fantastic power efficiency (typically 5W TDP) vs. the current Intel ULV offerings (15W TDP – 25W Turbo) – Arm64 emulation of x86 can be more efficient than native execution! [internally, all modern Intel CPUs translate x86 into native microcode/RISC thus they are not exactly native x86 anymore]

The performance/price efficiency (which depends entirely on the whole package not just the SoC) is no longer (much) better for emulation and comparable to native Intel. So if you will just be running x86 emulated and not native Arm64 you may be better off with Intel… Note that getting Arm64 versions of software may mean upgrading/updating at a cost – not all ISVs will provide free updates if at all.

In summary, the Windows 10 emulation is lacking and this has been improved in Windows 11 – but this should not stop you for trying a new Arm64 tablet/laptop. Hopefully we will now see good value but well-spec’d tablets from other OEMs (Dell, HP, etc.) that are perhaps better spec’d than the existing ones (e.g. Samsung with Galaxy Book, Lenovo). We are eagerly waiting to see…

x86 Emulation on Arm64: Needs Improvement: 6/10

Further Articles

Please see our other articles on:

CPU/SoC

Disclaimer

The article contains only public information (available elsewhere on the Internet) and not provided under NDA nor embargoed.

And please, don’t forget small ISVs like ourselves in these very challenging times. Please buy a copy of Sandra if you find our software useful. Your custom means everything to us!