Intel Core i7 8700K CofeeLake Review & Benchmarks – CPU 6-core/12-thread Performance

What is “CofeeLake” CFL?

The 8th generation Intel Core architecture is code-named “CofeeLake” (CFL): unlike previous architectures, it is a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). The server/workstation (SKL-X/KBL-X) CPU core saw new instruction set support (AVX512) as well as other improvements – these have not made the transition yet.

Possibly due limited competition (before AMD Ryzen launch), process issues (still at 14nm) and the disclosure of a whole host of hardware vulnerabilities (Spectre, Meltdown, etc.) which required microcode (firmware) updates – performance improvements have not been forthcoming. This is pretty much unprecedented – while some Core updates were only evolutionary we have not had complete stagnation before; in addition the built-in GPU core has also remained pretty much stagnant – we will investigate this in a subsequent article.

However, CFL does bring up a major change – and that is increased core counts both on desktop and mobile: on desktop we go from 4 to 6 cores (+50%) while on mobile (ULV) we go from 2 to 4 (+100%) within the same TDP envelope!

While this article is a bit late in the day considering the 8700K launched last year – we have now also reviewed the brand-new CofeeLake-R (Refresh) Core i9-9900K and it seems a good time to see what has changed performance-wise for the previous top-of-the-range CPU.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i7-8700K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 6C/12T 8C / 16T 10C / 20T 4C / 8T We have 50% more cores compared to SKL/KBL but still not as much as Ryzen/2 with 8 cores.
Speed (Min / Max / Turbo) 0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) 0.8-4.0-4.2GHz (8x-40x-42x) Single-core Turbo has increased close to 5GHz (reserved for Special Edition 8086K) way above SKL/KBL and Ryzen.
Power (TDP) 95W (131) 105W (135) 140W (308) 91W (100) TDP has only increased by 4% and is still below Ryzen though Turbo is comparable.
L1D / L1I Caches 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way 4x 256kB 8-way No change in L2 caches. Just more of them.
L3 Caches 12MB 16-way 2x 8MB 16-way 13.75MB 11-way 8MB 16-way L3 has also increased by 50% in line with cores, but still below Ryzen’s 16MB.
Microcode/Firmware MU069E0A-96 MU8F0802-04 MU065504-49 MU065E03-C2 We have a new model and somewhat newer microcode.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 291 [-13%] 334 485 190 In the old Drystone integer workload, CFL is still 13% slower than Ryzen 2 depite the huge lead over SKL.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 296 [-12%] 335 485 192 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 170 [-14%] 198 262 105 Switching to floating-point, CFL is still 14% slower in the old Whetstone also a micro-benchmark.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 143 [-15%] 169 223 89 With FP64 nothing much changes.
From integer workloads in Dhyrstone to floating-point workloads in Whestone, CFL is still 12-15% slower than Ryzen 2 with its 2 more cores (8 vs. 2), but much faster than the old SKL with 4 cores. We begin to see now why Intel is adding more cores in CFL-R.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 741 [+29%] 574 1590 (AVX512) 474 In this vectorised AVX2 integer test we see CFL beating Ryzen by ~30% despite less cores.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 305 [+63%] 187 581 (AVX512) 194 With a 64-bit AVX2 integer vectorised workload, CFL is now 63% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.9 [-16%] 5.8 7.6 3 This is a tough test using Long integers to emulate Int128 without SIMD: Ryzen 2 thus wins this one with CFL slower by 16%.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 678 [+14%] 596 1760 (AVX512) 446 In this floating-point AVX/FMA vectorised test, CFL is again 14% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 402 [+20%] 335 533 (AVX512) 268 Switching to FP64 SIMD code, CFL is again 20% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 16.7 [+7%] 15.6 40.3 (AVX512) 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL is just 7% faster but does win.
In vectorised SIMD code we see the power of Intel’s SIMD units that can execute 256-bit instructions in one go; CFL soundly beats Ryzen2 despite fewer cores (7-60%). SKL-X shows that AVX512 brings further gains and is a pity CFL still does not support them.
BenchCrypt Crypto AES-256 (GB/s) 17.8 [+11%] 16.1 23 15 With AES HWA support all CPUs are memory bandwidth bound; unfortunately Ryzen 2 is at 2667 vs CFL/SKL-X at 3200 which means CFL is 11% faster.
BenchCrypt Crypto AES-128 (GB/s) 17.8 [+11%] 16.1 23 15 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 9 [-51%] 18.6 26 (AVX512) 5.9 With SHA HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; CFL is thus 50% slower.
BenchCrypt Crypto SHA1 (GB/s) 17.3 [-9%] 19.3 38 (AVX512) 11.2 Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy thus CFL is only 9% slower.
BenchCrypt Crypto SHA2-512 (GB/s) 6.65 [+77%] 3.77 21 (AVX512) 4.4 SHA2-512 is not accelerated by SHA HWA, allowing CFL to use its SIMD units and be 77% faster.
AES HWA is memory bound and here CFL comfortably works with 3200Mt/s memory thus is faster than Ryzen 2 with 2667Mt/s memory (our sample); likely both would score similarly at 3200Mt/s. Ryzen 2 SHA HWA allows it to easily beat all other CPUs but only in SHA1/SHA256 – in others the CFL SIMD units win the day.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 207 [-19%] 257 309 128 In this non-vectorised test CFL cannot match Ryzen 2 and is ~20% slower.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 180 [-18%] 219 277 113 Switching to FP64 code, nothing much changes, Ryzen 2 is still faster.
BenchFinance Binomial float/FP32 (kOPT/s) 47 [-56%] 107 70.5 29.3 Binomial uses thread shared data thus stresses the cache & memory system; Ryzen 2 does very well here with CFL almost 60% slower.
BenchFinance Binomial double/FP64 (kOPT/s) 44.2 [-27%] 60.6 68 27.3 With FP64 code Ryzen2’s lead diminishes, CFL is “only” 27% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 41.6 [-23%] 54.2 63 25.7 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen 2 also wins this one, CFL is 23% slower.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 32.9 [-20%] 41 50.5 20.3 Switching to FP64 nothing much changes, CFL is still 20% slower.
Without SIMD support, CFL loses to Ryzen 2 as we saw with Dhrystone/Whetstone – between 20 and 50%. As we noted before, Intel will still need to add more cores in order to beat Ryzen 2. Still big improvement over the old SKL/KBL as expected.
BenchScience SGEMM (GFLOPS) float/FP32 385 [+28%] 300 413 (AVX512) 268 In this tough vectorised AVX2/FMA algorithm CFL is ~30% faster.
BenchScience DGEMM (GFLOPS) double/FP64 135 [+13%] 119 212 (AVX512) 130 With FP64 vectorised code, CFL’s lead reduces to 13% over Ryzen 2.
BenchScience SFFT (GFLOPS) float/FP32 24 [167%] 9 28.6 (AVX512) 16.1 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; here CFL is over 2.5x faster
BenchScience DFFT (GFLOPS) double/FP64 11.9 [+51%] 7.92 14.6 (AVX512) 7.2 With FP64 code, CFL’s lead reduces to ~50%.
BenchScience SNBODY (GFLOPS) float/FP32 411 [+47%] 280 638 (AVX512) 271 N-Body simulation is vectorised but many memory accesses to shared data but CFL remains ~50% faster.
BenchScience DNBODY (GFLOPS) double/FP64 127 [+13%] 113 195 (AVX512) 79 With FP64 code CFL’s lead reduces to 13% over Ryzen 2.
With highly vectorised SIMD code CFL performs well soundly beating Ryzen 2 between 13-167% as well as significantly improving over older SKL/KBL. As long as SIMD code is used Intel has little to fear.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1700 [+39%] 1220 4540 (AVX512) 1090 In this vectorised integer AVX2 workload CFL enjoys a ~40% lead over Ryzen 2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 675 [+25%] 542 1790 (AVX512) 433 Same algorithm but more shared data reduces the lead to 25% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 362 [+19%] 303 940 (AVX512) 233 Again same algorithm but even more data shared reduces the lead to 20%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 589 [+30%] 453 1520 (AVX512) 381 Different algorithm but still AVX2 vectorised workload means CFL is 30% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 57.8 [-17%] 69.7 223 (AVX512) 37.6 Still AVX2 vectorised code but CFL stumbles a bit here – it’s 17% slower than Ryzen 2.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 31.8 [+29%] 24.6 70.8 (AVX512) 20 Again we see CFL ~30% faster.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 3480 [+140%] 1450 3570 (AVX512) 2300 CFL (like all Intel CPUs) does very well here – it’s a huge 140% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 448 [+84%] 243 909 (AVX512) 283 In this final test, CFL is almost 2x faster than Ryzen 2.

The addition of 2 more cores brings big performance gains (not to mention the higher Turbo clock) over the old SKL/KBL which is pretty impressive considering TDP has stayed the same. With SIMD code (AVX/AVX2/FMA3) CFL has no problem beating Ryzen 2 by a pretty large margin (up to 2x faster) – but any algorithm not vectorised allows Ryzen 2 to win – though not by much 12-20%.

Streaming tests likely benefit from the higher supported memory frequencies that while in theory could be used on the older SKL/KBL (memory overclock) they were not supported officially nor stable in all cases. We shall test memory performance in a forthcoming article.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Intel i7-6700K SkyLake Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 41 [-33%] 61 52 28 .Net CLR integer performance starts off well over old SKL but still 33% slower than Ryzen 2.
BenchDotNetAA .Net Dhrystone Long (GIPS) 41 [-32%] 60 54 27 With 64-bit integers nothing much changes.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 78 [-24%] 102 107 49 Floating-Point CLR performance does not change much, CFL is still 25% slower than Ryzen despite big gain over old SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 95 [-19%] 117 137 62 FP64 performance is similar to FP32.
Ryzen 2 performs exceedingly well in .Net workloads – soundly beating all Intel CPUs, with CFL between 20-33% slower. More cores will be needed for parity with Ryzen 2, but at least CFL improves a lot over SKL/KBL.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 93.5 [-16%] 111 144 57 Just as we saw with Dhrystone, this integer workload sees CFL improve greatly over SKL but Ryzen 2 is still faster.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 93.1 [-15%] 109 143 57 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 361 [-8%] 392 585 228 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code but CFL is still 8% slower than Ryzen 2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 198 [-9%] 217 314 128 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL is still slower.
We see a similar improvement for CFL but again not enough to beat Ryzen 2; even using RyuJit’s vectorised support CFL cannot beat it – just reduce the loss to 8-9%.
Java Arithmetic Java Dhrystone Integer (GIPS) 557 [-3%] 573 877 352 Java JVM performance is almost neck-and-neck with Ryzen 2 despite 2 less cores.
Java Arithmetic Java Dhrystone Long (GIPS) 488 [-12%] 553 772 308 With 64-bit integers, CFL does fall behind Ryzen2 by 12%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 101 [-23%] 131 156 62 Floating-point JVM performance is worse though, CFL is now 23% slower.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 103 [-26%] 139 160 64 With 64-bit precision nothing much changes.
While CFL improves markedly over the old SKL and almost ties with Ryzen 2 in integer workloads, it does fall behind in floating-point by a good amount.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 100 [-12%] 113 140 63 Without SIMD acceleration we see the usual delta (around 12%) with integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 89 [-12%] 101 152 62 Nothing changes when changing to 64-bit integer workloads.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 64 [-34%] 97 98 41 With floating-point non-SIMD accelerated we see a bigger delta of about 30% less vs. Ryzen 2.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 64 [-29%] 90 99 41 With 64-bit floatint-point precision nothing much changes.
With compute heavy vectorised code but not SIMD accelerated, CFL cannot keep up with Ryzen 2 and the difference increases to about 30% less. Intel really needs to get Oracle to add SIMD extensions similar to .Net’s new CLR.

Ryzen dominates .Net and Java benchmarks – Intel will need more cores in order to compete; while the additional 2 cores helped a lot, it is not enough!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Due to Core improvement stagnation (4 on desktop, 2 on mobile), Intel had no choice really but increase core counts in light of new competition from AMD with Ryzen/2 with twice as many cores (8) and SMT (Hyper-Threading) as well! While KBL had increased base/Turbo cores appreciably over SKL within the same power envelope, CFL had to add more cores in order to compete.

With 50% more cores (6) CFL performs much better than the older SKL/KBL as expected but that is not enough in non-vectorised loads; Ryzen 2 with 2 more cores is still faster, not by much (12-20%) but still faster. Once vectorised code is used, the power of Intel’s SIMD units shows – with CFL soundly beating Ryzen despite not supporting AVX512 nor SHA HWA which is a pity. AMD still has work to do with Ryzen if it wants to be competitive in vectorised workloads (integer or floating-point).

We now see why CFL-R (CofeeLake Refresh) will add even more cores (8C/16T with 9900K which we review in a subsequent article) – it is the only way to beat Ryzen 2 in all workloads. In effect AMD’s has reached parity performance with Intel in all but SIMD workloads – a great achievement!

Unfortunately (unlike AMD’s AM4 Ryzen) CFL does require new chipset/boards (series 300) which makes it an expensive upgrade for SKL/KBL owners; otherwise it would have been a pretty no-brainer upgrade for those needing more compute power. While the new platform does bring some improvements (USB 3.1 Gen 2 aka 10GB/s, more PCIe lanes, integrated 802.11ac WiFi – at least on mobile) it’s nothing over the competition.

Roll on CofeeLake Refresh and the new CPUs: they are sorely needed…

nVidia Titan V: Volta GPGPU performance in CUDA and OpenCL

What is “Titan V”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Volta” architecture, the next generation to the current “Pascal” architecture on the Series 10 cards. Based on the top-end 100 chipset (not lower 102 or 104) it boasts full speed FP64/FP16 performance as well as brand-new “tensor cores” (matrix multipliers) for scientific and deep-learning workloads. It also comes with on-chip HBM2 (high-bandwidth) memory unlike more traditional GDDRX stand-alone memory.

For this reason the price is also far higher than previous Titan X/XP cards but considering the features/performance are more akin to “Tesla” series it would still be worth it depending on workload.

While using the additional cores provided in FP64/FP16 workloads is automatic – save usual code optimisations – tensor cores support requires custom code and existing libraries and apps need to be updated to make use of them. It is unknown at this time if consumer cards based on “Volta” will also include them. As they support FP16 precision only, not workloads may be able to use them – but DL (deep learning) and AI (artificial intelligence) are generally fine using lower precision thus for such tasks it is ideal.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan V
nVidia Titan X (P)
nVidia 980 GTX (M2)
Comments
Arch Chipset Volta VP100 (7.0) Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 80 / 5120 28 / 3584 16 / 2048 The V boasts 80 CU units but these contain 64 FP32 units only not 128 like lower-end chips thus equivalent with 40.
FP32 / FP64 / Tensor Cores 5120 / 2560 / 640 3584 / 112 / no 2048 / 64 / no Titan V is the only one with tensor cores and also huge amount of FP64 cores that Titan X simply cannot match; it also has full speed FP16 support.
Speed (Min-Turbo) 1.2GHz (135-1.455) 1.531GHz (139-1910) 1.126GHz (135-1.215) Slightly lower clocked than the X it will will make up for it with sheer CU units.
Power (TDP) 300W 250W (125-300) 180W (120-225) TDP increases by 50W but it is not unexpected considering the additional units.
ROP / TMU
96 / 320 96 / 224 64 / 128 Not a “gaming card” but while ROPs stay the same the number of TMUs has increased – likely required for compute tasks using textures.
Global Memory 12GB HBM2 850Mhz 3072-bit 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit Memory size stays the same at 12GB but now uses on-chip HBM2 for much higher bandwidth
Memory Bandwidth (GB/s)
652 512 224 In addition to the modest bandwidth increase, latencies are also meant to have decreased by a good amount.
L2 Cache 4.5MB 3MB 2MB L2 cache has gone up by about 50% to feed all the cores.
FP64/double ratio
1/2 1/32 1/32 For FP64 workloads the V has huge advantage as consumer and previous Titan X had far less FP64 units.
FP16/half ratio
2x 1/64 n/a The V has an even bigger advantage here with over 128x more units for FP16 tasks like DL and AI.

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 22,400 [+25%] / 20,000 17,870 / 16,000 7,000 / 6,100 Right off the bat, the V is just 25% faster than the X some optimisations may be required.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 33,300 [135x] / n/a 245 / n/a n/a For FP16 workloads the V shows its power: it is an astonishing 135 *times* (times not %) faster than the X.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 11,000 [+16.7x] / 11,000 661 / 672 259 / 265 For FP64 precision workloads the V shines again, it is 16 times faster than the X.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 458 [+17.7x] / 455 25 / 24 10.8 / 10.7 With emulated FP128 precision the V is again 17 times faster.
As expected FP64 and FP16 performance is much improved on Titan V, with FP64 over 16x times faster than the X; FP16 performance is over 50% faster than FP32 performance making it almost 2x faster than Titan X. For workloads that need it, the performance of Titan V is stellar.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 71 [+79%] / 87 40 / 38 16 / 16 Titan V is almost 80% faster than the X here a significant improvement.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 91 [+75%] / 116 52 / 51 23 / 21 Not a lot changes here, with the V still 7% faster than the X.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 253 [+89%] / 252 134 / 142 58 / 59 In this integer workload, Titan V is almost 2x faster than the X.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 130 [+21%] / 134
107 / 114 50 / 54 SHA1 is mysteriously slower than SHA256 and here the V is just 21% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 173 [+2.4x] / 176 72 / 42 32 / 24 With 64-bit integer workload, Titan V shines again – it is almost 2.5x (times) faster than the X!
Historically, nVidia cards have not been tuned for integer workloads, but Titan V is almost 2x faster in 32-bit hashing and almost 3x faster in 64-bit hashing than the older X. For algorithms that use integer computation this can be quite significant.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 18,460 [+61%] / 18,870
11,480 / 11,470 5,280 / 5,280 Titan V manages to be 60% faster in this FP32 financial workload.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 8,400 [+6.1x] / 9,200
1,370 / 1,300 547 / 511 Switching to FP64 code, the V is over 6x (times) faster than the X.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,180 [+81%] / 4,190
2,240 / 2,240 1,200 / 1,140 Binomial uses thread shared data thus stresses the SMX’s memory system: but the V is 80% faster than the X.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 2,000 [+15.5x] / 2,000
129 / 133 51 / 51 With FP64 code the V is much faster – 15x (times) faster!
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 12,550 [+2.35x] / 12,610
5,350 / 5,150 2,140 / 2,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here the V is over 2x faster than the X and that is FP32 code!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 4,440 [+15.1x] / 4,100
294 / 267 118 / 106 Switching to FP64 the V is again over 15x (times) faster!
For financial workloads, the Titan V is significantly faster, almost twice as fast as Titan X on FP32 but over 15x (times) faster on FP64 workloads. If time is money, then this can be money well-spent!
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 9,860 [+57%] / 10,350
6,280 / 6,600 2,550 / 2,550 Without using the new “tensor cores”, Titan V is about 60% faster than the X.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 3,830 [+11.4x] / 3,920 335 / 332 130 / 129 With FP64 precision, the V crushes the X again it is 11x (times) faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 605 [+2.5x] / 391 242 / 227 148 / 136 FFT allows the V to do even better – no doubt due to HBM2 memory.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 280 [+35%] / 245 207 / 191 89 / 82 We may need some optimisations here, otherwise the V is just 35% faster.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,390 [+15%] / 4,630
5,600 / 4,870 2,100 / 2,000 N-Body simulation also needs some optimisations as the V is just 15% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 4,270 [+15.5x] / 4,200
275 / 275 82 / 81 With FP64 precision, the V again crushes the X – it is 15x faster.
The scientific scores are a bit more mixed – GEMM will require code paths to take advantage of the new “tensor cores” and some optimisations may be required – otherwise FP64 code simply flies on Titan V.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 26,790 [50%] / 26,660
17,860 / 13,680 7,310 / 5,530 In this 3×3 convolution algorithm, Titan V is 50% faster than the X. Convolution is also used in neural nets (CNN) thus performance here counts.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 29,200 [+18.6x]
1,570 n/a With FP16 precision, Titan V shines it is 18x (times faster than X) but 12% faster than FP32.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 9,295 [+94%] / 6,750
4,800 / 3,460 1,870 / 1,380 Same algorithm but more shared data allows the V to be almost 2x faster than the X.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 14,900 [24.4x]
609 n/a With FP16 Titan V is almost 25x (times) faster than X and also 60% faster than Fp32.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 9,428 [+2x] / 7,260
4,830 / 3,620 1,910 / 1,440 Again same algorithm but even more data shared the V is 2x faster than the X.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 14,790 [+45x] 325 n/a With FP16 the V is now45x (times) faster than the X showing the usefulness of FP16 support.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 9,079 [1.92x] / 7,380
4,740 / 3450 1,860 / 1,370 Still convolution but with 2 filters – Titan V is almost 2x faster again.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 13,740 [+44x]
309 n/a Just as we seen above, the V is an astonishing 44x (times) faster than the X, and also ~20% faster than FP32 code.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 111 [+3x] / 66
36 / 55 20 / 25 Different algorithm but here the V is even faster, 3x faster than the X!
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 206 [+2.89x]
71 n/a With FP16 the V is “only” 3x faster than the X but also 2x faster than FP32 code-path again a big gain for FP16 processing
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 157 [+10x] / 24
15 / 15 12 / 11 Without major processing, this filter flies on the V – it is 10x faster than the X.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 215 [+4x] 50 FP16 precision is “just” 4x faster but it is also ~40% faster than FP32.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,370 / 22,780 [+25%] 19,480 / 14,000 7,600 / 6,640 This algorithm is 64-bit integer heavy and here Titan V is 25% faster than the X.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 24,180 [+4x] 6,090 FP16 does not help a lot here, but still the V is 4x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 846 [+3x] / 874 288 / 635 210 / 308 One of the most complex and largest filters, Titan V does very well here, it is 3x faster than the X.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 1,712 [+3.7x]
461 n/a Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Titan V brings big performance increases from 50% to 4x (times) faster than Titan X a big upgrade. If you are willing to drop to FP16 precision, then it is an extra 50% to 2x faster again – while naturally FP16 is not really usable on the X. With potential 8x times better performance Titan V powers through image processing tasks.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan V CUDA/OpenCL
nVidia Titan X CUDA/OpenCL
nVidia GTX 980 CUDA/OpenCL
Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 536 [+51%] / 530
356 / 354 145 / 144 HBM2 brings about 50% more raw bandwidth to feed all the extra compute cores, a significant upgrade.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.47 / 11,4
11.4 / 9 12.1 / 12 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.3 / 12.3
12.2 / 8.9 11.5 / 12.2 Again no significant difference but we were not expecting any.
Titan V’s HBM2 brings 50% more memory bandwidth but as it still uses the PCIe3 x16 connection there is no change to host upload/download bandwidth which may be a bit of a bottleneck trying to keep all those cores fed with data. Even more streaming load/save is required and code will need to be optimised to use all that processing power
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 180 [-10%] / 187
201 / 230 230 From the start we see global latency accesses reduced by 10%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 311 [+9%] / 317
286 / 311 306 Full range random accesses do seem to be 9% slower which may be due to the architecture.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53 [-40%] / 57 89 / 121 97 However, sequential accesses seem to have dropped a huge 40% likely due to better prefetchers on the Titan V.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75 [-36%] / 76 117 / 174 126 Constant memory latencies also seem to have dropped by almost 40% a great result.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 85 18 / 53 21 No significant change in shared memory latencies.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 [+9%] / 279 195 / 196 208 Texture access latencies seem to have increased by 9%
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 [+22%] / 313 282 / 278 308 As we’ve seen with global memory, we see increased latencies here by about 20%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88 / 163 87 /123 102 With sequential access there is no appreciable delta in latencies.
HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets
We see L1 cache effects between 64-128kB tallying with an L1D of 96kB – 4x more than what we’ve seen on Titan X (at 16kB). The other inflexion is at 4MB – matching the 4.5MB L2 cache size – which is 50% more than what we saw on Titan X (at 3MB).
As with global memory we see the same L1D (64kB) and L2 (4.5MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

Titan V’s memory performance does not disappoint – HBM2 obviously brings large bandwidth increase – latency depends on access pattern, when prefetchers can engage they are much lowers but in random accesses out-of-page they are a big higher but nothing significant. We’re also limited by the PCIe3 bus for transfers which requires judicious overlap of memory transfers and compute to keep the cores busy.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Volta” architecture does bring good improvements in FP32 performance which we hope to see soon in consumer (Series 11?) graphics cards – as well as lower-end Titan cards.

But here (on Titan V) we have the top-end chip with full-power FP64 and FP16 units more akin to Tesla which simply power through any and all algorithms you can throw at them. This is really the “Titan” you were looking for and upgrading from the previous Titan X (Pascal) is a huge upgrade admittedly for quite a bit more money.

If you have workloads that requires double/FP64 precision – Titan V is 15-16x times faster than Titan X – thus great value for money. If code can make do with FP16 precision then you can gain up to 2x extra performance again – as well as save storage for large data-sets – again Titan X cannot cut it here running at 1/64 rate.

We have not yet shown tensor core performance which is an additional reason for choosing such a card – if you have code that can make use of them you can gain an extra 16x (times) performance that really puts Titan V heads and shoulders over the Titan X.

All in all Titan V is a compelling upgrade if you need more power than Titan X and are (or thinking of) using multiple cards – there is simply no point. One Titan V can replace 4 or more Titan X cards on FP64 or FP16 workloads and that is before you make any optimisations. Obviously you are still “stuck” with 12GB memory and PCIe bus for transfers but with judicious optimisations this should not impact performance significantly.

nVidia Titan X: Pascal GPGPU Performance in CUDA and OpenCL

What is “Titan X (Pascal)”?

It is the current high-end “pro-sumer” card from nVidia using the current generation “Pascal” architecture – equivalent to the Series 10 cards. It is based on the 2nd-from-the-top 102 chipset (not the top-end 100) thus it does not feature full speed FP64/FP16 performance that is generally reserved for the “Quadro/Tesla” professional range of cards. It does however come with more memory to fit more datasets and is engineered for 24/7 performance.

Pricing has increased a bit from previous generation X/XP but that is a general trend today from all manufacturers.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan X with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan X (P) nVidia 980 GTX (M2) AMD Vega 56 AMD Fury Comments
Arch Chipset Pascal GP102 (6.1) Maxwell 2 GM204 (5.2) Vega 10 Fiji The X uses the current Pascal architecture that is also powering the current Series 10 consumer cards
Cores (CU) / Threads (SP) 28 / 3584 16 / 2048 56 / 3584 64 / 4096 We’ve got 28CU/SMX here down from 32 on GP100/Tesla but should still be sufficient to power through tasks.
FP32 / FP64 / Tensor Cores 3584 / 112 / no 2048 / 64 / no 3584 / 448 / no 4096 / 512 / no Only 112 FP64 units – a lot less than competition from AMD, this is a card geared for FP32 workloads.
Speed (Min-Turbo) 1.531GHz (139-1910) 1.126GHz (135-1.215) 1.64GHz 1GHz Higher clocked that previous generation and comparative with competition.
Power (TDP) 250W (125-300) 180W (120-225) 200W 150W TDP has also increased to 250W but again that is inline with top-end cards that are pushing over 200W.
ROP / TMU
96 / 224 64 / 128 64 / 224 64 / 256 As it may also be used as top-end graphics card, it has a good amount of ROPs (50% more than competition) and similar numbers of TMUs.
Global Memory 12GB GDDR5X 10Gbps 384-bit 4GB GDDR5 7Gbps 256-bit 8GB HBM2 2Gbps 2048-bit 4GB HBM 1Gbps 4096-bit Titan X comes with a huge 12GB of current GDDR5X memory while the competition has switched to HBM2 for top-end cards.
Memory Bandwidth (GB/s)
512 224 483 512 Due to high speed GDDR5X, the X has plenty of memory bandwidth even higher than HBM2 competition.
L2 Cache 3MB 2MB L2 cache has increased by 50% over previous arch to keep all cores fed.
FP64/double ratio
1/32 1/32 1/8 1/8 The X is not really meant for FP64 workloads as it uses the same ratio 1:32 as normal consumer cards.
FP16/half ratio
1/64 n/a 1/1 1/1 With 1:64 ratio FP16 is not really usable on Titan X but can only really be used for compatibility testing.

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from both nVidia and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

GPGPU Image ProcessingMotion-Blur (7×7) Filter single/FP32 (MPix/s)4,830 / 3,6201,910 / 1,440

Again same algorithm but even more data shared the V is 2x faster than the X.

Processing Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 17,870 [37%] / 16,000 7,000 / 6,100 13,000 8,720 Titan X makes a good start beating the Vega by almost 40%.
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 245 [-98%] / n/a n/a 13,130 7,890 FP16 is so slow that it is unusable – just for testing.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 661 [-47%] / 672 259 / 265 1,250 901 FP64 is also quite slow though a lot faster than on the GTX 980.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 25 [-67%] / 24 10.8 / 10.7 77.3 55 Emulated FP128 precision depends entirely on FP64 performance and thus is… slow.
With FP32 “normal” workloads Titan X is quite fast, ~40% faster than Vega and about 2.5x faster than an older GTX 980 thus quite an improvement. But FP16 workloads should not apply – better off with FP32 – and FP64 is also about 1/2 the performance of a Vega – also slower than even a Fiji. As long as all workloads are FP32 there should be no problems.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 40 [-38%] / 38 16 / 16 65 46 Titan X is a lot faster than previous gen but still ~40% slower than a Vega
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 52 [-38%] / 51 23 / 21 84 60 Nothing changes here , the X still about 40% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 134 [+4%] / 142 58 / 59 129 82 In this integer workload, somehow Titan X manages to beat the Vega by 4%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 [-34%] / 114 50 / 54 163 124 SHA1 is mysteriously slower thus the X is ~35% slower than a Vega.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 72 [+2.3x] / 42 32 / 24 31 13.8 With 64-bit integer workload, Titan X is a massive 2.3x times faster than a Vega.
Historically, nVidia cards have not been tuned for integer workloads, but Titan X still manages to beat a Vega – the “gold standard” for crypto-currency hashing – on both SHA256 and especially on 64-bit integer SHA2-512! Perhaps for the first time a nVidia card is competitive on integer workloads and even much faster on 64-bit integer workloads.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 11,480 [+28%] / 11,470 5,280 / 5,280 9,000 11,220 In this FP32 financial workload Titan X is almost 30% faster than a Vega.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,370 [-36%] / 1,300 547 / 511 1,850 1,290 Switching to FP64 code, the X remains competitive and is about 35% slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 2,240 [-8%] / 2,240 1,200 / 1,140 2,440 1,760 Binomial uses thread shared data thus stresses the SMX’s memory system and here Vega surprisingly does better by 8%
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 129 [-20%] / 133 51 / 51 161 115 With FP64 code the X is only 20% slower than a Vega.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,350 [+47%] / 5,150 2,140 / 2,000 3,630 2,470 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Titan X is almost 50% faster!
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 294 [-34%] / 267 118 / 106 385 332 Switching to FP64 the X is again 34% slower than a Vega.
For financial FP32 workloads, the Titan X generally beats the Vega by a good amount or at least ties with it; with FP64 precision it is about 1/2 the speed which is to be expected. As long as you have FP32 workloads this should not be a problem.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,280 [+19%] / 6,600 2,550 / 2,550 5,260 3,630 Using 32-bit precision Titan X beats the Vega by 20%.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 335 [-40%] / 332 130 / 129 555 381 With FP64 precision, unsurprisingly the X is 40% slower.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 242 [-20%] / 227 148 / 136 306 348 FFT does better with HBM memory and here Titan X is 20% slower than a Vega.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 207 / 191 89 / 82 139 116 Surprisingly the X does very well here and manages to beat all cards by almost 50%!
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 5,600 [+20%] / 4,870 2,100 / 2,000 4,670 3,080 Titan X does well in this algorithm, beating the Vega by 20%.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 275 [-20%] / 275 82 / 81 343 303 With FP64 precision, the X is again 20% slower.
The scientific scores are similar to the financial ones but the gain/loss is about 20% not 40% – in FP32 workloads Titan X is 20% faster while in FP64 it is about 20% slower than a Vega – a closer result than expected.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 14,550 [-60%] / 10,880 7,310 / 5,530 36,000 28,000 In this 3×3 convolution algorithm, somehow Titan X is over 50% slower than a Vega and even a Fury.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 3,840 [-11%] / 2,750 1,870 / 1,380 4,300 3,150 Same algorithm but more shared data reduces the gap to 10% but still a loss.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 3,920 [-10%] / 2,930 1,910 / 1,440 4,350 3,200 With even more data the gap remains at 10%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 3,740 [-11%] / 2,760 1,860 / 1,370 4,210 3,130 Still convolution but with 2 filters – Titan X is 10% slower again.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.7 / 55 [+52%] 20.6 / 25.4 36.3 30.8 Different algorithm allows the X to finally beat the Vega by 50%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 15.6 [-60%] / 15.3 12.2 / 11.4 38.7 14.3 Without major processing, this filter does not like the X much it runs 1/2 slower than the Vega.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 16,480 [-57%] / 14,000 7,600 / 6,640 38,730 28,500 This algorithm is 64-bit integer heavy but again Titan X is 1/2 the speed of Vega.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 290 / 6,350 [+13%] 210 / 3,080 5,600 4,410 One of the most complex and largest filters, Titan X finally beats the Vega by over 10%.
For image processing using FP32 precision Titan X surprisingly does not do as well as expected – either in CUDA or OpenCL – with the Vega beating it by a good margin on most filters – a pretty surprising result. Perhaps more optimisations are needed on nVidia hardware. We obviously did not test FP16 performance at all as that would have been far slower.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers from nVidia and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 398.36, CUDA 9.2, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

HBM2 does seem to increase latencies slightly by about 10% but for sequential accesses Titan V does perform a lot better than the X with 20-40% lower latencies, likely due to the the new architecture. Thus code using coalesce memory accesses will perform faster but for code using random access pattern over large data sets

 

Memory Benchmarks nVidia Titan X CUDA/OpenCL nVidia GTX 980 CUDA/OpenCL AMD Vega 56 OpenCL AMD Fury OpenCL Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 356 [+13%] / 354 145 / 144 316 387 Titan X brings more bandwidth than a Vega (+13%) but the old Fury takes the crown.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.4 / 9 12.1 / 12 12.1 11 All cards use PCIe3 x16 and thus no appreciable delta.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.2 / 8.9 11.5 / 12.2 10 9.8 Again no significant difference but we were not expecting any.
Titan X uses current GDDR5X but with high data rate allowing it to bring more bandwidth that some HBM2 designs – a pretty impressive feat. Naturally high-end cards using HBM2 should have even higher bandwidth.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 201 / 230 230 273 343 Compared to previous generation, Titan X has better latency due to higher data rate.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 286 / 311 306 399 525 Similarly, even full random accesses are faster,
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89 / 121 97 129 216 Sequential access has similarly low latencies but nothing special.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 / 174 126 269 353 Constant memory latencies have also dropped.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18 / 53 21 49 112 Even shared memory latencies have dropped likely due to higher core clocks.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 / 196 208 121 Texture access latencies have come down as well.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 / 278 308 And even full range latencies have decreased.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87 /123 102 With sequential access there is no appreciable delta in latencies.
We’re only comparing CUDA latencies here (as OpenCL is quite variable) – thus compared to the previous generation (GTX 980) all latencies are down, either due to higher memory data rate or core clock increases – but nothing spectacular. Still good progress and everything helps.
We see L1 cache effects until 16kB (same as previous arch) and between 2-4MB tallying with the 3MB cache. While fast perhaps they could be a bit bigger.
As with global memory we see the same L1D and L2 cache affects with similar latencies. All in all good performance but we could do with bigger caches.

Titan X’s memory performance is what you’d expect from higher clocked GDDR5X memory – it is competitive even with the latest HBM2 powered competition – both bandwidth and latency wise. There are no major surprises here and everything works nicely.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Titan X based on the current “Pascal” architecture performs very well in FP32 workloads – it is much faster than previous generation for a modest price increase and is competitive with the AMD’s Vega offers. But it is likely due to be replaced soon as next-generation “Volta” architecture is already out on the high-end (Titan V) and likely due to filter down the stack to both consumer (Series 11?) cards and “pro-sumer” cheaper Titan cards than the Titan V.

For FP64 workloads it is perhaps best to choose an older Quadro/Tesla card with more FP64 units as performance is naturally much lower. FP16 performance is also restricted and pretty much not usable – good for compatibility testing should you hope to upgrade to a full-speed FP16 card in the future. For both these workloads – the high-end Titan V is the card you probably want – but at a much higher price.

Still for the money, Titan X has its place and the most common FP32 workloads (financial, scientific, high precision image processing, etc.) that do not require FP64 nor FP16 optimisations perform very well and this card is all you need.

AMD Ryzen+ 2700X Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “Ryzen+” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen+” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen+:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we are testing them in this article!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen+ (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen+ 2700X Pinnacle Ridge AMD Ryzen+ 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen+ data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen+ L2 cache is unchanged but we’re told latencies have been improved. And 4x bigger than Intel’s!
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen+ L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
TLB 4kB pages
64 full-way 1536 8-way 64 full-way 1536 8-way 64 full-way 1536 8-way 64 8-way 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way 1536 2-way 64 full-way 1536 2-way 64 full-way 1536 2-way 8 full-way 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600-1200 600-1200 600-1200 1200-4000 Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
2400 / 2933 2400 / 2933 2400 / 2666 2533 / 2400 Ryzen+ how supports up to 2933MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Memory runs at the same timings on both Ryzen+ and Ryzen but we shall see if measured latencies are different.

Core Topology and Testing

As discussed in the previous article, cores on Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T SkyLake
Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 54.9 [+15%] 46.5 47.8 39 Ryzen+ manages 15% higher bandwidth between its cores, slightly better than just 11% clock increase – signalling some improvements under the hood.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 5.89 [+2%] 5.53 5.8 16.3 In worst-case pairs on Ryzen must go across CCXes – and with this link running at the same clock (1200MHz) on Ryzen+ we can only manage a 2% increase in bandwidth. This is why faster memory is needed.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.5 [-13%] 15.4 15.6 16.2 Within the same core (sharing L1D/L2), Ryzen+ manages a 13% reduction in latency, again better than just clock speed increase.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 40.1 [-7%] 43.5 43.2 47.3 Within the same compute unit (sharing L3), the latency decreased by 7% on Ryzen+ thus L3 seems to have improved also.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 128 [-6%] 132 236 Going inter-CCX we still see a 6% reduction in latency on Ryzen+ – with the CCX link at the same speed – a welcome surprise.
The multiple CCX design still presents some challenges to programmers requiring threads to be carefully scheduled – but we see a decent 6-7% reduction in L3/CCX latencies on Ryzen+ even when running at the same clock as Ryzen.
Aggregated L1D Bandwidth (GB/s) 862 [+18%] 615 730 837 Right off we see a 18% bandwidth increase – almost 2x higher (than the 11% clock increase) – thus some improvements have been made to the cache system. It allows Ryzen+ to finally beat the i7 with its wide L1 data paths (512-bit) though with 2x more caches (8 vs 4).
Aggregated L2 Bandwidth (GB/s) 736 [+32%] 542 556 329 We see a huge 32% increase in L2 cache bandwidth – almost 3x clock increase (the 11%) suggesting the L2 caches have been improved also. Ryzen+ has thus 2x the L2 bandwidth of i7 though with 2x more caches (8 vs 4).
Aggregated L3 Bandwidth (GB/s) 339 [+19%] 398 284 238 The bandwidth of the L3 caches has also increased by 19% (2x clock increase) though we see the 6-core 2600 doing better (398 vs 339) likely due to less threads competing for the same L3 caches (12 vs 16). Ryzen+ L3 caches are not just 2x bigger than Intel but also 2x more bandwidth.
Aggregated Memory (GB/s) 30.2 [+2%] 30.2 29.6 29.1 With the same memory clock, Ryzen+ does still manage a small 2% improvement – signalling memory controller improvements. We also see Ryzen’s memory at 2400Mt/s having better bandwidth than Intel at 2533.
We see big improvements on Ryzen+ for all caches L1D/L2/L3 of 20-30% – more than just raw clock increase (11%) – so AMD has indeed made improvements – which to be fair needed to be done. The memory controller is also a bit more efficient (2%) though it can run at higher clocks than tested (2400Mt/s) – hopefully fast DDR4 memory will become more affordable.
Data In-Page Random Latency (ns) 66.4 (4-12-31) [-6%] [0][-5][-4] 66.4 (4-12-31) 70.5 (4-17-35) 20.4 (4-12-21) In-page latency has decreased by a noticeable 6% on Ryzen+ (both 2700X and 2600) – we see 5 clocks reduction for L2 and 4 for L3 a welcome improvement. But still a way to go to catch Intel which has 1/3x (three times less) latency.
Data Full Random Latency (ns) 80.9 (4-12-32) [-8%] [0][-5][-4] 79.4 (4-12-32) 87.6 (4-17-36) 63.9 (4-12-34) Out-of-page latencies have also been reduced by 8% on Ryzen+ (same memory) and we see the same 5 and 4 clock reduction for L2 and L3 (on both 2700X and 2600 it’s no fluke). Again these are welcome but still have a way to go to catch Intel.
Data Sequential Latency (ns) 3.4 (4-6-7) [-8%] [0][-1][0] 3.5 (4-6-7) 3.7 (4-7-7) 4.1 (4-12-13) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% latency drop for Ryzen+.
Ryzen’s issue was high memory latencies (in-page/full random) and Ryzen+ has reduced them all by 6-8%. While it is a good improvement, they are still pretty high compared to Intel’s thus more work needs to be done here.
Code In-Page Random Latency (ns) 14.2 (4-9-24) [-9%] [0][0][0] 14.6 (4-9-24) 15.6 (4-9-24) 10.1 (2-10-21) Code latencies were not a problem on Ryzen but we still see a welcome reduction of 9% on Ryzen+. (no clocks delta)
Code Full Random Latency (ns) 88.6 (4-14-49) [-9%] [0][+1][+2] 89.3 (4-14-49) 97.4 (4-13-47) 70.7 (2-11-46) Out-of-page latency also sees a 9% decrease on Ryzen+ but somewhat surprisingly a 1-2 clock increase.
Code Sequential Latency (ns) 7.6 (4-12-20) [-8%] [0][+1][+1] 7.8 (4-12-20) 8.3 (4-11-19) 5.0 (2-4-9) Ryzen’s prefetchers are working well with sequential access pattern latency and we see a 8% reduction on Ryzen+.
While code access latencies were not a problem on Ryzen and they also see a 8% improvement on Ryzen+ which is welcome. Note code L1i cache is 2x Intel’s (64kB vs 32).
Memory Update Transactional (MTPS) 4.7 [+10%] 5 4.28 33.2 HLE Ryzen+ is 10% faster than Ryzen but naturally without HLE support it cannot match the i7. But with Intel disabling HLE on all but top-end CPUs AMD does not have much to worry.
Memory Update Record Only (MTPS) 4.6 [+11%] 4.75 4.16 23 HLE With only record updates we still see an 11% increase.

Ryzen+ brings nice updates – good bandwidth increases to all caches L1D/L2/L3 and also well-needed latency reduction for data (and code) accesses. Yes, there is still work to be done to bring the latencies down further – but it may be just enough to beat Intel to 2nd place for a good while.

At the high-end, ThreadRipper2 will likely benefit most as it’s going against many-core SKL-X AVX512-enabled competitor which is a lot “tougher” than the normal SKL/KBL/CFL consumer versions.

SiSoftware Official Ranker Scores

 

Final Thoughts / Conclusions

As with original Ryzen, the cache and memory system performance is not the clean-sweep we’ve seen in CPU testing – but Ryzen+ does bring welcome improvements in bandwidth and latency – which hopefully will further improve with firmware/BIOS updates (AGESA firmware).

With the potential to use faster DDR4 memory – Ryzen+ can do far better than in this test (e.g. with 2933/3200MHz memory). Unfortunately at this time DDR4 – especially high-end fast versions – memory is hideously expensive which is a bit of a problem. You may be better off using less but fast(er) memory with Ryzen designs.

Ryzen+ is a great update that will not disappoint upgraders and is likely to increase AMD’s market share. AMD is here to stay!

AMD Ryzen+ 2700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen+” ZEN+?

After the very successful launch of the original “Ryzen” (Zen/Zeppelin – “Summit Ridge” on 14nm), AMD has been hard at work optimising and improving the design: “Ryzen+” (code-name “Pinnacle Ridge”) is thus a 12nm die shrink that also includes APU – with integrated “Vega RX” graphics” – as well as traditional CPU versions.

While new chipsets (400 series) will also be introduced, the CPUs do work with existing AM4 300-series chipsets (e.g. X370, B350, A320) with a BIOS/firmware update which makes them great upgrades.

Here’s what AMD says it has done for Ryzen+:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen+ (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen+ 2700X Pinnacle Ridge
AMD Ryzen+ 2600 Pinnacle Ridge
AMD Ryzen 1700X Summit Ridge
Intel i7-6700K SkyLake
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 4C / 8T Ryzen+ like its predecessor has the most cores and threads; it thus be down to IPC and clock speeds for performance improvements.
Speed (Min / Max / Turbo) 2.2-3.7-4.2GHz (22x-37x-42x) [+9% rated, +11% turbo] 1.55-3.4-3.9GHz (15x-34x-39x) 2.2-3.3-3.8GHz (22x-34x-38x) 0.8-4.0-4.2GHz (8x-40x-42x) Ryzen+ base clock is 9% higher while Turbo/Boost/XFR is 11% higher; we thus expect at least about 10% improvement in CPU benchmarks.
Power (TDP) 105W 65W 95W 91W Ryzen+ also increases TDP by 11% (105W vs 95) which may require a bit more cooling especially when overclocking.
L1D / L1I Caches 8x 32kB 8-way / 8x 64kB 8-way 6x 32kB 8-way / 6x 64kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen+ data/instruction caches is unchanged; icache is still 2x as big as Intel’s.
L2 Caches 8x 512kB 8-way 6x 512kB 8-way 8x 512kB 8-way 4x 256kB 8-way Ryzen+ L2 cache is unchanged but we’re told latencies have been improved. 4x bigger than Intel’s.
L3 Caches 2x 8MB 16-way 2x 8MB 16-way 2x 8MB 16-way 8MB 16-way Ryzen+ L3 caches are also unchanged – but again lantencies are meant to have improved. With each CCX having 8MB even the 2600 has 2x as much cache as an i7.
SIMD Units 128-bit AVX/FMA3/AVX2 128-bit AVX/FMA3/AVX2 128-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 Ryzen+ still uses the 128-bit SIMD units going against the 256-bit SIMD units that all Intel CPUs have had since Sandy Bridge!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen+ supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 323 [+8%] 236 298 194 Right off Ryzen+ is 8% faster than Ryzen, let’s hope it does better. Even 2600 beats the i7 easily
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 337 [+12%] 238 301 194 With a 64-bit integer workload – we finally get into gear, Ryzen+ is 12% faster than its old brother.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 204 [+12%] 144 182 107 Even in this floating-point test, Ryzen+ is again 12% faster. All AMD CPUs beat the i7 into dust.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 172 [+11%] 123 155 89 With FP64 nothing much changes, Ryzen+ is still 11% faster.
From integer workloads in Dhrystone to floating-point workloads in Whetstone, Ryzen+ is about 10% faster than Ryzen: this is exactly in line with the speed increase (9-11%) but if you were expecting more you may be a tiny bit disappointed.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 619 [+16%] 428 535 510 In this vectorised AVX2 integer test Ryzen+ starts to pull ahead and is 16% faster than Ryzen; perhaps some of the arch improvements benefit SIMD vectorised workloads.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 187 [+10%] 132 170 197 With a 64-bit AVX2 integer vectorised workload, Ryzen+ drops to just 10% but still in line with speed increase.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 5.83 [+7%] 4.12 5.47 3 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen+ drops to just 7% faster than Ryzen but still a decent improvement.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 577 [+11%] 409 520 453 In this floating-point AVX/FMA vectorised test, Ryzen+ is the standard 11% faster than Ryzen.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 332 [+11%] 236 299 267 Switching to FP64 SIMD code, again Ryzen+ is just the standard 11% faster than Ryzen.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 15.6 [+15%] 11 13.7 11 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – Ryzen+ manages to pull ahead further and is 15% faster.
In vectorised AVX2/FMA code we see a similar story with 10% average improvement (7-15%). It seems the SIMD units are unchanged. In any case the i7 is left in the dust.
BenchCrypt Crypto AES-256 (GB/s) 14.1 [+1%] 14.1 13.9 14.7 With AES HWA support all CPUs are memory bandwidth bound; as we’re testing Ryzen+ running at the same memory speed/timings there is still a very small improvement of 1%. But its advantage is that the memory controller is rated for 2933Mt/s operation (vs. 2533) thus with faster memory it could run considerably faster.
BenchCrypt Crypto AES-128 (GB/s) 14.2 [+1%] 14.2 14 14.8 What we saw with AES-256 just repeats with AES-128; Ryzen+ is marginally faster but the improvement is there.
BenchCrypt Crypto SHA2-256 (GB/s) 18.4 [+12%] 13.2 16.5 5.9 With SHA HWA Ryzen+ similarly powers through hashing tests leaving Intel in the dust; SHA is still memory bound but with just one (1) buffer it has larger headroom. Thus Ryzen+ can use its speed advantage and be 12% faster – impressive.
BenchCrypt Crypto SHA1 (GB/s) 19.2 [+14%] 13.1 16.8 11.3 Ryzen+ also accelerates the soon-to-be-defunct SHA1 and here it is even faster – 14% faster than Ryzen.
BenchCrypt Crypto SHA2-512 (GB/s) 3.75 [+12%] 2.66 3.34 4.4 SHA2-512 is not accelerated by SHA HWA (version 1) thus Ryzen+ has to use the same vectorised AVX2 code path – it still is 12% faster than Ryzen but still loses to the i7. Those SIMD units are tough to beat.
In memory bandwidth bound algorithms, Ryzen+ will have to be used with faster memory (up to 2933Mt/s officially) in order to significantly beat its older Ryzen brother. Otherwise there is only a tiny 1% improvement.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 260 [+11%] 184 235 126 In this non-vectorised test we see Ryzen+ is the standard 11% faster than Ryzen.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 221 [+11%] 157 199 112 Switching to FP64 code, nothing changes, Ryzen+ is still 11% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 106 [+23%] 76 86 27 Binomial uses thread shared data thus stresses the cache & memory system; here the arch(itecture) improvements do show, Ryzen+ 23% faster – 2x more than expected. Not to mention 3x (three times) faster than the i7.
BenchFinance Binomial double/FP64 (kOPT/s) 60.8 [+28%] 43.2 47.5 29.2 With FP64 code Ryzen+ is now even faster – 28% faster than Ryzen not to mention 2x faster than the i7. Indeed it seems there improvements to the cache and memory system.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.4 [+11%] 38.6 49.2 49.2 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen+ does not seem to be able to reproduce its previous gain and is just the standard 11% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 41.2 [+10%] 29.1 37.3 20.3 Switching to FP64 nothing much changes, Ryzen+ is 10% faster.
Ryzen dies very well in these algorithms, but Ryzen+ does even better – especially when thread-local data is involved managing 23-28% improvement. For financial workloads Intel does not seem to have a chance anymore – Ryzen is impossible to beat.
BenchScience SGEMM (GFLOPS) float/FP32 275 [+10%] 238 250 267 In this tough vectorised AVX2/FMA algorithm Ryzen+ is still “just” the 10% faster than older Ryzen – but it finally manages to beat the i7.
BenchScience DGEMM (GFLOPS) double/FP64 113 [+4%] 103 109 116 With FP64 vectorised code, Ryzen+ only manages to be 4% faster. It seems the memory is holding it back thus faster memory would allow it to do much better.
BenchScience SFFT (GFLOPS) float/FP32 8.56 [+4%] 7.36 8.2 19.4 FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more; Ryzen+ is just 4% faster again and is still 1/2x the speed of the i7. Again it seems faster memory would help.
BenchScience DFFT (GFLOPS) double/FP64 7.42 [+1%] 6.87 7.32 9.19 With FP64 code, Ryzen+’s improvement reduces to just 1% over Ryzen and again slower than the i7.
BenchScience SNBODY (GFLOPS) float/FP32 279 [+12%] 197 249 269 N-Body simulation is vectorised but many memory accesses to shared data and Ryzen+ gets back to 12% improvement over Ryzen. This allows it to finally overtake the i7.
BenchScience DNBODY (GFLOPS) double/FP64 114 [+13%] 80 101 79 With FP64 code nothing much changes, Ryzen+ is still 13% faster.
With highly vectorised SIMD code Ryzen+ still improves by the standard 10-12% but in memory-heavy code it needs to run at higher memory speed to significantly overtake Ryzen. But it allows it to beat the i7 in more algorithms.
CPU Image Processing Blur (3×3) Filter (MPix/s) 1290 [+11%] 913 1160 1170 In this vectorised integer AVX2 workload Ryzen+ is 11% faster allowing it to soundly beat the i7.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 551 [+11%] 391 497 435 Same algorithm but more shared data does not change things for Ryzen+. Only the i7 falls behind.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 307 [+11%] 218 276 233 Again same algorithm but even more data shared does not change anything, but now the i7 is so far behind Ryzen+ is 50% faster. Incredible.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 461 [+11%] 326 415 384 Different algorithm but still AVX2 vectorised workload still changes nothing – Ryzen+ is 11% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 69.7 [+12%] 49.7 62 38 Still AVX2 vectorised code and still nothing changes; the i7 falls even further behind with Ryzen+ 2x (two times) as fast.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 24.7 [+11%] 17.5 22.3 20 Again we see Ryzen+ 11% faster than the older Ryzen and pulling away from the i7.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1460 [+8%] 1130 1350 1670 Here Ryzen+ is just 8% faster than Ryzen but strangely it’s not enough to beat the i7. Those SIMD units are way fast.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 243 [+11%] 172 219 268 In this final test, Ryzen+ returns to being 11% faster and again strangely not enough to beat the i7.

With all the modern instruction sets supported (AVX2, FMA, AES and SHA HWA) Ryzen+ does extremely well in all workloads – but it generally improves only by the 11% as per clock speed increase, except in some cases which seem to show improvements in the cache and memory system (which we have not tested yet).

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest drivers. .Net 4.7.x (RyuJit), Java 1.9.x. Turbo / Boost was enabled on all configurations.

VM Benchmarks Ryzen+ 2700X 8C/16T Pinnacle Ridge
Ryzen+ 2600 6C/12T Pinnacle Ridge
Ryzen 1700X 8C/16T Summit Ridge
i7-6700K 4C/8T Skylake
Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 63.2 [+8%] 30 58.6 26 .Net CLR integer performance starts off OK with Ryzen+ just 8% faster than Ryzen but now almost 3x (three times) faster than i7.
BenchDotNetAA .Net Dhrystone Long (GIPS) 49.6 [+20%] 33.6 41.2 27 Ryzen seems to favour 64-bit integer workloads, with Ryzen+ 20% faster a lot higher than expected.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 104 [+15%] 71.2 90.5 54.3 Floating-Point CLR performance was pretty spectacular with Ryzen already, but Ryzen+ is 15% than Ryzen still.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 122 [+20%] 88.2 102 65.6 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen+ even faster by 20%.
Ryzen’s performance in .Net was pretty incredible but Ryzen+ is even faster – even faster than expected by mere clock speed increase. There is only one game in town now for .Net applications.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 106 [+9%] 74 97 54 Just as we saw with Dhrystone, this integer workload sees a 9% improvement for Ryzen+ which makes it 2x faster than the i7.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 111 [+8%] 78 103 57 With 64-bit integer workload we see a similar story – Ryzen+ is 8% faster and again 2x faster than the i7.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 387 [+11%] 278 348 240 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Ryzen+ is 11% faster but still almost 2x faster than i7 despite its fast SIMD units
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 217 [+12%] 153 194 48.6 Switching to FP64 SIMD vector code – still running AVX/FMA – Ryzen+ is still 12% faster. i7 is truly left in the dust 1/4x the speed.
Ryzen+ is the usual 9-12% faster than Ryzen here but it means that even RyuJit’s SIMD support cannot save Intel’s i7 – it would take 2x as many cores (not 50%) to beat Ryzen+.
Java Arithmetic Java Dhrystone Integer (GIPS) 574 [+12%] 399 514 We start JVM integer performance with the usual 12% gain over Ryzen.
Java Arithmetic Java Dhrystone Long (GIPS) 559 [+12%] 392 500 Nothing much changes with 64-bit integer workload, we have Ryzen+ 12% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 138 [+13%] 99 122 With a floating-point workload Ryzen+ performance improvement is 13%.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 137 [+7%] 97 128 With FP64 workload Ryzen+ is just 7% faster but still welcome
Java performance improves by the expected amount 7-13% on Ryzen+ and allows it to completely dominate the i7.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 108 [+15%] 76 94 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but here Ryzen+ manages a 15% lead over Ryzen.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 114 [+24%] 73 92 With 64-bit vectorised workload Ryzen+ (similar to .Net) increases its lead by 24%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 99 [+14%] 69 87 Switching to floating-point we return to the usual 14% speed improvement.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+1%] 64 92 With FP64 workload Ryzen+’s lead somewhat unexplicably drops to 1%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives Ryzen+ free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

Ryzen dominated the .Net and Java benchmarks – but now Ryzen+ extends that dominance out-of-reach. It would take a very much improved run-time or Intel CPU to get anywhere close. For .Net and Java code, Ryzen is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen+ is a worthy update but its speed increase is generally due to its faster clock speed – similar to Intel’s SkyLake > KabyLake (gen 6 to gen 7) transition. But coming at the same price, a “free” performance increase of 10% or so is obviously not to be ignored. Let’s not forget that Ryzen+ can still use all the existing series 300 mainboards – subject to BIOS update.

The process shrink and power optimisations does allow Ryzen+ to run at lower voltages and consume less power – even though TDP has increased at least “on paper”.

Some algorithms do seem to show that the cache and memory system has been improved – but Ryzen+’s advantage is that it can (much) faster memory. Unfortunately at this time DDR4 memory, especially fast versions, are very expensive. Here Intel does (still) have an advantage in that fast DDR4 memory is not required except for bandwidth bound algorithms.

One advantage is that by now operating systems (and applications) have been updated to deal with its dual-CCX design that used to be so much trouble when we benchmarked Ryzen initially. With AMD increasing its market share no high-performance application can afford to ignore AMD CPUs.

We (just) cannot wait to see the new improvements in future AMD designs and especially the ThreadRipper2 update!

AVX512 performance improvement for SKL-X in Sandra SP2

Intel Skylake-X Core i9

What is AVX512?

AVX512 (Advanced Vector eXtensions) is the 512-bit SIMD instruction set that follows from previous 256-bit AVX2/FMA3/AVX instruction set. Originally introduced by Intel with its “Xeon Phi” GPGPU accelerators – albeit in a somewhat different form – it has finally made it to its CPU lines with Skylake-X (SKL-X/EX/EP) – for now HEDT (i9) and Server (Xeon) – and hopefully to mainstream at some point.

Note it is rumoured the current Skylake (SKL)/Kabylake (KBL) are also supposed to support it based on core changes (widening of ports to 512-bit, unit changes, etc.) – nevertheless no public way of engaging them has been found.

AVX512 consists of multiple extensions and not all CPUs (or GPGPUs) may implement them all:

  • AVX512F – Foundation – most floating-point single/double instructions widened to 512-bit. [supported by SKL-X, Phi]
  • AVX512DQ – Double-Word & Quad-Word – most 32 and 64-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512BW – Byte & Word – most 8-bit and 16-bit integer instructions widened to 512-bit [supported by SKL-X]
  • AVX512VL – Vector Length eXtensions – most AVX512 instructions on previous 256-bit and 128-bit SIMD registers [supported by SKL-X]
  • AVX512CD – Conflict Detection – loop vectorisation through predication [not supported by SKL-X but Phi]
  • AVX512ER – Exponential & Reciprocal – transcedental operations [not supported by SKL-X but Phi]
  • more sets will be introduced in future versions

As with anything, simply doubling register width does not automagically increase performance by 2x as dependencies, memory load/store latencies and even data characteristics limit performance gains – some of which may require future arch or even tools to realise their true potential.

In this article we test AVX512 core performance; please see our other articles on:

Native SIMD Performance

We are testing native SIMD performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks SKL-X AVX512 SKL-X AVX2/FMA3 Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  1460 [+23%]  1180 For integer workloads we manage only 23% improvement, not quite the 100% we were hoping but still decent.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  519 [+19%]  435 With a 64-bit integer workload the improvement reduces to 19%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  7.72 [=]  7.62 No SIMD here
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1800 [+80%]  1000 In this floating-point test we finally see the power of AVX512 – it is 80% faster than AVX2/FMA3 – a huge improvement.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  1150 [+85%]  622 Switching to FP64 increases the improvement to 85% a huge gain.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  36 [+50%]  24 In this heavy algorithm using FP64 to mantissa extend FP128 we see only 50% improvement still nothing to ignore.
AVX512 cannot bring 100% improvement but does manage up to 85% improvement – a no mean feat! While integer workload is only 20-25% it is still decent. Heavy compute algorithms will greatly benefit from AVX512.
BenchCrypt Crypto SHA2-256 (GB/s) 26 [+78%]  14.6 With no data dependency – we get good scaling of almost 80% even with this integer workload.
BenchCrypt Crypto SHA1 (GB/s)  39.8 [+51%]  26.4 Here we see only 50% improvement likely due to lack of (more) memory bandwidth – it likely would scale higher.
BenchCrypt Crypto SHA2-512 (GB/s)  21.2 [+94%]  10.9 With 64-bit integer workload we see almost perfect scaling of 94%.
As we work on different buffers and have no dependencies, AVX512 brings up to 94% performance improvement – only limited by memory bandwidth with even 4 channel DDR4 @ 3200Mt/s not enough for 10C/20T CPU. AVX512 is absolutely worth it to drive the system to the limit.
BenchScience SGEMM (GFLOPS) float/FP32  558 [-7%]  605 Unfortunately the current compiler does not seem to help.
BenchScience DGEMM (GFLOPS) double/FP64  235 [+2%]  229 Changing to FP64 at least allows AVX512 to with by a meagre 2%.
BenchScience SFFT (GFLOPS) float/FP32  35.3 [=]  35.3 Again the compiler does not seem to help here.
BenchScience DFFT (GFLOPS) double/FP64  19.9 [-2%]  20.2 With FP64 nothing much happens.
BenchScience SNBODY (GFLOPS) float/FP32  585 [-1%]  591 No help from the compiler here either.
BenchScience DNBODY (GFLOPS) double/FP64  175 [-1%]  178 With FP64 workload nothing much changes.
With complex SIMD code – not written in assembler the compiler has some ways to go and performance is not great. But at least the performance is not worse.
CPU Image Processing Blur (3×3) Filter (MPix/s)  3830 [+60%]  2390 We start well here with AVX512 60% faster with float FP32 workload.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  1700 [+70%]  1000 Same algorithm but more shared data improves by 70%.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  885 [+56%]  566 Again same algorithm but even more data shared now brings the improvement down to 56%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  1290 [+56%]  826 Using two buffers does not change much still 56% improvement.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  136 [+59%]  85 Different algorithm keeps the AVX512 advantage the same at about 60%.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  65.6 [+31.7%]  49.8 Using the new scatter/gather in AVX512 still brings 30% better performance.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  3920 [+3%]  3800 Here we have a 64-bit integer workload algorithm with many gathers with AVX512 likely memory latency bound thus almost no improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  770 [+2%]  755 Again loads of gathers does not allow AVX512 to shine but still decent performance
As with other SIMD tests, AVX512 brings between 60-70% performance increase, very impressive. However in algorithms that involve heavy memory access (scatter/gather) we are limited by memory latency and thus we see almost no delta but at least it is not slower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that even for a 1st-generation CPU with AVX512 support, SKL-X greatly benefits from the new instruction set – with anything between 50-95% performance improvement. However compiler/tools are raw (VC++ 2017 only added support in the recent 15.3 version) and performance sketchy where hand-crafted assembler is not used. But these will get better and future CPU generations (CFL-X, etc.) will likely improve performance.

Also let’s remember that some SKUs have 2x FMA (aka 512-bit) (and other instructions) licence – while most SKUs have only 1x FMA (aka 256-bit); the former SKUs likely benefit even more from AVX512 and it is something Intel may be more generous in enabling in future generations.

In algorithms heavily dependent on memory bandwidth or latency AVX512 cannot work miracles, but at least will extract the maximum possible compute performance from the CPU. SKUs with lower number of cores (8, 6, 4, etc.) likely to gain even more from AVX512.

In addition let’s not forget the “Phi” accelerators – that also support AVX512 – thus porting code will allow great performance on many-core (MIC) architecture too.

NUMA performance improvement for AMD ThreadRipper in Sandra SP2

What is NUMA?

Modern CPUs have had a built-in memory controller for many years now, starting with the K8/Opteron, in order to higher better bandwidth and lower latency. As a result in SMP systems each CPU has their own memory controller and its own system memory that it can access at high speed – while to access other memory it must send requests to the other CPUs. NUMA is a way of describing such systems and allow the operating system and applications to allocate memory on the node they are running on for best performance.

As ThreadRipper is really two (2) Ryzen dies connected internally through InfinityFabric – it is basically a 2-CPU SMP system and thus a 2-node NUMA system.

While it is possible to configure it in UMA (Uniform Memory Access mode) where all memory appears to be unified and interleaved between nodes, for best performance the NUMA mode is recommended when the operating system and applications support it.

While Sandra has always supported NUMA in the standard benchmarks – some of the new benchmarks have not been updated with NUMA support especially since multi-core systems have pretty much killed SMP systems on the desktop – with only expensive severs left to bring SMP / NUMA support.

Note that all the NUMA improvements here would apply to competitor NUMA (e.g. Intel) systems, thus it is not just for ThreadRipper – with EPYC systems likely showing a far higher improvement too.

In this article we test NUMA performance; please see our other articles on:

Native Performance

We are testing native performance using various instruction sets: AVX512, AVX2/FMA3, AVX to determine the gains the new instruction sets bring.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks NUMA 2-nodes
UMA single-node
Comments
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s)  965 [+2.8%]  938 The ‘lightest’ workload should show some NUMA overhead but we can only manage 3% here.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s)  312 [+2.3%] 305 With a 64-bit integer workload the improvement drops to 2%.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s)  10.9 [=]  10.9 Emulating int128 means far increased compute workload with NUMA overhead insignificant.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s)  997 [+1.2%]  985 Again no measured improvement here.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s)  562 [+1%]  556 Again no measured improvement here.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s)  27 [=]  26.85 In this heavy algorithm using FP64 to mantissa extend FP128 we see no improvement.
Fractals are compute intensive with few memory accesses – mainly to store results – thus we see a maximum of 3% improvement with NUMA support with the rest insignificant. However, this is a simple 2-node system – bigger 4/8-node systems would likely show bigger gains.
BenchCrypt Crypto AES256 (GB/s) 27.1 [+139%] 11.3 AES hardware accelerated is memory bandwidth bound thus NUMA support matters; even in this 2-node system we see a over 2x improvement of 139%!
BenchCrypt Crypto AES128 (GB/s) 27.4 [+142%] 11.3 Similar to above we see a massive 142% improvement by allocating memory on the right NUMA node.
BenchCrypt Crypto SHA2-256 (GB/s)  32.3 [+50%] 21.4 SHA is also hardware accelerated but operates on a single input buffer (with a small output hash value buffer) and here out improvement drops to 50%, still very much significant.
BenchCrypt Crypto SHA1 (GB/s) 34.2  [+56%]  21.8 Similar to above we see an even larger 56% improvement for supporting NUMA.
BenchCrypt Crypto SHA2-512 (GB/s)  6.36 [=]  6.35 SHA2-256 is not hardware accelerated (AVX2 used) but heavy compute bound thus our improvement drops to nothing.
Finally in streaming algorithms we see just how much NUMA support matters: even on this 2-note system we see over 2x improvement of 140% when working with 2 buffers (in/out). When using a single buffer our improvement drops to 50% but still very much significant. TR needs NUMA suppport to shine.
BenchScience SGEMM (GFLOPS) float/FP32  395 [114%]  184 As with crypto, GEMM benefits greatly from NUMA support with an incredible 114% improvement by allocating the (3) buffers on the right NUMA nodes.
BenchScience DGEMM (GFLOPS) double/FP64  183 [131%]  79 Changing to FP64 brings an even more incredible 131%.
BenchScience SFFT (GFLOPS) float/FP32  11.6 [86%]  6.25 FFT also shows big gains from NUMA support with 86% improvement just by allocating the buffers (2+1 const) on the right nodes.
BenchScience DFFT (GFLOPS) double/FP64  10.6 [112%]  5 With FP64 again increases
BenchScience SNBODY (GFLOPS) float/FP32  479 [=]  483 Strangely N-Body does not benefit much from NUMA support with no appreciable improvement.
BenchScience DNBODY (GFLOPS) double/FP64  189 [=]  191 With FP64 workload nothing much changes.
As with crypto, buffer heavy algorithms (GEMM, FFT, N-Body) greatly benefit from NUMA support with performance doubling (86-131%) by allocating on the right NUMA nodes; in effect TR needs NUMA in order to perform better than a standard Ryzen!
CPU Image Processing Blur (3×3) Filter (MPix/s)  2090 [+71%]  1220 Least compute brings highest benefit from NUMA support – here it is 71%.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  886 [=]  890 Same algorithm but more compute brings the improvement to nothing.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  494 [=]  495 Again same algorithm but even more compute again no benefit.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  720 [=]  719 Using two buffers does not seem to show any benefit either.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  116 [=]  117 Different algorithm keeps with more compute means no benefit either.
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  40.3 [=]  40.7 Using the new scatter/gather in AVX2 does not help matters even with NUMA support.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1880 [+90%]  982 Here we have a 64-bit integer workload algorithm with many gathers not compute heavy brings 90% improvement.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s)  397 [=]  396 Heavy compute brings down the improvement to nothing.
As with other SIMD tests,  low compute algorithms see 70-90% improvement from NUMA support; heavy compute algorithms bring the improvement down to zero. It all depends whether the overhead of accessing other nodes can be masked by compute; in effect TR seems to perform pretty well.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It is clear that ThreadRipper needs NUMA support in applications – just like any other SMP system today to shine: we see over 2x improvement in bandwidth-heavy algorithms. However, in compute-heavy algorithms TR is able to mask the overhead pretty well – with NUMA bringing almost no improvement. For non NUMA supporting software the UMA mode should be employed.

Let’s remember we are only testing a 2-node system, here, a 4+ node system is likely to show higher improvements and with EPYC systems stating at 1-socket 4-node we can potentially have common 4-socket 16-node systems that absolutely need NUMA for best performance. We look forward to testing such a system as soon as possible 😉

 

AMD Threadripper 1950X Review & Benchmarks – CPU 16-core/32-thread Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on.

AMD Epyc/Threadripper DieIn this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Threadripper (1950X) with HEDT competition (Intel SKL-X) as well as normal desktop solutions (Ryzen, Skylake) which also serves to compare HEDT with the “normal” desktop solution.

CPU Specifications AMD Threadripper 1950X Intel i9 9700X (SKL-X) AMD Ryzen 1700X Intel i7 6700K (SKL) Comments
Cores (CU) / Threads (SP) 16C / 32T 10C / 20T 8C / 16T 4C / 8T Just as Ryzen, TR has the most cores though Intel has just announced new SKL-X with more cores.
Speed (Min / Max / Turbo) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 1.2-3.3-4.3GHz (12x-33x-43x) 2.2-3.4-3.9GHz (22x-34x-39x) [note ES sample] 0.8-4.0-4.2GHz (8x-40x-42x) SKL has the highest base clock but all CPUs have similar Turbo clocks
Power (TDP) 180W 150W 95W 91W TR has higher TDP than SKL-X just like Ryzen so may need a beefier cooling system
L1D / L1I Caches 16x 32kB 8-way / 16x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 4x 32kB 8-way / 4x 32kB 8-way TR and Ryzen’s instruction caches are 2x data (and SKL/X) but all caches are 8-way.
L2 Caches 16x 512kB 8-way (8MB total) 20x 1MB 16-way (20MB total) 8x 512kB 8-way (4MB total) 4x 256kB 8-way (1MB total) SKL-X has really pushed the boat out with a 1MB L2 cache that dwarfs all other CPUs.
L3 Caches 4x 8MB 16-way (32MB total) 13.75MB 11-way 2x 8MB 16-way (16MB total) 8MB 16-way TR actually has 2 sets of 2 L3 caches rather than a combined L3 cache like SKL/X.
NUMA Nodes
2x 16GB each no, unified 32GB no, unified 16GB no, unified 16GB Only TR has 2 NUMA nodes

Thread Scheduling and Windows

Threadripper’s topology (4 cores in each CCX, with 2 CCX in one node and 2 nodes) makes things even more compilcated for operating system (Windows) schedulers. Effectively we have a 2-tier NUMA SMP system where CCXes are level 1 and nodes are level 2 thus the scheduling of threads matters a lot.

Also keep in mind this is a NUMA system (2 nodes) with each node having its own memory; while for compatibility AMD recommends (and the BIOS defaults) to “UMA” (Unified) “interleaving across nodes” – for best performance the non-interleaving mode (or “interleaving across CCX”) should be used.

What all this means is that you likely need a reasonably new operating system – thus Windows 10 / Server 2016 – with a kernel that has been updated to support Ryzen/TR as Microsoft is not likely to care about old verions.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen/TR support all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 447 [-2%] 454 226 186 TR can keep up with SKL-X and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 459 [+1%] 456 236 184 An Int64 load does not change results.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 352 [+30%] 269 184 107 Finally TR soundly beats SKL-X by 30% and scales well vs. Ryzen.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 295 [+32%] 223 154 89 With a FP64 work-load the lead inceases slightly.
Unlike Ryzen which soundly dominated Skylake (albeit with 2x more cores, 8 vs. 4), Threadripper does not have the same advantage (16 vs. 10) thus it can only beat SKL-X in floating-point work-loads where it is 30% faster, still a good result.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 918 [-22%] 1180 535 527 With AVX2/FMA SKL-X is just too strong, with TR 22% slower.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 307 [-29%] 435 161 191 With Int64 AVX2 TR is almost 20% slower than SKL-X.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 7 [+30%] 5.4 3.6 2 This is a tough test using Long integers to emulate Int128 without SIMD and here TR manages to be 30 faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 996 [=] 1000 518 466 In this floating-point AVX2/FMA vectorised test  TR manages to tie with SKL-X.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 559 [-10%] 622 299 273 Switching to FP64 SIMD code, TR is now 10% slower than SKL-X.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 27 [+12%] 24 13.7 10.7 In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – TR manages a 12% win.
In vectorised AVX2/FMA code we see TR lose in most tests, or tie in one – and only shine in emulation tests not using SIMD instruction sets. Intel’s SIMD units – even without AVX512 that SKL-X brings – are just too strong for TR just as we saw Ryzen struggle against normal Skylake.
BenchCrypt Crypto AES-256 (GB/s) 27.1 [-21%] 34.4  14  15 All CPUs support AES HWA – but TR/Ryzen memory is just 2400Mt/s vs 3200 that SKL-X enjoys (+33%) thus this is a good result; TR seems to use its channels pretty effectively.
BenchCrypt Crypto AES-128 (GB/s)  27.4 [-18%]  33.5  14  15 Similar to what we saw above TR is just 18% slower which is a good result; unfortunately we cannot get the memory over 2400Mt/s.
BenchCrypt Crypto SHA2-256 (GB/s)  32.2 [+2.2x]
 14.6  17.1  5.9 Like Ryzen, TR’s secret weapon is SHA HWA which allows it to soundly beat SKL-X over 2.2x faster!
BenchCrypt Crypto SHA1 (GB/s) 34.2 [+30%] 26.4  17.7  11.3 Even with SHA HWA, the multi-buffer AVX2 implementation allows SKL-X to beat TR by 16% but it still scores well.
BenchCrypt Crypto SHA2-512 (GB/s)  6.34 [-41%]  10.9  3.35  4.38 SHA2-512 is not accelerated by SHA HWA (version 1) thus TR has to use the same vectorised AVX2 code thus is 41% slower.
TR’s secret crypto weapon (as Ryzen) is SHA HWA which allows it to soundly beat SKL-X even with 33% less memory bandwidth; provided software is NUMA-enabled it seems TR can effectively use its 4-channel memory controllers.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 436 [+35%] 322  234.6  129 In this non-vectorised test TR bets SKL-X by 35%. The choice for financial analysis?
BenchFinance Black-Scholes double/FP64 (MOPT/s)  366 [+32%]
277  198.6  109 Switching to FP64 code,TR still beats SKL-X by over 30%. So far so great.
BenchFinance Binomial float/FP32 (kOPT/s)  165 [+2.46x]
 67.3  85.6  27.25 Binomial uses thread shared data thus stresses the cache & memory system; we would expect TR to falter – but nothing of the sort – it is actually over 2.5x faster than SKL-X leaving it in the dust!
BenchFinance Binomial double/FP64 (kOPT/s)  83.7 [+27%]
 65.6  45.6  25.54 With FP64 code the situation changes somewhat – TR is only 27% faster but still an appreciable lead. Very strange not to see Intel dominating this test.
BenchFinance Monte-Carlo float/FP32 (kOPT/s)  91.6 [+42]
 64.3  49.1  25.92 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; TR reigns supreme being 40% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s)  68.7 [+34%]
 51.2  37.1  19 Switching to FP64, TR is just 34% faster but still a good lead
Intel should be worried: across all financial tests, 64-bit or 32-bit floating-point workloads TR soundly beats SKL-X by a big margin that even a 16-core version may not be able to match. But should these tests be vectorisable using SIMD – especially AVX512 – then we would fully expect Intel to win. But for now – for financial workloads there is only one choice: TR/Ryzen!!!
BenchScience SGEMM (GFLOPS) float/FP32  165 [?] 623  240.7  268 We need to implement NUMA fixes here to allow TR to scale.
BenchScience DGEMM (GFLOPS) double/FP64  75.9 [?]  216  102.2  92.2 We need to implement NUMA fixes here to allow TR to scale.
BenchScience SFFT (GFLOPS) float/FP32  16.6 [-51%]  34.3  8.57  19 FFT is also heavily vectorised but stresses the memory sub-system more; here TR cannot beat SKL-X and is 50% slower – but scales well against Ryzen.
BenchScience DFFT (GFLOPS) double/FP64  8 [-65%]  23.18  7.6  11.13 With FP64 code, the gap only widens with TR over 65% slower than SKL-X and little scaling over Ryzen.
BenchScience SNBODY (GFLOPS) float/FP32  456 [-22%]  587  234  272 N-Body simulation is vectorised but has many memory accesses to shared data – and here TR is only 22% slower than SKL-X but again scales well vs Ryzen.
BenchScience DNBODY (GFLOPS) double/FP64  173 [-2%]  178  87.2  79.6 With FP64 code TR almost catches up with SKL-X
With highly vectorised SIMD code TR cannot do as well – but an additional issue is that NUMA support needs to be improved – F/D-GEMM shows how much of a problem this can be as all memory traffic is using a single NUMA node.
CPU Image Processing Blur (3×3) Filter (MPix/s)  1470 [-6%] 1560  775  634 In this vectorised integer AVX2 workload TR does surprisingly well against SKL-X just 6% slower.
CPU Image Processing Sharpen (5×5) Filter (MPix/s)  617 [-10%]  693  327  280 Same algorithm but more shared data used sees TR now 10%, more NUMA optimisations needed.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s)  361 [-6%]  384  192  154 Again same algorithm but even more data shared now TR is 6% slower.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s)  570 [-6%]  609  307  271 Different algorithm but still AVX2 vectorised workload – TR is still 6% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s)  106 [+35%]  78.3  57.3  34.9 Still AVX2 vectorised code but TR does far better, it is no less than 35% faster than SKL-X!
CPU Image Processing Oil Painting Quantise Filter (MPix/s)  37.8 [-17%]  45.8  20  18.1 TR does worst here, it is 17% slower than SKL-X but still scales well vs. Ryzen.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s)  1260 [?]  4260  1160  2280 This 64-bit SIMD integer workload is a problem for TR but likely NUMA issue again as not much scaling vs. Ryzen.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 420 [-45%]  777  175  359 TR really does not do well here but does scale well vs. Ryzen, likely some code optimisation is needed.

As TR (like Ryzen) supports most modern instruction sets now (AVX2, FMA, AES/SHA HWA) it does well but generally not enough to beat SKL-X; unfortunately the latter with AVX512 can potentially get even faster (up to 100%) increasing the gap even more.

While we’ve not tested memory performance in this article, we see that in streaming tests (e.g. AES, SHA) – even more memory bandwidth is needed to feed all the 16 cores (32 threads) and being able to run the memory at higher speeds would be appreciated.

NUMA support is crucial – as non-NUMA algorithms take a big hit (see GEMM) where performance can be even lower than Ryzen. While complex server or scientific software won’t have this problem, most programs will not be NUMA aware.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. .Net 4.7.x (RyuJit), Java 1.8.x. Turbo / Dynamic Overclocking was enabled on both configurations.

VM Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS)  111 [+88%]  59  61.5  29 .Net CLR integer performance starts off very well with TR 88% faster than SKL-X an incredible result! This is *not* a fluke as Ryzen scores incredibly too.
BenchDotNetAA .Net Dhrystone Long (GIPS) 62.9 [+3%]  61  41  29 TR cannot match the same gain with 64-bit integer, but still just about manages to beat SKL-X.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS)  193 [+82%]  106  103  50 Floating-Point CLR performance is pretty spectacular with TR (like Ryzen) dominating – it is no less than 82% faster than SKL-X!
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS)  225 [+67%]  134  111  63 FP64 performance is also great with TR 67% faster than SKL-X an absolutely huge win!
It’s pretty incredible, for .Net applications TR – like Ryzen – is king! It is pretty incredible that is is between 60-80% faster in all tests (except 64-bit integer). With more and more applications (apps?) running under the CLR, TR (like Ryzen) has a bright future.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s)  195 [+38%]
141  92.6  53.4 In this non-vectorised test, TR is almost 40% faster than SKL-X not as high as what we’ve seen before but still significant.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s)  192 [+34%]
 143  97.6  56.5 With 64-bit integer workload this time we see no changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s)  626 [+27%]
 491  347  241 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code; Intel strikes back through its SIMD units but TR is a comfortably 27% faster than it.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s)  344 [+14%]
 301  192  135 Switching to FP64 SIMD vector code – still running AVX/FMA – TR’s lead falls to 14% but it is still a win!
Taking advantage of RyuJit’s support for vectors/SIMD (through SSE2, AVX/FMA) allows SKL-X to gain some traction – TR remains very much faster up to 40%. Whatever the workload, it seems TR just loves it.
Java Arithmetic Java Dhrystone Integer (GIPS)  1000 [+16%]  857 JVM integer performance is only 16% faster on TR than SKL-X – but a win is a win.
Java Arithmetic Java Dhrystone Long (GIPS)  974 [+26%]  771 With 64-bit integer workloads, TR is now 26% faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS)  231 [+48%]  156 With a floating-point workload TR increases its lead to a massive 48%, a pretty incredible result.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS)  183 [+14%]  160 With FP64 workload the gap reduces way down to 14% but it is still faster than SKL-X.
Java performance is not as incredible as we’ve seen with .Net, but TR is still 15-50% faster than SKL-X – no mean feat! Again if you have Java workloads, then TR should be the CPU of choice.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s)  200 [+45%]  137 The JVM does not support SIMD/vectors, thus TR uses its scalar prowess to be 45% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s)  186 [+33%]  139 With 64-bit vectorised workload Ryzen is still 33% faster.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s)  169 [+69%]  100 With floating-point, TR is a massive 69% faster than SKL-X a pretty incredible result.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s)  159 [+59%]  100 With FP64 workload TR’s lead falls just a little to 59% – a huge win over SKL-X.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets (aka SSE2, AVX/FMA) gives TR (like Ryzen) free reign to dominate all the tests, be they integer or floating-point. It is pretty incredible that neither Intel CPU can come close to its performance.

TR (like Ryzen) absolutely dominates .Net and Java benchmarks with CLR and JVM code running much faster than the latest Intel SKL-X – thus current and future applications running under CLR (WPF/Metro/UWP/etc.) as well as server JVM workloads run great on TR. For .Net and Java code, TR is the CPU to get!

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

It may be difficult to decide whether AMD’s design (multiple CCX units, multiple dies/nodes on a socket) is “cool” and supporting it effectively is not easy for programmers – be they OS/kernel or application – but when it works it works extremely well! There is no doubt that Threadripper can beat Skylake-X at the same cost (approx 1,000$) though using more coress just as its little (single-die) brother Ryzen.

Scalar workloads, .Net/Java workloads just fly on it – but highly vectorised AVX2/FMA workloads only perform competitively; unfortunately once AVX512 support is added SKL-X is likely to dominate effectively these workloads though for now it’s early days.

It’s multiple NUMA node design – unless running in UMA (unified) mode – requires both OS and application support, otherwise performance can tank to Ryzen levels; while server and scientific programs are likely to be so – this is a problem for most applications. Then we have its dual-CCX design which further complicate workloads, effectively being a 2nd NUMA level; we can see inter-core latencies being 4 tiers while SKL-X only has 2 tiers.

In effect both platforms will get better in the future: Intel’s SKL-X with AVX512 support and AMD’s Threadripper with NUMA/CCX memory optimisations (and hopefully AVX512 support at one point). Intel are also already launching newer versions with more cores (up to 18C/36T) while AMD can release some server EPYC versions with 4 dies (and thus up to 32C/64T) that will both push power envelopes to the maximum.

For now, Threadripper is a return to form from AMD.

AMD Threadripper 1950X Review & Benchmarks – 4-channel DDR4 Cache & Memory Performance

What is “Threadripper”?

“Threadripper” (code-name ZP aka “Zeppelin”) is simply a combination of inter-connected Ryzen dies (“nodes”) on a single socket (TR4) that in effect provide a SMP system-on-a-single-socket – without the expense of multiple sockets, cooling solutions, etc. It also allows additional memory channels (4 in total) to be provided – thus equaling Intel’s HEDT solution.

It is worth noting that up to 4 dies/nodes can be provided on the socket – thus up to 32C/64T – can be enabled in the server (“EPYC”) designs – while current HEDT systems only use 2 – but AMD may release versions with more dies later on. The large socket allows for 4 DDR4 memory channels greatly increasing bandwidth over Ryzen, just as with Intel.

AMD Threadripper die

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the 2nd-from-the-top Ryzen (1700X) with previous generation competing architectures (i7 Skylake 4C and i7 Haswell-E 6C) with a view to upgrading to a mid-range high performance design. Another article compares the top-of-the-range Ryzen (1800X) with the latest generation competing architectures (i7 Kabylake 4C and i7 Broadwell-E 8C) with a view to upgrading to the top-of-the-range design.

CPU Specifications AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
TLB 4kB pages
64 full-way
1536 8-way
64 8-way
1536 6-way
64 full-way
1536 8-way
64 8-way
1536 6-way
TR/Ryzen has comparatively “better” TLBs 8-way vs 6-way and full-way vs 8-way.
TLB 2MB pages
64 full-way
1536 2-way
8 full-way
1536 6-way
64 full-way
1536 2-way
8 full-way
1536 6-way
Nothing much changes for 2MB pages with TR/Ryzen leading the pack again.
Memory Controller Speed (MHz) 600-1200 800-3300 600-1200 800-4000 TR/Ryzen’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (Mhz) Max
2400 / 2666 2533 / 2400 2400 / 2666 2533 / 2400 TR/Ryzen supports up to 2666MHz memory but is happier running at 2400; SKL/X supports only up to 2400 officially but happily runs at 3200MHz a big advantage.
Memory Channels / Width
4 / 256-bit 4 / 256-bit 2 / 128-bit 2 / 128-bit Both TR and SKL-X enjoy 256-bit memory channels.
Memory Timing (clocks)
14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T Despite faster memory, TR/Ryzen can run lower timings than HSW-E and SKL reducing its overall latencies.

Core Topology and Testing

As discussed in the previous article, cores on TR/Ryzen are grouped in blocks (CCX or compute units) each with its own 8MB L3 cache – but connected via a 256-bit bus running at memory controller clock. This is better than older designs like Intel Core 2 Quad or Pentium D which were effectively 2 CPU dies on the same socket – but not as good as a unified design where all cores are part of the same unit.

Running algorithms that require data to be shared between threads – e.g. producer/consumer – scheduling those threads on the same CCX would ensure lower latencies and higher bandwidth which we will test with presently.

In addition, Threadripper is a NUMA SMP design – with the other nodes effectively different CPUs; thus sharing data between cores on different nodes is equivalent to different CPUs in a SMP system.

We have thus modified Sandra’s ‘CPU Multi-Core Efficiency Benchmark‘ to report the latencies of each producer/consumer unit combination (e.g. same core, same CCX, different CCX) as well as providing different matching algorithms when selecting the producer/consumer units: best match (lowest latency), worst match (highest latency) thus allowing us to test inter-CCX bandwidth also. We hope users and reviewers alike will find the new features useful!

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). TR (like Ryzen) supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks AMD Threadripper 1950X Intel 9700X (SKL-X) AMD Ryzen 1700X Intel 6700K (SKL) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s)  92.2 [+7%]  85.5  47.2  39.5 With 16 cores (and thus 16 pairs) TR’s inter-core bandwidth beats SKL-X by over 7% – assuming threads are scheduled correctly.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 7.51 [1/3]  24.4  5.75  16 In worst-case pairs on TR go not to just different CCX but NUMA nodes thus bandwidth is 1/3 that of SKL-X.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns)  15.4 [-1%]
15.8  15.5  16.1 Within the same core (sharing L1D/L2) , TR/Ryzen inter-unit is ~15ns comparative with both Intel’s CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Core (ns)  46.4 [-36%]  72.3  44.3  45 Within the same compute unit (sharing L3), the latency is ~45ns is much lower than SKL-X
CPU Multi-Core Benchmark Inter-Unit Latency – Different CCX (ns)  184.7 [+4x]  135 Going inter-CCX increases the latency by 4 times thus threads sharing data must be properly scheduled.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Node(ns)  274.4 [+6x] Going inter-node increases the latency yet again by 6 times, thus scheduling is everything.
The multiple CCX design does present some challenges to programmers and threads will have to be carefully scheduled – as latencies are much larger than inter-core; going off node increases latencies yet again but not by a lot; if anything inter-node interconnect seems pretty low latency comparatively.
Aggregated L1D Bandwidth (GB/s)  1372 [-40%] 2252  739  878 SKL/X has 512-bit data ports (for AVX512) so TR/Ryzen cannot compete but they would do better against older designs.
Aggregated L2 Bandwidth (GB/s)  990 [-2%]  1010  565  402 The 16x L2 caches have similar bandwidth to the 10x much bigger caches on SKL-X.
Aggregated L3 Bandwidth (GB/s)  749 [+2.6x]
 289  300  247 The 4x L3 caches have much higher bandwidth than the single SKL-X cache.
Aggregated Memory (GB/s)  56 [-18%]  69  28  31 Running at lower memory speed TR cannot beat SKL-X but has comparatively higher memory efficiency
Even with 16x L1D and L2 caches, TR cannot match the much faster SKL-X 10x caches – that have been updated for 512-bit support but they are competitive; the 4x L3 caches do soundly beat the unified one on SKL-X but then again sharing data not within the same CCX is going to be very much slower.

At 2400Mt/s TR is running 33% slower than SKL-X at 3200Mt/s but its bandwidth is just 18% lower – thus its 4x DDR4 controllers are more efficient – not something we’re used to seeing.

Data In-Page Random Latency (ns)  72.8 [4-17-37] [+2.75x]  26.4 [4-13-33]  70.7 [4-17-37]  20 [4-12-21] What we saw previously with Ryzen was not accident; TR also suffers from surprisingly large in-page latency, almost 3x of Intel designs. Either the TLBs are very slow or not working.
Data Full Random Latency (ns)  111.5 [4-17-44] [+47%]  75.5 [4-13-70]  87.9 [4-17-37]  65 [4-12-34] Out-of-page latencies are ‘better’ with TR/Ryzen ‘only’ ~50% slower than SKL/X.
Data Sequential Latency (ns)  5.5 [4-7-8] [=]  5.4 [4-11-13]  3.8 [4-7-8]
 4.1 [4-12-13] TR’s prefetchers are working well with sequential access pattern latency at ~5ns matching SKL-X.
We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s. It is a mystery as to why, as both out-of-page random and sequential are competitive. It does point to something with the TLBs as to whether they do work or are just very much slower for some reason.
Code In-Page Random Latency (ns)  17.2 [4-10-26] [+43%] 12 [4-14-28]  16.1 [4-9-25]  10 [4-11-21] With code we don’t see the same problem – with in-page latency a bit higher than SKL-X (40%) but nowhere as high as what we saw before.
Code Full Random Latency (ns)  178 [4-15-60] [+2x]  86.1 [4-16-106]  95.4 [4-13-49]  70 [4-11-47] Out-of-page latency is a bit higher than SKL-X but not as bad as before.
Code Sequential Latency (ns)  8.7 [4-10-20] [+33%]  6.5 [4-7-12]  8.4 [4-9-18]  5.3 [4-9-20] Ryzen’s prefetchers are working well with sequential access pattern latency at ~9ns and thus 33% higher than SKL-X.
While code access latencies are higher than the new SKL-X – they are comparative with the older designs and not as bad as we’ve seen with data. Overall it seems TR (like Ryzen) will need some memory controller optimisations regarding latencies – though bandwidth seems just great.
Memory Update Transactional (MTPS)  1.9 52.2 [HLE]  4.18  32.4 [HLE] SKL/X is in a world of its own due to support for HLE/RTM and there is not much TR/Ryzen can do about it.
Memory Update Record Only (MTPS)  1.88  57.23 [HLE]  4.22  25.4 [HLE] We see a similar pattern here.
Without HLE/RTM TR (like Ryzen) don’t have much chance against SKL/X but considering support for it is disabled in most SKUs, there’s not much AMD has to be worried about – no to mention Intel disabling it in the older HSW and BRW designs. But should AMD enable it in future designs Intel will have a problem on its hands…

Threadripper’s core, memory and cache bandwidths are great, in many cases much higher than its Intel rivals partly due to more cores and more caches (16 vs 10); overall latencies are also fine for caches and memory – except the crucial ‘in-page random access’ data latencies which are far higher – about 3 times – TLB issues? We’ve been here before with Bulldozer which could not be easily fixed – but if AMD does manage it this time Ryzen’s performance will literally fly!

Still, despite this issue we’ve seen in the previous article that TR’s CPU performance is very strong thus it may not be such a big problem.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

TR’s memory performance is not the clean-sweep we’ve seen in CPU testing but it is competitive with Intel’s designs,and especially against older designs. The bandwidths are all competitive and especially the memory controllers seem to be more efficient – but latencies are a bit of a problem which AMD may have to improve in future designs.

Overall we’d still recommend TR over Intel CPUs unless you want absolutely tried and tested design which have already been patched by microcode and firmware/BIOS updates.

Intel Core i9 (SKL-X) Review & Benchmarks – 4-channel @ 3200Mt/s Cache & Memory Performance

Intel Skylake-X Core i9

What is “SKL-X”?

“Skylake-X” (E/EP) is the server/workstation/HEDT version of desktop/mobile Skylake CPU – the 6-th gen Core/Xeon replacing the current Haswell/Broadwell-E designs. It naturally does not contain an integrated GPU but what does contain is more cores, more PCIe lanes and more memory channels (up to 6 64-bit) for huge memory bandwidth.

While it may seem an “old core”, the 7-th gen Kabylake core is not much more than a stepping update with even the future 8-th gen Coffeelake rumored again to use the very same core. But what it does do is include the much expected 512-bit AVX512 instruction set (ISA) that are are not enabled in the current desktop/mobile parts.

SKL-X does not only support DDR4 but also NVM-DIMMs (non-volatile memory DIMMs) and PMem (Persistent Memory) that should revolutionise future computing with no need for memory refresh or immediate sleep/resume (no need to save/restore memory from storage).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-end desktop Core i9 with current competing architectures from both AMD and Intel as well as its previous version.

CPU Specifications Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
TLB 4kB pages
64 4-way / 64 8-way
1536 8-way
64 full-way
1536 8-way
64 4-way / 64 8-way
1536 6-way
64 4-way
1024 8-way
Ryzen has comparatively ‘better’ TLBs than all Intel CPUs.
TLB 2MB pages
8 full-way
1536 2-way
64 full-way
1536 2-way
8 full-way
1536 6-way
8 full-way
1024 8-way
Again Ryzen has ‘better’ TLBs than all Intel versions
Memory Controller Speed (MHz) 800-3300 600-1200 800-4000 1200-4000 Intel’s UNC clock runs higher than Ryzen
Memory Speed (Mhz) Max
3200 / 2667 2400 / 2667 2533 /2667 2133 / 2133 SKL-X can officially go as high as Ryzen and normal SKL @ 2667 but runs happily at 3200Mt/s.
Memory Channels / Width
4 / 256-bit (max 8 / 384-bit) 2 / 128-bit 2 / 128-bit 4 / 256-bit SKL-X has 2 memory controllers each with up to 3 channels each for massive memory bandwidth.
Memory Timing (clocks)
16-18-18-36 6-54-19-4 2T 14-16-16-32 7-54-18-9 2T 16-18-18-36 5-54-21-10 2T 14-15-15-36 4-51-16-3 2T SKL-X can run as tight timings as normal SKL or Ryzen.

Core Topology and Testing

Intel has dropped the (dual) ring bus(es) and instead opted for a mesh inter-connect between cores; on desktop parts this should not cause latency differences between cores (as with Ryzen) but on high-end server parts with many cores (up to 28) this may not be the case. The much increased L2 cache (1MB vs. old 256kB) should alleviate this issue – though the L3 cache seems to have been reduced quite a bit.

Native Performance

We are testing bandwidth and latency performance using all the available SIMD instruction sets (AVX, AVX2/FMA, AVX512) supported by the CPUs.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Native Benchmarks Intel i9 7900X (Skylake-X) AMD Ryzen 1700X Intel i7 6700K (Skylake) Intel i7 5820K (Haswell-E) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 87 [+85%] 47.7 39 46 With 10 cores SKL-X has massive aggregated inter-core bandwidth, almost 2x Ryzen or HSW-E.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 19 [+46%] 13 16 17 In worst-case pairs  SKL-X does well but not far away from normal SKL or HSW-E.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 15.2 15.7 16 13.4 [-12%]
Within the same core all modern CPUs seem to have about 15-16ns latency.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 80 45 [-43%] 49 58 Surprisingly we see massive latency increase almost 2x Ryzen or SKL.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 131 Naturally Ryzen scores worst when going off-CCX.
It seems the mesh inter-connect between cores has decent bandwidth but much higher latency than the older HSW-E or even the current SKL.
Aggregated L1D Bandwidth (GB/s) 2200 [+3x] 727 878 1150 SKL-X has 512-bit data ports thus massive L1D bandwidth over 2x HSW-E and 3x over Ryzen.
Aggregated L2 Bandwidth (GB/s) 1010 [+81%] 557 402 500 The large L2 caches also have 2x more bandwidth than either HSW-E or Ryzen.
Aggregated L3 Bandwidth (GB/s) 289 392 [+35%] 247 205 The 2 Ryzen L3 caches have higher bandwidth than all Intel CPUs.
Aggregated Memory (GB/s) 69.3 [+2.4x] 28.5 31 42.5 With its 4 channels SKL-X reigns supreme with almost 2.5x more bandwidth than Ryzen.
The widened ports on the L1 and L2 caches allow SKL-X to demolish the competition with over 2x more bandwidth than either Ryzen or older HSW-E; only the smaller L3 cache falters. Its 4 channels running at 3200Mt/s yield huge memory bandwidth that greatly help streaming algorithms. SKL-X is a monster – we can only speculate what the server 6-channel version would score.
Data In-Page Random Latency (ns) 26 [1/2.84x] (4-13-33) 74 (4-17-36) 20 (4-12-21) 25 (4-12-26) SKL-X has comparable lantecy with SKL and HSW-E and much better than Ryzen.
Data Full Random Latency (ns) 75 [-21%] (4-13-70) 95 (4-17-37) 65 (4-12-34) 72 (4-13-52) Full random latencies are a bit higher than expected but on part with HSW-E and better than Ryzen.
Data Sequential Latency (ns) 5.4 [+28%] (4-11-13) 4.2 (4-7-7) 4.1 (4-12-13) 7 (4-12-13) Strangely SKL-X does not do as well as SKL here or Ryzen but at least it beats HSW-E.
If you were hoping SKL-E to match normal SKL that is sadly not the case even at similar Turbo clock they are higher across the board, even allowing Ryzen a win. Perhaps further platform optimisations are needed.
Code In-Page Random Latency (ns) 12 [-27%] (4-14-28) 16.6 (4-9-25) 10 (4-11-21) 15.8 (3-20-29) With code SKL-X performs better though not enough to catch normal SKL.
Code Full Random Latency (ns) 86 [-15%] (4-16-106) 102 (4-13-49) 70 (4-11-47) 85 (3-20-58) Out-of-page code latency takes a bigger hit but nothing to worry about.
Code Sequential Latency (ns) 6.5 [-27%] (4-7-12) 8.9 (4-9-18) 5.3 (4-9-20) 10.2 (3-8-16) Again nothing much changes here.
SKL-X again does not manage to match normal SKL but soundly trounces both Ryzen and its older HSW-E brother, delivering a good result overall. Code access seems to perform more consistently than data for some reason we need to investigate.
Memory Update Transactional (MTPS) 52.2 [+12x] HLE 4.23 32.4 HLE 7 SKL-X with working HLE is over 12-times faster than Ryzen and older HSW-E.
Memory Update Record Only (MTPS) 57.2 [+13.6x] HLE 4.19 25.4 HLE 5.47 SKL-X is king of the hill with nothing getting close.
Yes – Intel has finally fixed HLE/RTL which owners of HSW-E and BRW-E must feel very hard done by considering it was “working” before having it disabled due to the errata. Thus after so many years we have both HLE, RTL and AVX512! Great!

If there was any doubt, SKL-X does not disappoint – massive cache (L1D and L2) aggregate and memory bandwidths with server versions likely even more; the smaller L3 cache does falter though which is a bit of a surprise – the larger L2 caches must have forced some compromises to be made.

Latency is a bit disappointing compared to the “normal” SKL/KBL we have on desktop, but are still better than older HSW-E and also Ryzen competitor. Again the L1 and L2 caches (despite being 4-times bigger) clock latencies are OK with the L3 and memory controller being the source of the increased latencies.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

After a strong CPU performance we did not expect the cache and memory performance to disappoint – and it does not. SKL-X is a big improvement over the older versions (HSW-E) and competition with few weaknesses.

The mesh interconnect does seem to exhibit higher inter-core latencies with small increase in bandwidth; perhaps this can be fixed.

The very much reduced L3 cache does disappoint both bandwidth and latency wise; the memory controllers provide huge bandwidth but at the expense of higher latencies.

All in all, if you can afford it, there is no question that SKL-X is worth it. But better wait to see what AMD’s Threadripper has in store before making your choice… 😉