AMD Radeon 5700XT: Navi GPGPU Performance in OpenCL

What is “Navi”?

It is the code-name of the new AMD GPU, the first of the brand-new RDNA (Radeon DNA) GPU arch(itecture) – replacing the “Vega” that was the last of the GCN (graphics core next) arch(itecture). It is a mid-range GPU optimised for gaming thus not expected to set records, but GPUs today are used for many other tasks (mining, encoding, algorithm/compute acceleration, etc.) as well.

RDNA arch brings big changes from the various GCN revisions we’ve seen previously, but its first iteration here does not bring any major new features at least in the compute domain. Hopefully the next versions will bring tensor units (matrix multiplicators) and other accelerated instruction sets and so on.

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
Arch Chipset RDNA / Navi 10 GCN5.1 / Vega 20 Pascal / GP102 GCN5.0 / Vega 10 The first of the Navi chips.
Cores (CU) / Threads (SP) 40 / 2560 60 / 3840 28 / 3584 56 / 3584 Less CUs than Vega1 and same (64x) SP per CU.
SIMD per CU / Width 2 / 32 [2x] 4 / 16 4 / 16 Navi increases the SIMD width but decreases counts.
Wave/Warp Size 32 [1/2x] 64 32 64 Wave size is reduced to match nVidia.
Speed (Min-Turbo) 1.6 / 1.755 1.4 / 1.75 1.531 / 1.91 1.156 / 1.471 40% faster base and 20% turbo than Vega 1.
Power (TDP) 225W 295W 250W 210W Slightly higher TDP but nothing significant
ROP / TMU 64 / 160 64 / 240 96 / 224 64 / 224 ROPs are the same but we see ~30% less TMUs.
Shared Memory
64kB [+2x]
32kB 48kB / 96kB per SM 32kB We have 2x more shared memory allowing bigger kernels.
Constant Memory
4GB 8GB 64kB dedicated 4GB No dedicated constant memory but large.
Global Memory 8GB GDDR6 14Gt/s 256-bit 16GB HBM2 1Gt/s 4096-bit 12GB GDDR5X 10Gt/s 384-bit 8GB HBM2 900Gt/s 4096-bit Sadly no HBM this time but the faster but not very wide.
Memory Bandwidth (GB/s)
448GB/s [+9%] 1024GB/s 512GB/s 410GB/s Still bandwidth is 9% higher.
L1 Caches ? x40 16kB x60 48kB x28 16kB x56 L1 does not appear changed but unclear.
L2 Cache 4MB 4MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
1024 / 1024 256 / 1024 1024 / 2048 per SM 256 / 1024 AMD has unlocked work-group sizes to 4x.
FP64/double ratio
1/16x 1/4x 1/32x 1/16x Ratio is same as consumer Vega1 rather than pro Vega2.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon 5700XT (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 18,265 [-7%] 29,057 245 19,580 Navi starts well but cannot beat Vega1.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 11,863 [-13%] 17,991 17,870 13,550 Standard FP32 increases the gap to 13%.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 1,047 [-16%] 5,031 661 1,240 FP64 does not change much, Navi is 16% slower.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 43 [-45%] 226 25 77 Emulated FP128 is hard on FP64 units and here Navi is almost 1/2 Vega1.
Starting up, Navi does not seem to be able to beat Vega1 in heavy vectorised compute loads with FP16 most efficient (almost parity) while complex FP128 is 2x slower.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 51 [-25%] 91 42 67 Despite more bandwidth Navi is 25% slower than Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 176 [+40%] 209 145 125 Navi shows its power here beating Vega1 by a huge 40%!
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 76 32
Despite more bandwidth of GDDR6, streaming algorithms work better on on “old” HBM2 thus Navi cannot beat Vega. But in pure integer compute algorithms like hashing, it is much faster by a significant amount which bodes well for the future.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 12,459 [+31%] 23,164 11,480 9,500 In this FP32 financial workload Navi is 30% faster than Vega1!
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 1,370 1,880
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 850 [1/3x] 3,501 2,240 2,530 Binomial uses thread shared data thus stresses the memory system and here we have some optimisation to do.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 129 164
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 5,027 [+30%] 6,249 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Navi is again 30% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 294 472
For financial FP32 workloads, Navi is ~30% faster than Vega1 – a pretty good improvement – though it naturally cannot compete with Vega2 due to consumer multiplier (1/16x). Crypto-currencies fans will love the Navi.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 5,165 [+2%] 6,634 6,073 5,066 GEMM can only bring a measly 2% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 340 620
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 376 [+2%] 643 235 369 FFT loves HBM but Navi is still 2% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 207 175
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 4,534 [-6%] 6,846 5,720 4,840 Navi can’t manage as well in N-Body and ends up 6% slower.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 275 447
The scientific scores don’t show the same improvement as the financial ones likely due to heavy use of shared memory with Navi just matching Vega1. Perhaps the larger shared memory can allow us to use larger workgroups.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 8,674 [1/2.1x] 25,418 18,410 19,130 In this 3×3 convolution algorithm, Navi is 1/2x the speed of Vega1.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,734 [1/3x] 5,275 5,000 4,340 Same algorithm but more shared data makes Navi even slower.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 1,802 [1/2.5x] 5,510 5,080 4,450 With even more data the gap remains at 1/2.5x.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,723 [1/2.5x] 5,273 4,800 4,300 Still convolution but with 2 filters – same 1/2.5x performance.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 48.44 [=] 92.53 37 48 Different algorithm allows Navi to tie with Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 97.34 [+2.5x] 57.66 12.7 38 Without major processing, this filter performs well on Navi.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 32,050 [+1.5x] 47,349 19,480 20,880 This algorithm is 64-bit integer heavy and Navi is 50% faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 9,516 [+1.6x] 7,708 305 6,000 One of the most complex and largest filters, Navi is again 50% faster.
For image processing using FP32 precision, Navi goes from 1/2.5x Vega1 performance (convolution) to 50% faster (complex algorithms with integer processing). It seems some optimisations are needed for the convolution algorithms.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon 5700X (Navi) AMD Radeon VII (Vega2) nVidia Titan X (Pascal) AMD Radeon 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 376 [+13%] 627 356 333 Navi’s GDDR6 manages 13% more bandwidth than Vega1.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 21.56 [+77%] 12.37 11.4 12.18 PCIe 4.0 brings almost 80% more bandwidth
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 22.28 [+84%] 12.95 12.2 12.08 Again almost 2x more bandwidth.
Navi’s PCIe 4.0 interface (on 500-series motherboards) brings as expected almost 2x more upload/download bandwidth while its high-clocked GDDR6 manages just over 10% higher bandwidth over HBM2.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 276 [+11%] 202 201 247 Navi’s GDDR6 brings slight latency increase (+10%)
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 286 353
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

“Navi” is an interesting chip to be sure and perhaps more was expected of it; as always the drivers are the weak link and it is hard to determine which issues will be fixed driver-side and which will need to be optimised in compute kernels.

Thus performance-wise it oscillates between 1/2x and 50% Vega1 performance depending on algorithm, with compute-heavy algorithms (especially crypto-currencies) doing best and shared/local memory heavy algorithms doing worst. The 2x bigger shared memory (64kB vs 32) in conjunction with the larger work-group (1024 vs 256 by default) sizes do present future optimisation opportunities. AMD has also reduced the warp/wave size to match nVidia – a historic change.

Memory wise, the cost-cutting change from HBM2 to even high-speed GDDR6 does bring more bandwidth but naturally higher latencies – but PCIe 4.0 doubles upload/download bandwidths which will become much more important on higher capacity (16GB+) cards in the future.

Overall it is hard to recommend it for compute workloads unless the particular algorithm (crypto, financial) does well on Navi, otherwise the much older Vega1 56/64 offer better performance/cost ratio especially today. However, as drivers mature and implementations are optimised for it, Navi is likely to start to perform better.

We are looking forward to the next iterations of Navi, especially the rumoured “big Navi” version optimised for compute…

AMD Radeon VII: Vega2 GPGPU Performance in OpenCL

What is “Vega2”?

It is the code-name of the updated “Vega” GPU arch(itecture), the last of the GCN (graphics core next) arch (version 5.1) shrinked to 7nm before being replaced by the forthcoming “Navi”. Originally for the professional/workstation high-end market, “Vega2″/”big Vega” designed for compute (scientific, machine learning, etc.) workloads was pressed into service to battle the latest 2000-series “Turing”/RTX competition.

As a result it contains many high-end features not normally found on consumer cards:

  • 1/4 FP64 rate (instead of 1/16 or worse)
  • 16GB HBM2 memory (instead of 8-12)
  • 4096-bit HBM2 memory 1TB/s bandwidth (instead of 400-500)
  • Int8/Int4 support for AI/ML workloads
  • PCIe 4.0 capable but not enabled at this time

See these other articles on GPGPU performance:

Hardware Specifications

We are comparing the middle-range Radeon with previous generation cards and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
Arch Chipset Vega 20 / GCN 5.1 GV100 / 7.0 GP102 / 6.1 Vega 10 / GCN 5.0 A minor revision of Vega1.
Cores (CU) / Threads (SP) 60 / 3840 80 / 5120 28 / 3584 56 / 3584 More CUs than normal Vega but not 64.
SIMD per CU / Width 4 / 16 n/a n/a 4 / 16 Naturally same SIMD count and width
Wave/Warp Size 64 32 32 75 Wave size has always been 2x nVidia.
Speed (Min-Turbo) 1.4 – 1.750 [+21%] (135-1455) 1.531 (139-1910) 1.156 – 1.471 Base clock is ~20% higher and turbo
Power (TDP) 300W [+42%] 300W 250W 210W TDP has gone up by 40%.
ROP / TMU 64 / 256 96 / 320 96 / 224 64 / ROPs and TMUs unchanged
Shared Memory
32kB 48 / 96 kB 48 / 96kB 32kB No shared memory change.
Constant Memory
8GB 64kB 64kB 4GB No dedicated constant memory but large.
Global Memory 16GB HBM2 2Gbps 4096-bit 12GB HBM2 2x850Mbps 3072-bit 12GB GDDR5X 10Gbps 384-bit 8GB HBM2 1.89Gbps 2048-bit 2x as big and 2x as wide HBM a huge improvement.
Memory Bandwidth (GB/s)
1000 [+2.4x] 652 512 410 Still bandwidth is 9% higher.
L1 Caches 16kB x 60 96kB x 80 48kB x 28 16kB x 56 L1 has not changed.
L2 Cache 4MB 4.5MB 3MB 4MB L2 has not changed.
Maximum Work-group Size
256 / 1024 1024 / 2048 1024 / 2048 256 / 1024 Same work-group sizes.
FP64/double ratio
1/4x 1/2x 1/32x 1/16x Ratio is 4x better than Vega1.
FP16/half ratio
2x 2x 1/64x 2x Ratio is the same throughout.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 29,057 [+48%] 33,860 245 19,580 Vega2 starts strong with a 48% lead over Vega1 and almost catching Volta.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 18,340 [+35%] 22,680 17,870 13,550 Good improvement here +35% over Vega1 again close to Volta.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 5,377 [+4.3x] 11,000 661 1,240 1/4 FP64 rate makes it over four (4x) times faster than Vega1.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 234 [+3x] 458 25.77 77 Similar to above, Vega2 is over three (3x) faster.
Vega2 looks about 35-50% faster than Vega1 in FP32/FP16 and 3-4x faster in FP64 due to its 1/4 FP64 rate. It won’t beat real workstation cards with 1/2 FP64 rate through thus that Titan has nothing to worry about.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 91 [+36%] 70 42 67 The fast HBM2 memory allows it to beat even Volta not just Vega1.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 93 58 88
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 209 [+67%] 245 145 125 Vega2 is a huge 70% faster in integer/crypto workloads.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 129 107 162
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 176 76 32
Vega2 increases its lead in integer workloads even streaming ones no doubt due to its very fast HBM2 memory making it the crypto-king of the hill though its cost may be an issue.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 23,164 [+2.3x] 18,570 11,480 9,500 Vega2 is over 2x faster than Vega1 also beating Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 7,272 [+3.84x] 8,400 1,370 1,880 In FP64 its almost 4x faster just below Volta!
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 3,501 [+38%] 4,200 2,240 2,530 Binomial uses thread shared data thus stresses the memory system Vega2 is still 40% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 789 [+4.8x] 2,000 129 164 With FP64 we’re almost 5x faster than Vega1.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 6,249 [+62%] 11,920 5,350 3,840 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – here Vega2 is 60% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 1,676 [+3.55x] 4,440 294 472 With FP64 we’re over 3.5x faster.
For financial FP32 workloads, Vega2 is 40-60% faster than Vega1 a decent improvement; naturally in FP64 it’s 4-5x times faster thus a significant upgrade for algorithms that require such precision.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 6,634 [+30%] 11,000 6,073 5,066 GEMM still brings a 30% improvement over Vega1.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 2,339 [+3.77x] 3,830 340 620 But DGEMM is almost 4x faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 643 [+74%] 617 235 369 FFT loves HBM thus Vega2 is 75% faster.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 365 [+2.1x] 280 207 175 DFFT is tough but Vega2 is still twice as fast.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 6,846 [+41%] 7,790 5,720 4,840 In N-Body physics Vega2 is 40% faster.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 1,752 [+3.9x] 4,270 275 447 And in FP64 physics Vega2 is almost 4x faster.
The scientific scores show a similar improvement, with FP32 30-40% better but FP64 a whopping four (4x) faster than Vega1 and, in some algorithms, matching the hugely expensive Volta.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 25,418 [+32%] 26,790 18,410 19,130 In this 3×3 convolution algorithm, Vega2 is 32% faster than Vega1
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 5,275 [+21%] 9,295 5,000 4,340 Same algorithm but more shared data reduces the lead to 21%.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 5,510 [+24%] 9,428 5,080 4,450 With even more data the gap remains constant.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 5,273 [+23%] 9,079 4,800 4,300 Still convolution but with 2 filters – similar 23% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 92 [+91%] 112 37 48 Different algorithm makes Vega2 almost 2x faster than Vega1.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 57 [+50%] 42 12.7 38 Without major processing, this filter is 50% faster on Vega2.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 47,349 [+2.3x] 24,370 19,480 20,880 This algorithm is 64-bit integer heavy and Vega2 flies 2x faster than Vega1.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 7,708 [+28%] 8,460 305 6,000 One of the most complex and largest filters, Vega2 is 28% faster.
For image processing using FP32 precision, Vega goes from 21% to 2x faster, overall a decent improvement if you are processing a large number of images. In many filters it beats the far more expensive Volta competition.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from AMD and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest AMD and nVidia. drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks AMD Radeon VII (Vega2) nVidia Titan V (Volta) nVidia Titan X (Pascal) AMD Vega 56 (Vega1) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 627 [+88%] 536 356 333 Vega2’s wide HBM2 is almost 2x faster as expected.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 12.37 [+2%] 11.47 11.4 12.18 Using PCIe 3.0 similar upload bandwidth.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 12.95 [+7%] 12.27 12.2 12.08 Again similar bandwidth.
Vega2 benefits greatly from its very wide HBM2 memory (4096-bit) which provides almost 2x real bandwidth as expected. But while PCIe 4.0 capable for now it has to make do with 3.0 and thus same upload/download bandwith. Here’s hoping for a BIOS update once new motherboards come out.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 202 [-19%] 180 201 247 The higher clock allows Vega2 a 20% latency reduction.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 341 [-4%] 311 286 353 Full range is only 4% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 53.4 89.8 115
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 75.4 117 237
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 18.1 18.7 55
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 212 195 193
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 344 282 301
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 88.5 87.6 80
Not unexpected, GDDR6′ latencies are higher than HBM2 although not by as much as we were fearing.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega2 (“BigVega”) is a big improvement over normal Vega1 and its workstation-class pedigree shows. For FP16/Fp32 workloads though the 30-40% performance improvement may not be worth it considering the much higher price: naturally FP64 performance is almost 4x due to 1/4 FP64 rate though not as good at professional cards with 1/2 rate or Titan competition with similar 1/2 rate.

While the GCN core (rev 5.1) has seen internal updates, there is nothing new that can be supported/optimised for in the compute land thus any code working well on Vega1 should work just as well on Vega2.

The 16GB HBM2 wide memory also helps big workloads with 2x higher bandwidth and also lower latency due to higher clock. For some workloads this alone makes it a definite buy when competition stops at 12GB.

Unfortunately the card has had a limited release at a relatively high price thus value/price ratio depends entirely on your workload – if FP64 with large datasets then it is very much worth it; if FP32/FP16 with datasets that fit in standard 8GB memory then the older Vega1 is much better value and you can even get 2 for the price of the Vega2.

For revolutionary change we need to wait for Navi and its brand new RDNA (Radeon DNA) arch(itecture)…

AMD Ryzen2 3700X Review & Benchmarks – CPU 8-core/16-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

To upgrade from Ryzen+/Ryzen1 or not?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the middle-of-the-range Ryzen2 (3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Core counts remain the same.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die 1 chiplet+1 sio rather than 1 die
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz 3700x base clock is lower than 2700x but turbo is higher
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP has been greatly reduced vs. ZEN+
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way L1I has been halved but better no. ways
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN+
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 SIMD units are 2x wider than ZEN+

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 7 3700X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 336 [=] 334 400 485 We start with no improvement over ZEN+
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 339 [=] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 202 [+2%] 198 236 262 Floating-point performance does not change delta either – only 2% faster
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 170 [=] 169 196 223 With FP64 nothing much changes again.
In the legacy integer/floating-point benchmarks ZEN2 is not any faster than ZEN+ despite the change in clocks. Perhaps future microcode updates will help?
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1023 [+78%] 574 985 1590 ZEN2 is ~80% faster than ZEN+ despite what we’ve seen before.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 374 [+2x] 187 414 581 With a 64-bit AVX2 integer vectorised workload, ZEN2 is now 2x faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.56 [+13%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here ZEN2 is still 13% faster.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 100 [+68%] 596 914 1760 In this floating-point AVX/FMA vectorised test, ZEN2 is ~70% faster.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 618 [+84%] 335 535 533 Switching to FP64 SIMD code, ZEN2 is now ~90% faster than ZEN+
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 24.22 [+55%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, ZEN2 is still 55% faster
With its brand-new 256-bit SIMD units, ZEN2 is anywhere from 55% to 100% faster than ZEN+/ZEN1 a huge upgrade from one generation to the next. For SIMD loads upgrading to ZEN2 gives a huge performance uplift.
BenchCrypt Crypto AES-256 (GB/s) 18 [+12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound  but ZEN2 manages a 12% improvement.
BenchCrypt Crypto AES-128 (GB/s) 18.76 [+17%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; ZEN2 is now 17% faster.
BenchCrypt Crypto SHA2-256 (GB/s) 20.21 [+9%] 18.6 12 26 With SHA/HWA ZEN2 similarly powers through hashing tests leaving Intel in the dust – and is still ~10% faster than ZEN+
BenchCrypt Crypto SHA1 (GB/s) 20.41 [+6%] 19.3 22.9 38 The less compute-intensive SHA1 does not change things due to acceleration.
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
ZEN2 with AES/SHA HWA is memory bound like all other CPUs, but it still manages 6-17% better performance than ZEN+ using the same memory. But as ZEN2 is rated for faster memory – using such memory would greatly improve the results.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 229 [+5%] 219 238 277 Switching to FP64 code, ZEN2 is just 5% faster.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 57.98 [-4%] 60.6 61.6 68 With FP64 code ZEN2 is 4% slower.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 46.34 [+13%] 41 44.5 50.5 Switching to FP64 nothing much changes, ZEN2 is 13% faster.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: ZEN2 does not improve much and is pretty much tied with ZEN+ – thus for non SIMD workloads you might as well stick with the older versions.
BenchScience SGEMM (GFLOPS) float/FP32 263 [-12%] 300 375 413 In this tough vectorised algorithm ZEN2 is strangely slower.
BenchScience DGEMM (GFLOPS) double/FP64 193 [+63%] 119 209 212 With FP64 vectorised code, ZEN2 comes back to be over 60% faster.
BenchScience SFFT (GFLOPS) float/FP32 22.78 [+2.5x] 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more; ZEN2 is 2.5x (times) faster.
BenchScience DFFT (GFLOPS) double/FP64 11.16 [+41%] 7.92 11.21 14.6 With FP64 code, ZEN2 is ~40% faster.
BenchScience SNBODY (GFLOPS) float/FP32 612 [+2.2x] 280 557 638 N-Body simulation is vectorised but fewer memory accesses; ZEN2 is over 2x faster.
BenchScience DNBODY (GFLOPS) double/FP64 220 [+2x] 113 171 195 With FP64 precision ZEN2 is almost 2x faster.
With highly vectorised SIMD code ZEN2 improves greatly over ZEN2 sometimes managing to be over 2x faster using the same memory.
CPU Image Processing Blur (3×3) Filter (MPix/s) 2049 [+42%] 1440 2560 4880 In this vectorised integer workload ZEN2 starts over 40% faster than ZEN+.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 950 [+52%] 627 1000 1920 Same algorithm but more shared data makes ZEN2 over 50% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 495 [+52%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 826 [+67%] 495 827 1500 Different algorithm but still vectorised workload ZEN2 is almost 70% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 89.68 [+24%] 72.1 78 221 Still vectorised code now ZEN2 drops to just 25% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 25.05 [+5%] 23.9 42.2 66.7 This test has always been tough for Ryzen so ZEN2 does not improve much.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1763 [+76%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better but ZEN2 is still almost 80% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 321 [+32%] 243 596 777 In this final test again with integer workload ZEN2 is 32% faster
As we’ve seen before, the new SIMD units are anywhere from 5% (worst-case) to 2x faster than ZEN+/1, a huge performance improvement.
Aggregate Score (Points) 8,200 [+40%] 5,850 7,930 11,810 Across all benchmarks, ZEN2 is ~40% faster than ZEN+.
Aggregating all the various scores, the result was never in doubt: ZEN2 (3700X) is 40% faster than the old ZEN+ (2700X) that itself improved over the original 1700X.

ZEN2’s 256-bit wide SIMD units are a big upgrade and show their power in every SIMD workload; otherwise there is only minor improvement.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: For SIMD workloads you really have to upgrade to Ryzen2; otherwise stick with Ryzen+ unless lower power is preferred. 9/10 overall.

The big change in Ryzen2 are the 256-bit wide SIMD units and all vectorised workloads (Multi-Media, Scientific, Image processing, AI/Machine Learning, etc.) using AVX/FMA will greatly benefit – anything between 50-100% which is a significant increase from just one generation to the next.

But for all other workloads (e.g. Financial, legacy, etc.) there is not much improvement over Ryzen+/1 which were already doing very well against competition.

Naturally it all comes at lower TDP (65W vs 95) which may help with overclocking and also lower noise (from the cooling system) and power consumption (if electricity is expensive or you are running it continuously) thus the performance/W(att) is still greatly improved.

Overall the 3700X does represent a decent improvement over the old 2700X (which is no slouch and was a nice upgrade over 1700X due to better Turbo speeds) and should still be usable in older AM4 300/400-series mainboards with just a BIOS upgrade (without PCIe 4.0).

However, while 2700X (and 1700X/1800X) were top-of-the-line, 3700X is just middle-ground, with the new top CPUs being the 3900X and even the 3950X with twice (2x) more cores and thus potentially huge performance rivaling HEDT Threadripper. The goad-posts have thus moved and thus far higher performance can be yours with just upgrading the CPU. The future is bright…

AMD Ryzen2 3900X Review & Benchmarks – CPU 12-core/24-thread Performance

What is “Ryzen2” ZEN2?

AMD’s Zen2 (“Matisse”) is the “true” 2nd generation ZEN core on 7nm process shrink while the previous ZEN+ (“Pinnacle Ridge”) core was just an optimisation of the original ZEN (“Summit Ridge”) core that while socket compatible it introduces many design improvements over both previous cores. An APU version (with integrated “Navi” graphics) is scheduled to be launched later.

While new chipsets (500 series) will also be introduced and required to support some new features (PCIe 4.0), with an BIOS/firmware update older boards may support them thus allowing upgrades to existing systems adding more cores and thus performance. [Note: older boards will not be enabled for PCIe 4.0 after all]

The list of changes vs. previous ZEN/ZEN+ is extensive thus performance delta is likely to be very different also:

  • Built around “chiplets” of up to 2 CCX (“core complexes”) each of 4C/8T and 8MB L3 cache (7nm)
  • Central I/O hub with memory controller(s) and PCIe 4.0 bridges connected through IF (“Infinity Fabric”) (12nm)
  • Up to 2 chiplets on desktop platform thus up to 2x2x4C (16C/32T 3950X) (same amount as old ThreadRipper 1950X/2950X)
  • 2x larger L3 cache per CCX thus up to 2x2x16MB (64MB) L3 cache (3900X+)
  • 24 PCIe 4.0 lanes (2x higher transfer rate over PCIe 3.0)
  • 2x DDR4 memory controllers up to 4266Mt/s

AMD Ryzen2 3950X chiplets

What’s new in the Ryzen2 core?

Micro-architecturally there are more changes that should improve performance:

  • 256-bit (single-op) SIMD units 2x Fmacs (fixing a major deficiency in ZEN/ZEN+ cores)
  • TLB (2nd level) increased (should help out-of-page access latencies that are somewhat high on ZEN/ZEN+)
  • Memory latencies claim to be reduced through higher-speed memory (note all requests go through IF to Central I/O hub with memory controllers)
  • Load/Store 32bytes/cycle (2x ZEN/ZEN+) to keep up with the 256-bit SIMD units (L1D bandwidth should be 2x)
  • L3 cache is 2x ZEN/ZEN+ but higher latency (cache is exclusive)
  • Infinity Fabric is 512-bit (2x ZEN/ZEN+) and can run 1x or 1/2x vs. DRAM clock (when higher than 3733Mt/s)
  • AMD processors have thankfully not been affected by most of the vulnerabilities bar two (BTI/”Spectre”, SSB/”Spectre v4″) that have now been addressed in hardware.
  • HWM-P (hardware performance state management) transitions latencies reduced (ACPI/CPPCv2)

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (3900X, 3700X) with previous generation Ryzen+ (2700X) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 3700X (Matisse) AMD Ryzen 7 2700X (Pinnacle Ridge) Intel i9 9900K (Coffeelake-R) Intel i9 7900X (Skylake-X) Comments
Cores (CU) / Threads (SP) 12C / 24T 8C / 16T 8C / 16T 8C / 16T 10C / 20T Matching core-count with CFL (3800X) but 3900X has 50% more cores – more than SKL-X.
Topology 2 chiplets, each 2 CCX, each 3 cores (1 disabled) (12C) 1 chiplet, 2 CCX, each 4 cores (8C) 2 CCX, each 4 cores (8C) Monolithic die Monolithic die AMD uses discrete dies/chiplets unlike Intel
Speed (Min / Max / Turbo) 3.8 / 4.6GHz 3.6 / 4.4GHz 3.7 / 4.2GHz 3.6 / 5GHz 3.3 / 4.3GHz Base clock and turbo are competitive with 3800X having higher base while 3900X higher turbo.
Power (TDP / Turbo) 105 / 135W 65 / 90W 105 / 135W 95 / 135W 140 / 308W TDP remains the same but 3900X may exceed that having more cores.
L1D / L1I Caches 12x 32kB 8-way / 12x 32kB 8-way 8x 32kB 8-way / 8x 32kB 8-way 8x 32kB 8-way / 8x 64kB 4-way 8x 32kB 8-way / 8x 32kB 8-way 10x 32kB 8-way / 10x 32kB 8-way ZEN2 matches L1I with CFL/SKL-X (1/2x ZEN+ but 8-way), L1D is unchanged (also matches Intel)
L2 Caches 12x 512kB (6MB) 8-way 8x 512kB (4MB) 8-way 8x 512kB (4MB) 8-way 8x 256kB (2MB) 16-way 10x 1MB (10MB) 16-way No changes to L2, still 2x CFL. Only SKL-X has its massive 1MB L2 per core which 3900X almost matches!
L3 Caches 2x2x 16MB (64MB) 16-way 2x 16MB (32MB) 16-way 2x 8MB (16MB) 16-way 16MB 16-way 13.75MB 11-way L3 is 2x ZEN/ZEN+ and thus 2x CFL (3800X) with 3900X having a massive 64MB unheard of on the desktop platform! SKL-X can’t match it either.
Mitigations for Vulnerabilities BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ hardware BTI/”Spectre”, SSB/”Spectre v4″ software/firmware RDCL/”Meltdown”, L1TF hardware, BTI/”Spectre”, MDS/”Zombieload”, software/firmware RDCL/”Meltdown” , L1TF, BTI/”Spectre”, MDS/”Zombieload”, all software/firmware Ryzen2 addresses the remaining 2 vulnerabilities while Intel was forced to add MDS to its long list…
Microcode MU-8F7100-11 MU-8F7100-11 MU-8F0802-04 MU-069E0C-9E MU-065504-49 The latest microcodes included in the respective BIOS/Windows have been loaded.
SIMD Units 256-bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 128bit AVX/FMA3/AVX2 256-bit AVX/FMA3/AVX2 512-bit AVX512 ZEN2 finally matches Intel/CFL but SKL-X’s secret weapon is AVX512 with even consumer CPUs able to do 2x 512-bit FMA ops.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, FMA3, AVX, etc.). Ryzen2 supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA but not AVX-512.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations. All mitigations for vulnerabilities (Meltdown, Spectre, L1TF, MDS, etc.) were enabled as per Windows default where applicable.

Native Benchmarks AMD Ryzen 9 3900X (Matisse)
AMD Ryzen 7 2700X (Pinnacle Ridge)
Intel i9 9900K (Coffeelake-R)
Intel i9 7900X (Skylake-X)
Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 551 [+38%] 334 400 485 Right off Ryzen2 demolishes all CPUs, it is 40% faster than CFL-R!
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 556 [+41%] 335 393 485 With a 64-bit integer workload nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 331 [+40%] 198 236 262 Floating-point performance does not change delta either – still 40% faster!
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 280 [+43%] 169 196 223 With FP64 nothing much changes again.
Ryzen2 starts with an astonishing display, with 3900X demolishing both 9900X and 7900X winning all tests by a large margin 38-43%! It does have 50% more cores (12 vs. 8) but it is not easy to realise gains just by increasing core counts. Intel will need to add far more cores in future CPUs in order to compete!
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1449 [+47%] 574 985 1590 Ryzen2 starts off by blowing CFL-R away by 47% and almost matching SKL-X with AVX512!
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 553 [+34%] 187 414 581 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is still 34% faster!
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 9.52 [+41%] 5.8 6.75 7.56 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 is again 41% faster!
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 1480 [+62%] 596 914 1760 In this floating-point AVX/FMA vectorised test, Ryzen2 is now over 60% faster than CFL-R and not far off SKL-X!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 906 [+69%] 335 535 533 Switching to FP64 SIMD code, Ryzen2 is now 70% faster even beating SKL-X!!!
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 35.23 [+53%] 15.6 23 40.3 In this heavy algorithm using FP64 to mantissa extend FP128, Ryzen2 is still 53% faster!
With its brand-new 256-bit SIMD units, Ryzen2 finally goes toe-to-toe with Intel, soundly beating CFL-R in all benchmarks (+34-69%) sometimes by more than just core count increase (+50%). Only SKL-X with AVX512 manages to be faster (but also with its extra 2 cores). Intel had better release AVX512 for desktop soon but even that will not be enough without increasing core counts to match AMD.
BenchCrypt Crypto AES-256 (GB/s) 15.44 [-12%] 16.1 17.63 23 With AES/HWA support all CPUs are memory bandwidth bound – thus Ryzen2 scores less than Ryzen+ and CFL-R.
BenchCrypt Crypto AES-128 (GB/s) 15.44 [-12%] 16.1 17.61 23 What we saw with AES-256 just repeats with AES-128; Ryzen2 is again slower by 12%.
BenchCrypt Crypto SHA2-256 (GB/s) 29.84 [+2.5x] 18.6 12 26 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust – 2.5x faster than CFL-R and beating SKL-X with AVX512!
BenchCrypt Crypto SHA1 (GB/s) 19.3 22.9 38
BenchCrypt Crypto SHA2-512 (GB/s) 3.77 9 21
Ryzen2 with AES/SHA HWA is memory bound thus needs faster memory than 3200Mt/s in order to feed all the cores; otherwise due to increased contention for the same bandwidth it may end up slower than Ryzen+ and Intel designs. Here you see the need for HEDT platforms and thus ThreadRipper but at much increased cost.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 257 276 309
BenchFinance Black-Scholes double/FP64 (MOPT/s) 379 [+55%] 219 238 277 Switching to FP64 code, nothing much changes, Ryzen2 55% faster than CFL-R.
BenchFinance Binomial float/FP32 (kOPT/s) 107 59.9 70.5 Binomial uses thread shared data thus stresses the cache & memory system;
BenchFinance Binomial double/FP64 (kOPT/s) 95.73 [+55%] 60.6 61.6 68 With FP64 code Ryzen2 is still 55% faster!
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 54.2 56.5 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches;
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 76.72 [+72%] 41 44.5 50.5 Switching to FP64 nothing much changes, Ryzen2 is 70% faster than CFL-R and still beating SKL-X.
Ryzen always did well on non-SIMD floating-point algorithms and here it does not disappoint: Ryzen2 is over 50% faster than CFL-R (+55-72%) and soundly beats SKL-X too! As before for financial algorithms there is only one choice and that is Ryzen, be it Ryzen1, Ryzen+ or Ryzen2!
BenchScience SGEMM (GFLOPS) float/FP32 300 375 413 In this tough vectorised algorithm Ryzen2.
BenchScience DGEMM (GFLOPS) double/FP64 212 [+1%] 119 209 212 With FP64 vectorised code, Ryzen2 matches CFL-R and SKL-X.
BenchScience SFFT (GFLOPS) float/FP32 9 22.33 28.6 FFT is also heavily vectorised but stresses the memory sub-system more;
BenchScience DFFT (GFLOPS) double/FP64 12.69 [+13%] 7.92 11.21 14.6 With FP64 code, Ryzen2 is 13% faster than CFL-R.
BenchScience SNBODY (GFLOPS) float/FP32 280 557 638 N-Body simulation is vectorised but fewer memory accesses;
BenchScience DNBODY (GFLOPS) double/FP64 332 [+94%] 113 171 195 With FP64 precision Ryzen2 is almost 2x faster than CFL-R.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. The new 256-bit SIMD units help but it seems the cores are starved of bandwidth (especially due to SMT) and some workloads may perform better with SMT off.
CPU Image Processing Blur (3×3) Filter (MPix/s) 3056 [+20%] 1440 2560 4880 In this vectorised integer workload Ryzen2 is 20% faster than CFL-R.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 1499 [+50%] 627 1000 1920 Same algorithm but more shared data makes Ryzen2 50% faster!
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 767 [+48%] 325 519 1000 Again same algorithm but even more data shared still 50% faster
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 1298 [+57%] 495 827 1500 Different algorithm but still vectorised workload Ryzen2 is almost 60% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 136 [+74%] 72.1 78 221 Still vectorised code now Ryzen2 is 70% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 38.23 [-9%] 23.9 42.2 66.7 This test has always been tough for Ryzen but Ryzen2 is competitive.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 1384 [-65%] 1000 4000 4070 With integer workload, Intel CPUs seem to do much better.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 487 [-18%] 243 596 777 In this final test again with integer workload Ryzen2 is 20% slower.
Thanks to AVX512 SKL-X does win all tests but Ryzen2 beats CFL-R between 20-74% with a few test mixing integer & floating-point SIMD instructions seemingly heavily favouring Intel but nothing to worry about. Overall for image processing Ryzen2 should be your 1st choice.
Aggregate Score (Points) 10,250 [+29%] 5,850 7,930 11,810 Across all benchmarks, Ryzen2 is ~30% faster than CFL-R!
Aggregating all the various scores, the result was never in doubt: Ryzen2 (3900X) is almost 2x faster than Ryzen+ (2700X) and 30% faster than CFL-R, almost catching up HEDT SKL-X.

Ryzen2 (unlike Ryzen1/+) has no trouble with SIMD code due to its widened SIMD units (256-bit) and thus soundly beats the opposition into dust (CFL-R 9900K flagship) sometimes more than just core count increase alone (+50% i.e. 12 cores vs. 8). Sometimes it even beats the AVX512 opposition (SKL-X 7900K) with more cores (10 cores vs. 12).

The only “problematic” algorithms are the memory bound ones where the cores/threads (due to SMT we have 24!) are starved for data and due to contention we see performance lower than less-core devices. While larger caches help (thus the massive 4x 16MB L3 caches) higher clocked memory should be used to match the additional core requirements.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Executive Summary: Ryzen2 is phenomenal and a huge upgrade over Ryzen1/+ that (most) AM4 users can enjoy and Intel has no answer to. 10/10.

Just as original Ryzen forced Intel to increase (double really) core counts to match (from 4 to 6 then 8), Ryzen2 will force Intel to come up with even more (and better) cores in order to compete. 3900X with its 12-cores soundly beats CFL-R 9900K (8-cores) in just about all benchmarks and in some tests goes toe-to-toe with HEDT SKL-X AVX512-enabled (10-cores) except in memory-bound algorithms where the 4 DDR4 memory channels with 2x more bandwidth count. For that you need ThreadRipper!

Ryzen1/+ was already competitive with Intel on integer and floating-point (non-SIMD) workloads but would fare badly on SIMD (AVX/FMA3/AVX2) workloads due to its 128-bit units; Ryzen2 “fixes” this issue, with its 256-bit units matching Intel. Only SKL-X with its 512-bit units (AVX512) is faster and Intel will have to finally include AVX512 for consumer CPUs in order to compete (IceLake?).

For compute-bound workloads, the forthcoming 3950X with its 16-cores/32-threads brings unprecedented performance to the consumer/desktop segment pretty much unheard of just a few years ago when 4-core/8-threads (e.g. 7700K) were all you could hope for – unless paying a lot more for HEDT where 8/10-core CPUs were far far more expensive. Naturally we shall see how the reduced memory bandwidth affects its performance with likely very fast DDR4 memory (4300Mt/s+) required for best performance.

Let’s also remember than Ryzen2 adds hardware mitigation to its remaining 2 vulnerabilities while Intel has been forced to add MDS/”Zombieload” even to its very latest CFL-R that now loses its trump card: hardware RDCL/”Meltdown” fix not to forget the recommendation to disable SMT/Hyperthreading that would mean a sizeable performance drop.

What is astonishing is that TDP has remained similar and with a BIOS/firmware upgrade, owners of older 300-series boards can now upgrade to these CPUs – and likely not even change the cooler unit! Naturally for PCIe4.0 a 500-series board is recommended and 400-series boards do support more features in Ryzen2/+ but let’s remember than on Intel you can only go back/forward 1 generation even though there is pretty much no core difference from Skylake (Gen 6) to Coffeelake-R (Gen 9)!

From top-end (3950X), high-end (3800X) to low-end/APU (3200G) Ryzen2 is such a compelling choice it is hard to recommend anything else… at least at this time…

AMD Ryzen 2 Mobile (2500U) Vega 8 GP(GPU) Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2 mobile:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Radeon RX Vega graphics core (DirectX 12.1)
  • Optimised boost (aka Turbo) algorithm – sharing between CPU & GPU cores

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

Hardware Specifications

We are comparing the graphics units of Ryzen2 mobile with competitive APUs with integrated graphics  to determine whether they are good enough for modest use, especially for compute (GPGPU) use supporting the CPU.

GPGPU Specifications AMD Radeon RX Vega 8 (2500U)
Intel UHD 630 (7200U)
Intel HD Iris 520 (6500U)
Intel HD Iris 540 (6550U)
Comments
Arch Chipset GCN1.5 GT2 / EV9.5 GT2 / EV9 GT3 / EV9 All graphics cores are minor revisions of previous cores with extra functionality.
Cores (CU) / Threads (SP) 8 / 512 24 / 192 24 / 192 48 / 384 Vega has the most SPs though only a few but powerful CUs
ROPs / TMUs 8 / 32 8 / 16 8 / 16 16 / 24 Vega has less ROPs than GT3 but more TMUs.
Speed (Min-Turbo) 300-1100 300-1000 300-1000 300-950 Turbo boost puts Vega in top position power permitting.
Power (TDP) 25-35W 15-25W 15-25W 15-25W TDP is about the same for all though both Ryzen2 and CFL-U have somewhat higher TDP (25W).
Constant Memory 2.7GB 1.6GB 1.6GB 3.2GB There is no dedicated constant memory thus a large chunk is available to use (GB) unlike a dedicated video card with very fast but small (kB).
Shared (Local) Memory 32kB 64kB 64kB 64kB Intel has 2x larger shared/local memory but slow (likely non dedicated) unlike Vega.
Global Memory 2.7 / 3GB 1.6 / 3.2GB 1.6 / 3.2GB 3.2 / 6.4GB About 50% of main memory can be used as global memory – thus pretty large workloads can be run.
Memory System 128-bit DDR4 2400Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR3L 1866Mt/s 128-bit DDR4 2133MT/s Ryzen2’s memory controller is rated for faster data rates thus should be able to use faster (laptop) memory.
Memory Bandwidth (GB/s)
36 30 30 33 The high data rate of DDR4 can result in higher bandwidth useful for the GPU cores.
L2 Cache ? 512kB 512kB 1MB L2 is comparable to Intel units.
FP64/double ratio Yes, 1/16x Yes, 1/8x Yes, 1/8 Yes, 1/8x FP64 is supported and at good ratio but lower than Intel’s.
FP16/half ratio
Yes, 2x Yes, 2x Yes, 2x Yes, 2x FP16 is also now supported at twice the rate – again unlike gimped dedicated cards.

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 831 927 1630 2000 [+23%] Thanks to FP16 support we see double the performance over FP32 but Vega is only 23% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 476 478 865 1350 [+56%] Vega rules FP32 and is over 50% faster than GT3.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 113 122 209 111 [-47%] FP64 lower rate makes Vega 1/2 the speed of GT3 and only matching GT2 units.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 5.71 6.29 10.78 7.11 [-34%] Emulated FP128 precision depends entirely on FP64 performance thus not a lot changes.
Vega is over 50% faster than Intel’s top-end Iris/GT3 graphics but only in FP32 precision – while it gains from FP16 Intel scales better reducing the lead to just 25% or so. In FP64 precision though it’s relatively low 1/16x ratio means it only ties with GT2 low-end-models while GT3 is 2x (twice) as fast. Pity.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 0.858 0.87 1.23 2.58 [+2.1x] No wonder AMD is crypto-king: Vega is over 2x faster than even GT3.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 1 1.08 1.52 3.3 [+2.17x] Nothing changes here, Vega is over 2.2x faster.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 2.72 3 4.7 14.29 [+3x] In this heavy integer workload, Vega is now 3x faster no wonder it’s used for crypto mining.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 6 6.64 11.59 18.77 [+62%] SHA1 is less compute intensive allowing Intel to catch up but Vega is still over 60% faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 1.019 1.08 1.86 3.36 [+81%] With 64-bit integer workload, Vega does better and is 80% (almost 2x) faster than GT3.
Nobody will be using integrated graphics for crypto-mining any time soon, but if you needed to (perhaps using encrypted containers, VMs, etc.) then Vega is your choice – even GT3 is left in the dust despite big improvement over low-end GT2. Intel would need at least 2x more cores to be competitive here.
GPGPU Finance Benchmark Black-Scholes half/FP16 (MOPT/s) 1000 1140 1470 1720 [+17%] If 16-bit precision is sufficient for financial work, Vega is 20% faster than GT3.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 694 697 794 829 [+4%] In this relatively simple FP32 financial workload Vega is just 4% faster than GT3.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 142 154 281 185 [-33%] Switching to FP64 precision, Vega is 33% slower than GT3.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 86 95 155 270 [+74%] Switching to 16-bit precision allows Vega to gain over GT3 and is almost 2x faster.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 92 93 153 254 [+66%] Binomial uses thread shared data thus stresses the internal memory sub-system, and here Vega shows its power – it is 66% faster than GT3.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 18 18.86 32 15.67 [-51%] With FP64 precision Vega loses again vs. GT3 at 1/2 the speed and just matches GT2 units.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 211 236 395 584 [+48%] With 16-bit precision, Vega dominates again and is almost 50% faster than GT3.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 223 236 412 362 [-12%] Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – but Vega somehow loses against GT3.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 29.5 33.36 58.7 47.13 [-20%] Switching to FP64 precision as expected Vega is slower.
Financial algorithms perform well on Vega – at least in FP16 & FP32 precision but FP64 is too “gimped” (1/16x FP32 rate) and thus loses against GT3 despite more powerful cores.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 127 140 236 884 [+3.75x] With 16-bit precision Vega runs away with GEMM and is almost 4x faster than GT3.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 105 107 175 214 [+79%] GEMM makes heavy use of shared/local memory which is likely why Vega is 80% faster than GT3.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 38.8 41.69 70 62.6 [-11%] As expected, due to gimped FP64 rate Vega falls behind GT3 but only by just 11%.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 34.2 34.7 45.85 61.34 [+34%] 16-bit precision helps reduce memory bandwidth pressure thus Vega is 34% faster.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 20.9 21.45 29.69 31.48 [+6%] FFT is memory access bound but Vega does well to beat GT3.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 4.3 5.4 6.07 14.19 [+2.34x] Despite the FP64 rate, Vega manages its memory accesses better beating GT3 by over 2x (two times).
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 270 284 449 623 [+39%] 16-bit precision still benefits N-Body and here Vega is 40% faster than GT3.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 162 181 291 537 [+85%] Back to FP32 and Vega has a pretty large 85% lead – almost 2x GT3.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 22.73 26.1 43.34 44 [+2%] With FP64 precision, Vega and GT3 are pretty much tied.
Vega performs well on compute heavy scientific algorithms (making heavy use of shared/local memory) and also benefits from half/FP16 to reduce memory bandwidth pressure, but FP64 rate comes back to haunt it where it loses against Intel’s GT3. Pity.
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 888 937 1390 2273 [+64%] With 16-bit precision Vega doubles its lead to 64% over GT3 despite its gain over FP32.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 461 491 613 781 [+27%] In this 3×3 convolution algorithm, Vega does well but only 30% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 279 302 409 582 [+42%] Again a huge gain by using FP16, over 40% faster than GT3.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 100 107 144 157 [+9%] Same algorithm but more shared data reduces the gap to 9%.
GPGPU Image Processing Motion Blur (7×7) Filter half/FP16 (MPix/s) 254 272 396 619 [+56%] Large gain again by switching to FP16 with 3x performance over FP32.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 103 111 156 161 [+3%] With even more shared data the gap falls to just 3%.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 259 281 363 595 [+64%] Another huge gain and over 3x improvement over FP32.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 99 106 145 155 [+7%] Still convolution but with 2 filters – the gap is similar to 5×5 – Vega is 7% faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 7.39 9.4 8.56 7.688 [-18%] Big gain but not enough to beat GT3 here.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 7 7.57 7.08 4 [-47%] Vega does not like this algorithm (lots of branching causing divergence) and is 1/2 GT3 speed.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 8.55 9.32 9.22 <BSOD> This test would cause BSOD; we are investigating.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 8 8.65 6.77 2.59 [-70%] Vega does not like this algorithms either (complex branching) and neither does GT3.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 941 967 1580 2091 [+32%] In order to prevent artifacts most of this test runs in FP32 thus not much gain here.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 878 952 1550 2100 [+35%] This algorithm is 64-bit integer heavy allowing Vega 35% better performance over GT3.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 341 390 343 1046 [+2.5x] Switching to FP16 makes a huge difference to Vega which is over 2x faster.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 384 425 652 608 [-7%] One of the most complex and largest filters, Vega is a bit slower than GT3 by 7%.
For image processing Vega generally performs well in FP32 beating GT3 hands down; but there are a few algorithms that may need to be optimised for it that don’t perform as well as expected. Switching to FP16 though doubles/triples scores – thus Vega may be starved of memory.

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both AMD and competition.

Results Interpretation: Higher values (MB/s, etc.) mean better performance. Lower time values (ns, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers, OpenCL 2.x. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel HD Iris 520 (6500U) Intel HD Iris 540 (6550U) AMD Radeon RX Vega 8 (2500U) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 12.17 21.2 24 27.32 [+14%] With higher speed DDR4 memory, Vega has 14% more bandwidth.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 6 10.4 11.7 4.74 [-60%] The GPU<>CPU link seems a bit slow here at 1/2 bandwidth of Intel.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 6 10.5 11.75 5 [-57%] Download bandwidth shows a similar issue, 1/2 bandwidth expected.
All designs have to rely on the shared memory controller and Vega performs as expected with good internal bandwidth due to higher speed DDR4 memory. But – transfer up/down speeds are disappointing possibly due to the driver as “zero-copy” mode should be engaged and working on such transfers (APU mode).
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 246 244 288 412 [+49%] Similarly with CPU data latencies, global “in-page/random” (aka “TLB hit”) latencies are a bit high though not by a huge amount.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 365 372 436 519 [+19%] Due to faster memory clock but increased timings “full/random” latencies appear a bit higher.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 156 158 213 201 [-6%] Sequential access latencies are less than competition by 6%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 245 243 252 411 [+63%] None have dedicated constant memory thus we see a similar picture to global memory: somewhat high latencies.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 82 84 100 22.5 [1/5x] Vega has dedicated shared/local memory and it shows – it’s about 5x faster than Intel’s designs.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 1152 1157 1500 278 [1/5x] Texture access is also very fast on Vega, with latencies 5x lower (aka 1/5) than Intel’s designs.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 1178 1162 1533 418 [1/3x] Even full/random accesses are fast, 3x (three times) faster than Intel’s.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 1077 1081 1324 122 [1/10x] With sequential access we see a crazy 10x lower latency as if AMD uses prefetchers and Intel does not.
As we’ve seen in Ryzen 2’s data latency tests – “in-page/random” latencies are higher than competition but the rest are comparative, with sequential (prefetched) latencies especially small. But dedicated shared/local memory is far faster (5x) and texture accesses are also very fast (3-5x) which should greatly help algorithms making use of them.
Plotting the global (or constant) memory latencies together we see that the “in-page/random” access latencies should perhaps peak somewhat lower but still nothing close to what we’ve seen in the (CPU) data memory latencies article. It is not very clear (unlike the texture latencies graph) where the caches are located.
The texture latencies graph is far clearer where we can see each level’s caches; unlike the global (or constant) latencies we see “in-page/random” latency peak and hold at a somewhat lower level (4MB).

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Vega mobile, as its desktop big siblings, is undoubtedly powerful and a good upgrade from the older integrated GPU cores; it also supports modern features like half/FP16 compute (which needs vectorisation what the driver reports as “optimised width”) and relishes complex algorithms making use of shared/local memory which is efficient. However Intel’s GT3 EV9.x can get close to it in some workloads and due to better FP64 ratio (1/8x vs 1/16x) even beat it in most FP64 precision tests which is somewhat disappointing.

Luckily for AMD, GT3 variant is very rare and thus Vega has an easy job defeating GT2 in just about all tests; but it shows that should Intel “get serious” and continue to improve integrated graphics (and CPUs) like they used to do before Skylake (SKL/KBL) – AMD might have more serious competition on its hands.

Note that until recently (2019) Ryzen2 mobile APUs were not supported by AMD’s main drivers (“Adrenalin”) and had to rely on pretty old OEM (HP, etc.) drivers that were somewhat problematic especially with Windows 10 changing every 6 months while the drivers were almost 1 year old. Thankfully this has now changed and users (and us) can benefit from updated, stable and performant drivers.

In any case if you want a laptop/ultraportable with just an APU and no dedicated graphics, then Vega is pretty much your only choice which means a Ryzen2 system. That pretty much means it is worthy of a recommendation.

In a word: Highly Recommended

In this article we test GP(GPU) integrated graphics performance; please see our other articles on:

AMD Ryzen 2 Mobile 2500U Review & Benchmarks – Cache & Memory Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

Why review it now?

With Ryzen3 soon to be released later this year (2019) – with a corresponding Ryzen3 APU mobile – it is good to re-test the platform especially in light of the many BIOS/firmware updates, many video/GPU driver updates and not forgetting the many operating system (Windows) vulnerabilities (“Spectre”) mitigations that have greatly affected performance – sometimes for the good (firmware, drivers, optimisations) sometimes for the bad (mitigations).

In this article we test CPU Cache and Memory performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 (2700X, 2600) with previous generation (1700X) and competing architectures with a view to upgrading to a mid-range high performance design.

 

CPU Specifications AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 icache is 2x of Intel with matching dcache.
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 2x 256kB 16-way 4x 256kB 16-way Ryzen2 L2 cache is 2x bigger than Intel and thus 4x larger than older SKL/KBL-U.
L3 Caches 4MB 16-way 4MB 16-way 4MB 16-way 6MB 16-way Here CFL-U brings 50% bigger L3 cache (6 vs 4MB) which may help some workloads.
TLB 4kB pages
64 full-way / 1536 8-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
64 full-way / 1536 2-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way 8 full-way  / 1536 6-way No TLB changes, same as 4kB pages.
Memory Controller Speed (MHz) 600 2600 (400-3100) 2700 (400-3500) 1600 (400-3400) Ryzen2’s memory controller runs at memory clock (MCLK) base rate thus depends on memory installed. Intel’s UNC (uncore) runs between min and max CPU clock thus perhaps faster.
Memory Speed (MHz) Max
1200-2400 (2667) 1033-1866 (2133) 1067-2133 (2400) 1200-2400 (2533) Ryzen2 now supports up to 2667MHz (officially) which should improve its performance quite a bit – unfortunately fast DDR4 is very expensive right now.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Timing (clocks)
17-17-17-39 8-56-18-9 1T 14-17-17-40 10-57-16-11 2T 15-15-15-36 4-51-17-8 2T 19-19-19-43 5-63-21-9 2T Timings naturally depend on memory which for laptops is somewhat limited and quite expensive.
Memory Controller Firmware
2.1.0 3.6.0 3.6.4 Firmware is the same as on desktop devices.

Core Topology and Testing

As discussed in the previous articles (Ryzen1 and Ryzen2 reviews), cores on Ryzen are grouped in blocks (CCX or compute units) each with its own L3 cache – but connected via a 256-bit bus running at memory controller clock. However – unlike desktop/workstations – so far all Ryzen2 mobile designs have a single (1) CCX thus all the issues that “plagued” the desktop/workstation Ryzen designs do note apply here.

However, AMD could have released higher-core mobile designs to go against Intel’s H-line (beefed to 6-core / 12-threads with CFL-H) that would have likely required 2 CCX blocks. At this time (start 2019) considering that Ryzen3 (mobile) will launch soon that seems unlikely to happen…

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen2 mobile supports all modern instruction sets including AVX2, FMA3 and even more.

Results Interpretation: Higher rate values (GOPS, MB/s, etc.) mean better performance. Lower latencies (ns, ms, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 18.65 [-21%] 16.81 18.93 23.65 Ryzen2 L1D is not as wide as Intel’s designs (512-bit) thus inter-core transfers in L1D are 20% slower.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 9.29 [=] 6.62 7.4 9.3 Using the unified L3 caches – both Ryzen2 and CFL-U manage the same bandwidths.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 16 [-24%] 21 18 19 Within the same core (share L1D) Ryzen2 has lower latencies by 24% than all Intel CPUs.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 46 [-23%] 61 54 56 Within the same compute unit (shareL3) Ryzen2 again yields 23% lower latencies.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) n/a n/a n/a n/a With a single CCX we have no latency issues.
While the L1D cache on Ryzen2 is not as wide as on Intel SKL/KBL/CFL-U to yield the same bandwidth (20% lower), both it and L3 manage lower latencies by a relatively large ~25%. With a single CCX design we have none of the issues seen on the desktop/workstation CPUs.
Aggregated L1D Bandwidth (GB/s) 267 [-67%] 315 302 628 Ryzen2’s L1D is just not wide enough – even 2-core SKL/KBL-U have more bandwidth and CFL-U has almost 3x more.
Aggregated L2 Bandwidth (GB/s) 225 [-29%] 119 148 318 The 2x larger L2 caches (512 vs 256kB) perform better but still CFL-U manages 30% more bandwidth.
Aggregated L3 Bandwidth (GB/s) 130 [-31%] 90 95 188 CFL-U not only has 50% bigger L3 (6 vs 4MB) but also somehow manages 30% more bandwidth too while SKL/KBL-U are left in the dust.
Aggregated Memory (GB/s) 24 [=]
21 21 24 With the same memory clock, Ryzen2 ties with CFL-U which means good bandwidth for the cores.
While we saw big improvements on Ryzen2 (desktop) for all caches L1D/L2/L3 – more work needs to be done: in particular the L1D caches are not wide enough compared to Intel’s CPUs – and even L2/L3 need to be wider. Most likely Ryzen3 with native wide 256-bit SIMD (unlike 128-bit as Ryzen1/2) will have twice as wide L1D/L2 that should be sufficient to match Intel.

The memory controller performs well matching CFL-U and is officially rated for higher DDR4 memory – though on laptops the choices are more limited and more expensive.

Data In-Page Random Latency (ns) 91.8 [4-13-32] [+2.75x] 34.6 [3-10-17] 27.6 [4-12-22] 24.5 As on desktop Ryzen1/2 in-page random latencies are large compared to the competition while L1D/L2 are OK but L3 also somewhat large.
Data Full Random Latency (ns) 117 [4-13-32] [-16%] 108 [3-10-27] 84.7 [4-12-33] 139 Out-of-page latencies are not much different which means Ryzen2 is a lot more competitive but still somewhat high.
Data Sequential Latency (ns) 4.1 [4-6-7] [-31%]
5.6 [3-10-11] 6.5 [4-12-13] 5.9 Ryzen’s prefetchers are working well with sequential access with lower latencies than Intel
Ryzen1/2 desktop issues were high memory latencies (in-page/full random) and nothing much changes here. “In-Page/Random pattern” (TLB hit) latencies are almost 3x higher – actually not much lower compared to “Full/Random pattern” (TBL miss) – which are comparable to Intel’s SKL/KBL/CFL. On the other hand “Sequential pattern” yields lower latencies (30% less) than Intel thus simple access patterns work better than complex/random access patterns.
Looking at the data access latencies’ graph for Ryzen2 mobile – we see the “in-page/random” following the “full/random” latencies all the way to 8MB block where they plateau; we would have expected them to plateau at a lower value. See the “code access latencies” graph below.
Code In-Page Random Latency (ns) 17.6 [5-9-25] [+14%] 13.3 [2-9-18] 14.9 [2-11-21] 15.5 Code latencies were not a problem on Ryzen1/2 and they are OK here, 14% higher.
Code Full Random Latency (ns) 108 [5-15-48] [+19%] 91.8 [2-10-38] 90.4 [2-11-45] 91 Out-of-page latency is also competitive and just 20% higher.
Code Sequential Latency (ns) 8.2 [5-13-20] [+37%] 5.9 [2-4-8] 7.8 [2-4-9] 6 Ryzen’s prefetchers are working well with sequential access pattern latency but not as fast as Intel.
Unlike data, code latencies (any pattern) are competitive with Intel though CFL-U does have lower latencies (between 15-20%) but in exchange you get a 2x bigger L1I (64 vs 32kB) which should help complex software.
This graph for code access latencies is what we expected to see for data: “in-page/random” latencies plateau much earlier than “full/random” thus “TLB hit” latencies being much lower than “TLB miss” latencies.
Memory Update Transactional (MTPS) 7.17 [-7%] 6.5 7.72 7.2 As none of Intel’s CPUs have HLE enabled Ryzen2 performs really well with just 7% less transactions/second.
Memory Update Record Only (MTPS) 5.66 [+5%] 4.66 5.25 5.4 With only record updates it manages to be 5% faster.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

We saw good improvement on Ryzen2 (desktop/workstation) but still not enough to beat Intel and a lot more work is needed both on L1/L2 cache bandwidth/widening and memory latency (“in-page” aka “TBL hit” random access pattern) that cannot be improved with firmware/BIOS updates (AGESA firmware). Ryzen2 mobile does have the potential to use faster DDR4 memory (officially rated 2667MHz) thus could overtake Intel using faster memory – but laptop DDR4 SODIMM choice is limited.

Regardless of these differences – the CPU results we’ve seen are solid thus sufficient to recommend Ryzen2 mobile especially when at a much lower cost than competing designs. Even if you do choose Intel – you will be picking up a better design due to Ryzen2 mobile competition – just compare the SKL/KBL-U and CFL/WHL-U results.

We are looking forward to see what improvements Ryzen3 mobile brings to the mobile platform.

In a word: Recommended – with reservations

In this article we tested CPU Cache and Memory performance; please see our other articles on:

AMD Ryzen 2 Mobile 2500U Review & Benchmarks – CPU Performance

What is “Ryzen2” ZEN+ Mobile?

It is the long-awaited Ryzen2 APU mobile “Bristol Ridge” version of the desktop Ryzen 2 with integrated Vega graphics (the latest GPU architecture from AMD) for mobile devices. While on desktop we had the original Ryzen1/ThreadRipper – there was no (at least released) APU version or a mobile version – leaving only the much older designs that were never competitive against Intel’s ULV and H APUs.

After the very successful launch of the original “Ryzen1”, AMD has been hard at work optimising and improving the design in order to hit TDP (15-35W) range for mobile devices. It has also added the brand-new Vega graphics cores to the APU that have been incredibly performant in the desktop space. Note that mobile versions have a single CCX (compute unit) thus do not require operating system kernel patches for best thread scheduling/power optimisation.

Here’s what AMD says it has done for Ryzen2:

  • Process technology optimisations (12nm vs 14nm) – lower power but higher frequencies
  • Improvements for cache & memory speed & latencies (we shall test that ourselves!)
  • Multi-core optimised boost (aka Turbo) algorithm – XFR2 – higher speeds

Why review it now?

With Ryzen3 soon to be released later this year (2019) – with a corresponding Ryzen3 APU mobile – it is good to re-test the platform especially in light of the many BIOS/firmware updates, many video/GPU driver updates and not forgetting the many operating system (Windows) vulnerabilities (“Spectre”) mitigations that have greatly affected performance – sometimes for the good (firmware, drivers, optimisations) sometimes for the bad (mitigations).

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Ryzen2 mobile (2500U) with competing architectures (Intel gen 6, 7, 8) with a view to upgrading to a mid-range but high performance design.

 

CPU Specifications AMD Ryzen2 2500U Bristol Ridge
Intel i7 6500U (Skylake ULV)
Intel i7 7500U (Kabylake ULV)
Intel i5 8250U (Coffeelake ULV)
Comments
Cores (CU) / Threads (SP) 4C / 8T 2C / 4T 2C / 4T 4C / 8T Ryzen has double the cores of ULV Skylake/Kabylake and only recently Intel has caught up by also doubling cores.
Speed (Min / Max / Turbo) 1.6-2.0-3.6GHz (16x-20x-36x) 0.4-2.6-3.1GHz (4x-26x-31x) 0.4-2.7-3.5GHz (4x-27x-35x) 0.4-1.6-3.4GHz (4x-16x-34x) Ryzen2 has higher base and turbo than CFL-U and higher turbo than all Intel competition.
Power (TDP) 25-35W 15-25W 15-25W 25-35W Both Ryzen2 and CFL-U have higher TDP at 25W and turbo up to 35W depending on configuration while older devices were mostly 15W with turbo 20-25W.
L1D / L1I Caches 4x 32kB 8-way / 4x 64kB 4-way 2x 32kB 8-way / 2x 32kB 8-way 2x 32kB 8-way / 2x 32kB 8-way 4x 32kB 8-way / 4x 32kB 8-way Ryzen2 icache is 2x of Intel with matching dcache.
L2 Caches 4x 512kB 8-way 2x 256kB 16-way 2x 256kB 16-way 4x 256kB 16-way Ryzen2 L2 cache is 2x bigger than Intel and thus 4x larger than older SKL/KBL-U.
L3 Caches 4MB 16-way 4MB 16-way 4MB 16-way 6MB 16-way Here CFL-U brings 50% bigger L3 cache (6 vs 4MB) which may help some workloads.
Microcode (Firmware) MU8F1100-0B MU064E03-C6 MU068E09-8E MU068E09-96 On Intel you can see just how many updates the platforms have had – we’re now at CX versions but even Ryzen2 has had a few.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). Ryzen supports all modern instruction sets including AVX2, FMA3 and even more like SHA HWA (supported by Intel’s Atom only) but has dropped all AMD’s variations like FMA4 and XOP likely due to low usage.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 103 [-6%] 52 73 109 Right off Ryzen2 does not beat CFL-U but is very close, soundly beating the older Intel designs.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 102 [-4%] 51 74 106 With a 64-bit integer workload – the difference drops to 4%.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 79 [+18%] 39 45 67 Somewhat surprisingly, Ryzen2 is almost 20% faster than CFL-U here.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 67 [+22%] 33 37 55 With FP64 nothing much changes, with Ryzen2 over 20% faster.
You can see why Intel needed to double the cores for ULV: otherwise even top-of-the-line i7 SKL/KBL-U are pounded into dust by Ryzen2. CFL-U does trade blows with it and manages to pull ahead in Dhrystone but Ryzen2 is 20% faster in floating-point. Whatever you choose you can thank AMD for forcing Intel’s hand.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 239 [-32%] 183 193 350 In this vectorised AVX2 integer test Ryzen2 starts 30% slower than CFL-U but does beat the older designs.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 53.4 [-58%] 68.2 75 127 With a 64-bit AVX2 integer vectorised workload, Ryzen2 is even slower.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 2.41 [+12%] 1.15 1.12 2.15 This is a tough test using Long integers to emulate Int128 without SIMD; here Ryzen2 has its 1st win by 12% over CFL-U.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 222 [-20%] 149 159 277 In this floating-point AVX/FMA vectorised test, Ryzen2 is still slower but only by 20%.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 126 [-22%] 88.3 94.8 163 Switching to FP64 SIMD code, nothing much changes still 20% slower.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 6.23 [-16%] 3.79 4.04 7.4 In this heavy algorithm using FP64 to mantissa extend FP128 with AVX2 – Ryzen2 is less than 20% slower.
Just as on desktop, we did not expect AMD’s Ryzen2 mobile to beat 4-core CFL-U (with Intel’s wide SIMD units) and it doesn’t: but it remains very competitive and is just 20% slower. In any case, it soundly beats all older but ex-top-of-the-line i7 SKL/KBL-U thus making them all obsolete at a stroke.
BenchCrypt Crypto AES-256 (GB/s) 10.9 [+1%] 6.29 7.28 10.8 With AES/HWA support all CPUs are memory bandwidth bound – here Ryzen2 ties with CFL-U and soundly beats older versions.
BenchCrypt Crypto AES-128 (GB/s) 10.9 [+1%] 8.84 9.07 10.8 What we saw with AES-256 just repeats with AES-128; Ryzen2 is marginally faster but the improvement is there.
BenchCrypt Crypto SHA2-256 (GB/s) 6.78 [+60%] 2 2.55 4.24 With SHA/HWA Ryzen2 similarly powers through hashing tests leaving Intel in the dust; SHA is still memory bound but Ryzen2 is 60% faster than CFL-U.
BenchCrypt Crypto SHA1 (GB/s) 7.13 [+2%] 3.88 4.07 7.02 Ryzen also accelerates the soon-to-be-defunct SHA1 but CFL-U with AVX2 has caught up.
BenchCrypt Crypto SHA2-512 (GB/s) 1.48 [-44%] 1.47 1.54 2.66 SHA2-512 is not accelerated by SHA/HWA thus Ryzen2 falls behind here.
Ryzen2 mobile (like its desktop brother) gets a boost from SHA/HWA but otherwise ties with CFL-U which is helped by its SIMD units. As before older 2-core i7 SKL/KBL-U are left with no hope and cannot even saturate the memory bandwidth.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 93.3 [-4%] 44.7 49.3 97 In this non-vectorised test we see Ryzen2 matches CFL-U.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 77.8 [-8%] 39 43.3 84.7 Switching to FP64 code, nothing much changes, Ryzen2 is 8% slower.
BenchFinance Binomial float/FP32 (kOPT/s) 35.5 [+61%] 10.4 12.3 22 Binomial uses thread shared data thus stresses the cache & memory system; here the arch(itecture) improvements do show, Ryzen2 is 60% faster than CFL-U.
BenchFinance Binomial double/FP64 (kOPT/s) 19.5 [-7%] 10.1 11.4 21 With FP64 code Ryzen2 drops back from its previous win.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 20.1 [+1%] 9.24 9.87 19.8 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; Ryzen2 cannot match its previous gain.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 15.3 [-3%] 7.38 7.88 15.8 Switching to FP64 nothing much changes, Ryzen2 matches CFL-U.
Unlike desktop where Ryzen2 is unstoppable, here we are a more mixed result – with CFL-U able to trade blows with it except one test where Ryzen2 is 60% faster. Otherwise CFL-U does manage to be just a bit faster in the other tests but nothing significant.
BenchScience SGEMM (GFLOPS) float/FP32 107 [+16%] 92 76 85 In this tough vectorised AVX2/FMA algorithm Ryzen2 manages to be almost 20% faster than CFL-U.
BenchScience DGEMM (GFLOPS) double/FP64 47.2 [-6%] 44.2 31.7 50.5 With FP64 vectorised code, Ryzen2 drops down to 6% slower.
BenchScience SFFT (GFLOPS) float/FP32 3.75 [-53%] 7.17 7.21 8 FFT is also heavily vectorised (x4 AVX2/FMA) but stresses the memory sub-system more; Ryzen2 does not like it much.
BenchScience DFFT (GFLOPS) double/FP64 4 [-7%] 3.23 3.95 4.3 With FP64 code, Ryzen2 does better and is just 7% slower.
BenchScience SNBODY (GFLOPS) float/FP32 112 [-27%] 96.6 104.9 154 N-Body simulation is vectorised but many memory accesses and not a Ryzen2 favourite.
BenchScience DNBODY (GFLOPS) double/FP64 45.3 [-30%] 29.6 30.64 64.8 With FP64 code nothing much changes.
With highly vectorised SIMD code Ryzen2 remains competitive but finds some algorithms tougher than others. Just as with desktop Ryzen1/2 it may require SIMD code changes for best performance due to its 128-bit units; Ryzen3 with 256-bit units should fix that.
CPU Image Processing Blur (3×3) Filter (MPix/s) 532 [-39%] 418 474 872 In this vectorised integer AVX2 workload Ryzen2 is quite a bit slower than CFL-U.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 146 [-58%] 168 191 350 Same algorithm but more shared data makes Ryzen2 even slower, 1/2 CFL-U.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 123 [-32%] 87.6 98 181 Again same algorithm but even more data shared reduces the delta to 1/3.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 185 [-37%] 136 164 295 Different algorithm but still AVX2 vectorised workload still Ryzen2 is ~35% slower.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 26.5 [-1%] 13.3 14.4 26.7 Still AVX2 vectorised code but here Ryzen2 ties with CFL-U.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 9.38 [-38%] 7.21 7.63 15.09 Again we see Ryzen2 fall behind CFL-U.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 660 [-53%] 730 764 1394 With integer AVX2 workload, Ryzen2 falls behind even SKL/KBL-U.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 94.1 [-55%] 99.6 105 209 In this final test again with integer AVX2 workload Ryzen2 is 1/2 speed of CFL-U.

With all the modern instruction sets supported (AVX2, FMA, AES and SHA/HWA) Ryzen2 does extremely well in all workloads – and makes all older i7 SKL/KBL-U designs obsolete and unable to compete. As we said – Intel pretty much had to double the number of cores in CFL-U to stay competitive – and it does – but it is all thanks to AMD.

Even then Ryzen2 does beat CFL-U in non-SIMD tests with the latter being helped tremendously by its wide (256-bit) SIMD units and greatly benefits from AVX2/FMA workloads. But Ryzen3 with double-width SIMD units should be much faster and thus greatly beating Intel designs.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest drivers. .Net 4.7.x (RyuJit), Java 1.9.x. Turbo / Boost was enabled on all configurations.

VM Benchmarks AMD Ryzen2 2500U Bristol Ridge Intel i7 6500U (Skylake ULV) Intel i7 7500U (Kabylake ULV) Intel i5 8250U (Coffeelake ULV) Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 22.7 [+39%] 9.58 12.1 16.36 .Net CLR integer starerts great – Ryzen2 is 40% faster than CFL-U.
BenchDotNetAA .Net Dhrystone Long (GIPS) 22 [+34%] 9.24 12.1 16.4 64-bit integer workloads also favour Ryzen2, still 35% faster.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 40.5 [+9%] 18.7 22.5 37.1 Floating-Point CLR performance is also good but just about 10% faster than CFL-U.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 49.6 [+6%] 23.7 28.8 46.8 FP64 performance is also great (CLR seems to promote FP32 to FP64 anyway) with Ryzen2 faster by 6%.
.Net CLR performance was always incredible on Ryzen1 and 2 (desktop/workstation) and here is no exception – all Intel designs are left in the dust with even CFL-U soundly beated by anything between 10-40%.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 43.23 [+20%] 21.32 25 35 Just as we saw with Dhrystone, this integer workload sees a big 20% improvement for Ryzen2.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 44.71 [+21%] 21.27 26 37 With 64-bit integer workload we see a similar story – 21% better.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 137 [+46%] 78.17 94 56 Here we make use of RyuJit’s support for SIMD vectors thus running AVX2/FMA code – Ryzen2 does even better here 50% faster than CFL-U.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 75.2 [+45%] 43.59 52 35 Switching to FP64 SIMD vector code – still running AVX2/FMA – we see a similar gain
As before Ryzen2 dominates .Net CLR performance – even when using RyuJit’s SIMD instructions we see big gains of 20-45% over CFL-U.
Java Arithmetic Java Dhrystone Integer (GIPS) 222 [+13%] 119 150 196 We start JVM integer performance with a 13% lead over CFL-U.
Java Arithmetic Java Dhrystone Long (GIPS) 208 [+12%] 101 131 185 Nothing much changes with 64-bit integer workload – Ryzen2 still faster.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 50.9 [+9%] 23.13 27.8 46.6 With a floating-point workload Ryzen2 performance improvement drops a bit.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 54 [+13%] 23.74 28.7 47.7 With FP64 workload Ryzen2 gets back to 13% faster.
Java JVM performance delta is not as high as .Net but still decent just over 10% over CFL-U similar to what we’ve seen on desktop.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 48.74 [+15%] 20.5 24 42.5 Oracle’s JVM does not yet support native vector to SIMD translation like .Net’s CLR but Ryzen2 is still 15% faster.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 46.75 [+4%] 20.3 24.8 44.8 With 64-bit vectorised workload Ryzen2’s lead drops to 4%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 38.2 [+9%] 14.59 17.6 35 Switching to floating-point we return to a somewhat expected 9% improvement.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 35.7 [+2%] 14.59 17.4 35 With FP64 workload Ryzen2’s lead somewhat unexplicably drops to 2%.
Java’s lack of vectorised primitives to allow the JVM to use SIMD instruction sets allow Ryzen2 to do well and overtake CFL-U between 2-15%.

Ryzen2 on desktop dominated the .Net and Java benchmarks – and Ryzen2 mobile does not disappoint – it is consistently faster than CFL-U which does not bode well for Intel. If you mainly run .Net and Java apps on your laptop then Ryzen2 is the one to get.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Ryzen2 was a worthy update on the desktop and Ryzen2 mobile does not disappoint; it instantly obsoleted all older Intel designs (SKL/KBL-U) with only the very latest 4-core ULV (CFL/WHL-U) being able to match it. You can see from the results how AMD forced Intel’s hand to double cores in order to stay competitive.

Even then Ryzen2 manages to beat CFL-U in non-SIMD workloads and remains competitive in SIMD AVX2/FMA workloads (only 20% or so slower) while soundly beating SKL/KBL-U with their 2-cores and wide SIMD units. With soon-to-be-released Ryzen3 with wide SIMD units (256-bit as CFL/WHL-U) – Intel will need AVX512 to stay competitive – however it has its own issues which may be problematic in mobile/ULV space.

Both Ryzen2 mobile and CFL/WHL-U have increased TDP (~25W) in order to manage the increased number of cores (instead of 15W with older 2-core designs) and turbo short-term power as much as 35W. This means while larger 14/15″ designs with good cooling are able to extract top performance – smaller 12/13″ designs are forced to use lower cTDP of 15W (20-25W turbo) thus with lower multi-threaded performance.

Also consider than Ryzen2 is not affected by most “Spectre” vulnerabilities and not by “Meltdown” either thus does not need KVA (kernel pages virtualisation) that greatly impacts I/O workloads. Only the very latest Whiskey-Lake ULV (WHL-U gen 8-refresh) has hardware “Meltdown” fixes – thus there is little point buying CFL-U (gen 8 original) and even less point buying older SKL/KBL-U.

In light of the above – Ryzen2 mobile is a compelling choice especially as it comes at a (much) lower price-point: its competition is really only the very latest WHL-U i5/i7 which do not come cheap – with most vendors still selling CFL-U and even KBL-U inventory. The only issue is the small choice of laptops available with it – hopefully the vendors (Dell, HP, etc.) will continue to release more versions especially with Ryzen 3 mobile.

In a word: Highly Recommended!

Please see our other articles on:

Intel Core i9 9900K CofeeLake-R Review & Benchmarks – 2-channel DDR4 Cache & Memory Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i7 (8700K) with previous generation (6700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No L1D/I changes, Ryzen’s L1I is twice as big.
L2 Caches 8x 256kB 4-way 6x 256kB 4-way 8x 512kB 8-way 10x 1MB 16-way No L2 changes, Ryzen’s L2 is twice as big again.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 2x 8MB 16-way L3 has also increased with no of cores, and now matches Ryzen.
TLB 4kB pages
64 4-way / 64 8-way / 1536 6-way 64 4-way / 64 8-way/ 1536 6-way 64 full-way 1536 8-way 64 4-way / 64 8-way / 1536 6-way No TLB changes.
TLB 2MB pages
8 full-way / 1536 6-way 8 full-way / 1536 6-way 64 full-way 1536 2-way 8 full-way / 1536 6-way No TLB changes.
Memory Controller Speed (MHz) 1200-5000 1200-4400 1333-2667 1200-2700 The uncore (memory controller) runs at faster clock due to higher rated clock but not a lot in it.
Memory Data Speed (MHz)
3200 3200 2667 3200 CFL/R can easily run at 3200Mt/s while KBL/SKL were not as reliable. We could not get Ryzen past 2667 while it does support 2933.
Memory Channels / Width
2 / 128-bit 2 / 128-bit 2 / 128-bit 2 / 128-bit All have 128-bit total channel width.
Memory Bandwidth (GB/s)
50 50 42 100 Bandwidth has naturally increased with memory clock speed but latencies are higher.
Uncore / Memory Controller Firmware
2.6.2 2.6.2 We’re on firmware 2.6.x on both.
Memory Timing (clocks)
16-16-16-36 6-52-25-12 2T 16-16-16-36 6-52-25-12 2T 16-17-17-35 7-60-20-10 2T Timings are very much BIOS dependent and vary a lot.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL-R supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1807), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Best (GB/s) 70.7 [+28%] 52.5 55.3 86 CFL-R finally overtakes Ryzen2 in inter-core bandwidth with almost 30% more bandwidth.
CPU Multi-Core Benchmark Total Inter-Core Bandwidth – Worst (GB/s) 15.4 [-1%] 15.5 6.35 25.7 In worst-case pairs on Ryzen2 must go across CCXes – unlike Intel’s CPUs – thus CFL can muster over 2x more bandwidth in this case.
CFL-R manages good bandwidth improvement with its 2  extra cores allowing it to dominate Ryzen  2; worst-case bandwidth does not improve as the inter-core connector has remained the same
CPU Multi-Core Benchmark Inter-Unit Latency – Same Core (ns) 13.4 [-7%] 14.4 13.5 15 With its faster clock, CFL-R manages lower inter-core latency with 7% drop.
CPU Multi-Core Benchmark Inter-Unit Latency – Same Compute Unit (ns) 43.7 [-3%] 45 40 75 Within the same unit, Ryzen2 is again faster than CFL/R.
CPU Multi-Core Benchmark Inter-Unit Latency – Different Compute Unit (ns) 115 Obviously going across CCXes is slow, about 3x slower which needs careful thread scheduling.
The multiple CCX designof Ryzen 2 still presents some challenges to programmers requiring threads to be carefully scheduled – thus the unified CFL-R just like CFL before it enjoys lower latencies throughout.
Aggregated L1D Bandwidth (GB/s) 1890 [+39%] 1630
854 2220 Intel’s wide L1D in CFL/R means almost 2x more bandwidth than Ryzen 2.
Aggregated L2 Bandwidth (GB/s) 618 [+8%] 571 720 985 But Ryzen2’s L2 caches are not only twice as big but also very wide – CFL/R surprisingly cannot beat it.
Aggregated L3 Bandwidth (GB/s) 326 [=] 327 339 464 Ryzen’s 2 L3 caches also provide good bandwidth matching CFL’s unified L3 cache.
Aggregated Memory (GB/s) 35.5 [=] 35.6 32.2 70 Running at 3200Mt’s obviously CFL enjoys higher bandwidth than Ryzen2 at 2667Mt’s but somehow the latter has better efficiency.
Nothing much has changed in CFL/R vs. old SKL/KBL thus while L1 caches are wide and thus fast – the L2, L3 are not as impressive and the memory controller while competitive it does not seem as efficient as Ryzen2 but is more stable at high data rates allowing for higher bandwidth.
Data In-Page Random Latency (ns) 17.5 (3-10-21) 17.4 (4-11-20) [-73%] 63.4 (4-12-31) 25.5 (4-13-30) While clock latencies have not changed w.s. old KBL/SKL, CFLR enjoys lower latencies due to higher data rates. Ryzen2 has problems here.
Data Full Random Latency (ns) 54.3 (3-10-36) 53.4 (4-11-42) [-30%] 76.2 (4-12-32) 74 (4-13-62) Out-of-page clock latencies have increased but still overall lower. Ryzen2 has almost caught up here.
Data Sequential Latency (ns) 3.8 (3-10-11) 3.8 (4-11-12) 3.3 (4-6-7) 5.3 (4-12-12) With sequential access, Ryzen2 is now faster as CFL/R’s clock latencies have not changed.
CFL-R does not improve over CFL (same memory controller) is lucky here as even Ryzen2 still has high latencies in random accesses (either in-page or full range) but manages to be faster with sequential access. Intel will need to improve going forward as clock latencies while good have really not improved at all.
Code In-Page Random Latency (ns) 8.6 (2-9-19) 8.7 (2-10-21) 13.8 (4-9-24) 11.8 (4-14-25) Code clock latencies also have not changed and again and while Ryzen2 performs a lot better, CFL/R manage to be ~35% faster.
Code Full Random Latency (ns) 60.1 (2-9-48) 59.8 (2-10-48) 85.7 (4-14-49) 83.6 (4-15-74) Out-of-page clock latencies also have not changed and here CFL/R is 20% faster over Ryzen2.
Code Sequential Latency (ns) 4.3 (2-3-8) 4.5 (2-4-10) 7.4 (4-12-20) 6.8 (4-7-11) Ryzen2 is competitive but again CFL/R manages to be almost 40% faster.
CFL/R does not improve over CFL but still dominates here and enjoys 30-40% less latency over Ryzen2 but the latter has improved a lot in time.
Memory Update Transactional (MTPS) 73.3 [+36%] 54 5 59 Finally all top-end Intel CPUs have HLE enabled and working and thus enjoy huge performance increase.
Memory Update Record Only (MTPS) 53.4 [+41%] 38 4.58 59 Nothing much changes here. CFL-R can do over 40% more transactions.

CFL-R does not really perform any different cache/memory wise vs. old CFL as the caches and memory controller are unchanged.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

CFL-R just adds more cores, thus enjoys higher aggregated L1D/L2 bandiwdths vs CFL but the L3 is still disappointing – especially as now it has to feed 33% more cores/threads (8/16 vs 6/12). Latencies (in clocks) do not change either but as it can clock higher they do decrease in real terms (ns).

The memory controller is the very same (even running same firmware) thus performs the same though now it has to feed 33% more cores/threads (8/16 vs 6/12) thus when all cores/threads are used the aggregated bandwidth falls due to extra contention. In fairness Ryzen2 has the same issue (too many cores/threads for too little bandwidth) thus SKL/KBL-X is where you should be looking for more bandwidth.

nVidia Titan V/X: FP16 and Tensor CUDA Performance

What is FP16 (“half”)?

FP16 (aka “half” floating-point) is the IEEE lower-precision floating-point representation that has recently begun to be supported by GPGPUs for compute (e.g. Intel EV9+ Skylake GPU, nVidia Pascal/Turing) and soon by CPUs (BFloat16). While originally meant for mobile devices in order to reduce memory and compute requirements – it also allows workstations/servers to handle deep neural-network workloads that have exploded in both size and compute power.

While not all algorithms can use such low precision and thus may require parts to use normal precision, nevertheless FP16 can still be used in many instances and thus needs to be implemented and benchmarked.

In addition we see the introduction of specialised compute engines that specifically support FP16 (and not higher precision like FP32/FP64) like “Tensor Engines”.

What are “Tensors”?

A tensor engine (hardware accelerator) is a specialised processing unit that accelerates matrix multiplication in hardware, in this case the latest nVidia GPGPU architectures (Pascal/Turing). While the former was targeted to workstations (Titan), the latter powers all consumer (series 2000) graphics cards – thus it has entered the mainstream. In addition, the speed restrictions (e.g. Maxwell FP16 processing speed was limited to 1/64 FP32 speed) have been lifted.

While it is used in other algorithms, it is intended to be used to accelerate neural networks (so called “AI”) that are now being used in mainstream local workloads like image/video processing (scaling, de-noising, etc.), games (anti-aliasing, de-noising when using with ray-tracing, bots/NPCs, procedural world-building, etc.).

In this article we are investigating both FP16/half performance vs. standard FP32 as well as the performance improvement when using tensors.

FP16/half Performance

We are testing GPGPU performance of the GPUs in CUDA as it supports both FP16/half operations and tensors; hopefully both OpenCL and DirectX will also be updated to support both FP16/half (compute) and tensors.

Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers (Jan 2019). Turbo / Dynamic Overclocking was enabled on all configurations.

For image processing, Titan V brings big performance increases from 50% to 4x (times) faster than Titan X a big upgrade. If you are willing to drop to FP16 precision, then it is an extra 50% to 2x faster again – while naturally FP16 is not really usable on the X. With potential 8x times better performance Titan V powers through image processing tasks.

Processing Benchmarks nVidia Titan X FP32
nVidia Titan X FP16/half
nVidia Titan V FP32
nVidia Titan V FP16/half
Comments
GPGPU Arithmetic Benchmark Mandel (Mpix/s) 17.82 0.244 22.68 33.86 [+49%] In a purely compute-heavy algorithm FP16 can bring 50% improvement.
When F16 precision is sufficient, compute heavy algorithms improve greatly on the unlocked Titan V; we see that on the previous Titan X FP16 is not worth using as its performance is way too low.
GPGPU Finance Benchmark Black-Scholes (MOPT/s) 11.32 0.726 18.57 37 [+99%] B/S benefits greatly from FP16 both through decreased memory storage and low precision compute.
GPGPU Finance Benchmark Binomial (kOPT/s) 2.38 2.34 4.2 4.22 [+1%] Binomial requires higher precision for the results to make sense thus sees almost no benefit.
GPGPU Finance Benchmark Monte-Carlo (kOPT/s) 5.82 0.137 11.92 12.61 [+6%] M/C also uses thread shared data but read-only but still requires higher precision.
For financial workloads, FP16 is generally too low and most parts of the algorithm do need to be performed in FP32; we can still use FP16 as data storage but the heavy compute sees little benefit. When FP16 can be deployed as in B/S then we see far higher performance benefits.
GPGPU Science Benchmark GEMM (GFLOPS) 6 0.191 11 15.8 [+44%] /

42 Tensor [+4x]

Here we see the power of the tensor cores – in FP16 Titan V is 4 times faster! Normal compute is still almost 50% faster a good result.
GPGPU Science Benchmark FFT (GFLOPS) 0.227 0.078 0.617 0.962 [+56%] FFT also benefits from FP16 due to reduced memory pressure.
GPGPU Science Benchmark NBODY (GFLOPS) 5.72 0.061 7.79 8.25 [+6%] N-Body simulation needs some parts in FP32 thus does not benefit as much.
The new Tensor cores show their power in FP16 GEMM – we see 4x (times) higher performance than in FP32 which can go a long way in making neural network processing much faster not to mention 1/2 memory size requirement of FP32.
GPGPU Image Processing Blur (3×3) Filter (MPix/s) 18.41 1.65 27 27.53 [+2%]
Surprisingly 3×3 does not seem to benefit much from FP16 processing performance.
GPGPU Image Processing Sharpen (5×5) Filter (MPix/s) 5 0.618 9.29 14.66 [+58%] Same algorithm but more shared brings over 50% performance improvement.
GPGPU Image Processing Motion-Blur (7×7) (MPix/s) 5.08 0.332 9.48 14.39 [52%] Again same algorithm but even more data shared also brings around 50% better performance.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 4.8 0.316 9 13.36 [+48%] Still convolution but with 2 filters – similar 50% improvement.
GPGPU Image Processing Noise Removal (5×5) Median Filter(MPix/s) 37.14 7.4 112.5 207.37 [+84%] Median filter benefits greatly from FP16 processing, it’s almost 2x faster.
GPGPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.73 28 42.38 141.8 [+235%] Without major processing, quantisation benefits even more from FP16, it’s over 3x faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter  (MPix/s) 21 6.43 24.14 24.58 [+2%] This algorithm is 64-bit integer heavy thus shows almost no benefit.
GPGPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 0.305 0.49 0.8125 2 [+148%] One of the most complex and largest filters, FP16 makes it over 2.5x faster.
FP16/half brings huge performance improvement in image processing as long as the results are acceptable – again some parts have to use higher precision (FP32) in order to prevent artifacts. Convolution can be implemented through matrix multiplication thus would benefit even more from Tensor core hardware acceleration.

Final Thoughts / Conclusions

FP16/half support when unlocked can greatly benefit many algorithms – if the lower precision is acceptable: in general performance improves by about 50% – though in some cases it can reach 200%.

When using the new Tensor cores – performance improves hugely: in GEMM we see 4x performance improvement (vs. FP32). It thus makes great sense to modify algorithms (like convolution) to use matrix multiplication and thus the Tensor cores – which will greatly accelerate image processing and neural networks. With the new nVidia 2000 series – this kind of performance is available in the mainstream right now and is pretty amazing to see.

Expect to see similar hardware accelerator units from both GPGPUs and soon CPUs with AVX512-VNNI as well as FP16 processing support (BFloat16) that will allow multi-core wide-SIMD CPUs to be competitive.

Intel Core i9 9900K CofeeLake-R Review & Benchmarks – CPU 8-core/16-thread Performance

What is “CofeeLake-R” CFL-R?

It is the “refresh” (updated) version of the 8th generation Intel Core architecture (CFL) – itself a minor stepping of the previous 7th generation “KabyLake” (KBL), itself a minor update of the 6th generation “SkyLake” (SKL). While ordinarily this would not be much of an event – this time we do have more significant changes:

  • Patched vulnerabilities in hardware: this can help restore I/O workload performance degradation due to OS mitigations
    • Kernel Page Table Isolation (KPTI) aka “Meltdown” – Patched in hardware
    • L1TF/Foreshadow – Patched in hardware
    • (IBPB/IBRS) “Spectre 2” – OS mitigation needed
    • Speculative Store Bypass disabling (SSBD) “Spectre 4” – OS mitigation needed
  • Increased core counts yet again: CFL-R top-end now has 8 cores, not 6.

Intel CPUs bore the brunt of the vulnerabilities disclosed at the start of 2018 with “Meltdown” operating system mitigations (KVA) likely having the biggest performance impact in I/O workloads. While modern features (e.g. PCID (process context id) acceleration) could help reduce performance impact somewhat on recent architectures (4th gen and newer) the impact can still be significant. The CFL-R hardware fixes (thus not needing KVA) may thus prove very important.

On the desktop we also see increased cores (again!) now up to 8 (thus 16 threads with HyperThreading) – double what KBL and SKL brought and matching AMD.

We also see increased clocks, mainly Turbo, but this still allows 1 or 2 cores to boost clocks higher than CFL could and thus help workloads not massively threaded. This can improve responsiveness as single tasks can be run at top speed when there is little thread utilization.

While rated TDP has not changed, in practice we are likely to see increased “real” power consumption especially due to higher clocks – with Turbo pushing power consumption even higher – close to SKL/KBL-X.

In this article we test CPU Core performance; please see our other articles on:

Hardware Specifications

We are comparing the top-of-the-range Gen 8 Core i9 (9900K) with previous generation (8700K) and competing architectures with a view to upgrading to a mid-range high performance design.

CPU Specifications Intel i9-9900K CofeeLake-R
Intel i7-8700K CofeeLake
AMD Ryzen2 2700X Pinnacle Ridge
Intel i9-7900X SkyLake-X
Comments
Cores (CU) / Threads (SP) 8C / 16T 6C / 12T 8C / 16T 10C / 20T We have 33% more cores matching Ryzen and close to mainstream SKL-X!
Speed (Min / Max / Turbo) 0.8-3.6-5GHz

(8x-36x-50x)

0.8-3.7-4.7GHz (8x-37x-47x) 2.2-3.7-4.2GHz (22x-37x-42x) 1.2-3.3-4.3 (12x-33x-43x) Single/Dual core Turbo has now reached 5GHz same as 8086K special edition.
Power (TDP) 95W (135) 95W (131) 105W (135) 140W (308) TDP is the same but overall power consumption likely far higher.
L1D / L1I Caches 8x 32kB 8-way / 8x 32kB 8-way 6x 32kB 8-way / 6x 32kB 8-way 8x 32kB 8-way / 8x 64kB 8-way 10x 32kB 8-way / 10x 32kB 8-way No change in L1 caches. Just more of them.
L2 Caches 8x 256kB 8-way 6x 256kB 8-way 8x 512kB 8-way 10x 1MB 8-way No change in L2 caches. Just more of them.
L3 Caches 16MB 16-way 12MB 16-way 2x 8MB 16-way 13.75MB 11-way L3 has also increased by 33% in line with cores matching Ryzen.
Microcode/Firmware MU069E0C-9E MU069E0A-96 MU8F0802-04 MU065504-49 We have a new stepping and slightly newer microcode.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). CFL supports most modern instruction sets (AVX2, FMA3) but not the latest SKL/KBL-X AVX512 nor a few others like SHA HWA (Atom, Ryzen).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

Native Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 400 [+20%] 291 334 485 In the old Dhrystone integer workload, CFL-R finally beats Ryzen by 20%.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 393 [+17%] 296 335 485 With a 64-bit integer workload – nothing much changes.
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 236 [+19%] 170 198 262 Switching to floating-point, CFL-R still beats Ryzen by 20% in old Whetstone.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 196 [+16%] 143 169 223 With FP64 nothing much changes.
From integer workloads in Dhrystone to floating-point workloads in Whetstone, CFL-R handily beats Ryzen by about 20% and is also 33-39% faster than old CFL. It’s “king of the hill”.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 1000 [+74%] 741 574 1590 (AVX512) In this vectorised AVX2 integer test CFL-R is almost 2x faster than Ryzen
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 416 [+122%] 305 187 581 (AVX512) With a 64-bit AVX2 integer vectorised workload, CFL-R is now over 2.2x faster than Ryzen.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 6.75 [+16%] 4.9 5.8 7.6 This is a tough test using Long integers to emulate Int128 without SIMD: still CFL-R is fastest.
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 927 [+56%] 678 596 1760 (AVX512) In this floating-point AVX/FMA vectorised test, CFL-R is 56% faster than Ryzen.
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 544 [+62%] 402 335 533 (AVX512) Switching to FP64 SIMD code, CFL-R increases its lead to 612.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 23.3 [+49%] 16.7 15.6 40.3 (AVX512) In this heavy algorithm using FP64 to mantissa extend FP128 but not vectorised – CFL-R is 50% faster.
In vectorised SIMD code we know Intel’s SIMD units (that can execute 256-bit instructions in one go) are much more powerful than AMD’s and it shows: Ryzen is soundly beaten by a big 50-100% margin. Naturally it cannot beat its “big brother” SKL-X with AVX512 – which is likely why Intel has not enabled them.
BenchCrypt Crypto AES-256 (GB/s) 17.6 [+9%] 17.8 16.1 23 With AES HWA support all CPUs are memory bandwidth bound; core contention (8 vs 6) means CFL-R scores slightly worse than CFL.
BenchCrypt Crypto AES-128 (GB/s) 17.6 [+9%] 17.8 16.1 23 What we saw with AES-256 just repeats with AES-128.
BenchCrypt Crypto SHA2-256 (GB/s) 12.2 [-34%] 9 18.6 26 (AVX512) With SHA HWA Ryzen2 powers through but CFL-R is only 34% slower.
BenchCrypt Crypto SHA1 (GB/s) 23 [+19%] 17.3 19.3 38 (AVX512) Ryzen also accelerates the soon-to-be-defunct SHA1 but the algorithm is less compute heavy allowing CFL-R to beat it.
BenchCrypt Crypto SHA2-512 (GB/s) 9 [+139%] 6.65 3.77 21 (AVX512) SHA2-512 is not accelerated by SHA HWA, allowing CFL-R to use its SIMD units and be 139% faster.
AES HWA is memory bound and here and CFL-R also enjoys the 3200Mt/s memory – but now feeding 8C / 16T which all contend for the bandwidth. Thus CFL-R does score slighly less than CFL and obviously gets left in the dust by SKL-X with 4 memory channels. Ryzen2 SHA HWA does manage a lonely win but anything SIMD accelerated belongs to Intel.
BenchFinance Black-Scholes float/FP32 (MOPT/s) 276 [+7%] 207 257 309 In this non-vectorised test CFL-R is just a bit faster than Ryzen2.
BenchFinance Black-Scholes double/FP64 (MOPT/s) 240 [+10%] 180 219 277 Switching to FP64 code, nothing much changes, CFL-R is 10% faster
BenchFinance Binomial float/FP32 (kOPT/s) 59.9 [-44%] 47 107 70.5 Binomial uses thread shared data thus stresses the cache & memory system; CFL-R strangely loses by 44%.
BenchFinance Binomial double/FP64 (kOPT/s) 61.9 [+2%] 44.2 60.6 68 With FP64 code Ryzen2’s lead diminishes, CFL is pretty much tied with it.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 56.5 [+4%] 41.6 54.2 63 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches; CFL-R is just 4% faster.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 44.3 [+8%] 32.9 41 50.5 Switching to FP64 CFL-R increases its lead to 8%.
Without SIMD support, CFL-R relies on its thread count increase (thus matching Ryzen2) to beat Ryzen2 by a small amount and lose in one test. But in a test that AMD used to always win (with Ryzen 1/2) Intel now has the lead.
BenchScience SGEMM (GFLOPS) float/FP32 403 [+34%] 385 300 413 (AVX512) In this tough vectorised AVX2/FMA algorithm CFL-R is 35% faster than Ryzen2
BenchScience DGEMM (GFLOPS) double/FP64 269 [+126%] 135 119 212 (AVX512) With FP64 vectorised code, CFL-R is over 2x faster.
BenchScience SFFT (GFLOPS) float/FP32 23.4 [+160%] 24 9 28.6 (AVX512) FFT is also heavily vectorised (x4 AVX/FMA) but stresses the memory sub-system more CFL-R is over 2x faster.
BenchScience DFFT (GFLOPS) double/FP64 11.2 [+41%] 11.9 7.92 14.6 (AVX512) With FP64 code, CFL-R’s lead reduces to 41%.
BenchScience SNBODY (GFLOPS) float/FP32 550 [+96%] 411 280 638 (AVX512) N-Body simulation is vectorised but many memory accesses to shared data but CFL-R is 2x faster than Ryzen2.
BenchScience DNBODY (GFLOPS) double/FP64 172 [+52%] 127 113 195 (AVX512) With FP64 code CFL’s lead reduces to 50% over Ryzen 2.
With highly vectorised SIMD code CFL-R performs very well – dominating Ryzen2: in some tests it is over 2x faster! Then again CFL did not have any issues here either, Intel is just extending their lead…
CPU Image Processing Blur (3×3) Filter (MPix/s) 2270 [+86%] 1700 1220 4540 (AVX512) In this vectorised integer AVX2 workload CFL-R is amost 2x faster than Ryzen2.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 903 [+67%] 675 542 1790 (AVX512) Same algorithm but more shared data reduces the lead to 67% still significant.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 488 [+61%] 362 303 940 (AVX512) Again same algorithm but even more data shared reduces the lead to 61%.
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 784 [+73%] 589 453 1520 (AVX512) Different algorithm but still AVX2 vectorised workload means CFL-R is 73% faster than Ryzen 2.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 78.6 [+13%] 57.8 69.7 223 (AVX512) Still AVX2 vectorised code but CFL-R stumbles a bit here – but it’s still 13% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 43.4 [+76%] 31.8 24.6 70.8 (AVX512) CFL-R recovers its dominance over Ryzen2.
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 4470 [+208%] 3480 1450 3570 (AVX512) CFL-R (like all Intel CPUs) does very well here – it’s a huge 3x faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 614 [+153%] 448 243 909 (AVX512) In this final test, CFL-R is over 2x faster than Ryzen 2.

Adding 2 more cores brings additional performance gains (not to mention the higher Turbo clock) over CFL which showed big gains over the old SKL/KBL again within the same (rated) TDP. Intel never had any problem with SIMD code (AVX/AVX2/FMA3) beating Ryzen2 by a large margin (now over 2x faster) but now also wins pretty much all tests.

It is consistently 33-40% faster than CFL (8700K) in line with core/speed increases which bodes well for compute-heavy code; streaming performance can be lower due to increase core contention for bandwidth and here faster (though more expensive) memory would help.

No – it cannot beat its “older big brother” SKL-X with AVX512 – not to mention increased core/thread count as well as memory channels, but in some tests it is competitive.

Software VM (.Net/Java) Performance

We are testing arithmetic and vectorised performance of software virtual machines (SVM), i.e. Java and .Net. With operating systems – like Windows 10 – favouring SVM applications over “legacy” native, the performance of .Net CLR (and Java JVM) has become far more important.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64 (1809), latest drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Spectre / Meltdown Windows Mitigations: all were enabled as per default (BTI enabled, RDCL/KVA enabled, PCID enabled).

VM Benchmarks Intel i9-9900K CofeeLake-R Intel i7-8700K CofeeLake AMD Ryzen2 2700X Pinnacle Ridge Intel i9-7900X SkyLake-X Comments
BenchDotNetAA .Net Dhrystone Integer (GIPS) 54.3 [-11%] 41 61 52 .Net CLR integer performance starts off well but CFL-R is 10% slower.
BenchDotNetAA .Net Dhrystone Long (GIPS) 55.8 [-7%] 41 60 54 With 64-bit integers the gap lowers to 7%.
BenchDotNetAA .Net Whetstone float/FP32 (GFLOPS) 95.5 [-6%] 78 102 107 Floating-Point CLR performance does not change much, CFL-R is still 6% slower than Ryzen2 despite big gain over old CFL/KBL/SKL.
BenchDotNetAA .Net Whetstone double/FP64 (GFLOPS) 126 [+8%] 95 117 137 FP64 performance allows CFL-R to win in the end.
Ryzen 2 performs exceedingly well in .Net workloads – but CFL-R can hold its own and overall it is not significantly slower (under 10%) and even wins 1 test out of 4.
BenchDotNetMM .Net Integer Vectorised/Multi-Media (MPix/s) 144 [+30%] 93.5 111 144 Unlike CFL, here CFL-R is 30% faster than Ryzen2.
BenchDotNetMM .Net Long Vectorised/Multi-Media (MPix/s) 139 [+28%] 93.1 109 143 With 64-bit integer workload nothing much changes.
BenchDotNetMM .Net Float/FP32 Vectorised/Multi-Media (MPix/s) 499 [+27%] 361 392 585 Here we make use of RyuJit’s support for SIMD vectors thus running AVX/FMA code and CFL-R is 27% faster than Ryzen2.
BenchDotNetMM .Net Double/FP64 Vectorised/Multi-Media (MPix/s) 274 [+26%] 198 217 314 Switching to FP64 SIMD vector code – still running AVX/FMA – CFL-R is still faster.
We see a big improvement in CFL-R even against old CFL: this allows it to soundly beat Ryzen 2 (which used to win this test) by about 30%, a significant margin. It is possible the hardware fixes (“Meltdown”) are having an effect here.
Java Arithmetic Java Dhrystone Integer (GIPS) 614 [+7%] 557 573 877 Java JVM performance starts well with a 7% lead over Ryzen2.
Java Arithmetic Java Dhrystone Long (GIPS) 644 [+16%] 488 553 772 With 64-bit integers, CFL-R doubles its lead to 16%.
Java Arithmetic Java Whetstone float/FP32 (GFLOPS) 143 [+9%] 101 131 156 Floating-point JVM performance is similar – 9% faster.
Java Arithmetic Java Whetstone double/FP64 (GFLOPS) 147 [+6%] 103 139 160 With 64-bit precision the lead drops to 6%.
While CFL-R improves by a good amount over CFL (which itself improved greatly over KBL/SKL) and now beats Ryzen2 by a good margin 7-16%.
Java Multi-Media Java Integer Vectorised/Multi-Media (MPix/s) 147 [+30%] 100 113 140 Without SIMD acceleration we still see CFL-R 30% than Ryzen2 in this integer workload.
Java Multi-Media Java Long Vectorised/Multi-Media (MPix/s) 142 [+41%] 89 101 152 Changing to 64-bit integer increases the lead to 41%.
Java Multi-Media Java Float/FP32 Vectorised/Multi-Media (MPix/s) 91 [-6%] 64 97 98 With floating-point non-SIMD accelerated Ryzen2 is faster.
Java Multi-Media Java Double/FP64 Vectorised/Multi-Media (MPix/s) 93 [+3%] 64 90 99 With 64-bit floatint-point precision CFL-R is back on top by just 3%.
With compute heavy vectorised code but not SIMD accelerated, CFL-R is still faster or at least ties with Ryzen2.

CFL-R now beats or at least matches Ryzen2 in VM tests the latter used to easily win. It may not have the lead native SIMD vectorised code allows it – but has no trouble keeping up – unlike its older SKL/KBL “brothers”.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

While Intel had finally increased core counts with CFL (after many years) in light of new competition from AMD with Ryzen/2 – it still relied on fewer but more powerful (at least in SIMD) cores to compete. CFL-R finally brings core (8 vs 8) and thread (16 vs 16) parity with the competition to ensure domination.

And with few exceptions – that’s what it has achieved – 9900K is the fastest desktop CPU at this time (November 2018) and if you can afford it you can upgrade on the same, old, 200-series mainboards (though naturally 300-series is recommended). The performance improvement (33-40% over 8700K) is pretty significant to upgrade – again if you can afford it – considering it is a “mere refresh”.

The “Meltdown” fixes in hardware are also likely to bring big improvement in I/O workloads – or to be precise – restore the performance loss of OS mitigations (KVA) that have been deployed this year (2018). Still, in very rough terms, now you don’t have to decide between “speed” and “security” – though perhaps KVA should be used by default just in case any CPU (not just Intel) leaks information between user/kernel spaces by a yet undiscovered side-channel vulnerability.

But despite the “i9” moniker – don’t think you’re getting a workstation-class CPU on the cheap: SKL-X not only (still) has more cores/threads and 2x memory channels but also supports AVX512 beating it soundly. It will also be refreshed soon – sporting the same “Meltdown” in-hardware fixes. But again considering the costs (almost 2x) CFL-R is likely the performance/price winner on most workloads.

For now, on the desktop, the 9900K is “king of the hill”!