SiSoftware Sandra 20/20/10 (2020 R10) Update – optimisations and fixes

Update Wizard

We are pleased to release R10 (version 30.71) update for Sandra 20/20 (2020) with the following changes:

Sandra 20/20 (2020) Press Release

Latest Sandra Version

We are moving towards a tiered system where different versions (R numbers) are provided to different customers depending on their need and version type. This allows us in these tough times to prioritise our customers while still providing stable, best features to the community. We will still aim to release all versions together where possible but we no longer guarantee it.

  • Manufacturers/OEM, Tech Support, Reviewers:
    • Latest Sandra (Beta) version, R+1
  • Commercial (Professional/Business/Engineer/Enterprise):
    • Current Sandra (Stable) version, R (R+1 if required*)
  • Lite (Evaluation):
    • Previous Sandra (Stable) version, R-1

Note (*): we will provide access to Beta versions if customer is affected by an issue resolved in the next release.

GP-GPU (CUDA / OpenCL / DirectX Compute) Benchmarks

  • Additional performance improvements for nVidia “Ampere
  • Additional performance improvements for “Image Processing” Benchmarks
  • Relaxed limits further for better performance on high-end/multiple GP-GPUs [up to 8]

CPU Benchmarks

  • Fixed possible lock-up in “Scientific Analysis” Benchmarks
  • Revised benchmarks for asymmetric work-loads for hybrid CPUs

Bug Fixes

  • Fixed (possible) crash on Intel graphics with 64-bit PCIe memory addressing
  • Reviewed all device code that deals with 64-bit PCIe memory addressing
  • Fixed TigerLake (TGL) memory information/timings for (LP)DDR5
  • Fixed TigerLake (TGL) integrated graphics memory information
  • Additional IceLake (ICL) memory information

Reviews using Sandra 20/20:

Update & Download

Commercial version customers can download the free updates from their software distributor; Lite users please download from your favourite download site.

Download Sandra Commercial (Pro/Biz/Eng/Ent)

Download Sandra Lite

AVX512-IFMA(52) Improvement for IceLake and TigerLake

CPU Multi-Media Vectorised SIMD

What is Sandra’s Multi-Media benchmark?

The “multi-media” benchmark in Sandra was introduced way back with Intel’s MMX instruction set (and thus Pentium MMX) to show the difference vectorisation brings to common algorithms, in this case (Mandelbrot) fractal generation. While MMX did not have floating-point support – we can emulate them using integers of various widths (short/16-bit, int/32-bit, long/int64/64-bit, etc.).

The benchmark thus contains various precision tests using both integer and floating point data, currently 6 (single/double/quad-floating point, short/int/long integer) with more to come in the near future (half/FP16 floating-point, etc.). Larger widths provide more precision and thus generate more accurate fractals (images) but are slower to compute (they also take more memory to store).

While the latest instruction sets (AVX(2)/FMA, AVX512) do naturally support floating-point data, integer compute performance is still very much important thus its performance needs to be tested. As quantities become larger (e.g. memory/disk sizes, pointers/address spaces, etc.) we have moved from int/32-bit to long/64-bit processing with even exclusive 64-bit algorithms (e.g. hashing SHA512).

What is the “trouble” with 64-bit integers?

While all native 64-bit processors (e.g. x64, IA64, etc.) support native 64-bit integer operations, these are generally scalar with limited SIMD vectorised support. Multiplication is especially “problematic” as it has the potential to generate numbers up to twice (2x) the number of bits – thus multiplying two 64-bit integers can generate 128-bit integer full result for which there was no (SIMD) support.

Intel has added native full 128-bit multiplication support (MULX) with the BMI2 (Bit Manipulation Instructions Version 2) but that is still scalar (non-SIMD); not even the latest AVX512-DQ instruction set brought support. While we could emulate full 128-bit multiplication using native 32-bit to 64-bit halves multiplication we have chosen to wait for native support. An additional issue (for us) is that we use “signed integers” (i.e. can hold both positive (+ve) and negative (-ve) values) while most multiplication instructions are for “unsigned integers” (thus can hold only positive values) – thus we need to modify the result for our needs which incurs overheads.

Thus the long/64-bit integer benchmark in Sandra remained non-vectorised until the introduction of AVX512-IFMA52.

What is AVX512-IFMA52?

IFMA52 is one of the new extensions of AVX512 introduced with “IceLake” (ICL) that supports native 52-bit fused multiply-add with 104-bit result. As it is 512-bit wide, we can multiply-add eight (8) pairs 64-bit integers in one go every 2 clocks (0.5 throughput, 4 latency on ICL) – especially useful for algorithms like (Mandelbrot) fractals where we can operate on many pixels independently.

As is generates a 104-bit full result, it is (as per name) only a 52-bit integer thus we need to restrict our integers to 52-bits. It also operates on unsigned integers only thus needs to be modified for our signed-integer purpose. Note also that while it is a fused multiply-add, we have chosen to use only the multiply feature here (in this Sandra version 20/20 R9); future versions (of Sandra) may use the full multiply-add feature for even better performance.

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX512, AVX2, AVX, etc.).

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
BenchCpuMM Emulated Int64 ALU64 (Mpix/s) 3.67 4.34 While native, scalar int64 processing is pretty slow.
BenchCpuMM Native Int64 ADX/BMI2 (Mpix/s) 21.24 [+5.78x] Using BMI2 for 64-bit multiplication increases (scalar) performance by 6x!
BenchCpuMM Emulated Int64 SSE4 (Mpix/s) 13.92 [-35%] Using vectorisation though SSE4 (2x wide) is not enough to beat ADX/BMI
BenchCpuMM Emulated Int64 AVX2 (Mpix/s) 22.8 [+64%] AVX2 is 4x wide (256-bit) and just about beats scalar ADX/BMI2.
BenchCpuMM Emulated Int64 AVX512/DQ (Mpix/s) 33.53 [+47%] 512-bit wide AVX512 is 47% faster than AVX2.
BenchCpuMM Native Int64 AVX512/IFMA52 (Mpix/s) 55.87 [+66%] / [+15x over ALU64] 70.41 [+16x over ALU64] IFMA52 is 66% faster than normal AVX512 and over 15x faster than scalar ALU.
With IFMA52, we finally see a big performance gain though native 64-bit integer multiplication and vectorisation (512-bit wide, thus 8x 64-bit integer pairs), it is over 15x faster on ICL and 16x faster on TGL! In fairness, ADX/BMI2 is only about 1/2 slower and that is scalar – showing how much native instructions help processing.

Conclusion

AVX512 continues to bring performance improvements by adding more sub-instruction sets like AVX512-IFMA(52) that help 64-bit integer processing. With 64-bit integers taking over most computations due to increased sizes (data, pointers, etc.) this is becoming more and more important and is not before time.

While not a full 128-bit multiplier, 104-bits allow complete 52-bit integer operation which is sufficient for most tasks – today. Perhaps in the future, a IFMA64 will be provided for full 128-bit multiply result integer support.

Intel Iris Plus G7 Gen12 XE TigerLake ULV (i7-1165G7) Review & Benchmarks – GPGPU Performance

Intel iRIS Xe Gen 12

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Gen12 (XE-LP) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each, 5400Mt/s)
  • No eDRAM cache unfortunately (like CrystallWell and co)
  • New Image Processing Unit (IPU6) up to 4K90 resolution
  • New 2x Media Encoders HEVC 4K60-10b 4:4:4 & 8K30-10b 4:2:0
  • PCIe 4.0

While ICL has already greatly upgraded the GP-GPU to gen 11 cores (and more than doubled to 64EU for G7), TGL upgrades them yet again to “XE”-LP gen 12 cores now all the way up to 96EUs. While again most features seem to be geared towards gaming and media (with new image processing and media encoders) – there should be a few new instructions for AI – hopefully provided by a OpenCL extension.

Again there is no FP64 support (!) while FP16 is naturally supported at 2x rate as before. BF16 should also be supported by a future driver. Int32, Int16 performance has reportedly doubled with Int8 now supported and DP4A accelerated.

The new memory controller supports DDR5 / LPDDR5 (5400Mt/s) that should – once memory becomes readily available – provide more bandwidth for the EU cores; until then LPDDR4X can clock even faster (4267Mt/s). There is no mention about eDRAM (L4) cache at all.

We do hope to see more GPGPU-friendly features in upcoming versions now that Intel is taking graphics seriously. Perhaps with the forthcoming DG1 discrete graphics

GPGPU (Xe-LP G7) Performance Benchmarking

In this article we test GPGPU core performance; please see our other articles on:

To compare against the other Gen10 SoC, please see our other articles:

Hardware Specifications

We are comparing the middle-range Intel integrated GP-GPUs with previous generation, as well as competing architectures with a view to upgrading to a brand-new, high performance, design.

GPGPU Specifications Intel Iris XE-LP G7
Intel XE-LP G1
Intel Iris Plus (IceLake) G7
AMD Vega 8 (Ryzen5)
Comments
Arch Chipset EV12 / G7 EV12 / G1 EV11 / G7 GCN1.5 The first G12 from Intel.
Cores (CU) / Threads (SP) 96 / 768 32 / 256 64 / 512 8 / 512 50% more cores vs. G11
SIMD per CU / Width 8 8 8 64 Same SIMD width
Wave/Warp Size 32 32 16/32 64 Wave size matches nVidia
Speed (Min-Turbo)
1.2GHz 1.15GHz 1.1GHz 1.1GHz Turbo speed has slightly increased.
Power (TDP) 15-35W 15-35W 15-35W 15-35W Similar power envelope.
ROP / TMU 24 / 48 8 / 16 16 / 32 8 / 32 ROPs and TMUs have also increased 50%.
Shared Memory
64kB
64kB 64kB 32kB Same shared memory but 2x Vega.
Constant Memory
3.2GB 3.2GB 2.7GB 3.2GB No dedicated constant memory but large.
Global Memory 2x LP-DDR4X 4267Mt/s (LPDDR5 5400Mt/s) 2x LP-DDR4X 4267Mt/s 2x LP-DDR4X 3733Mt/s 2x DDR4-2400 Can support faster (LP)DDR5 in the future.
Memory Bandwidth
42GB/s 42GB/s 58GB/s 42GB/s Highest (possible) bandwidth ever
L1 Caches 64kB x 6 64kB x 2 16kB x 8 8x 16kB L1 is much larger.
L3 Cache 3.8MB ? 3MB ? L3 has modestly increased.
Maximum Work-group Size
256×256 256×256 256×256 1024×1024 Vega supports 4x bigger workgroups.
FP64/double ratio
No! No! No! Yes, 1/16x No FP64 support in current drivers!
FP16/half ratio
2x 2x 2x 2x Same 2x ratio

Processing Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from both Intel and competition.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Processing Benchmarks Intel Iris XE-LP G7 96EV
Intel XE-LP G1 32EV
Intel Iris Plus (IceLake) G7 64EV
AMD Vega 8 (Ryzen5) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 4,342 [+54%] 1,419 2,820 2,000 Xe beats EV11 by over 50% using FP16!
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 2,062 [+55%] 654 1,330 1,350 Standard FP32 is just as fast, 55% faster.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 98.6* [+41%] 31.3* 70* 111 Without native FP64 support Xe craters like old EV11.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 9.91* [+31%] 3.49* 7.54* 7.11 Emulated FP128 is even harder for Xe.
Starting off, we see almost perfect scaling with improvement in EUs, with Xe 50% faster than old EV11. Unfortunately, again without native FP64 support – it cannot match the competition. For FP64 workloads – you’ll have to use the CPU; for ULV that may be OK but for discrete DG1 that is not so great.

* Emulated FP64 through FP32.

GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 7.9 [+3x] 2.54 2.6 2.58 Integer performance is 3x faster than EV11
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 3.54 3.38 3.3 Nothing much changes when changing to 128bit.
GPGPU Crypto Benchmark Crypto SHA2-256 (GB/s) 20.52 [+3x] 6.81 6.9 14.29 Xe beats Vega even with its acceleration.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 13.34 14.18 18.77 With 128-bit Xe is even faster.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 2.26 3.36 64-bit integer workload is also stellar.
Despite our sample using slower DDR4 memory vs. LP-DDR4x ICL/EV11, integer performance is 3x faster – a huge upgrade. It even manages to beat AMD’s Vega with its crypto acceleration instructions (media ops). While the crypto currency frenzy has died out (not likely to mine coins on ULV GP-GPUs), the dedicated DG1 may be a serious crypto-craker GPU.
GPGPU Finance Benchmark Black-Scholes float/FP16 (MOPT/s) 1,111 2,340 1,720 With FP16 we see G7 win again by ~35%.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 1,603 [+22%] 993 1,310 829 With FP32 Xe is 22% faster.
GPGPU Finance Benchmark Binomial half/FP16 (kOPT/s) 116 292 270 Binomial uses thread shared data thus stresses the memory system.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 334 [+14%] 111 292 254 With FP32, XE is just 15% faster.
GPGPU Finance Benchmark Monte-Carlo half/FP16 (kOPT/s) 470 667 584 Monte-Carlo also uses thread shared data but read-only.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 1,385 [+94%] 444 719 362 With FP32 code Xe is 2x faster than EV11.
For financial FP32/FP16 workloads, Xe is not always much faster than EV11, with two algorithms just 15-22% faster but one 2x as fast. Again, due to lack of FP64 support – it cannot run high-precision workloads which may be a problem for some algorithms.

This does not bode well for the dedicated DG1 as it would be the only discrete card without native FP64 support unlike competition. However, it is likely (some) FP64 units will be included unless Intel will aim it squarely to gamers (only).

GPGPU Science Benchmark HGEMM (GFLOPS) float/FP16 528 563 884 Vega still has great performance with FP16.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 683 [+64%] 419 314 With FP32, Xe is 64% faster than EV11.
GPGPU Science Benchmark HFFT (GFLOPS) float/FP16 33.32 61.4 61.34 Vega does very well here also with FP16.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 52.7 [+34%] 39.2 31.5 With FP32, Xe is 34% faster.
GPGPU Science Benchmark HNBODY (GFLOPS) float/FP16 652 930 623 All Intel GPUs do well here.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 908 [+60%] 566 537 With FP32, Xe is 60% faster.
On scientific algorithms, Xe does much better and manages 35-65% better performance than EV11 and generally trouncing Vega on FP32 though not quite on FP16. Shall we mention lack of FP64 again?
GPGPU Image Processing Blur (3×3) Filter single/FP16 (MPix/s) 3,520 2,273
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 4,725 [+3x] 1,649 1.570 782 In this 3×3 convolution algorithm, Xe is 3x faster!
GPGPU Image Processing Sharpen (5×5) Filter single/FP16 (MPix/s) 1,000 582
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 1,354 [+4.2x] 436 319 157 Same algorithm but more shared data, Xe is 4x faster.
GPGPU Image Processing Motion Blur (7×7) Filter single/FP16 (MPix/s) 924 619
GPGPU Image Processing Motion Blur (7×7) Filter single/FP32 (MPix/s) 727 [+2.2x] 232 328 161 With even more data Xe is 2x faster.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP16 (MPix/s) 1,000 595
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 1,354 [+4.26x] 435 318 155 Still convolution but with 2 filters – 4.3x faster.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP16 (MPix/s) 26.63 7.69
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 35.73 [+33%] 16.27 26.91 4.06 Different algorithm Xe just 33% faster.
GPGPU Image Processing Oil Painting Quantise Filter single/FP16 (MPix/s) 24.34
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 23.95 [+22%] 11.11 19.63 2.59 Without major processing, Xe is only 22% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP16 (MPix/s) 1,740 2,091
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 2,772 [+48%] 1,175 1,870 2,100 This algorithm is 64-bit integer heavy thus G7 is 10% slower
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP16 (MPix/s) 215 1,046
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 916 [-4%] 551 950 608 One of the most complex and largest filters, Xe ties with EV11.
For image processing tasks, Xe seems to do best, with up to 4x better performance – likely due to updated compiler and drivers. In any case for such tasks, upgrading to TGL will give you a huge boost. (fortunately no FP64 processing here)

Memory Performance

We are testing both OpenCL performance using the latest SDK / libraries / drivers from Intel and competition.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest Intel and AMD drivers. Turbo / Boost was enabled on all configurations.

Memory Benchmarks Intel UHD 630 (7200U) Intel Iris HD 540 (6550U) AMD Vega 8 (Ryzen 5) Intel Iris Plus (1065G7) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 44.92 [+27%] 45.9 36.3 27.2 Xe manages to squeeze more bandwidth of DDR4.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 7.75 [-54%] 7.7 17 4.74 Uploads are 1/2 slower at this time.
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 7.6 [-58%] 7.6 185 Download bandwidth is not much better.
Thanks to the faster LP-DDR4X memory, Xe has even higher bandwidth than EV11; with future DDR5 / LPDDR5 this will increase even higher. At this time, perhaps due to the driver the upload/download bandwidths are 1/2x lower.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Once again Intel seems to be taking graphics seriously: for the 2nd time in a row we have a major graphics upgrade with Xe with big upgrades in EV cores (count), performance and bandwidth. Overall it seems to be 50% faster than EV11 with lower-end devices benefiting most from the upgrade. While the competition was unassailable – Intel has managed to close the gap and overtake.

However, this is still a core aimed at gamers and it does not provide much for GP-GPU; the improved integer performance is very much welcome – 3-times better (!) but few and specific instructions for AI only. Lack of FP64 makes it unsuitable for high-precision financial and scientific workloads; something that the old EV7-9 cores could do reasonably well (all things considered).

For integrated graphics, this is not a problem – not many people would expect ULV GPU core to run compute-heavy workloads; however, the dedicated DG1 card would really be out-spec’d by the competition, with even old, low-end devices providing more features. However, dedicated DG1 is likely to include (some) FP64 units and/or additional units unlike the low-power (LP ULV) integrated versions.

Getting back to ULV, Xe-LP’s performance completely obsoletes devices (e.g. SKL/KBL/WHL/CML-ULV) using the older EV9x cores – unless you really don’t plan on using them except for “business 2D graphics” or displaying the desktop.

If you have not upgraded to ICL yet, TGL is a far better, compelling, proposition that should be your (current) top choice for long-term use. For ICL owners, there is still a lot to upgrade though not as massive as anything released previously.

In a word: Highly Recommended!

Please see our other articles on:

Intel Core Gen11 TigerLake ULV (i7-1165G7) Review & Benchmarks – CPU AVX512 Performance

Intel Core i7 Gen 11

What is “TigerLake”?

It is 3rd update of the “next generation” Core (gen 11) architecture (TGL/TigerLake) from Intel the one that replaced the ageing “Skylake (SKL)” arch and its many derivatives that are still with us (“CometLake (CML)”, “RocketLake (RKL)”, etc.). It is the optimisation of the “IceLake (ICL)” arch and thus on update 10nm++ again launched for mobile ULV (U/Y) devices and perhaps for other platforms too.

While not a “revolution” like ICL was, it still contains big changes SoC: CPU, GPU, memory controller:

  • 10nm++ process (lower voltage, higher performance benefits)
  • Up to 4C/8T “Willow Cove” on ULV  (CometLake up to 6C/12T)
  • Gen12 (Xe) graphics (up to 96 EU, similar to discrete DG1 graphics)
  • AVX512 and more of its friends
  • Increased L2 cache from 512kB to 1.25MB per core (+2.5x)
  • Increased L3 cache from 8MB to 12MB (+50%)
  • DDR5 / LPDDR5 memory controller support (2 controllers, 2 channels each)
  • PCIe 4.0
  • Thunderbolt 4 (and thus USB 4.0 support) integrated
  • Hardware fixes/mitigations for vulnerabilities (“JCC”, “Meltdown”, “MDS”, various “Spectre” types)

While IceLake introduced AVX512 to the mainstream, TigerLake adds even more of its derivatives effectively overtaking the ageing HEDT platform that is still on old SKL-X derived cores:

  • AVX512-VNNI (Vector Neural Network Instructions – also on ICL)
  • AVX512-VPINTERSECT/2 (Vector Pair Intersect)

While some software may not have been updated to AVX512 as it was reserved for HEDT/Servers, due to this mainstream launch you can pretty much guarantee that just about all vectorised algorithms (already ported to AVX2/FMA) will soon be ported over. VNNI, IFMA support can accelerate low-precision neural-networks that are likely to be used on mobile platforms.

The caches are finally getting updated and increased considering that the competition has deployed massively big caches in its latest products. L2 more than doubles (2.5x) while L3 is “only” 50% larger. Note that ICL had previously doubled L2 from SKL (and current CML) derivatives which means it’s 5x larger than older designs.

From a security point-of-view, TGL mitigates all (current/reported) vulnerabilities in hardware/firmware (Spectre 2, 3/a, 4; L1TF, MDS) except BCB (Spectre V1 that does not have a hardware solution) thus should not require slower mitigations that affect performance (especially I/O). Like ICL it is also not affected by the JCC issue that is still being addressed through software (compiler) changes but old software will never be updated.

DDR5 / LPDDR5 will ensure even more memory bandwidth and faster data rate speeds (up to 5400Mt/s), without the need for multiple (SO)DIMMs to enable at least dual-channel; naturally populating all channels will allow even higher bandwidth. Higher data rate speeds will reduce memory latencies (assuming the latencies don’t increase too much). Unfortunately there are no public DDR5 modules for us to test. LPDDR4X also gets a bump to ma 4267Mt/s.

PCIe 4.0 finally arrives on Intel and should drive wide adoption for both discrete graphics (GP-GPUs including Intel’s) and NVMe SSDs with ~8GB/s transfer (x4 lanes) on ULV but on desktop up to 32GB/s (x16). Note that the DMI/OPI link between CPU and I/O Hub is also thus updated to PCIe 4.0 speeds improving CPU/Hub transfer.

Thunderbolt 4.0 brings support for the upcoming USB 4.0 protocol and data rates as well  (32Gbps) that will also bring new peripherals including external eGPU for discrete graphics.

Finally the GPU cores have been updated again to XE (Gen 12) cores, up to 96 on some SKUs that represent huge compute and graphics performance increases over the old (Gen 9.x) cores used by gen 10 APUs (see corresponding article).

CPU (Core) Performance Benchmarking

In this article we test CPU core performance; please see our other articles on:

To compare against the other Gen10 CPU, please see our other articles:

Hardware Specifications

We are comparing the top-of-the-range Intel ULV with competing architectures (gen 10, 11) as well as competiors (AMD) with a view to upgrading to a mid-range but high performance design.

CPU Specifications AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
Cores (CU) / Threads (SP) 6C / 6T 4C / 8T 4C / 8T 4C / 8T No change in cores count.
Speed (Min / Max / Turbo) 1.6-2.3-4.0GHz 0.4-1.8-4.9GHz
(1.8GHz @ 15W, 2.3GHz @ 25W)
0.4-1.5-3.9GHz
(1.0GHz @ 12W, 1.5GHz @ 25W)
0.4-2.1-4.7GHz (1.2GHz @ 12W, 2.8GHz @ 28W) Both rates and Turbo clocks are way up
Power (TDP) 15-35W 15-35W 15-35W 12-35W Similar power envelope possibly higher.
L1D / L1I Caches 6x 32kB 8-way / 6x 64kB 4-way 4x 32kB 8-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way 4x 48kB 12-way / 4x 32kB 8-way No change L1D
L2 Caches 6x 512kB 8-way 4x 256kB 16-way 4x 512kB 16-way 4x 1.25MB L2 has more than doubled (2.5x)!
L3 Caches 2x 4MB 16-way 8MB 16-way 8MB 16-way 12MB 16-way L3 is 50% larger
Microcode (Firmware) n/a MU-068E09-CC MU-067E05-6A MU-TBD Revisions just keep on coming.
Special Instruction Sets
AVX2/FMA, SHA AVX2/FMA AVX512, VNNI, SHA, VAES,  IFMA AVX512, VNNI, SHA, VAES,  IFMA More AVX512!
SIMD Width / Units
256-bit 256-bit 512-bit 512-bit Widest SIMD units ever

Native Performance

We are testing native arithmetic, SIMD and cryptography performance using the highest performing instruction sets (AVX2, AVX, etc.). “IceLake” (ICL) supports all modern instruction sets including AVX512, VNNI, SHA HWA, VAES and naturally the older AVX2/FMA, AES HWA.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest AMD and Intel drivers. 2MB “large pages” were enabled and in use. Turbo / Boost was enabled on all configurations.

Native Benchmarks AMD Ryzen 4500U Intel Core i7 10510U (CometLake ULV) Intel Core i7 1065G7 (IceLake ULV) Intel Core i7 1165G7 (TigerLake ULV) Comments
CPU Arithmetic Benchmark Native Dhrystone Integer (GIPS) 208 134 154 169 [+10%] TGL is 10% faster than ICL but not enough to beat AMD.
CPU Arithmetic Benchmark Native Dhrystone Long (GIPS) 191 135 151 167 [+11%] With a 64-bit integer workload – 11% increase
CPU Arithmetic Benchmark Native FP32 (Float) Whetstone (GFLOPS) 89 85 90  99.5 [+10%]
With floating-point, TGL is only 10% faster but enough to beat AMD.
CPU Arithmetic Benchmark Native FP64 (Double) Whetstone (GFLOPS) 75 70 74  83 [+12%]
With FP64 we see a 12% improvement.
With integer (legacy) workloads (not using SIMD) TGL is not much faster than ICL even with its highly clocked cores; still 1 10-12% improvement is welcome as it allows it to beat the 6-core Ryzen Mobile competition.
BenchCpuMM Native Integer (Int32) Multi-Media (Mpix/s) 506 409 504* 709* [+41%] With AVX512 TGL is over 40% faster than ICL.
BenchCpuMM Native Long (Int64) Multi-Media (Mpix/s) 193 149 145* 216* [+49%] With a 64-bit AVX512 integer workload TGL is 50% faster.
BenchCpuMM Native Quad-Int (Int128) Multi-Media (Mpix/s) 4.47 2.54 3.67** 4.34** [+18%] A tough test using long integers to emulate Int128 without SIMD; TGL is just 18% faster. [**]
BenchCpuMM Native Float/FP32 Multi-Media (Mpix/s) 433 328 414*  666* [+61%]
In this floating-point vectorised test TGL is 61% faster!
BenchCpuMM Native Double/FP64 Multi-Media (Mpix/s) 251 194 232*  381* [+64%]
Switching to FP64 SIMD AVX512 code, TGL is 64% faster.
BenchCpuMM Native Quad-Float/FP128 Multi-Media (Mpix/s) 11.23 8.22 10.2*  15.28* [+50%]
A heavy algorithm using FP64 to mantissa extend FP128 TGL is still 50% faster than ICL.
With heavily vectorised SIMD workloads TGL can leverage its AVX512 support to not only soundly beat Ryzen Mobile even with its 6x 256-bit SIMD cores, but it is also 40-60% faster than ICL. Intel seems to have managed to get the SIMD units to run much faster than ICL even within similar power envelope!

* using AVX512 instead of AVX2/FMA.

** note test has been rewritten in Sandra 20/20 R9: now vectorised and AVX512-IFMA enabled – see “AVX512-IFMA(52) Improvement for IceLake and TigerLake” article.

BenchCrypt Crypto AES-256 (GB/s) 13.46 12.11 21.3*  19.72* [-7%] Memory bandwidth rules here so TGL is similar to ICL in speed.
BenchCrypt Crypto AES-128 (GB/s) 13.5 12.11 21.3* 19.8* [-7%] No change with AES128.
BenchCrypt Crypto SHA2-256 (GB/s) 7.03** 4.28 9*** 13.87*** [+54%] Despite SHA HWA, TGL soundly beats Ryzen using AVX512.
BenchCrypt Crypto SHA1 (GB/s) 7.19 15.71***   Less compute intensive SHA1 does not help.
BenchCrypt Crypto SHA2-512 (GB/s) 7.09*** SHA2-512 is not accelerated by SHA HWA.
The memory sub-system is crucial here, and despite Ryzen Mobile having SHA HWA – TGL is much faster using AVX512 and as we’ve seen before, 50% faster than ICL!  AVX512 helps even against native hashing acceleration.

* using VAES (AVX512 VL) instead of AES HWA.

** using SHA HWA instead of multi-buffer AVX2.

*** using AVX512 B/W

BenchFinance Black-Scholes float/FP32 (MOPT/s) 64.16 109
BenchFinance Black-Scholes double/FP64 (MOPT/s) 91.48 87.17 91 132 [+45%] Using FP64 TGL is 45% faster than ICL.
BenchFinance Binomial float/FP32 (kOPT/s) 16.34 23.55 Binomial uses thread shared data thus stresses the cache & memory system.
BenchFinance Binomial double/FP64 (kOPT/s) 31.2 21 27  37.23 [+38%]
With FP64 code TGL is 38% faster.
BenchFinance Monte-Carlo float/FP32 (kOPT/s) 12.48 79.9 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure on the caches.
BenchFinance Monte-Carlo double/FP64 (kOPT/s) 45.59 16.5 33 45.98 [+39%] Switching to FP64 TGL is 40% faster.
With non-SIMD financial workloads, TGL still improves by a decent 40-45% over ICL and it is enough to beat 6-core Ryzen Mobile – a no mean feat considering just how much Ryzen Mobile has improved. Still, it is more likely that the GPGPU will be used for such workloads today.
BenchScience SGEMM (GFLOPS) float/FP32 158 185*  294* [+59%]
In this tough vectorised algorithm, TGL is 60% faster!
BenchScience DGEMM (GFLOPS) double/FP64 76.86 69.2 91.7*  167* [+82%]
With FP64 vectorised code, TGL is over 80% faster!
BenchScience SFFT (GFLOPS) float/FP32 13.9 31.7*  31.14* [-2%] FFT is also heavily vectorised but memory dependent so TGL does not improve over ICL.
BenchScience DFFT (GFLOPS) double/FP64 7.15 7.35 17.7*  16.41* [-3%] With FP64 code, nothing much changes.
BenchScience SNBODY (GFLOPS) float/FP32 169 200*  286* [+43%]
N-Body simulation is vectorised but with more memory accesses.
BenchScience DNBODY (GFLOPS) double/FP64 98.7 64.2 61.8* 81.61* [+32%]
With FP64 code TGL is 32% faster.
With highly vectorised SIMD code (scientific workloads), TGL again shows us the power of AVX512 – and beats iCL by 30-80% and naturally Ryzen Mobile too. Some algorithms that are completely memory latency/bandwidth dependent cannot improve but require faster memory instead.

* using AVX512 instead of AVX2/FMA

Neural Networks NeuralNet CNN Inference (Samples/s) 19.33 25.62*  
Neural Networks NeuralNet CNN Training (Samples/s) 3.33 4.56*
Neural Networks NeuralNet RNN Inference (Samples/s) 23.88 24.93*
Neural Networks NeuralNet RNN Training (Samples/s) 1.57 2.97*
* using AVX512 instead of AVX2/FMA (not using VNNI yet)
CPU Image Processing Blur (3×3) Filter (MPix/s) 1060 891 1580* 2276* [+44%] In this vectorised integer workload TGL is 44% faster.
CPU Image Processing Sharpen (5×5) Filter (MPix/s) 441 359 633*  912* [+44%] Same algorithm but more shared data TGL still 44% faster.
CPU Image Processing Motion-Blur (7×7) Filter (MPix/s) 231 186 326*  480* [+47%]
Again same algorithm but even more data shared brings 47%
CPU Image Processing Edge Detection (2*5×5) Sobel Filter (MPix/s) 363 302 502*  751* [+50%]
Different algorithm but still vectorised workload still 50% faster.
CPU Image Processing Noise Removal (5×5) Median Filter (MPix/s) 28.02 27.7 72.9*  109* [+49%]
Still vectorised code TGL is again 50% faster.
CPU Image Processing Oil Painting Quantise Filter (MPix/s) 12.23 15.7 24.7*  34.74* [+40%]
Similar improvement here of about 40%
CPU Image Processing Diffusion Randomise (XorShift) Filter (MPix/s) 936 1580 2100*  2998* [+43%]
With integer workload, 43% faster.
CPU Image Processing Marbling Perlin Noise 2D Filter (MPix/s) 127 214 307*  430* [+40%]
In this final test again with integer workload 40% faster
Similar to what we saw before, TGL is between 40-50% faster than ICL at similar power envelope and far faster than Ryzen Mobile and its 6-cores. Again we see the huge improvement AVX512 brings already even at low-power ULV envelopes.

* using AVX512 instead of AVX2/FMA

Perhaps due to the relatively meager ULV power envelope, ICL’s AVX512 SIMD units were unable to decisively beat “older” architectures but with more cores (Ryzen Mobile or Comet Lake with 6-cores) – but TGL improves things considerably – anywhere between 40-50% across algorithms. Considering the power envelope remains similar, this is a pretty impressive improvement that makes TGL compelling for modern, vectorised software using AVX512.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

With AMD making big improvements with Ryzen Mobile (ZEN2) and its updated 256-bit SIMD units and also more cores (6+), Intel had to improve: and improve it did. While due to high power consumption, AVX512 was never a good fit for mobile and their meager ULV power envelopes (15-25W, etc.) – somehow “Tiger Lake” (TGL) manages to run them much faster, 40-50% faster than “Ice Lake” and thus beating the competition.

TGL’s performance still within ULV power budget in a thin & light laptop (e.g. Dell XPS 13) is pretty compelling and soundly beats not only older (bigger) mobile processors with more cores (4-6 at 35-45W) but also older desktop processors! It is truly astonishing what AVX512 can bring on a modern efficient design.

TGL also brings PCIe 4.0 thus faster NVMe/Optane storage I/O, Thunderbolt 4 / USB 4.0 compatibility and thus faster external I/O as well. DDR5 & LPDDR5 also promise even higher bandwidth in order to feed the new cores not to mention the updated GPGPU engine with its many more cores (up to 96 EU now!) that require a lot more bandwidth.

TGL is a huge improvement over older architectures (even 8th gen) that improves everything: greater compute power, greater graphics/GP compute power, faster memory, faster storage and faster external I/O! If you thought that ICL – despite its own big improvements – did not quite reach the “upgrade threshold” – TGL does everything and much more. The times of small, incremental improvements is finally over and ICL/TGL are just what was needed. Let’s hope Intel can keep it up!

In a word: Highly Recommended!

Please see our other articles on:

nVidia Titan RTX / 2080Ti: Turing GPGPU performance in CUDA and OpenCL

What is “Titan RTX / 2080Ti”?

It is the latest high-end “pro-sumer” card from nVidia with the next-generation “Turing” architecture, the update to the current “Volta” architecture that has had a limited release in Titan/Quadro cards. It powers the new Series 20 top-end (with RTX) and Series 16 mainstream (without RTX) cards that replace the old Series 10 “Pascal” series.

As “Volta” is intended for AI/scientific/financial data-centers – it features high-end HBM2 memory; since “Turing” is meant for gaming, rendering, etc. has “normal” GDDR6 memory. Similarly “Turing” has the new RTX (Ray-Tracing) cores for high-fidelity visualisation and image generation – in addition to the Tensor (TSX) cores that “Volta” has introduced.

While “Volta” has 1/2 FP64 ratio cores (vs. FP32), “Turing” has the normal 1/32 FP64 ratio cores: for high-precision computation – you need “Volta”. However, as “Turing” maintains the 2x FP16 rate (vs. FP32) it can run low-precision AI (neural networks) at full speed. Old “Pascal” had 1/64x FP16 ratio making it pretty much unusable in most cases.

“Turing” does not have high-end on-package HBM2 memory but instead high-speed GDDR6 memory that has decent bandwidth but is not  plentiful – with 1GB missing (11GB instead of 12GB).

With the soon-to-be unveiled “Ampere”  (Series 30) architecture, we look whether you can have a “cheap” Titan V performance using a Turing 2080TI consumer card.

See these other articles on Titan performance:

Hardware Specifications

We are comparing the top-of-the-range Titan V with previous generation Titans and competing architectures with a view to upgrading to a mid-range high performance design.

GPGPU Specifications nVidia Titan RTX / 2080TI (Turing)
nVidia Titan V (Volta)
nVidia Titan X (Pascal)
Comments
Arch Chipset Turing GP102 (7.5) Volta VP100 (7.0) Pascal FP102 (6.1) The V is the only one using the top-end 100 chip not 102 or 104 lower-end versions
Cores (CU) / Threads (SP) 68 / 4352 80 / 5120 28 / 3584 Not as many cores as Volta but still decent.
ROPs / TMUs 88 / 272 96 / 320
96 / 224 Cannot match Volta but more ROPs per CU for gaming.
FP32 / FP64 / Tensor Cores 4352 / 136 / 544 5120 / 2560 / 640 3584 / 112 / no Maintains the Tensor cores important for AI tasks (neural networks, etc.)
Speed (Min-Turbo) 1.35GHz (136-1.635) 1.2GHz (135-1.455) 1.531 (135-1.910) Clocks have improved over Volta likely due to lower number of SMs.
Power (TDP) 260W 300W 250W (125-300) TDP is less due to lower CU number.
Global Memory 11GB GDDR6 14GHz 320-bit 12GB HBM2 850Mhz 3072-bit 11GB GDDR5X 10GHz 384-bit As a pro-sumer card it has 1GB less than Volta and same as Pascal.
Memory Bandwidth (GB/s)
616 652 512 Despite no HBM2, bandwidth almost matches due to high speed of GDDR6.
L1 Cache 2x (32kB + 64kB) 2x 24kB / 96kB shared L1/shared is still the same but ratios have changed.
L2 Cache 5.5MB (6MB?) 4.5MB (3MB?) 3MB L2 cache reported has increased by 25%.
FP64/double ratio
1/32x 1/2x 1/32x Low ratio like all consumer cards, Volta dominates here
FP16/half ratio
2x 2x 1/32x Same rate as Volta, 2x over FP32

nVidia RTX 2080 TI (Turing)

Processing Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 452, CUDA 11.3, OpenCL 1.2 (latest nVidia provides). Turbo / Boost was enabled on all configurations.

Processing Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Arithmetic Benchmark Mandel FP16/Half (Mpix/s) 41,080 / n/a [=] 40,920 / n/a 336 / n/a Right off the bat, Turing matches Volta and is miles faster than old Pascal.
GPGPU Arithmetic Benchmark Mandel FP32/Single (Mpix/s) 25,000 / 23,360 [+11%] 22,530 / 21320 18,000 / 16,000 With standard FP32, Turing even manages to be 11% faster despite less CUs.
GPGPU Arithmetic Benchmark Mandel FP64/Double (Mpix/s) 812 / 772 [-93%] 11,300 / 10,500 641 / 642 For FP64 you don’t want Turing, you want Volta. At any cost.
GPGPU Arithmetic Benchmark Mandel FP128/Quad (Mpix/s) 30.4 / 29.1 [-94%] 472 / 468 24.4 / 27 With emulated FP128 precision Turing is again demolished.
Turing manages to improve over Volta in FP16/FP32 despite having less CUs – most likely due to faster clock and optimisations. However, if you do need FP64 precision then Volta reigns supreme – the 1/32 rate of Turing & Pascal just does not cut it.
GPGPU Crypto Benchmark Crypto AES-256 (GB/s) 48 / 52 [-33%] 72 / 86 42 / 41 Streaming workloads love Volta’s HBM2 memory, Turing is 33% slower.
GPGPU Crypto Benchmark Crypto AES-128 (GB/s) 64 / 70 [-30%] 92 / 115 57 / 54 Not a lot changes here, Turing is 30% slower.
GPGPU Crypto Benchmark Crypto SHA2-512 (GB/s) 192 / 182 [+7%] 179 / 181 72 / 83 With 64-bit integer workload, Turing manages a 7% win despite “slower” memory.
GPGPU Crypto Benchmark Crypto SHA256 (GB/s) 170 / 125 [-33%] 253 / 188 95 / 60 As with AES, hashing loves HBM2 so Turing is 33% slower than Volta.
GPGPU Crypto Benchmark Crypto SHA1 (GB/s) 161 / 125 [+56] 103 / 113 69 / 74 While Turing wins, it is likely a compiler optimisation.
It seems that Turing GDDR6 memory cannot keep up with Volta’s HBM2 – despite the similar bandwidths: streaming algorithms are around 30% slower on Turing. The only win is 64-bit integer workload that is 7% faster on Turing likely due to integer units optimisations.
GPGPU Finance Benchmark Black-Scholes float/FP32 (MOPT/s) 17,230 / 17,000 [-7%] 18,480 / 18,860 10,710 / 10,560 Turing is just 7% slower than Volta.
GPGPU Finance Benchmark Black-Scholes double/FP64 (MOPT/s) 1,530 / 1,370 [-82%] 8,660 / 8,500 1,400 / 1,340 FP64 is almost 1/8x slower.
GPGPU Finance Benchmark Binomial float/FP32 (kOPT/s) 4,280 / 4,250 [+4%] 4,130 / 4,110 2,220 / 2,230 Binomial uses thread shared data thus stresses the SMX’s memory system – Turing is 4% faster.
GPGPU Finance Benchmark Binomial double/FP64 (kOPT/s) 164 / 163 [-91%] 1,920 / 2,000 131 / 134 With FP64 code Turing is 1/10x slower.
GPGPU Finance Benchmark Monte-Carlo float/FP32 (kOPT/s) 11,440 / 11,740 [+1%] 11,340 / 12,900 8,100 / 6,000 Monte-Carlo also uses thread shared data but read-only thus reducing modify pressure – Turing is just 1% faster.
GPGPU Finance Benchmark Monte-Carlo double/FP64 (kOPT/s) 327 / 263 [-92%] 4,330 / 3,590 304 / 274 Switching to FP64 again Turing is 1/10x slower.
For financial workloads, as long as you only need FP32 (or FP16), Turing can match and slightly outperform Volta; considering the cost that is no mean feat. However, if you do need FP64 precision – as we saw before, there is no contest – Volta is 10x (ten times) faster.
GPGPU Science Benchmark HGEMM (GFLOPS) half/FP16 34,080 [-16%] 40,790 Using the new Tensor cores, Turing is just 16% slower.
GPGPU Science Benchmark SGEMM (GFLOPS) float/FP32 7,400 / 7,330 [-33%] 11,000 / 10,870 6,280 / 6,600 Perhaps surprisingly, Turing is 33% slower than Volta here.
GPGPU Science Benchmark DGEMM (GFLOPS) double/FP64 502 / 498 [-89%] 4,470  4,550 335 / 332 With FP64 precision, Turing is 1/10x slower than Volta.
GPGPU Science Benchmark HFFT (GFLOPS) half/FP16 1,000 [+2%] 979 FFT somehow allows Turing to match Volta in performance.
GPGPU Science Benchmark SFFT (GFLOPS) float/FP32 512 / 573 [-5%] 540 / 599 242 / 227 With FP32, Turing is just 5% slower.
GPGPU Science Benchmark DFFT (GFLOPS) double/FP64 302 / 302 [+1%] 298 / 375 207 / 191 Completely memory bound, Turing matches Volta here.
GPGPU Science Benchmark HNBODY (GFLOPS) half/FP16 9,000 [-2%] 9,160 N-Body simulation with FP16 is just 2% slower.
GPGPU Science Benchmark SNBODY (GFLOPS) float/FP32 9,330 / 8,120 [+27%] 7,320 / 6,620 5,600 / 4,870 N-Body simulation allows Turing to dominate.
GPGPU Science Benchmark DNBODY (GFLOPS) double/FP64 222 / 295 [-94%] 3,910 / 5,130 275 / 275 With FP64 precision, Turing is again 1/10x slower than Volta.
The scientific scores are a bit more mixed – but again Turing can match or slightly exceed Volta with FP32/FP16 precision – as long as we’re not memory limited; there Volta is still around 30% faster. With FP64 it’s the same story, Turing is about 1/10x slower.
GPGPU Image Processing Blur (3×3) Filter single/FP32 (MPix/s) 23,090 / 19,000 [-14%] 26,860 / 29,820 17,860 / 13,680 In this 3×3 convolution algorithm, Turing is 14% slower. Convolution is also used in neural nets (CNN).
GPGPU Image Processing Blur (3×3) Filter half/FP16 (MPix/s) 28,240 [=] 28,310 1,570 With FP16 precision, Turing matches Volta in performance.
GPGPU Image Processing Sharpen (5×5) Filter single/FP32 (MPix/s) 6,000 / 4,350 [-35%] 9,230 / 7,250 4,800 / 3,460 Same algorithm but more shared data makes Turing 35% slower.
GPGPU Image Processing Sharpen (5×5) Filter half/FP16 (MPix/s) 10,580 [-38%] 14,676 609 With FP16 Volta is almost 40% faster over Turing.
GPGPU Image Processing Motion-Blur (7×7) Filter single/FP32 (MPix/s) 6,180 / 4,570 [-33%] 9,420 / 7,470 4,830 / 3,620 Again same algorithm but even more data shared Turing is 33% slower.
GPGPU Image Processing Motion-Blur (7×7) Filter half/FP16 (MPix/s) 10,160 [-31%] 14,651 325 With FP16 nothing much changes in this algorithm.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter single/FP32 (MPix/s) 6,220 / 4,340 [-30%] 8,890 / 7,000 4,740 / 3,450 Still convolution but with 2 filters – Turing is 30% slower.
GPGPU Image Processing Edge Detection (2*5×5) Sobel Filter half/FP16 (MPix/s) 10,100 [-25%] 13,446 309 Just as we seen above, Turing is about 25% slower than Volta.
GPGPU Image Processing Noise Removal (5×5) Median Filter single/FP32 (MPix/s) 52.53 / 59.9 [-50%] 108 / 66.34 36 / 55 Different algorithm we see the biggest delta with Turing 50% slower.
GPGPU Image Processing Noise Removal (5×5) Median Filter half/FP16 (MPix/s) 121 [-40%] 204 71 With FP16 Turing reduces the loss to just 40%.
GPGPU Image Processing Oil Painting Quantise Filter single/FP32 (MPix/s) 20.28 / 25.64 [-50%] 41.38 / 23.14 15.14 / 15.3 Without major processing, this filter flies on Volta, again Turing is 50% slower.
GPGPU Image Processing Oil Painting Quantise Filter half/FP16 (MPix/s) 59.55 [-54%] 129 50.75 FP16 precision does not change things.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter single/FP32 (MPix/s) 24,600 / 29,640 [+1%] 24,400 / 24,870 19,480 / 14,000 This algorithm is 64-bit integer heavy and here Turing is 1% faster.
GPGPU Image Processing Diffusion Randomise (XorShift) Filter half/FP16 (MPix/s) 22,400 [-8%] 24,292 6,090 FP16 does not help here as we’re at maximum performance.
GPGPU Image Processing Marbling Perlin Noise 2D Filter single/FP32 (MPix/s) 3,000 / 10,500 [-20%] 3,771 / 8,760 1,288 / 6,530 One of the most complex and largest filters, Turing is 20% slower than Volta.
GPGPU Image Processing Marbling Perlin Noise 2D Filter half/FP16 (MPix/s) 7,850 [-4%] 8,137 461 Switching to FP16, the V is almost 4x (times) faster than the X and over 2x faster than FP32 code.
For image processing, Turing is generally 20-35% slower than Volta somewhat in line with memory performance. If FP16 is sufficient, then we see Turing matching Volta in performance – something that old Pascal could never do.

Memory Performance

We are testing both CUDA native as well as OpenCL performance using the latest SDK / libraries / drivers.

Results Interpretation: For bandwidth tests (MB/s, etc.) high values mean better performance, for latency tests (ns, etc.) low values mean better performance.

Environment: Windows 10 x64, latest nVidia drivers 352, CUDA 11.3, OpenCL 1.2. Turbo / Boost was enabled on all configurations.

Memory Benchmarks nVidia Titan RTX / 2080TI (Turing) nVidia Titan V (Volta) nVidia Titan X (Pascal) Comments
GPGPU Memory Bandwidth Internal Memory Bandwidth (GB/s) 494 / 485 [-7%] 534 / 530 356 / 354 GDDR6 provides good bandwidth, only 7% less than HBM2.
GPGPU Memory Bandwidth Upload Bandwidth (GB/s) 11.3 / 10.4 [-1%] 11.4 / 11.4 11.4 / 9 Still using PCIe3 x16 there is no change in upload bandwidth. Roll on PCIe4!
GPGPU Memory Bandwidth Download Bandwidth (GB/s) 11.9 / 12.3 [-1%] 12.1 / 12.3 12.2 / 8.9 Again no significant difference but we were not expecting any.
Turing’s GDDR6 memory provides almost the same bandwidth as Volta’s expensive HBM2. All cards use PCIe3 x16 connections thus similar upload/download bandwidth. Hopefully the move to PCIe4/5 will improve transfers.
GPGPU Memory Latency Global (In-Page Random Access) Latency (ns) 135 / 143 [-25%] 180 / 187 201 / 230 From the start we see global latency accesses reduced by 25%, not a lot but will help.
GPGPU Memory Latency Global (Full Range Random Access) Latency (ns) 243 / 248 [-22%] 311 / 317 286 / 311 Full range random accesses are also 22% faster.
GPGPU Memory Latency Global (Sequential Access) Latency (ns) 40 / 43 [-25%] 53 / 57 89 / 121 Sequential accesses have also dropped 25%.
GPGPU Memory Latency Constant Memory (In-Page Random Access) Latency (ns) 77 / 80 [+2%] 75 / 76 117 / 174 Constant memory latencies seem about the same.
GPGPU Memory Latency Shared Memory (In-Page Random Access) Latency (ns) 10.6 / 71 [-41%] 18 / 85 18.7 / 53 Shared memory latencies seem to be improved.
GPGPU Memory Latency Texture (In-Page Random Access) Latency (ns) 157 / 217 [-26%] 212 / 279 195 / 196 Texture access latencies have also reduced by 26%.
GPGPU Memory Latency Texture (Full Range Random Access) Latency (ns) 268 / 329 [-22%] 344 / 313 282 / 278 As we’ve seen with global memory, we see reduced latencies by 22%.
GPGPU Memory Latency Texture (Sequential Access) Latency (ns) 67 / 138 [-24%] 88 / 163 87 / 123 With sequential access we also see a 24% reduction.
The high data rate of Turing’s GDDR6 brings reduced latencies across the board over HBM2 although as we’ve seen in the compute benchmarks, this does not always translate in better performance. Still some algorithms, especially less optimised ones may still benefit at much lower cost.
We see L1 cache effects between 32-64kB tallying with an L1D of 32-48kB (depending on setting) with the other inflexion between 4-8MB matching the 6MB L2 cache.
As with global memory we see the same L1D (32kB) and L2 (6MB) cache affects with similar latencies. Both are significant upgrades over Titan X’ caches.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

If you wanted to upgrade your old Pascal Titan X but could not afford the Volta’s Titan V – then you can now get a cheap RTX 2080Ti or Titan RTX and get similar if not slightly faster FP16/FP32 performance that blows the not-so-old Titan X out of the water! If you can make do with FP16 and use Tensor cores, we’re looking at 6-8x performance over FP32 using a single card.

Naturally, the FP64 performance is again “gimped” at 1/32x so if that’s what you require, Turing cannot help you there – you will have to get a Volta. But then again the Titan X was similarly “gimped” thus if that’s what you had you still get a decent performance upgrade.

The GDDR6 memory may have similar bandwidth on paper, but in streaming algorithms is about 33% slower than HBM2 so there Turing cannot match Volta, but considering the cost it is a good trade. You will also lose 1GB just like with Titan X but again, not a surprise. Global/Constant/Texture memory access latencies are lower due to the high data rate which should help algorithms that are memory access limited (if you cannot help hide them).

As we’re testing GPGPU performance here, we have not touched on the ray-tracing (RTX) units, but should you happen to play a game or two when you are “resting”, then the Titan RTX / 2080TI might just impress you even more. Here, not even Volta can match it!

All in all – Titan RTX is a compelling (relatively) cheap upgrade over the old Titan X if you don’t require FP64 precision.

nVidia Titan RTX (Turing)