SiSoftware Logo
  スタート   よくでる質問   メディアの評価   ダウンロード & 購入   ランキング   お問い合わせ  
New: SiSoftware Sandra 2015!
DE DE EN EN FR FR IT IT RU RU 

Cache and Memory Latency

Benchmarks : Measuring GP (GPU/APU) Cache and Memory Latencies (updated)


Update: Added Intel GT3 (Haswell), nV GeForce 660 TI, nV GeForce 260 GTX, nV GeForce 8800 GTS.

What is Latency?

In this context, latency is the time (either in clocks or nano-seconds) taken to transfer a block of data either from main memory or GPU caches. We want the data as quickly as possible, thus the lower the time the better. The size of the data block we request is usually the size of a native pointer (4 bytes in 32-bit, 8 in 64-bit).

As a GPU (or APU) executes instructions, both the instructions themselves and the data they operate on must be brought into registers; until the instruction/data is available, the GPU cannot proceed and must wait; even advanced designs that can execute out-of-order eventually need data.

Latency is generally measured in core "clocks" (1/frequency) for caches (as they usually run at GPU speed) and nano-seconds (10^-9) for the main memory.

Why is it important to measure it?

The latency of the main memory directly influences the efficiency of the GPU, thus its performance: reducing wait time can be more important than increasing execution speed. Unfortunately, memory has huge latency (today, by a factor of 100 or more): A GPU waiting for 100 clocks for data would run at 1/100 efficiency, i.e. 1% of theoretical performance!

Modern GPUs have internal "memory caches" that mirror instructions/data from main memory but at far lower latencies; they allow the GPU to get data much faster and thus increase efficiency. Unfortunately the faster the cache the smaller it needs to be, thus modern GPUs contain various cache hierarchy levels (L1, L2, L3) that get progressively bigger but slower.

Memory is not only differentiated by the speed it runs at (MHz) but also its type (e.g. DDR3, GDDR3, GDDR5, etc.) and also the timings (command latencies) it supports (e.g. tCAS/CL, tRP, tRCD, tRAS, etc.). The lower the timings the lower the overall latency of memory.

What kinds of memory do GPUs have?

While CPUs have "data" and "code/instruction" memory/caches, GPUs have additional memory types that, from a compute perspective, serve different purposes and have different characteristics. As GPUs are generally SIMT designs, threads execute in groups (blocks/warps) - not independently as with multi-core/threaded CPUs.

  • Global Memory: Total GPU memory (usually GB), accessible by all threads. Generally not data cached but TLB cached.
  • Constant Memory: Read-only memory, limited in size (e.g. 64kB), accessible by all threads. Generally cached.
  • Shared Memory: For data sharing between threads of a block, limited in size (e.g. 32kB). Not cached.
  • Private Memory: Per thread memory that does not fit in local registers (overspill). Generally not cached.
  • Texture Memory: GPU memory used for textures for graphics rendering. Generally cached.
  • Code/Instruction Memory: Read-only global memory that holds instructions to be executed, not data to be used by them. Generaly cached.

Are the Cache / Memory latencies fixed?

No. Modern GPUs also contain "data prefetchers" which bring data into the caches speculatively, i.e. they guess which instruction/data will be needed next and fetch it to be ready when needed. Thus the GPU does not need to wait for the data to be brought all the way from main memory but get it from the cache.

Prefetchers work by recognising patterns in the access of data (spatial, temporal, etc.) when executing code. Thus the latency of accessing data depends entirely to whether the prefetchers have "understood" the pattern and have fetched the right data into the caches.

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers.

Are there any other latencies that influence the result?

Yes. Kernels do not access physical memory directly ("real mode") but virtualised memory through "paging". Paging simplifies memory management by mapping (non-contigous) physical memory into contiguous virtual space as well as extending "real memory" with other, non-local memory (e.g. system memory).

Memory is thus allocated and managed in fixed-size blocks ("page size") while the memory manager (or run-time memory manager) manage application memory requests. Unlike CPUs, GPU page size is typically not known.

What is the TLB?

The "Page Table" is what maps virtual to physical addresses and thus virtual pages to real memory. The TLB (translation look-aside buffer) is a CPU feature that caches the recent mappings from the page table.

If the TLB does not contain the required map, i.e. "TLB miss", the page table itself must be searched which is very much slower: "Page-Walk Hit". GPUs may contain multiple TLB levels - just like cache levels - but typically have only 512-entries x 4kB page = 2MB ("TLB range"). This is relatively small compared to 8-16GB memory of today's computers.

How does this relate to lantecy measurement?

As the TLB range is relatively small, an algorithm accessing a large memory block in a random pattern is likely to miss the TLB and thus incur the "TLB miss". Thus the total latency to access a data item not cached in L1D/L2 caches is not just the L3/Memory access latency but this additional latency.

The latency values published by the manufacturers are naturally "best case", and include only L1D/L2/L3/Memory access times and not any additional latencies incurred in practice.

We do not believe it is realistic, due to the small native page size and thus small TLB range that algorithms would not incur the "page-walk hit" when accessing memory outside L1D/L2.

What are the memory access patterns Sandra uses?

Sandra allows you to test various access patterns and thus observe the latencies of the various cache levels and memory, as well as the effect of the prefetchers:

  • Sequential Access Pattern: Memory is accessed sequentially which is an easy pattern for prefetchers - "a show-case for prefetchers"; thus the latencies will be "best case", very much reduced.

  • In-Page Random Access Pattern: Memory is accessed in a random pattern within the page (either native or large): this ensures there are no "TLB miss" latencies, just raw cache/memory latencies. Some prefetchers (e.g. "adjacent line prefetcher") still have an impact.

  • Full Random Access Pattern: Memory is accessed in a random pattern within the whole block. Large blocks may incur a "TLB miss" depending on the "TLB range".

Note: OpenCL was used as it is supported by all GPUs/APUs. The tests are also available through CUDA which provides more precise clock timings (due to core clock tick counter) which are not available in OpenCL nor DirectX ComputeShader.

Hardware Specifications

Here are the GPUs and APUs we are comparing in this article:

GPU / APU Core (CU) Speed / Turbo Cores (CU) / Threads (SP) Memory / Speed Registers / Const / Shared / L2+L3+L4 cache
GeForce 8800 GTS (GT80) 1188MHz 12C / 96SP 640MB GDDR3 800MHz 320-bit 8k / 64kB / 16kB
nVidia GeForce GTX 260 (GT200) 1295MHz 24C / 192SP 896MB GDDR3 1GHz 448-bit 16k / 64kB / 16kB
nVidia GeForce 555M (Fermi) 1180MHz 3C / 144SP 1.5GB DDR3 1.8GHz 192-bit 32k / 64kB / 48kB / 384kB
nVidia GeForce 660 TI (Kepler) 980MHz / 1100MHz 7C / SP 2GB DDR5 6GHz 192-bit 64k / 64kB / 48kB / 384kB
AMD A6-3650 APU (Llano) / Radeon HD 6530D 444MHz 4C / 320SP 512MB DDR3 1.33GHz 128-bit (shared out of 8GB) 16k / 64kB / 32kB / 64kB
AMD Radeon HD 6850 (Barts) 775MHz 12C / 960SP 1GB GDDR5 4GHz 256-bit 16k / 64kB / 32kB / 256kB
Intel i7-3xxxM APU (Ivy Bridge) / GT2 HD 4000 650MHz / 1050MHz 16C / 16SP* 512MB DDR3 1.33GHz 128-bit (shared out of 8GB) 16k / 64kB / 64kB / 2MB
Intel i7-4xxxM APU (Haswell) / GT3 HD 5200 600MHz / 800MHz 40C / 40SP* 512MB DDR3 1.6GHz 128-bit (shared out of 8GB) 16k / 64kB / 64kB / 2MB + 128MB eDRAM

Global Latency

Global Cache/Memory Latency

"Global Memory" is device memory, either dedicated in the case of GPU or shared system memory in the case of APU. It can hold any data type and can be read or written and accessed by any thread running on the GP.

While not all GPUs cache global memory, they do have TLB caches - just like modern CPUs. The "random in-page" access pattern that Sandra uses is especially designed to avoid TLB misses and thus measure the "real" cache/memory latencies. The "full random" access pattern can be used to measure TLB miss penalties where desired.

Global Latency

GPU L1D (clk) L2 (clk) L3 (clk) Memory (clk) Comment
GeForce 8800 GTS ~502clk / ~577ns It's prety clear that there are no caches for global memory on the old G80. Over 4MB we see TLB miss penalties as we're using the full random access pattern.
GeForce 260 GTX ~493clk / ~380ns No caching effects on GT100 either with pretty small TLB miss penalties.
GeForce GT 555M 4kB ~20clk 32kB ~100clk 256kB ~320clk ~680clk / ~575ns Fermi adds caching to global memory in a pretty much a textbook result, with extremely low latency small L1D/L2D caches and reasonably performant global L3 cache. TLB penalties are pretty significant with worst-case memory latency as veryt high - DDR3 memory does not help.
Radeon HD 6850 8kB ~320clk 256kB ~365clk ~545clk / ~703ns L1D cache as slow as Fermi's L3 (!) and L2D not much help but better than nothing. At least TLB miss penalties are not as bad as Fermi's but while memory latency is less in terms of clocks, it is slower even with GDDR5 (~703ns vs ~575ns).
AMD Llano APU 8kB ~320clk 64kB ~363clk ~493clk / ~1110ns Same L1D as HD 6850 but smaller L2D, but at least not worse than a dedicated GPU - for a 1st gen APU that's not bad. Memory latency is higher, no doubt due to shared DDR3 memory.
Intel Ivy Bridge APU 128kB ~90clk ~300clk / ~272ns We only find one cache here (L1D), but it's 3 times faster than AMD's (90clk vs 363clk), matches Fermi's L2 and is reasonably large. Main memory latency is extremely low for an APU, even including TLB miss penalties, only ~270ns - again that is 3 times faster than Llano using the same 1.33GHz DDR3 memory.
Intel Haswell APU 256kB ~100clk ~410ns We only find one cache here (L1D), double Ivy size and 10clk slower. However, main memory latency is huge, almost 2x (410ns vs. 272ns). Whatever changes were made to the RingBus, latencies seem to have increased considerably - perhaps that is the reason for the doubling of L1D?

An impressive result for "Fermi" and a good result for APUs, with an impressive memory controller on "Ivy Bridge". The Radeon 6850 comes off worst even though it is the only dedicated GPU using GDDR5 memory.

Constant Latency

Constant Cache/Memory Latency

"Constant Memory" is read-only memory and as such more likely to be cached. As it is limited in size (e.g. 64kB) it needs to be used judiciously. No TLBs are generally needed as it may span 1 or very few pages.

Constant Latency

GPU L1D (clk) L2 (clk) Memory (clk) Comment
GeForce 8800 GTS 2kb ~90clk 32kb ~210clk ~365clk / ~307ns Unlike global memory, const memory is cached, still much slower than shared memory.
GeForce 260 GTX 2kB ~105clk 32kB ~140clk ~396clk / ~305ns Unlike global memory, const memory is cached, still 3x shared memory latency.
GeForce GT 555M 4kB ~20clk 32kB ~100clk ~234clk / ~198ns Similar cache sizes/latencies to global memory - thus excellent result. Worst-case latency low at ~200ns.
Radeon HD 6850 8kB ~348clk ~385clk / ~497ns Worse L1D latency than global memory (+20clk), might as well not bother with constant memory at all! That is not what we expect here at all. High worst-case latency as well (~500ns), over 2x as much as Fermi (~200ns)- and this is a dedicated GPU!
AMD Llano APU 8kB ~351clk ~391clk / ~882ns Worse L1D latency than global memory again, here +30clk (!) not good news. Large worst-case latency also (~880ns) - 4x Fermi's, 9x Ivy Bridge's - that is pretty significant.
Intel Ivy Bridge APU ~90clk / ~90ns Likely due to the large L1D cache we saw with global memory, latency is constant throughout the range - and the same as global memory. Very low worst-case time of ~90ns, lowest by miles - even Fermi is 2x slower.
Intel Haswell APU ~100clk / ~135ns As with Ivy, we don't detect any caches here; assuming the only cache is the large L1D one, the constant cache is too small to observe any caching effects. As with Global, latency seems 10clk higher.

Disastrous result for AMD's Radeon 6850 and Llano APU - just don't use constant memory. Great results on Fermi at small block sizes but overall Intel' Ivy Bridge has great performance throughout the range.

Shared Latency

Shared Memory Latency

"Shared Memory" is thread-group memory used to transfer data between threads running on the same group. As such it is pretty limited in size (e.g. 16-32kB) and not cached.

Shared Latency

GPU Memory (clk) Comment
GeForce 8800 GTS ~24clk / ~20ns Very fast shared memory, much faster than global/const memory.
GeForce 260 GTX ~24clk / ~18ns Very fast shared memory, much faster than global/const memory.
GeForce GT 555M ~30clk / ~25ns Extremely fast memory, though does not beat global/const L1D cache (~20clk).
Radeon HD 6850 ~164clk / ~211ns Pretty slow against the competition (5x slower than Fermi!) but 1/2 latency of global/const L1D cache. It may be worth copying constant memory data into shared memory if possible!
AMD Llano APU ~163clk / ~368ns Similar to Radeon 6850 result, slow, but not as slow as global/const L1D cache.
Intel Ivy Bridge APU ~76clk / ~78ns Somewhat high latency, but still lower than global/const L1D, thus normal optimisations still apply.
Intel Haswell APU ~84clk / ~108ns Again, 10clk higher latency than Ivy but still lower than global/const as well as competitor APUs. Still, somewhat disappointing that the latency has not improved.

Bad result for AMD's Radeon 6850 and Llano APU again - though an opportunity for optimisation arises: copying constant data to shared memory reduces latency by half! That's exactly how we improved the GP Cryptography benchmarks (AES encrypt/decrypt kernels) with great success! This optimisation also benefits Ivy Bridge but can be worse on Fermi where global/const L1D cache is 50% faster.

Private Latency

Private Memory Latency

"Private Memory" is thread local memory, used for thread data manipulation. Each thread has a limited number of registers available for this purpose (total threads per CU / number of active threads per CU) - any "overspill" causes global memory to be used. As global memory latency is huge compared to registers (1clk), overspill has to be avoided at all costs.

Here we measure "overspill" penalty hit, i.e thread-accessible global memory latency as register latency is not useful.

Private Latency

GPU L1D (clk) L2 (clk) Memory (clk) Comment
GeForce 8800 GTS 8kB ~460clk ~607clk / ~510ns Up to 8kB latency is similar to global memory; over that size latencies are even higher.
GeForce 260 GTX 8kB ~450clk ~480clk / ~370ns Similarly to G80, up to 8kB latency is similar to global memory and higher over that size. Overspills are costly.
GeForce GT 555M 1/2kB? 280clk 32kB ~323clk ~558clk / ~472ns It is not conclusive whether there is a L1D of 1/2kB but the L2D is clearly visible, that while "slow", it does help - the competition has no caches at all! Worst-case latency is high though, comparable to global memory latency.
Radeon HD 6850 ~532clk / 687ns No caching is visible and overspills are costly: worst-case latency is high (~687ns) - though comparable with the competition.
AMD Llano APU ~514clk / 1159ns Similar to Radeon 6850 result, no caching visible and costly overspills: while slightly lower in terms of clocks, the real-time latency (~1160ns) is very high.
Intel Ivy Bridge APU 128kB ~119clk ~119clk / 113ns While it may appear there is no caching here either, the L1D cache we saw in global/const has similar latencies to what we see here - we are nowhere near global memory worst-case latencies. Real-time worst-case latency (113ns) is 1/10 that of Llano and 1/5 Fermi!
Intel Haswell APU TBA

Bad result for AMD's Radeon 6850 and Llano APU yet again, with costly overspill penalties - Ivy Bridge rules both APUs and GPUs! Fermi's honour is saved by the caches.

Texture Latency

Texture Cache/Memory Latency

This article does not investigate texture cache/memory latencies.

nVidia

GeForce 8800 GTS

The 8800 (GT80) was the World's first "mass-market" GPGPU, supporting CUDA 1.0 - and DirectX 10 - and thus just as revolutionary as the original GPU (Riva TNT). Its unified shaders could, for the 1st time, perform a more varied set of tasks - like GPGPU - even today it can run CUDA and Open CL applications. Its 8 SP per SM design remained unchanged until CUDA 2.0.

GeForce 8800 GTS Latency

Its GDDR3 memory and wide bus are holding their own against modern DDR3/GDDR3 competition, and while global memory is not cached, most CUDA/Open CL applications should have taken this into account long ago. Constant memory is cached with decently fast L1D and L2 and shared memory is fast also - similar to modern GPGPUs.

nVidia

GeForce 260 GTX

The 260 GTX like its big brother 280 GTX were based on the 2nd generation GPGPU architecture (GT200), supporting CUDA 1.3 and, for the first time, double/FP64 support in hardware. High-precision scientific applications (that required 64-bit precision) could finally be ported to GPGPU.

GeForce 260 GTX Latency

While the GDDR3 memory is wider (448 vs 320-bit) and faster, latencies are pretty much similar to the previous G80. Global memory is still uncached, and constant and shared memory latencies are comparable.

nVidia

nVidia GeForce 555M (Fermi)

GeForce 555M is a "Fermi" (v2) CUDA 2.1 mobile dedicated GPU, but comprises multiple versions - here we have the GF106 (3C 144SP) 192-bit DDR3 version. CUDA 2.1 devices contain 3 shader groups (3x16 SFU/SM) vs. 2 groups in Fermi v1 CUDA 2.0 versions (2x16 SFU/SM) and have superscalar features in order to keep all shaders occupied.

GeForce 555M Latency

"Fermi" (CUDA 2.x) has one major improvement over previous G80/GT200 (CUDA 1.x) architectures: global memory is now cached, with a 3-level cache visible - same as constant memory. Previously, only TLB caches existed for global memory (L1 TLB, L2 TLB), but constant memory was cached. Fermi is now more "forgiving" in terms of memory accesses, though optimisation is still required.

Its 3-level cache architecture (L1D, L2D, L3D) it is the most complex design here, but not unexpected - it is similar to the architecture of modern CPUs.

Very fast but small L1D (4kB ~20clk) and L2D (32kB ~100clk) keep the latencies down for global and constant memories, but memory latencies increase with block size until the worst-case value (~680clk) is over 30x higher! Only design where L1D is faster than shared memory.

Shared memory is fast with constant latency throughout the range. Private memory used for overspills is slow but, due to caching, faster than the competition.

AMD

AMD A6-3650 APU (Llano) / Radeon HD 6530D

"Llano" was the 1st mainstream (both desktop and mobile) APU and as such it has enjoyed mass-market appeal. While recently replaced with "Trinity", it is still used in a vast number of systems.

Its 32nm DirectX 11 GPU (BeaverCreek/Sumo) is based on the Radeon 5500 series (Redwood) and is thus a VLIW5 design with 80 SP per CU and 4-5CUs. Here we test the 4C 320SP version.

Llano Latency

It is quite clear that there is little point in using constant memory: stick to global (-20clk). If possible, copy const data into shared memory that is 3 times faster (~163clk vs. ~351clk). Private memory used for overspill is very slow (~490clk), thus they have to be avoided like the plague.

AMD

AMD Radeon HD 6850 (Barts)

"Barts" (6800 series) is the successor of "Cypress" (5800 series), the 1st of the "Northern Islands" family but still a VLIW5 design on 40nm. While it boasts various improvements, it also lacks some features (e.g. double/FP64 support) and with far less SIMD units actually performs lower - it is all about lower costs.

You need to look at the 6900 series for a worthy successor to the 5800 series or even the 7800 series.

Radeon 6850 Latency

Somewhat surprisingly, the same comments as "Llano" apply here, even though it is a dedicated GPU and not an APU. Code optimised for one will run equally well on the other.

Intel

Intel i7-3xxxM APU (Ivy Bridge) / GT2

"Ivy Bridge" is the first Intel APU (GT1/GT2 EU v7) as it includes GPGPU capabilities - the previous "Sandy Bridge" model did contain a built-in GPU (GT1/GT2 EU v6) but it did not have GPGPU capabilities - they were emulated in software on the CPU.

World's 1st 22nm device, it has few but complex EUs (CU) with an undisclosed number of SPs (6-8) per EU. It also contains a 2MB cache for code and data.

Ivy Bridge Latency

Very low latencies for all memory types, including worst-case global memory including TLB miss penalties (~285clk) - beating dedicated GPUs with GDDR5 memory! Reasonably large L1 cache (128kB) keeps latencies low for all memory types.

Intel

Intel i7-4xxxM APU (Haswell) / HD 4000

"Haswell" is a 2nd generation Intel APU (GT1/GT2 and GT3 EU v7.5) on the same 22ns process. Our GT3 sample has more than double EVs (40!) than GT2 (16) as well as 128MB eDRAM/L4 cache; while core speed is only slightly lower (600MHz vs. 650MHz) Turbo speed is much lower (800MHz vs. 1.1GHz). All things considered GT3 has higher performance for less power but it is not cheap!

Haswell Latency

Final Thoughts / Conclusions

While nVidia's GPUs' latencies have been investigated by other parties in more detail before, here we compare GPUs and APUs from multiple vendors using the common OpenCL interface (and DirectX ComputeShader). Fermi shows one major improvement - cached global memory - but while its 3-level cache architecture is complex - it works as expected, similar to modern CPUs.

The AMD GPU and APU do throw a few surprises that, whatever their nature (hardware, compiler, etc.) mean that some kernels may need to be optimised differently for best performance. We (SiSoftware) ourselves have used the latency results to optimise the GP Cryptography benchmarks (AES encrypt/decrypt kernels) with great success (details in future article).

Intel's first APU is also interesting to test: it behaves similarly to AMD's own APU and GPU (albeit with far lower latencies) - rather than nVidia's GPU. While its cache architecture is very simple (large L1) it works well.

We have shown that there is no "one latency", but latencies greatly vary with memory type and access pattern. The way kernels access memory (access pattern) and the type of memory used have direct influence on the latencies they will experience.

ニュース | レビュー | Twitter | Facebook | プライバシーポリシー | ライセンス | お問い合わせ