Q & A – Memory Benchmark

Q & A – Memory Benchmark

This document provides some frequently asked questions about Sandra. Please read the Help File as well!

Q: What is STREAM?

A: STREAM is a popular memory bandwidth benchmark that has been used on personal computers to super computers. It measures sustained memory bandwidth not burst or peak. Therefore, the results may be lower than those of other benchmarks. Sandra is based on this benchmark.

Q: How is Sandra’s Memory Benchmark different from STREAM?

A: STREAM 2.0 uses static data (about 12M) – Sandra uses dynamic data (around 40-60% of physical system RAM). This means that on computers with fast memory Sandra may yield lower results than STREAM. It’s not feasible to make Sandra use static RAM – since Sandra is much more than a benchmark, thus it would needlessly use memory.

A major difference is that Sandra’s algorithm is multi-threaded on SMP/SMT systems. This works by splitting the arrays and letting each thread work on its own bit. Sandra creates a thread for each CPU in the system and assignes each thread to an individual CPU.

Another difference is the aggressive use of sheduling/overlapping of instructions in order to maximise memory throughput even on “slower” processors. The loops should always be memory bound rather than CPU bound on all modern processors.

The other major difference is the use of alignment. Sandra dynamically changes the alignment of streams until it finds the best combination, then it repeatedly tests it to estimate the maximum throughput of the system. You can change the alignment in STREAM and recompile – but generally it is set to 0 (i.e. no).

Q: Is Sandra compatible with STREAM?

A: No. See above for the main differences. The results should reflect a comparable difference between different computers but are not comparable themselves.

Q: Does Sandra detect NUMA systems?

A: Yes, Sandra does support NUMA systems; you also need Windows XP/2003 or later for proper NUMA support.

Q: Will the rating be correct on my NUMA system?

A: Assuming that both the OS and Sandra detect the NUMA support and the NUMA nodes correctly, the rating should be correct; this depends on the allocation of test memory blocks on the corresponding NUMA node for each thread (aka CPU).

Q: Why does the rating change between runs?

A: Make sure you have enough RAM (16MB or more) and only Sandra is running. If you see the hard disk light up then your computer is swapping. Accurate results can only be obtained if the computer is not swapping.

Q: What’s the deal with the new block pre-fetch/buffering SSE(2)/EMMX benchmarks?

A: In a nutshell, the new tests use the pre-fetching instructions to bring data into the CPU and store the data directly into memory bypassing the caches. In order maximise throughput, buffers are used to pre-fetch data into the caches so that it is already there when needed and to reduce switches between different data streams.

For more information, please read the following whitepapers:

Intel Pentium III – SGI Whitepaper on memory copy using new SSE instructions.
Intel Pentium 4 – see Pentium 4 Code Optimisation Manual.
AMD Athlon/Duron – Athlon/Duron Optimisation Document. Relevant parts of the guide are Chapter 5, p. 66 Optimizing Main Memory Performance for Large Arrays and also the sample code in Chapter 10, p.180 which has Athlon/Duron-specific optimized memcpy() that works for any size memory block.

Q: Why do I see different reference score lists from Sandra on different systems?

A There are currently 4 different reference lists depending on tests run (Options, CPU capabilities):

Uni-Processor legacy (ALU/FPU) tests
Multi-Processor (SMP) legacy (ALU/FPU) tests
Uni-Processor advanced (buffering/code-prefetch EMMX/SSE/SSE2) tests
Multi-Processor (SMP) advanced (buffering/code-prefetch EMMX/SSE/SSE2) tests

While this may appear confusing, this was done in order to compare systems using same types of tests and thus not compare apples-to-oranges as much as possible. This should ensure a fair test and result.

Q: How am I supposed to know what (kind of) test was run?

A: Pay attention at the result bar, it should tell you all about the test as well as the result in MB/s. It should say:

Type of unit(s) used. E.g. ALU or FPU.
Type of data used. E.g. integer or floating-point.
Any techniques used. E.g. block pre-fetch, buffering, etc.
Any instruction sets used. E.g. MMX, EMMX, SSE, SSE2, etc.
The score in MB/s.

Q: Why do some systems show close to 95% bandwidth efficiency and others less than 80%?

A: While the code for all CPUs is heavily optimised, the effciency depends on chipset performance and memory settings. Only the most aggressive settings may yield > 80% efficiency, thus anything higher is a bonus.

Q: Which memory has better efficiency: SDRAM, DDR, DDR2, RDRAM, etc?

A: Mostly, memory bandwidth performance depends on chipset architecture and memory timings, thus performance varies. A generic answer is beyond the scope of this document.

Q: Why do I get lower indexes with the new buffering benchmarks & SSE(2)/EMMX in SMP mode?

A: While the ALU/FPU benchmarks were also memory bound, the CPU caches were used implicitly by the system; the new benchmarks completly bypass the caches to achieve greater throughput. However, this results in more collisions/congestions if the CPU bus is shared. This results in a lower index.

This is why L2, L3 caches are so much more important in a SMP system. It is up to you to decide whether you want to measure the pure, maximum performance or the SMP performance. If you want to measure the former, disable SMP support from the benchmark’s options.

Q: Why don’t I get higher scores with HyperThreading/SMT enabled?

A: SMT does NOT help in memory transfers. The bandwidth available to each CPU is the same, thus using all cores would increase overhead resulting in lower scores. We’re looking into using SMT for prefetching into future versions of the benchmark.

Q: If the benchmark is multi-threaded, why don’t I get higher indexes on a SMP system?

A: The benchmark is OK. You can verify by looking at the load, number of threads and memory utilisation in Task Manager.

The issue is the bus that connects the CPUs. If it is shared and not point-to-point (e.g. Intel’s (A)GTL+ as used in PPro/PII/PIII/4) the CPUs are sharing the same bandwidth so you won’t see much increase due to the huge amount of data transferred by the benchmark. Since the benchmark is memory limited (in order to be correct), one CPU or more won’t make much difference since the memory bus is the bottleneck. When the bus is not much utilised you get close to N increase in performance (where N is no of CPUs), otherwise you get no/small performance gain.

Q: In my SMP system all memory benchmarks (ALU, FPU, MMX, SSE2 etc.) return the same score! Why is that?

A: See above. This shows that the benchmark is working, i.e. the limit of memory throughput is reached – when no matter what you use to load/store it does not make any difference.

Q: My system is supposed to have a bandwidth of X MB/s (e.g. 800MB/s for PC100 SDRAM). Why does Sandra show less than 1/2 of it?

A: The number quoted by the manufacturer is the best case sequential read throughput. Sandra reads & writes to the memory, using different areas in SMP mode. This puts a larger stress on the memory system (including cache controllers) resulting in a lower index, but more realistic. Most programs read, compute and write back data rather than just read data. Please update to Sandra 2002 or later which uses the new instruction sets and techniques to obtain better efficiency.

Q: Why does the Win64 version of Sandra not test over 1TB of memory?

A: The current Win64 version of Sandra cannot handle more than 1TB of memory. Future versions will support more, although they may not support the whole 64-bit address space.

Q: Why does the Win32 version of Sandra not test over 2GB of memory?

A: The current Win32 version of Sandra cannot handle more than 2GB of memory. Future versions will support more, although they won’t be able to support the 36-bit address space.

Q: Why does the WinCE 3.0 version of Sandra not test over 16MB of memory?

A: The current WinCE 3.0 version of Sandra cannot handle more than 16MB of memory. WinCE .Net versions can handle up to 2GB of memory.

Q: Why doesn’t the benchmark include my super-duper XXXXGHz CPU?

A: While we do buy and test each and every CPU model on the market, we cannot afford to buy all the very latest speed grades of each CPU. Even if we did, we cannot update the benchmark when a new speed grade is released – we’d need to do it every week.