Q & A – CPU Benchmark

This document provides some frequently asked questions about Sandra. Please read the Help File as well!

Q: What is the Dhrystone benchmark?
A: The original Dhrystone benchmark is still widely used to measure CPU performance in industry under various versions/variants. The benchmark is designed to contain a representative sample of types of operations, mostly numerical, used by applications. Unfortunately this does not always represent a true real-life performance, but is useful to compare the speed of various CPUs.

The Dhrystone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of instructions. Due to various changes, the result is not directly comparable with other Dhrystone benchmarks. However the MIPS (Million Instructions Per Second) should be the same for the same system (+5-10% variation) between benchmarks.

While the original benchmark does not compute anything, this version does check the results with the expected ones just in case there are problems with the CPU/memory.

Q: What is the Whetstone benchmark?
A: The Whetstone benchmark is widely used in the computer industry as a measure of FPU or Co-Processor performance. Floating-point arithmetic is most significant in programs that require a Co-Processor. These are mostly scientific, engineering, statistical and computer-aided design programs.

The Whetstone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of floating-point instructions. Due to various changes, the result is not directly comparable with other Whetstone benchmarks. However the MFLOPS (Million FLoating OPerations per Second) should be the same for the same system (+5-10% variation) between benchmarks.

Q: What are the SIMD (SSE2, SSE3, etc) Whetstone benchmarks?
A: With the introduction of SSE2 and its support for double floats (64-bit) it is now possible to write code that does not use the legacy FPU at all. This version shows that the full Whetstone benchmark can be implemented using SSE2 and thus take advantage of the SIMD mode of operation.

Q: Why does the rating vary between sequential runs?
A: On most systems, the value of the rating shouldn’t change by more than about +/-5% from run to run. On systems with limited memory it may vary by +/-10% due to memory swapping. If you’re seeing variations higher than this, some hardware or software is probably to blame. Do note that “limited memory” depends on operating system, installed drivers, running programs, etc. A badly configured system with 64MB may be worse memory-wise than an 16MB system…

Software Causes:

Other programs running. Close all programs. If this doesn’t solve the problem, check for programs that are loaded in the Startup group, Run list, etc.
Problematic device drivers. Some drivers which are not correctly configured or incompatible may slow-down the system. Also some drivers “poll” the system for various reasons, e.g., the CD-ROM auto-run feature.
Power-saving options are turned on (APM). These options may cause the CPU to automatically slow down after even a few seconds of inactivity. Often, “inactivity” is defined as nobody typing on the keyboard or moving the mouse, even if a program is working away.
If you’re benchmarking a SMP system, some variation may occur due to the synchronising algorithm in the benchmark. Since the work done by threads must be synchronised, there is a small overhead which may be more apparent in some systems.

Hardware Causes:

Insufficient memory for program to function properly. Close all programs. Also, you may try unloading some drivers which you don’t need anymore.
Insufficient secondary (L2/L3) cache memory or poor cache controller.

Q: Why does the rating vary between random runs?
A: You may find that running the program straight after Windows loads you get a higher benchmark rating than after running and closing programs. This should not happen in practice very often, but it does sometimes.

Software Causes:

Programs which do not clean up after closing down. Crashed programs leave “orphaned” objects which take up system resources. While Windows does garbage-collecting, sometimes you just have to restart…
Memory fragmentation. After programs load and close down, memory may still be fragmented after memory is taken and freed. Again, due to various reasons the de-fragmentation does not yield 100% results.

Hardware Causes:

Insufficient secondary (L2/L3) cache memory or poor cache controller.

Q: The benchmark scores in the latest Sandra are different from earlier versions!
A: The benchmarks change from major release to major release in order to keep up with new technology developments. Please compare results using the same major version of Sandra.

Q: While my P4 Hyper-Threaded/SMT system does well in Whetstone FPU/SSE it does not do much better on Dhrystone! Why?
Q: While my P4 Hyper-Threaded/SMT system does well in Multi-Media Float SSE2 it does not do much better in Multi-Media Integer SSE2! Why?
A: The FPU units were under-utilised in the original P4 thus they get good improvement in SMT (~50%); the ALUs (even double-clocked) were better utilised and gave great performance already; thus they get some improvement only (~20%).

Q: Why do some CPUs with advanced FPUs score low in Whetstone?
A: The Whetstone benchmark uses the slowest functions of the FPU (computing transcendentals, e.g. sin/cos/tan) in a way that cannot be parallelised (one serial chain). This was done in order to prevent cheating by manufacturers as much as possible. Unfortunately, this means that features like out-of-order execution, pipelining, etc. are bypassed. These are paramout to today’s processors but that’s not what is tested here.

Thus on processors that transcendental instructions have been optimised the benchmark index will be much higher. Some manufacturers have chosen to optimise other instructions – which they consider more widely used than transcendentals.

Q: My SMP/multi-core system scores the same or lower than a single CPU system! What’s wrong?
A: Here’s the most likely causes:

Make sure you’re using the SMP kernel. You should see all your CPUs in task manager for example.
If you’re using the SMP ACPI kernel, make sure each CPU can go up to 100%. Some BIOSes contain bugs that do not allow full usage of all CPUs.
Make sure no background processes are using the CPU(s), i.e. utilisation is 0% before you’re starting the benchmark.

Q: Why does the benchmark use as many threads as CPUs?
A: A proper CPU benchmark is CPU bound, thus one thread per CPU should use 100% of power. Using more threads just increases the synchronisation overhead. Otherwise, more threads would help.

Q: When should I use the Dynamic Load Balance?
A: Always use the Static Load Balance unless the CPUs are running at different speeds internally. The calibration and synchronisation algorithms of the Dynamic Load balancer are more complex and have a higher overhead thus they should be avoided. We’ve provided them for testing purposes only.

Q: What is this CPU Multi-Media Benchmark? What does it do?
A: This benchmark generates a picture (various size) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements SIMD instructions bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assigns each thread to a different CPU.

The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power.

The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

Q: What does the metric “it/s” stand for?
A: It stands for computed “Mandelbrot iterations per second” x 1,000 (thus strictly mit/s). Thus an index of 1,000 means that 1 iteration has been completed in 1 second.

Q: How do I compute the computed pixel rate from the index? (i.e. how fast is the algorithm?)
A: The image rendered is 640×480 pixels, 32 colours. Computed pixel rate = 640*480*index/1000. Thus, for example, if the index is 1,000 – the pixel rate = 640*480*1000/1000 = 307k pixels/second.

Q: Isn’t comparing an ALU/FPU index to a SIMD index like comparing apples to oranges?
A: It depends on what you’re trying to test; the index shows what gain the new instructions bring in getting the task (computing the Mandelbrot fractal in this case) done. If you want to test how two processors perform using the same test, go to Options and disable the test(s) that use(s) the more advanced instructions.

Q: How am I supposed to know what (kind of) test was run?
A: Pay attention at the result bar, it should tell you all about the test as well as the result in it/s. It should say:

Type of unit(s) used. E.g. ALU or FPU.
Type of data used. E.g. integer or floating-point.
Any instruction sets used. E.g. SSE2, etc.
The score in it/s. (see above how to calculate pixel rate)

Q: Are the tests in the CPU Multi-Media Benchmark optimised for a specific CPU?
A: Yes, the tests are optimised as far as possible but without introducing instructions that would generate large penalties on other processors.

ALU (Integer) Test – blend optimised for P6+.
FPU (Floating Point) Test – blend optimised for P6+.
SSE2, SSSE3, SSE4.1, SSE4.2 (Integer & Floating Point) Test – blend optimised for P7+.

Q: I have 2 CPUs with different steppings in my SMP system and the benchmarks are strange!
A: Make sure the CPU with the lowest stepping is the boot (BSP) CPU. Thus software will use only the features/timings supported by all processors.

Q: Why doesn’t the benchmark include my super-duper CPU?
A: While we do buy and test each and every CPU model on the market, we cannot afford to buy all the very latest speed grades of each CPU. Even if we did, we cannot update the benchmark when a new speed grade is released – we’d need to do it every week.