Q & A : Benchmarks : Multi-Core Efficiency |
What is it?
A benchmark specifically designed to measure the efficiency of the different multi-core processors with their different architectures as well as compare their performance to traditional multi-processor (SMP) systems.
Performance measuring benchmarks do not show at-a-glance the differeces between multi-core processors and how multi-threading programs should be designed to best take advantage of the underlying architecture while avoiding the “gotchas” .
This benchmark does not test processor unit computational performance, i.e. how fast the cores of the processors are; it tests how fast is the connection between them only.
Why do we measure it?
There are many multi-core processors available, so much so that traditional single-core processors are disappearing; even (some) traditional multi-processor systems are replaced by a multi-core single processor. However, not all are made equal as their architectures differ greatly.
For example, what is faster? A new, single, dual-core processor or a traditional single-core dual-processor system?
Most Popular Processors
Most popular Processors as benchmarked by users (past 30 days): | Most popular Processors as bought through the store (past week): | ||||||||||||||||
|
|
||||||||||||||||
For a complete list of statistics, check out the Most Popular Hardware page. For a list of more products, see SiSoftware Shopping. |
Typical Results from Processors on the Market
Testing various current processors or just checking out the reference results makes the differences in architectures and implementations very clear. Let’s see a few examples:
Processor | Inter-core Bandwidth | Inter-core Latency | Commentary |
AMD Athlon X2 | 2.88GB/s @ 2.6GHz | 98ns @ 2.6GHz | The integrated memory controller and SRI/Crossbar interface allows it very low inter-core latency however, the lack of shared L2 caches means the inter-core bandwidth cannot match. However, passing data between threads is faster than a comparative SMP system. |
AMD Phenom X4 | 3.93GB/s @ 2.4GHz | 159ns | With a shared L3 cache, inter-core bandwidth for small/medium transfers will be high with low latency while larger transfers will still benefit from the built-in memory controllers. Its bandwidth efficiency is higher, thus resulting in higher achievable bandwidth. |
Intel Core Duo | 2.9GB/s @ 1.67GHz | 160ns @ 1.67GHz | The first design with shared L2 cache, the results are very good though the L2 cache is smaller and the bandwidth lower than newer Core 2 processors. |
Intel Core 2 Duo | 8.11GB/s @ 2.67GHz | 90ns @ 2.67GHz | You can clearly see its dual core performance is exemplary due to its large, shared L2 cache. Except the very largest combination, all others just use the shared L2 cache and don’t need to touch main memory. Passing data between threads is extremely fast, latency is very low, with almost no penalty unlike traditional SMP systems! |
Intel Core 2 Quad | 17.54GB/s @ 3GHz | 79ns @ 3GHz | Effectively 2 Core 2 Duos in 1 package, by carefully pairing the threads the penalty of going off-chip between the two dies can be managed; 2 thread pairs use each of the L2 shared caches. Software using 2 threads or more need to ensure threads exchanging data are scheduled on the same die while threads working on different can do with different dies. |
Intel Pentium D | 700MB/s @ 2.67GHz | 265ns @ 2.67GHz | The first dual-core design with 2 processor dies on 1 package. Transfering data between threads through the shared FSB to/from main memory is very slow, completly dependent on memory bandwidth. Large L2 caches do not help here except to buffer any common data that does not change thus freeing the FSB for other transfers. This is how traditional SMP systems behaved so it is not worse than a dual CPU but not better either. |
How does it work?
The benchmark has 2 stages of operation:
1. Scheduling: first, it quickly works out what is the best producer/consumer scheduling by trying out all the combinations – if you have more than 2 processor units (naturally if you just have 2 there is only 1 combination).
If you have, say, 4 processor units there are 6 combinations to test (0-1, 0-2, 0-3, 1-2, 1-3, 2-3) and then work out which are the fastest to use for the actual test. If you look at Task Manager, you can see the utilisation of various processors going up and then down.
Why are we doing this? Surely we know the best scheduling based on cores-per-package/SMTs-per-core/NUMA nodes, after all Sandra has its own scheduler which other benchmarks make use of to schedule the threads in the most efficient fashion? Yes we could, but the data gathered by trying out the bad combinations
is useful as well.
2. Testing: now that we have the optimum thread scheduling, the buffers are created, the producer/consumer chains are initialised and the testing begins. The following chain sizes x buffer sizes are tested: 2x 8kB, 4x 8kB, 2x 32kB, 4x 32kB, 16x 8kB, 2x 128kB, 16x 32kB, 64x 8kB, 16x 128kB, 64x 32kB, 64x 128kB. A combined value is worked out based on the individual results.
Different combinations of chain x block size transfer similar amounts of data overall, the variations show the differences in buffering the data and transferring a big block versus working on smaller blocks and transferring them as they are completed.
Technical Information
- Algorithm/paradigm: based on the producer/consumer paradigm using different chain sizes and transfer block sizes
- Systems supported: multi-core, SMP, SMT, SMP/SMT multi-core systems, NUMA systems
- Operating systems supported: native 32-bit, 64-bit ports; Windows XP/Server 2003/Vista/Server 2008
- Threading: as many threads as processor units are used
- Instruction Sets: SSE2 required
- Options: The operation is fully automatic; there are no user-configurable settings that affect benchmark operation.