Q & A : Benchmarks : Multi-Core Efficiency

Q & A : Benchmarks : Multi-Core Efficiency

What is it?

A benchmark specifically designed to measure the efficiency of the different multi-core processors with their different architectures as well as compare their performance to traditional multi-processor (SMP) systems.

Performance measuring benchmarks do not show at-a-glance the differeces between multi-core processors and how multi-threading programs should be designed to best take advantage of the underlying architecture while avoiding the “gotchas” .

This benchmark does not test processor unit computational performance, i.e. how fast the cores of the processors are; it tests how fast is the connection between them only.

Why do we measure it?

There are many multi-core processors available, so much so that traditional single-core processors are disappearing; even (some) traditional multi-processor systems are replaced by a multi-core single processor. However, not all are made equal as their architectures differ greatly.

For example, what is faster? A new, single, dual-core processor or a traditional single-core dual-processor system?

What do the results mean?

Combined Inter-core Bandwidth (MB/s):
- The average of the transfer bandwidths at the tested Chain Size x Block Size in MB/s (mega-bytes per second).
- The higher the score the better (transfer is faster).
- Being an average, the individual scores impact its value.

Inter-core Latency (ns):
- The shortest time to transfer the lowest amount of data between one core to another in ns (nano-seconds, 10^-9 seconds).
- The lower the score the better (time to transfer is lower).
- The responsiveness of the inter-connect between cores and the corresponding data synchronisation protocol directly impacts the score.

Inter-core Bandwidth (for specific Chain Size x Block Size):
- How fast a block of data – of size Block – can be transferred from one core to another. Chain Size is the number of blocks transferred between cores.
- The higher the score the better (transfer is faster).
- The data path of the inter-connect between cores directly impacts the score. This may be a shared L2/L3 cache, cross-bar switch, or the traditional Front-Side-Bus (FSB) which connects individual processors.

Scheduling of thread pairs (CPUx-CPUy): Shows how the producer/consumer pairs are scheduled on the available processor cores (for informational purposes).

Multi-Core Efficiency Dialog (c) SiSoftware

Most Popular Processors

Most popular Processors as benchmarked by users (past 30 days):

Most popular Processors as bought through the store (past week):

Intel

10.79 GBP

1.		Crucial 2GB Upgrade for a Acer Aspire One D255 (Intel Atom N550) DDR3		10.79 GBP
2.		Crucial 32GB kit (8GBx4), Ballistix 288-pin DIMM, DDR4 PC4-19200,		279.59 GBP

For a complete list of statistics, check out the Most Popular Hardware page. For a list of more products, see SiSoftware Shopping.

Typical Results from Processors on the Market

Testing various current processors or just checking out the reference results makes the differences in architectures and implementations very clear. Let’s see a few examples:

Processor	Inter-core Bandwidth	Inter-core Latency	Commentary
AMD Athlon X2	2.88GB/s @ 2.6GHz	98ns @ 2.6GHz	The integrated memory controller and SRI/Crossbar interface allows it very low inter-core latency however, the lack of shared L2 caches means the inter-core bandwidth cannot match. However, passing data between threads is faster than a comparative SMP system.
AMD Phenom X4	3.93GB/s @ 2.4GHz	159ns	With a shared L3 cache, inter-core bandwidth for small/medium transfers will be high with low latency while larger transfers will still benefit from the built-in memory controllers. Its bandwidth efficiency is higher, thus resulting in higher achievable bandwidth.
Intel Core Duo	2.9GB/s @ 1.67GHz	160ns @ 1.67GHz	The first design with shared L2 cache, the results are very good though the L2 cache is smaller and the bandwidth lower than newer Core 2 processors.
Intel Core 2 Duo	8.11GB/s @ 2.67GHz	90ns @ 2.67GHz	You can clearly see its dual core performance is exemplary due to its large, shared L2 cache. Except the very largest combination, all others just use the shared L2 cache and don’t need to touch main memory. Passing data between threads is extremely fast, latency is very low, with almost no penalty unlike traditional SMP systems!
Intel Core 2 Quad	17.54GB/s @ 3GHz	79ns @ 3GHz	Effectively 2 Core 2 Duos in 1 package, by carefully pairing the threads the penalty of going off-chip between the two dies can be managed; 2 thread pairs use each of the L2 shared caches. Software using 2 threads or more need to ensure threads exchanging data are scheduled on the same die while threads working on different can do with different dies.
Intel Pentium D	700MB/s @ 2.67GHz	265ns @ 2.67GHz	The first dual-core design with 2 processor dies on 1 package. Transfering data between threads through the shared FSB to/from main memory is very slow, completly dependent on memory bandwidth. Large L2 caches do not help here except to buffer any common data that does not change thus freeing the FSB for other transfers. This is how traditional SMP systems behaved so it is not worse than a dual CPU but not better either.

How does it work?

The benchmark has 2 stages of operation:

1. Scheduling: first, it quickly works out what is the best producer/consumer scheduling by trying out all the combinations – if you have more than 2 processor units (naturally if you just have 2 there is only 1 combination).

If you have, say, 4 processor units there are 6 combinations to test (0-1, 0-2, 0-3, 1-2, 1-3, 2-3) and then work out which are the fastest to use for the actual test. If you look at Task Manager, you can see the utilisation of various processors going up and then down.

Why are we doing this? Surely we know the best scheduling based on cores-per-package/SMTs-per-core/NUMA nodes, after all Sandra has its own scheduler which other benchmarks make use of to schedule the threads in the most efficient fashion? Yes we could, but the data gathered by trying out the bad combinations
is useful as well.

2. Testing: now that we have the optimum thread scheduling, the buffers are created, the producer/consumer chains are initialised and the testing begins. The following chain sizes x buffer sizes are tested: 2x 8kB, 4x 8kB, 2x 32kB, 4x 32kB, 16x 8kB, 2x 128kB, 16x 32kB, 64x 8kB, 16x 128kB, 64x 32kB, 64x 128kB. A combined value is worked out based on the individual results.

Different combinations of chain x block size transfer similar amounts of data overall, the variations show the differences in buffering the data and transferring a big block versus working on smaller blocks and transferring them as they are completed.

Technical Information

Algorithm/paradigm: based on the producer/consumer paradigm using different chain sizes and transfer block sizes
Systems supported: multi-core, SMP, SMT, SMP/SMT multi-core systems, NUMA systems
Operating systems supported: native 32-bit, 64-bit ports; Windows XP/Server 2003/Vista/Server 2008
Threading: as many threads as processor units are used
Instruction Sets: SSE2 required
Options: The operation is fully automatic; there are no user-configurable settings that affect benchmark operation.

Q & A : Benchmarks : Multi-Core Efficiency

Recent Posts

Recent Comments

Archives

Categories

Meta