SiSoftware Logo
  НаЧало   Вопросы И Ответы   Пресса   Загрузка И Покупка   Рейтинги   Контакты  
New: SiSoftware Sandra 2014
DE DE EN EN FR FR IT IT JP JP 

AMD Icon

Benchmarks : High-End Desktop Performance: AMD "Bulldozer" CPU/APU


What is it?

"Bulldozer" is the codename for a whole new generation of CPUs/APUs from AMD, the most important after K8's launch in 2003. It comprises three variants, two for the server market ("Interlagos" dual-chip and "Valencia" single-chip) and one for the desktops ("Zambezi"). There are APU versions (that include a GPU) and CPU only versions.

AMD took a modular approach with Bulldozer based on "Compute Units" (CUs): a shared L2 cache (up to 2MB), two 128-bit FMA-capable FPUs (unified as one 256-bit FPU for AVX) along with two integer cores (featuring 4 pipelines-the fetch/decode stage is shared) with shared L3 cache. The shared FPU in 256-bit mode could be an issue in 256-bit mode when using AVX/FMA and future instruction sets: there may not be much of a gain - if any - to make the effort worthwhile.

The Bulldozer "CU" design resembles Intel's Hyper-Threading; the main advantage of Bulldozer is that provides each thread with dedicated schedulers and integer units though the FPU is shared in 256-bit mode. Note: Windows 7/Server 2008 R2 kernel does not schedule threads based on CU affinity yet - which is why Sandra uses "hard affinity" through its own scheduler.

It is a pity that the memory controller is only dual channel (PC3-15000 native support) for the desktop part, whereas Intel already has a tri-channel and aims for quad channel in future releases. However, Opteron will feature a quad channel memory controller with support for PC3-12800 DDR3 Reg RAM.

  • AVX (Advanced Vector eXtensions) – a new instruction set with 256-bit width (double of SSE2/3/4) that greatly enhances performance of SIMD code. First implemented by Intel in Sandy Bridge, now launched in AMD chips.
  • FMA4 (4-operand fused multiply-add) - adding to AVX, allowing fast float multiply-add operations. Here AMD is ahead of the competition, however Intel will add the FMA3 in its future chips which won't be compatible, despite having the same functionality! Not since the SSE vs. 3D Now! SIMD instruction wars have we had such an issue.
  • XOP (former SSE5) - a set of instructions not covered by AVX but part of the original SSE5 AMD specification. Some may be supported by AVX2, until then they remain AMD only.
  • AES acceleration, SHLD (shift left double - used for SHA hashing) and ADC (add with carry) greatly improve cryptographic performance in the most popular algorithms today: AES for encryption/decryption (AES256/AES128) and and SHA for hashing/signing (SHA256/SHA1). Crypto has long been mainstream and hardware acceleration removes the performance penalties slower CPUs suffered from.
  • "GPAPU" is General Processing (GP) using both the GPU (GPGPU) and the CPU (GPCPU) concurrently. We have added support for APUs to all our GP benchmarks (Processing, Cryptography, Bandwidth) in Sandra 2011.

Hardware Specifications

We are comparing an AMD "Zambezi" chip against Intel's "Sandy Bridge".

Specs. CPU AMD "Zambezi" Intel "Sandy Bridge"
Speed - Turbo 2800MHz (28x) - 3800MHz (38x) 3000MHz (30x) - 3600MHz (36x)
CU / Cores / Threads 4CU / 8C / 8T 4C / 8T
Caches L1/L2/L3 8x16kB / 4x2MB / 8MB 4x32kB / 4x256kB / 6MB
Power (TDP) 95W 95W
Cost (USD)
Memory 2x DDR3 PC3-10700 2x DDR3 PC3-10700
Speed/Timing 2x 667MHz (1333MHz) 9-9-9-24 4-33-10-5 2x 667MHz (1333MHz) 9-9-9-25 4-34-10-5
Memory Controller Speed 1100MHz (11x) 3000MHz (30x)

Standard Processing Performance

We are testing native and software VM processing performance of the processors themselves, i.e. not using (GP)GPU - aka "traditional" benchmarking.

Results Interpretation: Higher values (GOPS) mean better performance.

Base 10 Multipliers: 1GOPS = 1000MOPS, 1MOPS = 1000kOPS, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

CPU Processing Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
Native Dhrystone/Whetstone (GOPS) 54.64 (34% lower) 85.15 (baseline) The native CPU benchmark reveals that Intel had done a very job with "Sandy Bridge" and it is hard for AMD, even with a fresh design, to keep up with its very latest. If it came earlier it could have defeated the older "Lynnfield/Nehalem" (Core gen 1) versions.
Java Dhrystone/Whetstone (GOPS) 37 (49% lower) 72.7 (baseline) Java VM is not Zambezi's strong point: it scores 50% lower! Considering Sun Microsystems (now part of Oracle) sold AMD Opteron servers you would expect the JVM to be tuned for AMD.
.net Dhrystone/Whetstone (GOPS) 17.66 (33% lower) 26.55 (baseline) In the .Net environment, AMD catches up a little bit, but still not up to par.

Multi-Media Performance

We test raw native multi-media (SIMD) performance using any of the supported instruction sets (FMA, AVX, SSE, etc.) against GPAPU/GPGPU using any of the supported interfaces (OpenCL, DirectX ComputeShader/DirectCompute, CUDA).

Results Interpretation: Higher values (MPix/s) mean better performance.

Base 10 Multipliers: 1MPix/s = 1000kPix/s, 1kPix/s = 1000Pix/s, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Multi-Media Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
CPU Multi-Media SSE/128-bit (Mpix/s) 132.3 (15% lower) 155 AMD CPUs were always good performers in Multi-Media tasks and this is a respectable result. Let's hope that the gap will not be wider when "Ivy Bridge" launches (Core gen 3).
CPU Multi-Media AVX/256-bit (Mpix/s) 147.4 (33% lower) 217.5 (baseline) Using AVX, "Sandy Bridge" is 50% faster than SSE(x) while Zambezi does not improve much (most likely due to the shared FPU within CUs); the performance gap has thus doubled. At least you don't need to pay to upgrade your software to AVX.
GPU Multi-Media (MPix/s) - 20.13 A pretty woeful result from Sandy Bridge's so-called GPU, any GPU AMD will include in the APU version should blow this away (Llano's GPU certainly does).
GPCPU OpenCL (MPix/s) 46 (17% lower) 55.6 (baseline) A decent result from the Zambezi chip, however a difference that cannot be ignored remains between the two.
Java Multi-Media (Mpix/s) 22.86 (10% lower) 25.32 (baseline) Java VM is a Multi-Media scenario that's quite favourable to the AMD CPU and with optimisations it could overtake its competition as the gap is small.
.Net Multi-Media (Mpix/s) 15.73 (22% lower) 20.15 (baseline) The .Net environment is not as kind to the AMD CPU, we're seeing twice the gap of Java.

Cryptographic Performance

We test cryptographic performance of the strongest common algorithms (AES256, SHA256) performance using any of the supported instruction sets (AVX, SSE, etc.) against GPAPU/GPGPU using any of the supported interfaces (OpenCL, DirectX ComputeShader/DirectCompute, CUDA).

Results Interpretation: Higher values (MB/s) mean better performance.

Base 2 Multipliers: 1MB/s = 1024kB/s, 1kB/s = 1024bytes/s, 1byte = 8bits, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Cryptography Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
Crypto (MB/s) 925 (16% higher) 797 (baseline) In ALU mode (no AES and no AVX) the AMD chip scores its first win by a good margin (16% faster)! Finally the "true core" design shines through!
Crypto AES/AVX (MB/s) 1277 (44% lower) 2270 (baseline) Despite having both AES and AVX support, Zambezi is almost half as fast; looking at the individual results, it seems its AVX hashing perfomance is the issue (337MB/s vs. 943MB/s) - again most likely due to its shared FPU design. AES performance is comeptitive.
GPGPU Crypto (MB/s) - 417 (baseline) The integrated GPU in Sandy Bridge only manages 50% of non-accelerated CPU performance and 20% of accelerated AES/AVX performance. Why bother? Again, just about any GPU in the APU Zambezi will decimate it.
GPCPU OpenCL Crypto (MB/s) 583 (18% higher) 494 (baseline) In OpenCL AMD takes the lead by almost 20%, though to be fair it is AMD's OpenCL run-time, they are not going to optimise for the competition.

Memory Bandwidth Performance

Here we test the memory controller of the CPU with the same memory modules for objective comparison.

Results Interpretation: Higher values (MB/s) mean better performance.

Base 2 Multipliers: 1MB/s = 1024kB/s, 1kB/s = 1024bytes/s, 1byte = 8bits, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Memory Bandwidth Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
Memory Bandwidth (GB/s) 15.33 (13% lower) 17.58 (baseline) While the memory controller in the AMD CPU runs 36% slower, it is only ~10% slower, so it looks to be more efficient. If only it could squeeze more bandwidth out of the memory but at least it does better than previous Core CPUs (Lynnfield/Nehalem 2 channel).
Memory Bandwidth AVX 15.41 (12% lower) 17.56 (baseline) With AVX, Zambezi improves a bit though not enough to be noticeable.
Memory Bandwidth AVX, Internal GPU enabled - 17.3 (baseline) Unlike many designs, Sandy Bridge does not lose performance when the GPU is enabled. We shall have to see how the AMD APU behaves.
GPCPU OpenCL Memory Bandwidth 8.44 (34% lower) 12.73 (baseline) Even though we're using AMD's own OpenCL runtime, memory transfers are 35% slower, 3x the gap we saw in native memory bandwidth benchmarks. It seems AMD's OpenCL team has some optimisations to do for Zambezi!

Transcoding Performance

A test for the trascoding performance of CPU's along with the GPU.Those days there are many audio and video formats and the hardware acceleration of the transcoding implemented in GPU's is very useful.

Results Interpretation: Higher values (MB/s) mean better performance.

Base 2 Multipliers: 1MB/s = 1024kB/s, 1kB/s = 1024bytes/s, 1byte = 8bits, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Transcoding Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
CPU Transcode (kB/s) 657 (22% lower) 837 (baseline) Even with its 8 real cores and improved multi-threading, Zambezi still cannot match Sandy Bridge even in software mode.
GPU Transcode (kB/s) - 4800 (baseline) It is not fair to compare the transcoding capabilities found in a modern GPU with those of a CPU, but we cannot help it: Intel's QuickSync is a "killer feature" even for high-end CPUs. You'd need 6-times more cores (24 = 4x6) to match the performance in software! Zambezi would need ~8-times more cores (64 = 8x8), so keep an eye for that 8-way Opteron system.

Cache and Memory Performance

The internal caches are being tested with this benchmark to see if the "Bulldozer" design makes a difference.

Results Interpretation: Higher values (MB/s) mean better performance.

Base 2 Multipliers: 1MB/s = 1024kB/s, 1kB/s = 1024bytes/s, 1byte = 8bits, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Cache and Memory Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
Cache & Memory SSE-128 (GB/s) 66.45 (25% lower) 88.7 (baseline) The newly approach with the shared L2 cache in the Zambezi may not be the best given the bandwidth measured.
Cache & Memory AVX-256 (GB/s) 69.7 (25% lower) 93.16 (baseline) We see the same difference in bandwidth when the AVX instructions are used; both systems improve marginally by using AVX.
Cache & Memory int. GPU AVX-256 (GB/s) - 88.63 (baseline) It remains to be seen whether how much the APU version of Zambezi is affected when the internal GPU is enabled; Sandy Bridge is not affected much - unlike older systems with shared graphics.

Latency Performance

The benchmark for the response time of processors' caches and memory. The latency of caches is measured in processor clocks as it is dependent on the processor clock speed.

Results Interpretation: Lower values (ns) mean better performance.

Base 10 Multipliers: 1sec = 1000ms, 1ms = 1000us, 1us = 1000ns, etc.

Environment: Windows 7 x64 SP1, latest AMD Catalyst and Intel drivers.

Latency Benchmarks AMD "Zambezi" Intel "Sandy Bridge" Comments
Linear (ns) 12.3 (73% higher) 7.1 (baseline) Zambezi's result is not satisfactory in latency terms: another line in the list of improvements that need to be made in future revisions or perhaps a BIOS optmisation is enough.
Random (ns) 98.4 (34% higher) 73.5 (baseline) The random latency test fares better for AMD but it is still too high considering we use the same memory speed/timings. The memory controller needs some optimisations to be competitive.

Efficiencies

Because not all things in life are evaluated to their true value, the next measurements will take into consideration various efficiency aspects:

Power Efficiency (this measures the efficiency of power design, or TDP) AMD "Zambezi" Intel "Sandy Bridge" Comments
Standard Performance vs. Power 54.64GOPS 95W 59.52MOPS/W (34% lower) 85.15GOPS 95W 89.63MOPS/W The performance gap in the native processing benchmark reflects here as both chips have the same TDP, with Zambezi fielding 4 CUs with 8 cores against Sandy Bridge's 4 cores and 8 threads. At least Zambezi is more efficient than previous Core CPUs ("Nehalem"/"Westmere" - these are ~40-50% less efficient than "Sandy Bridge" while Zambezi is "only" 34% less efficient).
Multi-Media Performance vs. Power 147Mpix/s 95W 1.55Mpix/W (32% lower) 217Mpix/s 95W 2.28Mpix/W Similar difference in AVX mode, but in SSE(x)/128-bit mode it's "only" 15% less efficient; so for current and older software the gap is not significant. It also beats previous Core CPUs as they don't support AVX anyway.
Cryptographic Performance vs. Power 1277MB/s 95W 13.4MB/w (44% lower) 2270MB/s 95W 23.9MB/w The highest difference yet, even though Zambezi has 8 real cores not 4 like Sandy Bridge (and AES & AVX support).

Note: Turbo was enabled on all processors; thus each processor was free to Turbo or not.

Speed Efficiency (how performance scales with speed and how they perform at the same speed) AMD "Zambezi" Intel "Sandy Bridge" Comments
Standard Performance vs. Speed 54.64GOPS 2800MHz-3800MHz Turbo 2.02MOPS/MHz (29% lower) 85.15GOPS 3000MHz-3600MHz Turbo 2.84MOPS/MHz Similar to the power efficiency, Zambezi is 30% less efficient clock for clock; this means it is less efficient than even previous Core CPUs (which are only ~10-15% less efficient than Sandy Bridge).
Multi-Media Performance vs. Speed 147Mpix/s 2800MHz-3800MHz Turbo 53kpix/MHz (26% lower) 217Mpix/s 3000MHz-3600MHz Turbo 72kpix/MHz Similar difference in power efficiencies, Zambezi is 26% less efficient in AVX mode clock for clock.
Cryptographic Performance vs. Speed 1277MB/s 2800MHz-3800MHz Turbo 0.45MB/MHz (40% lower) 2270MB/s 3000MHz-3600MHz Turbo 0.75MB/MHz The highest diference of all tests, 40% less is pretty significant clock for clock.

Live Results @ SiSoftware Live Ranker

Related Articles

Final Thoughts / Conclusions

Looking back at all the benchmarks of the new CPU, it is obvious that AMD is a generation behind Intel performance-wise. The "all new" design is better than the K8-variants and some results (e.g. Cryptography, Multi-Media) than the competition, but the other results are dissapointing. 256-bit performance (AVX/FMA) - what modern software will use - is "slow", most likely due to the shared FPU within a CU; the competition saw great gains from AVX, this is not the case here.

It will be challenging for AMD to compete in the "enthusiast market" with a CPU with (only) dual channel memory against tri-channel; maybe that's why Intel has not bothered to upgrade the 3-year old X58, but when it does (X79) the gap will widen even more.

For AMD fans the upgrade is definitely worth it, especially if they already use a socket AM3+ motherboard; for those that have a socket AM3 boards you'll need to check whether a BIOS update is enough to ensure compatibility. In any case, the CPU is much faster than the mainstream "Llano" (see AMD Desktop "Llano" APU (CPU+GPU)).

Pricing is currently the main unknown, but AMD has not been as hungry for cash - at least from an upgrade point of view (you still need a modern board).

Новости | Обзоры | Twitter | Facebook | Политика безопасности | Лицензия | Контакты