SiSoftware Logo
  Home   FAQ   Press   Download & Buy   Rankings   Contact  
New: SiSoftware Sandra 2014
DE DE FR FR IT IT JP JP RU RU 

Intel Icon

Benchmarks : Intel Mobile Haswell (CrystalWell): Memory Sub-System


What is Haswell?

Haswell (HSW) is the next generation Core "APU" from Intel (v4 2013) replacing the current Ivy Bridge (IVB) (v3 2012) line-up in both desktop and mobile platforms. While built on the same 22nm process as Ivy, it is a major architecture refresh, thus it introduces new instruction sets (AVX2, FMA3) and features. Not socket compatible (LGA 1150 desktop), so existing systems cannot be upgraded.

In this article we test CPU core performance; please see our other articles on:

Hardware Specifications

We are comparing 3 processors are similar core and Turbo speeds - Turbo is enabled - for a "clock-for-clock" comparison. New models may run at higher (or lower) frequencies than the ones they replace, thus performance delta can vary.

Processor (CPU) Specifications Intel Haswell (CrystalWell) Mobile Intel Ivy Bridge Mobile Intel Sandy Bridge Mobile Comments
L0 Instruction/Code Cache 4x 1.5k uop (~6kB) 4x 1.5k uop (~6kB) 4x 1.5k uop (~6kB) No change.
L1 Data Cache 4x 32kB 8-way 512-bit width 4x 32kB 8-way 256-bit width 4x 32kB 8-way 256-bit width Same size/way but 2x as wide L1D - double-width data ports. Bandwidth should also double - especially using AVX2.
L1 Instruction/Code Cache 4x 32kB 8-way 128-bit width 4x 32kB 8-way 128-bit width 4x 32kB 8-way 128-bit width No change.
L2 Shared Cache 4x 256kB 8-way 256-bit width? 4x 256kB 8-way 128-bit width 4x 256kB 8-way 128-bit width Same size/way but 2x as wide L2 - double data ports. Bandwidth should also double - especially using AVX2.
L3 Shared Cache 6MB 12-way (fully inc) 128-bit width 6MB 12-way (fully inc) 128-bit width 6MB 12-way (fully inc) 128-bit width Same size/way and width. However, Memory Controller/RingBus/L3 can run at higher speed than clock unlike Ivy/Sandy - which should especially improve GP(GPU) bandwidth.
L4/eDRAM Package Cache 128MB 16-way 1.2GHz eDRAM 128-bit width? none none Unfortunately only the CrystalWell variant has L4/eDRAM and here we test the highest capacity (128MB) version. While running at only 1.2GHz, its huge size should help many algorithms, both CPU and (GP)GPU.
1st level Data TLB 64 4-way 64 4-way 64 4-way No change in 1st level data TLB.
1st level Code TLB 64 8-way 64 4-way 64 4-way 1st level code TLB is now 8-way.
2nd level shared TLB 1024 8-way 512 4-way 512 4-way 2nd level shared TLB is now double size and 8-way which should improve memory latencies when using many pages, especially small (4kB). Few software use large (2MB) or huge (1GB) pages even today.
Memory Size / Type / Speed / Width / Timings 2x 4GB DDR3 1.6GHz 128-bit 11-11-11-28 5-39-12-6 2x 4GB DDR3 1.6GHz 128-bit 11-11-11-28 5-39-12-6 2x 4GB DDR3 1.6GHz 128-bit 11-11-11-28 5-39-12-6 Same memory size, type, width and timings.

Most Popular Processors, Video Adapters

Most popular Processors as benchmarked by users (past 30 days):   Most popular Video Cards as benchmarked by users (past 30 days):
1.6%Intel Core i7-3930KIntel Core i7-3930K625.00 USD
2.4%Intel Core i5-2500KIntel Core i5-2500K433.82 USD
3.4%Intel Core i7 920Intel Core i7 920173.81 USD
4.3%AMD FX-8350AMD FX-8350228.00 USD
5.3%Intel Core i5-3570KIntel Core i5-3570K209.93 USD
 
1.8%AMD Radeon HD 7900AMD Radeon HD 7900488.48 USD
2.4%AMD Radeon HD 6800AMD Radeon HD 6800289.08 USD
3.4%AMD Radeon HD 7800AMD Radeon HD 7800482.07 USD
4.4%NVIDIA GeForce GTX 560 TiNVIDIA GeForce GTX 560 Ti704.73 USD
5.4%AMD Radeon HD 6900AMD Radeon HD 6900471.24 USD
For a complete list of statistics, check out the Most Popular Hardware page. For a list of more products, see SiSoftware Shopping.

Cache and Memory Bandwidth

We are testing native cache and memory bandwidth using the highest performing instruction sets (AVX2, AVX, etc.). Haswell introduces AVX2 which allows 256-bit integer SIMD (AVX only allowed 128-bit) and FMA3 - finally bringing "fused-multiply-add" for Intel CPUs. We are now seeing CPUs getting as wide a GP(GPUs) - not far from AVX3 512-bit in "Phi".

Results Interpretation: Higher values (GOPS, MB/s, etc.) mean better performance.

Environment: Windows 7 x64 SP1, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Bandwidth Benchmarks Intel Haswell (CrystalWell) Mobile Intel Ivy Bridge Mobile Intel Sandy Bridge Mobile Comments
L1D Bandwidth (2 threads sharing) 699GB/s AVX2+FMA3/256-bit (+91%) 365GB/s AVX/128-bit 333GB/s AVX/128-bit We know L1D data path is twice as wide in Haswell and here we see the improvement: L1D bandwidth up 91% when using 256-bit transfers - at the same clock! Let's hope the latencies have remained the same.
L2 Bandwidth (2 threads sharing) 228GB/s AVX2+FMA3/256-bit (+15%) 204GB/s AVX/128-bit 199GB/s AVX/128-bit We are told L2 data path is also twice as wide in Haswell but here it goes up 15% only even when using 256-bit transfers. That is pretty disappointing - unless there is an underlying reason why the extra bandwidth cannot be realised. Dropping to AVX/128-bit transfers again does not change anything.
L3 Bandwidth (8 threads sharing) 112GB/s AVX2+FMA3/256-bit (-6%) 119GB/s AVX/128-bit 119GB/s AVX/128-bit Here, we surprisingly see L3 bandwidth 6% less than either Ivy/Sandy - even though the memory controller/ring-bus are supposed to be improved. It is not a big deal but we would expect some improvement especially considering the use of 256-bit transfers and AVX2/FMA3.
L4/eDRAM Bandwidth (8 threads sharing) 42GB/s AVX2+FMA3/256-bit (+110%) n/a n/a The 128MB eDRAM (largest size) mounted on the same package provides 2x bandwidth that 128-bit DDR3 @ 1.6GHz can provide - similar to a 4-channel Sandy/Ivy-EX system! Here it runs at 1.2GHz but other models may run it faster or slower. Data that fits in this "cache" can thus be accessed 2x as fast as main memory - let's hope the latency is decent also.
Integer Memory Bandwidth 17GB/s AVX2/256-bit / 17GB/s AVX/128-bit (-15%) 20GB/s AVX/128-bit 20GB/s AVX/128-bit Even though on Haswell we use 256-bit memory transfers (not 128-bit as on Ivy/Sandy) through AVX2, memory bandwidth is lower by 15%. While this "issue" was to be fixed in retail steppings, it is still present today. Using AVX/128-bit transfers does not improve the result.
Float Memory Bandwidth 17GB/s FMA3/256-bit / 17GB/s AVX/128-bit 20GB/s AVX/128-bit 20GB/s AVX/128-bit Floating-point code uses 256-bit transfers and FMA3 (to increase "triad" STREAM execution rate) but still cannot improve bandwidth efficiency. We do know the memory controller/ring-bus now runs asynchronous to core but CPU cores are not - but does that reduce efficiency? 15% slower is not a lot but it is significant.

Haswell's wider caches can be "seen" when streaming 256-bit transfers - with L1D up 91% but L2 up only 15%, L3 is down 5%. Cache sizes also have not changed - especially L3 could have increased to 8GB. The ace card is the L4/eDRAM on the same package - which provides 2x main memory bandwidth - unfortunately most Haswells will not have it.

Overall, somewhat disappointing results - still algorithms whose data fit in either L1D or eDRAM will run much faster on Haswell and streaming algorithms can take advantage of these faster buffers to stream data faster.

Cache and Memory Latencies

We are testing native cache and memory latencies - using different access patterns (in-page, full random, sequential) for both code and data.

Results Interpretation: Lower values (ns, seconds, etc.) mean better performance.

Environment: Windows 7 x64 SP1, latest Intel drivers. Turbo / Dynamic Overclocking was enabled on both configurations.

Latency Benchmarks Intel Haswell (CrystalWell) Mobile Intel Ivy Bridge Mobile Intel Sandy Bridge Mobile Comments
L1D Data Latency In-Page 4 clocks 4 clocks 4 clocks No change in latency for L1D.
L2 Data Latency In-Page 12 clocks 12 clocks 12 clocks No change in L2 latency either.
L3 Data Latency In-Page 21 clocks (+1clk) 23 clocks 20 clocks We see +1 clock difference here (within margin of error at this level) thus L3 latency can stil be considered unchanged. Ivy +3 clock increase is significant.
L4/eDRAM Data Latency In-Page 61 clocks (-15%) equivalent 71 clocks at eDRAM block size equivalent 69 clocks at eDRAM block size Haswell's eDRAM latency is 10 clocks less equivalent block for Ivy/Sandy, i.e. 15% faster. Thus not only bandwidth is 2x main memory but latencies are less too. There is no doubt eDRAM helps, but whether the cost is worth it that's another matter ;)
Memory Data Latency In-Page 31ns (+25%) 28.4ns 24.8ns We see that main memory latencies have also increased for Haswell compared to Ivy/Sandy; while bandwidth is 15% less, latencies have gone up by 25% a pretty large amount. Whether this is due to the asynchronous memory controller/ring-bus or additional L4/eDRAM level it is not good news. Algorithms using a large amount of data will "hope" that data affinity (temporal/spatial) will allow it to be cached - otherwise they may just run slower on Haswell.
L1D Data Latency Full-Random 4 clocks 4 clocks 4 clocks No change in latency for L1D.
L2 Data Latency Full-Random 12 clocks 12 clocks 12 clocks No change in L2 latency either.
L3 Data Latency Full-Random 36 clocks (+24%) 29 clocks 29 clocks Even though data TLBs have been improved, we see out-of-page latencies going up by 24% - unlike "in page" where latencies were the same. The reason for this is not clear.
L4/eDRAM Data Latency Full-Random 138 clocks (-62%) equivalent 176 clocks at eDRAM block size equivalent 224 clocks at eDRAM block size Haswell's eDRAM latency is 86 clocks less equivalent block for Ivy/Sandy - when out-of-page! Considering that most software use 4kB pages (due to OS limitations/permission restrictions) that can happen pretty often - making data accesses 62% faster.
Memory Data Latency Full-Random 85ns (5%) 77.8ns 80.8ns Haswell is only 5% slower than Sandy and 9% slower than Ivy - we were expecting worse considering the L3 results and "in-page" memory latencies. It is disappointing that Haswell L3/memory latencies have gone up though.
L1I Code Latency In-Page 2 clocks 2 clocks 2 clocks No change in latency for L1I.
L2 Code Latency In-Page 19 clocks (-2clk) 21 clocks 21 clocks Haswell L2 appears 2 clocks faster, possibly due to the improved code TLB.
L3 Code Latency In-Page 24 clocks (-2clk) 26 clocks 26 clocks Haswell L3 also appears 2 clocks faster - a good result.
L4/eDRAM Code Latency In-Page 41 clocks (-25%) equivalent 54 clocks at eDRAM block size equivalent 57 clocks at eDRAM block size Haswell's eDRAM latency is 11 clocks less equivalent block for Ivy/Sandy, i.e. 25% faster - better that what we saw for data. Complex algorithms that require large amount of code to implement (many functions across many libraries) should benefit - especially object oriented or VM (either .Net or Java) code.
Memory Code Latency In-Page 20ns 21.6ns 19.9ns Main memory latency for "in-page" code accesses matches Ivy/Sandy; as we saw that data accesses are 25% slower, the code improvements in Haswell must balance them out in order to end up equal.
L1I Code Latency Full-Random 2 clocks 2 clocks 2 clocks No change in latency for L1I.
L2 Code Latency Full-Random 20 clocks (-1clk) 21 clocks 21 clocks No appreciable change in L2 latency either.
L3 Code Latency Full-Random 44 clocks (+15%) 38 clocks 39 clocks Even though code TLBs have greatly improved (Haswell has 2-level TLB), going out-of-page is still ~5 clock slower - thus is 15% slower.
L4/eDRAM Code Latency Full-Random 130 clocks (-52%) equivalent 217 clocks at eDRAM block size equivalent 249 clocks at eDRAM block size Haswell's eDRAM latency is 119 clocks less equivalent block for Ivy/Sandy - when out-of-page! For very complex algorithms eDRAM will help a lot.
Memory Code Latency Full-Random 91ns (5%) 133ns 88.6ns Haswell is only 5% slower than Sandy - similarly to data latencies for full-random accesses.

The non-news is that latencies for L1D, L1I and L2 are unchanged. For data access L3 latencies have increased - luckily the improved code TLB (2 level) means that code access latencies are about the same. Memory latencies for both data and code are around 15-25% slower - thus not only bandwidth has decreased but latencies have increased.

Naturally eDRAM provides at least 15-50% lower latencies for the same block size, not quite the 2x improvement we've seen in bandwidth but good improvement - but at cost. Unfortunately most Haswells will not have it thus no improvement can be expected there.

SiSoftware Official Ranker Scores

Final Thoughts / Conclusions

Even with all the improvements, Haswell does not improve much over Ivy/Sandy: yes L1D bandwidth is 2x and L2 bandwidth is up 20% - while latencies are the same. But L3 size is the same, bandwidth is lower (-6%) and latencies higher (+15%) for both data and code. The improved code TLB does manage to reduce latencies for both L3 and main memory (by approx 25%) allowing Haswell to match Ivy/Sandy.

Main memory bandwidth is even more reduced (-15%) and latencies even higher (+15-25%) - for the same memory size / type / speed and timings. Algorithms whose code and data is not cached in L1D/L1I/L2 may not perform better on Haswell at all.

Whether L3/memory drop in efficiency is due to the asynchronous memory controller/ring-bus (designed to run faster than core but could also run slower) or additional L4/eDRAM level remains to be seen - however the results are consistent and not a "stepping issue". Perhaps it will be fixed in the next minor update next year.

Naturally the eDRAM provides much higher bandwidth (2x) and much lower latencies (-25/-50%) but we shall have to see the cost to determine the worth. Where algorithm code and data fit in the 128MB performance will be considerably improved.

Haswell's cache and memory sub-system's patchy performance (some things better, some things worse), it depends on the specific algorithm whether performance improves or not, there is no easy way to tell.

News | Reviews | Twitter | Facebook | privacy | licence | contact