Benchmarks : Haswell's TSX and Memory Transaction Throughput (HLE and RTM)
What are transactions and why are they important to software?
A transaction is a sequence of sub-operations that are treated as a "single unit" in order to ensure data integrity it operates upon. It must be completed in its entirety; any failure needs to revert all the changes that were made to the data since the transaction started (aka "rollback").
While "transactions" are commonly encountered where databases are concerned, any multi-threaded operating system or application must use similar mechanisms when dealing with shared thread data: read, write (aka modify) and delete (modify of other data structures).
Multi-threading is required to take advantage of increasing number of cores and threads of modern CPUs. While every effort must be made to reduce the data shared between threads, some data has to be shared; synchronisation mechanisms are required to protect the integrity of the shared data by serialising operations - often through the use of locks and critical sections.
Locks and critical sections ensure operations to the shared data are executed transactionally.
Why do we need transactional syncronisation extensions (TSX)?
While locks and critical sections ensure shared data integrity (by allowing only 1 thread to access/modify the shared data at any one time), concurency and thus performance is reduced. To reduce the time when data is locked, "fine-grained" locking can be used by using multiple locks to protect different parts of the shared data. However, ensuring correct operation is difficult and error prone.
"Coarse-grained" locking (i.e. using few locks) reduces complexity but often introduces excessive locking. One way to reduce it is lock only when needed; to determine whether/when to lock can be done using static information but it is can be done dynamically by the hardware.
The new transactional syncronisation extensions (TSX) allows the CPU to dynamically determine whether/when to lock and performs serialisation only when required. Lock time is reduced, concurency improved and thus performance increases.
What are interfaces available in TSX?
TSX is available on (some) 4th generation Intel Core CPUs (Haswell) as well as (some) future AMD CPUs and comprises the following interfaces:
- Hardware Lock Elision (HLE) is a legacy compatible instruction set extension, i.e. transparent to CPUs that do not support TSX. The very same code can execute on TSX-capable CPUs - and benefit - but also work on legacy CPUs without performance penalty.
How is this possible? Intel has redefined what a sequence of opcodes "means" for TSX-capable CPUs (i.e. interpret them as HLE instructions) while legacy CPUs (including AMD ones) ignore them.
- Restricted Transactional Memory (RTM) is a new instruction set that requires a TSX-compatible CPU but allows the transactional regions to be specified in a more flexible manner.
That means new instructions that require an updated assembler or compiler - though it is always possible to emit the new opcodes "manually".
Both interfaces can be used by applications directly (i.e. to implement locking/transactions) without specific operating system (OS) support (e.g. like AVX, etc.), but applications using OS provided locks (mutexes, semaphores, etc.) will not benefit until their implementation in the OS is updated to support TSX.
[Neither Windows 7 nor Windows 8 use TSX at this time, but it is possible they will be updated or Windows 8.1/"Blue" kernel will be required for support]
What are the types of locks Sandra includes?
Sandra allows you to compare and contrast TSX performance against common standard software locks used by OS and applications (including Sandra itself!). Other (more complex) locks may be added in the future (if required).
- Software (Classic) standard "coarse grain" locking, i.e. lock before shared data access and unlock after data access.
- Software (Read/Write) an improved lock that only locks when data is modified (on write/delete) while allowing reads to succeed (aka "dirty" lock). While more complex, it reduces lock time as long as modify probability is low. As it increases, the efficiency of the lock decreases but at worst it should be equal to the classic lock.
- Hardware TSX (HLE or RTM) the "classic" lock implemented using HLE or RTM. The shared data is accessed optimistically without synchronisation; if no conflicting access occurs, the execution commits without any serialisation. Otherwise the transaction is aborted, the state is restored and the execution is resumed from the beginning non-transactionally. Just as with a Read/Write lock, lock time is reduced as long as modify probability is low, and its efficiency also decreases as modify probability increases.
We expect to see THX lock performance similar to classic software lock on legacy (non TSX-capable) CPUs, and higher than either software lock on TSX-capable CPUs.
What are the types of updates Sandra includes?
Sandra provides multiple ways to access the shared data and thus determine TSX impact:
- Database Transactional Using B-Tree to store the data, we test all the basic operations on it: read a record (search), insert a new record, update the data of an existing record, delete record. The probabilities of insert and update are 1/2 the total modify probability.
A global lock is the simple solution to ensure tree integrity when multiple threads execute operations on it (insert, remove records) at the expense of concurrency; using multiple locks would add significant complexity and possibly hard to find/debug race conditions/deadlocks - which is what TSX was designed to avoid.
- Record Update Only Here the tree is already populated and the number of elements is constant (no adds and no deletes). We only test read (search) and update (an existing record) - the update probability being the total modify probability.
As we do not change the tree itself, we only need to lock access to the record data - thus there is better chance of paralellism. Global locking here would really hurt performance (modifying any record locks all) but record-locking (each record has a lock) could be really wasteful in resources if the record data is small.
As before, we expect to see THX lock performance similar to classic software lock on legacy (non TSX-capable) CPUs, and higher than either software lock on TSX-capable CPUs.
Here are the CPUs and memory systems we are comparing in this article:
||Intel i7-2xxxM (Sandy Bridge)
||Intel i7-4xxxM (Haswell)
|Speed - Turbo
||2.2 / 3.1GHz
||2.2 / 3.1GHz
||same speed/turbo programmed
|Cores (CU) / Threads (SP)
||4C / 8T
||4C / 8T
|Caches (L1 / L2 / L3 / L4)
||4x 32kB / 4x 256kB / 6MB
||4x 32kB / 4x 256kB / 6MB / 128MB
||same configuration with additional eDRAM/L4
|Memory (Speed / Latency)
||2x DDR3 1600MHz 11-11-11-29
||2x DDR3 1600MHz 11-11-11-29
||same memory and timings
Here we compare the performance of different locks when executing typical database transactions (find/read, insert, replace, delete) with different modify probabilities - between 0.0 (no modifies at all) to 1.0 (always modify).
Database Update Transaction
||Again, classic lock and HLE lock performance on non-TSX legacy CPUs is similar, HLE does not affect performance on legacy CPUs.
||18.97 MTPS (~4x classic lock)
||5.72 MTPS (matches SNB)
||4.69 MTPS (matches SNB)
||HLE allows 4x more transactions than the basic classic clock - a massive increase! There is no question that HLE greatly improves transactional performance.
Here we compare the performance of different locks when updating (modifying) the record itself only - no inserts and no deletes - with different modify probabilities - between 0.0 (no modifies at all) to 1.0 (always modify).
Record Update Transaction
||As only a record modification is made (no inserts, deletes) the performance of the classic/HLE locks varies little with probability while the read/write lock decreases as the number of modifies increases.
||14.59 MTPS (~5x classic lock)
||4.85 MTPS (matches SNB)
||3.80 MTPS (matches SNB)
||HLE allows the rate of modify-only transactions increase by a massive 5x - blowing both classic and even R/W locks out of the water! Applications that use many threads and locks will see a huge increase in performance when changing their locking to HLE.
Final Thoughts / Conclusions
We have seen in the preliminary testing that TSX does deliver what it was designed to do:
- It is relatively straight-forward to replace classic software locks with TSX versions. Applications can implement it now using exiting compilers by using a little assembler.
- TSX locks using HLE will run with no appreciable performance delta on legacy CPUs - not better but no worse.
- TSX appears to provide significant improvement in concurrency - and thus performance - on TSX-capable CPUs which is very important as the number of threads and cores continues to increase in modern CPUs.
- Once operating systems will be updated to use it, the performance of the OS as well as locking API should improve. New locking APIs will be provided to allow applications easy ways to use TSX.