SiSoftware Logo
  Home   FAQ   Press   Download & Buy   Rankings   Contact  
New: SiSoftware Sandra 2014

Financial Analysis Performance

Benchmarks : Measuring GP (GPU/APU) Financial Analysis Performance (Double/FP64)

What are Valuation Models in Financial Analysis?

Financial analysis is employed to determine the metrics of a financial entity, be it a business, asset, option, etc. Here, various models are used to determine the future worth of "options" in organised option trading. An "option" is a contract to buy/sell an asset at a specified price ("strike price") at (or before) an expiration date. Determining an option's worth is thus very important - it determines whether the option should be traded or not!

Mathematical models are employed to estimate option worth and are implemented in most financial or trading software; some are compute intensive which is where GPGPU acceleration comes in.

The principal valuation models are Black-Scholes, Binomial (tree) and Monte Carlo. Other models (e.g. Finite difference, etc.) are not included here.

Why is it important to measure?

Financial analysis algorithms are complex, floating-point (of different precision) workloads that represent "real-life" examples of workstation use, which stress FPUs and FP accelerators. They use a variety of non-basic mathematical functions (e.g. exp, log, sqrt, etc.) while processing a good amount of data that stress memory and cache sub-systems.

Some of these functions used in these algorithms (e.g. exp, log, transcendentals) are GPU-accelerated as they are used extensively in graphics (shaders), while CPUs do not accelerate them as they are relatively "rare". However as, historically, GPUs "cut corners" and did not implement them in "full accuracy" (as this was not required for graphics), higher precision or slower, non-accelerated versions, may be needed.

Fractal algorithms (whose performance Sandra measures) use FMA (fused-multiply-add) extensively while generating little data (pixel colour index) - thus are pretty close to the maximum theoretical throughput of a GPU/vector processor. Most algorithms, however, are far more complex.

Crypto algorithms (e.g. AES, SHA, whose performance Sandra also measures) do process large amounts of data but employ integer data and specific acelerated functions (shift, rotate, bit count, etc.), while (at least historically) GP(GPU)s only dealt with floating-point data.

What precision is used to implement the algorithms?

The financial analysis benchmarks use IEEE double/FP64 (64-bit double-precision) format, the standard higher precision floating-point format used in computing that is supported by most FPUs and modern GP(GPU)s. It provides "enough" precision for just about all algorithms except where very large or very small numbers are processed causing errors to mount up.

If you are interested in single precision performance, please see Financial Analysis Performance (Float/FP32).

Note: OpenCL was used as it is supported by all GPUs/APUs. The tests are also available through CUDA which provides better optimisations for nV hardware.

Hardware Specifications

Below are the GPUs and APUs we are comparing in this article:

GPU / APU Cores (CU) / (SP) Normal / Turbo Speed Memory / Speed Registers / Const / Shared / cache
nV GeForce GTX Titan (GK110 Kepler / CUDA 3.5) 14 / 2688 (192 SP/CU) 938MHz / 1.2GHz 6GB 6GHz GDDR5 384-bit 64k / 64kB / 16kB / 1.5MB
nV GeForce GTX 680 (GK104 Kepler / CUDA 3.0) 8 / 1536 (192 SP/CU) 1.1 / 1.2GHz 2GB 6GHz GDDR5 256-bit 32k / 64kB / 16kB / 512kB
nV GeForce GTX 660 TI (GK104 Kepler / CUDA 3.0) 7 / 1344 (192 SP/CU) 980MHz / 1.1GHz 2GB 6GHz GDDR5 192-bit 32k / 64kB / 16kB / 384kB
nV GeForce GTX 460 (Fermi CUDA 2.0) 7 / 224 (32 SP/CU) 1.78GHz 1GB 4GHz 256-bit 16k / 64kB / 16kB / 512kB
AMD Radeon HD 7970 (Tahiti XT/GCN1) 32 / 2048 (64 SP/CU) 925MHz - 1GHz 3GB 5.8GHz GDDR5 384-bit 64k / 128kB / 64kB / 768kB
AMD Radeon HD 7850 (Pitcairn Pro/GCN2) 16 / 1024 (64 SP/CU) 860MHz 2GB 4.8GHz GDDR5 256-bit 64k / 128kB / 64kB / 512kB
AMD Radeon HD 7790 (GCN2) 14 / 896 (64 SP/CU) 1075MHz 1GB 6GHz GDDR5 128-bit 32k / 64kB / 32kB / 256kB
ATI Radeon HD 5870 (Cypress XT/VLIW5) 20 / 1600 (80 SP/CU) 850MHz 1GB 4.8GHz GDDR5 256-bit 16k / 64kB / 32kB / 512kB


The Black-Scholes (European Options) Model

The Black-Scholes model is implemented using a closed-form expression that is relatively "simple"; this allows each option to be processed by a single thread while the relatively small number of registers required allows large groups of threads to be executed on each compute unit - thus good CU utilisation.

Little memory is required (input and output parameters), thus a large number of options can be executed up to device memory size; however using doubles greatly increases register pressure - thus smaller work-groups can be executed.


GPU Black-Scholes (Double/FP64) Comment
nV GeForce GTX Titan 1.78 GOPT/s Turning on "full FP64" allows the Titan to overtake its 7970 rival and greatly outperform the GRX 680. However, with 8x more DFP units even better scaling may have been expected. Finally the Titan comes into its own!
nV GeForce GTX 680 0.47 GOPT/s With 8x less DFP units than Titan (1/24 FP), performance is far lower - even a 2-year old 5870 manages to outperform it. GK104 is just not suited to double/FP64 compute intensive loads. Good scaling w.r.t 670/660TI and does beat the 7850 soundly but poor overall.
nV GeForce GTX 660 TI 0.33 GOPT/s Just manages to beat the GTX 460 and good value w.r.t. GTX 680 (2-card system would outperform it) but not great overall.
nV GeForce GTX 460 0.29 GOPT/s Not the slowest here - surprising considering "Fermi" had more DFP units per FP than "Kepler". However, "Fermi" has 1/2 registers per SMU which becomes a problem when using double/FP64.
AMD 7970 1.7 GOPT/s Great result, even if outperformed by Titan - a far more expensive card. Beats all other cards by huge margin, ensuring that even low cost 7900 LE cannot be matched.
AMD 7850 0.2 GOPT/s Very low performance compared to its bigger brother 7970 - 1/2 performance of a 2-year old 5870. What we saw in Monte-Carlo FP32 is thus no isolated incident: 7850 is not 1/2 7970 performance but can be much, much slower.
ATI 5870 0.55 GOPT/s Fantastic performance for a 2-year old card, coming 3-rd and beating far newer and more expensive cards (680, 7850, etc.) in this algorithm just as in float/FP32. Its performance is thus no fluke, though old this card can hold its own!

Unlike Black-Scholes float/FP32, few surprises here - though old cards do very well. Top-end cards show their worth by dominating double/FP64 loads by a large margin - you got to pay if you want great double/FP64 performance!


The Binomial (European Options) Model

The Binomial tree model uses an extra parameter - the number of time steps (equally spaced) from current date to expiry date. The price of the option for each node/leaf of the binomial tree must be computed iteratively and then reduce the price array. Shared memory is needed - which being a very limited resource (usually 16kB) - it can only be used as a "buffer" for the much slower global memory.

The benchmark uses 1024 time steps, with number of options based on device memory.

Each option is computed by an entire work-group with each thread in the group computing a time-step. Thus, Binomial is a far more complex algorithm (than Black-Scholes) that uses more registers per thread/group and stresses shared memory that stresses all but the latest GPUs. Here, using doubles puts even more register pressure on GPU - thus pretty small work-groups can be executed.

For example on CUDA 2.x devices, 52 registers per thread are needed (!); with 32k registers per CU we can "fit" only 576 threads and thus only 38% CU occupancy (1536 max)!


GPU Binomial (Double/FP64) Comment
nV GeForce GTX Titan 83 kOPT/s Even in "full FP64" mode Titan cannot beat the 7970 with pretty poor scaling compared to the GTX 680. Much better performance was expected considering the huge number of DFP units.
nV GeForce GTX 680 45 kOPT/s Good scaling w.r.t. 670/660TI (again!) - with a 2-card system likely to match Titan in performance for far lower cost - if nV hardware is required. It does beat the 7850 soundly but far slower than 7970.
nV GeForce GTX 660 TI 34 kOPT/s Soundly outperforms the older GTX 460 even with less DFP units showing that the memory improvements and higher number of registers per CU matter more than raw performance, even double.
nV GeForce GTX 460 21 kOPT/s Slowest card here, surprising considering Fermi's performance elsewhere.
AMD 7970 112 kOPT/s Fastest again by a large margin, soundly beating even Titan and all other cards. Best by far.
AMD 7850 34 kOPT/s Only 1/3 slower than 7970 unlike the catastrophic drop in other algorithms, matching 660TI.
ATI 5870 23 kOPT/s Good showing considering age and manages to beat GTX 460, but still not good showing compared to modern cards.

Binomial shows that modern GP(GPU)s have improved - they all beat their older equivalents. Double/FP64 almost doubles the number of registers required per thread and thus puts even higher pressure on CU while moving 2x more memory per option. Top end GP(GPUs) do show their power again by greatly outperforming lesser series.

Monte Carlo

The Monte Carlo (European Options) Model

The Monte Carlo model uses a normally distributed sample (number of paths) to estimate the price. Parallel reduction is used to compute the final price/confidence from the partial sums. For a small number of paths, shared memory can be used for efficient reduction (but limited to 16kB) which each option computed by a work-group and each thread in the group computing a path.

For a large number of paths, we have to use global memory to store the partial sums, thus we can use multiple groups per option with the group size dependent on the hardware resources (larger groups where more registers per thread).

The benchmark uses 32,000 simulation paths, with number of options based on device memory.

Again, Monte Carlo is far more complex (than Black-Scholes) using a similar number of registers (to Binomial) and fast shared memory and thus performs best on the latest GPUs.

For example on CUDA 2.x devices, 52 registers per thread are needed (!); with 32k registers per CU we can "fit" only 576 threads and thus only 38% CU occupancy (1536 max)!

Monte Carlo

GPU Monte Carlo (Double/FP64) Comment
nV GeForce GTX Titan 560 kOPT/s Far better performance than GTX 680 in "full FP64" mode but still not enough to outperform 7970. If nV hardware is required, Titan is hard to beat - not even 5 GTX 680 cards would match.
nV GeForce GTX 680 92 kOPT/s Good scaling w.r.t. 670/660TI (yet again) but the low number of DFP units makes it unsuitable for double/FP64 work.
nV GeForce GTX 660 TI 67 kOPT/s Narrowly beats the old GTX 460 (though with far less DFP units) which again shows the improvements in "Kepler" do help. It is, however, greatly outperformed by the 7850 and thus 7900 LE making it a poor choice.
nV GeForce GTX 460 65 kOPT/s Slowest card here (just as in Binomial), showing that register pressure reduces CU utilisation and thus performance in complex algorithms.
AMD 7970 622 kOPT/s Great performance, soundly beating Titan by a good margin (again). Best value by far - there's nothing this card cannot do!
AMD 7850 86 kOPT/s Almost ties with the far more expensive GTX 680 competition but still loses to the 2-year old 5870 and 7x slower than its bigger 7970 brother. A 7900 LE (or 7870 XT) would thus be better value.
ATI 5870 198 kOPT/s Good showing by this very old card, almost 2x faster than 7850 and coming 3-rd yet again! Its double/FP64 performance was way ahead of its time and even all the improvements of newer hardware cannot outperform it.

Like Binomial, Monte Carlo also favours modern GP(GPU)s, needing fast (shared) memory and far more registers per thread to prevent expensive "spills" to global memory - as well as fast caches for common / constant data. However, over a certain factor - register pressure affects both old and new cards alike allowing older cards to perform better than expected (as CU utilisation is low for all cards).

Final Thoughts / Conclusions

Financial analysis algorithms are a good match for GP(GPUs) - however double/FP64 was never used for graphics (such high precision is not required). Unlike FPUs where going from float to double reduces performance by approximately 1/2, many GP(GPU)s have far less DFP units than FP units (e.g. "Kepler" has 1/24 FP/DFP ratio) - thus double/FP64 performance is far lower.

While performance is not likely to drop by the same amount, 5-10x slower is a significant difference requiring careful assessment as to whether high precision is required for the whole computation or just specific parts. High-end cards that contain more DFP units are thus greatly favoured.

Complex models (Binomial, Monte Carlo) add even more register pressure in double/FP64 mode, as much as 2x which greatly reduces the number of threads that can execute on a CU and thus work-group size.

As data for an option (input, output, shared, etc.) is 2x as large in double/FP64 mode, the stress on the memory and cache sub-system is also higher thus favouring high end cards with faster memory.

AMD's GCN 7900 series outperforms the top-end Titan in the complex algorithms (Binomial, Monte Carlo) by a good margin while scoring slightly less in complex one (Black-Scholes) and thus represents tremendous value. GCN does very well in double/FP64 mode also - a truly best buy.

AMD's GCN 7800 series does pretty well compared to the competition thus still good value - but much slower than 7900 and sometimes beaten even by a 5870; again, paying more for a 7900 LE (aka 7870 XT) is worth it.

nV's GTX Titan finally shows its true power in "full FP64" mode, trading blows with the 7900 but its high cost ultimately defeats it. However, if CUDA or nV hardware is required it represents tremendous performance that older GK104 cards cannot match.

nV's GTX 680 with only 1/24 DFP units, it has far lower double/FP64 performance to be good value - with 5 cards needed to match Titan performance. Great for float/FP32 workloads, if the precision is sufficient then the picture changes completely.

nV's GTX 660 TI just like GTX 680, not suited to double/FP64 workloads - with 8-10 cards needed to match Titan!

If you are interested in single precision performance, please see Financial Analysis Performance (Float/FP32).

Help us Help you!

Please remember that unlike review sites we buy our hardware, thus we cannot afford to test many devices on the market. Please test your GP(GPU) and submit your scores to the Ranker - this allows us to provide you and other users with aggregated scores from other GP(GPU)s as well as certifying whether your scores are statistically valid.

News | Reviews | Twitter | Facebook | privacy | licence | contact