
Benchmarks : Measuring GP (GPU/APU) Financial Analysis Performance (Double/FP64) 
What are Valuation Models in Financial Analysis?
Financial analysis is employed to determine the metrics of a financial entity, be it a business, asset, option, etc. Here, various models are used to determine the future worth of "options" in organised option trading. An "option" is a contract to buy/sell an asset at a specified price ("strike price") at (or before) an expiration date. Determining an option's worth is thus very important  it determines whether the option should be traded or not!
Mathematical models are employed to estimate option worth and are implemented in most financial or trading software; some are compute intensive which is where GPGPU acceleration comes in.
The principal valuation models are BlackScholes, Binomial (tree) and Monte Carlo. Other models (e.g. Finite difference, etc.) are not included here.
Why is it important to measure?
Financial analysis algorithms are complex, floatingpoint (of different precision) workloads that represent "reallife" examples of workstation use, which stress FPUs and FP accelerators. They use a variety of nonbasic mathematical functions (e.g. exp, log, sqrt, etc.) while processing a good amount of data that stress memory and cache subsystems.
Some of these functions used in these algorithms (e.g. exp, log, transcendentals) are GPUaccelerated as they are used extensively in graphics (shaders), while CPUs do not accelerate them as they are relatively "rare". However as, historically, GPUs "cut corners" and did not implement them in "full accuracy" (as this was not required for graphics), higher precision or slower, nonaccelerated versions, may be needed.
Fractal algorithms (whose performance Sandra measures) use FMA (fusedmultiplyadd) extensively while generating little data (pixel colour index)  thus are pretty close to the maximum theoretical throughput of a GPU/vector processor. Most algorithms, however, are far more complex.
Crypto algorithms (e.g. AES, SHA, whose performance Sandra also measures) do process large amounts of data but employ integer data and specific acelerated functions (shift, rotate, bit count, etc.), while (at least historically) GP(GPU)s only dealt with floatingpoint data.
What precision is used to implement the algorithms?
The financial analysis benchmarks use IEEE double/FP64 (64bit doubleprecision) format, the standard higher precision floatingpoint format used in computing that is supported by most FPUs and modern GP(GPU)s. It provides "enough" precision for just about all algorithms except where very large or very small numbers are processed causing errors to mount up.
If you are interested in single precision performance, please see Financial Analysis Performance (Float/FP32).
Note: OpenCL was used as it is supported by all GPUs/APUs. The tests are also available through CUDA which provides better optimisations for nV hardware.
Hardware Specifications
Below are the GPUs and APUs we are comparing in this article:
GPU / APU 
Cores (CU) / (SP) 
Normal / Turbo Speed 
Memory / Speed 
Registers / Const / Shared / cache 

nV GeForce GTX Titan (GK110 Kepler / CUDA 3.5) 
14 / 2688 (192 SP/CU) 
938MHz / 1.2GHz 
6GB 6GHz GDDR5 384bit 
64k / 64kB / 16kB / 1.5MB 

nV GeForce GTX 680 (GK104 Kepler / CUDA 3.0) 
8 / 1536 (192 SP/CU) 
1.1 / 1.2GHz 
2GB 6GHz GDDR5 256bit 
32k / 64kB / 16kB / 512kB 

nV GeForce GTX 660 TI (GK104 Kepler / CUDA 3.0) 
7 / 1344 (192 SP/CU) 
980MHz / 1.1GHz 
2GB 6GHz GDDR5 192bit 
32k / 64kB / 16kB / 384kB 

nV GeForce GTX 460 (Fermi CUDA 2.0) 
7 / 224 (32 SP/CU) 
1.78GHz 
1GB 4GHz 256bit 
16k / 64kB / 16kB / 512kB 

AMD Radeon HD 7970 (Tahiti XT/GCN1) 
32 / 2048 (64 SP/CU) 
925MHz  1GHz 
3GB 5.8GHz GDDR5 384bit 
64k / 128kB / 64kB / 768kB 

AMD Radeon HD 7850 (Pitcairn Pro/GCN2) 
16 / 1024 (64 SP/CU) 
860MHz 
2GB 4.8GHz GDDR5 256bit 
64k / 128kB / 64kB / 512kB 

AMD Radeon HD 7790 (GCN2) 
14 / 896 (64 SP/CU) 
1075MHz 
1GB 6GHz GDDR5 128bit 
32k / 64kB / 32kB / 256kB 

ATI Radeon HD 5870 (Cypress XT/VLIW5) 
20 / 1600 (80 SP/CU) 
850MHz 
1GB 4.8GHz GDDR5 256bit 
16k / 64kB / 32kB / 512kB 

The BlackScholes (European Options) Model 
The BlackScholes model is implemented using a closedform expression that is relatively "simple"; this allows each option to be processed by a single thread while the relatively small number of registers required allows large groups of threads to be executed on each compute unit  thus good CU utilisation.
Little memory is required (input and output parameters), thus a large number of options can be executed up to device memory size; however using doubles greatly increases register pressure  thus smaller workgroups can be executed.
GPU 
BlackScholes (Double/FP64) 
Comment 

nV GeForce GTX Titan 
1.78 GOPT/s 
Turning on "full FP64" allows the Titan to overtake its 7970 rival and greatly outperform the GRX 680. However, with 8x more DFP units even better scaling may have been expected. Finally the Titan comes into its own! 

nV GeForce GTX 680 
0.47 GOPT/s 
With 8x less DFP units than Titan (1/24 FP), performance is far lower  even a 2year old 5870 manages to outperform it. GK104 is just not suited to double/FP64 compute intensive loads. Good scaling w.r.t 670/660TI and does beat the 7850 soundly but poor overall. 

nV GeForce GTX 660 TI 
0.33 GOPT/s 
Just manages to beat the GTX 460 and good value w.r.t. GTX 680 (2card system would outperform it) but not great overall. 

nV GeForce GTX 460 
0.29 GOPT/s 
Not the slowest here  surprising considering "Fermi" had more DFP units per FP than "Kepler". However, "Fermi" has 1/2 registers per SMU which becomes a problem when using double/FP64. 

AMD 7970 
1.7 GOPT/s 
Great result, even if outperformed by Titan  a far more expensive card. Beats all other cards by huge margin, ensuring that even low cost 7900 LE cannot be matched. 

AMD 7850 
0.2 GOPT/s 
Very low performance compared to its bigger brother 7970  1/2 performance of a 2year old 5870. What we saw in MonteCarlo FP32 is thus no isolated incident: 7850 is not 1/2 7970 performance but can be much, much slower. 

ATI 5870 
0.55 GOPT/s 
Fantastic performance for a 2year old card, coming 3rd and beating far newer and more expensive cards (680, 7850, etc.) in this algorithm just as in float/FP32. Its performance is thus no fluke, though old this card can hold its own! 
Unlike BlackScholes float/FP32, few surprises here  though old cards do very well. Topend cards show their worth by dominating double/FP64 loads by a large margin  you got to pay if you want great double/FP64 performance!

The Binomial (European Options) Model 
The Binomial tree model uses an extra parameter  the number of time steps (equally spaced) from current date to expiry date. The price of the option for each node/leaf of the binomial tree must be computed iteratively and then reduce the price array. Shared memory is needed  which being a very limited resource (usually 16kB)  it can only be used as a "buffer" for the much slower global memory.
The benchmark uses 1024 time steps, with number of options based on device memory.
Each option is computed by an entire workgroup with each thread in the group computing a timestep. Thus, Binomial is a far more complex algorithm (than BlackScholes) that uses more registers per thread/group and stresses shared memory that stresses all but the latest GPUs. Here, using doubles puts even more register pressure on GPU  thus pretty small workgroups can be executed.
For example on CUDA 2.x devices, 52 registers per thread are needed (!); with 32k registers per CU we can "fit" only 576 threads and thus only 38% CU occupancy (1536 max)!
GPU 
Binomial (Double/FP64) 
Comment 

nV GeForce GTX Titan 
83 kOPT/s 
Even in "full FP64" mode Titan cannot beat the 7970 with pretty poor scaling compared to the GTX 680. Much better performance was expected considering the huge number of DFP units. 

nV GeForce GTX 680 
45 kOPT/s 
Good scaling w.r.t. 670/660TI (again!)  with a 2card system likely to match Titan in performance for far lower cost  if nV hardware is required. It does beat the 7850 soundly but far slower than 7970. 

nV GeForce GTX 660 TI 
34 kOPT/s 
Soundly outperforms the older GTX 460 even with less DFP units showing that the memory improvements and higher number of registers per CU matter more than raw performance, even double. 

nV GeForce GTX 460 
21 kOPT/s 
Slowest card here, surprising considering Fermi's performance elsewhere. 

AMD 7970 
112 kOPT/s 
Fastest again by a large margin, soundly beating even Titan and all other cards. Best by far. 

AMD 7850 
34 kOPT/s 
Only 1/3 slower than 7970 unlike the catastrophic drop in other algorithms, matching 660TI. 

ATI 5870 
23 kOPT/s 
Good showing considering age and manages to beat GTX 460, but still not good showing compared to modern cards. 
Binomial shows that modern GP(GPU)s have improved  they all beat their older equivalents. Double/FP64 almost doubles the number of registers required per thread and thus puts even higher pressure on CU while moving 2x more memory per option. Top end GP(GPUs) do show their power again by greatly outperforming lesser series.

The Monte Carlo (European Options) Model 
The Monte Carlo model uses a normally distributed sample (number of paths) to estimate the price. Parallel reduction is used to compute the final price/confidence from the partial sums. For a small number of paths, shared memory can be used for efficient reduction (but limited to 16kB) which each option computed by a workgroup and each thread in the group computing a path.
For a large number of paths, we have to use global memory to store the partial sums, thus we can use multiple groups per option with the group size dependent on the hardware resources (larger groups where more registers per thread).
The benchmark uses 32,000 simulation paths, with number of options based on device memory.
Again, Monte Carlo is far more complex (than BlackScholes) using a similar number of registers (to Binomial) and fast shared memory and thus performs best on the latest GPUs.
For example on CUDA 2.x devices, 52 registers per thread are needed (!); with 32k registers per CU we can "fit" only 576 threads and thus only 38% CU occupancy (1536 max)!
GPU 
Monte Carlo (Double/FP64) 
Comment 

nV GeForce GTX Titan 
560 kOPT/s 
Far better performance than GTX 680 in "full FP64" mode but still not enough to outperform 7970. If nV hardware is required, Titan is hard to beat  not even 5 GTX 680 cards would match. 

nV GeForce GTX 680 
92 kOPT/s 
Good scaling w.r.t. 670/660TI (yet again) but the low number of DFP units makes it unsuitable for double/FP64 work. 

nV GeForce GTX 660 TI 
67 kOPT/s 
Narrowly beats the old GTX 460 (though with far less DFP units) which again shows the improvements in "Kepler" do help. It is, however, greatly outperformed by the 7850 and thus 7900 LE making it a poor choice. 

nV GeForce GTX 460 
65 kOPT/s 
Slowest card here (just as in Binomial), showing that register pressure reduces CU utilisation and thus performance in complex algorithms. 

AMD 7970 
622 kOPT/s 
Great performance, soundly beating Titan by a good margin (again). Best value by far  there's nothing this card cannot do! 

AMD 7850 
86 kOPT/s 
Almost ties with the far more expensive GTX 680 competition but still loses to the 2year old 5870 and 7x slower than its bigger 7970 brother. A 7900 LE (or 7870 XT) would thus be better value. 

ATI 5870 
198 kOPT/s 
Good showing by this very old card, almost 2x faster than 7850 and coming 3rd yet again! Its double/FP64 performance was way ahead of its time and even all the improvements of newer hardware cannot outperform it. 
Like Binomial, Monte Carlo also favours modern GP(GPU)s, needing fast (shared) memory and far more registers per thread to prevent expensive "spills" to global memory  as well as fast caches for common / constant data. However, over a certain factor  register pressure affects both old and new cards alike allowing older cards to perform better than expected (as CU utilisation is low for all cards).
Final Thoughts / Conclusions
Financial analysis algorithms are a good match for GP(GPUs)  however double/FP64 was never used for graphics (such high precision is not required). Unlike FPUs where going from float to double reduces performance by approximately 1/2, many GP(GPU)s have far less DFP units than FP units (e.g. "Kepler" has 1/24 FP/DFP ratio)  thus double/FP64 performance is far lower.
While performance is not likely to drop by the same amount, 510x slower is a significant difference requiring careful assessment as to whether high precision is required for the whole computation or just specific parts. Highend cards that contain more DFP units are thus greatly favoured.
Complex models (Binomial, Monte Carlo) add even more register pressure in double/FP64 mode, as much as 2x which greatly reduces the number of threads that can execute on a CU and thus workgroup size.
As data for an option (input, output, shared, etc.) is 2x as large in double/FP64 mode, the stress on the memory and cache subsystem is also higher thus favouring high end cards with faster memory.
AMD's GCN 7900 series outperforms the topend Titan in the complex algorithms (Binomial, Monte Carlo) by a good margin while scoring slightly less in complex one (BlackScholes) and thus represents tremendous value. GCN does very well in double/FP64 mode also  a truly best buy.
AMD's GCN 7800 series does pretty well compared to the competition thus still good value  but much slower than 7900 and sometimes beaten even by a 5870; again, paying more for a 7900 LE (aka 7870 XT) is worth it.
nV's GTX Titan finally shows its true power in "full FP64" mode, trading blows with the 7900 but its high cost ultimately defeats it. However, if CUDA or nV hardware is required it represents tremendous performance that older GK104 cards cannot match.
nV's GTX 680 with only 1/24 DFP units, it has far lower double/FP64 performance to be good value  with 5 cards needed to match Titan performance. Great for float/FP32 workloads, if the precision is sufficient then the picture changes completely.
nV's GTX 660 TI just like GTX 680, not suited to double/FP64 workloads  with 810 cards needed to match Titan!
If you are interested in single precision performance, please see Financial Analysis Performance (Float/FP32).
Help us Help you!
Please remember that unlike review sites we buy our hardware, thus we cannot afford to test many devices on the market. Please test your GP(GPU) and submit your scores to the Ranker  this allows us to provide you and other users with aggregated scores from other GP(GPU)s as well as certifying whether your scores are statistically valid.
