Why did you need to build the Bayes Analytic engine when so many analytic engines are available on the market?
The simple answer is that I has specific requirements and could not meet those requirements with the software I had available. The business answer is to provide unique competitive advantage to the hedge funds we partner with.
The default was using readily available trading software. I started with software supplied by the brokers but unfortunately it was designed for normal investors who are not engineers so they where simplistic and took over 1/2 hour to run even simple analysis over a few years of 1 minute bar data. They also did not include any sort of optimizer. The built-in back trace systems where primitive and you could not specify the kind of complex relationships I wanted to use for the trade triggers. They did have one distinct advantage in that they had already embedded the code for most popular indicators but it was either impossible, extremely cumbersome or very slow to add new custom indicators. There where existing trading engines that could be scripted including one with .net hooks but they where fairly expensive, every one of them that I researched had performance complaints and old run-time version complaints in their blogs so it didn’t seem worth the risk. There quite a few complaints that one of more popular systems hadn’t released a 64 bit version so they where still trapped in the 2 gig memory limit. The .net version was very rigid in the object model exposed and slow to deliver data so it was not practical for anything using genetic optimizers which require many passes.
What about existing analytic and inference engines
I looked at several open source engines for machine learning, analytic and artificial intelligence but after prototyping with the first 3 I found that they where not designed to answer the questions in the form I desired. I was fairly confident that my logical approach was sound so I was unwilling to compromise. Even when I wrote vastly simplified versions that could fit the design intent of their engine I found that every one of the libraries I tested took hours to accomplish what I needed in seconds or at least minutes. In addition 2 of the 3 ran out of memory on a 8 gig box unless I reduced the data set.
I have had some successful projects using compiled Prolog which is now available in 64 bit and it seemed well suited to the task. I ultimately did not choose Prolog because I was uncomfortable with the back-track overhead for this volume of data. The project run-time overhead is heavily weighted in feature computation where Prolog would not add value. The high-speed 64 bit compiler was not free and the company seemed to be struggling so not a good bet to base my technology on especially if acquisition was anywhere in the future.
The Decision to build Bayes Analytic
I ultimately decided I would have to build a new engine if I wanted to fully express the ideas I had developed while analyzing stock patterns for the prior 2 years. I have 25 years of experience building algorithmically complex engines to solve equally difficult problems so there may have been a bit of the Interesting challenge factor involved as well.
I have historically been well compensated because I tend to drill down to the very core and learn how to solve these problems from the ground up. This allows me to deliver design fast and novel solutions my customers have appreciated. These unique skills have repeatedly been instrumental in obtaining promotions or contracts so a new engine was a good skills investment.
Additional decision factors:
- Any commercial runtime I used would require that I pay their fees for ever new customer. After having Oracle and progress kill a few deals for me in the past this is not a risk to be taken lightly.
- I thought there was a good chance of selling the skills for building the engine from the bottom up at my normal rates while consulting rates for customizing existing engines seems to be dropping.
- If I ended up with something good that was valuable and salable I didn’t want to be encumbered by core IP received from other products.
- If I built on top of another companies engine I was largely trapped. I have been forced multiple times to rebuild systems late in the project from scratch when proprietary engines failed to deliver either the reliability, scalability or speed required. Not all full run-time engines are flaky but you can not fix these yourself and you are at the mercy of the vendor to solve problems. I once had a Microsoft bug in the PocketPC bios cost me a 250K deal I really needed because it took them 6 months to release a patch. Larger run-times by smaller companies are even more risky.
- In past projects I have used several proprietary 4GL inference and decision support engines and every customer had performance issues when I was called in. In some instances I was forced to rewrite core components in C to meet the project goals.
What about R it is built for statistics after all
A college professor I knew was very excited about using R for exactly this problem and he had enough background in the field for me to take it serious. From all outward appearances it appeared great and it was quite powerful for answering ad hoc questions but as I started trying to use it to apply complex statistical inference across hundreds of features it rapidly became clear the R system was simply not designed for this level of complexity.
I built an early version of the Bayes Analytic system in R. I found the R interpreter to be about 500 times too slow. If you could do everything you needed in matrix operations it was very powerful but as soon as you needed to be able to reference the contents of one row of same type from a separate row in a rolling window then it had to be done with array slices or procedural with indexing both of which where super slow. The interpreter was just slow while the slices or groups could be faster when you applied matrix operations to the slice but they consumed horrible amounts of memory. By the time I was 1/10th of the way through the algorithm I gave up and went back to a more traditional language. I also found the R runtime to not be sufficiently stable to use as any production component if you could not keep humans in the loop to restart the VM but that may be because of the amount of memory it was using. The amount of data I am using now is at least 50 times more than I had when I bailed on R.
There is always a chance a R expert could have done better but I tried pretty hard to make it work and when I was interacting with people on the user group they where surprised I could accomplish as much as I did without buying the expensive commercial R versions. After talking at length to the technical experts with a couple of the R vendors I was under the impression that my desired design approach was simply outside the design intent of the system. Their multi-core distributed system with a compiler may have been able to reach my minimum performance criteria but it would have been a large investment with considerable risk. I have made a gamble on several similar 4G systems in the past and ended up spending months of my own time on my own dime rewriting in C after their engines failed so perhaps I am a bit of a skeptic.
What about SPSS
IBM has built a large set of libraries targeted at a number of risk modeling domains in SPSS and it has both a macro and a scripting language that seem vaguely similar to BASIC. Based on the companies they have applied it to I would have to guess that it could tackle the data size I needed but it was unclear if it could deliver the turn around time.
Bayes Analytic started as a personal project which limited my viable out of pocket expense. I didn’t give SPSS serous consideration because I was unwilling to use any environment that charged for run-times. Even if I was willing to pay the run-times once we reached commercial scale production I certainly was not willing to pay for them during development, testing and market creation.