Why Language Performance Matters & Some Measurements

An optimization pass that takes 10 minutes in one language could take 15 hours in the slower language.

I work from the philosophy that I want the highest performance I can afford but this is traded off against development costs and delivery time because a great solution 6 months later  may be less valuable than a good enough solution 6 months sooner.

We must add the maintenance costs and flexibility for changing the system to meet new and emerging business requirements.       At one extreme I could deliver the highest performance with hand optimized assembler but it would take so long to deliver than we may never realize the business value.  On the other extreme if the system was delivered immediately but could not perform fast enough or could not be changed to meet changing requirements and deliver a competitive advantage then it also doesn’t deliver the desired business value.

Ultimately it all works out to a trade-off where performance is better than adequate but I can deliver the working  system flexible enough to grow and adapt with the fewest development and maintenance man hours.

I will adapt this slightly and assert that if the environment is so slow that I spend a majority of our best engineering hours trying to make it fast enough rather than focusing on new features then it dramatically  reduces the ROI from the system.   I want the underlying language and technical infrastructure to be fast enough that I invest the maximum development time delivering business value rather than struggling to keep it fast enough.

I still believe proper design, architecture, coding and proper stress testing are  just as important a language and run-time selection for  large scale systems.  This goes hand in hand with retaining sufficient very senior people to ensure adequate quality.

Continuous Optimization in Machine Learning consumes a lot of CPU

In the Bayes Analytic Engine we face a continuous optimization problem where the stock prediction engine is always seeking to improve it’s performance based on current and historical market conditions.   Unlike most eCommerce work this requires a continuous compute load.    Our training set is millions of transactions per day and we need the ability to predict in near real time.     A simple change from the optimizer may require partial re-training of the engine and we may need a thousands to tens of thousands of these changes to find the most optimal configuration which changes on a minute to minute basis.    As a result we may need to run through relatively complex data sets thousands of times only to have rapidly changing market conditions require new optimizations.   Due to the number of repeated iterations and the continuous loads speed is king.

Speed = trading profits

Our ability to trade stocks on the data is directly related to how fast the stock Analytic engine can work through all this data and spit out recommendations.  We obviously implemented  design level enhancements to maximize performance such as caching partial computations that are repeated but in the end it ultimately boils down to how fast the system can run.  We can always run it in the cloud but there are cost trade offs in the cloud and we found at least in this instance since we have a near continuous optimization problem that it would cost us nearly as much every month to consume the cloud resources as it cost us to buy equivalent compute power.

Machine learning on large data sets is CPU  and RAM intensive

Many machine learning and the related AI (Artificial Intelligence) methods effectively use a method where they change one or more parameters and then run a pass to see if this change improved the ability to meet a goal and each pass may require recomputing intermediate variables for the entire result set.    This is essence of Genetic Algorithms.    It is easy for small data sets with simple relationships but when you have millions of inputs with complex relationships that require a lot more intermediate calculations it can become challenging.

Many of the test cases for  AI algorithms are published with relatively small data sets typical of a email spam filter.  Our data set is orders of magnitude larger and the rate of change was high.   When I tried to use various reference implementations from common AI toolkit especially those in Python and JavaScript they simply broke down and either ran a 30 gig machine out of RAM or ran for days to accomplish what we needed in seconds.   Java was an interesting case because the best algorithms in Java performed pretty well but the there seems to be more poorly written code in common Java libraries which performed worse than the Python code.  I guess this proves you should not always blame the run-time or language for performance issues.

Some languages 90 times slower

One of our common requirements was to load the 1 minute or 10 minute pricing data and compute 1 or more indicators such as a SMA (Simple Moving Average),  Distribution Rank or RSI for multiple periods such as 10 minute, 30 minute, 60 minute, 90 minute, 1 hour, 1/2 day, 1 day, 2 day, 10 day, 300 day.     These are sub tasks the optimizer does but it must run millions of cycles.    I used this task as a place holder with the assertion that if a representative algorithm was 90 times slower to run in a given language then any larger system which used similar algorithms would also be 90 times slower.      The language specific testing ultimately showed that a task which required 10 minutes in one language would take 15 hours in the slower language.

Across the set of languages I saw differences large enough that some languages could have required weeks to complete what another language could complete in less than an hour.     Some of this can be overcome with more sophisticated algorithms and distributed computing but these require more effort,  more code,  more maintenance,  higher complexity and more talented engineers at every phase of the SDLC.    The industry is largely ignoring this issue but companies keep hiring me to solve  performance related problems so performance must still be important.

Raw speed alternatives

There are several ways work around language speed issues.

  1. Better Algorithms,
  2. Caching  pre-computed partial answers.
  3. Buy faster hardware
  4. Use Distributed computing to share the load,
  5. Reduce Memory Management Overhead
  6. Simplify or Recast the problem to allow slower response

I use all of these at various times but they always have cost and ROI impacts.    I  assert that our overall goal is to maximize project ROI which is largely driven by run-time costs,   development costs,   development timeline and profits delivered.    Each of these techniques  have a cost.   A  faster language / run time can reduce the total investment required implementing these optimization techniques.

Better Algorithms –  The perfect example of this is using a Binary search on a sorted data set rather than a simple linear scan.   The Binary search can save a lot of time but it has extra overhead of keeping the data set sorted.    In modern environments we receive a large number of libraries that provided many functions  in a semi-optimized fashion. for example;  I would have a difficult time building a hash table implementation that is faster than what comes built into Python’s dictionary.   This means we are focusing our algorithm optimization at the next layer of complexity and depending on the standard libraries providing a good enough implementation.   This is normally a viable assertion but I have found a number of libraries especially the streaming IO libraries in GO  where the standard libraries deliver horrible performance.

I always try to use the best algorithm I know about and which I can implement cost effectively.   Choosing how much to invest in the performance of any single piece of code is as much of an art as a science.  Moving to faster and more sophisticated algorithms can require additional investment in code and there is a trade off between the sophistication of algorithm versus cost, ROI and maintainability.

I can choose to  improve any algorithm where I find a performance hot spot and this is true for all languages.    Some programmers use this as an excuse to not worry about performance but this can have profound impacts on the release time line and total project cost as the weight of these decisions accumulate and become highly visible as inadequate performance late in the project development cycle when it is expensive to fix.   I try to walk a line where if I suspect a given piece of code will be performance critical that I at least document how to do it faster and that I put extra effort into ensuring it has good timing metrics at the module level we can see the need for optimization early in the SDLC (Software Development Life Cycle) when it is less expensive.

One side effect of applying sophisticated algorithms and ultra clever designs to obtain performance is that they can be non-obvious to other engineers especially those who have never been involved in high performance computing.   This is a significant risk for the long term ROI because it increases the risk of the esoteric flaws being introduced.  It also increases the risk of the performance critical areas of the code  getting ripped out by the future maintenance programmers because they didn’t understand it.

A faster core language reduces the number of places where I must invest in rigorously optimized algorithms which facilitates faster delivery of features,  reduced code volume and lower code complexity all of which improve the ROI of the overall project.

Sloppy engineering can easily destroy any run-time benefits obtained from language speed.   I have seen hand written C systems which where super fast brought to their knees by a few sloppy engineers who didn’t pay attention to the impact of their changes.      I estimate  that 30% of the projects where I have been engaged to fix a failing system the core design was sound but a maintenance by sloppy or junior engineers  overwhelmed what was originally a great design.

Many times I could fix this these performance issues with changes to less than 5% of the code base but it is super expensive to find this problem late in the game.  Good module level stress and performance tests with full data volumes help but buying better people can provide a substantial boost to the total project ROI.

Caching  pre-computed partial answers – This could actually be a sub-class of better algorithm design but is used so heavily that it deserves it’s own treatment.   These computations can either be done and saved on disk in a format that is fast to reload or can be stored in RAM.   The main advantage is when looking up the answer is a lot faster than re-performing the computation which is not as obvious with modern hardware due to the fact that paging in memory not already cached in the CPU can be more expensive than hundreds of CPU cycles.    It also consumes RAM for each answer cached and if not a dense matrix this RAM consumption is coupled with the overhead of looking up the answer based on the input keys.    I heavily use this technique but am limited because one thing the ML optimization algorithms do is change a parameter and try again.  If the parameter changed invalidates the cache then the overhead of building the cache is thrown away.   I have found places where cached computation has delivered a 50X speedup and have  also taken time to implement a cache only to find that it’s memory and look-up overhead caused a net slow down and increased memory cost.     Automated Module level performance testing is essential to know where to implement caches.   The performance impact of caches should be built into the module level tests  when they do not deliver enough benefits they should be stripped out because they can also introduce esoteric bugs.        Do not use junior people to do this work as minor logical errors here can have multi-year cost and ROI impacts.

Buy faster hardware – This is a favorite industry wide and I will admit to having used it many times.    It is relatively inexpensive to reach a certain performance bar but the cost curve steepens rapidly and could rapidly exceed the entire IT budget  if not used judiciously.    In addition even a computer that runs 3 times faster is still much slower than the same computer with the same algorithm running in a run-time that is 90 times faster.      The rate of speed increase from the CPU’s is slowing so it is relatively inexpensive to reach a quad core I7 at 4 Gigihertz  in inexpensive commodity grade systems but to double performance again can quickly increase costs by several times.  When combined with distributed computing one key ROI strategy is keeping the core systems on commodity grade hardware which can be scaled up cost effectively.    I will admit that we run our Bayes Analytic instance on the fastest 8 core system with the maximum RAM we could stuff into the mother board and that we are using high multiple speed SSD to increase the rate we can move data through the system.  We pushed right to the edge of what we could buy using commodity hardware but we are already at the edge where going to the next grade of faster hardware would cost 10 times more while delivering minor performance improvements.

Use Distributed computing to share the load – This is one of my favorite tricks which also allows me to solve  fault tolerance problems using fleets of toaster style systems each doing a portion of the work.   There are three main ways to use these

#1 – Web Services or SOA where many different servers do a portion of the job and contribute answers back to other servers who eventually have all the data they need to respond to the user or generate the outputs needed.   This is very common and powerful architecture which allows each portion of the system to be implemented in different languages so the best tool for the local job can be used.  It also allows code streams to be isolated and present less risk of a change in one component breaking another component.    This can also provide good performance or horrible performance depending on the implementation but at the very least each it will add some latency,  network data transfer and parsing overhead at each layer.    If a system is very well designed so the coordinator layer can make many downstream calls in parallel and assemble the results it becomes a form of parallel computing.   Unfortunately 95% of all the systems I have seen including those in market leading companies use more serial approaches where they receive the full answer from one component,  parse the results do some local processing and then call the next component.   One of my past  employers made over 3,000 distributed calls to paint one web screen many of which where done in series and was forever battling performance issues.     I do use this approach  but try to minimize the distributed hops because the amount of data we deal with would require a lot of coding to make re-establishing context in downstream nodes fast and efficient by the time we where done this plumbing layer may exceed the code volume in our core engine so judicious use is required.

#2-  Parallel fleets – I have used heavily in eCommerce,  search and indexing work is to have many nodes that do similar jobs end to end and each node can be used as part of a fleet to service.    Both of these are useful techniques but they also represent linear cost constrains the second approach still requires that each node can complete the work fast enough to deliver responses to the user in a timely fashion.   I use this approach but mostly for web front end,  search and indexing where I need to add redundancy for fault tolerance anyway.   This can actually be considered as a form of #2 above.       I tend to use this as the first strategy when I own the sub system such as search and need to rapidly deliver the ability to provide N-scale capability and very high fault tolerance.   It does require considerable coding and infrastructure overhead to ensure that each node has access to the data required.  It also requires considerable effort and expertise to ensure that no shared resource is used between the nodes because that shared resource will represent a key failure and key point that limits scaling.  At least 50% of the implementations I have found that have implemented this approach and are still facing reliability and or performance problems have violated  the no shared component rule normally at the database or SAN level.      At the end you still have a fundamental rule which is that your layer of nodes must be able to execute their code path fast enough that they can deliver an answer to the next layer and still have enough time to deliver the end result to the user in reasonable time.     In N-tier SOA architectures this means even major layers like search engines need sub 500ms response time and sometimes a lot lower such as in my last Metadata retrieval layer we needed to provide average worst case response times under 20ms for 50 records. and average response times under 5ms.    Depending on the complexity of the job a fast run time can make it easier to meet these goals.

#3 parallel distributed computing –  This is a powerful technique where the job is divided up and separate pieces of the work are sent to different nodes to work on in parallel and then the results are re-assembled at the end or at mid points along the way.    Typical examples of this are Hadoop map reduce and the earlier Yahoo pipes where some pipes can run simultaneously.   We use this technique in multiple locations such as we have a layer which focuses on interacting with external servers to obtain new purchase data while another layer is simultaneously updating the training data set and yet another layer is using the training data to develop the predictions.   We also have the ability to run the analysis for different symbols on different nodes simultaneously.  We can also run the analysis and training steps for the same symbol with different goals simultaneously on distributed nodes.  The next step would be to break the primary analysis step up so we could spread the work for a single symbol across multiple CPU possibly using a map reduce.    This step is feasible but would require substantial investment in code and infrastructure to make robust since it would need to tolerate node failures and re-dispatch and our first round of tests showed a raw slowdown per CPU in map reduce of over 500% compared to our existing implementation which means we would have to consume 5 times more hardware before we broke even on speed.    Net, Net  we do not believe that breaking out this task to the next smaller granularity would yield sufficient ROI  to make the that investment.    Due to the high speed of the core algorithms and designs it is probably adequate to spread the symbols across multiple nodes.   There is very interesting research in the universities to allow more automatic  task breakout across massive compute clusters.  We will follow this work with interest but it does not yet appear to be viable for our use.

Reduce Memory Management Overhead

This could be considered part of better algorithms but it is such a large contributor to performance issues applications I have been asked to fix that it is worth some discussion on it’s own.    One of the interesting aspects of the Java, Python, Node and .net environments is that every object generally requires a separate memory allocation.   In micro these memory allocations are normally fast but if you have enough of them the system tends to work harder and harder to deliver the required memory and eventually the system must stop and do some cleanup (garbage collection) to obtain more memory.   I have seen object allocation abuse take a system that performed well in one version of software and turn it into completely abysmal performance a few months latter.      This problem started with the C++ programmers and has become wide spread over time.  Even in fully garbage collected languages like Java it can yield dramatic performance differences.  One of the challenges in Java is that many of the popular and built in libraries indiscriminately allocate memory even if the application program using them has been careful.    Depending on when this problem is caught the fix can be fast or nearly impossible the longer it goes the more expensive fixing it becomes.

One of the best ways to control the issue with careful module level stress and performance tests which show degradation early while it is still relatively easy to fix.

A good example of this was an analysis system I built a few years ago.   In the first approach we used a line by line parsing strategy which created a collection of records each with about 9 attributes that where separately allocated which for a several million record input set ran about 20 minutes.    I modified the algorithm to pre-scan the input set with zero allocations so It new enough to allocate the memory in larger chunks and brought the load time down to 2 minutes for a 10 X improvement.      To obtain more improvement I dropped back into some tricks popular in the K&R C days where after pre-scanning the records rather than allocating a large number of record objects we flipped the memory structure around  arrays of simple fields such as an array of integers so we rather than millions of allocations we where down to a few hundred.  This brought us from 2 minutes down to 22 seconds so we where  54 times faster.      To obtain the next speedup  required replacing the line by line read with a block reach where I scanned for the line tokens and then scanned each line to find the field boundaries and stored them in a re-usable integer array which eliminated dozens of memory allocations per record.  I then parsed each of the lines fields using the lowest level primitive parse functions directly into their indexed field positions eliminating several automatic block level allocations.   This change brought us from 22 seconds down to 4 seconds over a 300 times speedup.    I tested this strategy in a range of languages from C#, Java, Go,  Node.JS (Javascript) and python. It yielded the greatest benefits in C# and GO with good results in Java.   It actually slowed down Node.js because it treats all array index operations as more expensive hash look-up rather than simple computed memory positions.

As this shows memory allocation can be very expensive and that while keeping the machine and run-time performance the same it can cause a 300 X speed degradation.  Not every application needs this kind of optimization but it is important to have the people with these skills available when you encounter the issue.   One of my favorite training tactics for 5th year engineers is to give them a system which I suspect has this problem and force them to go through the entire optimization problem.  I have found this quite successful at preventing most of these engineers from re-creating memory allocation nightmares.    One interesting aspect is that during these and other tests I  re-confirmed that not all garbage collection systems are equal and that some VM (virtal machines) which are very fast before collection starts can become the  slowest  over longer periods of stress.   This is one reason I like to see module level stress tests run with sample data sets that are at least 300% larger than production data sets and which an be configured to run for multiple days replaying real production queries when possible.

 This entire set of changes required modifying less than 10% of the code of the entire system but I have been involved in other systems where the problem was so wide spread and diffuse that it required touching over 80% of the code where it may have been cheaper to scrap the system and start over.

Simplify or Recast the problem to allow slower response – I will admit to having used this more than once but it is generally undesirable since part of the job of both principal architects and Engineering Directors is to negotiate the customer feature set to the minimum needed to deliver the business ROI before the project starts.     I prefer to recast the sub problems and use the other tactics mentioned in this white paper wherever possible.   I can normally solve this by  breaking the task apart so I can execute with fleets operating in parallel  but sometimes time to market and cost constraints make this the only viable choice.     Having a very fast language obviously helps.

 

 Bar file Parser comparison in Node, Python, Go, FSharp and Java  Feb-2013

This code is a fairly simple test cased designed to mimic some functionality that is heavily used in the Bayes Analytic Engine.      It parses a simple file line by line and parses 5 numeric fields out into a output array.    It then computes a common statistical function  SMA (Simple Moving Average).     The sample data tested 96,964 lines of data totaling 5.96 Meg.

Two approaches where used#1 is the most obvious array of objects.  The other is a set of parallel arrays each containing all the values for a single column from the CSV.   The second approach was used to see if it would make statistical functions which often  work a single 1 column faster because they are moving through contiguous memory.    The array of vectors approach seems to be faster to load and faster calculate than the individual records for most languages that support allocating large blocks of contiguous memory such as C and GO.

The Microsoft F# solution provides a unique combination of reasonable slow down 9.7X coupled with the highest reduction in code even when writing for maintenance.    My main concern with F# is that it can be cryptic to read at and has a steep learning curve.  F# is a weird hybrid where you can choose either a idiomatic F# technique or  a more traditional data structure technique optimized for performance.     When using the idiomatic technique F# performance degrades to be substantially worse the LUA and in some instances is worst than Python.    The long term challenge is that most programmers who want to program in F# learn the idiomatic approach and will find the alternative performance oriented approach alien and difficult to maintain or enhance.   I also found that the code savings in F# drops as complexity and size and complexity of the total system rises.   I also found that only the Microsoft run-time performs well and that the open source Mono run-time consistently performed worse than Java and was quite often slower than Python.

Language Notes:

  • ANSI C provided by far the best performance and  seems like a viable choice but code volume expanded rapidly as I started adding more functionality and when re-factoring for re-use.  With this said if I had a deep pool of programmers trained in the pre-C++ days to draw from it would would likely yield a the fastest solution.     The main challenge is that to get C to deliver maximum performance we must skip some of the core libraries and use proprietary techniques that many programmers will find difficult to maintain.     A few months after doing this work I re-coded the C version into C++ using common collection libraries and it actually performed worse than GO.  I believe this means C is viable but only if you are able to staff the team with people qualified to do real time programming using techniques common for optimizing operating systems.
  • Microsoft F# provided  a 9.7 times slowdown for a code reduction of 2.2.   In python we have a 61.3 times slowdown for a code savings of 2.32 for a ratio of 26.4.      I love the python syntax but in this instance it would be difficult to justify accepting a 61 X slowdown.     I also wrote the same thing in C# latter and found that when using hand optimized F# versus C# the performance was nearly identical.  As soon as F# idomatic strategies where used or as soon as heavy use of the Linq in either language was made they slowed down to become comparable to Python.The original F# implementation was 70 times slower than C.    I had almost decided to implement the entire system in C but decided to test a direct port of the optimized C design to F#.   I used C optimization techniques such as large alloc of contiguous vectors,  Using parallel vector view instead of arrays of records. The F# only grew by 8% in code volume but moved from 70 times slower than C to one of the fastest at only 9 times slower than C.     I explicitly chose to violate some of the purity of F# to make it more competitive as a general purpose compute tool.I some cases the F# language design made it harder to implement the optimization techniques and would have been easier if F# had supplied true support for standard concepts such as  break, continue and early function returns.  In general I find that the lack of a small number of these features make F# more a far less effective tool that it could have been while making the resulting F# code less clear and less maintainable than it could have been.Our calculations of SMA,EMA and Stochastic require thousands of computations over rolling windows of the of the original vectors.    The idiomatic F# slice currying techniques delivered poor performance but when you pass a pointer to the original array along with begin and end indexes it actually was only 9 times slower than C which made if faster than our best Java port at 22 times slower.      Only C,GO and D where faster than F# in the tight loop when passing in the entire vector pointer along with begin and end indexes.    GO supplies a unique feature for slices which creates a virtual slice think of it as a begin and end pointer which but does not require any data copying.   I think that F# should borrow the same approach.  If the F# adopted this feature I suspect I could use idiomatic F# most of the time and still deliver good performance.
  • Java provided a 22.6 slowdown for a code savings of 1.36 for a ratio of 17.4 so it is technically better than the python but way worse than F#.  In reality the Java saves very little code compared to C  and was actually longer than the C after the first pass of refactoring for use as a library.    Even with this said in most medium to large sized customers I may choose Java simply due to the large number of programmers available.
  • Google GO provided the best absolute performance after C while using a memory safe architecture entirely inside the GO VM.   The GO code is so close to the C in size that it is not worth any degradation.  In addition I found the GO compiler to produce such bad error messages that it dramatically slows debugging things down compared to gcc which produces very nice errors.    In addition the GO build cycle was slower than GCC C compiler.    The GO performance was initially very poor I eventually tracked this down to a problem in the streaming IO library and was able to re-write the code to use block IO with a custom parser after which the GO performance jumped to the front of the pack.        I really liked the GO virtual slice functionality and believe all languages should consider duplicating this functionality.    When the GO code crashed it tended to do so with less informational error messages than the C code which made debugging very expensive.    I also found that GO code expanded in volume rapidly when refactored for common library use in a way that seemed even worse than C.    The GO memory management process seemed OK but it is prone to very long pauses and when stress tested even after a long cleanup process got slower and slower over time where both the .net and java run-times stabilized.
  • Javascript / Node.js – As a long term Python programmer I liked a lot of things about node and quite frankly Node was my default choice for the  Bayes Analytic project.     There where already some good Javascript ML libraries with a lot of traffic and it seemed the best long term choice for a true vender neutral run-time.    I had previously ran tests with Node and it had performed well so I expected to perform well for these tests.   Unfortunately the Node run-time and garbage collector is simply not ready for prime time and it failed to successfully run with our test data set running out of memory or going into what appears to be memory panic loops where it would run for a 1/2 hour or more before giving up out of memory.   I simplified and reduced the data set to run something in Node and found an interesting limitation of Node where while it is very fast in tight loops it has serious performance problems accessing elements in numerically indexed arrays which cause it to perform the statistical calculations several times slower than even python.     I will continue to monitor Node and the competitive java-script environments because I think they still show a lot of promise and we have to retain these skills for use in the HTML 5 front end anyway.

Some interesting finding:

  • The C version loaded 2.5 times faster than the next fastest Java.
  • The C version computes averages 25 times faster than Java.
  • The Node code was the slowest in calculating the averages. I think this is because it does not have a true array type with pre-allocated contiguous storage and must work harder to compute the location of an array element.
  • There is a convenience method in FSharp, Python and Java  to read all lines which was used.    The vector load version in F# was changed to use .net 4 ieterator.  I think I could take out another 100ms by converting this to bytes.
  • # Lines of code shown with all comments and blank lines removed

 

Total Load Time ms Compute Average ms Lines of Code Code # lines as % of C code Ratio  slower in Avg()  than C Slowdown times / by code times
ANSI C – GCC vectors

372

0.31

121

100%

1.0

Idiomatic C++
Java 7 struct

1,071

13.00

95

79%

41.9

Java 7 vectors

1,090

7.00

89

74%

22.6

17.4

GO Struct

1,835

1.39

117

97%

4.5

GO Vectors

1,767

0.63

111

92%

2.0

GO modified to C design
Python Struct

1,325

25.00

41

34%

80.7

Python Vector

1,393

19.00

52

43%

61.3

26.4

F# Struct

890

4.0

37

31%

12.9

F# Vectors

905

3.0

55

45%

9.7

4.4

Node.js

1,337

145.00

64

53%

467.7

D Vectors

661

1.25

123

102%

4.0

Mozilla Rust

Smaller is better

Smaller is better

Smaller is better

Smaller is better

Smaller is better

Smaller is better

 

ROI Balancing of compute performance to programming overhead

I work from a core assertion that higher code volume is more expensive to maintain but only to a point where the cost of hardware and hacking on performance begin to outweigh the savings in code.     I also assert that even high initial savings in code volume can be lost if the environment forces a lot of lost time trying to work around limitations of the environment.        Eventually code performance will offset code reduction because as compute performance degrades larger computers are required and that at some point slower compute time requires more work in distributed architectures to share the work across multiple computers which requires more plumbing code and higher operational sophistication which can offset the original savings.    As you approach the limit of available compute power before moving to the next level of distribution you will invest more  time debugging and optimizing which will eventually offset the original gain obtained from faster programming.

 

Performance optimization summary

It still works to think about how I would optimize the algorithm to maximize speed in K&R C pre(c++) and then try mimic that design as close as possible in the target language.      Object allocation, destruction and currying into different shapes to pass parameters is time consuming and should be minimized in high performance code.  This rule still holds true for F#, Java, C#, GO, C, C++ and many other languages provided the language allows you to modify code design to minimize allocation overhead.

The level of effort invested in performance and optimization must be moderated by ROI and time to market but I seriously believe that it can not be ignored.

Module level stress tests which exercise the entire system and each module at data volumes 300% the size of production data is still the best way to isolate where to invest in performance optimization.   These tests work best if the performance of the module tests are compared across time with careful scrutiny given to modules which show significant degradation compared to prior releases.

I remain a big fan of distributed architectures built with open protocols like REST and JSON especially when they are schema free and able to expand to accommodate new requirements.    A important driver is that if the system has been effectively partitioned on distributed boundaries then I can choose to re-write any single component to run in a different high performance language,  run-time and operating system without requiring changes in other components.    If your distributed architecture does not allow this flexibility then you are likely loosing a large portion of the benefit and ROI possible from distributed architectures.

Thanks Joe Ellsworth phone-number-for-bayes-analytic

 

3 Replies to “Why Language Performance Matters & Some Measurements”

  1. I recently wrote a new version of this test which includes source code in Lua and Julia. I tested Lua when I first wrote ran these tests but for some reason I didn’t include it in the results even though it was the fastest of the interpreted languages by far. At that time it was 2X slower than the F# code. When I wrote my recent LUA jit tests many of the key statistical compute loops run 26X faster in the JIT with some running up to 80X faster. With 26X speedup in the Jit this would push the Lua to be faster than the F#. When I have a chance I will go back and re-visit the tests and produce new set of results for true comparisons. I have been less happy with F# over the last while as it is unclear if F# compiler on the command line has long term micro-soft commitment. As such I am porting at least parts of the system over to Lua for the next round of engine so I am not trapped on the Microsoft platform. The new Lua article with source is at: http://bayesanalytic.com/lua_jit_faster_than_julia_stock_prediction/

Leave a Reply