KNN and Ensemble for stock price prediction

Applying KNN to stock price prediction

© Jan-25-2014 Joseph Ellsworth All Rights Reserved   phone-number-for-bayes-analytic

How to apply KNN (K-nearest_neighbors) algorithms to predict future stock price movements.  How to combine KNN,  score normalization and  and Ensemble techniques for trading predictions.

See my newer work for a Quantized Classifier.    It is much faster than KNN and can deliver comparable or better results in many situations.   The code is freely available on BitBucket and I released it under MIT license so it is safe to use in commercial projects.   I hope you will buy our consulting services to help solve your problems but feel free to use the code anyway.    There is also free code showing how to read and classify machine learning CSV files using Deep Learning from TensorFlow.   

History never repeats identically but similar patterns do occur and they tend to produce similar results.

I wrote this article in response to a question about how to apply KNN (K-nearest_neighbors) algorithms to predict future stock price movements.       Please send feedback on how to make it more clear.  I wrote this at the level of abstraction where I prefer to receive knowledge.  If there is strong demand for sample code then I may be willing to provide it in a second article.

Basic KNN process for trading prediction.

There are many KNN introductions available but at the core it is a statistical technique used to predict or classify data based on similarity to the nearest neighbor.

It requires a few extra steps to apply KNN to predict stock prices.  The first step of this process requires figuring out which indicators you want to measure and the second is specifying a  goal such as the stock price rising 1% with 10 days.         For this example I will use K=6 which means using the scores from the 6 neighbors with the closest score.

Maury Osborne introduced a concept that 2 Consecutive Higher ticks tend to have a higher probability of the next tick going higher.  I extended this to an indicator that measures the percentage of bars or ticks with higher highs over a specified period of time.     The Percentage of higher high for 20 bars is abbreviated to PHH(20).   I use PHH in this article because it is easy to calculate.

For this example I will use a simple goal of the closing price to rise by at least 1% in the next 40 bars.   You can consider the 1% gain as magnitude and the 40 bars the duration component of the goal.   Some implementations require both magnitude and duration but my favored version only requires duration.

Let’s say that your measurement for the most recent bar is that 16 out of the previous 20 bars  had a high that was above than the high from the prior bar.    This would yield a raw PHH(20) indicator score of 16/20 =  0.80.

The raw indicator score is used to select the K scores closest to your computed measurement.  With K=6 then you would take the 3 with scores that are immediately above and 3 scores immediately below our raw score 0.80.       If 6 out of 8 of them had met the goal then you have an effective KNN score of 5/8 = 0.625.

KNN purists  may say you had a score of 1 (Majority consensus) where each measurement was  1 if the goal was met and 0 if the goal was not met.     I personally like the decimal number as it can be used as a level of confidence.

Using the 3 closest scores above and below is an arbitrary choice for distance computation.  I could also have used the 3 mathematically closest.

It is possible for our raw indicator score of 0.80 to match more than K neighbors.   When this occurs we use the neighbors which are closest in time as the secondary selection filter  It is also reasonable to expand K when more exact matches are available.    Since we are using arrays with numeric indexes where bars and trades are known to be sorted by time we can compute a distance in time from any bar as the absolute value of it’s index minus the index of the matching candidate.

Some KNN implementations always use a time component as part of the section criteria for the neighbors and in that case a more complex selection algorithm.    You could use the time measured in # bars, minutes, hours or days as the X axis and the raw PHH(20) score for the Y axis.  The net difference in distances is computed by finding time difference and the feature difference between the candidate bar and all other bars.   These are then combined by applying the Pythagorean theorem hypotenuse formula.  Once these are computed then a separate selection process used to find the bars with the shortest hypotenuse.    The more difficult part when including time as a selection point is to find a way to normalize the time contribution against the raw indicator output scale so you can properly balance the input of both.      I tend to prefer restricting the total compare set and using only the raw indicator score for selection of the neighbors  but there are valid arguments either way.  A larger system is likely to use both approaches.

Extending for Normalized Magnitude score retention

I like to extend the native KNN output to retain the magnitude of gain within the next X bars where X is the duration portion of the goal.       With K=6 you may have percentage gains of  [1.3%, 0.2%, -2%, 1%, 0.5%, 0.4%].   These can be used to compute a score that is an average % change for the selected neighbors.

I like this input because it allows the magnitude of the change to be represented both positive and negative as a component of the output numbers without requiring explicit statement in the goal.  For this sample it would be a raw score of 0.667%.

Normalize this score against the entire set which allows easy combination of disparate scores in the ensemble described below.   (Think of it as poor man’s gauge normalization).    If the entire set had an average score of 0.183%  and our average score from our neighbors is 0.667%   the normalized score is computed as 0.667% / 0.183% = 3.64.

This can be interpreted as this bars neighbors had an average gain 3.64 times higher than the average gain for the entire set.    Since this bar is similar then it is likely to produce a similar result.

It is reasonable to compute the level of confidence and the normalized magnitude of change for both sets.   These can be used as separate inputs to the optimizer described below as they both provide interesting inputs where one predicts the amount of movement and the other predicts how likely that movement is to occur.

Performance Note: If you pre-compute the indicator to produce an array containing the output metric and index into the trade or bar array  and then sort by the output metric you can binary search for closest matching scores very fast or you can use a tree structure with is fast to look-up and fast to update.   I prefer a tree structure which makes it easy to add new measurements without inuring the sort overhead.   A linear search also works but is much slower.

The scores for any bars too new to compute success or to provide the forward looking score must be suppressed from the historical set when choosing neighbors.

Extending with basic Ensemble

Once you have processed the data set as described above for a single indicator,  you have a good single KNN measurement that is probably useful but is unlikely to be highly accurate for predicting future price movements.

Ensembles combine the output from multiple models with the assertion that the combined output is likely to yield more accurate results than the prediction from any single model.       Ensemble enthusiasts also claim that ensemble output are less likely to be negatively impacted by special cases and are less affected by over learning.

The next step is to repeat the process for several additional measures and different time frames.        For example you may choose something like the following.

  •   Percentage of higher High (10 bar)
  •   Percentage of higher high (20 bar)
  •   Percentage of lower Low (10 bar)
  •   Percentage SMA(5) above or below  SMA(10)
  •   Distribution rank of this max compared to the last (100) max scores.
  •   Sky’s the limit.

I recommend that  you keep your individual indicators easy to code so you can spend more time on the core KNN,  Ensemble and optimizer algorithms then you can add more indicators as needed.

Once you have the full set of measures and the KNN score for the most recent bar based for each of these measure sets you can average those measures to produce the ensemble score which hopefully will be more accurate than the score for any single indicator.

The score magnitude normalization step above is critical to allow easy combination via averaging because all the scores are already relative to the entire set.        One extension that can yield substantial benefits is to extend the ensemble averaging method so it can apply different weights for each measure set.     An optimizer can be used to find the set of weights which yield the highest success rate for the entire set.          You can think of this as choosing which expert you trust the most and giving their input the greatest weight.

Rejecting some neighbors to maximize accuracy:

One weakness of the average based ensemble combination is that a  single score from a single neighbor for a single feature can drive too much change in the overall score.   I have seen instances where 5 out of 6 neighbors had an average gain of 0.6% over the next 40 bars while one of the 6 neighbors had a loss of 4%.  Statistically the outlier changed the net score more than should have been allowed which made the output less accurate than desired.   This is only an issue when using the normalized score mechanism mentioned above.      If you are using the pure consensus rules then the statistical outlier is automatically eliminated.

Maximize statistical impact of  the most similar neighbors

One way to work adjust for this is to keep the K neighbors for each feature.   Those neighbors which are selected the most times by multiple features are those which have exhibited the greatest similarity.   Since they are the most similar you can  you increase their weighted contribution to the ensemble score relative to those which matched on fewer facets.    This allows you to accept some input from all the neighbors but give the greatest impact to those neighbors which are most similar.     This requires that you defer the calculation of the feature average until after you have gathered the set of neighbors all of the active features.

Selecting a larger set and suppressing statistical outliers

Another way to avoid this is to use a higher K value such and then reject the a subset of the neighbors that have future gains that are the furthest from the mean for the set.   Think of this as throwing out the statistical outliers from a Cauchy Distribution, Levy Distribution.   I am generally not a fan of throwing away data but in this instance our sole goal is to increase prediction accuracy and in this special case it can be beneficial.      Benoît B. Mandelbrot  introduced the idea that while stocks normally behave according to standard distribution rules that they occasionally exhibit wild moves outside of statistical expectations.   We normally protect against these wild moves with hedges and stop losses but they have a secondary effect of reducing prediction accuracy.    We can extend Mandelbrot’s original  assertion to say that “Some conditions produce wildly different effects that are unlikely to repeat in a useful period of time”.  If you can collect these wild effects into a cluster and use them, they can be aids in prediction as kind of a reverse neighbor look-up.     In most conditions they represent a small number of wild outlier readings in a otherwise rational set that can skew  statical accuracy and need to be suppressed.    Our ultimate goal is to maximize the number of times we correctly predict the price movement magnitude or direction and to meet this goal suppressing the impact  wild inputs may help.

If more than a small number of neighbors are showing wild forward looking effects  outside the statistical norm then it is prudent to pay attention as it can be a warning of severe risk.     In this case we may actually keep the wild inputs and either tune down or suppress of the input normal results.    A high number of wild results in any set of neighbors can also be used to dramatically reduce the confidence in the prediction or possibly eliminate the prediction all together.     This can be used when modeling to find and cluster wild effects which can be a component of an early warning system for major market moves.   This kind of warning signal can be beneficial to  the  portfolio risk manager to suppress purchases,  adjust hedge levels  and cause early sale of long positions.      Using a system for wild behavior identification is a perfect example of  when more years of data and a larger number of neighbors can yield beneficial results by providing more wild events to learn from.

Any easy way of filtering wild input is to  reject any neighbors that produce predictor outputs that are are more X standard deviations away from the mean of the K neighbors.   A faster selection mechanism is to throw away the upper and lower 10% of the readings keeping only the 80% in the center.     The nice part of this approach is that it allows the normalized score to be computed earlier which simplifies some data management.    These approaches can also be combined.    I tend to prefer maximizing the contribution from neighbors that have been judged the most similar but it could be beneficial to  combine this with suppressing wild readings form less similar neighbors.

The ability to reject or adjust neighbors contribution relatively inexpensively is a benefit of the KNN over some of other classification systems where it is more difficult to identify and adjust for the outliers.

Calculating the future price

I am not a huge fan of attempting to predict a specific future price simply because  I have never seen it deliver pin point accuracy and it sends the wrong message to the user’s who inevitably try to use it that way.      On the other hand I have seen it deliver high accuracy when predicting which stocks will rise in price versus which will drop.  It is also beneficial when comparing purchase opportunities where the greatest profit may be from the purchase with the larger  predicted price increase.      If you use it as a directional indicator rather than an absolute prediction I think you will see greater trading success.

If you applied the score magnitude normalization above then you can can take the output score from the ensemble and multiply to obtain the future price prediction.      The formula for price change would be  Last  Bar Closing Price *  Avg Perc Movement during duration period * Normalized Score output.


Bar close price = 184.20
Average price move during goal period =  0.183% = 0.00183
Ensemble or KNN normalized magnitude score =  3.64
target price movement = (184.2 * 0.183% * 3.64) =  1.2269
target Price = 184.20 + 1.2269 = 185.423

I have not tested this but I suspect the probability of accurately predicting the exact price movement is relatively low.   It is more useful when used as a +- 20% indicator so the actual price movement would be fairly likely to be between $0.97 and $1.46.  One way to measure the accuracy of the system is to compute average % deviation between predicted score and actual score.    I suspect that the most accurate predictions will be for the top 85% to 95% scores and the 5% to 15% scores but this would need to be proven with read data tests.

Scoring and measuring

Scoring and measuring serve two purposes.  The most important is to give you a good idea of how much you should trust the output of the algorithm to influence live trading.   The second aspect is that scoring is an essential component of optimization.     It is important to keep these two aspects distinct in the mind of the engineer because for optimization the scoring must be fast to allow many passes whereas scoring for trust is best done with a separate scoring mechanism that can be somewhat more expensive.

When optimizing you typically score against the training data set but and when scoring for trust you score against the portion of data set aside for testing and which  the training and optimization algorithms have never seen.

Simple Ranked scoring

You apply the algorithm to every bar in your historical data set to obtain it’s computed output score.  Then sort these bars to find the top 10% of the bars with the best score.     If you have done everything right and had some luck in choosing the features (indicators)  then you should find that your bars which have the top 10 percentile of the combined scores will have a success rate substantially higher than the average score for the total set.

You must suppress any bars from the earliest part of your history where you do not have enough data to accurately compute the indicators you have chosen.

You can measure success based on maximizing the number of successful bars in the top 10 percentile of your computed scores.     A more accurate measure for the optimizer is obtained by  running a quick back-test after each change.  Consider the change good if it results in increased net profit from the back-test.

Make sure you include transaction fees for the back-test or you will get inaccurate results.

Prediction Accuracy scoring scaled

  1. Compute the the target price move for every bar using the algorithm described above.
  2. Choose a target zone such as prediction of price move +- 20%.
  3. Classify each items as a success if the price fell inside the predicted price range or a failure if the price fell outside this range.
  4. Rank  order all items based on absolute magnitude score.
  5. iterate through the items keeping a count of total items reviewed so far versus successful items.
  6. Current success rate = # of success / # items reviewed
  7. The items success rate = Min (last items success rate,  current items sucess rate)
  8. Stop when current sucess rate drops below a threshold such as 90%
  9. Classification Score for the pass is computed by # of sucess before threshold was encountered / total set size.     This score can be easily used by the optimizer.
  • A similar approach can be used but rather than choosing a range,  Use the logic of if the magnitude of the change was positive and the move within duration target was positive then sucess or if the magnitude of predicted change was negative and the move was negative then sucess.  Otherwise failure.
  • A similar approach can be used but rather than choosing an absolute range and threshold compute the percentage difference between predicted move and actual move and keep a running average for absolute value of this difference.     When the value of this average rises above a threshold then exit.    The number of records selected before reaching the threshold  represents a classification accuracy that can be used
  • In trading we are generally most interested in the records at both ends of the extreme movement scores because that is where trades represent the greatest potential profit.    You would run this score mechanism from both directions high to low and low to high and it is quite probable that you may find a higher selection efficiency at either end of the spectrum.

 Prediction Accuracy scoring zoned

  1. Compute the the target price move for every bar using the algorithm described above.    Create a target zone such as prediction of price move +- 20%.
  2. Classify each items as a success if the price fell inside the predicted price range or a failure if the price fell outside this range.
  3. Take the absolute range of bucket scores and divide by a target number of buckets to obtain a target bucket size.   For example if the absolute move magnitude scores range from -6.83 to 5.2 and you want 100 buckets then the average bucket size will be 12.03 / 100 = 0.1203.
  4. Group each historical bar into it’s appropriate bucket.
  5. Score each bucket based on the  # of successful price predictions /  total # of items in the bucket.
  6. Select the buckets which have the highest probability of success.
  7. Give priority to the buckets with the highest success score but only when they have sufficient number of items in them.     Buckets with too few of items should be considered statistical outliers and rejected.    If you have too many buckets with  too few of items then reduce the number of buckets or use an alternative approach.

Back Test scoring

 Back-test scoring is exactly what it sounds like.   You essentially emulate buying and selling stocks or options based on the predicted price.    Back-test results which yield the greatest provide are considered the most successful.

In this system you may choose to place long purchases  any stock / bar combination where the last purchase  or last bar produced a score in the top 20% of the ranked scores and sell when the last purchase or last bar produces a score lower than the top 70% of the top scores.         You may  also include a limit which requires the predicted profit to be at least 300% of commission and fees costs.    Force all transactions open at end of the test to exit as if sold at market at that time.    You should also include a maximum long value at risk so you can accurately predict the effect on a portfolio.  Use of a stop loss should be tested as a optional component.    Finally test for bankruptcy and abort the test when you reach the minim allowed account balance.

I suspect that if using the Absolute normalized scoring above that the top 5% and bottom 5% of scores will be less accurate than those between the 5% to 25% and those 75% to 95%.  I also expect those in the 40% to 60% to be less accurate when used for prediction directional movement.     If you use the ratio of accuracy for predicting a two part goal shown then the most accurate should be those at the very top of the score range.     I have been surprised many times by so this is really a wild guess, the only real way to know if with extensive testing.


Once you have the first set of scores for the ensemble you may find that it is very accurate for some symbols and much less accurate for other symbols.    I have seen this many times and it tends to result from different stocks exhibiting different trading personalities.   A given indicator such as a SMA(5) as percent of SMA(50) may be very valuable for SPY and produce negative results for AIG.  I believe this is driven by the internal processes and personalities of the institutional traders who are  actively trading those symbols and these influences do change over time.

Optimizing is the essentially the process we use to find combined set of factors which produce the best result for a given subset of stocks during a given time-frame.    The essential process is:

  1. Change one or more parameter values,
  2. Recompute the scores,
  3. Determine if the change resulted in a positive result,
  4. Keep the change if positive,  reverse the change if negative.
  5. Repeat for a Specified number of passes or until the system has reached a semi-stable state.

The Optimizer can be  used to change many factor such as:

  • The contribution weight for various indicators or measure sets of the ensemble
  • Change the durations of indicators chosen.
  • Add or remove some indicators or indicator durations.
  • Change K (number of neighbors) either for the set or individual indicators.
  • Change the number of days in the input data set.
  • Change from limit or magnitude or directional sucess criteria to one of the others.
  • The sky is the limit

The set of combinations grows rapidly so I generally suggest optimizing  1 axis at a time.  It also saves a lot of time if you load the key values that can be changed by the optimizer from a data file so you can save the results after the run.    I always save the intermediate results after each major pass so we can use those combinations as input to more advanced genetic optimizers.

Designing optimizers is a art as much as a science and there is a risk of the optimizer finding micro cases which look good but actually produce negative trading results.  The easiest optimizer is generally to randomly choose a feature to change,   randomly choose a new value within a range +-50% of current value and test for that single change.     More advanced genetic optimizers have been used but be fore-warned that it is easy to sink hundreds of hours into testing and tweaking optimizers.

Many optimizers have a tendency to go down the wrong path which can cause them to never discover better optimization opportunities.     One way to work around this is to periodically clear the entire optimization set to default state and then run new passes.   If you keep the results from these clean slate passes as data files  then it is relatively easy to use them as the input as the input genome for future  genetic optimizer passes.

Early in my ML (Machine Learning) work I found that genetic optimizers depending on pure random generated genome values very often yielded negative results for the entire set and they never recovered.  I invested weeks testing various ways to add random perturbation to the genome combination process and found several that worked but none of them seemed to work as well or where as efficient  as independently generating the starting genome settings from known good result sets that where produced by a less sophisticated random sequence optimizer.

Save some data for testing

Ensure you train against one data set and reserve some data from the same series to measure how well the system will perform.   The test data must be completely isolated from the training data or you will not get obtain an accurate measurement of your effective lift.

The ultimate measure of success is how many of the bars from your reserved test data set which have a score in the top (filter-range eg 90%) of the set succeed.

When testing new algorithms I like to reserve 20% of my total time for use as testing data.   You should expect your classification success rate in the test data set to be less than when applying it to the training data set.   One measure of brittleness is how much the difference there is between the two scores.

I would be pretty cool if your top 10% scores had 3 times the average win rate and your bottom 10% of scores had 3 times the average loss rate.

Thanks Joe Ellsworth
CTO Bayes Analytic
Feel free to contact me privately for any questions.
Contact us

I like as a quick intro to the KNN algorithm.


Required Disclaimer

Forex, futures, stock, and options trading is not appropriate for everyone. There is a substantial risk of loss associated with trading these markets. Losses can and will occur. No system or methodology has ever been developed that can guarantee profits or ensure freedom from losses. No representation or implication is being made that using this methodology or system or the information in this letter will generate profits or ensure freedom from losses.  Forex and Option trading can result in losses that exceed the original principal balance.

Hypothetical or simulated performance results have certain limitations. Unlike an actual performance record, simulated results do not represent actual trading. Also, since the trades have not been executed, the results may have under-or-over compensated for the impact, if any, of certain market factors, such as lack of liquidity. Simulated trading programs in general are also subject to the fact that they are designed with the benefit of hindsight. No representation is being made that any account will or is likely to achieve profit or losses similar to those shown.

Even results from Live cash trading can be subject to specific market conditions that may not repeat in the future and as such, duplicate results from future trading is unlikely to duplicate past results.    Changing the dollar amount traded can cause different behavior in live trading markets especially when trading large positons that can exceed the liquidity available in the market and cause changes in pricing behavior.

Bayes Analytic LLC provides software that can produce trading signals.  The customer is responsible for choosing a configuration and parameters for the software that meets their own goals.  The customer is responsible for conducting their own tests and only the customer can activate the software to start trading.   The software runs in an account the customer has logged into and then activated the software.   Bayes Analytic has no control of,  influence over or visibility to the signals specific to given user because we have no visibility into configuration parameters the user has chosen to operate with.    The Bayes Analytic software is provided without Warranty on a As Is, Where is basis.  It is the customers responsibility to test the software to ensure it meets their trading requirements.   Every time Bayes Analytic releases a new version of the software the customer should conduct new tests to validate the new version continues to meet their requirements because every software change could have unexpected side effects that may not be obvious until the customer has tested it in their environment with their configurations.   The Bayes Analytic software may run as a script inside of other software packages or talking to API that Bayes Analytic has no control of or Influence over so the customer should test entire ecosystem to ensure it meets their trading requirements.    Bayes Analytic may provide the software in source form since that is required by some trading systems but it remains the exclusive copyrighted property of Bayes Analytic and may not be reverse engineered or redistributed.    The customer is responsible  for choosing their own broker and installing the Bayes Analytic software so it can trade using the desired account.  Bayes Analytic has no control over or influence of the broker and many brokers have different ways of quoting spreads,  charging commissions,  flow of orders and latency of information.  As such a strategy and software that performs well at one broker may and probably will require changes to perform well at other brokers.  It is the customers responsibility to test the software with their selected broker to ensure it meets their trading requirements.