Machine Learning Failure

Research has the problem that sometimes only positive findings are published. In my case, my confirmation bias probably makes me notice good things about machine learning. At least today, my confirmation bias is being held at bay.

I ran a machine learning program with the the returns of each of the last 12 weeks—for one of my sims/ports–as factors.

So simply a neural net using technical factors on one of my sims. I am looking for a signal to get in an out of my port.

So, you could probably look at multiple things but the correlation between my predicted returns and and the actual returns (in a validation set) was only 0.02. And not too surprisingly this small correlation was not statistically significant.

Copied from Python: “SpearmanrResult(correlation=0.022931135723975947, pvalue=0.7123258875857394)”

I am doing some other things that seem to have more potential. But I wouldn’t want to share just my positive results.

And I probably need to (probably will) try this with a recurrent neural network.

Machine learning can tell you when using a strategy would be a mistake. There’s that at least.

Best,

Jim

Well duh! That’s about as big a secret as Donald Trump’s spray-on tan.

Ouch. I didn’t have to read further.

I guess the advantage of machine learning is that the machine will compute anything you tell it to compute. A market-knowledgable human would have reacted to that task by saying “Are you f-ing kidding me? I’m not computing that sh*t. I already know that absent dumb luck your result is going to suck. I’ve got better things to do with my time so go feed your garbage into some dumb-ass machine that doesn’t know any better and will play along.”

Jim, my mathematical./statistical knowledge is probably about 0.01 or maybe a bit less of yours, and my machine learning knowledge is non existent. But I will claim to know that you can’t get good answers if you don’t ask good questions. While there may be plenty of quants who print out wikipedia articles on correlation to put under their pillows while they sleep every night, when I think of successful investors who actually make money in the market over prolonged periods of time, I’m not sure how many can even spell correlation much less care enough to compute it. It’s not that a correlation of .02 is satisfactory. For questions whose answer depends on correlation, it’s not. It’s terrible. The issue is whether the problem you’re trying to solve really is one for which a correlation between predicted returns and expected returns should be part of the answer you seek.

Re-think you target vectors, or dependent variables. Just because some data miners who parlayed their nonsense into fame and tenure used returns for their dependent variables and correlation doesn’t mean the rest of the world has to do likewise forever and ever and ever. Don’t piss away all your creativity on technique (machine learning); save some of it for study design.

Hint: Where back to the D-phrase: Domain Knowledge. Look among the strategies stored in your account. Pick one that turned out well and one that turned out badly. Download transactions of each into Excel. Go off of p123 and now analyze both of them as if you were an investor who was sent these by your adviser. Ask yourself why you liked this one and not that one. Then move into study-design mode by trying to explain your reactions with increasing objectivity and systemization until you find yourself with a set of study-able researchable ideas. See what happens. The reason I tag this as domain knowledge is because your study is not going to grow out of what a math or statistics book tells you but out of what you, as an investor, see is associated with what you regard as a good outcome.

Marc,

You may think we disagree about statistics. But I am on your side most of the time. Like agreeing with you about the misuse of backtests.

I am not sure looking at this with machine learning was misguided however.

I also specifically tried this with multiple moving average crossovers which also did not work. The machine learning method should have picked up on any moving average crossover effects as well as more complex interactions.

But a specific look at moving average crossovers was negative also (for this sim anyway). Moving average crossovers did not work in a test-set for this sim.

Many at P123 like their moving averages so I did not expect as much knee-jerk agreement on this and therefore did not post the part about the moving averges. I try not to be controversial anymore.

Anyway, all-in-all I am probably the person who most agrees with you in the forum about statistics. The bad statistics anyway. And there are a lot of bad methods that get a pass on the forum.

In fact, we ultimately agree on this. And I am willing to post my negative results.

Can you please post the results of one of your tests of the Chaikin Oscillator, if you have any?

I was planning on testing that too and perhaps you could save me some time.

Much appreciated.

Best regards,

-Jim

Jim,

I’m not commenting on the factors you may have used. I am concerned that your study design – the criteria for a good/bad result – may need re-thinking. For example, I’ve done work and am doing other work where returns are not what I look at – my dependent variables are measures of volatility. I’ve down other work where dependent variables are boolean; whether the stock returned x or higher within a specified period of time. Be creative . . . looking for correlations between predicted y and actual y is stale and not necessarily aimed at pointing you to good investment outcomes. Be statistical . . . that’s you. But be creative too . . . especially if you have avant garde computing capbility at your disposal.

I know I was at the other end opposing brute force ML, AI and other stuff in this direction.

But guess what, I changed my mind and I am in the process going in this direction and testing it out
(@Jim → Thank your for the whole discussion process, it helped me a ton!)

Got to know a trader who uses a tool (programmed by himself) which could be labeled as a genetic programming tool.

The design of the tool is the following: it uses brute force but tries to avoid overoptimization before the brute force calculation by a good design before the brute force test. and afterwards by statistical tests (that are partly not OOS based)

1000 Indicators (a lot of bogus stuff, but interesting stuff too like momentum)

  • Intermarket analysis
  • 4 Rules max
  • Time exits (e.g. no sell rule but time possible)
  • Intermarket (e.g. you push QQQ, TLT, VIX HYG into the Data Pool and let the machine find out the behavior)
  • 20 Years of Data
  • 50% OOS tests (and more) possible
  • System portfolios
  • Filter for systems (e.g. only run long systems if the sp500 is above 200 MA etc.)
  • Intelligent Stops based on vola (and vola regime changes)

Then the machine runs based on the indicators you choose.
So here comes the thing: if you are experienced and know what works (e.g. momentum based on a ton of academic research) you choose those indicators.

Then the machine calculates up to 15.000 (not a typo) strats per second and puts the best 500 stats in the results.
Then you test the strats based on the statistical indicators, partly not based on the normal distribution assumption (and for this you need a lot of experience, steep learning curve!)

Then the most important step comes.

You ask yourself why the strat could work (or could not) in the future?

Testing methods are: How did the strat in different markets (e.g. if you tested QQQs, how did IWM do, testing against random strats based on randomly selected indicators of the mashine and a ton
of stats tested against the random distributions.

So, the steps are:

  1. Make sure your test setup (as far as possible) is not prone to overoptimization (less rules as possible, intermarket data, indicators that are proofed to have some merit by academic papers, at least 50% OOS, realistic slippage etc.)
  2. Let the machine run
  3. Test the strats with statistical methods
  4. Ask yourself what principles (backed by academic papers) the found strats are using and why it might work in the future
  5. Realtime Test (for me at least 2 Years)
  6. Trade them.

So it is a different process from 1. Idea first, 2. Math second, its more like a mixture e.g.

Ideas first (design setup), 2. Machine spits our ideas 3. Did the machine find something that could work in the future (bc.
it uses stuff that has a backup by a lot of academic papers?).

I used the tool for 3 Weeks and some pretty interesting (basically trend following intermarket systems based on QQQ, TLT and HYG) stuff came up which is already in the real time test.

If P123 is interested I can give you the contact of that person (trades since 10 Years, works for a High Frequency
Prop Trader and in spare time built that tool.), maybe a good team up after the implementation of international stocks.
Could be a great add on tool for P123 with a different price tag and a shared business model.

Best Regards

Andreas

Thank you Andreas.

Everybody . . . Print that out and paste it on a wall or someplace you can see as you work on p123 backtests.

And impose on yourselves an absolute rule that you will never ever present or discuss a backtest you did, on p123 or anywhere, without addressing those questions. Also, if you see a post by someone else that does not address this, then call them out and let them know the omission is noticed.

Andreas,

I think I know the platform you’re referring to and isn’t it limited to using just “technical” price/volume based indicators? Are you paper trading the system now?

Walter

FYI I also failed at doing a ML program with a decent probability to predict market corrections (let’s say negative return of SPY over 1 month or a quarter). I used more than 100 technical features in various assets (idea was intermarket IA analysis), I tried random forests, bagging SVMs, bagging KNNs. I was first fooled by apparently excellent out-of sample accuracy (>80%) in cross validation. The issue is, I was using a standard function designed for shuffled cross validation. As market has a memory (you know that?), shuffling data results in generalized data leaks (overfitting without knowing). NEVER shuffle data in cross validation when using financial time series. Standard K-fold validation told me the inconvenient truth. Anyway I have learned a few things in the process, it was a funny video game and I will play it again.
Edit: btw “market has a memory” is also the reason why I think using Monte Carlo simulations in finance is very questionable. Monte Carlo method is designed for a Markov process. I am quite bad in statistics, but I understand financial data are not markovian.

Frederic,

Thank you for this EXCELLENT point.

There is another point specific to P123 that fooled me initially and is very important.

For excess returns you should only use next close for your model.

The problem is the benchmark return is always the next close. If you use the next open for your model you get a peak at Monday’s data when you lag your data. You think you are only seeing data from Monday open to Monday open but the benchmark data at the close of Monday gets snuck in and the neural net is smart enough to use that information and find a pattern.

More simply the data is not PIT this way. Not PIT because you could not really calculate the excess returns until the Monday close.

Bottom line: use next close if you are going to use excess returns and download time-series data from P123.

Federic, thank you for sharing this.

Regards,

Jim

Jim - I have a question for you. If you take the target output from your training set and plot a histogram, what do you get? If possible, can you share the results with us?

SteveA

Hi SteveA,

This is the excess weekly returns in percent for the training set. The factors were the lagged data. In other words f1 = data lagged one week. f2 = data lagged 2 weeks etc. I went up to 12 weeks of lagged data but tried different numbers of factors (weeks of lagged data).

The training set was 2000 to 2009 (inclusive). The validation set was 2010 to 2014 (not shown).

I did not use the test set (2015 to now) because the validation set results were so poor.

As you probably remember from your experience with neural nets there are different ways to avoid overfitting. The easiest is called “Early Stopping.” For that you just stop the program when you begin to see overfitting in the data. This is what I used here.

Neural nets are no secret and I am happy to share any code if anyone is interested.

Best,

Jim


Jim - are you processing the outputs further or feeding the raw percentages into the NN? Same question for inputs. I was under the impression that the inputs at least were converted to a range from 0 to 1.

Can you show me the histogram for the validation outputs?

** EDIT ** What NN program are you using?

SteveA

SteveA,

Initially I standardized the inputs. But this is just to put the factors on the same scale from my understanding. For example, a value ratio and market-cap are on scales that are very different. Since this is just a lag of the same data and it is on the same scale I stopped doing that after confirming that it made no difference.

And you are right you generally normalize 0 to 1 or standardize.

I used TensorFlow in Python.

The image is from P123 distribution (not the excess returns). I will get you the excess returns later if you want that too.

Regards,

Jim


I can give you some suggestions, I can’t guarantee that any of them will work. I can guarantee it will be a lot more work, however.

For starters, you want to make sure you have a good target (output). By that, I mean that you want a histogram that is pretty much flat across the spectrum. i.e. equal number of samples in each bin. It doesn’t have to be perfect but you don’t want 99% of the samples close to zero and 1% large positive or negative returns.

I suggest starting by using the log of the output instead of the output itself. I think you can instead use the arctan if memory serves me correctly. Second, you probably want to throw out most of the entries where the output values are around zero and perhaps duplicate the entries for the large output swings to get a better approximation of a flat histogram.

You didn’t give me the same type of graph for the validation set as for the training set so I can’t compare the two. But you should strive to get a similar histogram for both sets.

You might also want to consider changing the definition of the output. For example, instead of the %change, you could calculate a 10-sample best fit line through the weekly data. Then take the difference between the next week actual versus the trend line projection and use that as the output. Convert to log value of course.

For the inputs, you want to use timeframes that are larger than for the outputs. In other words, if your output is one week change, then your inputs should be 1-3 months change, not one week change. You also want to be more focused on intermarket and/or breadth data as inputs, again using 1-3 months lookback. Using the same time series for both input and output won’t get you too far.

Stay away from binary inputs to start i.e. moving average crossover…

Keep it simple, not too many inputs and try to work with inputs that you have experience with or know that have a chance of working. For example, you could start with Georg’s inputs to his iMarket timer. I can imagine that it might be difficult getting all of the data but you gotta do what you gotta do.

Also, minimize the number of internal nodes. Don’t create a deep neural net because all you will succeed in doing is memorize the data. The number of internal nodes should be 50% or less of the number of inputs. But the less memorization that occurs means the more averaging that occurs. That is why you want flat histograms. Otherwise, you will end up with an output that is an average, typically positive.

It is also good practice to normalize the inputs from 0 to 1 or some other limits. This way you won’t end up with an input that is off the scale in the OOS data.

As a final suggestion, you want to minimize the noise in both the input and output data. If you are working with weekly data, I suggest that you consider applying a gaussian filter on daily data. Instead of using the close one week ahead, you would use 1day3 + 4day4 + 6day5 + 4day6 + 1 *day7

Anyways, these are my suggestions that may or may not help.
Good luck
SteveA

Friends, I am not clued up enough on any of this to join the conversation. But a strategy that could actually work is to follow successful hedge fund managers’ stock selections.

All one has to do is to get the quarterly 13F filings of their funds holdings. This public information is available 45 days after the end of the months to which it refers. For example, end of December holdings are available 45 days later, about Feb-15.

I used the stock factor tool to upload this point-in-time holding information into P123 and was able to match the average return of the best five fund managers. There are no special buy or sell rules in my model. A screen with all the holdings showed an amazing annualized return of 30% over the last 10 years, without market timing.

I will not post the backtest because of P123 restrictions, but if you want to see it you can email me and I will send it to you. There is a screen holding all stocks, and a strategy with ranking of 15 stocks which shows an AR= 35% over the last 10 years, outperforming SPY during every year over this period and a low annual turnover of 190%.

Attached is the allocation of the current holdings.


Georg, you’re taking cherry-picking to an extreme here. What’s the difference between choosing, after the fact, the best five fund managers and choosing, after the fact, the five best-performing stocks or ETFs over the last ten years? And what does this have to do with machine learning? Sorry to get a bit heated up here, but anybody can outperform the market after the fact by picking stuff that did well. What we want to do here is to figure out how to outperform the market WITHOUT picking stuff that did well in the past, because that’s look-ahead bias. Whether it’s piggyback suggestions or market-timing rules, you can’t pick stuff that did well in the past and expect it to do well in the future. Look-ahead bias has been called by one statistician “the gravest mistake a data scientist can make.” It’s the number-one thing any machine-learning program has to avoid. Piggybacking on successful ETFs and hedge funds, like a lot of market timing, relies on 20-20 hindsight. It’s bad data science, and it’s not going to hold up out-of-sample.

How does piggybacking or market timing differ from backtested quantitative stock-picking? Aren’t they all forms of prognostication? Why is it different to rely on a value factor that has worked in the past than to rely on a hedge fund that has gotten good returns?

Because of look-ahead bias. When we backtest a value factor, or a growth or quality factor, we are looking at thousands of stocks over thousands of dates without knowing in advance which ones will do well, and we’re applying sensible criteria to differentiate those stocks from one another. There’s no look-ahead bias there. But when we backtest a particular hedge fund or ETF or economic indicator BECAUSE WE KNOW IT DID WELL and not because it was the most sensible choice at the time, that’s pure look-ahead bias.

We have to be vigilant about this, whether we’re doing machine-learning or developing investing systems. It’s the only way we’re ever going to get data to cooperate.

Hi Walter,

yes I use the tool with the only technical information, they have developed a full blown version with fundamental data.

Hi Marc,

yes, yes, yes. I got a lot of skin in the game and aks myself that question (Will that Strategy work in the future and why).

Best Regards

Andreas

SteveA,

Thank you. These are all good points. Let let me expand on a few.

I have found the using the log can be VERY helpful too. Using the excess returns to get rid of some of the noise and the log —in part reduce the effect of outliers–can be mandatory to get any kind of result. For whatever reason neural nets seem to be more forgiving on the use of the log than a Random Forest say. I will try this with the excess returns of the log as I have with Random Forests and Boosting.

To be clear when I checked for moving average crossover effects for this sim I did not use machine learning. I just optimized the crossover for one set and then saw how that did with a hold out sample.

Amen. Some of this stuff is supposed to have automatic feature selection. But you end up having to do as you suggest. Keep the number of feature low to begin with and remove the ones that are not adding anything.

I agree. And I did do this. I will probably go back to doing this to be sure it is not a problem here.

Wow! I did try the daily data that P123 provides for a moving average crossover test WITH A KALMAN FILTER.

SteveA,

This are good ideas and I will incorporate and/retry most of them.

I am not done looking a technical data yet. I will try this with a recurrent neural network for example.

I also want to do more with fundamental data.

Thank you.

Best regards,

Jim

Jim and Steve, why do you think NNs are better for this purpose than random forests, bagging SVMs or bagging KNNs? Maybe it’s my ignorance, but NNs are complete black boxes to me. I think decision trees, KNNs and SVMs show more obvious (mathematical) common sense. I have also read somewhere that NNs were not so successful in kaggle competitions (maybe it was old stuff). Any thought?

Frederic,

I used to believe exactly as you do with boosting being my favorite method.

I now believe that NNs can be better for some sets of data. I believe I have seen situations where this is the case. That does not imply that this will always be the case for all data, as you know.

In fact the “No Free Lunch Theorem” may prove there is no single best method for all situations. Neural nets have been working well for me recently, however.

Plus, they are, surprisingly fast.

-Jim