Backtesting - controlling for number of trials

quadz42 · June 4, 2015, 6:12pm

Building on some of the recent threads, linked below is an interesting webtool that illustrates how overfit models often fall apart out-of-sample.

http://datagrid.lbl.gov/backtest/index.php

I found this tool while reading the presentation linked below, which highlights many of the common errors with backtesting. According to the presentation, the most significant risk of backtesting is controlling the number of trials. Unfortunately, there is no way to know how many iterations of a model that a designer has run, before settling on the final parameters.

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2606462

A more academic paper is below.

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2507040

mgerstein · June 4, 2015, 8:59pm

The material is interesting. But, at least in our content, if they’re oversimplifying by simply focusing pm the number of trials. Essentially, my question s whether all trials are alike.

It may depend on the nature of changes being made from one trial to the next.

If I’m keeping the same basic strategy but changing parameter details, I think that’s what they’re talking about. Trial 1: Factors A and B are weighted 45% and 55%. Trial 2: Factors A and B are weighted Waited 40% and 60%. Etc. etc. etc.

But suppose different trials reflect genuine rethinking of the idea after test. Trial 1: Factors A and B are weighted 45% and 55%. Trial 2: Factors A, C and D are weighted weighted 25%, 40% and 35%. Trial 3: No change in the ranking system but two buy rules are dropped and replaced by four others. Do these multiple trials have the same meaning as in the first example?

tkp · June 4, 2015, 9:22pm

Alan, thank you, very interesting info.

I am not a math geek but after fast review of this http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2308682 I think they went a bit farther with their Combinatorially-Symmetric Cross-Validation framework. Marc, would you be so kind to look at this and evaluate how this framework fits to P123?

If there were an Investors P123 wishlist, I thinks this could be number 1 feature to implement. And I do believe DIY investors-designers would support this as well.

geov · June 4, 2015, 10:09pm

I have 3 models which use an identical ranking system which was developed for the first model. Thus for models 2 and 3 the number of trials for the optimum ranking system was zero. According to the presentation, the most significant risk of backtesting is controlling the number of trials. If this hypothesis is correct then models 2 and 3 should be more robust than model 1, which I think is nonsense.

The models use large-cap minimum volatility stocks and returns from beginning of July 2014 (SPY return 9.8%):
model 1: universe is from ETF USMV…28.4%
model 2: universe is from Vanguard fund VDIGX…16.7%
model 3: universe is from S&P500…29.4%

InspectorSector · June 4, 2015, 10:23pm

One thing that this paper has made clear to me is that I made a wrong career choice. If I had gone to work for the US Department of Energy then I would be writing academic articles on finance Is this payed for by US tax payers?

In any case he did make one correct statement:

“In the Industry, out-of-sample testing is the peer-review.”

The author did overlook one strategy that trumps everything he said in the paper, but I can’t share this info as it is a secret

Steve

regallow · June 5, 2015, 2:00am

Bear in mind that models 2 and 3 inherit the number of trials used to build the ranking system, not zero. That being said, I don’t believe there is a foolproof way to determine if a backtest has been over fitted. Statistics can lie both ways!

quadz42 · June 5, 2015, 3:21am

I had the same thought! After some digging, it turns out that there’s a project called the Computational Infrastructure for Financial Technologies (CIFT). Reading the description, it’s as good a use of taxpayer dollars than any!

http://crd.lbl.gov/departments/data-science-and-technology/SDM/current-projects/cift/

Some snippets below:

"CIFT aims to improve the understanding of the U.S. energy supplies and related markets by leveraging high-performance computing resources and data-intensive analytic technologies…

…It is striking how many themes in scientific computing and networking are directly relevant to issues faced by participants in financial markets, and to governments that seek to understand, monitor and regulate them…

…Financial markets and high performance computing research are a natural fit in many ways. Markets are a rich area for research. Their data volumes are huge, and their real-time requirements are at the edge of technological capabilities."

quadz42 · June 5, 2015, 3:32am

The material is interesting. But, at least in our content, if they’re oversimplifying by simply focusing pm the number of trials. Essentially, my question s whether all trials are alike.

It may depend on the nature of changes being made from one trial to the next.

If I’m keeping the same basic strategy but changing parameter details, I think that’s what they’re talking about. Trial 1: Factors A and B are weighted 45% and 55%. Trial 2: Factors A and B are weighted Waited 40% and 60%. Etc. etc. etc.

But suppose different trials reflect genuine rethinking of the idea after test. Trial 1: Factors A and B are weighted 45% and 55%. Trial 2: Factors A, C and D are weighted weighted 25%, 40% and 35%. Trial 3: No change in the ranking system but two buy rules are dropped and replaced by four others. Do these multiple trials have the same meaning as in the first example?

I believe they are referring to optimization where a fixed number of variables are reweighted an infinite number of times, and the one permutation that produces the highest Sharpe ratio is chosen.

It’s not clear to me if they would consider a genuine rethinking of the strategy as another trial. But conceptually, introducing a random new variable and rerunning the optimization a gazillion times would still likely result in overfitting.

judgetrade · June 5, 2015, 4:32am

hmmm,
lets say you got a ranking System with 9 “straight” Factors, that simply rank
And 2 Buy Rules and 1 Sell Rule (Sell rule based on rank).
Lets say you use all the stocks available on p123 (No OTC Stocks ore all fundamentals).

And you tweak the variables a 1000 Times or 10.000 Times or 1 Million times.

You Test with evenid = 1 and 0, you test with 1, 2, 3,4,5 10 20 and 33 and 50 and 100 Stocks
And your System gives you always a good Results, and it reacts smoothly and the results
do now swing around like hell. And you have a lot of trades the less stocks you want to trade
(5 Stock System over 1800% better over 2000% yearly my rule of thump)
And lets say your System uses stuff that has been researched a Million times
by People who are 1 Million times smarter then you.
Lets say the stuff is value, momentum, size, earnings surprise etc.

Then you got an interesting System (if the capital curve is nice with next Close and
if you Count in slippage).

If you Modell swings around like cracy depending on the variable changes
you got nothing.
If you use a System with a 1000 Variables you got nothing.
If you use a System with stuff that has not been investigated
by very smart People I would ask my self, if you are smarter then
a lot of very smart People and would be at least carefull.

If your System swings around like cracy based on the variable changes and you
then see in the 10.000 tweaks and runs a wonderfull capital curve and you then start to trade
this you do not know much about System design or you lie to your self, because
the System will fail and you loose Money.

So the amount of tweaking is not it, it is adhering to design rules (not too much buy and sell variables,
ranking System that keeps “degrees of freedom” etc.)
and then find a System that reacts smoothly to the tweaks and even gives positive results if
you tweak from lets say future grwoth from 10% to 100% etc. and the tewaks can
really not get your Systems to swing around like cracy then you got a start.

Regards
Andreas

Tomyani · June 5, 2015, 1:05pm

I just wrote a very long post on this that got deleted - I wish P123 would autosave replies!

(All of the below numbers are from simple tests run with the ranking system).

Alan,

Thanks for sharing. Over the past 2-3 years, I have read many papers in this area.

EDIT (I attached the wrong paper)
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2249314

This is a paper by ‘smart people.’ They have another paper showing that many of the 360 factors in peer-reviewed journals likely aren’t statistically robust once number of trials is taken into account. How many of the P123 factors are?

For example, there is huge debate in the academic literature as to whether or not the ‘small cap’ effect is real if a) January is removed and b) tiny stocks are eliminated (penny stocks, etc) and c) we only look post 1980. It may not be ‘robust.’

However, their method for ‘discounting’ sharpe ratio’s also likely won’t work.

POINT 1: The idea that you can every count trials is wrong.
All of these papers miss a fundamental issue. Georg gives a simple example of this. It’s impossible to count the history of an idea.

Take academic research. There are tens of thousands of researchers and PhD students annually running ‘backtests’ on data - and this has been going on for decades. Those who find ‘meaningful’ patterns then join with a mentor and seek academic publication. Computers are being used. There are a near infinite number of ‘trials’ on the data, with only the ‘successful’ factors being published. This is massive data ‘overfitting.’ (The authors above talk about this).

Given that unless we can perfectly trace the origins of our ideas, ‘counting backtests’ is impossible, we have to assume that every ‘factor’ is the result of infinite perm’s having been run on the historical data.

POINT 2: Discounting sharpe ratios is also wrong.

My time on P123 has taught me that the ‘discounting sharpe ratios’ doesn’t work. The most overfit backtests often have the highest Sharpe’s to begin with and still (sometimes) pass all kinds of ‘sensitivity’ tests. ‘Math’ is good, but it’s not enough.

So, we should just assume that every system / variable has been the produce of an infinite number of trials, and then ask ourselves what levels of ‘parameter sensitivity’ checks can we run after the fact to gain an understanding of the system’s sensitivity to parameter shifts / changes.

POINT 3: Running ‘batch perms’ on tests - especially black boxes to be sold to others is a minimum thing to do.

Simple ‘trading day’ perm’s will usually not hold up to perm shifts. If a 5% gain over 20 days works, but 4% over 19 or 6% over 21 doesn’t, I likely don’t have a ‘robust’ factor. Some highly overoptimized systems will break in these tests. Some won’t.

Many of the most ‘statistically robust’ and highest sharpe systems will have the highest turnover. They will be very sensitive to slippage/trade execution. And to others continuing to trade them. Those will be the critical input / stress point.

Each system will have it’s own stress points.

POINT 4: More factors can be good or bad

If I start with the ADT100 $1MM universe and close>$2 and just rebalance 100 stocks annually, I earn 8.2% AR - which beats the SP500 handily.

If I choose a ranking system to do this, and I choose 2 random factors, I won’t be that bad off if I don’t do much else.

My annual returns will equal
a) the Universe returns
minus
b) slippage and trading costs
minus
c) taxes
plus / minus
d) tracking error (which is huge for low number of holding and low turn, but falls with high holdings and/or high turn)
plus/minus
e) timing / hedge modules

Most R2G’s are so complex, they are almost impossible to evaluate from the outside. Most very small number of holdings, hi-turn systems will perform well below their backtests… some won’t, but they are impossible to select before hand. That doesn’t make them ‘bad’, but they are hard to allocate large amounts of money to and are very sensitive to very small parameter changes (i.e. assumed slippage or fill rates on the ‘best’ stocks).

POINT 4B: Sometimes more factors is good and sometimes it isn’t

More factors can be used to increase optimization and likelihood of system failure or to reduce it. Nearly every designer likely thinks they are doing the latter. Are they?

Let’s say I have 2 factors. Factor 1 is a true factor with AR% of 18% for the top 5% of this universe and 14.5% for the top 20% of this universe.

Factor 2 is random. In this case, the top 5% bucket will fall to 14% (about a 20% reduction in ‘true AR’) and the top 20% bucket will fall to 13% (about a 10% fall in AR).

Let’s say now I have 9 factors. 8 are random. One has a true AR% of 18% for top 5% of the uni and 14.5% for top 20% of uni.

Now, the top 5% bucket falls to 11% or so and the top 5% bucket to 12%.

In both cases, trading only the top 5% in a well designed system can still have more alpha.

So… given that we can assume that many of our factors are likely random, or will behave as if they are random in the out-of-sample period, it makes sense to build some redundancies into every system (at the ranking, universe and/or buy and sell rules - I use them all).

POINT 5: Every ‘factor’ has periods when it works and periods when it won’t. When it doesn’t it can behave negatively or behave as if it’s random. Even if it’s a ‘true factor.’

I may have heard that ‘value works’, and run one test that confirms it - but ‘value works’ came from 1,000’s of backtests over decades confirming it (or curve fitting that found it, followed by a rush to copy it).

We should attempt to understand why it works, but should also be skeptical of our own stories.

For example, maybe value works initially because people over discount unattractive companies because no one wants them. So, one person learns that baskets of them are good because they sell at huge discounts to their ‘intrinsic values.’ She makes millions. But over time more people learn this and ‘value’ multiples get elevated by lots of people (or computers) trading them because ‘value always works’ relative to growth - so that value stocks no longer have any real value (they are overpriced relative to instrinsic company value) and value won’t keep working unless unless more new naive value buyers and volume comes along, and so the ‘value works’ will work if enough people follow it even though the real intrinsic value relative to growth won’t be there anymore - but it won’t work if people stop believing in it.

There are countless ‘factors’ we can do these thought experiments for, but nearly (if not 100%) of factors are giant ‘shell games.’

POINT 6: Better book backtesting and ‘rule based’ sub-system weighting/selection/functionality is the ONLY reliable way the best multi-asset or multi-system managers have solved this problem for decades. P123 needs a huge upgrade here.

We all (if we are managing money) need to therefore ask, how to best construct mechnical systems for backtesting and, in real-time, building more robust portfolios of low/negative ‘peak stress event’ correlation systems. Having the book feature allow us to adjust weights dynamically based on a ‘basket of underlying’ systems and their trailing correlations and volatility and behavior would be a great starting point.

Being able to run backtests on a PIT, no-survivorship bias R2G ‘graveyard’ will be a great starting point.

I have been beating this drum for years, because it’s the only way to better manage my family’s money. Please vote for this and/or lobby for it.

POINT 7: Unique, proprietary data sets are one of the most reliable sources for ‘alpha’. Alpha is limited by nature.

Upgrading ‘data pack’ add-ons, and/or finding a way for PIT data to be sold by third party vendors after it passes PIT tests should be something P123 looks very seriously at and the community gets behind. Please vote for this.

Please also vote for ‘user defined’ short-term and long-term tax rates, so that we can model things like tax-loss harvesting and look at sim’s with after-tax returns.

Everything else - including counting ‘backtests’ or relying on ‘time tested’ fundamentals, misses the point. The market is a multi-player game long divorced from what businesses would ‘buy’ and ‘sell at’ and we need other traders following similar rules as us for stocks to be driven higher (or lower if we’re shorting) in predictable ways - and we can’t be sure on any single system and need to combine large numbers of them in a rule based way to create reliable return streams.

Best,
Tom

Jrinne · June 5, 2015, 1:29pm

This is sometimes called the Jelly Bean problem.

I can run a study to test whether eating Jelly Belly jelly beans makes a person more attractive. If I run this test for a mixed bag of jelly beans the results will probably be negative. But what if it is the flavor of jelly bean? There are 50 official flavors of Jelly Belly jelly beans. So I can run 50 different studies: a study for each flavor of Jelly Belly. A level of significant of p <= 0.05 means that I will obtain a “significant” result 1/20 times. I will probably show that eating 2 or 3 of the flavors makes a person more attractive, however, eating mixed flavors does not!

Similar logic shows that if I test whether the first letter of the ticker affects the stock returns, I am likely to get a significant result if I run the test for all 26 letters!

There are at least 2 solutions. One is in the link and that is to decrease the p-value required for significance when running multiple studies (Bonferroni method).

The other solution is to run an out-of-sample test on the best jelly bean flavor(s) after running the first study. Since I am running only one study (or perhaps a few), I will now have a much reduced chance of showing significance.

This shows the value of out-of-sample testing. I can get around this to some extent. Maybe I will run 100 ports on automatic. When I want to make everyone think I am a good trader I can show the best performing automatic port and still mislead everyone regarding my general skills.

Even so, the number of ports I have on automatic is much fewer than the number of sims I have tried.

Considering the number of sims I have tried, it is possible that it is an efficient market and I’m just doing the equivalent of eating jelly beans to make me more attractive. It is only the out-of-sample results with real money that will give me an answer: there is only one study being done with real money.

mgerstein · June 5, 2015, 2:50pm

A bizarre example. If one would even think in terms of testing factors that are not known, at the outset, to be valid, there is no process in the world that can point one in the correct direction.

This is correct as long as one recognizes inclusion of the words “as if.” Randomness in the market does not exist (except, perhaps, at such an high-level exalted view as to be relevant only to deities). In the real world, the difference is between things that happen for logical reasons we can identify versus things that happen for logical reasons we cannot identify (at least not at a particular point join time with a particular set of investigatory resources). But randomness is a useful verbal shorthand give the way it would get tiresome to say shelling form every time. The distinction, however, is important when it comes to developing models.

I may have heard that ‘value works’, and run one test that confirms it - but ‘value works’ came from 1,000’s of backtests over decades confirming it (or curve fitting that found it, followed by a rush to copy it).

We should attempt to understand why it works, but should also be skeptical of our own stories.

For example, maybe value works initially because people over discount unattractive companies because no one wants them. So, one person learns that baskets of them are good because they sell at huge discounts to their ‘intrinsic values.’ She makes millions. But over time more people learn this and ‘value’ multiples get elevated by lots of people (or computers) trading them because ‘value always works’ relative to growth - so that value stocks no longer have any real value (they are overpriced relative to instrinsic company value) and value won’t keep working unless unless more new naive value buyers and volume comes along, and so the ‘value works’ will work if enough people follow it even though the real intrinsic value relative to growth won’t be there anymore - but it won’t work if people stop believing in it.

There are countless ‘factors’ we can do these thought experiments for, but nearly (if not 100%) of factors are giant ‘shell games.’

This is not correct or even logical. A value investor invests in baskets because they sell below intrinsic value but when basket companies get bid up too far relative to intrinsic value, value stops working unless you can find a bunch of suckers to overpay? WHAT THE F***!

Tom, your hypothetical demonstrates the exact opposite of your point. A stock that has been bid up to becoming “overpriced relative to intrinsic company value” is no longer a value stock and would have been sold by anybody who strictly pursues value. Such a stick, one that depends on finding people willing to pay more, is a growth stock or a momentum stock. (And actually, it’s naive to dismiss such people as naive; there are many who get slaughtered this way, but there are others who understand how to approach such stocks that flourish.)

No, the proposition that value works does not “come from 1,000s of backtests over decades confirming it” but from the wiring of the human brain, the part of it that leads us to invest with an expectation that we will get back more than we spent. That does not require any testing.

What requires more elaboration, testing, implementation,are the challenges of real-world implementation. Since we don not and cannot actually know the cash streams that will flow back to us in the future as a result of ownership, we’re forced to adapt by recognizing visible characteristics that based on experience suggest a good probability that things along these lines will work out for us (this is where the science comes in, or at least the good science, as opposed to the bad science, which dismisses the need to properly frame questions and simply flails about at with no rhyme or reason).

Frankly, looking at this thread (and the items obtained via the links) s well as others, I think we and others are cloaking a remarkably simple question with needless complexity. I suspect that deep down, everybody knows when they’re curve-fitting, data mining, etc. knows exactly what they are doing and are consciously choosing to do it. All the rest, I believe, is clutter: How can I conceal the fact that I’m over-fitting? How can I catch others that are doing it and won’t admit it?

If you start with a valid idea and test ignored to get feedback on the efficacy of your implementation – and are willing at the droopy a hat to ditch the model and start again if the tests suggest your implementation isn;t effective – don’t lose sleep. Your work process is likely to be fine.

If you start with a desire to generate a set of test results that will catch the eye and trust and tweak however you must ignored to accomplish it, you’re likely to be screwed no matter what your process is.

Are there any MDs in the p123 community? If so, what would you do if you got a lab report showing a patient’s cholesterol level at 783? Would you publish a paper on the medical freak you found? Or would you immediately call the lab, report an obvious error and demand they give the patient (or rather the patient’s insurance company) a freebie on the re-test?

Folks, the definition lf alpha is such that any number above zero makes one a star. Recognizing the difficulties of 100% eliminating the biases of looking backward, we can’t literally hold p123 tests to that threshold. But if I see an sim with annual alpha 0f 80%, my reaction is like the doctor in the above example. I could care less about the “process.” When users contact privately asking for feedback on such models, I tell them in a split second that they’ve got a problem, and then we go over the model to identify where the problems are and discuss how to fix them.

Bottom line: Don’t sweat all the rhetoric, If you want to do it right, you will (with or without asking for help). If you don’t care about doing it right (or worse, if you WANT to pump your sim), nothing said about process here in the forum or in any of the other content in links is going to change it. Good research is a choice. Those who make good choices will find their way to good processes. Those who make bad choices, oh well . . .

Tomyani · June 5, 2015, 3:01pm

Here is an update on the 5 best public ranking systems I could find in 2009, and 3 of the most promising P123 public ranks from after I finished reviewing all the systems on P123 back in 2009.

https://docs.google.com/presentation/d/1vSKgJIUiPrndKCKLbcAi0QzN2anwTlvHHJ684Gcdoiw/pub?start=false&loop=false&delayms=3000

Testing methodology described. All of them have no significantly underperformed the index over a 6 year out-of-sample period when trading frictions are included and moderate turnover, 50 holding portfoio.

Marc, why have P123 public ranks done so poorly in 6 years out of sample vs. the index if it’s simple?

Please stop saying this system is built on value, momentum and earnings so it will work. All of these systems I included in these tests were built on those factors and they haven’t worked. All sims show 50 positions. People trading 5 and 10 stock versions have, on average, done much worse (although some haven’t).

I am trying to address the papers in this thread from Quadz.

I am not saying no systems have worked, just that I followed a rigorous path to reviewing every public system I could in 2008 and 2009 to try and find the best systems, I then saved the best rankings in a folder.

I have managed to use P123 to beat both of my indexes over this time - the hedge fund index and the 60/40 global conservative index. I love the tools. But… it’s not simple.

R2G will look much better then these backtests over time, because the database has a huge survivorship bias and losing models are being deleted.

Don’t worry, this is the last day of my mid-year review, so I should be off the message boards for awhile after today.

Tomyani · June 5, 2015, 3:09pm

Marc,

The whole point of the academic paper I shared is that many factors believed to be ‘valid’ won’t be for a whole host of reasons. Did you even read the paper? The papers are talking about why a lot of medicines making it through clinical trials aren’t working and why the success rate seems to be falling. They are talking about medicine and general problems around ‘backtests’ and huge number of perm’s in all fields - including finance.

Re Value: “This is not correct or even logical. A value investor invests in baskets because they sell below intrinsic value but when basket companies get bid up too far relative to intrinsic value, value stops working unless you can find a bunch of suckers to overpay?”

The point is that, what is traditionally called ‘value’ are more relatively ‘low quality’ companies…

‘Value’ stocks are typically ranked relative to the total universe on some number of metrics. For example, how much I have to pay to buy $1 in sales in a company. We’d all prefer to buy $1 in sales in the fast growing, hi-margin, no debt company if all $1 in sales cost the same to buy. But they don’t.

Most P123 users are ranking relatively, i.e. in deciles or ‘quintiles’ or whatever, not versus some ‘true’ asset value. If ‘Value’ stocks were typically 20% as expensive as growth stocks on a price per dollar of sales, but are now 90% as expensive, they will still show up in many P123 ‘relative’ value rankings as cheap. But, they may be selling way over what any reasonable private buyer would pay for the company - if they went through a detailed future earnings discount model. When relatives are used (and the total market trades on relatives), it’s very easy for the whole market to be disconnected from anything other than patterns and trading volumes.

The other point I was showing with the ‘math of it’, is that even if factors we think work are ‘random’, but get blended into a ranking with a factor that works, the ranking can still work out of sample.

But, I’m signing off.

P.S. I don’t think anyone thinks including ‘tax field’ for a user inputted value of ST and LT taxes in sim backtests for me to model this would hurt do they? It can’t be that hard to do. It’s one of the largest drags on real-world performance, so not sure why it can’t be added after years of people asking?

Best,
Tom

quantguru · June 5, 2015, 7:17pm

Following up on the initial discussion, I’d like to add one interesting article on data mining published some days back by Cliff Asness of AQR:

https://www.aqr.com/cliffs-perspective/it-is-not-data-mining-not-even-close

It focuses on the in- and out-of-sample performance of several trading strategies, that were questioned back at the days of their ‘development’, but showed good OOS performance. He also mentions several papers one might consider helpful.

My opinion:
Data mining - as I understand it - does not necessarily result in overfitting.
Also, as stated above by others, I would like to point out that even a ‘near infinite’ repetition of optimization cycles may not necessarily lead to overfitting, if other parameters are fulfilled, supporting what Andreas wrote.

I would like to give the following example often seen here on P123 (I inevitably have to think of the saturday top performer mail when writing this…):
A ranking system with 200 buckets, with bucket no.200 performing enormously with 100%+, while even the bucket no. 199 (and the lower ones) performing (very) poorly without any visible slope. THIS is for me a clear result of rapid overfitting, and such a system will fail inevitably. It is pretty easy to create such a system.
HOWEVER, if another ranking system is optimized - perhaps extremely often - so that an ever better slope is obtained while other parameters like standard deviation and returns are not deteriorating, then I would say this is a much more robust system, no matter how much optimization was put in.
In addition to that, I’d also like to mention the great possibility of this platform to uncover anomalies. This may be called data mining, but does not necessarily lead to overfitting either.

InspectorSector · June 5, 2015, 10:55pm

Everyone is mud-wrestling. At the end of the day we’ve had lots of fun, are very dirty, and yet have accomplished nothing. Then we take a bath and start over the next day, more rolling around in the mud. No one is going to get anywhere with these arguments because no one can prove anything.

We don’t understand the difference between “theory”, “hypothesis”, “scientific method” and “empirical evidence”. The focus is in the wrong place. We need to lift ourselves out of the mud and focus on the next level higher. Let’s start with two theories.

Theories:
(1) The absolute value of a company can be calculated (or estimated) and used to extract profits from the stock market
(2) There are hidden relationships that can be found and used to extract profits from the stock market

These theories are just that. They can’t be proven and the best that one can do is build a database of empirical evidence to help substantiate them. This is done by creating hypotheses that can be tested. Unfortunately, this is much more difficult to do with Theory (1). You may only find one value equation that “works” for you in your entire lifetime. Then you have to figure out what “works” means i.e. what is the hypothesis pass/fail criteria? (What is the test period and do you use probability bands, even though they are mathematically unsound. And by the time you have designed the hypothesis and tested it out of sample for a period of time the value of the value equation may have waned.

Theory (2) on the other hand, is quite amenable to hypothesis test/building empirical evidence. You can slice and dice the universe of stocks any way you wish, optimize, and test out of sample, thus building up a huge amount of empirical evidence in support or against the underlying theory…

So long as academics are looking at the trees instead of the forest, they will come up with advanced mathematical equations for one specific factor and in general, the mathematics is not sound… it doesn’t apply to the financial markets.

So lets step away from the mud and take a more scientific approach. In the end we still only have theories and perhaps some empirical evidence to support them. What it really boils down to is what you believe in and which path you choose. And these two theories can co-exist and work together in a positive way. It isn’t an either one or the other but not both.

Take care
Steve

Jrinne · June 6, 2015, 1:31pm

This is silly. Rolling in the mud just to roll in the mud.

The original post (and Tom’s post) was about problems with backtests. Preaching to the choir and rolling in the mud with a choir member at the same time? An untested business model to my thinking.
Almost all of us are using rational fundamental factors anyway. Done after much serious research in most cases. Are there a ton of irrational factors available at P123 anyway?
Does anyone question whether backtesting done right could help identify which are the more useful fundamental factors? Why does P123 recommend “Quantitative Strategies for Achieving Alpha?”

If this is about the possibility that someone is intentionally making bad R2Gs that are designed to crash out-of-sample , this should be expressed out front and should probably be put into a separate thread. BTW, there is no rational individual that would think Tom fits into this category.

InspectorSector · June 6, 2015, 1:53pm

“This is silly”

Jim - you stole my term If you think about what I am saying for 10 minutes you will realize that I am correct.

Steve

Jrinne · June 6, 2015, 1:59pm

Steve,

Exactly. I did steal your idea. We are both in the same choir. That doesn’t mean I want to mud-wrestle you though. LOL.

Just your idea in different terms, I think. In any case, I like Tom’s ideas and I hope he is not signing off for too long. Who would have thought that he doesn’t like mud-wrestling?

InspectorSector · June 6, 2015, 4:28pm

Jim - I quit the choir as there were too many guys singing soprano. I’m not arguing for or against anyone. I just think all the arguments one way or the other are meaningless in the bigger picture.

Steve