Bootstrapping: any comments welcome

Guenter,
Thank you!
That sounds like the way to go. I will download that and start learning it at work. Will probably mean “bootcamp” for Mac to bootstrap at home. But definitely the smart way to go.

Thanks for the other ideas too!

BTW, you probably already know that it was your post that made me aware of Aronson’s book: a string of great and helpful ideas!!!

-Jim

Justy a reminder, even the best statistical practices can produce bad results if they are applied to situations that are not compatible.

I.I.D. and randomness are not parts of the world of the financial markets and this is probably why famous quants have had much more success publishing books and papers than they have in producing real-world dollars and cents results. There’s a big difference between saying something is random versus saying something has definite causes but it’s hard to identify and define them in advance. We wrestle with the latter.

I suppose this is a good time to catch up on something I had forgotten to mention. I added an Introduction to the on-line strategy design class. It’s available in in Help>>Tutorials>>Courses>>Portfolio123 Virtual Strategy Design Class. Hopefully, it will help frame the nature of the research endeavors we undertake and make it clear that randomness and I.I.D. are not part of our world.

Marc,
Just want to say I agree with you.

Not that I have succeeded but using bootstrapping, smaller p-values etc are all attempts to avoid some of the pitfalls at least—e.g., normality assumption and excessive data mining leading to “alpha inflation.”

Still I may not have adequately addressed other issues—as I have said—like i.i.d. I do not try to hide this.

I do not want to give details. But I have found things with Excel spreadsheets that have proved statistically significant. Things that cannot be tested using P123. Things that can be shown to have made me money using simple accounting—so for anyway. Things that can be found in the literature but that I found only after becoming aware of them by looking at my trades in a spreadsheet and doing a search on the findings in the literature.

Furthermore, even bad statistics can tell you things. Maybe a regression cannot be done due to outliers and lack of linearity for example. But it can still be good to ask: “What is that extreme outlier doing there?”

The so called “French Paradox” that found a low incidence of heart disease in French people is an example. It is speculated that the French people have less heart disease—and are extreme outliers despite other risk factors—because they drink wine. I am not so sure that regression was really legitimate and maybe should have been thrown out of that paper. But is has the undeniable benefit of giving me an excuse to drink some wine now and again;-)

We are programmed to see patterns that just are not there–as Aronson points out. Any statistics that identifies these false patterns is good.

But we also miss pretty extreme patterns if we do not look for them. It can be shown that we humans regularly miss things with up to .70 correlation unless we look for those correlations.

Don’t even get me started on the fact that much of what is published and is supposed to work does not work in some ports while it works just fine in others. Or sometimes it is so well understood and published that following the herd is actually harmful. Not trying to sort this out is just laziness. Without a doubt an advanced degree in finance is the best way to get the best answers. But will that answer all of my questions on short-term and long-term momentum after I am done?

So I will not give up on the statistics—I know you are not recommending that I do.

-Jim

@Jrinne,

So… am I getting this straight? You have excess returns which, when annualized, are e^(m*252) of your daily excess natural logarithmic returns with some light assumptions, which at p-value = 99.9% equal 13%? Is this real-world after transaction costs? If so, can I just give you my money???

Also, I can second Cyberjoe’s motions for using Python or R. Of the two, I prefer Python because it’s more general purpose (and the syntax is also much cleaner). Someday, you may want to use a language for something… I dunno… other than statistics?

And thirdly, I agree with Marc G’s views on the mis-assumptions of normality. But a far worse thing is to overfit the data – using higher level sample moments, for example – if a sample distribution is not normal, then adding skew and kurtosis to a normal distribution doesn’t fix the problem (i.e., Post-Modern Portfolio Theory is post-mortem). The key, though, I think is to be able to differentiate models from reality. This sounds easy enough, but the troubles come from the fuzzy lines we draw between models and reality. Is not a sufficiently detailed model of the universe indistinguishable from the universe? I’m not saying we live in The Matrix, but I am saying that you and I have a vested interest in simulating reality as closely as possible. When models become sufficiently reflective of reality, human psychology has trouble distinguishing between the two.

Btw, I am semi-serious about the money thing.

David,
Good post. Let met start with the most interesting question first:

I do too. That is the reason for using bootstrapping. If I were not concerned (I am) I would have used a simple paired t-test of daily returns of my sim and my benchmark.

I hope people more knowledgable than I am will comment on how well bootstrapping addresses this issue. Getting those comments is the purpose of this post.

And I do not limit my concerns to questions of normality. I take all of Marc’s concerns seriously and rather than ignore them I like to look at them. When I can, I will do better statistics. Otherwise, I will “fudge down” my estimates when I cannot calculate objective numbers.

In the meantime should I just sit and watch CNBC and let them site their correlations thinking that they have somehow done it better? Always safe to follow the authority figures and the herd in general, I think. You should always just use their statistics. Or better, just let them tell you the way it is.

Naw. That is no fun. I’ll go down using my ideas if I go down.

I think the math is correct. This is absolutely not anything out-of-sample. It is a sim. A serious sim with variable slippage etc but just a sim.

As with most sims at P123 I can guarantee a few things. 1) It will revert to the mean. 2) There is—at the very best—data mining with guaranteed “alpha inflation.” If you adjust the p-value based on the number of trials the p-value needs to be adjusted at A LOT.

I made the interval as wide as I could with this program (or my present knowledge of this program) to address this "alpha inflation’ problem.

Also, I think overfitting is a slightly different topic than “alpha inflation” but however you view that there is always some overfitting too.

Finally on confidence intervals. I look forward to the day that a port performs near the upper range of a confidence interval over a reasonably long time-period. I truly look forward to it but I have not seen it yet.

All that having been said, it remains possible that–as the statistics suggest—it will beat the benchmark.

I absolutely will do one or both of those-starting with R, I think. I had kind of looked at R before. But that is why I post. I get great ideas and suggestions and this is obviously a good one.

David. On a separate topic: slippage. I did not post on your thread because I think you are talking about different liquidity than I have experience with. But, depending on how you measure your slippage, it is possible to get to pretty accurate numbers—with a small standard error—pretty quickly.

Recommendation: Make your best estimate on slippage. Start a little small so that any errors do not cost you much. Then adjust based on your data. Surely, we can all agree on this use of statistics.

-Jim

Wow!!! Thank you Guenter!!!

Just finished my first read of Aronson’s book and started with a little bit of his type of analysis: see above bootstrapping.

But his book is not about bootstrapping. Well, it is about a lot of things.

People should read this book to understand DATA-MINING BIAS. I have called it regression toward the mean. What I have called it often relates to other issues and is a very poor term. Read the book and let someone who really understands this go through it. I am a neophyte and not necessarily the most promising one at that. But people who have not read the book and just look at the pretty graphs on P123 are at a disadvantage.

If I have taken away nothing else from the book it is that: “The data miner’s mistake is using the best rule’s back-tested performance to estimate its expected performance.”

We all kind of know this. But he shows why this can never be avoided. Accepts it, and moves on to doing the best data-mining possible.

In other words, use the annualized return as part of your decision as to which sims to turn into ports but never use the annualized return as an estimate of your future return.

Now onto a few minor things:

With regard to normality. Non-parametric tests of the above example sim using SPSS continue to show a good p-value. But also using a paired t-test gives the EXACT SAME confidence interval as the bootstrapping: I looked at a 95% confidence interval for this. However you look at it, normality does not seem to be a big issue. However, bootstrapping is intended to take into account fat tails and did accomplish this in some of my anecdotal tests (I do not think SPSS is the best program for bootstrapping either). But generally speaking the central limit theorem really is a theorem and not just someone’s opinion.

On i.i.d. This is to be taken seriously. By accident and because I mimicked what Aronson had done I, at least, made my analysis stationary. This is because he uses differencing and detrending. And as I understand it using natural logs may help with this too. I will keep learning about this.

Finally, I do not think this argues against anything Marc has said. If anything it strongly makes his point that statistics can be very badly misused. We may—or may not—have a minor difference of opinion on whether properly done statistics can ever tell you anything.

But I think Marc and I would probably be in agreement that if I show you a sim in isolation, claim it is good and say “I have proved its value to the p < X level of significance and you can expect this kind of return going forward” it is …… Well, I’ll let Marc use his own expletives on that: I would not be surprised if we are using the same ones.

Aronson’s book could even be used to show why doing very few backtest and no data-mining works: as it does for Marc.

Also a careful read of the Data-Mining Bias would make one want to use rational rules that have the highest chance of being effective (best Bayesian priors).

These are the rules that Marc is recommending based on a great deal of experience and education. I use them: see above about being a neophyte and not the most promising one at that, however.

Thank you Marc, David (Primus) and Guenter. Discussing this is a big part of my beginning to understand it. BTW, you can do R on Macs (thanks).

-Jim

Good luck, Jim! Sounds like you have an exciting journey ahead. I haven’t read Aronson’s book, but it’s on my Amazon wish list. I am much more heavily invested in commodities-based business valuation. It’s sucks. Don’t get into it. In my experience, it’s far better to be a generalist investor – more opportunities, less backlash from being wrong, only required to be generally right, etc…

All:

I take bootstrapping very seriously. In particular, I use R to “rebuild” the equity curve a 1000 times. I then analyze these curves for various metrics including return and peak-to-trough drawdown. In my experience “good” systems show roughly normally distributed returns over shorter time frames and roughly exponentially distributed drawdowns in the tail of the distribution. This is one of the reasons that I have high confidence in the systems I trade.

Bill

I’m wondering if any of you can explain this to someone whose knowledge of statistics doesn’t go beyond what’s available on Excel. I’ve read this thread several times and I can’t figure out what bootstrapping is. Or what an equity curve is. Or what p-value and t-test mean. Or what an I.I.D. is. I take it R means correlation (as in r-squared)? When I look these up, I get just as confused. What does randomness have to do with P123 results? Do I need to take a stats course, or can someone explain in simple terms what you’re doing? Are you manipulating screens and/or simulations in Excel (which is what I do)? Does this have anything to do with alpha and standard deviation or is it something altogether different? And isn’t applying statistics to technical analysis like applying Newton’s laws of motion to astrology?

Yuval,

I’ll have a go and try not to confuse you …

Standard statistical methods that test hypotheses or estimate confidence intervals assume that the sample data is from a so-called normal distribution. Unfortunately, this assumption is not 100% true for stock market return data. Returns have usually fat tails and are skewed.

Therefore, the traditional way of calculating hypothesis tests or confidence intervals leads to “wrong” results. It usually underestimates confidence intervals, misleading investors.

Bootstrapping is a method that does not assume a normal distribution. This method can be used for data sampled from unknown probability distributions, or small samples, or samples with outliers (common in return data).

Bootstrapping basically creates a distribution of a test statistic by repeated random sampling with replacement from the original sample. Without making any assumptions about the underlying theoretic distribution, confidence intervals can be estimated based on the distribution of the test statistic.

A confidence intervals describes the range of an (unknown) population parameter based on a random sample. Roulette is a good example. If you spin the wheel 10000 times, you would expect the number of zeros to be “close” to (1/37)*10000 = 270 (never been to Las Vegas, but I think American roulette wheels have 2 zeros, so the expected number of zeros would be double). Confidence interval methods calculate the probability that the number of zeros will be within a certain range. In the roulette example, the number of zeros can be expected to between 240 and 300 in 94% of cases. So, if you play at a table for 10000 spins, and the number zero comes up 290 times, you would say this is as expected. If the zero shows up 310 times, you should have strong doubts about the fairness of the table.

As for Jim, he can make the assumption that his excess returns are 0. Sometimes, the excess returns are a bit higher than 0, at other times, they are a bit lower than 0. Bootstrapping is expected to confirm if his excess returns are statistically significantly different from 0 – or not.

As for some of your other questions:

  • R (capital R) is an open source software for statistical analysis.

  • R squared is the proportion of the variance of a dependent variable explained by the independent variable(s). This statistic is used to describe the quality of a regression model.

  • i.i.d. is the assumption that variables are independent and identically distributed.

  • t-test and p-value are terms used for hypothesis testing.

PS: Statistics is not that hard to learn. A good college textbook and a few weekends should be enough. You will benefit a lot. It will make you smile when people sell their stocks, because the unemployment rate is 0.1 percentage points “higher” than it was three months ago.

Thank you! I will definitely try to learn some more about this. - Yuval

y:

There are also free/low-cost online classes. You might want to check out Coursera or Edx. C: you did a really good job of explaining a lot of not so simple concepts.

Best,

Bill

After learning a little bit of R I was able to get a histogram of some bootstrapping results.

The central limit theorem is alive and well. There is no doubt that the distribution of daily stock returns is not normal.

But 100,000 sample means with 4585 daily returns in each sample (i.e., MAX period on a sim) is looking pretty normal to me. The central limit theorem does turn a non-normal distribution into a normal one when there is a large sample size, it seems.

Conclusion: as long as you are talking about large sample sizes there is not a lot or error introduced by just using Sharpe Ratios, t-tests etc (without bootstrapping), I THINK. Bootstrapping is cool, however.

BTW, R is much, much, much faster than SPSS at this.

-Jim


CoOoOoL.

By the way, one of the side-effects of the central limit theorem is that a random sampling of sample means will approach a normal distribution as N → infty no matter what the underlying sample distribution looks like (or sampling distributions provided they are i.i.d.).

So… burning question… what is the so-what factor from these tests? How does it affect how you are going to invest?

RE: central limit theorem. Yes assuming a finite variance (as you know). It is amazing however, how many people criticize any statistics based on the lack of normality of the underlying distribution.

And I still do not know how large the sample really has to be—on a practical level. Somewhere between 30 and 4585 it would seem based on this test and what the textbooks say.

My last post does specifically address the original topic of this thread: is bootstrapping really superior to other methods?

So, the main thing I get from this is the Data-Mining bias. This can actually be quantified, to some degree, using “White’s Reality Check” based on bootstrapping and Aronson’s general methods. And on a more basic level Aronson’s book makes it crystal clear why this occurs. I will forever understand it on a gut level.

In short, when I look at a sim I will have some idea of how much to subtract from the sim’s returns to get the maximum returns that the sim would likely provide as a port: worse if there is bad overfitting. But I will never again think the port is likely to do as well as the sim out-of-sample: JudgeTrade seems to be the only one who can avoid this rule (amazing results);-).

What is really happening is that I am realizing that my ports are a result of a lot of Data Mining and survivorship bias. I am trying to develop realistic expectations—while getting some idea whether they really have an edge over the benchmark (or not).

But your point is well taken. I do enjoy talking to people who really understand this. And being encouraged to learn R and Python was helpful. Now I need something else to analyze with R. When I get a good hammer that I like to use I do go around looking for nails.

Much appreciated!

-Jim

All:

What I get out of good design, which I check in part by bootstrapping, is a robust system.  In general, I have been pleased that my systems live up to their advertisement (i.e. their backtest).  

Bill

I looked at a few DMs with high returns. I downloaded the daily performance and calculated monthly returns for 17 years = 204 month. One particular model had 6 months showing positive returns at and over 3 standard deviations away from the mean. Needless to say, none of the negative monthly returns were that extreme.

If a data distribution is approximately normal then about 68 percent of the data values are within one standard deviation of the mean (mathematically, μ ± σ, where μ is the arithmetic mean), about 95 percent are within two standard deviations, and about 99.7 percent lie within three standard deviations. Accordingly one would expect only 0.3% of the models’s monthly returns to be over 3 standard deviation of the mean, which is not even one month out of the 204.

I then calculated the annualized return for the model by substituting the return of SPY for the 6 months under consideration. This caused the model’s annualized return to drop from 39% to under 30%, which is a huge difference. So when considering a DM or your own simulations have a look at the Distribution Chart of the monthly returns, which should provide an indication whether a model is over-optimized.


Putting this here, because it’s a great read, slightly on topic and I don’t know where else to put it.

A famous conjecture linking geometry, probability theory and statistics has been proved at last…

https://www.quantamagazine.org/20170328-statistician-proves-gaussian-correlation-inequality/

@InmanRoshi,

what a fantastic read! Thank you for sharing!

//dpa

So I know I will get a lot of pushback on this but here goes.

“White’s Reality Check” and Aronson’s methods, in general, attempt to quantify how much a method’s performance (a sim in this case) is due to luck.

The assumption is that when we pick a sim we are picking the best performing sim. And that sim has performed well due to both luck and "predictive power,"using Aronson’s term.

He calls this luck factor the data-mining bias.

I do not think we can ever get away from this. This is because of the data-snooping bias. We read “The Magic Formula,” “What works of Wallstreet,” Fame and French’s Work—not to mention all of the public ranks and sims on P123. When we run our very first sim after joining P123 it can be the result of tens-of-thousands of trials (done by other people). Then we add more of our own trials over the years.

If you take one of your 5-stock sims and zero-out the mean return then it should have no predictive power. But how good can it do just by luck? If you bootstrap the sim and run it 2,000 (or so) times you will find out.

That is the beginning of finding out how much luck there can be in a sim.

Note that I said “can be.” I do believe may of the tests for overfitting reduce this. Yuval (testing 100-150 stocks), Denny (even odd) and other posts are very important. I agree: MAYBE THE DATA-MINING BIAS CAN EVEN BE REMOVED. I have some additional ideas on this. I will probably agree with other peoples ideas on this.

That having been said: a simplified “White’s Reality Check” suggests to me that as much as 35% annualized return in a sim is possible just by luck. This is based on the FACT that you can run 2,000 5-stock sims and have one of them return 35% annualized: on average. Even if the return is zeroed out and you know it has no predictive power. Now, if you do better than 35% you may be on to something. Just do not expect it to do as well as the backtest going forward. And if you really do a lot of checking—like looking at the entire rank performance and the above ideas on preventing overfitting—maybe you can get by if your sim does less than 35% annualized. Maybe.

And, of course, out-of-sample results change the numbers on this: especially if you are looking at your own ports which will be a limited number of ports that are not cherry-picked and do not have a large survivorship bias.

But even if you have chosen a perfect port with 100% predictive power which had no luck at all in the past: that is no guarantee the you will not be unlucky in the future. This addressed this possibility also. Instead of seeing how lucky your sim was in the past, you can get an idea of how unlucky your perfect port might be in the future. Bootstrapping using a 95% confidence interval suggests that a perfect port could underperform by 20% annualized over an 18 year period just because of bad luck. Only someone as unlucky as me would have both happen–pick a lucky sim that turns out be be an unlucky port–but that would be in the range of possibilities.

So even if you have the perfect port, it would be a mistake to dismiss White’s Reality Check entirely.

This is not intended to be negative: just a reality check. After doing this I have many sims and ports that I have confidence in: more that before. I just do not think I will be retiring next year. It could take two or three years;-)

BTW, the R code for bootstrapping is not that hard (or I could not have done it). You will, of course, name your own variables. s is the name of my Excel document in csv format. You would probably download it with read.csv(“directory path to s” header =T), sx is the daily excess log (natural log) returns calculated on the spreadsheet. Make sure to load the boot extension. Adjust the confidence interval as desired (possibly based on your data snooping bias and/or number of sim trials). See if any to this applies to you—assuming it is possible for you to be either luck or unlucky.

summary(s)
attach(s)
simx ← function(sx, d) {return(mean(sx[d]))}
simxx=boot(sx,simx,R=100000)
plot(simxx)
ci = boot.ci(simxx, type=“basic”, conf=0.95)
ci
print(mean(simxx$t[,1]))

More advanced coding doing White’s Reality Check exactly can be found on the web. Maybe even someone like me could cut-and-paste this. But I am so unconvinced that you could ever get a truly exact number that I have not tried it. It is enough, for me, to see that chance can always be a factor—since there is, sadly, no divine inspiration for making money here.

Best of luck!

-Jim