Single factor performance using alpha and volume based slippage

I would caution that if you use a non-adjusted Mktcap filter in your universe that the back test will probably be skewed as we expect companies market caps to grow over time with inflation. I actually need to fix this in some of my universes…

On a different note I have switched to looking at rank bucket testing using a 1 week rolling annual back test (like the screener, not simulation) with standard deviation error bars. I have noticed that some of the single factors have massive standard deviation in the expected returns of all of the bars, but especially the top bar.

Price2SalesTTM with alpha, but no slippage:

This information is not very useful as this is just a single factor, but it can give you a bit more understanding about the factor.

I am thinking about if I can make a nice “core” ranking system with say 5-10 factors and then test my new factors by adding them on one at a time and doing a similar plot. For context this is just for helping decide what factors to download for ML and optimization. I know that you really need to test your factors in your full system by removing/adding them to see how they affect the total performance. But I cannot download all of the factors due to API credits, so a less intensive approximation system seems valuable.

Jonpaul,

All great points, I think.

So that last bucket has little meaning in a statistical sense (by itself) and probably little usefulness even if it is meaningful unless you like volatile models for some reason.

A regression on that data would not fit the assumption of homoscedasticity.

Yes. We really have to think about what we download. I am probably going to pay for 50,000 API credits. I have been thinking and working all weekend, making sure that a I will download what i need and nothing more.

JIm

Knowing what to download is really hard. I want to be greedy and download it all so I can optimize away. And see if ML can handle the kitchen sink of factors and come out with something amazing. But reality is not so kind and I think even with ML the input quality is just as important. Also api credits…

I started part two and three of the factor evaluation quest: looking at ranking system factor removal and addition. I have to say its difficult as few factors produce “worse” performance across all of my evaluation metrics.

For example this is a screenshot of doing factor removal. The first purple line is the baseline system and the second line is removing forward earnings yield. For removal if the performance improves then it means the factor is hurting the system.

You can see that the Spearman’s rank correlation dropped along with annualized alpha which would say to keep it as things get worse without it. However, the standard deviation dropped and the 5 fold performances are all better when you remove it which would say get ride of it. So is it better to keep or remove? I guess it depends on what you care about. Looking at two P123 ranking systems I see the same case where removing the factor is not an improvement across all of my tests.

So the task continues…

Jonpaul,

Get rid of it. Full stop.

So, as background, you can go to the designer models and see what happens to some very smart people using some very good factors when they use just backtests. Much of what they are doing is not wrong.

But that is the nature of cross validation. You training performance goes down but you system “generalizes” better to new data.

It is not immediately intuitive. But we know from the designers that a model can do well in-sample and do nothing out of sample. After a while the designers remove the underperforming model, backtest a new one, submit it as a new designer model, Rinse, repeat……You will not want to start that cycle for yourself and you own models.

BTW, I think forward earnings yield can be a good factor, if somewhat marginal. Maybe it is being removed due to collinearity or an interaction with another factor (out-of-sample in the k-fold validation). So your belief in the factor (as an individual factor) should not influence whether you remove it from a model with other factors in it.

This assumes that your Spearman Rank correlation is not part of a k-fold cross-validation (maybe code that if I understand correctly that it is not now part of a k-fold validation).

I could be missing something–like whether the Spearman’s rank correlation is part of a k-fold cross-validation or something else. I hope I am not too far off.

Jim

All,

The use of mod() or subsampling and cross-validation is confusing to people. For sure I do not always get it myself. I have to at least think. Its not natural.

With regards to Python and Shuffle (Shuffle = True, or Shuffle = False), I think that Shuffle = True will generally give you the equivalent of mod() (subsampling or bootstrapping) while Shuffle = False will give you cross-validation for out-of-sample performance. Shuffle = False (with your csv file ordered by date) will train on one time-period and test on a different time period which is what you are looking for as far as understanding what to expect for out-of-sample performance.

Perhaps mod(), subsampling and bootstrapping can be thought of as a methods to TRAIN on new data or at least a set of data that could have happened in the past.

Cross-validation is a change to TEST on new data that you (or Python with k-fold validation) did not train on.

So, you often want to do both. Expand what you are training on and be more aware of a change in your model will do out-of-sample.

TL;DR: Mod() and cross-validation are 2 separate things that naturally get confused with each other.

Jim

I’m not sure why you’d want to remove factors instead of simply assigning them zero weight for a little while. Considering that type II errors are, in general, more fatal than type I errors, there’s a strong argument to be made for keeping as many factors as possible in the range of possibilities. See https://backland.typepad.com/investigations/2018/09/the-two-types-of-investing-or-trading-errors.html for a discussion of this.

Yuval and Jonpaul,

Yuval points out just ONE problem with using p-values. I could go on. I have used boostrapping and Bayesian statisitics. But it would be just plain wrong to say these solve all the problems with p-values. Just to add one ADDITIONAL problem (in addition to the one Yuval mentions) the multiple comparison problem makes using p-value difficult. And if you did try to pick a p-value of p < 0.05 as a cutoff it is just wrong.

IF you are going to use p-values you could use k-fold validation (which is a different subject than p-values). A way to do that would be to get the p-value for each of your factors and weight them equally in a ranking system.

Do a k-fold cross-validation grid search (or Bayesian optimization) with the p-value as a hyper-parameter to find the cutoff that works best with the k-fold cross-validation.

TL;DR: Yuval is right, p-values are problematic in many ways (not just the one he mentions).Cross-validation is a different subject than p-values, however…Cross-validation does not use p-values, nor does it use type I or type II errors. If you want to think about using a p-value cutoff to control your type I and type II errors k-fold cross-validation would be an objective way to determine p-values. Ultimately you may not do that but it would, at least, be better than an automatic p < 0.05 cutoff.

BTW, if you are going to use p-values you might ask ChatGPT how to do it with bootstrapping which is non-parametric and gets around the usual discussion of whether stock returns have a normal distribution and the need for the central limit theorem.

Jim

Yuval,

Just an observation. Your link addresses the effect of the total number of false positives (type I errors) and false negatives (type II errors) while keeping the proportion of false positive and false negatives roughly the same. You are randomly selecting features from a set of features with a fixed proportion of type I and type II errors in other words.

Another study could be done looking at the effect of false discovery rate (FDR). I other words the keep the number of factors the same but change the proportion of factors that are type I and type II errors. That would clearly be a different study than the one in your link.

This could be done by keeping the number of factors the same but selecting factors based on a p-value cutoff. And doing this with different cutoffs which would, by definition of p-value change the proportion of type I and type II. This is a basic homework assignment in any freshman class (or advance placement high-school class) in statistics: determining the proportion of type I and type II with a given p-value. As I am sure you know.

You often link to this paper for good reason I think. One reason I like it is that is all really about the false discovery rate: Is There a Replication Crisis in Finance?

From the paper: “….we also calculate the posterior probability of false discoveries (false discovery rate, FDR)” The paper also talks about the Benjamini-Hochberg procedure which is a frequentist method for controlling the false discovery rate.

But really, the entire paper is about the rate of false positives (FDR) in the financial literature. Factors that are “discovered” that cannot be replicated are, by definition, false discoveries.

TL;DR: “Is There a Replication Crisis in Finance?” is good paper. Not just because of the abstract or any conclusions but because it has some concepts and techniques that can be used for controlling the FDR in ranking systems whatever you think that number should be.

And that number could be empirically determined by using k-fold validation and/or other methods such as the Benjamini-Hochberg procedure (in Python for now).

Jim

Jim and Yuval,

Thank you for the great discussion. I think I will need to brush up on my statistics and the practical application of it. Unfortunetly many high school and college courses only introduce these tools in a very abstract manner. That being said the article you linked helped explain the idea of type I and II errors well.

I generally understand the idea of statistical significance and hypothesis testing for observations (maybe not the specifics of how p-values are calculated). But I do not understand how you can get a p-value/FDR for a single factor in a multi-factor system as we know the behavior of factors change when you combine them. I will have to learn more about this. Maybe I am missing something…

Jonpaul

Jonpaul,

I find the discussion interesting too.

But please be aware that I did not introduce the concept of type I and type II errors into the discussion. Nor do I it find particularly relevant to your question.

As you recall your question was about k-fold cross-validation. Type I and type II errors are a separate topic.

As a direct answer to your question I would only suggest that you continue to use some sort of cross-validation and consider using the Spearman’s rank correlation as a k-fold cross-validation.measure (and less as a backtest measure if you are not doing that already.

I actually agree that type I and type II errors is confusing and if one is to use it the topic has not been adequately addresses in the forum. Nor will it be.

TL;DR: Continue to use cross-validation.

Jim

My most recent question/comments were on the difficulty of determining what factors are worth downloading given limited resources. One of my methods was looking at 5 universe “fold” results.

The concept of type I and II errors seems relevant to deciding what factors to include or exclude in a system. But how to test for them is challenging (for me anyway).

I think cross-validation methods are probably one of the easiest methods to implement for determining how robust a factor or combination of factors is. Maybe p-values and Spearman’s Rank Correlation in combination with cross-validation vs say alpha and beta are also good to consider. That being said they are more abstract and thus less appealing when first investigating things.

Anyway I think my overall conclusion is that I should download as many factor ranks for as many stocks as I can afford as the question of “good” and “bad” factors is not so straight forward assuming that the factor makes sense.

Jonpaul

1 Like

Jonpaul and Yuval,

Thank you for the discussion. As you recall I have been trying to assess which factors to download myself.

I will be downloading a few more factors than I planned as I agree basically with Yuval or anyone who says p-value is not the only thing to look at.

Especially if one is going to eventually look at interactions with a random forest or XGBoost. Collinearity clouds the issue also.

And perhaps this echoes that:

I think cross-validation remains a viable means of making the final decisions about a model and the factors to use. But I will cross validate a few factors with lower p-values with my downloads to see what the final cross-validation looks like with those lower p-value factors because of this discussion. And with the idea the interactions may make a difference in some cases even if the p-value is marginal.

Jim

Excellent idea, Jim. This is the approach that Marcos Lopez de Prado and Michael J. Lewis took in their paper “What Is the Optimal Significance Level for Investment Strategies,” in which they concluded that “a particularly low false positive rate (Type I error) can only be achieved at the expense of missing a large proportion of the investment opportunities (Type II error).” At any rate, the idea of removing a factor because it might be a Type I error is what I was trying to address in my own feeble study, in which removing a factor results in fewer factors being tested.

Yuval,

Right. So just a question. You think type II errors are more damaging than type I. I certainly have no objective evidence that would make me disagree with that.

But I assume that with a ranking system that already has a large number of factors (zeroing out any effect the raw number of factors might have), you think that type I errors can cause some harm?

Would you agree that a method that determines the false discovery rate (like discussed in "Is there a Replication Crisis in Finance?') could be useful? Allowing for the optimal ratio of both type I and the II errors in a ranking system?

I have actually done that and will going back to do that with k-fold cross-validation. Soon probably as the cross-validation method I used previously was poor.

Either way, your discussion was helpful for advancing my thinking on this. Thank you.

@yuvaltaylor: plus ChatGPT has been bugging me to do that and will not stop whether I really need it or not. :worried:

Jim.

Certainly type I errors can cause some harm, even in a ranking system that has a large number of factors. And I do think that the false discovery rate is pertinent.

But you also have to take into account that factors interact in weird ways. Looking at a factor that measures accruals, for example, is never going to have a good slope in a bucketed backtest. But it’s a very useful check when looking at companies that are reporting high earnings, even though it would totally fail a “false discovery rate” test.

Personally, I like everything I do to be evidence-based and rely mostly on backtesting to determine how I invest. But I realize there are real problems with that approach, and many of those are of both the Type I and Type II variety, in that some factors that backtest very well make no real sense and should be discarded and others that don’t seem to backtest well at all should be retained because they provide valuable checks on other factors or help prevent you from buying companies that are lying through their teeth.

Absolutely agree with this. As I said above I plan to download more factors than I originally planned on with the idea that some may interact in models that allow for interaction (e.g., random forests or XGBoost.

But agree. Full stop.

Cool. I do wish you would look hard at k-fold cross-validations and understand it. Discard the idea if you do not like it. Seriously, we no longer have to reach a consensus on methods as I can download data and make my own decision now. But I think you might find it useful. Or again, discard if it you do not find merit in the idea.

Best,

Jim

Duckruck,

Just to clarify your short post.

  1. Is our data a time-series?

  2. Are you saying because our data is a time-series then a time-series validation (like walk-forward) is most appropriate?

  3. You had a link to a paper that used k-fold validation with an embargo period to avoid information leakage. Do you remember that paper? Also de Prado has written on this.

To give more detail about my methods, I have on my list the use of cross-validation with time-series (a walk-forward). And a second thing on my list that would use k-fold validation with an embargo period. I will not go into the weeds at to why I will not use the same method for both, although I could easily add a walk-forward or time-series validation to he k-fold with an embargo. Okay, relatively easily when it cannot all be done in Python.

Anyway, just not sure what your full meaning is. But I think I may see your point and fully agree.

Best,

Jim

I am not going to completely ignore what de Prado (and others) say at this point: Advances in Financial Machine Learning. Which does not stop me from using other things too. Clearly time-series validation is a good method. There are some advantages to both and I would actually say a walk-forward validation over a holdout test should be included.

k-fold will give you more data as you are training and validating all of the data with it—including a validation of the beginning of the time-series for your training period. A clear advantage when you are short on data. Personally, I will use it when I am short on data—especially when I have a holdout test set eating up the data that I can use for training.

de Prado does express the following concerns about k-fold cross-validation. The passage below suggests that he would not like the use of mod() for cross-validation purposes (which is different than subsampling). Or with scikit-learn he would set k-fold to shuffle = False for cross-validation.

The embargo period is less of a problem as it addresses potential “data leakage” from “autocorrelation” which may or may not exist for your data and autocorrelation is a problem only for the relatively short time-frame that any autocorrelation would have an effect (seldom long by any theory and non-existent for adherents to the Efficient Market Hypothesis). That is an effect that should disappear quickly if there are any “RoseBud Traders” around or for those who who were formally known as “statistical arbitragers” now just called “Statistical []”

BUT it is not that hard to include an embargo period. It is a little bit clunky to do it in a for-loop and sometimes it is easier to just manually slice your DataFrame to define the train/test splits.

From Advances in Financial Machine Learning:

"PURGED K-FOLD CV One way to reduce leakage is to purge from the training set all observations whose labels overlapped in time with those labels included in the testing set. I call this process “purging.” In addition, since financial features often incorporate series that exhibit serial correlation (like ARMA processes), we should eliminate from the training set observations that immediately follow an observation in the testing set. I call this process “embargo.”’

de Prado, Marcos López. Advances in Financial Machine Learning (pp. 196-197). Wiley. Kindle Edition.

How long? That is a serious question that I have asked ChatGPT. In this context if one were to think that they might use an embargo period how long should it be? Serious question.

Show me the autocorrelation data proving what you say. It is hard to find autocorrelation that lasts long in my experience.

Is that autocorrelation effect strong enough that you use it to trade and you make a lot of money off of it?

It is always possible to look back and see a pattern that you could have exploited it you knew then what you know now. I think that would be called a backtest. Show me the data proving there are long periods of autocorrelation. And how long that that autocorrelation persists if you have the data.

Another way to put this: You have run a logit-regression on some financial data and found an effect that you can use?

Edit: It does seem that serial correlation is the key. One can actually look at the serial correlation in the data. I suspect ChatGPT can do that for you and determine the magnitude and length of serial correlation in your data for you.

Q: So to be clear if there is no serial correlation an embargo period is not necessary?

A: "Yes, you are correct. The primary purpose of implementing an embargo period is to prevent potential data leakage due to serial correlation in the time series data. If the serial correlation is negligible or non-existent, the rationale for incorporating an embargo period diminishes. Here’s a more detailed explanation:

When Serial Correlation is Absent:

  1. Data Leakage Risk is Low: Without serial correlation, the risk of data leakage between adjacent time periods (or between training and testing sets) is low, which makes the necessity for an embargo period less critical.
  2. Efficient Use of Data: By not implementing an embargo, you can use your data more efficiently, as you are not discarding potentially valuable training data.
  3. Simpler Cross-Validation Procedure: Without an embargo, the cross-validation procedure can be simpler and more straightforward, as you do not need to implement additional steps to remove data points adjacent to the testing set.

I think an embargo period of 3 years is possible with over 20 years of P123 data. I have considered such an embargo as I generally download 3 years for each DataMiner download. I have considered just not concatenating the 3 years following the test data into the training data each time I form a train/test split. I am not aware of any autocorrelation lasting over 3 years.

Jim

If you’re using a holdout period do you ensure that you are not using factors that you or someone else came up with during or after the holdout period? Or is that irrelevant? Just curious.