SURVIVORSHIP BIAS?

ETFOptimize · December 31, 2021, 11:35am

Maybe I’m mistaken, but since the 2020 switchover of data providers, it has become my impression that Portfolio123 powered by FactSet might NO LONGER be accurately characterized as Point-In-Time Data. If a database includes significant errors on the day it’s published, we, as disciplined, systematic investors, must assess if it is reliable enough for our needs. Fortunately, in my case, the use of factors from the Fundamental Category has dropped off significantly since 2016, when I concentrated my investment efforts and advisory services 100% on ETFs. ETFs are far superior investment vehicles than common stocks for an evidence-based, data-driven investment process, IMO.

It seems obvious that FactSet (and every other data provider) would want to eliminate ERRORS in the data to the fullest extent possible. After all, if you operate a business hawking farm-fresh chicken eggs each morning from a pedal-powered cart, you wouldn’t stay in business long if you let slightly rotten eggs get through to the customer. The quality of goods sold must be high for any business to be successful. The entire data-use continuum is affected by rotten data, from FactSet, to Portfolio123, to ETFOptimize, and finally, the end-user-investor can all be affected by a subpar quality of goods sold from the perspective of Point-In-Time data.

The two most critical, data-quality characteristics that are necessary for any systematic investment process to be accurate and successful are: 1) Data is POINT-IN-TIME - meaning that historical data is as accurate as possible at the time posted, and 2) there is NO SURVIVORSHIP BIAS in the data.

Eliminating Survivorship Bias means that at any backtested point in time, the companies or ETFs included in the database are the ones that existed at that time. For example, suppose you were to assess today’s list of the ~ 500 constituents of the S&P 500 Index/ETF (SPY). In that case, that list might be significantly different than the list of the 500 companies that were included in the index in 2007 (before the Financial Crisis).

However, if we used the list of today’s companies to assess historical market conditions, today’s list only includes corporate winners. It would not have the terrible stock performances during the 2007-2009 Financial Crisis when many publicly traded companies failed or were delisted from the S&P 500. Businesses such as Bear Stearns, IndyMac, Washington Mutual, Lehman Brothers, and AIG are no longer around, so it may be said that this list would exhibit Survivorship Bias when examining the historical performance of today’s S&P 500 companies.

To me, the elimination of Survivorship Bias is another of many examples where accurate Point-In-Time data is invaluable. Quantitative designers must have a precise list of active public companies or ETFs constituting the Universe as it evolves, with correct constituents of the Universe on each specific date. Right now, this is most important to me because I am creating a plethora of breadth indicators. However, to Jim’s point below, I also use earnings estimates for the constituents of ETFs.

Jrinne · December 31, 2021, 2:16pm

All,

I think P123 should be more proactive about avoiding problems with look-ahead bias and maybe telling us if FactSet’s PIT offering of earnings estimates data has time-stamps that can avoid random differences in the sim and port.

I admit to know knowing for sure how much of a problem this is as far as returns. The point is I do not know and never will know with my data alone. For now I just know that the sim seems to always do better than the port. Anecdotal to be sure. I do know for sure that differences in the holdings (and ranks) occur often.

Here are the holdings for my port on auto and the sim for the port. Neither forces positions into the universe.

I think data issues will affect whether anyone wants to join P123 to do machine learning. Those long lectures an how to “impute” data and how much data error can be tolerated before machine learning no longer works may be remembered by anyone who has taken a course in machine learning.

P123 should be proactive in this, I think. That is probably the best business decision.

Jim

marco · December 31, 2021, 3:01pm

Chris,

The objective you are describing cannot be achieved with any data provider. We would have to create our own data collection system that immediately grabs data from the SEC, standardizes it ( some AI would be needed for footnotes ), applies corporate actions, and so on. A massive $10M undertaking with huge operating costs.

Delays will never go away with data providers. They serve a professional market that cannot invest in most of the listed stocks. Therefore there’s no incentive for them to add more staff to process the data faster during earnings for all companies. This extra staff would be sitting around in between quarters.

And since these delays will never go away, you will always be late to any information alpha vs. other HFT players like Goldman Sachs that do their own thing. Or look at it this way: what is the point of having S&P tell you exactly when they disseminated a certain line item if that very same point data was on the SEC the previous week ?

Patience and systematic investing is the true alpha for us. And it’s a better alpha than most hedge funds that do all sorts of crazy stuff. It’s just like investing with IPOs: you are almost always better off waiting a few months and even longer. The market is luring and forgiving: it always gives you a second chance Only exception I can recall is Google where the IPO price was a steal. Oh Google… if you are listening… please give me another chance!

ETFOptimize · January 1, 2022, 6:23am

I understand about these limitations, Marco. Thanks. But can you please reassure me that the data is accurate and timely from the perspective of Survivorship Bias?

And lastly, in what way was the previous data superior to the current FactSet data? It seems that a lot of tributes were paid to that data being PIT. Thanks.

Jrinne · January 1, 2022, 10:13am

P123,

I get that CompuStat data is probably superior to FactSet data for fundamentals. Perhaps having to do with the time-stamps. Probably Chris would like an answer from P123 so I will not expand on this. P123 can correct me if I am wrong about time-stamps being a benefit.

P123 simply will not discuss CapitalIQ data which is different from CompuStat data (but provided in the same data package).

CapitalIQ data has the same look-ahead bias that FactSets earnings estimates data has. And any time-stamps do not accurate reflect when the data became available to the P123 members. In other words, there is a look-ahead bias. I believe one can look back in the forum and find where this was first posted (by Marco). But maybe we have not been informed of any improvements in the data and P123 can simply update us–making old posts outdated at best and not pertinent to this discussion. Things can change.

We finally have a lag in the FactSet earnings estimates data that may correct some or all of the problems that both Factest and CapitalIQ have. But is CapitalIQ data lagged now? It is possible that if you use a lot of earnings estimates P123’s FactSet data is better than CapitalIQ (because of the lag). If I am wrong on this it is only because I cannot get an answer to my questions.

I have asked whether CapitalIQ data has a lag. It would be even better to find that it does not need a lag for some reason. Actually I would be happy to learn anything on the subject.

Anyway, if P123 decides it might be appropriate to inform us about some of the data it provides, I wonder if P123 might answer whether FactSet’s PIT offering has accurate time-stamps. We do not have FactSet’s earnings estimates data with any time-stamps now.

CompuStat uses time-stamps for its fundamental data. P123 uses time-stamps for earnings announcements from FactSet. It seems like time-stamps might be a good thing that is helpful at times.

It is not clear that P123 has looked into the FactSet’s PIT offering of earnings estimates (which uses time-stamps). Probably P123 has looked into this and just does want to discuss it. But P123 does actually charge for providing CapitalIQ data. I think it would be appropriate to include a discussion of CapitalIQ data here. Chris is actually asking about all of the data including CapitalIQ data I believe. It is appropriate to ask about the data.

I hope P123 is willing to answer Chris’ question and expand into all of the data it provides in the CompuStat package. Maybe even tell us why time-stamps are a great idea for everything except earnings estimates data.

I have said it before and will say it again here: FactSet’s earnings estimates data with a lag may be fine. Let me just stipulate, for discussion, that it probably is very useful with the lag. It may even be better that CapitalIQ data. But I also think–stepping back and looking at Chris’ main point–things could be better. At a minimum, we could all be better informed before deciding whether to pay $15,000 a year (or more) for earnings estimates that may not be lagged or have accurate time-stamps. And if it is affordable, FactSet’s PIT offering of earnings estimates data is almost certainly better for some situations. Obviously, there is a reason FactSet decided to make this option available. They understand their own data and that there are problems that the PIT offering solves (for anyone informed enough to know there are problems). Actually, Marco just pointed out that there are problems with all data, so everyone with a pulse should know at this point. Marco could not have put it more clearly or simply.

A discussion of some of the specifics would not be the least important thing discussed in this forum, I believe. Very worthy of a discussion and an appropriate question, to be sure. Thank you for the question, Chris.

Jim