I Get different simulation results with exact same system / timeframe

[img]I think having PIT data is critical. Backtesting is otherwise meaningless. It’s like me trying to bet on tennis match outcomes knowing who the 4 semi-finalists are; it’s a huge edge.

I’m ok with OOS underperformance; you can’t predict that. But I’m struggling with the reliability of backtesting. It is my opinion that most of the DM are not doing well, not because of market-regime change but because backtesting has given flawed results.

I have no proof that compstat was better as I didn’t save any data but I’m revisiting some of my models and the differences are stark. Here’s an example of one of my models that trades 20 positions weekly over the last 2 years. The first image is the live model. The second is a replica of the live strategy as is.
The difference is enormous.
You can tell that the models are probably trading similar tickers but the non-PIT data is having an edge because the strategy is selling and buying at more opportune times because it has better information that the live strategy didn’t.

I’ve completely given up on backtesting and use p123 merely as a fancy stock screener and a place to get stock information.



Hi all,
let me first preface my comments with the immense respect I have for Yuval and all the great contributions he has made.
Now for the bad news. If I buy a house that has hidden defects that I only find out about after I buy, it does me no good unless I can sue the seller.
I would go so far to say(and I know I will get heat for this) that “yes” ignore the changes. If that was that data that was known at the time then go with that because that is what we would have traded from real time. Of course we all want correct data to work with but if it is not PIT it is of no use to me. Making a data base look pretty is not my goal, making money is. I would love to place bets on Sunday football games the day after, it is just not realistic.

This is a very big problem.

I have been one of the success stories here. I have been a member since 2010. Since I started live trading in 2015, my accounts beat the average stock in live trading for five years in a row!

However since the switch to FS I am seeing huge discrepancies between live portfolios and backtested sims. I am now seriously considering alternatives.

It’s simple. Don’t correct any error unless you know it will never happen again (because you fixed the cause of the error).

For example, if FS or CS misreported EBITDA for a stock, don’t fix it! It shows a flaw in their data collection process (or in company reporting) that will probably reoccur going forward.

OTOH, if you were adding 000s by mistake, go ahead and fix it. That error will not appear again oos.

The most valuable part of P123 is PIT backtesting. Snapshotting the actual data received from the vendor is the best way to do this. That’s what was done with Reuters data and it worked very very well once they collected PIT snapshots. When they switched to Compustat they started relying on CS to keep accurate snapshots. That was a step down. There were some CS factors that had a look ahead bias in real life that made backtests for those factors misleading. FS seems like another huge step down. I don’t know why, but I am seeing quite different results in live vs sims.

Would you be happy with data that was retroactively cleaned and has a look ahead bias?

I just want to amend this. Nisser’s data is both significant and impressive. I have never seen Nisser to make a mistake in his posts and I assume this is correct until shown otherwise.

I am looking closer at my ports and with an open mind. But I am not the only one with data.

After Nisser’s, Brett’s and Chaim’s posts (people I trust) I feel stronger that this deserves a serious discussion.

A discussion by someone, like Chaim, who understands that Snapshots are not something to be dismissed out of hand.

Someone who knows that P123 has done this before for a provider that has less of a problem with providing PIT data. If not serious consideration then, at least, a serious discussion would be appreciated.

Chaim, the answer to this is obvious. It is strange, I think, that we would debate whether we should like look-ahead-bias. We may have to tolerate some of this from FactSet but strange to try to convince us that we should like it:

Jim

Back in 2015, I reported a problem with Canadian data: in 2004 and 2005, some Canadian companies with dividends have Yield = 0 for some weeks. The error was never fixed.

https://www.portfolio123.com/mvnforum/viewthread_thread,8753#!#46138

I assumed this was fixed with the new data provider. But when I looked, the problem was still there! I guess Compustat and FactSet started with the same Canadian data. Should FactSet and/or P123 fix this error? I say yes even though it could cause some backtests to change.

FactSet will provide an earnings estimate revision in the backtested data before it was know publicly. Read Factset’s brochure about the advantages of getting their PIT option if anyone questions this. This is true look-ahead-bias.

At a minimum rational people need to be against true look-ahead bias. Maybe tolerate a little (if necessary) but not argue that this is good.

Even if you think some revisions of data seem desirable in some situations, Nisser’s port is unacceptable. Maybe the Canadian Yield data can be fixed going forward and look-ahead-bias can be avoided when fixing the past.

I certainly expect P123 to understand the different data error problems and not try to convince us that we should like look-ahead bias that keeps repeating in the future.

Jim

No, I understand what you are talking about, Andreas, but maybe I didn’t communicate clearly enough.

What I’m saying is that if—for whatever reason—P123 chooses a different stock or ETF selected for your model on day ‘N,’ that singular difference of a change at the beginning, with the power of compounding, can result in an enormous difference in performance on day N+10 years or N+20 years.

I was just offering the concept of starting your model on different days as a way to prove this point to yourself. It’s also a way that P123 members can test for robustness. If the previous iteration of your model selected stock X on day N and the new iteration selects stock Y on day N, there can be a huge difference in the results after 10 or 20 years.

Another possibility is that perhaps the PRICE for the same stocks (both are stock ‘X’) are slightly different at the beginning of each iteration. I have seen big differences in performance by testing the three pricing options of Next Close, Next Open, or Next Avg of Hi and Low. If this robustness test of price alternatives results in a significant difference in performance, it indicates your model is not robust.

Again, I’m not suggesting this happened to you, but I’m providing examples that can be tested by other readers who want to check their models for robustness. If there are any changes at the beginning of a Sim, it can result in huge differences in the results at the end.

That said, I again state that I agree 100% with you regarding PIT data. If the data is always moving, even slightly, it is impossible to develop strategies to exploit market inefficiencies. However, in the world that we live in in 2020, this subject is one of the biggest challenges that humanity faces. We are well on our way to a world driven by data run on ubiquitous microprocessors.

For the world society in general, data accuracy is an enormous problem. How to solve for these inaccuracies is not an easy challenge to resolve. I think I agree with Yuval that the data should be fixed when an error is identified. I would prefer to have the most accurate possible data than to have a stable environment that doesn’t fix errors.

That said, I believe that the optimum solution would be for P123 to offer a stable, ‘As-initially reported’ option—and ever-subtly changing ‘refreshed’ data. However, this might be an impossible challenge for P123 because of all the potential iterations of the data that would need to be maintained.

Back to my original post, have you compared the initial transactions of your first version (if you have them) vs. the transactions in the latest version? Is there a difference in the stocks selected or a difference in opening prices? That could be causing your problems. As I pointed out initially, a single change in a stock price could cause enormous differences in performance after spans of years.

I understand your concern with the data - by a factor of infinity. But I doubt there is an easy solution that will satisfy all users.

Chris

I get the argument from Yuval refering to data bug and programm bugs.

A Snapshot could help

  • Snapshot database with today - 1 = Read only (also for bugs!), bugs get fixed only with today - 0
  • Current Database with fixex to bugs and data
  • an option so the user can choose (like with the prems)

Problem is, that very hard to accomplish.
So the compromise could be

  • Snapshot Database with fixes of programmatic bugs and data bugs, BUT no overwriting of Prem Data (as the new vendor systematically does?), e.g. Today - 1 for Data → That was the status we had with the old data vendor, and that was a very good compromise. So the idea is to hinder
    the data vendor to update the past systematically and get the PIT we had before the switch. Just put the changes of the vendor 1:n below
    the database

Something like that →
“Stock” 1:n → “Line Item” (which is today - 1 = read only, besides for monster bugs) 1:n → “Line Item update by vendor”

If today < today - 0, update then update “Line Item update by vendor”

Best Regards

Andreas

Best Regards

Andreas

Brett has a way of getting to the point in very few words

No getting around that being a problem.

But obviously Andreas, Nisser, Chaim and Georg get it too. They understand why this can be a problem.

Marco says this does happen—at least to some extent.

Can we move the discussion from whether this happens to whether it happens enough to be a significant problem?

To be fair, Georg thinks it is not a big problem for his ports. I only started funding a port again fairly recently (and I have modified the port). So I do not have good data. I cannot compare the sim to the port for a long enough time period to draw any conclusions.

I guess I will probably know something in 6 months. I will hold off on my plans to move to a Boosting model using the same factors and keep funding the simple P123 model with no changes for clear comparison to the sim.

I will keep meticulous records and share them. I assume others can do the same (whether they share it or not). We can know how bad the problem is without a lot of debate about how it is good for sims to “know the score ahead of time.”

Maybe we can skip the philosophical debate on this and get to some (more) data defining the magnitude of the problem.

Best,

Jim

Yes Jim, yes Andreas let’s work on a solution.
P123 what say you? Your awkward silence is deafening.
At the very least tell us exactly what you do with new data/revised data/bugs, when you do it , and how you do it.
Transparency at least gives us the opportunity for a work around.
Ideally, I think we all would like to have the choice between 1) choosing true PIT data (today-1 =read only with bugs and all ) or 2) a cleaned up after the fact database. I just don’t know if that is feasible.
Is that feasible

I understand the frustration you guys are having and it’s legitimate but hey, S&P Global for P123 is still available! You just have to fork $12,000 USD per year for it. If you cannot afford it then Factset non-PIT data is available for a fraction of the cost, which is what you are paying now. However, it does come with the inconvenience you guys are discussing in this thread. Does it suck? Yes it does. It’s not really P123’s fault that we are at this point, blame the previous data supplier and the ludicrous pricing.

It would be nice to have more information.

We do know one independent group looking at this (Quantopian) found a need to lag FactSet’s earnings estimate data.

Furthermore, a second group with good inside knowledge (FactSet themselves) have a PIT version of the earnings estimates data that involves a lag. There is a FactSet brochure advertising the PIT data. The brochure pretty successfully ravages the competition (the FactSet non-PIT data).

No one over there could be as smart as the people here, I know. Maybe we should call them and tell them they have made a mistake. But then again, many people here at P123 see a need for a fix of some kind too.

Anyway, they are not as smart as us (obviously) but some pretty smart people thought a lag in the earnings estimate data was necessary.

And as far as I know they have not found a a need to use a lag elsewhere (e.g., the fundamental data). And I am not aware of any other measures being taken (other than the lag of the earnings estimates data).

Could be pretty simple—once we have some data we can hang our hats on.

And it would not be reinventing the wheel.

Best,

Jim

Here’s the problem in a nutshell: FactSet’s data is not PIT. They make revisions. And we don’t know WHEN they make revisions (i.e. corrections; restatements are a different kettle of fish).

So let’s say we followed the suggestions from several users on this thread and took snapshots of data daily and stored them and then let users decide which date’s data to use. (Obviously, this would be a huge task and would mean our data storage capacity would have to be multiplied by thousands.) Even then there would be plenty of non-PIT information in the backtests because of a) revisions that FactSet makes to final data and b) a lack of effective dates (we have no idea when FactSet actually delivered the data that it’s reporting to its clients).

If we decided to no longer accept FactSet data revisions, we would not be any closer to being PIT than we would if we accepted them because so many data revisions have already been applied. And we would run the risk of not being able to fix errors. And because the data is so interrelated, revisions of one data point will involve revisions to others. For example, let’s say we point out to FactSet that they got their share count wrong for a certain stock at a certain point in the past. They would then revise that share count, which would affect every single value ratio for that stock. The ripple/butterfly effect would be huge. Should we decline that revision? Apparently not. But what if another FactSet client points out that the share count is wrong and they respond in the same way? Should we decline that revision because it wasn’t generated by us? And how are we to differentiate between a revision that’s correcting a data error and a revision that reflects new information?

At the same time, we have been taking a number of steps all along to limit the number of changes that users experience and to make our data as PIT-ish as possible. This includes noting effective dates of announcements and statements post-March 2020, assigning past effective dates as smartly as possible, creating PIT time series, and so on. I’d be happy to consider other suggestions regarding making our data more PIT-ish. But please do take into account that the data we’re starting with isn’t PIT to begin with.

I do not want to dismiss or minimize your concerns. I just want to tell you how difficult it is to satisfy them, and how hard we’re trying to deal with this.

Yuval,

I wonder if you would take the time to look at what Quantopian does and why.

And be sure that FactSet’s PIT earnings estimates is not an option (probably will not be an option, I would guess).

To be sure I am not sure that this is a problem with my personal ports. And if so what the actual problem is.

I wonder if we (including me) could get more information and put a little more thought into this (if it is even a problem for most of us). Although some seem pretty convinced it is a problem. A problem that Quantopian thinks it has addressed adequately.

Quantopian uses FactSet data and must have had a reason for what it did.

Best,

Jim

I will look into this, Jim.

If this helps and for all to look at……

Quantopian has 2-step approach for earnings estimates. It looks like they started doing what I think of as SnapShots in November 2018. Marco may have a better definition than SnapShots. (Image)

They lag data 24 hours for data before that (same image).

What they do (if anything) with Fundamental data is less clear to me.


I see at least 2 questions being conflated in this discussion: 1) the accuracy of FactSet’s data vs Compustat’s; and 2) what defines PIT.

  1. Re accuracy of the datasets: during the months where we had access to both datasets, I ran simulations of 189 different models on compustat, on factset-using-prelims and on factset-without-prelims. The differences in simulated AR between the different datasets was entirely within the “noise” – all results on factset data were within 3 standard deviations of the results on compustat data. The r-squared of the factset ARs vs the compustat ARs was 0.97 – so if compustat showed model A as having a higher AR than model B, then FactSet also showed A as having a higher AR than B. I saw nothing to suggest that FactSet data was “less PIT” or less reliable.

  2. Re PIT: I want the simulations to be as accurate as possible, therefore I want bug fixes and data corrections applied to historical data. I don’t care about simulations being reproducible from one day to the next. I care about them being as accurate as we can make them today given the inherent approximations, guesswork and limitations involved in the process of constructing a simulation. If what you care about is a reproducible result, if what you care about is that a sim run yesterday and a sim run today have the exact same AR, then just save the sim results when you run them and later you can refer back to those results rather than rerunning the sim and thereby always “get the same results”. If a model is so sensitive that bug fixes and data corrections “break it”, then whatever data you thought you could derive from the earlier results was not trustworthy.

Bug fixes from P123 increase my confidence in their processes. If they weren’t finding bugs and fixing them, that would be a huge red flag. Likewise and for the same reason, data corrections from FactSet increase my confidence in their processes. If my models continue to hold in the presence of those changes, then that’s a nice bonus test of their durability. I’ve never seen changes implemented by P123 cause changes in results that exceeded the noise inherent in the data. If anyone is seeing changes in results that bother them, I suspect either A) those changes are within the noise and so they should temper their expectations about precision; or B) their model is too fragile/curvefit for any simulation to provide worthwhile data.

SUPirate,

Good stuff about sims.

But did I miss something?

Nisser says he runs a PORT for a while and finds the sim (over the same period) is nothing like the port.

In other words the sim is what the name suggests. An artificial simulation that has nothing to do with reality. Not as good as a video game at the end of the day. And you do not lose that much money in a video game.

Personally, I find that possibility concerning. And Nisser has data to suggest that is the situation.

I am very sure Chaim is saying the same thing. And 99% sure that Brett is saying the same thing.

I see the question of whether the sim has much to do with the port as the important issue. This is what speaks to the question of PIT-ness don’t you think?

I saw PIT in there somewhere but is there an evidence that the data is PIT (or not PIT) in all of that? And it was not even clear to me whether you like things to be PIT or not? Do you think look-ahead bias is a problem? I truly missed that.

Did you test whether the sims resemble the ports as Nisser has done? How much look-ahead bias (when it exists) can be tolerated before the sim has absolutely nothing to do with the port? Maybe we are finding out.

It is hard to discount what Nisser and Chaim say without some data to suggest otherwise. Data I did not see in your post.

Chaim has posted about this type of thing before. He gets it and his accounts have always been correct. Nisser is generally correct in all of his post.

Brett is quiet but has the most sense of all of us.

So, I still have questions unless you looked at some ports too. I do not see how running 189 sims (and no ports) answers much that is important.

And me, I do place much value in someone’s ability to predict when she is consistently given the score of the game before the prediction is made (using Brett’s analogy). More simply, I do not like look-ahead bias.

That is just common sense. I am not sure about what could change my mind on that. Not the r-squared of 189 sims for sure.

I still wonder why–if all this is not a problem–Quantopian is jumping through all of those hoops.

Best,

Jim

Nisser is the only one comparing a port to a sim. Everyone else appears to be talking about running sims on different days (though for the same simulation period) and getting different results. I have no comment on the port vs sim. I haven’t seen anything in my own investigations that suggests that FactSet data has a look-ahead bias.

Several assertions in this thread seem to suggest that correcting errors in the dataset (E.g., Sales for GE for their 3rd quarter 10-Q was reported by GE as 10.23 but originally incorrectly entered into FactSet’s database as 102.30, and then later, even as late as yesterday, corrected to 10.23) is a form of look-ahead bias. I disagree with that assertion.