Accurate t-tests with no statistical knowledge required


I downloaded the weekly returns for a P123 model (details of the model not pertinent here) under statistics in P123 and asked ChatGPT if my results were statically significant. I have the ChatGPT desktop for Mac and was trying it out with some screenshots and downloads from P123.

This was my original question:

You have a screenshot of the out-of-sample results of a stock model I am investing in. In addition, I have uploaded a csv file with the result of my model (column head "Model"), a benchmark ("S&P 1500 Pure Value") and "Excess Return" which is the excess returns of my model versus the benchmark. Are my results statistically significant?

I had to prompt it as to do a paired t-test, to do it as a one-sided t-test and whether to use a Wilcoxon Signed-Rank test (and/or the Student's t-test). The Wilcoxon Signed-Rank test is good for data that is not normally distributed. But it was right each time for the type of test it was using. I.e., I got a correct two-sided Student's t-test for the model initially (probably the default answer most of the time).

Final answer:

Yes, the p-value of 0.021 from the Wilcoxon Signed-Rank test is for a two-sided test. This means it tests whether there is a statistically significant difference in either direction between the returns of your model and the benchmark. If you have a specific direction in mind (e.g., you believe your model performs better), you could perform a one-sided test, which would have a p-value half of the two-sided test. However, since the two-sided p-value is already below 0.05, the result is significant for a one-sided test as well.

TL;DR: You don't have to know a lot about statistics to do statistics now: just the name of the test you want. You do not have to know how to program with Python. Also AI/ML is coming. ChatGPT may be able to supplement some of what P123 is doing to make machine learing and statistics easy (or easier at least) and understandable.



Should these tests be run on as long as possible or only last few years?

Ideally, it should be out of sample data from a live port that you funded from the very beginning of the port. This would be truly out-of-sample data that limits the multiple-comparison problem.

Sims and in-sample data present a lot of problems, not the least of which is the multiple comparison problem. That is NOT to say that there cannot be some multiple comparison problems with ports but you can, at least, count the number of ports and compensate for how many comparisons you are making (or ports you are looking at). You can do that with sims too, I guess, but I think nobody actually does that and they may be OCD if they do.

I should probably add that this link has some ideas that work as well as what P123 is doing with cross-validation. I do not want to give the impression that using out-of-sample ports (and counting the number of ports you are looking at if the number is large) is all that you can possibly do to deal with some of the problems with that occur when looking at sims. But it is hard to have look-ahead bias on an out-of-sample port and the number of funded ports is constrained (a good thing for evaluating a model).


With the help of Gemini to tell me what to do, I ran a T-test on a historically performing designer model. Both one-tailed and two-tailed. Paired. Not having done statistical maths in the past, I used recommendations from the AI and excel.

What strikes me if that by running the test on different time frames, the p-values are vastly different. Very promising for all out of sample (>5 years). Very bad for last year. Model was performing very well for the long run, but is now struggling.

Can one hypothesized that for the long term, the over performance is statistically significant, while in the short term the bad results are just random noise and therefore not statistically significant?

Or should it be interpreted as it worked but it does not anymore?

Trying to understand all of that...

I do not think there is one simple answer to that. My method is used by other members, however, in different forms.

For me it is not an all-or-nothing thing. The model does not magically become statistically significant one day and I suddenly invest all of my money in it and then move all of my money out because it losses statistical significance in a downturn that may (or may not) be temporary.

I am pretty numerical in what I do so I actually strictly adhere to Bayesian Model Averaging (BMA).

To get a method similar to BMA simply with ChatGPT (and probably other LLMs) try his (real question):

"Okay. So I have a diversified portfolio and I want to adjust the weights of this stock model according to my confidence in the model (objectively determined confidence). I have used Bayesian Model Averaginb [sic] but I find it comples [sic]. Could you give me the Vovk-Seilke odds ratio for that data?"

Here is the complete answer:

"To calculate the Vovk-Sellke maximum p-ratio (VS-MPR) manually, you can use the following formula:

[ \text{VS-MPR} = \frac{1}{-e \cdot p \cdot \ln(p)} ]

For the Wilcoxon Signed-Rank test p-value of 0.021:

[ p = 0.021 ]

Let's calculate it step by step:

[ \text{VS-MPR} = \frac{1}{-e \cdot 0.021 \cdot \ln(0.021)} ]

We need the values of (e) (Euler's number) and (\ln(0.021)).

[ e \approx 2.71828 ]
[ \ln(0.021) \approx -3.8632 ]

Substituting these values into the formula:

[ \text{VS-MPR} = \frac{1}{-2.71828 \cdot 0.021 \cdot -3.8632} ]

[ \text{VS-MPR} = \frac{1}{0.2206} ]

[ \text{VS-MPR} \approx 4.53 ]

The Vovk-Sellke maximum p-ratio for the Wilcoxon Signed-Rank test p-value is approximately 4.53. This ratio provides an upper bound on how much more likely the alternative hypothesis is compared to the null hypothesis based on the given p-value." End of ChatGPT's answer.

So this is the MAXIMUM excess weight you should place on this model as part of a diversified portfolio, IMHO. If the model deserves a greater weight it will prove that to you objectively down the road with plenty of time for you to become wealthy if it turns out to be a good model. No biases from me or anyone else on the model (my model not available anyway but you get the point). Just an objective number based on out-of-sample data to use that will not put you all-in if it is a bad model and it will change in the right direction and be more accurate as more data comes in.

Just my method but I do actually use it (with Bayesian statistics which gives similar answers).


I am not quite sure to understand the following: "So this is the MAXIMUM excess weight you should place on this model as part of a diversified portfolio, IMHO.". Do you mean 4.53% or other? I also do not get how the VS-MPR translates to an asset allocation. Based on my fragile understanding the VS-MPR strengthen the case for the model validity but does not lead by itself to an allocation.

So 4.53 would be 4 times. So let's say you have a few small-cap models, large-cap models and maybe even some fixed income ETFs.

Some of your ETFs and models may be performing pretty much as expected. You do not give them extra weight.

But if one of your models is outperforming and proving itself with good data don't you want to give it more weight? This particular method for determining how much additions weight would weight it 4.53 greater than when you started funding it because it is proving itself. So if it was say 5% of your portfolio when you started then make it 23% of your portfolio--or a little less since the VS-MPR is just what its name implies: Vovk-Sellke maximum p-ratio. This is the maximum amount of confidence you can have that the the model is truly better than random chance (numerically or statistically speaking).

It is just a way of weighting your strategies accoriding to your confidence in them—if you want an objective numerical way of determing your confidence level.

Instead of asking Gemini about VS-MPR ask it about Bayesian Model Averaging (BMA). If you do not want to calculate the exact number for BMA after doing that then just use the VS-MPR as a surrogate for the BMA number (and just have Gemini calculate the VS-MPR).


Hum. Got it. Thank you.
Food for thoughts for the week end.