Are our linear AI models homoskedastic? Are they close to linear?

We kind of skip over it, but linear models come with the assumption of homoskedasticity.That might be okay — we make a lot of simplifying assumptions for financial data at P123. We have to and I understand that. But we shouldn’t forget what those assumptions are either.

We also assume linearity with linear regressions, which makes perfect sense. Its in the name of the model after all.

But I’m not so sure the stocks on the right — the ones we’re actually looking to buy — fit neatly into a linear model.

I’m also not sure that treating that rightmost bucket — which contains almost every stock I’m seriously considering — as an outlier is what I am looking for either.

Could we get back on track–respecting both assumptions–if we used inverse-variance weighted regressions?

I don’t think that is a minor question to be ignored. At least at home with our downloads.

Advanced — if P123’s AI consultant is considering looking into this:

The variance might get normalized out in P123 Classic Rank Weights— but only if we assume equal variance across all features. Still this normalization of rank weights may explain–in part–why P123 is such a successful model.

If the features have unequal variances, we might consider correcting for that somehow. It might be nice to have as an option for normalization in P123’s AI models.

There are probably additional complexities I haven’t considered while writing this, but I thought it was worth briefly revisiting the assumptions we’re making in some of our models.

A few of those assumptions — in practice — aren’t even remotely true.

If your concern is this specific plot, I don't think you need to worry.

First, you maybe misunderstanding the purpose of the plot ("the stocks on the right — the ones we’re actually looking to buy — fit neatly into a linear model.") This plot is meant as a diagnostic, an illustration of how predictive your customer factor is.

The model at heart is annualized returns ~ your customer factor. The point is to capture the slope $\beta_1$ (and also the intercept $beta_0$) as a way to quantify performance. The steeper the slope, the more powerful your custom factor is in ordering stocks by future return.

Second, the assumptions for linear regression are primarily to secure valid inference on the parameters. Inference meaning: making a statement about a population based on the data sample. When the assumptions are meant, we can do this with confidence intervals, p-values, etc.

However, the purpose of this plot is not generally to make inferences at all. We just want to quantify the strength of the factor. In which case, we don't need to fret about the assumptions at all.

Ordinary least squares is just an algorithm that finds the line-of-best-fit. It will do that for any data, regardless of assumptions. It only gets tricky when we want to make some statement about a population based on our data sample.