Does this approach to Canadian AI Factors make sense? Looking for feedback & suggestions

Hi All, I’m posting this in General rather than AI Factor because these NA issues likely affect anyone creating standard ranking systems for Canadian universes. not just AI workflows.

After testing and a full factor audit using ChatGPT-5, here’s what I’ve discovered so far about building Portfolio123 AI Factors for Canadian stocks:

Heavy fundamentals (EPS, EBITDA, margins) = high NA risk, especially pre-2012.Analyst estimates, ownership, and short interest? Often ≥30% NA.

So I pivoted to low-NA families, examples:

Momentum (Pr26W%Chg, Pr52W%Chg/PctDev(52,5))Pullback (RSI(10), Close(0)/SMA(20))Technical breadth (UpDownRatio(20,0), ATRN(20,0)/ATRN(20,20))Liquidity & execution (MktCap, (LoopAvg("Spread(CTR)",20)/Price)*-100)If fundamentals are needed, I gate them with Eval() to avoid NA blowups.

I attached a list of factors and several spreadsheets from my AI Factor builds that had high NA rates. Based on that, ChatGPT-5 generated this infographic comparing factor families by coverage vs NA risk:

Question to the community:

Does this approach make sense for Canadian universes?

Are there other factor families or syntax tricks you’ve found effective for reducing MAX NA?

Any best practices for mixing technicals with fundamentals post 2012?

Looking forward to your thoughts!

I personally only manipulate features with IsNA(x,0) for analyst revision features, surprise features, SUE, and actuals. In general, analyst and actuals data are the worst offenders regarding NA’s. Tried to remove all NAs at some point but I found that leaving NA flags for tickers with a lot of missing data is more feature than bug. I personally don’t want to give tickers the benefit of the doubt for rather basic missing data.

That said, I personally am not a big fan of Canada-Only AIFactor. The number of “good” investible stocks in my case was below 500 stocks which I found is a bit low for ML training. I pair Canada with Europe or go total North Atlantic. But that’s just my preference.

Of course you can also just include all shady tickers and go full technical features but I would assume that leaves you with a lot of blindspots.

Also make sure that the number of NA sensitive features as share of total feature list is not by coincidence just at the 30% mark. This will leave you with a lot of whipsawing, also in live trading because the 30% NA limit not only is applied in training but also in live prediction (a AIFactor Predictor will give out NA for a Ticker with > 30% NAs as of today’s data, even it was a top pick on yesterday’s data with <30%NA)

1 Like

Thanks for the reply, lots of good points to digest. I agree that analyst/actuals are NA nightmares. Pairing Canada with Europe or North Atlantic sounds smart, I might look into adding Europe to my subscription so I can test that.The 30% NA cutoff reminder is huge. I’ve seen picks vanish overnight when ratios flip. Lately my approach has been to gate or drop fundamentals with <70% coverage, keep my core focused on momentum, liquidity, and technicals and make sure the NA ratio stays well below the cutoff. If you’ve got favorite Europe or North Atlantic factor combos, or tricks for balancing technicals with fundamentals, I’m all ears.

Thanks again for the input!

2 Likes

I don't know anything about NAs in ML models, but in standard ranking systems I've had very good results with Canadian stocks, NAs or not. And I personally think it's a big mistake to not consider quality and value factors (fundamentals) as much as you possibly can. "If fundamentals are needed . . ." They're always needed. Choosing stocks without looking at fundamentals is like buying used cars without looking under the hood. Look at the specific line items that have lots of NAs and figure out workarounds. I use IsNA([factor],0) for a lot of items, if that's appropriate. It might seem like a lot of work, but I think it's well worth the effort.

I'm also curious why you've decided to build AI models rather than ranking systems for Canadian stocks, but that's perhaps a subject for a different thread.

4 Likes

I personally use feature lists which comprise mainly well-known factor definitions I also use in my classic ranking systems. You could start with every metric you find in the Core - Combination System + some obvious ones like MktCap etc. which are missing.

Also try to iterate and test configurations to get a feeling. Always make a 5y OOS predictor and build an actual ranking system with it to use in an actual sim. Screen/Validation results of AIFactor can be highly misleading if most of the return is based on high-turnover mean reversion which is a slippage nightmare.

I would suggest starting with 3m Total Return Target and weekly. Timer series CV with 12 months steps. Also for the first iterations try to avoid Short-term signals to force at least some auto-correlation. If you use a lot of (noisy) technical signals with sub3m lookbacks, the system will overfit using those and it likely won't survive slippage. Try with a rather prudent feature mix of 3-12m mom, growth, value, 13w revisions (prefer CY data since it has highest coverage), quality, (volatility - try both with and without here), profitability, sentiment etc.

Also focus on LightGBM, linear and maybe ExtraTrees first (fast, cheap, promising models)

You can add more short-term features, tweak universe or target later but first you should get a feeling for it.

That said, try a simple screen with NA filters for the most sensitive features first and backtest it to see number of stocks passing over time… Imo if the number of stocks is <500 most of the time, your universe is maybe too small to build a reliable ML model (that's why I don't have a pure Canada AIFactor). But maybe you can build a linear model, use the feature stats for a classic ranking system or combine ML with a classic approach in one system etc. because for classic (linear) ranking, number of stocks matters way less if you get the ordering right.

3 Likes

Thanks Yuval, really appreciate the response. I’m still relatively new to P123 (about 2 years in), but I’ve built a few ranking systems that are working well. I’ve read all your blog posts on P123 and even used them (with help from ChatGPT-5) to build a best practices checklist.

When I was fist building ranking systems, I had been mainly using the Small & Micro Cap Focus ranking system as a base to start from and found they performed better for my Canadian universe after dialing back the sentiment composite. That led me to explore why Canadian stocks seem to have more NAs than US ones, especially in fundamentals and sentiment.

I’m now diving into AI Factors and ML models, where I’ve read that NA handling becomes even more critical. Missing data can shrink training sets, distort predictions, or cause tickers to fail live scoring if NA thresholds are hit. So I’ve been leaning on low NA families and gating fundamentals when needed.

Curiously, many of my ranking systems (including ones modified from P123 templates) seem to produce higher CAGR when I treat NAs as neutral rather than negative in simulations. So you’re probably onto something, I just don’t fully understand the outcome yet. Still learning!

Thanks again for the insights.

Thanks Doney. I’ve been using mainly LightGBM and ExtraTrees so far. I’ve read a lot of Andreas Himmelreich’s (Judgetrade) posts and recently started building systems that use LightGBM in the ranking system and ExtraTrees in the buy rules with surprisingly good results.
I’m still experimenting, but your point about avoiding short term noisy signals early on really resonates. Also agree on the importance of building a proper ranking system and sim around the predictor. I’ll definitely try your suggestion of a simple NA screen to track universe size over time. The Canada Universe I created gives me about 900 names, less than 500 are probably worthy of a look.

Thanks again for the insights, super helpful!

1 Like

NA neutral basically gives more “benefit of the doubt” to stocks with many NAs. For such a stock to still make it to the top of the ranking, the other subranks likely will be fabulous. With NA negative you take away this possibilty. In the end, devil is in the details of factor weights and universe/factor composition/distribution

500 isn’t a hard threshold of course. But recent posts of Andreas regarding Filtering before vs. after training mainly concerns narrow universes from what I see in my tests. In my global filtered universe with 2000ish names, “filtering first, train narrow” vs. “train broad, filter in rules” makes no huge difference (advantage slightly to the former in my case).

In the end, to build conviction the only way is to iterate and make robustness checks over and over and over…

1 Like

Did a round of testing yesterday to see how sturdy my Canadian ML setup really is based on the responses from Doney & Yuval. A few small tweaks make a big difference.

Treating missing data as neutral (instead of punishing it) gave a solid boost to top ranked stocks.
Example: My custom universe; Canada +30M – CurR + SalesGr

Top bucket CAGR: 21.7% to 28.6%
Slope: 1.44% to 1.81% per quantile
Bottom bucket got worse, but I don’t buy there, so no big deal.

Ran three checks to see if the strategy holds up under pressure:

Nudged weights 10–20%
Tested across 2018–2020, 2020–2022, 2022–2024
Tried different rank cutoffs (60, 70, 80)

Still solid. No regime collapse.

Training Universe:

Canada +30M (1,686 stocks): NA = Negative
Canada +30M – CurR + SalesGr (922 stocks) NA = Neutral

Validation/Trading

CA SmallCap AI-Factor v1 (375 stocks)
CanadaPolicy2025 (910 stocks)
Canada DNP – 2 (662 stocks)

Each has its own flavour, some momentum heavy, some quality tilted.

Kept It Clean; Median $Vol > $100–150k made a difference.

Appreciate the responses from Doney on robustness and Yuval’s early posts on fundamentals. This setup now gives me a clean ML pipeline with strong monotonicity.

I use NA Negative in my ranking systems, but will override it in rare cases when it makes sense. For example, one strategy is using the North American universe so it is a mix of US and CAN stocks. SIRatio is NA for every CAN stock, so it is not logical to penalize them for the NA.

So if NA, I set the value to the average SIRatio from the universe using this formula:
Eval(SIRatio=NA, Aggregate("SIRatio",#All), SIRatio)
If I wanted to be more precise, I could do this:
Eval(SIRatio=NA and ExchCountry("CAN"), Aggregate("SIRatio",#All), SIRatio)

2 Likes

Hi Dan, good point on SIRatio and NA handling. I’ve run into similar issues with Canadian tickers missing sentiment fields, and your Eval() workaround is super clean. I’ve actually borrowed a few ideas from your posts in the past, my Canada DNP universe is tagged after you, so thanks for the ideas. I’ve been using Eval() more aggressively lately, especially to gate features before feeding them into my ExtraTrees predictors. For example, I’ll wrap things like SalesGr%TTM or OpInc in Eval() to avoid blowing up the model with NAs. It seems to be keeping training sets clean and live scoring stable. I’m leaning towards NA Negative as a default, but I’ve seen Neutral NA outperform in top buckets for Canadian universes when paired with quality filters. Bottom buckets get messy, but I’m not buying there anyway. Appreciate you sharing the formula, super practical.

2 Likes