Man+Machine – The evolution of fundamental scoring models and ML implications

Dear all,

This article from CFA UK is about the development of scoring models. I think the most important line in the article is :

Surprisingly, Mohanram’s G-Score and Piotroski’s F-Score were the only ones surviving an out-of-sample backtesting delivering statistically significant alpha despite the other rejected models being developed at a later date.

Portfolio123 already has F-Score built in.
Here are the 8 formulas for the G-Score (namely G1-G8) that you can setup under formulas.

G1 = Eval(ROA%(0,ANN) > Aggregate(“ROA%(0,ANN)”, #Sector , #Avg),1,0)

G2 = Eval(OperCashFl(0,ANN)/AstTot(0,ANN) > Aggregate(“OperCashFl(0,ANN)/AstTot(0,ANN)”, #Sector , #Avg),1,0)

G3 = Eval(OperCashFl(0,ANN) > NetIncCFStmt(0,ANN),1,0)

G4 = Eval(LoopStdDev(“ROA%(ctr, ANN)”,5) < StdDev(Aggregate(“ROA%(0,ANN)”, #Sector, #Avg),Aggregate(“ROA%(4,ANN)”, #Sector, #Avg), Aggregate(“ROA%(3,ANN)”, #Sector, #Avg), Aggregate(“ROA%(2,ANN)”, #Sector, #Avg), Aggregate(“ROA%(1,ANN)”, #Sector, #Avg)),1,0)

G5 = Eval(LoopAvg(“Sales(ctr,ANN)/Sales(ctr + 1,ANN) - 1”, 5) < StdDev(Aggregate(“Sales(0,ANN)/Sales(1,ANN)-1”, #Sector, #Avg),Aggregate(“ROA%(4,ANN)/Sales(5,ANN)-1”, #Sector, #Avg), Aggregate(“ROA%(3,ANN)/Sales(4,ANN)-1”, #Sector, #Avg), Aggregate(“ROA%(2,ANN)/Sales(3,ANN)-1”, #Sector, #Avg), Aggregate(“ROA%(1,ANN)/Sales(2,ANN)-1”, #Sector, #Avg)),1,0)

G6 = Eval( RandD(0, ANN)/AstTot(0, ANN) > Aggregate(“RandD(0, ANN)/AstTot(0, ANN)”, #Sector, #Avg), 1,0)

G7 = Eval(CapEx(0, ANN)/AstTot(0,ANN) > Aggregate(“CapEx(0, ANN)/AstTot(0,ANN)”, #Sector,#Avg),1,0)

G8 = Eval(SGandA(0, ANN)/AstTot(0,ANN) > Aggregate(“SGandA(0, ANN)/AstTot(0,ANN)”, #Sector,#Avg),1,0)

From my experience, the best way to apply F-Score and G-Score is to use the Frank function.




I am currently using Mohanram G-Score and Piotroski F-Score in all my stock based strategies. I hope other P123 members also find them useful in selecting stocks.


1 Like

We’ll see about adding a built-in G-Score. Makes a perfect pair with F-Score: value + growth.

Also interesting the cross validation technique for the ML paper as we’re evaluation which techniques to include in our AI factors (slated for beta release this quarter):

“The training of the model implements a two-year rolling window to predict next year returns; notwithstanding, the authors also display results using other three rolling windows: one, three and four years.”

It’s interesting because it’s more dynamic. The model for prediction is continually updated using only a small history (3 years probably the sweet-spot). This makes sense considering this other interesting tidbit from the article:

"alpha decay” effect in 2016 was an eye-opener finding out that portfolio returns are 26% lower out-of-sample and 58% lower post-publication of a recently discovered alpha signal.

58% lower! In other words, if your goal is 25% CAGR you need a robust (not curve-fitted) backtest of 43% CAGR. Guess we’ll be using this response when users complain about out-of-sample not living up to the backtests.


I have been using the G-score too. albeit in a ranking system. Wonder how different are the stocks it ends up selecting


I will leave it for @yuvaltaylor to comment but i think he has consistently been for longer training periods. Note I don’t think using the optimizer (or not) is a big factor in whether we call this machine learning, reinforcement learning, jrinne [or fill in with the username of your choice] “special unique proprietary method not to be confused with machine learing.” Anyway, I leave it to Yuval to comment on his own system and I apologize for any misunderstandings on my part.

In my case I can comment without equivocation. I tried a rolling 10-year window training period and a training period that started 2000 each time. i.e, 2000 - 2010 train then test 2010 - 2011. Then train 2000 - 2011 and test 2011.

Which of the methods won was NOT EVEN CLOSE. Starting the training at year 2000 won hands down. That is what cross-validation if for: selecting the best method to use out of sample. For me, for this model starting at 2000 was best no matter what anyone’s opinion on the matter is.

Here are my results BTW: 2023 performance - #10 by Jrinne.

FWIW, one anecdotal example of a real-live port with P123 data if we are to have just one way of doing cross-validation at P123.

Me, I would be happy to be able to rebalance daily with our present downloads without dowloading data starting at 2000 each time. Daily for rank, z-score and Min/Max. I would pay the API fee for each abbreviated download for rebalance and I would download more complete data (starting with 24 years of Min/Max).

Ultimately, I would like to have the choice. Maybe not have it left up to someone else.

I DO appreciate what I can do now (see link above). Just my feedback on the present downloads which were described as a method of getting feedback from members that are doing machine learning already…

I welcome feedback from Pitmaster, Johnpaul etc. It would be my preference that they did not have to even read my posts if they did not find them helpful (and could do it their own way).


Jim not sure which methods you are comparing. Rolling 10Y vs what? In one method you keep the window 10Y and in the other it grows every year? And which trained model do you use for actual predictions? The latest one (the last roll) or you average the predictions of all the models generated?


For me personally, 2 years is not enough. 3 years is not enough. 10 years is not even that good.

I, personally, like to train on as much data as I have. And I have evidence about what works best for my model. Evidence that I don’t think conflicts with what Yuval has been doing even.

But mostly, if Pitmaster likes 2 years let him do 2 years.

I hope I said that better.


Why use #sector? Paper seems clear that Industry is used, although it’s S&P GICS which has fewer industries than FactSet (70 vs 90)

We could of course create a G-Score with a scope parameter so you have total freedom.

FactSet SubSector seems better than Industry to avoid sparse industries. Here’s the breakdown using “Easy To Trade USA” universe with 3764 stocks as of today:

Scope Count Min # Stocks Max #Stocks
Industry 90 1 289
SubSector 32 13 351
Sector 12 53 931

PS You can calculate your own totals with the screener and these two rules:

@cnt:FCount("1", #sector)        // count number of stocks in the scope 
Forder("mktcap",#sector,#desc)=1 // filter the largest stock for the scope
1 Like


I have the G-score formulas setup before getting the CFA UK paper link. That is the reason #sector are used.

Personally, I think #sector can be switched to #industry or #subindustry.

For my crypto stock screen, I use these which gives the best results

Frank(“PiotFScore”,#subindustry,#DESC)>25 &

(G1-G8) are using #sector.


I have not done yet any research to claim which approach for training window would work best. But I would rather stick to a longer training window. Below you may find useful research done by Yuval:

You’re more likely to have factors work like they did in the past if you use a 10-year lookback period than a 1-year.

Some interesting avenue of reaseach would be for example:

  1. use weighted sampling, two methods available:
  • undersample less important data (e.g., select 100% observations from FY2023, 90% from FY2022, etc.)
  • specify importance of data (similarly as in the previous point) as a classifier parameter, which will puts more emphasis on getting these points right (already implemented in some Scikit-learn & Keras classifiers)
  1. dynamic training window, an algorithm would decide how many years of past data use to predict next period based on some smart metrics.
1 Like


The truth is we all can experiment with these methods and more now. We can.

BUT we can only rebalance ranks once per week with the present downloads. And we cannot rebalance z-scores and Min/Max at all unless (in my case) I download 24 years of data each time.

Not practical for daily rebalance, I think. I run 2 ports each day.

I don’t have insight to what you plan but you should give Pitmaster, test user, Jonpaul etc the option to try different methods and find out thru cross-validation what works for for them.

I don’t think a discussion on the forum and a vote, feature request etc will work for a lot of machine learners.

As practical as it sounds I have little interest in Exponential weighted sampling for example but absolutely agree if I become immortal and have unlimited time I should do that.

Someone else may think it is the first thing to try and I VOTE WE LET THEM DO JUST THAT!!! Note again, THEY PROBABLY CAN DO THAT now with Python.

It would be easy to let Pitmaster do that if you let him have a way to rebalance z-scores and or Min/Max.

I do get that some (including perhaps Pitmaster) can do this with the API with ranks now. But I think he still would need a large download to do this with Z-score or MIN/MAX, I think.

I am sure you are addressing this in a way that will allow Pitmaster to do that but I thought you said “might” remember the training for z-score rebalance. That makes me wonder how you could make everyone able to do what they want.

At the end of the day I can stay with what I use now but I would not mind if your AI/ML worked for a lot of people. Attracted new ML members with their own ideas on cross-validation (different from mine or Pitmaster’s perhaps).

To be clear my ML method is unique and no one else is going to do it. Also unique in the sense that I do not need the same downloads as Jonpaul, Pitmaster etc. Not that I wouldn’t try that route if I could rebalance a method that I trained using a random forest for example.


I wrote about the F-Score at length here: Why Piotroski's F-Score No Longer Works - Portfolio123 Blog. There are a number of other X-Score thingies out there. I use a modified version of Beneish’s M-Score, which I outlined here: Detecting Financial Fraud: A Close Look at the Beneish M-Score - Portfolio123 Blog as well as Ohlson’s O-Score and Dechow’s F-Score. There’s also Altman’s Z Score. They all have merits and problems. One problem with these scores is that they mainly rely on multilinear regression rather than weighted ranking, and that’s just not as good a method.

Yuval, could you explain this further?

Ranking uses normalized values. Multilinear regression doesn’t, so the M Score, for example, is terribly susceptible to extreme values, and therefore requires you to trim outliers at the 1st and 99th percentiles. Even that often isn’t enough. An extremely high score on one variable will sometimes outweigh all the others.

Dear all,

Based of my understanding, that is not how Mohanram G-Score and Piotroski F-Score works.

Piotsroski F Score is calculated as a score, which ranges from 0 to 9, measures a company financials by 9 different attributes (formulas) while Mohanram G-Score is a score, which ranges from 0 to 8, measures the 8 different attributes (formulas) of a company financials. A score of either 0 or 1 is rewarded for each of these attributes (formulas), depending on whether it has been fulfilled or not.

It is not true that an extreme high score in one criteria will outweight all the others in the 9 point (Piotsroski F Score) or 8 in (Mohanram G score) system.

Here is the link for the definition of the ratios.


1 Like

Another reason to be grateful for P123.


That is absolutely correct. But z-score is normalized too (and requires a trim while rank does not).

Ranking is also nonparametric. Not just normalized.

This is important for regression but not so much for random forests (which are themselves nonparametric and will give the same result for rank or raw data). However rank is normalized over a different period than z-score and Min/Max giving the question: which normalization period is better(and is it always better for everything)?

The other thing is, if you think about it, P123 classic NEVER makes a prediction for returns. Everything else machine learning will have a predicted return and you will have to sort the predicted returns (which basically ranks them thru sorting).

I agree with you: There are some important differences both subtle and huge.