How to mechanically assign a score to a ranking system

Jamie, without meaning to be too argumentative, that’s not what statistical significance means at all. It has nothing to do with smoothly increasing functions. What you have done is to postulate a linear model:
(Bucket Height) = a + b*(Bucket Number) + noise.

There is no reason to make that assumption, nor is there justification.

I do agree with these general notions you and others have floated:
(1) There should be more flexible filtering of the universe for the Performance calculation.
(2) Overall average performance of a bucket is inadequate
(2a) Need to view over time
(2b) Need to drill down and see what’s in the bucket.
(3) No single measure of performance will suffice for all users, for all time, or for all purposes.

Cheers,
J

Steve,

I think you may be onto something, but I’m still not getting it.

If, as you suggested, P123 ranks by the annualised return, instead of by a ranking logic, isn’t that basically just the same as creating a ranking system, as it currently functions, and having that whole ranking system based on the returns for certain time periods? Examples: create many single factor ranking systems based on these factors:

Pr52W%Chg
Pr26W%Chg
Pr13W%Chg
etc.

Once again, though, we are both looking for some metric that is not skewed by the mean (i.e., fat tails) or in statistical layman’s terms and in an example: if all of the returns for the bucket for the 2001 to 2006 period are predominantly from one year, 2003, then there is not a lot of predictive ability is there? In trading terms: “the ranking system made me 382% total over the past 6 years, or 30% annualised return” Next, the killjoy asks: “OK, take away the highest annual return for those six years, now what is the annualised return?” Answer: “Well, 2003 had a nice return of 168%, so I guess I’ll throw that out just to play along”. So, what’s your annualised return now? Answer: 382 - 168 = 214% or IRR% = 21%; not 30%. Basically, it’s just throwing out the fat tails; trimming the tails as the factor feature area of P123 shows. So, why not take the normal distribution of the buckets through time, and ‘trim the tails’. Steve, I think that may be what you are looking for?

Jamie states it well: [quote]
if all of the returns for the bucket for the 2001 to 2006 period are predominantly from one year, 2003, then there is not a lot of predictive ability is there?
[/quote]
One is reminded of the statistician who drowned in a river that was, on average, only one foot deep.

Let’s just agree to disagree.

The bigger issue, as you have rightly pointed out, is that the user should be able to ‘roll their own’: whatever floats their boat.

Sometimes, I will also want to look for higher last few buckets, or lower last few buckets (and I do). Sometimes I will want a calculation similar to what Marco has proposed, and what I have proposed with the silly-named CBDT. There’s a happy median (there we go with statistics again) in between.

Now, we could add in many other statistics, and make it really complicated too, or we could just give everyone what they want when they want it: high buckets today, pseudo-statistical significance tomorrow. I, too, would like to search the ranking systems in various ways: slope, correlation, high last bucket, high delta between first and last bucket; anything!

That said, some of the many statistics that could also be included are:

OLS regression
Coefficients of determination
Mean
StdDev
‘T’-Stat
Slope
R-Squared

Generally, I would certainly suggest to Marco and his team to take a look at any academic paper that analyses factors, and there will be many flavours of laying out the analyses. SSRN.com has gazillions of papers. I may get around to putting a sample bucket factor table together, but I would also be keen to see what you or others come up with. :slight_smile:

Very true, Jerrod.

Or, “Lies, damned lies, and statistics”

It was Keynes who said that wasn’t it? (I’m guessing without Googling).

Jerrod,

I am also a little rusty with statistics, but I am curious about this concept you mentioned, ‘discrimination’. Do you mean determination, or was that something I missed in statistics?

Personally, I think statistics don’t discriminate, people do :wink: And guns don’t kill people…

[quote]
I am curious about this concept you mentioned, ‘discrimination’. Do you mean determination, or was that something I missed in statistics?
[/quote] Jamie, here’s the differrence between regression and discrimination (both of which sound pejorative, don’t they?)

Regression: Fitting a model to the data so that one can “best” predict what the value of an observation would be if all of the independent variables were known. Generally this means least-squares, but not necessarily.

Discriminiation: Fitting a model so that one can best predict what classification an observation would be in, if all of the independent variables were known. Usually limited to two classes, I think.

Example: Let’s say that I want to determine whether some stock is more likely to have an annual gain aabove or below (say) 30%. That’s discrimination. The output of a discriminant function is the probability that, given all of the independent variables, the dependent value will be above or below the specified threshold value.

Please don’t hold me to all of this. It’s been decades.

Jamie -

I will try to give a very simple example first using the P123 methodology then with the change that I suggested.

The user runs a ranking system performance with 20 buckets with one week frequency over a two week period.

There are three stocks: STKA, STKB, STKC in a larger universe of stocks. The ranks of the three stocks as per some undefined ranking system are 97, 98, 99 respectively at the start of the first week. Their %gain over the next week is -20%, 2%, 40% respectively.

So for the first one week time period the %gain using P123 methodology for bucket 95-100 would be (-20+2+40)/3 assuming that only STKA, STKB and STKC fall into the right-most bucket. The average 1 week gain would be 7.333% gain for the week for the bucket.

Then you go on to the next week and the ranking of three other stocks STKX, STKY and STKZ fall into the upper bucket with gains of +3%, -2%, +1%. The average for the second week 95-100 bucket would be (3-2+1)/3 = 0.666%.

The overall %gain for the bucket would be the average of the two weeks. (7.333 + 0.666)/2 = 4% gain. Annualized %gain would be ??? (not important for this example)

Now let’s repeat this process but ranking the %gains.

If the fundamental ranking system were ideally predictive then all stocks in the upper bucket would have perfect %gain and the %gain ranking gain would be 100,100, 100. But obviously things are not going to be ideal.

The first week STKA, STKB and STKC have %gain of -20%, 2% and 40% gain. Now all of the stocks in the universe are ranked according to their percent gain. Let’s say for example that STKA has a ranking of 5 (-20% gain), STKB a ranking of 51 (2% gain) and STKC has a ranking of 95 (40% gain). Therefore in the 95-100 bucket would be deposited 5, 51 and 95. The average would be (5 + 51 + 95)/3 = 50.333

The second week STKX, STKY and STKZ have %gain of +3%, -2%, +1%. It happens that it is a very quiet week and the range of %gains for the entire stock universe is quite small. For the second week the %gains are ranked for the stock universe and STKX, STKY, STKZ come out as 75,10, 60 for example. These numbers are dropped into the 95-100 bucket and the average is (75 + 105+ 60)/3 = 50.

The overall output for the 95-100 bucket would be the average of the two weeks. (50.333 + 50)/2 = 50.167. This number has no units. You can’t extract an annual performance gain from this number. All it will tell you is the relative performance of the 95-100 bucket versus all of the other buckets. All of the performance figures across all buckets should in theory average out to 50. So ideally the left most buckets should be below 50 and the right most buckets above 50 if the ranking system is good.

With this method each time period is equally weighted and therefore all time periods contribute equally.

The way that you suggested by creating a normal distribution and throwing out outliers would also be a very viable solution. There are pros and cons for each implementation. I think that the normal distribution could break down with smaller number of time periods - 1 year frequency for example. There could be issues with data thrown out ifor different buckets being inconsistent (i.e. not same time periods thrown out). The implementation may be more complex. I’m not sure.

Steve

I agree that what really matters about a ranking system is how the top 5% of stocks perform, in other words the top 10 buckets of a 200 bucket performance test.

So if automatic metrics are added p123, please give the user the ability to specify if the metrics are calculated on all 200 buckets, the top 50 buckets or the top 10 (my preference) or the top 5 buckets.

Thank you.

Marco,

Here are the things that would help me most when evaluating performance ranking. Point #1 deals with numeric metrics. Point #2 deals with the presentation visually of the metric(s).

Point #1. Numerical metrics
I would suggest providing 3 metrics rather than trying to find a formula to combine everything into 1 metric. No two users are looking for exactly the same thing, so there will be no agreement if one tries to find a formula that gives a single metric.

I see 3 types of users and thus 3 specific metrics.

  • type a - aggressive - This person wants highest return even if the ride is rough - so give a metric that measures the best gain - p123 already provides this with the column chart of buckets.

  • type b - consistency - This person is want to make some gains every year even if they underperform in a spectacular year like 2003. Such a person would rather get 20%-25% each and every year rather than 100% for three years and zero for a couple years. Yes they will have less than type a in the end, but the ride is smoother.

  • type c - loss adverse - for this person drawdowns are crippling painful emotionally. So they would like a ranking system that gives very small drawdowns.

Point #2. Visual Presentation of the Metrics
I am a visually oriented person. If the above metrics are just provided as numbers I will be typing them into Excel to display a chart. It would be great if p123 could display above 3 metrics could be displayed visually. For example:

  • metric a - (maximum profit) - the current column display does this well.

  • metric b - (consistency) - Enhance the current column bucket display by having each year’s returns show in a different color - ie, stack bars for each bucket with different colors for each year. The only challenge would be how to display loosing years. Perhaps each year’s returns could be given its own (thin) column with no space between the columns for a given bucket, but a space would separate the column cluster of one bucket from that of the next bucket.

  • metric c - (drawdown) - Several way to display this. Maximum Drawdown a thin red column immediately to the right of the main column for each bucket. Or Maximum Drawdown could be a thin red line in the middle of each bucket. — Also, in theory the current Historical chart display shows drawdowns, but it is virtually unreadable for more than 10 buckets. I like to use 200 buckets and really only care about the top 5. Right now if I have 200 buckets it is impossible to see how these behave given the clutter of the other 195. Perhaps the first 195 buckets could be displayed in gray as background and and then the the top 5 in color using thick lines – something so they would really stand out.

Oh, would it be possible to for the user to specify how many buckets p123 would display. For example, I normal test with 200 buckets, but I only look at the top 10 buckets, so it would be nice to just have the top 10 (of the 200) display. This would allow for the clustering of yearly columns suggested for metric b above.

Thank you.
.

I don’t think we need a new metric to tell which cap size is making a ranking system work. P123 already provides the tools to do that. Just run the ranking system twice: once with R1000 and once with R2000.

I pulled out an Econometrics textbook just for kicks. It has been decades for me too. I must have missed the class on discrimination or maybe I was snoozing, or maybe I was at the pub or maybe I was at the pub snoozing.

In any case, the example you gave sounds to me like a CDF (Cumlative Distribution Function): Pr(X <= x); or, the probability of the event that X is less than or equal to some value x. Visually, take a normal distribution of whatever and chop it at x and add up all of the constituent probabilities of those events, all based on the sample population, of course.

There definitely will not be a test as the Prof is at the pub too.

Bonus question: what is the probability that the Prof will consume 5 or greater beers based on the data sample population of 52 Thursday nights?

On another note, there are some worthy statistical measures to be gleaned for this Ranking System Metric, from any Econometric textbook.

For instance, Skewness and Kurtosis. Here is a riveting part of the Econometric book I dusted off, which has applicability to the prior discussion about fat tails (lone high buckets and also annualised returns all congregated in one time period; both of which could be lumped in the category of the discussion of fat tails): “Tests for Skewness and Kurtosis: One common application of conditional moment tests is checking the residuals from an econometric model for skewness and excess kurtosis. By ‘excess’ kurtosis, we mean a fourth moment greater than 3SD4 (3 standard deviations to the 4th power), the value for the normal distribution…The presence of signifcant departures from normality may indicate that a model is misspecified, or it may indicate that we should use a different estimation method. For example, although least squares may still perform well in the presence of moderate skewness and excess kurtosis, it cannot be expected to do so when the error terms are extremely skewed or have very thick tails. Both skewness and excess kurtosis are often encountered in returns data form financial markets, especially when the returns are measured over short periods of time. A good model should eliminate, or at least substantially reduce, the skewness and excess kurtosis that is generally evident in daily, weekly, and, to a lesser extent monthly returns data. Thus one way to evaluate a model for financial returns, such as the ARCH models…is to test the residuals for skewness and excess kurtosis” Source: Econometric Theory and Methods, Davidson & MacKinnon.

In any case, I think an array of stats metrics would be a very good addition to the Ranking System performance section.

One thing I think we all agree on is that it would be great to be able to search ranking systems for criteria of our choice, many of which have been discussed.

It would also be nice to query the Ranking System db to find the ‘best’ N ranking systems (based on our chosen criteria) for the past Y days, weeks, months, and then display those ranking systems in order (similar to the presentation of various simulations in the simulation section; rank by your criteria; click and change the sort method), and with the various criteria we are interested in. Now, that, in my opinion, would ‘rock’. It is available through some services, and it would be a wonderful addition to P123.

Point well taken. That is correct.

I simply thought since, market cap is such a crucial factor in all ranking systems, maybe it could be incorporated into all ranking systems. For instance, as I explained in the example, rather than visually looking at a ranking system for a few different market caps or by indices that target various caps, as you suggested, automatically include this in the ranking bucket. It kills a few birds with one stone.

Personally, it is signifcant to see if a ranking system is driven by its logic or if it is merely the small cap effect, and if the various caps are visually displayed together it gives a lot more information visually, instantaneously.

Alternatively, expand the concept: maybe by hovering the mouse over a bucket a tootip would appear that shows a pie chart of what actually makes up the bucket. The pie could be based on market cap, sector weight, industry weight or other ideas.

Combining information is always a good thing in my humble opinion rather than pointing and clicking back and forth all over the place: combine the information in one place.

Just an idea.

I know exactly how you feel. There is a way to exclude all the junk and noise by using p123’s “Boolean” option to simulate exclusion filters when testing ranking systems.

I will start a new thread and call it “FILTERS — How to put a FILTER into a Ranking System for Quick and Precise Ranking Development”

Regards,
o806 (b519b)

Steve,

If I understand your example correctly, I think the same thing could be achieved by displaying the buckets as annualised relative returns to the benchmark (the alpha relative to the benchmark), rather than nominally, as they are currently.

That would give the same result as you gave, but the process arriving at the result is different. Does that make sense? After all, it seems you are proposing this:

  1. rank the stocks by some logic (the current standard process)
  2. take the stocks in each bucket, for each time period and create another ranking of the buckets, this time by returns. Therefore, it is a second internal ranking each time period, based on relative returns to the universe (the new benchmark, instead of the SP500). Then, for each bucket based on logic, take the average bucket based on returns for each of those component stocks. So, instead of buckets charted by returns, they will still be compared by returns (internally), but they will not get a return number they will get an average bucket rank.

It’s basically the same as ranking by alpha, except, in this case, you have suggested that the relative benchmark should be the universe ot stocks, and not the SP500. You have also said forget about any mention of returns; just show the ranking. I hear what you are saying, and it would be good as well, but if I were to prioritise, I would develop the relative alpha returns first, and then the purely rank-based (no returns at all) next, but then again they are giving the same result; they are just presented differently.

Steve, at first I thought your example was a little circuitous, but I learned a long time ago to never throw away ideas, especially creative, and ‘different’ ones regardless of from where or from whom they came (for instance, I even have a good idea once in awhile). Your idea is good for a very crucial reason, which has something to do with the small cap effect I mentioned, and a lot to do with statistical significance. First off, comparing nominal returns is fairly obviously not the best way to go. So, relative returns are better. Secondly, if I am thinking about this correctly from your example, the ranking system and its component stocks would be compared to what else? It’s universe, which if you are looking for some aspect of signifcance in the ranking system, makes a lot of sense. Sure, a bunch of small caps will outperform the S&P500 for any time period this century, but how have the small cap stocks (for example), with the current ranking system performed vs. themselves (their universe, their benchmark)? That is the true measure of a ranking system. I am not stating that I do not want to see the out performance of the small caps, but if one wants to evaluate a ranking system, relative performance to its universe is essential.

So, it is as simple as this: Rather than displaying the nominal, annualised returns, display the relative annualised returns (or if you want non-annualised, but I honestly think it gets meaningless; do you really want to know that the bucket made 0.3% this week?; I prefer annualised returns) versus a benchmark. Essentially, this is the definition of alpha and beta: the relative (excess) returns versus a benchmark. Everyone thinks of beta as the SP500 benchmark, but that is just the most common beta. You are suggesting (I think) to display the excess returns of the bucket versus various benchmarks. In your specific example, you are basically saying: compare the returns to the current universe; that’s all I care about right now. I don’t care about the nominal returns, and I don’t care about the relative returns to the most common benchmark, the SP, I want to know the relative returns to the current universe. I modified your approach/process, but I think it arrives at the same result. Once again, the choice of benchmark should be an option for the user; some examples:

the standard nominal annualised returns, not compared to anything (what we have currently)
the relative annualised returns, compared to a benchmark (S&P500, the current universe, etc.)

The beauty of your idea is that by being able to pick the current universe as a benchmark, you are automatically building in a better display of statistical significance with which you can evaluate a ranking system. Like I keep harping on about the small cap effect: Show how this ranking system performs relative to the current universe. I guess, once again, it is coming back to stats. Your idea is alpha: comparing the returns to a benchmark (universe). But, it’s not the common alpha and beta (based on the SP500, but based on whatever benchmark is available, and specifically in your example: the benchmark is the current universe of stocks in the ranking system as it is currently being tested.

My understanding is that this thread is about evaluating a ranking system, and your idea does just that. This idea, combined with the other good ideas mentioned throughout the thread would be great tools with which to evaluate ranking systems. And again, if users want to evaluate a ranking system, by looking at the highest bucket, that should be an available search criteria, in addition to other suggestions such as the one you explained.

To be honest I don’t see that much value in expending so much effort in ranking a ranking system.

Some of the best ports are based on ranking systems that on first blush don’t inspire much excitement.

Even if we are to accept that displaying more metrics and statistics could be useful, that would be the case if and only if the search and display abilities are improved as well. All that additional information is useless if it cannot be easily accessed and interpreted. For example has anyone figured out yet how to use the Reverse Engineering feature to good effect?

I don’t see offhand how ranking ranking systems will help me more than ranking simulations and portfolios, but the current search abilities for doing even the latter leave much room for improvement.

After considering which of the latest feature improvements have been the best, I’d have to conclude that perhaps surprisingly the ones I consider the most useful are the user interface improvements. The disable/enable checkbox and the run simulation now button are ones that help me all the time.

Sterling -

One of the reasons for ranking a ranking system is for automatic optimization of a ranking system. I for one want to get off the beaten path with the existing ranking systems. I don’t want to be buying and selling the same smallcap stocks as hundreds of other people are.

I think it has been demonstrated that there is lots of potential for new ranking systems like Olikea’s system or my Large Cap robust ranking system. It takes enormous effort to develop a new ranking system and I am quite frustrated with the process. And I am not alone.

Jamie -

You understand my thinking quite well. There are a couple of things I would like to add:

(1) using my approach would be an option. One could optimize with my technique then flip to annualized gains and display the performance with the already optimized system. Nothing lost there.
(2) WIth annualized gains - as a result of the normal distribution long tails one can expect that the right most bucket will be much larger than one would expect. I see this often with ranking systems. If you use my approach there should be a more linear relationship between performance and bucket. i.e. you will get a better fit to a line. Better metrics.

I’m not a stats expert and I was only suggesting one possibility for solving what I perceive is a problem that needs addressing. I’m sure there are many other good solutions.

Steve

Steve,

To combine two different trains of thought/discussion (Sterling’s comments regarding the Reverse Engineering tool and yours and mine regarding relative performance), I think the Reverse Engineer feature may have be designed to approximate what you and I have been discussing (relative performance within a universe, whichever way that is calculated).

On another note: does anyone wish to know further information about a bucket? I have heard it said in the forums that this is not necessary, which truly makes me scratch my head.

Steve,

For instance, let me ask your opinion: would you like to hover your mouse over any bucket and get a little basic information on what that bucket actually shows (other than annualised nominal returns, which is what the bucket obviously shows). For instance, at the most basic crucial level: how many stocks are contained in this bucket?

I concede that the factor tool (beta version with a formula) will be able to show the distribution of the buckets, and I think that will be a great tool, but why not combine that information (distribution, number of stocks) with the ranking system buckets?. However, it still makes common sense to me to be able to get the information in one place; that is, hover the mouse over the bucket so it will tell the user: 500 stocks. or 500 stocks in a pie chart with other information (MktCap, Sector, Price, etc.). Numbers are wonderful, but put them together visually and it changes everything.

Marco,

Any thoughts on this? It seems that knowing the number of stocks in a bucket would be a worthy additional tool (fairly crucial as far as I see it); and after that is accomplished further information could be displayed as pie or other display.

Jamie -

It would certainly be an interesting tool. But my prime interest is in ranking system development. For an Excel add-in, metrics have to be nailed down in order to automate the process of optimization. Visual tools will not help.

Also I haven’t looked into the reverse engineering feature (I don’t understand it) but I want to continue to make the point that the metrics need to include some consistency through time factor. Again this is for purposes of automating or semi-automating development of optimized ranking systems.

Steve

I’m very happy to see this thread. I wanted to make an add-in for the rankings and, since it would be difficult to import in Excel and sort 100 graphics, I asked Marco for some numerical indicator.
My post wasn’t clear. I posted it after reading posts proposing one single value rather than a set of indicators.
This is a rephrased version:[quote]
There are dozens of values to measure the performance of a simulation and still many users keep asking for more and more indicators.
For the ranking system I would do the same: I would focus more on defining all the indicators so that anyone can interpret them in their own way, than on one single value, because the users would keep asking for more and more indicators.
Instead of suggesting expressions to get one value like Slope * Delta (H bucket - L bucket) * Correl, let’s suggest a list of values Slope, H bucket, L bucket, Correl, etc.
[/quote]
I think we should forget the buckets.
The buckets are good to create a graphical representation of the ranking performance, but if we look for numerical indicators there is no reason to keep using buckets, they only introduce error.
Or if you like I would consider ~3000 one stock buckets.

This is my 2 cents: I would ask for the ratio No. of Stocks/No. of Negative Delta.
Examples:
11 10 12 13 20 50 = 6 / (1+0+0+0+0+0) = 6
12 10 13 11 20 50 = 6 / (2+0+1+0+0+0) = 2
In the first 6 stock market only the 1st stock is higher than one stock on its right.
In the second example the 1st stock is higher than 2 stocks on the right and the 3rd is higher than one.
I don’t care about the amount of the difference, I want the order to be as accurate as possible.

Two more cents: since the ranking systems are often used with a buy rule Rank > X% (and even if the buy rule doesn’t exist they are never used to pick stocks with low ranks) I would ask the user for X and look only at the top X% of the ranking, ignoring any inaccuracy in the sorting of lower ranked stocks.
For example X=30% would look at the accuracy of the order only on the top 30% stocks, ignoring the other 70%.
I think that if the ranking is very good on the upper 30% then:

  • there are very few chances to get good quality stocks in the lower part of the ranking, while
  • there are high chances to have a lot of noise down there, where the stocks have messed up data.

I noticed that some people ask for buy rules/screener in the ranking definition.
Today we ask for the price filter, then another filter, then the screen, then the buy rules, then the sell rules, then we end up with a single ranking-simulation editor.
My idea is to create 2 new add-ins:

  • one for the rankings, to edit the definition, run and import the performance data
  • one for ranking-simulation, to edit both ranking and simulation definition at the same time, run the simulation and import the simulation performance data.

Stefano