[@marco. Carsten asks a serious question here and this is actually the simplest complete statistical answer to his question, I believe . Because this is a widely used and accepted method, because it is trivial to calculate (requiring few resources), and because this can be output in a table that everyone at P123 could understand it would not be wrong for a machine learning site to consider it, IMHO. By the way I can do this on my own and have no opinion as to whether this could help with marketing at P123 so I won’t mind if you think this it too advanced for the type of members you want to attract..]
Hi Carsten,
So I note that P123 will soon be including ttests in the rank performance test. That being the case I would seriously consider the BenjaminiHochberg procedure
Second, I note from our correspondence that you are an engineer with serous math and programming skills.
Finally, I note that this can actually be done in a spreadsheet and that P123 will be providing the pvales in the rank performance test, so once understood, it would be trivial for anyone at P123 to do.
TL;DR: You might look at the the BenjaminiHochberg procedure (with Excel spreadsheet method here) to control the False Discovery Rate (FDR)
The only things you would have to consider is 1) the FDR you are happy with and 2) the length of your rolling window. Both of these could be determined by backtests or crossvalidation.
Seems wonky because it is. But the output would be a table and could be calculated in a spreadsheet at the end of the day. THE TABLE could be understood by all members who look at this seriously.
(Optional) addendum for those interested:
The test you’re referring to sounds like the BenjaminiHochberg procedure, which is a method for controlling the false discovery rate (FDR) in multiple hypothesis testing. This procedure is particularly useful when conducting a large number of parallel tests, as it helps to manage the rate of false positives (type I errors) that occur purely by chance due to the multiple comparisons problem.
Overview of the BenjaminiHochberg Procedure:

Rank the pvalues: From all the tests you’ve conducted, rank the pvalues from smallest to largest.

Calculate the Cutoff: For each pvalue, calculate its BenjaminiHochberg critical value using the formula: ((i/m)Q), where:
 (i) is the rank of the pvalue,
 (m) is the total number of tests, and
 (Q) is the false discovery rate you’re willing to accept (e.g., 0.05 or 5%).

Find the Largest Significant Rank: Starting from the largest pvalue, find the largest (i) where (p_i \leq (i/m)Q). All tests having a pvalue smaller than or equal to this threshold are considered significant.

Adjust Pvalues: Alternatively, adjusted pvalues can be computed to directly compare against the desired FDR threshold.
This procedure adjusts the significance level for each test to reflect the number of tests being performed, thus controlling the expected proportion of falsely rejected null hypotheses (false discoveries) among all rejections. [emphasis mine]
The key advantage of the BenjaminiHochberg procedure is that it is less conservative than the Bonferroni correction, which simply divides the desired overall alpha level by the number of tests and can therefore greatly reduce power in cases with many tests. The BenjaminiHochberg procedure allows for a more powerful test while still controlling the rate of false discoveries, making it highly valuable in fields like genomics, where researchers often deal with large datasets requiring multiple simultaneous hypothesis tests.