Single factor testing tool developed using the API

sraby · April 17, 2023, 11:24am

Hello Whycliffes, I will share my “input” spreadhseet, I have been adding more factors over time.

abwillingham · April 17, 2023, 2:31pm

I wrote a python script that basically allows one to extract every factor from any public ranking system, remove the duplicates, and spit out a new XML ranking system. It has several hundred nodes. Its not perfect though. It simply grabs each formula/factor from the XML, removes all the white spaces, and excludes exact matches. I can share the script and/or the XML ranking system if anyone is interested.

Tony

ivillalongabarreiro · April 17, 2023, 2:33pm

I’m quite interested!!

sraby · April 17, 2023, 2:45pm

I am also very interrested

WalterW · April 17, 2023, 3:12pm

I would like to see that, too. I need to catalog the factors used in my ranking systems.

Whycliffes · April 17, 2023, 3:39pm

I would love that. I have used this time consuming solution: Extract Text Between Two Characters - PhraseFix to filter out.

abwillingham · April 18, 2023, 12:00am

Here ya go.

Copy this python code to a file called Dups.py (or whatever you chose. The name makes no difference )

Start with a ranking system you like.

Make sure to copy the RS XML from “raw editor (no ajax)” section in the ranking system screen or the XML will not be formatted correctly.

Save the RS XML as a file called “in.xml” in the same directory as the Ptyhon program “dups.py”

Many of the public RSes have old depricated factors that will give an error when you try to paste them back into an RS on the website.

There is a text file called “invalidFactors.txt”. The program will check each factor against the list in that file and remove the bad factors.

If you come across any more bad factors, you can add it to this file to save yourself future grief.

Copy and past a bunch of new factors from some RS into “In.xml” without the beginning/ending <RankingSystem RankType="Higher"> </RankingSystem> tags.

You cannot have those tags more than once in an RS.

Run dups.py

It will remove all the duplicate factors and save the output to “out.xml”.

Copy “out.xml” to “in.xml” (or rename the files) so that “in.xml” now conatins all your unique factors.

Repeat… copy another RS to the end of “in.xml” (minus the <Ranksystem> tags) and run dups.py again

After each running of dups.py, replace in.xml with out.xml.

Make sure any XML you copy to in.xml stays within the
<RankingSystem RankType="Higher"> </RankingSystem> tags.

Those should be the first and last tags in every RS XML file and only appear once.

When you are happy with your giant library, you can copy your XML file back into a blank RS on the website using the “text editor” button on the RS page.

If you have no idea how XML files are constructed you may want to read up on it. You don’t need to know much about XML to use this.

The best place I know to find lots of public factors is in the website Search box → Search for Systems and Strategies.

The program strips out white spaces and comments only when doing the duplicate comparison.
It writes the original factor unmolested to out.xml.

I have not yet converted Dan’s list of factors in his Excel file to an RS.

If someone has converted it to an RS, please send it to me.

If you think of any interesting additions you would like to see, please let me know.

If enough people show interest I may add requested features.

# Dups.py

# Will delete duplicate factors using different criteria depending on the factor type
# Duplicate Compsites and Conditionals with the same name will be deleted
# For all other factors and formulas, the actual factor/formula is used regardless if the names are the same or not

import lxml.etree as ET
import pprint

source_XML      = 'in.xml'
destination_XML = 'out.xml'

tree = ET.parse(source_XML)
root = tree.getroot()

with open('invalidFactors.txt') as f:
    BadFactors = f.read().splitlines()

for elem in list(tree.iter()):
    if elem.tag in ("StockFormula","StockFactor"):
        for e in elem:
            if e.text in (BadFactors):
                parent = elem.getparent()
                print("Deleting Bad Factor: ", elem.tag, e.text)
                parent.remove(elem)




def find_in_list_of_list(char):
    for sub_list in factorList:
        if char in sub_list:
            return (factorList.index(sub_list), sub_list.index(char))

def childTextsList(node):
    global factorList
    texts= list()
    sep = '//'
    new = False

    if any (factor in node.tag for factor in ["Composite","Conditional"]):
        s = node.attrib['Name']
        factName = "".join(s.split())

        if sep in factName:
            factName = factName.split(sep, 1)[0]  # STRIP OUT COMMENTS

        found = find_in_list_of_list(factName)
        if not found:
            new = [child.tag, factName, 1]
            factorList.append(new)
        else:
            factorList[found[0]][2] += 1
            new = False

    elif any (factor in node.tag for factor in ["StockFactor","IndFactor","StockFormula","IndFormula","SecFormula"]):                  
            for subchild in list(node):
                s = subchild.text
                # Strip out white spaces
                factName = "".join(s.split())
                if sep in factName:
                    factName = factName.split(sep, 1)[0]  # STRIP OUT COMMENTS

                found = find_in_list_of_list(factName)
                if not found:
                    new = [child.tag, factName, 1]
                    factorList.append(new)
                else:
                    factorList[found[0]][2] += 1
                    new = False
    return new

nodes = root.xpath('//RankingSystem/*')

StartCount = len(root.xpath("//RankingSystem/*"))
print("*****************************************") 
print("Beginning Total Count: ", StartCount)
print("*****************************************") 
totalDeleted = 0

factorList = list()

for child in nodes:
    newFactor=childTextsList(child)
    if not newFactor:
       child.getparent().remove(child)
       totalDeleted += 1


pprint.pprint(factorList)
print("******************")
print("Duplicates deleted:", totalDeleted)
EndCount =  len(root.xpath("//RankingSystem/*"))

print("******************")
print("Start Count:\t",StartCount)
print("End Count:\t", EndCount)
print(EndCount - StartCount,"factors difference.")



tree.write(destination_XML)

invalidFactors.txt

BV5YCGr%
Sales3YCGr%
PEG
Prc2SalesIncDebt
InsOwnerSh%
EarnYieldOld
ShsOutAvgTTM
Beta
CF5YCGr%
NI%ChgPQ
NI5YCGr%
Sales5YCGr%
SGRP
SSGA
SOPI
RTLR
PEGInclXor
LTGrthRtLow

abwillingham · May 2, 2023, 5:00pm

Does the API not return each credit transaction with the number of credits used and remaining?

Whenever I update APIuniverse or APIrankingsystem, I get an xml string returned with that info.

Tony

danp · May 2, 2023, 5:27pm

Hi Whycliffes - I replied to the same question in the chat you sent before I saw this post. The tests you were running are still using 2 credits per test (ie each factor tested). I provided more detail from the log files in the chat. If you have questions, it would be better to discuss them in that chat instead of this forum thread.

sraby · January 5, 2024, 12:53am

I use a program on my mac called “EasyTransformData” to do the manipulation back and forth (both ways)

jrallen81 · February 21, 2024, 10:04pm

Danp, I really appreciate what you did here. I’m having some trouble so am posting hoping to see what I’m doing wrong. The code seems to work until it gets to factor 85, as seen in the first screenshot. I added the second screenshot in case it shows a useful error message.

danp · February 22, 2024, 9:06pm

The Colab script has been fixed. I also replaced some of the factors in the factor list that were disabled back in 2021. There are probably other factors in this list that may not work since this list was created back in 2021, but the script should not fail if any bad formulas are encountered.

jrallen81 · February 22, 2024, 9:51pm

For someone like me this is amazing, thank you. Just so I’m clear, as of now there are not other tools that extend factor/ranking analysis under different assumptions, at least not as directly as this, correct? And I would assume the ML implementation is not directly related? Previously, I was spending a lot of time changing one thing at a time and re-running.

danp · February 23, 2024, 2:57pm

This is currently the only tool that automates running rank performance tests on a large number of different factors. We are discussing the possibility of creating a new tool for this that would not require the user to deal with Colab since that is confusing for some users.

This single factor testing tool is not directly related to the AI project but it could be useful to create a list of factors to use as features in an AI model.

We also have the Optimizer which lets you define a set of tests where you vary the weights assigned to each factor in the ranking system and then run all the iterations and return the results.

sraby · February 23, 2024, 3:22pm

I converted the Colabe code to PyCharm (Python) so this can be run locally. I hardvoded everything (did not had much time to make changes), but the scipt works. Maybe a Gibhub project would be nice to keep track of the changes.

import p123api
import pandas as pd
import time
import sys
import datetime

def WriteToResultsFile(df_results):
    # Get current datetime to use in file name to make it unique.
    ts = datetime.datetime.now()
    strFileName = 'SingleFactorResults ' + ts.strftime("%m-%d_%H-%M")

    # Include the directory path
    directory_path = '/Users/sraby/Investing/Output/'
    full_file_path = directory_path + strFileName + '.xlsx'

    # Write the settings dataframe and results dataframe to different sheets in the same Excel file
#    with pd.ExcelWriter(strFileName + '.xlsx') as writer:
    with pd.ExcelWriter(full_file_path) as writer:
#        df_settings.to_excel(writer, sheet_name='Settings')
        df_results.to_excel(writer, sheet_name='Results')

    print("Results file created!")


try:
    ##############################
    # HARD CODED INPUT PARAMETERS
    ##############################
    apiId = "xxx"
    apiKey = "xxxxxxxxxxxxxxxxxxxxxxx"
    RankingSystem = "ApiRankingSystem"
    RankingMethod = "2"
    Bench = "SPY"
    Universe ="CAD: Unicorn Universe"
    StartDate = "2014-02-12"
    EndDate = "2024-02-12"
    Freq = "Every Week"
    NumBuckets = 20
    MinPrice = "1"
    PitMethod = "Prelim"
    TransType = "Long"
    Slippage = "0.25"

    #df_settings = pd.read_excel(
    #           '/Users/sraby/Investing/Input/FactorsList_RUN.xlsx',
    #            header=0,
    #            usecols='A:L',
    #            sheet_name='Settings',
    #            engine='openpyxl')
    #   print(df_settings)
    #   df_settings.columns = df_settings.iloc[0]
    #   df_settings = df_settings.iloc[1:]
    #   apiId = str(df_settings.iloc[0]["ID"])
    #    print("API ID:", apiId)
    #    apiKey = str(df_settings.iloc[0]["KEY"])
    #    print("API KEY:", apiKey)
    # RankingSystem = str(df_settings.iloc[0]["RankingSystem"])
    #    RankingSystem = "ApiRankingSystem"  # Hard code for now so users dont accidently wipe out one of their other ranking systems.
    #    RankingMethod = str(df_settings.iloc[0]["RankingMethod"])
    #    Bench = str(df_settings.iloc[0]["Bench"])
    #    Universe = str(df_settings.iloc[0]["Universe"])
    # Sector = str(df_settings.iloc[0]["Sector"])  #Sector is not working. Have request to Dev to look into it.
    #    StartDate = str(df_settings.iloc[0]["StartDate"])
    #    print("Start Date:", StartDate)
    #    EndDate = str(df_settings.iloc[0]["EndDate"])
    #    print("End Date:", EndDate)
    #    Freq = str(df_settings.iloc[0]["Freq"])
    #    NumBuckets = 20  # str(df_settings.iloc[0]["NumBuckets"])  #Current code below only supports 20 buckets.
    #    MinPrice = str(df_settings.iloc[0]["MinPrice"])
    #    PitMethod = str(df_settings.iloc[0]["PitMethod"])
    #    TransType = str(df_settings.iloc[0]["TransType"])
   #    Slippage = str(df_settings.iloc[0]["Slippage"])

    # Read the entire Excel sheet of factor inputs into a dataframe.
    df_factors = pd.read_excel('/Users/sraby/Investing/Input/FactorsList_RUN.xlsx', sheet_name='FactorList')
    testRows = df_factors.shape[0]  # Get the count from the dataframe
    print("Shape of df_factors:", df_factors.shape)

    # Connect
    client = p123api.Client(api_id=apiId, api_key=apiKey)

    # Create the dataframe and column names for the df that will hold the results.
    df_results = pd.DataFrame(
        columns=['FactNum', 'Category', 'Formula', 'VsInd', 'HighLow', 'Description', 'Bench', '1', '2', '3', '4', '5',
                 '6',
                 '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20'])

    # Start the main loop which calls to API to run the perf tests and capture the results for each factor in the input file.
    FactorCount = 1
    for FactorCount in range(0, testRows):
        # Read the next factor/function from the xls file with formula list
        factNum = df_factors.loc[FactorCount, "FactNum"]
        category = df_factors.loc[FactorCount, "Category"]
        formula = df_factors.loc[FactorCount, "Factor/Formula"]
        vsInd = df_factors.loc[FactorCount, "VsIndustry"]
        highLow = df_factors.loc[FactorCount, "HigherOrLower"]
        description = df_factors.loc[FactorCount, "Description"]

        print(str(FactorCount + 1) + " of " + str(testRows) + ". Current factor is " + formula)

        # Future enhancement: I could code this to accept up to a certain number of additional factors (and weight, vsInd, etc) from the spreadsheet. Worthwhile?
        # If do it, create a tab for the additional factors in the input file. Those factors would be added to the ranking system in every run.
        # For example, if we wanted to see what factors complement the FCFGr%TTM and FCFYield factors, I would add the text below:
        #        baseFactors = "<StockFactor Weight=\"0%\" RankType=\"Higher\" Scope=\"Universe\"> \
        #                    <Factor>FCFGr%TTM</Factor> \
        #                     </StockFactor> \
        #                     <StockFactor Weight=\"0%\" RankType=\"Higher\" Scope=\"Universe\"> \
        #                    <Factor>FCFYield</Factor> \
        #                     </StockFactor>"
        baseFactors = ""  # For now, just use this script to test single factors.

        # Change the ApiRankingSystem ranking system to use that factor/function
        strNodes = "<RankingSystem RankType=\"Higher\">" \
                   "<StockFormula Weight=\"0%\" RankType=\"" + highLow + "\" Name=\"TestFactor\" Description=\"\" Scope=\"" + vsInd + "\">" \
                                                                                                                                      "<Formula>" + formula + "</Formula>" \
                                                                                                                                                              "</StockFormula>" \
                   + baseFactors + \
                   "</RankingSystem>"
        data = {
            "type": "stock",
            "rankingMethod": RankingMethod,
            "nodes": strNodes
        }
        dict = client.rank_update(data)

        data = {
            "rankingSystem": RankingSystem, "rankingMethod": RankingMethod, "startDt": StartDate, "endDt": EndDate,
            "rebalFreq": Freq,
            "benchmark": Bench, "universe": Universe, "numBuckets": NumBuckets, "minPrice": MinPrice,
            # bug. Sector not working so removed it "sector": Sector,
            "slippage": Slippage, "pitMethod": PitMethod, "outputType": "ann", "transType": TransType
        }

        try:
            # API call to run the RankPerf
            dict = client.rank_perf(data)
            time.sleep(
                3)  # Colab is not waiting for rank_perf to complete and this is causing the same results to be saved for multiple factors. Added a sleep for now until figure out how to fix it.

            # Append the results from the API call to the dataframe containing the results.
            # Probably a better way to do this? Can only use 20 buckets unless make changes to this code.
            #            df_results = df_results.append({'FactNum': factNum,'Category': category,'Formula': formula,'VsInd': vsInd,'HighLow': highLow,'Description': description,'Bench': dict["benchmarkAnnRet"],
            #                                            '1': dict["bucketAnnRet"][0],'2': dict["bucketAnnRet"][1],'3': dict["bucketAnnRet"][2],'4': dict["bucketAnnRet"][3],'5': dict["bucketAnnRet"][4],'6': dict["bucketAnnRet"][5],
            #                                            '7': dict["bucketAnnRet"][6],'8': dict["bucketAnnRet"][7],'9': dict["bucketAnnRet"][8],'10': dict["bucketAnnRet"][9],'11': dict["bucketAnnRet"][10],'12': dict["bucketAnnRet"][11],
            #                                            '13': dict["bucketAnnRet"][12],'14': dict["bucketAnnRet"][13],'15': dict["bucketAnnRet"][14],'16': dict["bucketAnnRet"][15],'17': dict["bucketAnnRet"][16],'18': dict["bucketAnnRet"][17],
            #                                            '19': dict["bucketAnnRet"][18],'20': dict["bucketAnnRet"][19]
            #                                            }, ignore_index=True)
            new_row = pd.DataFrame([{'FactNum': factNum, 'Category': category, 'Formula': formula, 'VsInd': vsInd,
                                     'HighLow': highLow, 'Description': description, 'Bench': dict["benchmarkAnnRet"],
                                     '1': dict["bucketAnnRet"][0], '2': dict["bucketAnnRet"][1],
                                     '3': dict["bucketAnnRet"][2], '4': dict["bucketAnnRet"][3],
                                     '5': dict["bucketAnnRet"][4], '6': dict["bucketAnnRet"][5],
                                     '7': dict["bucketAnnRet"][6], '8': dict["bucketAnnRet"][7],
                                     '9': dict["bucketAnnRet"][8], '10': dict["bucketAnnRet"][9],
                                     '11': dict["bucketAnnRet"][10], '12': dict["bucketAnnRet"][11],
                                     '13': dict["bucketAnnRet"][12], '14': dict["bucketAnnRet"][13],
                                     '15': dict["bucketAnnRet"][14], '16': dict["bucketAnnRet"][15],
                                     '17': dict["bucketAnnRet"][16], '18': dict["bucketAnnRet"][17],
                                     '19': dict["bucketAnnRet"][18], '20': dict["bucketAnnRet"][19]
                                     }])
            df_results = pd.concat([df_results, new_row], ignore_index=True)

            FactorCount += 1
            # Hopefully code below is temporary. Having a random issue where authentification fails at the end when it tries to write results to the spreadsheet.
            # Code below saves the results to a file after every 100 factors. So no more than 100 API calls are wasted if that auth error should come up.
            # Only the final file is needed, so the rest can be deleted by the user.
            if FactorCount % 100 == 0:
                WriteToResultsFile(df_results)


        # Need to handle case where the formula being ranked has errors.
        except p123api.ClientException as e:
            print(e)
            s = str(e)
            if 'Invalid command' in s or 'Ranking failed' in s:
                df_results = df_results.append(
                    {'FactNum': factNum, 'Category': category, 'Formula': formula, 'VsInd': vsInd, 'HighLow': highLow,
                     'Description': description, 'Bench': "NA",
                     '1': "Failed - Bad formula", '2': "NA", '3': "NA", '4': "NA", '5': "NA", '6': "NA", '7': "NA",
                     '8': "NA", '9': "NA",
                     '10': "NA", '11': "NA", '12': "NA", '13': "NA", '14': "NA", '15': "NA", '16': "NA", '17': "NA",
                     '18': "NA", '19': "NA", '20': "NA",
                     }, ignore_index=True)
                FactorCount += 1
                # Let it continue with other factors from the input file
            else:
                print(
                    'Got some error other than a bad formula and it is not handled. So dumping results to results file and quitting.')
                WriteToResultsFile(df_results)
                sys.exit(1)  # not tested because didnt get any errors to trigger this.
            continue

    WriteToResultsFile(df_results)

except p123api.ClientException as e:
    print(e)
    print('Got unhandled error so dumping results to results file so they are not lost')
    # noinspection PyUnboundLocalVariable
    WriteToResultsFile(df_results)

jrallen81 · February 23, 2024, 4:09pm

thank you both. I have a 2024 goal of learning some programming to be more capable/useful on this sort of thing.

smian · February 23, 2024, 4:10pm

Considering the valuable insights provided by single factor testing and statistics for both beginner and advanced users, it would be awesome to build a dedicated section within the P123 platform.

This section would allow users to easily access and analyze statistics for each factor and will be updated by the weekend data update. Additionally, centralizing this information could reduce the load on P123 system if multiple users are executing the same single factor queries. A win-win.

P123 team, could this be something you can consider adding to your shorter-term roadmap?

pitmaster · February 24, 2024, 12:35pm

Thanks @sraby, this type of research tool is something I use very often.

However for ML you may want to calculate this statistics every period before training. It needs to be fast enough to deal with thousands of factors with millions of rows.

Do you have any estimate how many milliseconds is needed per 1 factor to be calculated?

sraby · February 24, 2024, 12:57pm

@pitmaster, you will not be able to use that tool for machine learning. It is only good to test each factors (i.e. ranking), sequencially. For ML, it is still best to download using the new P123 AI download, and then run your own python ML code.

pitmaster · February 24, 2024, 1:15pm

This is correct.
For for users who want to perform monotonicity testing for ML I may recommend to implement this tool in polars rather than pandas. My implementation that calculates similar stats as p123 uses 240 ms / factor, which is 8x faster than pandas. I have not yet tested numpy implementation which should even faster.

edit: I made mistake in calculations it uses actually 62 ms (milliseconds) per factor or in other words calculates stats for 16 factors per 1 second (my data has 555000 rows)