Data download the day of rebalance for machine learning

Hi Jim,

I just noticed that in the DataMiner, Future%Chg(5) calculates the return as from Friday to Friday.

Using 100*((Close(-6)/Close(-1))-1) works for Monday to Monday returns. However, when asked to return future values that haven’t yet occurred, it doesn’t return NAs. I haven’t looked at what is returned.

Walter,

Thank you VERY much. This is the code that P123 suggested for me. I felt bad about not being a good enough coder to figure it out for myself.

What you have said means my downloads for the last 2 months have serious look-ahead bias. Because the ranks I download were from Saturday or perhaps Monday morning and the purchases would have been made Monday. Any real purchases could not have been made before Monday but the prices are from Friday.

It is so helpful.to know that my data has look-ahead bias. The API credits I used: burned for nothing of use…

I think some of this is easy for some people most of the the time but I think we can all get stuck and need some help at times. And perhaps this is an example where it may not be trivial, exactly, for an advanced coder.

I was told this should be easy and did not even require knowledge of Python.

Sometimes it is just a small feature we need, whether it fits into someone else’s established protocols or not. Features have not been quick to come from P123. And essentially none of the ones that have been implemented have been for statistical metrics or machine learning.

As far as feature requests, even with your changes in my future downloads, it will be difficult to use this that information without the download Jonpaul (and I) are requesting.

All that is needed is a simple matrix (or relatively simple I should say seeing how this is going) that i have been requesting for 10 years. In truth–assuming one has access to a powerful enough computer–nothing else is need to do machine learning. The rest of what is needed can be found at Scikit-Learn.

I think maybe things are changing at P123. I hope so. You being here helps.

Thank you for your help. I am EXTREMELY GRATEFUL!

JIm

We’re in uncharted territory.

I’m working through these issues just like you are. What I’m seeing makes sense after working through the downloaded data.

For a given weekend, the Rank operation will return ranks identical to that available Monday morning. I did confirm that.

However, w/r/t prices, since the Monday market open hasn’t happened yet (from the POV of lookup), the latest pricing date is from Friday. So functions like Future%Chg(5) would have to use that. It was just unanticipated. Unfortunately, that function doesn’t support an offset parameter. If it did, something like Future%Chg(5,-1) would work - I think.

Another issue I need to check is why looking up a future close that hasn’t yet occurred doesn’t return NA. I consider this issue important. I wanted to have DataMiner return three targets; 1, 4, and 13 week future returns. Not returning NAs when the data doesn’t exist complicates the issue.

Finally, my findings need to be confirmed.

1 Like

Which P123 could help you with. Which it will do, if and when, it gets serious about machine learning.

And it is not even as if people who do things that the do not consider to be machine learning could not use this too.

Reminder: At the end of the day it is just an array. Features, target, an index and preferably no look-ahead bias. Updated data at the time of rebalance.

I had the same conclusion about future returns and I want Monday morning to morning returns so I did:
(Open(-6)-Open(-1))/Open(-1)

This is the same equation as Walter, just a little different form (not simplified). Also open instead of close.

I have not looked into what it returns for a date that does not exist, but I did check the open prices were correct for Monday to Monday when both Mondays have happened.

I checked that in the screener and it also shows the problem. I think OHLC prices with negative offset is broken.

This feature request encompasses so many other possible feature requests that are already available at Sklearn.

Non-linear methods, early stopping, bootstrapping, K-fold cross-validation, recursive feature elimination, regularization, methods addressing collinearity issues (duckrucks PCA suggestion and others), not just the ability to calculate metrics not provided by P123, but the ability to use them for cross-validation. No spreadsheet required.

Many of these features are features that P123 might not be able to provide quickly if they are interested in making these features available on their platform at all. It remains to be seen how interested P123 is in responding to machine learning feature requests or taking suggestions for improvements in their initial AI/ML offering going forward. But addressing this one request would allow for more focused planning for P123 as no feature would be crucial to anyone using the downloads with Sklearn. More and more people are able to use those download with advanced methods because of a lot of new developments including (but not limited to) ChatGPTs code interpreter and Colab.

P123, I am sure you are working on this considering how many features can be addressed at once and the enthusiasm you have expressed for machine learning.

Thank you for continuing to work on this.

Here is some academic support of this featue request. It would be nice to rebalance a model, with updated data, that is trained using this: A Gentle Introduction to k-fold Cross-Validation

Here is what would then be available thru Sklearn: K-Folds cross-validator

Notice Sklearn’s use of random_state that would replace mod() and has more functionality. Also using shuffle = True and Shuffle = False has different uses covering many previously describe at P123 (e.g., Shuffle = True is similar to even/odd universe with more options and Shuffle = False allows for selecting different time-periods).

Sklearn allows for optimization with different metrics and you can code you own metrics if Sklearn does not have a metric you like. And automate this optimization.

This is something only available thru uploads and downloads of csv files to and from Python now. And Yuval had stress the usefulness of dividing data into discrete universes himself. Something i agree with. So there seems to be wide support for making this available thru with spreadsheets at least. Perhaps it would not be wrong to make this available thru multiple different means, including thru Python.

Python does make this easier, better and adds additional functionality compared to a spreadsheet using mod(). And I can run it while I sleep.

Bootstrapping, subsampling, recursive feature elimination, model averaging as well as may other features might also be useful to some P123 members.

Earlier in this thread I said I would speak to the developers about the requested enhancement to the API to add the ability to return daily data instead of the weekly data it currently returns. The development team would need resources allocated to this to accurately determine the scope of the project, but the expectation is that this will be a fairly large project. The developers are fully allocated to other projects for the near future including the P123 AI functionality. This enhancement is not on the schedule at this time, so I cannot provide an estimated date for completion.

This thread is a Feature Request created by Jim. As of today, it has received no votes from the user community. Votes are important in determining priority.

Dan,

Thank you for taking the time to understand the request in the first place, look into it, give it serious consideration and for getting back to us.

Best,

Jim

Hi Jim,
You wrote in another thread today that I said ‘no’ to this request. I just want to clarify that the answer was not no. The answer was ‘not right now’. We have a lot of projects already in progress and everybody is fully allocated to those. We will look into enhancing the API so it can retrieve daily data, but I cant tell you when.

1 Like

Daily fundamentals is an old Feature Request (2013!). I moved it to the Roadmap here Feature Request: Backtests with daily rebalance for a fee

It is something we want to do and might be able to start after we launch AI/ML, so please vote if you want it more than other things.

Thanks

So Marco.

I just need it today (literally just for 8/23/23). Just today. Today. I have no use for daily historical data.

DataMiner download or download from my port, ranking system or anywhere for today. Just each factor and an index of the tickers.

Like we get with the screener (that has 500 row limit) for today. And can use in our classic port that uses the ranking system (machine learning ports needing something a little different).

I understand we can get it now with screenRun from DataMiner.

But you can get only one factor at a time in the screener. It would be nice to get an array (csv file) with all of the factors you use in your system as one download (ticker as index is the only other column needed).

I really think this will at some point give you a good cost/benefit ratio for machine learners!!!

EVERY MACHINE LEARNER WILL USE THIS!!!

Anyway I am fine if you disagree on the profitability or importance once you understand my request and how useful it is to machine learners.

Thank you very much. Truly. For looking at this.

Jim

Aaron pointed out an important clarification regarding retrieving daily data from the API. The API can return the latest daily data, but only if the asOfDt is today or today-1. For example, today is Friday 8/25/23. An API call with asOfDt set to 2023-08-24 or 2023-08-25 will return the latest available daily data. An asOfDt <= 2023-08-23 will return the weekly data from the Saturday prior to the asOfDt. So those wanting to retrieve the latest daily data to feed into machine learning scripts can use the data_universe or ranks endpoints.

The DataUniverse and Ranks operations in DataMiner cannot return today’s daily data because the minimum Frequency setting is ‘1Week’ which will return data only for Saturday dates. We will look into an enhancement to enable DataMiner to return the daily data for the current day.

I am not sure I understand. Lets say next week on Wednesday the 30th (after market close) I want to get the data for Wednesday to rebalance the next day. If I make a call with an asofDt of 8/30/23 will I get that days information or will I get 8/26/23 (last Saturday) data? Using the python rank_ranks api.

You would get the same data you would get if you ran a screen on the website as long as you call rank_ranks with the as of date = the current calendar date (or calendar day-1).

I attached a spreadsheet where I tested that scenario today.
verify API can return TODAYS data.xlsx (68.3 KB)

Thanks! I think that makes sense. I will have to try it out next week to make sure it still works on other days of the week.

While this means you cannot download historical daily ranking data I think this does mean that folks can rebalance using yesterdays data any day of the week!

I posted it before, but here is my download code again. You can paste it into Google Colab, or run it locally on your machine. It is a lot of code, but should have nice explanations to make it easier to understand each functions inputs.

import and install, you may need to install another way if you are not using colab

!pip install --upgrade p123api # This is only for google colab
import p123api
import pandas as pd
import numpy as np
from datetime import timedelta, datetime

Class for getting the rank downloads

class Portfolio123API:
    def __init__(self, api_id, api_key):
        self.client = p123api.Client(api_id=api_id, api_key=api_key)

    def download_rank(self, universe, ranking_name, date, add_data, pit='Prelim', precision=4, method=2, names=False,
                      NACnt=True, finalStmt=True, node_type='factor', currency='USD'):
        """
        Generate a ranking pandas DataFrame based on the provided inputs. Uses 1 api credit per 25,000 data points as of Aug 2023.

        Parameters:
        -----------
        universe : str
            Name of the universe on Portfolio123.
        ranking_name : str
            Name of the ranking system on Portfolio123.
        date : str
            Date for which the ranking is being generated. Format is 'YYYY-MM-DD'
        add_data : list
            Additional data to be considered for ranking. like 'Close(0)' which is fridays close. Note some things may require the PIT license.
        pit : str, optional
            Period in time, e.g., 'Prelim' or 'Complete', default is 'Prelim'.
        precision : int, optional
            Number of decimal places for ranking scores, default is 4.
        method : int, optional
            Numeric code representing the ranking method, default is 2 which is negative, 4 is neutral
        names : bool, optional
            Flag indicating whether to include ticker names in the output, default is False.
        NACnt : bool, optional
            Flag indicating whether to include the count of missing values, default is True.
        finalStmt : bool, optional
            Flag indicating whether to include if the stock has a final statement, default is True.
        node_type : str, optional
            Type of node for ranking, e.g., 'factor' or 'composite', default is 'factor'.
        currency : str, optional
            Currency for monetary values, default is 'USD'. 'USD' | 'CAD' | 'EUR' | 'GBP' | 'CHF'

        Returns:
        --------
        pandas.DataFrame
            A DataFrame containing the generated ranking data. Added the date as a column
      """

        ranking = self.client.rank_ranks({
            'rankingSystem': ranking_name,
            'asOfDt': date,  # Formated as 'yyyy-mm-dd'
            'universe': universe,
            'pitMethod': pit,
            'precision': precision,
            'rankingMethod': method,  # 2-Percentile NAs Negative, 4-Percentile NAs Neutral
            'includeNames': names,
            'includeNaCnt': NACnt,
            'includeFinalStmt': finalStmt,
            'nodeDetails': node_type,  # 'factor', 'composite'
            'currency': currency,
            'additionalData': add_data  # Example: ['Close(0)', 'mktcap', "ZScore(`Pr2SalesQ`,#All)"]
        }, True)  # True - output to Pandas DataFrame | [False] to JSON.

        dates = pd.to_datetime([date] * len(ranking))
        ranking.insert(0, 'date', dates)
        return ranking

    def download_universe(self, universe, date, formulas, pit='Prelim', precision=4, names=False, currency='USD'):
        """
        Generate a pandas DataFrame based on the provided inputs. Uses 1 api credit per 25,000 data points as of Aug 2023.

        Parameters:
        -----------
        universe : str
            Name of the universe on Portfolio123.
        date : str
            Date for which the ranking is being generated. Format is 'YYYY-MM-DD'
        formulas : list
            Additional data to be considered for ranking. like 'Close(0)' which is fridays close. Note some things may require the PIT license.
        pit : str, optional
            Period in time, e.g., 'Prelim' or 'Complete', default is 'Prelim'.
        precision : int, optional
            Number of decimal places for ranking scores, default is 4.
        names : bool, optional
            Flag indicating whether to include ticker names in the output, default is False.
        currency : str, optional
            Currency for monetary values, default is 'USD'. 'USD' | 'CAD' | 'EUR' | 'GBP' | 'CHF'

        Returns:
        --------
        pandas.DataFrame
            A DataFrame containing the generated ranking data. Added the date as a column
      """
        ranking = self.client.data_universe({
            'universe': universe,
            'asOfDt': date,  # 'yyyy-mm-dd'
            'formulas': formulas,  # ['Close(0)', 'mktcap']
            'pitMethod': pit,
            'precision': precision,
            'includeNames': names,
            'currency': currency
        }, True)  # True - output to Pandas DataFrame | [False] to JSON.

        dates = pd.to_datetime([date] * len(ranking))
        ranking.insert(0, 'Date', dates)
        return ranking

    def download_weekly_ranks(self, universe, ranking_name, start_date, end_date, add_data, pit='Prelim', precision=4,
                              method=2, names=False, NACnt=True, finalStmt=True, node_type='factor', currency='USD'):
        """
        Download ranking from multiple dates. Note that to calculate some additional stats like alpha to the universe some additional date is required!
        Uses 1 api credit per 25,000 data points as of Aug 2023.

        Parameters:
        -----------
        universe : str
            Name of the universe on Portfolio123.
        ranking_name : str
            Name of the ranking system on Portfolio123.
        start_date : str
            Start date to get data. Format is 'YYYY-MM-DD'. Note that the resulting dataframe will use the previous Saturday as the date
        end_date : str
            End date to get data. Format is 'YYYY-MM-DD'. Note that the resulting dataframe will use the previous SSaturday as the date
        add_data : list
            Additional data to be considered for ranking. like 'Close(0)' which is fridays close. Note some things may require the PIT license.
        pit : str, optional
            Period in time, e.g., 'Prelim' or 'Complete', default is 'Prelim'.
        precision : int, optional
            Number of decimal places for ranking scores, default is 4.
        method : int, optional
            Numeric code representing the ranking method, default is 2 which is negative, 4 is neutral
        names : bool, optional
            Flag indicating whether to include ticker names in the output, default is False.
        NACnt : bool, optional
            Flag indicating whether to include the count of missing values, default is True.
        finalStmt : bool, optional
            Flag indicating whether to include if the stock has a final statement, default is True.
        node_type : str, optional
            Type of node for ranking, e.g., 'factor' or 'composite', default is 'factor'.
        currency : str, optional
            Currency for monetary values, default is 'USD'. 'USD' | 'CAD' | 'EUR' | 'GBP' | 'CHF'

        Returns:
        --------
        pandas.DataFrame
            A DataFrame containing the generated ranking data from one date to another
        """

        current_date = datetime.strptime(start_date, '%Y-%m-%d')
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
        combined_dataframe = pd.DataFrame()
        required_data = ['Open(-6)/Open(-1)-1', 'Open_W(-4)/Open(-1)-1',
                         'Open_W(-12)/Open(-1)-1']  # This gives a Monday to Monday open gain which is what I trade. Change if you trade another time

        while current_date <= end_date:
            previous_sunday = current_date - timedelta(
                days=(current_date.weekday() + 1) % 7)  # This calculates a more accurate asofDate
            previous_sunday_str = previous_sunday.strftime('%Y-%m-%d')
            dataframe = self.download_rank(universe, ranking_name, previous_sunday_str, required_data + add_data, pit,
                                           precision, method, names, NACnt, finalStmt, node_type, currency)
            dataframe.rename(columns={'formula1': 'nw_change'}, inplace=True)
            dataframe.rename(columns={'formula2': 'nm_change'}, inplace=True)
            dataframe.rename(columns={'formula3': 'n3m_change'}, inplace=True)

            # Calculate the universe gain and then each stocks alpha!
            nw_univ_gain_percentage = dataframe['nw_change'].mean()  # Calculate the universe return
            dataframe['nw_alpha'] = dataframe[
                                             'nw_change'] - nw_univ_gain_percentage  # Calculate the alpha and add to the dataframe

            nm_univ_gain_percentage = dataframe['nm_change'].mean()  # Calculate the universe return
            dataframe['nm_alpha'] = dataframe[
                                             'nm_change'] - nm_univ_gain_percentage  # Calculate the alpha and add to the dataframe

            n3m_univ_gain_percentage = dataframe['n3m_change'].mean()  # Calculate the universe return
            dataframe['n3m_alpha'] = dataframe[
                                              'n3m_change'] - n3m_univ_gain_percentage  # Calculate the alpha and add to the dataframe

            combined_dataframe = pd.concat([combined_dataframe, dataframe],
                                           ignore_index=True)  # Add it to the dataframe

            current_date += timedelta(weeks=1)

        combined_dataframe.columns = combined_dataframe.columns.str.replace(r' \(\d+\.\d+%\)', '', regex=True)
        return combined_dataframe

Finally examples of how to run the above class:

# Initialize the api class
api_id = 'Your api id'
api_key = 'Your api key'
api = Portfolio123API(api_id, api_key)

#-------------- Examples for each function below -------------------------------

# Download ranks from a ranking system  ------------------------------------
ranks = api.download_rank('Easy to Trade US', 'All-Stars: Greenblatt', '2023-08-25', ['Close(0)'])
print("Ranks from ranking system are:\n")
print(ranks)

# Download data from a universe  ------------------------------------
universe_data = api.download_universe('Easy to Trade US', '2023-08-25', ['Close(0)'])
print("Universe data is as follows:\n")
print(universe_data)

# Download ranks over multiple dates!
# Note that this function adds future universe alpha columns and future 1w, 1m, and 3m changes that are based on Monday open to Monday open.
# Change if you want another time or do something fancy like open to close or the like. It is defined in the function in the class
start_date = '2017-01-15'
end_date ='2017-12-24'
univserse = 'Large Cap'
ranking_name = 'All-Stars: Greenblatt'
extra_formulas = ['Close(0)']
weekly_ranks = api.download_weekly_ranks(universe, ranking_name, start_date, end_date, extra_formulas)

# Save to a pickle which is very fast to load, but not human readable
weekly_ranks.to_pickle('data.pkl') # This saves the data as a pickle. You can load it using: weekly_ranks = pd.read_pickle('data.pkl')

# Save to a csv, slow to load, but human readable
weekly_ranks.to_csv('data.csv', index=False) # To load it again use: weekly_ranks = pd.read_csv('data.csv')

Dan, Jonpaul, Walter, Aaron, Marco and others,

Thank you and WOW!!! And just to be clear, Jonpaul should be a target audience. He has an Ultimate membership, BTW. I want to be like him when I grow up. More specifically, I mean I want to learn better Python skills, catch up on Bayesian optimization etc. Continue to compare notes in the forum with him and others and be able to contribute. I think this is what machine learning at P123 looks like BTW. Machine learning that will attract the Kaggle crowd, undergraduates in any scientific field, aerospace engineers etc. Maybe a bit in the weeds but: awesome P123!!! And than you.

So, I could probably figure out the API. But for now I use DataMiner to create a csv file and work with it in Jupyter notebooks.

So simple question: Same thing applies to DataMiner? Just to be sure.

The focus of my question, as with Jonpaul on the API: Will it be overnight update of the ranks if I do this on Friday morning (after the updates)?

For clarity, the code I will use:

Main:
Operation: Ranks
On Error: Stop # ( [Stop] | Continue )
Precision: 4 # ( [ 2 ] | 3 | 4 )

Default Settings:
PIT Method: Prelim # ( [Complete] | Prelim )
Ranking System: ‘M3DM’
Ranking Method: NAsNeutral
Start Date: 2005-01-02
End Date: 2010-01-01
Frequency: 1Week
Universe: Easy to Trade US
Columns: factor #( [ranks] | composite | factor )
Include Names: true #( true | [false] )
Additional Data:

    - 1WkTradedRet: 100*(Close(-6)/Close(-1)-1) 
    - Future 1wkRet: Future%Chg(5)
    - FutureRel%Chg(5,GetSeries("$SPALLPV:USA")) #Relative return vs $SPALLPV:USA

Jim

Based on this post by Dan DataMiner cannot currently return the daily data.

Dan, Marco, other folks at P123
Maybe P123 can provide a colab file for the downloads like they do for the screener? Or a tutorial or something on how to set it up? The example can include how to write the data to csv. My code above shows how to do this. Feel free to use it if you make a tutorial or colab file. Chatgpt wrote 90% of it for me anyway…

1 Like

Jonpaul,

Thank you. So I can probably train my data which will take a while knowing I can use the API for daily rebalances when the time come.

I am slow but I can usually figure it out. And as you say, ChatGPT can help me nowadays.

Very helpful. Thank you.

Jim