API ranking download feedback for ML

Hi all,

I am looking for some feedback on what to download to explore both linear ranking system optimizations and nonlinear systems without burning all of my api credits.

Here is my plan so far:

  • Download api method: Python API rank_ranks to get weekly data per Friday from my ranking system. I can get all factor ranks in the system from this + some extras mentioned below. I will also call this function many times to get multiple dates.
  • Ranking system: include factors of interest and the Sector/Industry. While sector rank does not tell me a specific sector it should be relatively constant rank for a large universe so I can use it to split my data into sub-universes if desired.
  • Open_D(-1): this gives next Monday’s open price which is when I will rebalance. I will use this from multiple weeks to also calculate the total universe return for the next week. Using that I can calculate “next week alpha” and “previous week alpha”. Future%Chg is Friday to Friday and I cannot buy a stock in the past (Friday is before I rebalance), so the weekend gap will create a LOT of error in my returns. Thus the need for the Monday open price.
  • Market cap: this is mostly just for breaking out sub-universes if I want
  • Volume(-1): this will help me calculate slippage
  • Spread last week average: this will help me calculate slippage
  • May pull other benchmark data from Yahoo finance, not sure yet

My resulting dataframe (or table for those not familiar with Pandas) will have the following columns:
Date, ticker/ID, factor1, factor2, factor…, open_D(-1), mktcap, vol(-1), spread last week avg

Does this sound sufficient to get started with random forest or PCA?

Thanks,
Jonpaul

So just a thought.

One you have the data for use in Python you might be able to use stock ID or just Random Seed to actually bootstrap your samples.

To the extent that is possible for some situations it would be highly superior. Like going from your old Ford Pinto (having to use subindustries) to a mid-engine Corvette (random sampling with replacement 1000 or 10,000 times with bootstrapping).

Can you do that maybe?

Jim

For sure that seems like the best method for cross validation. More that I want to be able to discard sectors that add a lot of noise, but I still want to download the data initially so I don’t have to do it later if I decide I really wanted a sector I excluded.