Machine learning competition on Kaggle?

Dear All,

To promote P123 use case with ML , what do you think if we host a competition on Kaggle ? Kaggle Competitions

There are about 600K data scientists tinkering on Kaggle, so the exposure would be fantastic. But it would require a sizable monetary prize for someone to make their ML system public.

I examined Kaggle’s past competitions and I found none in our space. Why would that be? Perhaps they don’t allow competitions in the stock market ?

I filled out request and waiting to hear back.

I just had a quick look at Kaggle and I don’t really understand it, other than it appears to be some sort of crowd-sourced Notebook-style environment for Python coding. If you run a contest, how would it work, what are the criteria for winning the contest, where will the data come from, what should the output look like, is there a cost for being on Kaggle, etc?

Marco,

I would expect that to be an effective way to get P123 known by the machine-learning community.

In fact, I was convinced of this when I recommended it in a post Dec 13 last year: [url=https://www.portfolio123.com/mvnforum/viewthread_thread,12022#!#70518]https://www.portfolio123.com/mvnforum/viewthread_thread,12022#!#70518[/url]

Best,

Jim

Marco,

I have seen multiple competitions from the investment community. Example:

Jane Street Market Prediction: Active

Two Sigma: Using News to Predict Stock Movements: completed

Two Sigma Financial Modeling Challenge Can you uncover predictive value in an uncertain world? Completed

Also competition are often for “Kudos” or “Swag.” Or relatively small amounts of money.
Current Active Example:

Predict Future Sales, Kudos

Completed example:

March Machine Learning Mania 2017Predict the 2017 NCAA Basketball Tournament: swag

Jim

Jane Street competition is $100K prize money. Wow. https://www.kaggle.com/c/jane-street-market-prediction/overview

They supply anonymized data for the training, and determine the winner by running live out of sample. But they deal with real time data and out of sample only .

Another competition was this one sponsored by Two Sigma, also a $100K prize. This one has ended and had data from Intrinio & Reuters. https://www.kaggle.com/c/two-sigma-financial-news/overview/evaluation

We could do something similar with anonymized and run the models live for 6 months to determine the winner. I think out of sample is the only way with this competition.

Steve, take a look at the two examples above for the rules and how they score the models. For the Jane Street competition you can still download the data since it’s ongoing.

Jrinne, yeah you did show the way a year ago. Sorry I missed it. I agree with everything.

I sent an inquiry to kaggle. Still waiting for a reply.

Hi,
Pretty new here, so great to meet everyone. I am familiar with Kaggle, the data mostly, not the competitions. Given that there is no ML capability within P123, at least that this noob knows of anyway, what is the advantage of getting a bunch of Python ML’ers interested in P123?

sorry if the answer is obvious.
Steve

Steve,

Marco can tell you his plans better than I can. But he is clearly doing most everything he can to make the data accessable. An API and increasing sim downloads etc.

What machine-learning methods have you used?

I only ask because boosting has become popular on P123 (using various download methods). Steve Auger and Marco have become interested in boosting for a variety of reasons. But boosting offers the advantage that as long as the factors or inputs (predictors) keep their order boosting gives the exact same answer no matter how the predictor is transformed (given the constraint of preserving the order).

Specifically, ranks work as well as the the raw data. Ranking is designed to keep the order, of course. But the truth is that for many machine-learning methods rank is an advantage for being a non-parametric method. But if you are not a fan of ranks that is okay. P123 will be offering Z-scores without a data liscense.

Z-scores will keep the (scaled) intervals and will work well for methods such as Ridge Regression.

IMHO, P123 is ideal for machine-learning methods. Especially, for non-linear data—like financial data.

For sure one can do boosting with P123. I have downloaded the ranks as factors and trained data using the excess returns as the target or label. And made predictions daily by downloading factors in the screener, making predictions on the universe of stocks using these factor downloads, sorting the prediction (best to worst) and buying (or holding) the stocks that are predicted to do the best over the rebalance period.

Anyway, machine-learning can be done here. P123 is making serious efforts to make the data for machine-learning available.

At some point the data will probably be accessible enough to screen a large number of factors using feature importance with boosting or with random forests. This could be useful ever for those who want to use those factors in the present P123 port metods.

Frankly, if they can find the Higgs boson over at Kaggle (this is the competition that made Kaggle famous, I think) then they should be able to make some simple stock predictions.

Or at least there may be some people at Kaggle who would like to try it here at P123 and/or in a Kaggle completion.

Hmmm…Mystery of the universe/stock prediction.

Most elusive particle in the universe/stock prediction.

Yea, right or wrong, I think I can see why they would think it might be worth a try.

Jim

P123 is a niche business and caters to a unique type of individual/business. P123 has been cultivating indicators, formulae, and data curation for 20 years, and ranks up there with platforms that are 10x the cost. This is just a guess, but I doubt that Kaggle can compete when it comes to financial data. These types of websites would have poorly curated data, if any at all. It is a fact that anyone can work on a Python platform for free, hell run it on your desktop if you want. But you can’t get P123’s offerings (data curation, financial ratios, formulae, custom series, various universes including custom) anywhere else than P123. It should in fact be a Data Scientist’s wet dream. So I am going to go out on a limb and guess that data scientists have above-average incomes and like to torture data. Do you agree? And where better than P123, with its super-clean well organized dataset. Now it is true that P123 is in the early stages of promoting ML, but this is a smart move for Marco because it gives P123 an opportunity to expand into an adjacent area, very much in tune with what it already does very well. And it caters to people with money and probably want to invest and are naturally comfortable with what we do here at P123, optimize trading strategies. This means the possibility of drawing in new subscribers and having a new source of revenue (ML data). Win-Win all aound.