Share complete ML phython code

Whycliffes · June 3, 2024, 4:04am

I hope others will share their code, since there are many more out there who are much better than me at this, but after a lot of work and even more frustration, I managed to put together a draft of a Python ML code that works on P123 Data.

I use it in Visual Studio and have received some assistance from Cody, Tabnine, and BlackBox.

The code is supposed to do this (I don't have enough programming experience to know if this is what it actually does, And the stocks recommended based on the trained models surprised me. ):

Imports necessary libraries and modules.
Defines the main() function.
Loads the dataset from a CSV file using Dask DataFrame.
Performs data preprocessing:

Converts the 'Date' column to datetime format.
Sorts the data by the 'Date' column.
Performs feature engineering by extracting year, month, and day from the 'Date' column.
Removes rows with missing values.

Checks if the 'Price' column exists in the dataset. If not, it uses the first numeric column as the target.
Defines the training and testing window sizes (in years).
Defines the path to the folder where the trained models will be saved.
Tries to load previously saved results from a pickle file. If the file doesn't exist, it performs walk-forward validation:

Iterates over different start years for the training and testing periods.
Filters the training and testing data based on the start year and window sizes.
Checks if there is enough data for the current training and testing periods.
Defines the features (X) and target (y) for training and testing data.
Converts Dask DataFrames to Pandas DataFrames.
Splits the training data into train and validation sets.
Defines the hyperparameter grids for RandomizedSearchCV.
Defines the machine learning models (RandomForest, ExtraTrees, XGBoost) with RandomizedSearchCV.
Creates the model directory if it doesn't exist.
Trains and evaluates the models on the train and validation sets.
Calculates and stores the mean squared error (MSE) for each model.
Saves the best estimator for each model to disk.
Defines an ensemble model (VotingRegressor) using the best estimators from the individual models.
Trains and evaluates the ensemble model on the train and validation sets.
Calculates and stores the MSE for the ensemble model.
Saves the ensemble model to disk.
Saves the results (MSE for each model and period) to a pickle file.

Prints the model evaluation results (average MSE over all periods) for each model.
Finds the best model based on the lowest average MSE.
Finds the last start year for which the best model was trained.
Loads the best model from disk.
If the best model is loaded successfully:

Computes the predictions for the entire dataset using the best model.
Converts the predictions to a Pandas DataFrame.
Adds the 'Ticker' column to the predictions DataFrame.
Sorts the predictions DataFrame by the predicted values in descending order.
Prints the top 10 recommended stocks based on the model predictions.

If no model was trained, it prints a message indicating that.
Calls the main() function if the script is run directly.

import dask.dataframe as dd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import joblib
import pandas as pd
import pickle
import os
import os.path
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import VotingRegressor

def main():
    # Last inn datasettet med dask
    file_path = r'C:\XXXX\020624RAA.csv'
    data = dd.read_csv(file_path)

    # Forbehandling
    # Anta at 'Date' er kolonnen med datoer
    data['Date'] = dd.to_datetime(data['Date'])
    data = data.sort_values('Date')

    # Funksjonsengineering
    data['Year'] = data['Date'].dt.year
    data['Month'] = data['Date'].dt.month
    data['Day'] = data['Date'].dt.day

    # Fjern rader med manglende verdier hvis det er noen
    data = data.dropna()

    # Sjekk om 'Price'-kolonnen finnes
    target_column = 'Price'
    if target_column not in data.columns:
        print(f"Warning: Target column '{target_column}' not found in dataset. Using the first numeric column as the target.")
        numeric_columns = [col for col in data.columns if data[col].dtype.kind in 'bifc']
        if len(numeric_columns) > 0:
            target_column = numeric_columns[0]
        else:
            print("No numeric columns found in the dataset. Exiting.")
            return

    # Definer trenings- og testperioder
    train_window = 3  # Antall år for treningsvindu
    test_window = 3   # Antall år for testvindu

    # Definer banen til mappen hvor modellene skal lagres
    modell_mappe = r'C:\XXX\MAKSKINL LAGRING\LAGREDE MODELLER'

    # Prøv å laste inn lagrede resultater
    try:
        with open('walk_forward_results.pkl', 'rb') as f:
            resultater = pickle.load(f)
    except FileNotFoundError:
        # Hvis filen ikke finnes, kjør walk-forward validering
        resultater = {}

        # Walk-forward validering
        for start_år in range(2001, 2022 - train_window - test_window + 1):
            train_start_date = f"{start_år}-01-01"
            train_end_date = f"{start_år + train_window}-12-31"
            test_start_date = f"{start_år + train_window + 1}-01-01"
            test_end_date = f"{start_år + train_window + test_window}-12-31"

            # Filtrer treningsdata
            train_data = data[(data['Date'] >= train_start_date) & (data['Date'] <= train_end_date)]

            # Filtrer testdata
            test_data = data[(data['Date'] >= test_start_date) & (data['Date'] <= test_end_date)]

            # Sjekk om du har nok data i trenings- og testdataene
            if len(train_data) == 0 or len(test_data) == 0:
                print(f"Ikke nok data i trenings- eller testdataene for periodene {train_start_date} - {train_end_date} og {test_start_date} - {test_end_date}. Hopper over denne iterasjonen.")
                continue

            # Definer X og y for trenings- og testdata
            X_train = train_data[['Year', 'Month', 'Day']]
            y_train = train_data[target_column]

            X_test = test_data[['Year', 'Month', 'Day']]
            y_test = test_data[target_column]

            # Konverter til Pandas DataFrames
            X_train_pd = X_train.compute()
            X_test_pd = X_test.compute()
            y_train_pd = y_train.compute()
            y_test_pd = y_test.compute()

            # Del datasettet
            X_train_pd, X_val_pd, y_train_pd, y_val_pd = train_test_split(X_train_pd, y_train_pd, test_size=0.2, random_state=42)

            # Definer hyperparameter-rom for RandomizedSearchCV
            rf_param_grid = {
                'n_estimators': [50, 100, 200],
                'max_depth': [None, 5, 10],
                'min_samples_split': [2, 5, 10]
            }

            et_param_grid = {
                'n_estimators': [50, 100, 200],
                'max_depth': [None, 5, 10],
                'min_samples_split': [2, 5, 10]
            }

            xgb_param_grid = {
                'n_estimators': [50, 100, 200],
                'max_depth': [3, 5, 7],
                'learning_rate': [0.01, 0.05, 0.1]
            }

            # Definer modeller
            modeller = {
                'RandomForest': RandomizedSearchCV(estimator=RandomForestRegressor(random_state=42), param_distributions=rf_param_grid, n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1),
                'ExtraTrees': RandomizedSearchCV(estimator=ExtraTreesRegressor(random_state=42), param_distributions=et_param_grid, n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1),
                'XGBoost': RandomizedSearchCV(estimator=XGBRegressor(random_state=42, tree_method='hist'), param_distributions=xgb_param_grid, n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=-1)
            }

            # Opprett mappen hvis den ikke eksisterer
            if not os.path.exists(modell_mappe):
                os.makedirs(modell_mappe)

            # Tren og evaluer modeller
            for navn, modell in modeller.items():
                print(f"Trener {navn} for perioden {train_start_date} - {train_end_date}...")
                modell.fit(X_train_pd, y_train_pd)
                prediksjoner = modell.best_estimator_.predict(X_val_pd)
                mse = mean_squared_error(y_val_pd, prediksjoner)
                if navn not in resultater:
                    resultater[navn] = []
                resultater[navn].append(mse)
                print(f"{navn} MSE: {mse}")

                # Lagre modellen
                joblib.dump(modell.best_estimator_, os.path.join(modell_mappe, f'{navn}_modell_{start_år}.pkl'))

            # Ensemble-metoder
            ensemble_modeller = {
                'VotingRegressor': VotingRegressor([
                    ('rf', modeller['RandomForest'].best_estimator_),
                    ('et', modeller['ExtraTrees'].best_estimator_),
                    ('xgb', modeller['XGBoost'].best_estimator_)
                ])
            }

            # Tren og evaluer ensemble-modeller
            for navn, modell in ensemble_modeller.items():
                print(f"Trener {navn} for perioden {train_start_date} - {train_end_date}...")
                modell.fit(X_train_pd, y_train_pd)
                prediksjoner = modell.predict(X_val_pd)
                mse = mean_squared_error(y_val_pd, prediksjoner)
                if navn not in resultater:
                    resultater[navn] = []
                resultater[navn].append(mse)
                print(f"{navn} MSE: {mse}")

                # Lagre modellen
                joblib.dump(modell, os.path.join(modell_mappe, f'{navn}_modell_{start_år}.pkl'))

        # Lagre resultatene
        with open('walk_forward_results.pkl', 'wb') as f:
            pickle.dump(resultater, f)

    # Print resultater
    print("\nModell evalueringsresultater (gjennomsnittlig MSE over alle perioder):")
    for navn, mse_list in resultater.items():
        gjennomsnitt_mse = sum(mse_list) / len(mse_list)
        print(f"{navn}: Gjennomsnittlig MSE = {gjennomsnitt_mse}")

    print(f"Innholdet av resultater: {resultater}")

    # Finn den beste modellen basert på gjennomsnittlig MSE
    beste_modell_navn = min(resultater, key=lambda x: sum(resultater[x]) / len(resultater[x]))
    print(f"Den beste modellen er: {beste_modell_navn}")

    # Finn siste start år for den beste modellen
    siste_start_år = None
    for start_år in range(2001, 2022 - train_window - test_window + 1):
        modell_filnavn = os.path.join(modell_mappe, f'{beste_modell_navn}_modell_{start_år}.pkl')
        if os.path.exists(modell_filnavn):
            siste_start_år = start_år

    # Last inn den beste modellen
    beste_modell = None
    if siste_start_år is not None:
        modell_filnavn = os.path.join(modell_mappe, f'{beste_modell_navn}_modell_{siste_start_år}.pkl')
        print(f"Laster inn den lagrede modellen {modell_filnavn}")
        beste_modell = joblib.load(modell_filnavn)
    else:
        print("Ingen lagret modell ble funnet.")

    # Bruk den beste modellen til å predikere på hele datasettet
    if beste_modell is not None:
        print("Laster inn den beste modellen.")
        X_full = data[['Year', 'Month', 'Day']].compute()
        y_pred = beste_modell.predict(X_full)

        # Konverter prediksjoner til Pandas DataFrame
        y_pred_df = pd.DataFrame(y_pred, columns=['Predicted'])

        # Legg til 'Ticker'-kolonnen
        ticker_series = data['Ticker'].compute().reset_index(drop=True)
        y_pred_df['Ticker'] = ticker_series

        # Sorter etter predikert verdi
        y_pred_df = y_pred_df.sort_values('Predicted', ascending=False)

        # Skriv ut de 10 mest anbefalte aksjene
        print("\nDe 10 mest anbefalte aksjene basert på modelltreningen:")
        print(y_pred_df[['Ticker', 'Predicted']].head(10))
    else:
        print("Ingen modell ble trent.")

if __name__ == "__main__":
    main()

Jrinne · June 3, 2024, 8:28am

Whycliffes.

Nice! I did not know about the VotingRegressor in Sklearn and appreciate your sharing. I learned something. Thank you.

And you automatically use the best_estimator (with the optimal hyperparmeters) for each ML method determined by RandomSearchCV for this VotingRegressor ensemble. How could you not appreciate the beauty, simplicity and efficiency in that!!!

Jim

jlittleton · June 3, 2024, 2:29pm

Whycliffes! Thanks for sharing!

I have a little bit different of code to share as I am not that far along and I am trying to go a more complicated route (for better or worse) to have more control over things like an embargo period in my hyper parameter cross validation.

Anyway... here is the code I use to concatenate my zscore data after downloading multiple periods. It is not always perfect especially if your columns have a ton of NAs, but its pretty good!

Fair warning It also uses like 45 GB of RAM when I run it with about 5.5 GB of csv files to produce a combined pickle of about 5 GB. Although I deleted most of my code, so it should be lighter for you especially if you are loading less data. I did not optimize for RAM, but I have found the single dataframe with all of my data to be useful for pre-processing and such. You can always just use the code to load the csvs, combine rows, and then save the combined dfs.

Longer version of what it does:

Loads multiple csv files to pandas dataframes. Note csvs need at least 1 week of overlap: for example 2014 through 2020, and 2019-12-01 through 2024
Combines the two dataframes and scales the zscores to match across the dates. Note that the zscores need to be for the whole period when you download
Saves the combined dataframes as pickles. This might be a good place to stop if you don't have a lot of RAM or don't want to combine
Next loads the pickles and does some pre-processing like future return calcs. This requires updating as your target names as mine are probably different
Combines the features and targets together and returns the df and features/target lists
Saves all of the data as a pickle
Last bit of code is to load the saved data again. This would go in your ML code instead of your processing code

import pandas as pd
import pickle


def load_csv(csv_path):
    # Step 1: load dataframes from csv
    loaded_df = pd.read_csv(csv_path)

    # Step 2: convert date to datetime and generate unique ID
    loaded_df['Date'] = pd.to_datetime(loaded_df['Date'])
    loaded_df['unique_id'] = loaded_df['Date'].astype(str) + '_' + loaded_df['P123 ID'].astype(str)

    return loaded_df


def rescale_dfs_coefs(match_df, change_df, features):
    """Both df only have two rows"""
    sigma = (match_df[features].iloc[1] - match_df[features].iloc[0]) / (
            change_df[features].iloc[1] - change_df[features].iloc[0])
    mean_u = match_df[features].iloc[0] - sigma * change_df[features].iloc[0]
    return [sigma, mean_u]


def combine_and_scale_rows(match_df, scale_df):
    # determine overlap area and calculate the scale factor for each feature
    split_date = match_df['Date'].to_list()[-1]
    overlap_df = scale_df[scale_df['Date'] == split_date]
    overlap_df = overlap_df.dropna()
    scale_df = scale_df[scale_df['Date'] > split_date]

    # Get the two rows with the same unique_id from the overlapping dataframes
    df1 = match_df[match_df['unique_id'].isin(overlap_df['unique_id'])].iloc[0:2].sort_values(by='unique_id')
    df2 = overlap_df[overlap_df['unique_id'].isin(df1['unique_id'])].sort_values(by='unique_id')
    features = df1.columns.to_list()[3:-1]

    # Scale 2 to 1, so 1 is desired and 2 is change
    [sigma, mean_u] = rescale_dfs_coefs(df1, df2, features)
    scaled_second_df = scale_df.copy()
    scaled_second_df[features] = sigma * scale_df[features] + mean_u

    # Step 4: combine dataframes
    combined_df = pd.concat([match_df, scaled_second_df], ignore_index=True)

    return combined_df.sort_values(by='Date').reset_index(drop=True)


def combine_columns(dfs):
    combined_df = dfs[0]
    for df in dfs[1:]:
        combined_df = pd.merge(combined_df, df, on=['unique_id'], how='inner')
        combined_df.columns = combined_df.columns.str.removesuffix("_x")
        combined_df = combined_df.drop(
            columns=[col for col in ['Date_y', 'P123 ID_y', 'Ticker_y'] if col in combined_df.columns])
    combined_df = combined_df.sort_values(by=['Date']).reset_index(drop=True)
    return combined_df


def convert_csvs_to_dfs():
    df1 = load_csv("your_zscore_data1.csv") # This names all require updating!!!
    df2 = load_csv("your_zscore_data2_withoverlap.csv")
    zscores1 = combine_and_scale_rows(df1, df2)
    zscores1.to_pickle('your_dataframe_save_location.pkl')

    df1 = load_csv("your_zscore_data3.csv")
    df2 = load_csv("your_zscore_data4_withoverlap.csv")
    zscores1 = combine_and_scale_rows(df1, df2)
    zscores1.to_pickle('your_dataframe_save_location2.pkl')
    
    df1 = load_csv("your_execution_data1.csv")
    df2 = load_csv("your_execution_data2.csv")
    execution2 = pd.concat([df1, df2], ignore_index=True).sort_values(by='Date').reset_index(drop=True)
    execution2.to_pickle('your_execution_data.pkl')

    # Add your additional csvs as needed, if you have more than 2 dfs to combine, then
    #   you need to call it again with the second df like combine_and_scale_rows(zscores1, df3)...


def load_combine_dfs():
    # Load features
    zscores_df1 = pd.read_pickle('your_dataframe_save_location.pkl')  # You need to update this to match!
    zscore_features = zscores_df1.columns.to_list()[3:-1]
    print(zscores_df1.columns.to_list())
    zscores_df2 = pd.read_pickle('your_dataframe_save_location2.pkl')  # This is if you have more than one set of downloads with different columns
    zscore_features = zscore_features + zscores_df2.columns.to_list()[3:-1]
    print(zscores_df2.columns.to_list())
    zscores_df = combine_columns([zscores_df1, zscores_df2])
    print("Loaded feature dfs")
    
    # Load execution df or targets and filters
    execution_df = pd.read_pickle('your_execution_df.pkl')
    print('Loaded execution dfs')
    
    # Generate targets: you need to update as you want, but I used the open for my targets as this is when I can trade
    targets_df = pd.DataFrame()
    targets_df['unique_id'] = execution_df['unique_id']
    targets_df['one_week_gain_o'] = execution_df['open-6'] / execution_df['open-1'] - 1
    targets_df['one_month_gain_o'] = execution_df['open-21'] / execution_df['open-1'] - 1
    targets_df['three_month_gain_o'] = execution_df['open-61'] / execution_df['open-1'] - 1
    targets_df['six_month_gain_o'] = execution_df['open-121'] / execution_df['open-1'] - 1
    targets_df['one_year_gain_o'] = execution_df['open-241'] / execution_df['open-1'] - 1
    target_features = targets_df.columns.to_list()[1:]
    print(target_features)
    
    # Other pre-processing as needed
    
    # Combined with the features df
    merged_df = combine_columns([zscores_df, targets_df])
    return [merged_df, zscore_features, target_features]


# Load from csvs and save as pandas dataframes. This is a good place to stop if you have load RAM
convert_csvs_to_dfs()

# Load in the data and combine into one df --------------------------------------------
print("Loading dfs")
[m_dfs, zscore_f, target_f] = load_combine_dfs()

# Save a file in binary write mode
data = [m_dfs, zscore_f, target_f]
with open('you_data.pickle', 'wb') as f:
    pickle.dump(data, f)

# And you can load it again like this:
with open('you_data.pickle', 'rb') as f:
    loaded_data = pickle.load(f)
[df, zscore_f, targets] = loaded_data

abwillingham · June 4, 2024, 4:02am

I'm not sure if you meant that you use Visual Studio Code, but if not, try it out and pay for the CoPilot subscription. It will really help you write and learn Python.

You can create detailed comments and CoPilot will write the code for you based on your comments. It also scans all your open files for context.
Its quite good and been a real time saver for me.

Tony

Whycliffes · June 4, 2024, 4:18am

Thank you for the tip!

I have used:

Cody using Claude. It has been good, as well as Chat GTP premium version, but I will definitely try out Copilot.

By the way, I see that many recommend Amazon Codewisper.

Have you tried other versions?

abwillingham · June 4, 2024, 10:49am

I have not used those others in VS Code. Copilot is the most integrated. It is essentially chatgpt that has been specially trained using github. Its very coding specific. What I think makes it interesting is it will analyze all your open files in real time and as you type a few characters, it knows to complete large chunks of code. You can read more about it here. Top 10 AI Extensions for Visual Studio Code -- Visual Studio Magazine

Whycliffes · June 9, 2024, 4:10am

Here is my last attempt:

This script has an addition to the one above:
selects the best stocks based on the model and provides me with the date, so I am confident that it is choosing correctly in the dataset. It also gives me a list of the best features for the first stock.
provides me with an overview of the best features for each training period
should also provide me with the dates for each training, validation, and testing period.

Step by step:
1. load_and_preprocess_data(file_path)

Loads data from a CSV file using dask.dataframe.read_csv .
Checks if the "Ticker" column exists in the dataset.
Keeps the "Ticker" column and removes rows with missing values (NaN).
Identifies and handles categorical columns:
- Categorical columns with few unique values are converted to numerical values using LabelEncoder .
- The remaining categorical columns are converted into "one-hot encoded" columns with pd.get_dummies .
Converts data to a Pandas DataFrame and converts all columns to numerical values (if possible).
Removes columns containing non-numerical values.
Adds back the "Ticker" column and sorts the data by index (date).
Checks if the "Ticker" column still exists.
Prints information about the last date in the dataset after preprocessing.

2. split_data(data, target_column, train_size=0.6, val_size=0.2, num_train_periods=5)

Splits the dataset into training, validation, and test sets based on specified sizes.
Further splits the training set into periods.
Calculates the start and end dates for each training period.
Trains a model (RandomForestRegressor by default) on each training period and extracts the 50 most important features.
Calculates the start and end dates for the validation and test period.
Returns the most important features for each period, along with validation and test data.

3. tune_hyperparameters(X_train, y_train, X_val, y_val)

Defines hyperparameter distributions for RandomForestRegressor, ExtraTreesRegressor, and XGBoost.
Performs hyperparameter tuning using RandomizedSearchCV for each model.
Returns the best hyperparameters for each model based on validation error (neg_mean_squared_error).

4. train_and_save_model(model, X_train, y_train, model_path)

Trains a model on training data.
Saves the trained model as a pickle file.

5. evaluate_model(model, X_test, y_test)

Predicts values on test data using the model.
Calculates the mean squared error (MSE) between predictions and actual values.
Returns MSE and the predictions.

6. generate_report(models, X_test, y_test)

Evaluates each model in the models dict on test data.
Prints MSE and predictions for each model.

7. recommend_stocks(model, X_test, tickers, test_data, feature_names, num_recommendations=20)

Retrieves the last date in the test set.
Selects data for the last date from test data and corresponding tickers.
Predicts returns for the stocks on the last date using the model.
Creates a DataFrame with predictions, tickers, and feature importances for each stock.
Sorts stocks by predictions (descending) and selects the top num_recommendations stocks.
Returns DataFrame with recommended stocks.

8. display_recommendations(recommended_stocks)

Sorts feature importances for the first stock.
Prints a table showing ticker, prediction, top 20 features, and importance for the first stock.
Prints a list of the remaining recommended stocks based on tickers.

import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import joblib
import os
from tabulate import tabulate
import datetime

# Funksjon for å laste inn og forbehandle data
def load_and_preprocess_data(file_path):
    print("Laster inn data...")
    data = dd.read_csv(file_path, parse_dates=['Date']).set_index('Date')

    print("Kolonner før forbehandling:", data.columns)
    
    if 'Ticker' not in data.columns:
        raise KeyError("Kolonnen 'Ticker' finnes ikke i den opprinnelige datasettet.")
    
    tickers = data['Ticker']  # Bevar 'Ticker' kolonnen
    data = data.dropna()  # Rense dataene for manglende verdier
    print("Kolonner etter å ha droppet manglende verdier:", data.columns)
    
    # Identifisere kategoriske kolonner
    categorical_columns = data.select_dtypes(include=['object']).columns
    print(f"Identifiserte kategoriske kolonner: {categorical_columns}")

    # Bruk LabelEncoder for å konvertere kategoriske kolonner med få unike verdier
    for col in categorical_columns:
        if data[col].nunique().compute() < 100:
            le = LabelEncoder()
            data[col] = data[col].map_partitions(lambda s: pd.Series(le.fit_transform(s)))
    
    print("Kolonner etter LabelEncoding:", data.columns)

    # Bruk pd.get_dummies for de gjenværende kategoriske kolonnene
    data = data.map_partitions(pd.get_dummies, columns=[col for col in categorical_columns if data[col].dtype == 'object'])
    
    print("Kolonner etter pd.get_dummies:", data.columns)
    
    data = data.compute()  # Konverter Dask DataFrame til Pandas DataFrame
    data = data.apply(pd.to_numeric, errors='coerce')  # Konverterer til numeriske verdier
    print("Kolonner etter konvertering til numeriske verdier:", data.columns)
    
    data = data.dropna(axis=1, how='any')  # Fjerner kolonner som inneholder ikke-numeriske verdier
    print("Kolonner etter å ha droppet ikke-numeriske verdier:", data.columns)
    
    data['Ticker'] = tickers  # Legg tilbake 'Ticker' kolonnen
    data = data.sort_index()

    # Sjekk om Ticker-kolonnen er til stede
    if 'Ticker' not in data.columns:
        raise KeyError("Kolonnen 'Ticker' finnes ikke i datasettet etter forbehandling.")

    print("Siste dato i datasettet etter forbehandling:", data.index.max())
    return data

# Funksjon for å dele data i trenings-, validerings- og testsett
def split_data(data, target_column, train_size=0.6, val_size=0.2, num_train_periods=5):
    print("Splitter data...")
    print("Siste dato i datasettet før splitting:", data.index.max())

    # Splitt datasettet inn i trening, validering og test
    train_val_data = data.iloc[:int(len(data) * (train_size + val_size))]
    test_data = data.iloc[int(len(data) * (train_size + val_size)):]

    if train_val_data.empty or test_data.empty:
        raise ValueError("Trenings-, validerings- eller testsettet er tomt. Sjekk dataens lengde og splitting.")

    # Splitt trening og validering
    train_data = train_val_data.iloc[:int(len(train_val_data) * train_size / (train_size + val_size))]
    val_data = train_val_data.iloc[int(len(train_val_data) * train_size / (train_size + val_size)):]

    # Del treningsdatasettet inn i perioder
    train_periods = np.array_split(train_data, num_train_periods)
    train_period_features = []

    # Hent start- og sluttdato for hver treningsperiode
    train_period_dates = []
    for period in train_periods:
        start_date = period.index.min()
        end_date = period.index.max()
        train_period_dates.append((start_date, end_date))

        X_train = period.drop([target_column, 'Ticker'], axis=1)
        y_train = period[target_column]
        model = RandomForestRegressor()  # Eller en annen modell du foretrekker
        model.fit(X_train, y_train)
        feature_importances = pd.Series(model.feature_importances_, index=X_train.columns)
        top_features = feature_importances.sort_values(ascending=False).head(50)
        train_period_features.append(top_features)

    X_val = val_data.drop([target_column, 'Ticker'], axis=1)
    y_val = val_data[target_column]
    X_test = test_data.drop([target_column, 'Ticker'], axis=1)
    y_test = test_data[target_column]
    tickers_test = test_data['Ticker']

    return train_period_features, X_val, y_val, X_test, y_test, tickers_test, test_data, train_period_dates

# Hyperparameter-tuning med RandomizedSearchCV
def tune_hyperparameters(X_train, y_train, X_val, y_val):
    param_distributions = {
        'n_estimators': [100, 200],
        'max_depth': [3, 5, 10],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    }
    rf = RandomForestRegressor()
    et = ExtraTreesRegressor()
    xgb = XGBRegressor()
    
    rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_distributions, n_iter=5, cv=3, verbose=2, n_jobs=-1, scoring='neg_mean_squared_error')
    et_random = RandomizedSearchCV(estimator=et, param_distributions=param_distributions, n_iter=5, cv=3, verbose=2, n_jobs=-1, scoring='neg_mean_squared_error')
    xgb_random = RandomizedSearchCV(estimator=xgb, param_distributions=param_distributions, n_iter=5, cv=3, verbose=2, n_jobs=-1, scoring='neg_mean_squared_error')
    
    rf_random.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    et_random.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    xgb_random.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    
    return rf_random.best_estimator_, et_random.best_estimator_, xgb_random.best_estimator_

# Funksjon for å trene og lagre modellen
def train_and_save_model(model, X_train, y_train, model_path):
    print(f"Trener og lagrer modellen: {model_path}")
    model.fit(X_train, y_train)
    joblib.dump(model, model_path)

# Funksjon for å evaluere modellen
def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    return mse, predictions

# Funksjon for å generere rapporter
def generate_report(models, X_test, y_test):
    for model_name, model in models.items():
        mse, predictions = evaluate_model(model, X_test, y_test)
        print(f'{model_name} MSE: {mse}')
        print("Prediksjoner vs faktiske verdier:")
        print(predictions)
        print(y_test.values)

# Funksjon for å anbefale aksjer basert på prediksjoner for siste dato
def recommend_stocks(model, X_test, tickers, test_data, feature_names, num_recommendations=20):
    latest_date = test_data.index.max()
    print(f"Siste dato i datasettet: {latest_date}")
    latest_data = X_test[test_data.index == latest_date]
    latest_tickers = tickers[test_data.index == latest_date]
    predictions = model.predict(latest_data)
    data = pd.DataFrame({'Ticker': latest_tickers, 'Predictions': predictions})

    # Legg til feature-importanser for hver aksje
    feature_importances = pd.DataFrame(
        [model.feature_importances_ for _ in range(len(data))],
        columns=feature_names
    )
    data = pd.concat([data.reset_index(drop=True), feature_importances.reset_index(drop=True)], axis=1)

    recommended_stocks = data.sort_values(by='Predictions', ascending=False).head(num_recommendations)
    return recommended_stocks

# Funksjon for å vise anbefalingene i tabellform
def display_recommendations(recommended_stocks):
    # Sorter featurene etter viktighet for den første tickeren
    first_ticker_features = recommended_stocks.iloc[0, 2:].sort_values(ascending=False)

    # Skriv ut tabellen for den første tickeren
    print(tabulate([['Ticker', recommended_stocks.iloc[0, 0]],
                    ['Predictions', recommended_stocks.iloc[0, 1]],
                    ['Top 20 Features', ''],
                    *first_ticker_features.head(20).items()],
                   headers=['Feature', 'Importance'],
                   tablefmt="fancy_grid"))

    # Skriv ut tickerne for de resterende anbefalte aksjene
    print("\nResterende anbefalte aksjer:")
    for i in range(1, len(recommended_stocks)):
        print(f"{i}. {recommended_stocks.iloc[i, 0]}")

# Hovedfunksjon for å kjøre hele pipeline
def main():
    file_path = r'C:\xxxxA.csv'
    model_path = 'trained_model.pkl'

    data = load_and_preprocess_data(file_path)
    print("Kolonner i datasettet:", data.columns)  # Sjekk kolonnene i datasettet

    target_column = '1 Day Returns'  # Sett denne til riktig målkolonne i datasettet

    if target_column not in data.columns:
        raise KeyError(f"Målkolonnen '{target_column}' finnes ikke i datasettet.")

    # Del dataene før modellvalg
    train_period_features, X_val, y_val, X_test, y_test, tickers_test, test_data, train_period_dates = split_data(data, target_column, train_size=0.6, val_size=0.2, num_train_periods=5)
    print("Siste dato i testsettet:", test_data.index.max())
    feature_names = X_test.columns  # Lagre feature names

    if os.path.exists(model_path):
        use_existing_model = input("Vil du bruke en tidligere trent modell? (y/n): ").lower() == 'y'
        if use_existing_model:
            model = joblib.load(model_path)
        else:
            perform_tuning = input("Vil du utføre hyperparameter-tuning? (y/n): ").lower() == 'y'
            if perform_tuning:
                print(f"Treningsdata: {X_train.shape}, Valideringsdata: {X_val.shape}, Testdata: {X_test.shape}")  # Legg til logging for å vise datasettstørrelse
                rf_model, et_model, xgb_model = tune_hyperparameters(X_train, y_train, X_val, y_val)
                models = {'RandomForest': rf_model, 'ExtraTrees': et_model, 'XGBoost': xgb_model}
                generate_report(models, X_test, y_test)
                
                best_model_name = min(models, key=lambda key: evaluate_model(models[key], X_test, y_test)[0])
                best_model = models[best_model_name]
                train_and_save_model(best_model, X_train, y_train, model_path)
                model = best_model
            else:
                best_model = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=2, min_samples_leaf=1)
                train_and_save_model(best_model, X_train, y_train, model_path)
                model = best_model
    else:
        perform_tuning = input("Vil du utføre hyperparameter-tuning? (y/n): ").lower() == 'y'
        if perform_tuning:
            print(f"Treningsdata: {X_train.shape}, Valideringsdata: {X_val.shape}, Testdata: {X_test.shape}")  # Legg til logging for å vise datasettstørrelse
            rf_model, et_model, xgb_model = tune_hyperparameters(X_train, y_train, X_val, y_val)
            models = {'RandomForest': rf_model, 'ExtraTrees': et_model, 'XGBoost': xgb_model}
            generate_report(models, X_test, y_test)
            
            best_model_name = min(models, key=lambda key: evaluate_model(models[key], X_test, y_test)[0])
            best_model = models[best_model_name]
            train_and_save_model(best_model, X_train, y_train, model_path)
            model = best_model
        else:
            best_model = RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_split=2, min_samples_leaf=1)
            train_and_save_model(best_model, X_train, y_train, model_path)
            model = best_model

    # Anbefale aksjer basert på den trente modellen
    print("Anbefalte aksjer:")
    recommended_stocks = recommend_stocks(model, X_test, tickers_test, test_data, feature_names)
    display_recommendations(recommended_stocks)

        # Skriv ut de 50 viktigste featurene for hver treningsperiode
    print("\nDe 50 viktigste featurene for hver treningsperiode:")
    for i, period_features in enumerate(train_period_features):
        print(f"Treningsperiode {i+1}:")
        print(period_features.to_string())
        print()

    # Skriv ut start- og sluttdato for hver treningsperiode
    print("\nStart- og sluttdato for hver treningsperiode:")
    for i, (start_date, end_date) in enumerate(train_period_dates):
        print(f"Treningsperiode {i+1}: {start_date} - {end_date}")

if __name__ == "__main__":
    main()

bobmc · June 11, 2024, 7:30pm

How do you post python code that still keeps the format and indentations? Whycliffes managed it but my attempt failed.

aschiff · June 11, 2024, 9:38pm

I edited it to fix the formatting. Surround your code as follows:
```python
code goes here
```