AlgoMan HFRE v2.4 — Hierarchical Factor Ranking Engine (Download)

I did a quick test to show how effective the Autocorrelation function is. I did a very simple test, I let the app chose 30 anchors with no restrictions to autocorrelation first. Then I build ranking system only with the metamodel (used no support features, just to for this test).

This is the OOS backtest with no limitations on autocorrelation.


Then I did a second test letting the app chose 30 features with minimum autocorrelation level of 0.95

A third test with letting the app chose 30 features with minimum autocorrelation level of 0.95 and use 5 support features when building nodes

*Sell rule - rank<95

6 Likes

If I assign 0 to all NAs, I could then go into the download .csv and change all the 0s to to Rand() (which outputs a random number between 0 and 1). Would that eliminate the problem? Or would it cause more problems?

That was my idea above. It completely resolves the spurious correlation problem. Not sure you what to it with the rest of what @algomanis doing in his code. For the regression this may not be ideal.

But your point is also well taken that the NAs may be best handled in the csv without AlgoMan having to change his code.

It would break the ties, but there is a risk that you are introducing new problems. You can always test with one csv with the random number and one without.

When I have time I will to some NA handling tests. I got a few in mind.

I have an idea. Post-download, take all N/As and substitute 0.5 * Rand() + 0.25. That way you get a random number between 0.25 and 0.75. That would kind of duplicate NAs neutral without the horrible problems you get with masses of stocks hovering around 0.5. And it avoids the problem of having NA stocks rank extremely high or low, which would add a lot of noise.

I will make a mini app to import a CSV file, impute values instead of the NA's (0.5) and export the new CSV file that can be used in the HRFE app. I want to try a few methods including the randomized imputation.

My biggest concern is that the exported P123 ranking system will differ too much from the rank we will see in the app. Might end up having to run with two parallel datasets, one imputed set through workflow steps 3 to 5, then run the validation on the original dataset with all the NA's. Will have to give it some thought :face_with_monocle:

2 Likes

This will be easy, got the basics from the HRFE app already, Claude can do the rest of the work.

After som AI conversations I ended up with 5 different methods to trial.

  • Random Jitter (as suggested by Jrinne and Yuval)
  • Group based media. Use the median for Industry/Sector
  • k-Nearest Neighbours.
  • MICE
  • Rank Preserving

Will do a longer post with explanations of the different methods when the mini app is ready.

Then hopefully at least one of the imputation methods will improve the final HRFE ranking system. Should have it finished during the weekend.

2 Likes

Running final trials, should be ready tomorrow or sunday. Some screenshots and detailed explanations of all the imputation methods that can be used.

I would still be careful using features with very high NA's, there is a risk that we essentially created a synthetic feature derived from other features. But who knows, that might work great?!

Anyone who decides to use the imputation tool, please report back what works and what don't. I wont have time test too much before I make it public.

If there is a method that turnout to be superior, I will add it to the HFRE main app as an option.

METHOD 1: RANDOM JITTER

HOW IT WORKS:

For each 0.5 value in a selected feature:
new_value = 0.5 + uniform_random(-range, +range)
new_value = clamp(new_value, 0.0, 1.0)

PARAMETERS:

Jitter Range: Half-width of the random range (default 0.25)

  • 0.25 produces values in [0.25, 0.75]
  • 0.10 produces values in [0.40, 0.60]
  • 0.49 produces values in [0.01, 0.99]

PROS:

  • Simplest method, easiest to understand
  • Breaks the 0.5 cluster that causes tied ranks
  • Does NOT inflate correlations between features
  • Does NOT create mechanical relationships
  • Fast computation

CONS:

  • Adds zero real information
  • The jitter is purely random noise
  • Does not use any relationship between features
  • Stocks that were genuinely median-ranked also get jittered
    (but since we can't distinguish them from NAs, this is
    unavoidable for any method)

WHEN TO USE:

  • As a baseline to compare against smarter methods
  • When you want to break tied ranks without making
    assumptions about what the missing values should be
  • When correlation inflation from other methods is a concern

METHOD 2: GROUP-BASED MEDIAN

HOW IT WORKS:

For each stock with a 0.5 value on a feature:

  1. Look at the stock's classification (e.g., SubIndustry)
  2. Find all OTHER stocks in that group on the same date
    that have real (non-0.5) values for this feature
  3. If the group has enough stocks (>= minimum size),
    use the MEDIAN of their values
  4. If not enough, move up the hierarchy:
    SubIndustry -> Industry -> SubSector -> Sector -> Universe

REQUIRED COLUMNS:

The downloaded P123 CSV dataset MUST include classifier
columns with these EXACT names:

SubIndustry, Industry, SubSector, Sector

These are P123's standard GICS-based classification columns.
Add them to your ranking system's "Additional Formulas" or
universe columns before downloading the CSV. Without at least
one of these columns, Method 2 cannot run.

PARAMETERS:

Minimum Group Size: Min non-NA stocks in a group (default 20)

  • Higher = more reliable medians, coarser groups
  • Lower = more specific groups, noisier medians

HIERARCHY:

SubIndustry (most specific)
-> Industry
-> SubSector
-> Sector
-> Full Universe (least specific)

PROS:

  • Uses economic structure (bank stocks imputed from banks)
  • Intuitive and interpretable
  • Moderate correlation inflation (group membership is a
    weak predictor)

CONS:

  • Requires SubIndustry/Industry/SubSector/Sector columns
    with those exact names in the CSV
  • All missing stocks in a group get the SAME value
  • Coarse groups (Sector) may not be specific enough
  • Does not use factor relationships

WHEN TO USE:

  • When your CSV includes the classifier columns
    (SubIndustry, Industry, SubSector, Sector)
  • When you believe missing values relate to group membership
  • As a middle ground between random jitter and full ML

METHOD 3: k-NEAREST NEIGHBOURS (kNN)

HOW IT WORKS:

For each date cross-section independently:

  1. Convert all 0.5 values in selected features to NaN
  2. For each stock with NaN values, find the k stocks
    most similar to it based on NON-missing features
  3. Average the neighbours' values for the missing features
  4. Optionally re-rank-normalise to restore [0, 1] scale

PARAMETERS:

k: Number of neighbours (default 5, range 3-50)

  • Lower k = more specific, noisier
  • Higher k = smoother, more regressive to mean

Weights: How to weight neighbours

  • uniform: All k neighbours weighted equally
  • distance: Closer neighbours contribute more

PROS:

  • Uses full factor structure (all known features inform
    the imputation)
  • Handles staggered missingness naturally
  • Non-parametric (no linearity assumption)
  • Well-understood, widely-used method

CONS:

  • Imputed values are mechanically correlated with the
    features used to find neighbours -> correlation inflation
  • Computationally heavier for large datasets
  • If many features are missing for a stock, neighbours
    are found using fewer dimensions (less reliable)
  • The imputed value lies on the "surface" of existing
    data, potentially understating true variance

WHEN TO USE:

  • Features have moderate NA rates (10-25%)
  • You want data-driven imputation without strong
    parametric assumptions
  • Good general-purpose choice

METHOD 4: ITERATIVE IMPUTATION (MICE)

HOW IT WORKS:

Multiple Imputation by Chained Equations:

  1. Convert all 0.5 values in selected features to NaN
  2. Fill initial guesses (column medians)
  3. For each feature with NaN values:
    • Fit a regression model using ALL other features
    • Predict the missing values
  4. Repeat step 3 for all features (one full "round")
  5. Cycle through multiple rounds until convergence
  6. Optionally re-rank-normalise to restore [0, 1] scale

Operates per-date cross-section independently.

PARAMETERS:

Max Iterations: Number of full rounds (default 10)

  • More = better convergence, slower
  • 10 is usually sufficient

Estimator: The regression model for each step

  • BayesianRidge: Regularised, handles collinearity well
  • Ridge: Simpler, good for correlated features
  • ExtraTrees: Non-linear, can capture complex patterns
    but slower and may overfit

PROS:

  • Uses the FULL multivariate correlation structure
  • Considered the gold standard for missing data
  • Iterative refinement improves estimates each round
  • Can capture complex conditional relationships

CONS:

  • Most computationally expensive method
  • Strongest correlation inflation risk (every feature
    is predicted from every other feature)
  • Assumes Missing At Random (MAR) - the probability of
    missingness depends only on observed data. P123 data
    partially violates this (missingness is related to
    stock type: pre-revenue, foreign ADRs, etc.)
  • Imputed values are the most tightly coupled to the
    correlation structure, creating circular logic risk

WHEN TO USE:

  • You have relatively few features selected (<30-40)
  • NA rates are moderate (10-20%)
  • You want the most sophisticated imputation
  • You have time to wait for computation

METHOD 5: RANK-PRESERVING IMPUTATION

HOW IT WORKS:

For each date cross-section, for each feature with NAs:

  1. Identify stocks with real values (known) and stocks
    with 0.5/NA values (missing)
  2. Select the top-N features most correlated with the
    target feature (using known-value stocks only)
  3. Using these predictor features, predict a RELATIVE
    ORDERING among the missing stocks:
    • kNN: Find nearest neighbours and average their ranks
    • Ridge: Linear regression to predict rank positions
  4. Insert the missing stocks into the existing ranking
    at their predicted positions
  5. Re-rank the entire feature to [0, 1] scale

This is the only method designed specifically for
rank-normalised data.

PARAMETERS:

Predictor: Model for ordering prediction

  • kNN: Non-parametric, uses k nearest neighbours
  • Ridge: Linear, uses regression coefficients

k (for kNN): Number of neighbours (default 5)

Top-N Features: Number of correlated features to use
as predictors (default 20)

  • Higher = more information, more noise
  • Lower = cleaner signal, may miss relevant features

PROS:

  • Output is inherently rank-normalised (no clamping
    or re-ranking needed)
  • Preserves the rank distribution shape
  • Most natural fit for HFRE's data format
  • Predicts RELATIVE ORDER, not absolute values
  • Less prone to outlier imputation

CONS:

  • Custom implementation (less battle-tested than kNN
    or MICE from sklearn)
  • Still uses other features -> correlation inflation
  • The predicted ordering is approximate
  • Computationally moderate

WHEN TO USE:

  • You want imputation that respects the rank-normalised
    nature of P123 data
  • You care about maintaining the distributional shape
  • Good choice when other methods produce values outside
    [0, 1] that need aggressive clamping
2 Likes

I'm giving up on this project because I can't seem to get the factor download to work. I took all my formulas and converted the ones that were ranked on industry or sector or subindustry or subsector to FRank. I took all my composite nodes and relabeled them so that I could manually combine them once the data is available. I ended up with 300 factors, and I tried to download them for a very large universe in the North Atlantic region for just one year (April 2009 to April 2010). The process failed. Even if it had worked it would have taken an hour for each year. (The reason I tried just one year is that it failed for three years.)

Maybe I'll try again in a week or two, but the process has been excruciating. If anyone has any tips, I'll be glad to try them.

1 Like

I am on the same boat due to my number of factors. Thanks for helping test this! Wish it had worked

I have not done this level of ML testing before, but following this post has been very interesting. I had an idea that might be helpful if anyone wants to run with it: closest neighbor but limited to same industry or subindustry if feasible.

@AlgoMan congrats on all the hard work

I was able to get the data into AlgoMan’s format download the data and run through his bundling method several times but not without several hiccups each time. I think he has created an elaborately documented system with a rational explanation for each step but at this point I have been unable to develop a set of factors organized to give me performance close to his. As well documented as it is I still seem to have difficulty. But AlgoMan has done a great service by showing me a software structure that I would like to emulate but with my own ideas and methods. I have been using Jupyter Lab with a different notebook for each step. An integrated program with a built-in report is a much better toolset. I have not been using my P123 subscription for almost 6 months due to circumstances but have intentions of picking up where I left off. I very much appreciate AlgoMan’s superb accomplishment.

Have you tried Claude Cowork with the API? It does chain together Python programs and generates a nice final report (Excel Spreadsheet). It is a little irritating that sometimes I have to paste a command into terminal, and I can see how it might not be as useful for advanced programers, but it has helped me a lot.

Edit: I have not used Jupyter Labs, but I can see how it does a lot of what Cowork does. I will have to try Jupyter Labs–if I do not get too dependent on Cowork. LOL

I had similar problems, with FRanks, ~2000 stocks, 2 weeks perdiod.
What works for me is to reduce number of factors per bacth to 200 factors. I may take up to 60 minuets but it usually works.

Algoman have been doing great job with this app. It may be a bit complicated workflow for most of users so probably the way to move forward is to segment this app into 'simple' and 'advanced' workflows.

I made a script (app) that can merge CSV files, as long as you use the same dates and same universe in the factor download, you can use the script.

If the issues you are experiencing is due to too large factor files, this should solve the problem.
Just split your download in to batches of factors, download the files, then merge them with the app. You could actually pass the 300 factor limits using this app.

I still suggest to start to test the ap with 50 features or so to see how it works, the analysis in the HRFE app will be very slow with large datasets.

The app is also handy if one wants to add an additional target.

Download exe file (for windows)

Script below for python users.

"""
CSV Column Merger — GUI Application

Merges columns from one CSV file (File B) into another (File A) by row position.
Both files must have the same number of rows. Only non-duplicate columns from
File B are added. A row-by-row Ticker comparison warns about mismatches.
"""

import tkinter as tk
from tkinter import filedialog, messagebox, ttk
from pathlib import Path
from threading import Thread
from io import StringIO

import pandas as pd


def find_column(df: pd.DataFrame, candidates: list[str]) -> str | None:
    """Case-insensitive column lookup. Returns actual column name or None."""
    cols_lower = {c.lower(): c for c in df.columns}
    for candidate in candidates:
        if candidate.lower() in cols_lower:
            return cols_lower[candidate.lower()]
    return None


class MergeApp:
    """Tkinter GUI for merging columns from two CSV files."""

    def __init__(self) -> None:
        self.root = tk.Tk()
        self.root.title("CSV Column Merger")
        self.root.geometry("560x340")
        self.root.resizable(False, False)

        self.df_a: pd.DataFrame | None = None
        self.df_b: pd.DataFrame | None = None
        self.path_a: str = ""
        self.path_b: str = ""

        # StringVars for dynamic labels
        self.file_a_var = tk.StringVar(value="No file selected")
        self.file_b_var = tk.StringVar(value="No file selected")
        self.info_a_var = tk.StringVar(value="")
        self.info_b_var = tk.StringVar(value="")
        self.status_var = tk.StringVar(value="Select two CSV files to merge.")

        self._build_gui()

    # ------------------------------------------------------------------ #
    #  GUI construction
    # ------------------------------------------------------------------ #
    def _build_gui(self) -> None:
        pad = {"padx": 10, "pady": 4}

        # --- File A frame ---
        frame_a = tk.LabelFrame(self.root, text="File A  (target)", padx=8, pady=6)
        frame_a.pack(fill="x", **pad)

        row_a = tk.Frame(frame_a)
        row_a.pack(fill="x")
        tk.Button(row_a, text="Browse...", width=10,
                  command=self._browse_file_a).pack(side="left")
        tk.Label(row_a, textvariable=self.file_a_var,
                 anchor="w", fg="grey30").pack(side="left", padx=(8, 0), fill="x", expand=True)

        tk.Label(frame_a, textvariable=self.info_a_var,
                 anchor="w", fg="green4").pack(fill="x")

        # --- File B frame ---
        frame_b = tk.LabelFrame(self.root, text="File B  (source of new columns)", padx=8, pady=6)
        frame_b.pack(fill="x", **pad)

        row_b = tk.Frame(frame_b)
        row_b.pack(fill="x")
        tk.Button(row_b, text="Browse...", width=10,
                  command=self._browse_file_b).pack(side="left")
        tk.Label(row_b, textvariable=self.file_b_var,
                 anchor="w", fg="grey30").pack(side="left", padx=(8, 0), fill="x", expand=True)

        tk.Label(frame_b, textvariable=self.info_b_var,
                 anchor="w", fg="green4").pack(fill="x")

        # --- Merge button ---
        self.merge_btn = tk.Button(
            self.root, text="Merge && Save As...", width=30, height=2,
            command=self._do_merge, state=tk.DISABLED,
        )
        self.merge_btn.pack(pady=12)

        # --- Progress bar (hidden until loading) ---
        self.progress_frame = tk.Frame(self.root)
        self.progress_frame.pack(fill="x", padx=10, pady=(0, 2))
        self.progress_bar = ttk.Progressbar(
            self.progress_frame, mode="determinate", length=400,
        )
        self.progress_label = tk.Label(
            self.progress_frame, text="", anchor="w", fg="grey40",
        )
        # Start hidden — widgets are packed on demand in _show_progress / _hide_progress

        # --- Status bar ---
        tk.Label(
            self.root, textvariable=self.status_var,
            relief="sunken", anchor="w", padx=6,
        ).pack(fill="x", side="bottom", padx=10, pady=(0, 8))

    # ------------------------------------------------------------------ #
    #  File browsing
    # ------------------------------------------------------------------ #
    def _browse_file_a(self) -> None:
        self._load_file("a")

    def _browse_file_b(self) -> None:
        self._load_file("b")

    def _load_file(self, which: str) -> None:
        path = filedialog.askopenfilename(
            title=f"Select File {'A' if which == 'a' else 'B'}",
            filetypes=[("CSV files", "*.csv"), ("All files", "*.*")],
        )
        if not path:
            return

        label = "A" if which == "a" else "B"

        # Show path immediately and set "loading" state
        if which == "a":
            self.file_a_var.set(self._short_path(path))
            self.info_a_var.set("Loading...")
        else:
            self.file_b_var.set(self._short_path(path))
            self.info_b_var.set("Loading...")

        # Disable buttons during load
        self._set_buttons_enabled(False)
        self._show_progress(f"Loading File {label}...")

        # Launch background thread
        thread = Thread(
            target=self._load_file_worker, args=(path, which), daemon=True,
        )
        thread.start()

    def _load_file_worker(self, path: str, which: str) -> None:
        """Background thread: read CSV in chunks and report progress."""
        label = "A" if which == "a" else "B"
        try:
            file_size = Path(path).stat().st_size
            chunk_size = 64 * 1024  # 64 KB read chunks
            bytes_read = 0
            raw_chunks: list[str] = []

            with open(path, "r", encoding="utf-8", errors="replace") as f:
                while True:
                    chunk = f.read(chunk_size)
                    if not chunk:
                        break
                    raw_chunks.append(chunk)
                    bytes_read += len(chunk.encode("utf-8", errors="replace"))
                    pct = min(bytes_read / file_size * 100, 100) if file_size else 100
                    # Schedule UI update on main thread
                    self.root.after(0, self._update_progress, pct,
                                    f"Reading File {label}... {pct:.0f}%")

            # Parse the full text with pandas
            self.root.after(0, self._update_progress, 100,
                            f"Parsing File {label}...")
            full_text = "".join(raw_chunks)
            df = pd.read_csv(StringIO(full_text))

            # Deliver result back to main thread
            self.root.after(0, self._on_file_loaded, which, path, df, None)

        except Exception as e:
            self.root.after(0, self._on_file_loaded, which, path, None, e)

    def _on_file_loaded(self, which: str, path: str,
                        df: pd.DataFrame | None, error: Exception | None) -> None:
        """Main-thread callback after background load finishes."""
        self._hide_progress()
        self._set_buttons_enabled(True)

        if error is not None:
            messagebox.showerror("Load Error", f"Failed to load CSV:\n{error}")
            if which == "a":
                self.file_a_var.set("No file selected")
                self.info_a_var.set("")
                self.df_a = None
            else:
                self.file_b_var.set("No file selected")
                self.info_b_var.set("")
                self.df_b = None
            self._check_ready()
            return

        # Store data
        if which == "a":
            self.df_a = df
            self.path_a = path
            self.info_a_var.set(f"Loaded \u2014 {len(df):,} rows, {len(df.columns)} columns")
        else:
            self.df_b = df
            self.path_b = path
            self.info_b_var.set(f"Loaded \u2014 {len(df):,} rows, {len(df.columns)} columns")

        self._check_ready()

    # ------------------------------------------------------------------ #
    #  Progress bar helpers
    # ------------------------------------------------------------------ #
    def _show_progress(self, text: str) -> None:
        self.progress_bar.pack(fill="x", pady=(0, 2))
        self.progress_label.pack(fill="x")
        self.progress_bar["value"] = 0
        self.progress_label.config(text=text)
        self.status_var.set(text)
        self.root.update_idletasks()

    def _update_progress(self, value: float, text: str) -> None:
        self.progress_bar["value"] = value
        self.progress_label.config(text=text)
        self.status_var.set(text)

    def _hide_progress(self) -> None:
        self.progress_bar.pack_forget()
        self.progress_label.pack_forget()
        self.progress_bar["value"] = 0

    def _set_buttons_enabled(self, enabled: bool) -> None:
        state = tk.NORMAL if enabled else tk.DISABLED
        for widget in self.root.winfo_children():
            if isinstance(widget, tk.LabelFrame):
                for child in widget.winfo_children():
                    if isinstance(child, tk.Frame):
                        for btn in child.winfo_children():
                            if isinstance(btn, tk.Button):
                                btn.config(state=state)
        # Always manage merge button separately via _check_ready
        if enabled:
            self._check_ready()
        else:
            self.merge_btn.config(state=tk.DISABLED)

    @staticmethod
    def _short_path(path: str, max_len: int = 55) -> str:
        """Truncate long paths for display."""
        if len(path) <= max_len:
            return path
        return "..." + path[-(max_len - 3):]

    def _check_ready(self) -> None:
        if self.df_a is not None and self.df_b is not None:
            self.merge_btn.config(state=tk.NORMAL)
            self.status_var.set("Both files loaded. Ready to merge.")
        else:
            self.merge_btn.config(state=tk.DISABLED)

    # ------------------------------------------------------------------ #
    #  Merge logic
    # ------------------------------------------------------------------ #
    def _do_merge(self) -> None:
        assert self.df_a is not None and self.df_b is not None

        # 1 ── Row count check
        if len(self.df_a) != len(self.df_b):
            messagebox.showerror(
                "Row Count Mismatch",
                f"File A has {len(self.df_a):,} rows but File B has "
                f"{len(self.df_b):,} rows.\n\n"
                f"Both files must have the same number of rows.",
            )
            return

        # 2 ── Find Ticker column in both files
        ticker_candidates = ["ticker", "Ticker", "TICKER", "symbol", "Symbol"]
        ticker_a = find_column(self.df_a, ticker_candidates)
        ticker_b = find_column(self.df_b, ticker_candidates)

        if ticker_a is None or ticker_b is None:
            missing = []
            if ticker_a is None:
                missing.append("File A")
            if ticker_b is None:
                missing.append("File B")
            messagebox.showerror(
                "Ticker Column Not Found",
                f"Could not find a Ticker/Symbol column in: {', '.join(missing)}.\n\n"
                f"Expected one of: {', '.join(ticker_candidates)}",
            )
            return

        # 3 ── Row-by-row ticker comparison
        tickers_a = self.df_a[ticker_a].astype(str).values
        tickers_b = self.df_b[ticker_b].astype(str).values

        mismatches: list[tuple[int, str, str]] = []
        for i in range(len(tickers_a)):
            if tickers_a[i] != tickers_b[i]:
                mismatches.append((i + 1, tickers_a[i], tickers_b[i]))

        if mismatches:
            sample = mismatches[:20]
            lines = [f"  Row {row}: '{ta}'  vs  '{tb}'" for row, ta, tb in sample]
            detail = "\n".join(lines)
            if len(mismatches) > 20:
                detail += f"\n  ... and {len(mismatches) - 20:,} more"

            proceed = messagebox.askyesno(
                "Ticker Mismatch Warning",
                f"\u26a0  {len(mismatches):,} of {len(tickers_a):,} rows have "
                f"different Ticker values:\n\n"
                f"{detail}\n\n"
                f"Do you want to continue with the merge anyway?",
            )
            if not proceed:
                self.status_var.set("Merge cancelled by user.")
                return

        # 4 ── Separate new columns vs. duplicates
        cols_a_lower = {c.lower(): c for c in self.df_a.columns}
        new_columns: list[str] = []
        duplicate_columns: list[tuple[str, str]] = []  # (col_b, col_a)

        for col in self.df_b.columns:
            if col.lower() in cols_a_lower:
                existing = cols_a_lower[col.lower()]
                # Skip the shared key columns (Ticker, Date) — never overwrite those
                if col.lower() not in {"ticker", "date", "data", "symbol", "p123 id"}:
                    duplicate_columns.append((col, existing))
            else:
                new_columns.append(col)

        # Ask about overwriting duplicate columns
        overwrite_columns: list[tuple[str, str]] = []
        if duplicate_columns:
            dup_names = [f"  {col_b}" for col_b, _ in duplicate_columns[:20]]
            detail = "\n".join(dup_names)
            if len(duplicate_columns) > 20:
                detail += f"\n  ... and {len(duplicate_columns) - 20} more"

            answer = messagebox.askyesnocancel(
                "Duplicate Columns Found",
                f"{len(duplicate_columns)} column(s) in File B already exist "
                f"in File A:\n\n{detail}\n\n"
                f"Yes = Overwrite with File B values\n"
                f"No = Skip duplicates, only add new columns\n"
                f"Cancel = Abort merge",
            )
            if answer is None:  # Cancel
                self.status_var.set("Merge cancelled by user.")
                return
            if answer:  # Yes — overwrite
                overwrite_columns = duplicate_columns

        if not new_columns and not overwrite_columns:
            messagebox.showinfo(
                "Nothing to Add",
                "No new columns to add and no duplicates selected "
                "for overwrite.\nNothing to merge.",
            )
            return

        # 5 ── Build merged dataframe (positional alignment)
        self.status_var.set("Merging...")
        self.root.update_idletasks()

        result_df = self.df_a.copy()
        for col in new_columns:
            result_df[col] = self.df_b[col].values
        for col_b, col_a in overwrite_columns:
            result_df[col_a] = self.df_b[col_b].values

        # 6 ── Save As dialog
        default_name = Path(self.path_a).stem + "_merged.csv"
        output_path = filedialog.asksaveasfilename(
            title="Save Merged CSV As",
            defaultextension=".csv",
            filetypes=[("CSV files", "*.csv"), ("All files", "*.*")],
            initialfile=default_name,
            initialdir=str(Path(self.path_a).parent),
        )
        if not output_path:
            self.status_var.set("Save cancelled.")
            return

        # 7 ── Write output
        try:
            result_df.to_csv(output_path, index=False)
        except Exception as e:
            messagebox.showerror("Save Error", f"Failed to save file:\n{e}")
            return

        # Build summary
        summary_parts: list[str] = []
        if new_columns:
            col_list = ", ".join(new_columns[:10])
            if len(new_columns) > 10:
                col_list += f" ... (+{len(new_columns) - 10} more)"
            summary_parts.append(f"Added {len(new_columns)} new column(s):\n  {col_list}")
        if overwrite_columns:
            ow_list = ", ".join(col_b for col_b, _ in overwrite_columns[:10])
            if len(overwrite_columns) > 10:
                ow_list += f" ... (+{len(overwrite_columns) - 10} more)"
            summary_parts.append(f"Overwritten {len(overwrite_columns)} column(s):\n  {ow_list}")

        self.status_var.set(f"\u2713  Saved: {Path(output_path).name}")
        messagebox.showinfo(
            "Merge Complete",
            "\n\n".join(summary_parts) + "\n\n"
            f"Result: {len(result_df):,} rows, {len(result_df.columns)} columns\n"
            f"Saved to: {output_path}",
        )


if __name__ == "__main__":
    app = MergeApp()
    app.root.mainloop()

I have address the naming issue in this update version v2.4.2
I have tested a few different datasets now and I never have any issues with the bundling. Would like to know exactly what the hiccups you are experiencing are?

Download v2.4.2

I could make an "AUTO" button on the first page that joust speeds through the workflow once the dataset is loaded, all the way to the production tab and export of Rank system. But I don't think more people would use it.

What slightly helped me here is throwing the entire help file into NotebookLM and just asking a lot of questions to make sure I understand it. I've made many mistakes so far, but I'm starting to see a little light at the end of the tunnel.

However, I have a problem. I am now in the last step: production. I have trained the model on all the data, so under "export production model," I pasted in the ranking system I have, which contains all the names and formulas that are in the CSV file. It seems to be working well.

But for some reason, when I try "copy to clipboard" in step 2, it doesn't seem to work. I am then left with the same ranking system with the same weight on the nodes as the one I pasted in?

And save XML dident work.

[06:24:26] [HEADER] ══════════════════════════════════════════════════
[06:24:26] [HEADER] STEP 8: PRODUCTION MODEL
[06:24:26] [HEADER] ══════════════════════════════════════════════════
[06:24:26] [INFO]
[06:24:26] [INFO] Training Configuration:
[06:24:26] [INFO] Training period: 2014-02-22 to 2026-02-07
[06:24:26] [INFO] Target: TARGET
[06:24:26] [INFO] Features: 181 reduced
[06:24:26] [INFO]
[06:24:26] [INFO] Enabled Anchors (10):
[06:24:26] [INFO] 1. 944K:Justert driftsresultat (kv) mot kor (Higher)
[06:24:26] [INFO] 2. 895K:Justert salg (kv) mot egenkapital (Higher)
[06:24:26] [INFO] 3. 2739M:Glidende snitt (50d vs 200d) (Higher)
[06:24:26] [INFO] 4. AstTotQ (Lower)
[06:24:26] [INFO] 5. 92.VO-share turnover, 3 months (Lower)
[06:24:26] [INFO] 6. 1728G:EPS-vekst (5år) relativt til brans (Higher)
[06:24:26] [INFO] 7. 100S:Uventet resultat (Surprise) siste 2 (Higher)
[06:24:26] [INFO] 8. 2821K:Endring i varelager mot eiendeler (Higher)
[06:24:26] [INFO] 9. 2757VO:Volum (5d snitt) mot (15d snitt) (Higher)
[06:24:26] [INFO] 10. 306. V-Netto FCF per aksje TTM / Pris. (Higher)
[06:24:26] [INFO]
[06:24:26] [INFO] Node Building Settings:
[06:24:26] [INFO] Regularization: L1 ratio = 0.50
[06:24:26] [INFO] Positive weights: Yes
[06:24:26] [INFO] Alpha: manual (0.01)
[06:24:26] [INFO] Max support features: 5
[06:24:26] [INFO] Build mode: Blended (residual_weight=0.30)
[06:24:26] [INFO]
[06:24:26] [INFO] Meta Model Settings:
[06:24:26] [INFO] L1 ratio: 0.50
[06:24:26] [INFO] Non-negative weights: Yes
[06:24:26] [INFO]
[06:24:26] [INFO] Training production model...
[06:24:26] [INFO] Loading data...
[06:24:27] [INFO] Running Feature Analysis...
[06:26:13] [INFO] Building Nodes...
[06:27:31] [INFO] Fitting Meta Model...
[06:27:46] [INFO] Complete!
[06:27:46] [INFO]
[06:27:46] [INFO] ==================================================
[06:27:46] [INFO] PRODUCTION MODEL RESULTS
[06:27:46] [INFO] ==================================================
[06:27:46] [INFO] Training IC: 0.0991
[06:27:46] [INFO] Training IR: 1.308
[06:27:46] [INFO] Nodes trained: 10
[06:27:46] [INFO] Features used: 38
[06:27:46] [INFO]
[06:27:46] [INFO] Top Nodes by Meta Weight:
[06:27:46] [INFO] 2739M:Glidende snitt (50d vs 200d): 0.321 (5 support)
[06:27:46] [INFO] 944K:Justert driftsresultat (kv) mo: 0.204 (5 support)
[06:27:46] [INFO] 306. V-Netto FCF per aksje TTM / Pr: 0.134 (2 support)
[06:27:46] [INFO] 895K:Justert salg (kv) mot egenkapi: 0.127 (5 support)
[06:27:46] [INFO] 1728G:EPS-vekst (5år) relativt til : 0.116 (5 support)
[06:27:46] [INFO] ... and 5 more nodes
[06:27:46] [INFO]
[06:27:46] [SUCCESS] Step 8 complete - Production model ready for export

You are right, there is a bug on the production export :roll_eyes:
I will fix it tonight.

1 Like

Let us know how to reproduce errors in generating a datasets. I'm assuming it's the generation part that fails, and not the download?

We are currently working on a Streamlit App called FactorMiner that will use generated datasets for factor engineering. It will run locally in our infrastructure, so no need to download. But the dataset still has to be generated. Fixing any problems in the generation will be next in our priority. They idea is to support huge datasets with thousands of factors. And, FYI, any apps that run locally will be able to use raw datasets since there's not data license issues.

2 Likes