Crowdsource Finding Answers at P123 with the Most Efficient Algorithm Known to Man

Jrinne · May 27, 2025, 9:26am

TL;DR: Thompson Sampling zeroes in on the best solution astonishingly fast. If you’ve never seen it in action, I can share some quick coding demos—you’d be truly shocked. I was anyway. P.S. If anyone’s curious, I’m happy to share a short Python demo that shows how Thompson Sampling rapidly identifies the best option—even with noisy data. Just ask!

A serious machine learning site with enough data-driven, active members would be all over this idea, I believe. I understand it may not be practical with our present membership, but I wanted to put it out there for the future—especially if the statistical, data-driven membership at P123 continues to grow.

Once it reaches critical mass, this wouldn’t just boost community involvement—it could become a powerful marketing tool for P123 as a truly data-driven platform.

Instead of just sharing tips and anecdotes, what if we could crowdsource the discovery process itself—and do it with scientific efficiency?

I propose—at some point in P123’s growth, if not now—we form groups where members collaborate to find the best trading tactics using Thompson sampling, the most efficient explore-exploit algorithm available. It’s what top tech companies use to quickly find the best ads, layouts, and recommendations.

How it could work:

The group creates a list of trading ideas to test—order types, best times of day to trade, best days of the week, etc.
Each member is assigned a tactic to try, selected via Thompson sampling (the most efficient algorithm known to mankind). This means the system directs us to test both new ideas and those already showing promise, efficiently balancing exploration and exploitation.
Results are reported back to the group. As more data comes in, Thompson sampling continually steers further testing toward the most promising strategies.
The group quickly zeroes in on the most effective tactics— based on actual evidence, not just opinion.

Why do this?

It’s proven: Thompson sampling is mathematically the fastest, most reliable way to discover what works.
It’s community-powered: Everyone’s efforts combine to benefit the whole group.
It builds real knowledge: We move beyond guesswork to data-driven answers.

If P123 could provide a simple tool to manage these kinds of projects, we could answer practical questions like Doney1000’s faster and better than ever.

We’d start with a handful of ideas the community cares about most—maybe IB algos (even though I don’t use IB myself), or any topic with broad interest. With some support from P123 to set this up, this approach could expand to many more community questions over time.

Would anyone else be interested in this kind of group project, if the tools were available? I’d love to hear your thoughts, suggestions, or additional use cases!

bobmc · May 27, 2025, 2:36pm

Interesting but a question on how each stock trading system could be scored. Would need to rank every trade, profit/loss and time held . . . Seems like this could be used as a scoring method for every current existing screen or simulated strategy backtest. I especially appreciate having knowledgeable individuals like yourself and others with different backgrounds experimenting with new concepts.
Yes! I’d defiantly like to see your demo.

Jrinne · May 27, 2025, 2:48pm

Great question! Thompson Sampling—or an alternative Bayesian algorithm called Upper Confidence Bound (UCB)—can absolutely handle continuous variables, so you’re not limited to simple “winner/loser” situations.

However, it’s easiest to set up these systems as a binary or categorical problem—think “success” or “failure” for each attempt, or simply tallying a frequency of successes.

That actually makes trading (trade execution for a buy or sell order with an IB algo as an example) a perfect fit! For example, you might define “success” as a trade executing at a price better than the open, or better than the close (same day or previous close). You could also use benchmarks like VWAP.

Ultimately, the beauty is that success or failure can be defined in whatever way the collaborating group decides. It’s flexible—just needs a clear, shared definition for what counts as a “win” for each trading idea.

Jrinne · May 27, 2025, 2:54pm

TL;DR:
If you form a group, each individual gets to the best solution— MUCH faster than they could by themselves--using the best algorithm available. What more could you ask for?

Background:

Historically, these are called multi-armed bandit problems—think of a row of slot machines (“arms”), each with a different (unknown) payout frequency. You have 500 pulls and keep what you win. Your goal is simple: make the most money possible. To do that, you want to find and pull the arm that pays out best as often as possible—but you need to experiment a bit to discover which one that is.

Thompson Sampling is a proven optimal algorithm for solving this kind of problem: it explores to learn, then exploits to maximize rewards. In real life, Amazon and other companies use similar algorithms for things like testing which web page or ad works best, by showing different users different options and counting “successes.”

For trading, you might use this to find the best trading algo or strategy (e.g., best fill rates)—the idea is exactly the same.

How to use this code:

Set the true_success_rates list to whatever probabilities you want (these are the hidden win rates).
Run the code (no seed, so you can see different results each time).
See how quickly the algorithm finds and sticks to the best arm—visualized and in stats.

Enjoy!

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configuration
true_success_rates = [0.10, 0.25, 0.15, 0.40]  # True success probabilities
n_arms = len(true_success_rates)
n_rounds = 500


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
#np.random.seed(42)

# Set style for prettier plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
#np.random.seed(42)

# Configuration
true_success_rates = [0.10, 0.25, 0.15, 0.40]  # True success probabilities
n_arms = len(true_success_rates)
n_rounds = 500

# Color scheme
colors = plt.cm.viridis(np.linspace(0, 0.9, n_arms))
optimal_arm = np.argmax(true_success_rates)

# Track successes and failures for each arm
successes = np.zeros(n_arms)
failures = np.zeros(n_arms)

# Track choices and rewards
arm_history = []
rewards = []
cumulative_rewards = []

# Run Thompson Sampling
for i in range(n_rounds):
    # For each arm, sample a probability from its Beta distribution
    sampled_probs = [np.random.beta(successes[j] + 1, failures[j] + 1) 
                     for j in range(n_arms)]
    
    # Choose the arm with the highest sampled probability
    chosen_arm = np.argmax(sampled_probs)
    arm_history.append(chosen_arm)
    
    # Simulate pulling the arm
    reward = np.random.rand() < true_success_rates[chosen_arm]
    rewards.append(reward)
    cumulative_rewards.append(sum(rewards))
    
    # Update successes or failures
    if reward:
        successes[chosen_arm] += 1
    else:
        failures[chosen_arm] += 1

# Calculate final statistics
arm_counts = np.bincount(arm_history, minlength=n_arms)
estimated_rates = successes / (successes + failures)

# Create figure with subplots
fig = plt.figure(figsize=(14, 8))
gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3, height_ratios=[1.5, 1])


# 2. Total arm selection frequency (top right)
ax2 = fig.add_subplot(gs[0, 1])
bars = ax2.bar(range(n_arms), arm_counts, color=colors, edgecolor='black', linewidth=1.5)

# Highlight the optimal arm
bars[optimal_arm].set_edgecolor('darkgreen')
bars[optimal_arm].set_linewidth(3)

# Add value labels on bars
for i, (count, bar) in enumerate(zip(arm_counts, bars)):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 5,
             f'{count}', ha='center', va='bottom', fontsize=12, fontweight='bold')

ax2.set_xlabel('Arm', fontsize=14)
ax2.set_ylabel('Times Selected', fontsize=14)
ax2.set_title('Total Times Each Arm Was Selected', fontsize=16, fontweight='bold')
ax2.set_xticks(range(n_arms))
ax2.set_xticklabels([f'Arm {i}' for i in range(n_arms)])
ax2.set_ylim(0, max(arm_counts) * 1.15)

# 2. Summary statistics table (bottom)
ax3 = fig.add_subplot(gs[1, 0])
ax3.axis('tight')
ax3.axis('off')

# Create table data
table_data = []
headers = ['Arm', 'Times Pulled', 'True Success Rate']

for i in range(n_arms):
    row = [
        f'Arm {i}',
        f'{arm_counts[i]}',
        f'{true_success_rates[i]:.3f}'
    ]
    table_data.append(row)

# Create table
table = ax3.table(cellText=table_data, colLabels=headers, 
                  cellLoc='center', loc='center',
                  colWidths=[0.3, 0.35, 0.35])

# Style the table
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.8)

# Color code the cells
for i in range(n_arms):
    # Color the arm column
    table[(i+1, 0)].set_facecolor(colors[i])
    table[(i+1, 0)].set_text_props(weight='bold')
    
    # Highlight the row of the optimal arm
    if i == optimal_arm:
        for j in range(3):
            table[(i+1, j)].set_facecolor('#90EE90')  # Light green
            table[(i+1, j)].set_text_props(weight='bold')

# Style header row
for j in range(3):
    table[(0, j)].set_facecolor('#4472C4')
    table[(0, j)].set_text_props(weight='bold', color='white')

ax3.set_title('Summary Statistics After 500 Trials', fontsize=16, fontweight='bold', pad=20)

# Add main title and key insight
plt.suptitle('Thompson Sampling: Learning Which Arm is Best', fontsize=20, fontweight='bold')

# Add text box with key insight
textstr = f'Key Insight: Thompson Sampling quickly identified and focused on Arm {optimal_arm} (40% success rate)\nwhile still exploring other options. It pulled the best arm {arm_counts[optimal_arm]} times out of {n_rounds}!'
props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
fig.text(0.5, 0.02, textstr, transform=fig.transFigure, fontsize=14,
         horizontalalignment='center', bbox=props)

plt.tight_layout()
plt.show()

# Print simple summary
print("\n" + "="*50)
print("THOMPSON SAMPLING RESULTS")
print("="*50)
print(f"\nAfter {n_rounds} trials:")
print(f"- Best arm (Arm {optimal_arm}) was pulled {arm_counts[optimal_arm]} times ({arm_counts[optimal_arm]/n_rounds*100:.1f}%)")
print(f"- Algorithm achieved {sum(rewards)} total rewards")
print(f"- Average reward per pull: {np.mean(rewards):.3f}")
print(f"- Optimal average would be: {max(true_success_rates):.3f}")

# Calculate what random selection would have achieved
random_expected_reward = np.mean(true_success_rates)
random_total_expected = random_expected_reward * n_rounds

print(f"\nComparison with random selection:")
print(f"- Random would pull each arm ~{n_rounds/n_arms:.0f} times")
print(f"- Random expected total rewards: {random_total_expected:.0f}")
print(f"- Random expected average reward: {random_expected_reward:.3f}")
print(f"- Thompson Sampling improvement: {(sum(rewards)/random_total_expected - 1)*100:.1f}% better!")

print("\nThe algorithm successfully learned which arm was best!")

Key takeaways:

Thompson Sampling finds and focuses on the best option with remarkable speed and efficiency, while still checking the others occasionally.
The plots and stats show, at a glance, how well it works.

ustonapc · May 27, 2025, 5:16pm

Jim,

Thanks for the detailed description on Thompson Sampling. The topic Crowdsource Finding Answers at P123 with the Most Efficient Algorithm Known to Man is definitely eye catching and I cannot resist to look it up to see how it works.

However, I found the following limitations for its use in building trading strategies in particular. Pls take a look and respond if you can if you have time.

Many thanks.

Regards
James

Thompson Sampling, while effective in certain situations like online hypothesis testing and multi-armed bandit problems. It's crucial to understand its limitations and potential pitfalls when applied to financial markets. Applying Thompson Sampling to investing can be risky because it might not be suitable for the complex, dynamic nature of financial markets, where factors beyond simple reward probabilities play a significant role.

Thompson Sampling is a reinforcement learning algorithm used to find the best option from a set of possibilities when there's uncertainty about their rewards. It's often visualized using the multi-armed bandit problem, where you're trying to find the slot machine with the highest payout.
How it works:

The algorithm maintains a probability distribution for each "arm" (or option). It randomly samples from these distributions and chooses the arm with the highest sample. After receiving a reward, it updates the probability distribution for that arm, reflecting its performance.

Why it might be wrong for investing:
- Uncertainty vs. Complex Market Factors: Financial markets are not just about simple rewards. They're influenced by numerous factors, including economic news, investor sentiment, and unpredictable events. Thompson Sampling's focus on rewards might not capture the nuances of these factors.
- Time-Dependent Rewards: In financial markets, rewards are not fixed. They change over time, and the relationship between actions and rewards can be complex and unpredictable. Thompson Sampling might struggle to adapt to these dynamic changes.
- Exploration vs. Exploitation: Thompson Sampling balances exploration and exploitation, meaning it sometimes tries new options (exploration) while focusing on the best ones (exploitation). However, in investing, the consequences of poor decisions can be significant, and relying too much on exploration might not be ideal.

Jrinne · May 27, 2025, 5:21pm

Correct! Using IB as an example: if an algo stops working for some reason, it can take a while to realize this with traditional approaches. Thompson sampling is actually better than Upper Confidence Bound at detecting these changes and adjusting accordingly. And if adapting to change is a serious concern, the algorithm can be modified further to address this.

For example, Facebook modifies Thompson sampling to better handle nonstationary environments—where the “best” choice can change over time.

Using a rolling window for the lookback period or applying exponential weighting to past data are a couple of easy ways to make the algorithm more responsive to change.

Great point. But honestly, the only thing worse than exploring too much is never knowing which algorithm is actually best—again, thinking of IB algos here, where complete ignorance is basically my present state! Also, as I mentioned, having a large group using the algorithm can dramatically reduce the number of explorations required for each individual—which is why I bring it up here.

The reality is, people only join a group like this to find answers they don’t already know. And if you do have complete historical data on every “arm,” you’d use a different approach altogether. For example, with P123 classic stock selection, you have FactSet data showing you what would have happened with any stock (stock = arm). Bandit algorithms like Thompson sampling are for when you don’t have enough historical data to know the best choice in advance.

Thanks for raising these important questions—it’s a great discussion. And just to add: Facebook and other tech giants have already done much of the heavy lifting on these algorithms and ways to adapt them. We just need to apply and adapt their solutions for our purposes.

There are not just articles but entire books about this:

Bandit Algorithms for Website Optimization: Developing, Deploying, and Debugging (1st Edition, Kindle)(Not a super advanced book, but if you read the whole thing, you might find it’s a better step-by-step explanation than I can fit in a post.)
Bandit Algorithms by Tor Lattimore & Csaba Szepesvári (Kindle)(This is the real technical reference—the “deep dive” textbook.)

ustonapc · May 27, 2025, 6:06pm

Jim,

Thanks for the books suggestion. However, it seems to be focused and employed mostly on website design like Facebook that you mentioned in the reply.

Do you have any references that are more related to building trading strategies (proven with statistical signficant or out of sample performance perhaps?)

I am also interested to know the views of the investment experts here like @yuvaltaylor.

Thanks again for sharing your view.

Regards
James

Jrinne · May 27, 2025, 6:10pm

Sadly, I am seldom the first with an idea.

Strategy Selection Using Multi-Armed Bandit Algorithms in Financial Markets

What I have added is the advantage of being in a group. I may be the first on that.

Hmmm... I’ve contemplated developing an app—tentatively named “Hive Mind” —designed to crowdsource decision-making processes. The name draws from nature, where collective behavior leads to efficient problem-solving, much like bees in a hive. So actually, nature beat me to the idea of crowdsourcing using a similar explore/exploit algorithm (with bees) a few millennia ago.

I do think explore/exploit strategies have survived in nature this long because, despite any failings, they’re the best solution possible—short of fortunetelling—even for humans.
Less dramatically, these strategies are proven to be optimal for a certain set of problems. But like a hammer, they’re good to have in your toolbox—but they don’t belong in a surgical suite…unless, of course, you’re an orthopedic surgeon.

ustonapc · May 27, 2025, 6:14pm

Marco/Yuval,

Glad to know what you think and how this could be applied in P123.

Regards
James

EDIT : I can't believe that the reply above was completely re-written/edited after I gave a like but it is great to have an academic paper on high frequency trading from someone at Hainan University, China. (maybe with connection to Deepseek?).

In case some members skip the link, here is the conclusion from the paper.

Conclusion
This study explored the application of Multi-Armed Bandit (MAB) algorithms in the dynamic selection
of trading strategies within financial markets, a novel application that has not been addressed in existing
literature. By introducing the Composite Trading Strategy, which integrates trend-following, meanreversion, and momentum strategies, this research demonstrated how MAB algorithms can enhance decision-making in high-frequency trading environments. The experimental results indicate that MAB algorithms such as Upper Confidence Bound (UCB),
Thompson Sampling, and epsilon-greedy are effective in identifying optimal strategies under certain
market conditions. In particular, the introduction of the Composite Trading Strategy led to increased
trading opportunities, which enabled the algorithms to more effectively balance exploration and
exploitation. However, the performance of these algorithms in adverse market conditions, such as
downtrends or range-bound markets, remains limited. While the Composite Strategy improved
profitability in favorable conditions, it did not fully overcome the inherent challenges faced by MAB
algorithms in more complex and volatile market environments.
The primary contribution of this research is the application of MAB algorithms to real-time strategy
selection, a field where these algorithms have not been previously explored. This study demonstrates that MAB algorithms can be applied beyond traditional portfolio optimization and risk management, offering new possibilities for improving trading performance in real-world financial markets.
Despite these findings, there are several limitations to this study. First, the experiments were
conducted using simulated market data, which, while useful for controlled testing, may not fully capture the complexity and unpredictability of real-world financial markets. Second, the increased computational complexity introduced by the Composite Trading Strategy poses challenges for real-time application, especially in high-frequency trading scenarios where rapid decision-making is critical.
Future research should focus on further refining MAB algorithms to enhance their adaptability in
adverse market conditions. One promising direction could involve integrating predictive models that
can better anticipate market shifts, allowing the algorithms to make more context-aware decisions.
Additionally, exploring the combination of MAB algorithms with machine learning techniques, such as reinforcement learning, could offer new avenues for improving decision-making under uncertainty. This would not only address the challenges of strategy selection but also expand the practical applicability of MAB algorithms in increasingly complex financial markets. In conclusion, this study has highlighted both the potential and limitations of MAB algorithms in financial strategy selection, underscoring the need for continued research to fully harness their capabilities in dynamic and uncertain trading environments.

Jrinne · May 27, 2025, 11:03pm

Better, shorter, and more informative I thought. I thought an academic reference was a good idea when proofreading my own post..

Here are a few more papers/posts if you are interested. They do show it is not an entirely new idea and maybe not all that racial either. But I have not read them. My understanding of the optimality of the algorithm (for a certain class of problem) and the Python demonstrations I have seen are enough for me. The above code is easily modified to simulate about any situation one might encounter including the number of algorithms being traded (while being tested), number of total trades and how similar/different the "true-success-rate" of the algorithms being investigated is. With no random seed you can easily rerun it multiple times (it is fast and just take a click) to see how consistent the results are. I can't imagine what else I would need.

Hedging using reinforcement learning: Contextual k-armed bandit versus Q-learning

Stochastic Multi-armed Bandits: Optimal Trade-off among Optimality, Consistency, and Tail Risk

Maximizing REITurns: A Multi-Armed Bandit Approach to Optimizing Trading Strategies on Real Estate Investment Trusts

Multi-armed bandits applied to order allocation among execution algorithms

Strategy Selection Using Multi-Armed Bandit Algorithms in Financial Markets

Multi Armed Bandit Optimization in Trading

Multi-Armed Bandit (MAB) Methods in Trading

Risk-aware multi-armed bandit problem with application to portfolio selection

Learning the Trading Algorithm in Simulated Markets with Non-stationary Continuum Bandits

If I wanted to digest the papers i would load all of them into Gemini Pro (with a large context window) and see what it said. But I think i will stick to the Python demonstration for now. If someone else wants to do that I would be interested.

Edit: I am not sure whether ChatGPT 4,1 actually searched those papers (I was too lazy to upload them). But here is its conclusion:

Conclusion:

While the application of MAB algorithms in trading is not new, the integration of group collaboration to enhance their effectiveness presents a novel avenue for exploration. Implementing such a system could offer significant benefits in terms of efficiency and strategy optimization."

yuvaltaylor · May 28, 2025, 2:04am

When choosing between strategies developed on Portfolio123, backtesting them using simulations as similar as possible to how you'd actually trade them seems like an unbeatable way to go about things. The backtests should be robust (multiply the number of stocks you test, test them on random sub-universes, and so on), but Portfolio123 has two decades of experience enabling users to use their tools.

The challenge comes when choosing between strategies that are not so easy to backtest or compare to one another. For example, how much of one's portfolio should one devote to European, Canadian, and US stocks? Backtests won't help us much here, and even out-of-sample results are questionable. I use quite different systems based on Compustat and FactSet data. How should I weight their use in my portfolio? Is a put-based hedge better than a short-based hedge? How much of one's portfolio should one put into a non-ranking-based machine-learning strategy and how much into a strategy governed by ranking systems? Jim's thought about trading strategy is on point as well. Is VWAP better than TWAP? When is it more appropriate to use not-held/desk orders if the commission is higher? Which of IB's algorithms actually work best for the kinds of orders you want to place? (For this, it's essential to dig deep to understand how the orders really work: often IB's own explanations are misleading or insufficient.)

I don't think ANY of these questions can be helped much by using an MAB algorithm. There are simply too many variables at play. Not only that, but many of these questions may simply be unanswerable.

I am struggling to think of an application where MAB algorithms would work. I'm certainly willing to entertain ideas and would love to find a good application.

A thought for Jim: could one use an MAB algorithm to efficiently assign weights to nodes in a ranking system?

Jrinne · May 28, 2025, 8:35am

BTW, I really recommend this book—even if it’s just for all the other algorithms it covers. If you’re tired of MAB, you can skip that chapter, but it’s still a fantastic read!
It’s the best resource I’ve seen on this topic (just one chapter is about MAB/Upper Confidence Bound, similar to Thompson sampling). The whole book is written for laypeople, is very accessible, and has LOTS of practical algorithms.
What really stuck with me was the medical research example—it’s tragic, and it shows what can go wrong with classical (non-adaptive) methods.
Algorithms to Live By: The Computer Science of Human Decisions

Great question! There might be good algorithms for assigning weights in ranking systems, but classic Multi-Armed Bandit (MAB) isn’t really designed for that purpose.

MABs excel when you’re working with binary or categorical outcomes—for example, “Which of these algos is most likely to outperform a VWAP order?” Each trial gives a simple success/failure outcome, which fits the MAB framework perfectly.

They’re also ideal when you don’t have a large historical dataset—MAB lets you “learn as you go” and start favoring winners quickly, without having to run exhaustive tests on everything upfront.

But if you do have years of historical data (like with FactSet or P123’s classic tools), more traditional optimization and machine learning approaches are generally a better fit and more efficient.

Summary:

MAB works well for binary feedback and situations with little or no historical data. For ranking systems—especially if you have plenty of historical results—other approaches are usually a better fit.

—

By the way, here’s a set of hypothetical payouts:

true_success_rates = [0.35, 0.50, 0.65, 0.49, 0.51] # True success probabilities.

So, most algos do about the same as VWAP, one is worse (0.35), and one is clearly better (0.65). You can do this with your own number, but iI think it’s plausible for real-world algo comparisons.

If 10 members did 50 trades each over a month, that’s 500 trades total. Ideally, the MAB algorithm would direct most trades toward the algo that beats VWAP 65% of the time.

Here’s a simulation of what the algorithm would do:

You can judge for yourself how well this works!

—

And here’s why everyone—including medical researchers—are so interested in this approach:

If you just tried each trading system 100 times and ran a statistical test, you’d end up using the poorly performing algorithms about 400 times—losing money in the process. In medicine, too many in the “large control” group receive suboptimal treatments, which can cost more than just money; sometimes it costs lives.

Jrinne · May 28, 2025, 11:14am

While MAB isn’t ideal for this particular problem, I do have some thoughts and methods in this area—most of which, as far as I know, aren’t being published or widely shared (except perhaps by a few secretive hedge funds).

I’d be very interested in exploring ways we could collaborate on this to our mutual benefit. I hope you understand that discussing these ideas in a public forum (especially one accessible to non-members) probably wouldn’t be in the best interest of anyone at P123.

If you’re open to it, maybe we could continue this conversation privately.

Also, I’ve come to agree with you that P123 classic is an excellent tool. At this point, it’s really just a question of finding the fastest and most effective way to optimize the rank weights. On that front, my approach might offer some advantages—at least be quite a bit faster. Maybe better but we could explore that together.

A better fit perhaps would be to offer a premium level at P123 for developing ranking systems. There could be 3-way participation there. I would just need to be an adequately reimbursed P123 staff member helping to develop that.

Jrinne · May 30, 2025, 11:43am

Here’s a continuous version. I used bootstrapping (a non-parametric method) instead of the usual normal distribution for continuous models, since financial data typically has fat tails and isn’t actually normal:

import numpy as np
import matplotlib.pyplot as plt

# True means for each restaurant (hidden from the algorithm)
true_means = [2.5, 4.2, 3.1, 4.0, 4.1, 3.5]
n_arms = len(true_means)
n_rounds = 200

# For bootstrapped TS: keep a list of observed rewards for each arm
rewards = [[] for _ in range(n_arms)]
history = []

# --- Warm start: force a few initial samples per arm ---
init_trials = 5
for arm in range(n_arms):
    for _ in range(init_trials):
        reward = np.random.normal(true_means[arm], 0.7)
        rewards[arm].append(reward)
        history.append(arm)

# --- Main Thompson Sampling loop ---
for t in range(n_arms * init_trials, n_rounds):
    # For each arm, bootstrap a mean from its observed rewards
    sampled_means = []
    for arm in range(n_arms):
        if len(rewards[arm]) > 0:
            sample = np.random.choice(rewards[arm], size=len(rewards[arm]), replace=True)
            sampled_means.append(np.mean(sample))
        else:
            sampled_means.append(0)  # Shouldn't happen with warm start

    choice = np.argmax(sampled_means)
    reward = np.random.normal(true_means[choice], 0.7)
    rewards[choice].append(reward)
    history.append(choice)

# --- Plot how often each arm was chosen ---
plt.figure(figsize=(7, 5))
counts = [history.count(i) for i in range(n_arms)]
plt.bar(range(n_arms), counts, color='steelblue')
plt.xticks(range(n_arms), [f"Arm {i}" for i in range(n_arms)])
plt.title("Bootstrapped Thompson Sampling (Continuous Rewards)")
plt.ylabel("Number of times chosen")
plt.xlabel("Arm")
for i, count in enumerate(counts):
    plt.text(i, count + 3, str(count), ha='center', va='bottom', fontsize=12)
plt.show()

# --- Show average reward per round ---
total_reward = sum([sum(rlist) for rlist in rewards])
print(f"Average reward per round: {total_reward/n_rounds:.2f}")

# --- Show how often the true best arm was chosen ---
best_arm = np.argmax(true_means)
print(f"Best arm (true mean {true_means[best_arm]:.2f}) was chosen {counts[best_arm]} times.")

BTW: If anyone waiting for a paper to spell out exactly how to use this in real-world finance, you might be waiting a long time. Continuous Thompson sampling isn’t new, but I haven’t seen the bootstrapped version (which is much more robust for fat-tailed financial data) anywhere else. So you’ll have to use your own judgment and creativity on how to put this into practice—and the code above can help you simulate a wide range of scenarios for continuous variables now.

Bottom line: This method is most useful when you don’t have historical data to start with—such as with IB’s trading algos. That’s just one possible application. I’ll leave it to the creative members here to find others.

For developing ranking systems, P123 classic and the newer AI/ML modules remain the gold standard, thanks to the depth of historical data from Compustat and FactSet.

Whatever the application, crowdsourcing reduces the number of times each individual is stuck with an inferior strategy. The group as a whole finds the winner faster. The continuous version expands these possibilities even further.

@marco, you’ve mentioned some of the big problems with medicine. I can only agree looking at it from the inside. Here’s one real solution that’s actually being used more and more in clinical research—and it might have lessons for finance too:

As for when to use this: The answer is deceptively simple—use it whenever you want a statistical study but can’t afford to waste resources or take unnecessary risk on ineffective options, like an IB algo or trading strategy whose true performance is poor (but isn’t known in advance).

The medical analogy is especially poignant: In classic clinical trials, people in the control group (sometimes literally taking sugar pills) don’t get better— and in chemotherapy trials, for example, some may die a painful death. (That’s the stark reality, not an exaggeration.) Modern adaptive algorithms like this one help direct more patients to the better treatment sooner, saving lives instead of just collecting statistics.

In finance, the logic is the same: you want to zero in on what works as quickly (and safely) as possible. I’m not suggesting P123 should immediately adopt this, but whenever anyone is considering a statistical study, it’s worth thinking about the real cost of a control group—and whether there might be a smarter, faster method.

And here’s what I like most about this method:

It sidesteps all the headaches of traditional study design:

No need to estimate effect size or statistical power in advance
No agonizing over how many trials or trades you’ll need
No arbitrary “end date” or “p = 0.06, now what?” dilemmas

With an adaptive algorithm, you just start exploring, and the data itself tells you when you’ve found something that works. If there’s no clear winner, the method hedges gracefully between options. If one approach is truly better, the algorithm shifts toward it—as quickly as the evidence allows.

Bottom line: All upside, no downside—for virtually any real-world statistical need.

If the choices perform about the same, you don’t waste extra effort or take extra risk. If there’s a real winner, you find it faster—without having to gamble on sample size calculations or hope that “statistical significance” lines up with real-world performance. No more wasted resources. No more pointless power studies.