I Built a Weather Prediction Bot. Then Live Trading Humbled It.

I'm 46 years old. I have a commercial pilot certificate, 371.0 flight hours, and a very clear problem: I need 129 more hours to reach the 500-hour Part 135 target. At roughly $200 an hour, that's about $25,800 I need to generate from somewhere that isn't my day job managing a car electronics shop in Merced.

That number has been sitting on my desk for a while. I've been building toward it through a few different angles — an iOS app, KDP workbooks, some content work — all orchestrated through OpenClaw, my AI agent platform. But a few months ago I started thinking about prediction markets. Specifically, Kalshi. Specifically, the weather markets.

Here's where my head went: I'm a pilot. Weather is not a hobby topic for me. I've been reading METARs, TAFs, area forecasts, and prog charts for two years — the same skills I built while earning my commercial and instrument ratings. I know what NOAA's models are good at and where they fall apart. And every time a weather event rolls through, there are real money markets on Kalshi asking things like "Will the high temperature in Phoenix exceed 105°F today?" or "Will it rain more than 0.25 inches in Seattle this week?"

I thought: I probably have better-than-random intuition here. What if I built a bot to make those trades systematically and found out?

The bot ran paper for 35 days: 217 resolved trades, 76.5% win rate, and $1,773 in simulated profit. On March 28 it went live with a $424 bankroll. Later, after stripping noisy results and auditing calibration, the honest answer was uglier: the weather-only live run was not capital-ready, and the current gate still blocks live trading.

April 25 update: Kalshi weather is paused. The newest gate audit fixed bogus candidate-flow blockers, but the model still loses to the market baseline and needs more clean paper/live-gated data. I am leaving the original build notes below because the failure mode is the lesson.

Why Weather Prediction Markets

Kalshi is a regulated prediction market exchange. You trade on yes/no questions about future events. Unlike sports betting or crypto, the underlying "asset" is something with actual data behind it — government forecasts, historical records, numerical weather models that have been refined for decades.

Weather markets appealed to me for a few specific reasons.

First, the edge is knowable. Temperature forecasts from NOAA and Open-Meteo aren't random. They're probabilistic outputs from physics-based models. If I can extract a probability estimate that's better calibrated than what the market is pricing, I have a real edge.

Second, the events resolve fast. A "will it hit 100°F in Phoenix today?" market resolves by end of day. That's very different from holding a position for weeks. Fast resolution means I learn quickly whether my model is working.

Third, as a pilot, I already had the domain knowledge. I didn't need to learn what a 500mb chart means. I didn't need to understand the difference between ensemble spread and deterministic output. That was already in my head from instrument training — the same skills I built while earning my instrument rating and commercial certificate.

So I started building.

How the Bot Works

Data Sources

The bot pulls from two forecast APIs: NOAA's National Digital Forecast Database (NDFD) and Open-Meteo. NOAA is the authoritative US source. Open-Meteo provides ensemble model data with uncertainty spreads. Together they give me a probability distribution for temperature, precipitation, and wind at any US location.

For each market, the bot identifies the relevant location and variable, queries both APIs, and builds a forecast ensemble. The ensemble spread tells me how confident the models are. Wide spread means high uncertainty; tight spread means the models agree.

Probability Estimation

For bracket markets (things like "will the high exceed X?"), I use a Normal CDF approach. I take the ensemble mean as the expected value and the spread as the standard deviation, then calculate the probability that the outcome falls above or below the threshold.

So if the ensemble says Phoenix high temperature tomorrow has a mean of 103°F with a standard deviation of 4°F, I can calculate the exact probability that it exceeds 105°F. That probability becomes my edge estimate against whatever price the market is offering.

Strategies

The bot runs two strategies, and this is where the real lesson is.

CALIBRATION is the primary strategy. It looks for markets where my probability estimate and the market price disagree by a meaningful margin. If I think there's a 70% chance something happens and the market is paying 55 cents on the dollar (implying 55%), that's a 15-point edge and I take the trade.

BOUNDARY_FADE was a second strategy I added, looking to fade markets priced near the boundary — extreme events, thin probability ranges. It seemed clever in theory. The data killed it. BOUNDARY_FADE averaged negative $19.38 per trade. It's disabled now.

Position Sizing: Kelly Criterion

This part matters more than most people think. I use a fractional Kelly Criterion to size each position. Kelly tells you the mathematically optimal fraction of your bankroll to bet given your edge and the odds. Full Kelly is aggressive enough to cause significant drawdowns, so I use a fraction of it.

The formula: f = (edge * odds) / (odds - 1), then I take a fraction of that (typically 25-50% Kelly) based on my confidence in the edge estimate. Position sizes are capped so no single trade risks more than a defined percentage of the bankroll.

Storm Mode

One detail I'm proud of: the bot has a storm mode. Normally it scans for new markets every 30 minutes. When there's a significant weather event in the forecast — a front moving through, a major precipitation event, an extreme heat warning — it shifts to scanning every 10 minutes. More markets open during events, they're more volatile, and the edge windows close faster.

What the Numbers Say

After 217 resolved paper trades on a $1,000 simulated bankroll:

$1,773

Total Paper Profit

217

Total Trades

76.5%

Win Rate

2.44x

Win/Loss Ratio

Breaking it down by strategy:

Strategy	Trades	Profit	Avg/Trade
CALIBRATION	124	$907.79	+$7.32
BOUNDARY_FADE	13	−$251.94	−$19.38

CALIBRATION alone: 124 trades, $907.79 profit, $7.32 average per trade. Average win of $21.17 against an average loss of $8.68. That 2.44x win-to-loss ratio is doing a lot of heavy lifting alongside the 52.7% win rate.

BOUNDARY_FADE: 13 trades, -$251.94. Disabled immediately once I saw those numbers.

Monte Carlo Validation

I ran 10,000 Monte Carlo simulations using the CALIBRATION strategy's win rate, average win, average loss, and a realistic position size range. The results:

98.3% of simulations were profitable
Median simulated profit: $873
CALIBRATION-only run: 100% of simulations profitable

That's not a guarantee of future results. Monte Carlo assumes the future looks like the past. But 98.3% profitable across 10,000 runs with median profit of $873 is a signal worth taking seriously.

The edge isn't huge. $7.32 per trade on a $1,000 bankroll is not going to fund a Citation X. But it's real, it's consistent, and it compounds.

The Gate That Wasn't Strict Enough

Paper trading is paper trading. I built a gate system: six criteria that had to pass simultaneously before the bot touched real money. On March 28, 2026, the old gate cleared. The newer gate is harsher, and it is right to be harsher.

✓ 80 paper trades completed — 217 resolved
✓ 55% win rate — 76.5%
✓ Brier score ≤ 0.20 — 0.1947
✓ Positive P&L — $1,773
✓ Peak drawdown < 20% — 10.6%
✓ 30 consecutive days of trading — 35 days

Update (March 28, 2026): The bot went live with a $424 bankroll, quarter-Kelly sizing, $15 max per position, and a circuit breaker armed.

Update (April 25, 2026): The bot is paused again. A later gate audit showed candidate flow was partially blocked by bad cluster logic, but after fixing that, the remaining block was real: not enough settled live data, not enough trading days, and final decision quality still worse than the market baseline. That is not a green light. That is the safety system doing its job.

What I Got Wrong

BOUNDARY_FADE is the obvious one. I was pattern-matching to a trading concept that sounds good in a book and didn't do the work to validate it before trading it. By the time I had 13 trades of data, it had already cost me $252 in simulated profit. That's actually a cheap lesson in paper trading terms. In live trading it would have been real money.

The other thing I underestimated: how much the model spread matters. Early on I was using point forecasts from NOAA without accounting for forecast uncertainty. A mean temperature of 98°F means something very different when the ensemble spread is 2°F versus 8°F. I now weight my confidence in an edge by the spread. Tight spread, high confidence, bigger position. Wide spread, uncertain, smaller position or skip.

What's Next

The short answer: I keep it in paper/gated mode until it earns another shot. No threshold heroics. No tuning it into passing because I want the story to be cleaner. If the model cannot beat the market baseline after enough settled data, the strategy gets killed or rebuilt.

I'm also looking at extending the model to precipitation and wind markets, not just temperature. Temperature is the cleanest signal because NOAA's temperature forecasts are very good. Precipitation is noisier. But there might be edge in specific market types where the noise works in my favor.

The longer game is still the same: build income streams that can fund flight hours. But KDP and digital products are cleaner proof right now than prediction-market bravado. I wrote about the full picture of how I'm using AI to build side income as a pilot, and I keep the current numbers on the Flight Funded tracker.

371.0 hours and counting. The bot can sit in the penalty box until the math stops embarrassing itself.

📘 Want the Exact System Behind the Bot?

I wrote up the OpenClaw setup behind this whole automation stack in The Non-Developer's OpenClaw Playbook. It covers cron jobs, model routing, memory files, coding-agent workflow, safety gates, and the mistakes that cost money before they taught anything useful.

Get the Playbook → Read the Playbook story → Flight Funded Tracker →

Nick Rae is a commercial pilot (ASEL/AMEL/IR) with 371.0 hours, a car electronics shop manager, and a builder based in Merced, CA. He's building income streams to fund the 129.0 hours between him and the 500-hour Part 135 target. Follow along on X @nickrae or check the Flight Funded tracker for live numbers.