
I trained XGBoost on 461K CSP + covered call trades across 28 tickers to see what a model learns about wheel strike selection — here's what it found (and what the 30-delta rule already gets right)
TL;DR: I've been running the wheel for about 20 years and doing ML for about 20 years. Finally built the project I'd been putting off — a model that scores individual CSPs and covered calls for "will this hit 50% profit before expiry?" across 28 tickers, from SPY/QQQ/AAPL through the volatile stuff most of us actually wheel now (COIN, MSTR, HOOD, OKLO, PLTR, MARA, RIOT, SOFI). Trained on 461K trades, 2020-2026. The punchline: on SPY/QQQ the standard wheel rules (30-delta, close at 50%, roll at 21 DTE) are hard to beat. On the volatile names, the rules you inherited from SPY-land actively hurt you, and the model learned why. Sharing because strike selection is where most wheel operators (myself included) under-think, and this experiment made me change how I size and select.
Repo (MIT, nothing for sale): https://github.com/caradhras36/options-ml-scoring
Not a product. Not a service. No Discord, no newsletter, no course. Posting because the wheel community is where I've learned the most over the years and this is me putting something back.
Why this matters for wheel operators specifically
The wheel works. "Sell 30-delta CSP, close at 50%, if assigned sell 30-delta covered call, close at 50%, repeat" is a disciplined playbook that beats most of what retail does.
But the moment you extend the wheel past SPY/QQQ into volatile single names — which is where most of the premium is, and also where most of us have gotten blown up — the heuristics stop behaving the same way. A 30-delta on SPY and a 30-delta on COIN are not the same trade, and every experienced wheel operator knows this intuitively. The question I wanted to answer with this project: can a model formalize that intuition, and does doing so actually help?
Short answer: yes, but less than the headline backtest numbers suggest, and mostly by forcing discipline that an honest wheel operator would already apply.
Setup
For every CSP or covered call meeting:
- 21-60 DTE
- |Delta| 0.10-0.40 (normal wheel band)
- Mid > $0.05
…score whether it will hit 50% of max profit before expiry. Five models on the same ~30 features:
- hit_50pct — binary (primary)
- max_profit — how much you realistically clear
- days_to_50 — how fast it cooks
- expected_value — dollar EV
- outcome_category — full win / partial win / breakeven / loss
Data from Polygon EOD chains, 28 tickers, Jan 2020 - Mar 2026. Greeks computed via Black-Scholes and cross-checked against OptionsDX to 0.4% delta error.
The question I care about most for wheel operators: given a ticker and a chain, which strike should I actually sell this week?
What the model learned (wheel-specific takeaways)
Three findings that changed how I wheel:
1. Ticker-specific delta bands
If you apply "sell 30-delta CSP" uniformly across 28 tickers, the hit-rate distribution looks like this:
| Ticker | 50%-profit hit rate at ~30-delta CSP |
|---|---|
| SPY | 89% |
| QQQ | 87% |
| AAPL | 84% |
| NVDA | 80% |
| PLTR | 74% |
| COIN | 71% |
| OKLO | 68% |
| MSTR | 67% |
The SPY rule is a SPY rule. On OKLO/MSTR/COIN, 30-delta is meaningfully more likely to go against you than the community heuristic implies. The model learned to shade lower delta (15-25) on the volatile names and hold normal 25-35 on the mega-caps. My manual rule now: on any name with ATM IV > 50%, shift the target delta band down by ~10 points. This by itself — no ML required — probably captures a meaningful chunk of what the model found.
2. rv_iv_ratio is the feature I wish I'd tracked for 20 years
The single most useful engineered feature in the model is rv_20d / iv_atm — 20-day realized volatility divided by ATM implied. When it's low, you're selling rich premium relative to what the underlying is actually doing. When it's high, you're selling cheap premium into a stock that's been moving.
Every wheel operator does a version of this in their head ("IV looks juicy") but I'd never actually normalized it ticker-by-ticker. The model treats rv/iv on SPY and rv/iv on COIN as the same signal, which is exactly what you want — it's a relative richness signal.
Practical wheel rule: if rv_iv_ratio > 1.2 (realized exceeding implied), skip the open. Wait for IV to catch up or for realized to cool. Not a model requirement — a rule you can apply from any options data source.
3. The 50%-profit label is actually the right wheel target
I was nervous the "hit 50% profit before expiry" label would be weird for wheelers who hold through assignment. Turns out it maps well. For CSPs that end up assigned, the 50%-profit label is rarely hit (the position gets assigned at a loss or at close-to-max — that's what assignment is). The model learned to score low-probability-of-50% trades as "avoid" even when premium looked attractive, which is basically the wheel operator's "do I want to own this at this strike" gut check, formalized.
The model is not a replacement for "pick tickers you're willing to own." It's a filter on top of that.
Results — and the part where I beg you to discount the dollar number
Holdout backtest: Jan 2025 - Mar 2026 (15 months the model never saw):
| Metric | Model (threshold 0.85) | "Sell everything in the band" baseline |
|---|---|---|
| Trades | 193,608 | 285,379 |
| Hit rate | 99.7% | 78.2% |
| Avg P&L / trade | $404 | $95 |
| Precision lift | +11pp | — |
Five reasons the +$400/trade is fiction-adjacent:
- Mid-price fills. Every backtest trade fills at the bid-ask midpoint. In a real wheel account selling on names like MSTR or COIN, you're giving up 10-15% of the credit to spread. That alone knocks $30-60 off the average per-trade figure.
- No capital / margin / concentration constraints. 193K trades over 15 months is ~500/day. No wheel account has that capital. The realistic question is "among the N trades I can actually put on today, does the model's top-N beat the heuristic's top-N?" — and I haven't answered that yet.
- annualized_return is the model's top feature. SHAP analysis shows the single most important input is premium-per-day-per-capital. That's technically known at trade entry so it's not strict leakage, but it means a meaningful chunk of the model's "edge" is just "avoid thin-premium trades" — which is a rule you can write on a napkin. I'm retraining without it to see what survives.
- 15 profitable months = a favorable regime. The backtest window was mostly benign for premium sellers. I have no data on what this does in a 2008-style crisis or a 2001-style low-vol grind where premiums compress.
- No assignment/wheel path modeled. The label is hit-50% or not. It doesn't follow the CSP-into-assignment-into-covered-call cycle that actually defines the wheel. A version that does is on the roadmap but isn't built yet.
What actually changed in my own wheel
Because the only thing that matters for wheel operators is "did this make your own book better," and here's the honest account:
- I stopped selling CSPs at the same delta on COIN/MSTR/OKLO that I sell on SPY. This was the biggest behavioral change, and it doesn't actually require the model — the finding is "volatile names need lower delta," and once you know that, you can apply it manually.
- I added rv_iv_ratio as a manual gut check before opening any position. No model required — just a 20-day realized vol calc.
- I do not use the model as a go/no-go signal. I use it as a confirmation check. Model agrees with my intuition + delta + rv/iv → I size up. Model disagrees with my intuition → I shrink size or skip. Never the only input.
- I'm more skeptical of high-premium CSPs on volatile names, not less. The SHAP analysis caught several cases where the model was rewarding high-premium trades that were actually bad (high notional ≠ good trade), and that made me audit my own real-money history. I found two OKLO trades from 2025 that fit the "bad trade that looked good because premium was fat" pattern. Costly lesson, but it's the kind of lesson a model-output review can surface.
The deeper insight — and the one I'd push you to steal regardless of whether you ever touch the model — is that most of the time, the wheel rules are right, and the wheel rules are wrong in a specific, identifiable direction on volatile tickers. The rules were designed on SPY. If you're wheeling anything with ATM IV above 50%, derate your delta band.
If you want to use this on your own watchlist
Repo has a Jupyter notebook (notebooks/example_inference.ipynb) that walks through scoring a single trade end-to-end — I used an NVDA $130 PUT as the example. You feed it a ticker + chain snapshot, it returns all 5 predictions plus a SHAP breakdown showing which features pushed the score up and down for that specific trade.
That's the fastest path for a wheel operator who wants to actually use this — not to retrain, just to score your own watchlist each week. Everything is MIT-licensed. The 28-ticker pre-trained model is in the repo via Git LFS.
https://github.com/caradhras36/options-ml-scoring
What I'd actually want feedback on
- Is 50% profit the right label for wheel operators? "Close at 50% OR 21 DTE" is more realistic for most of us. I'll probably relabel in v8. Anyone tried both?
- Does rv_iv_ratio > 1.2 = skip match your experience? I think it's generalizable but I've only tested it on these 28 tickers.
- For wheel-specific backtesting, should I model the full CSP→assignment→CC cycle as the label, instead of per-trade hit-50%? That's a bigger project but probably the right one.
- What's your honest per-ticker delta band on volatile names? I moved to 15-25 on COIN/MSTR/OKLO, 25-35 on SPY/QQQ. Curious what other experienced wheel operators do.
I'll be in the thread for the next few hours. Brutal feedback welcome — especially on the mid-price backtest and the annualized_return leakage concern, which are the two things I'm least confident about.