u/Old-Friendship-8013

Backtest over 8,200 bets looked solid. But everyone knows backtest ≠ reality. So I've been logging every live pick since May 3 to see if it holds.

Live results so far (May 3–13, 2026): 24 picks. 24 correct. 100% hit rate.

Before anyone says "small sample" — yes, obviously. That's the point of logging it live. I'm not claiming it's statistically significant yet. I'm showing the process.

What matched backtest expectations:

High-confidence picks (≥70%) performing above average ✓
Home/away ELO gap still the strongest signal ✓
Low-league matches more volatile, as expected ✓

What surprised me:

Model was more conservative live than in backtest — fewer picks hitting the 70% threshold per day
Some markets I expected to be strong are underperforming on small live sample

I'll keep updating weekly. Public log here: https://docs.google.com/spreadsheets/d/1ILYXsDL4kzuhTm4k_i1thB3rNb5cRRPoZ2iA4RXE2cc

Happy to answer questions on methodology.

Been working on this for about a year. Here's what actually moved the needle.

**The model:**

Stacking ensemble — XGBoost + LightGBM + Random Forest as base

learners, Logistic Regression as meta-learner. Isotonic calibration

on top. Threshold auto-tuned per market on validation set.

**~165 features per match:**

- ELO ratings with K=30 and +100 home advantage modifier

- Form: last 10 home games and away games tracked separately

- xG luck factor (actual goals vs expected goals delta)

- Rest days, H2H records, league position, referee tendency

- League baseline stats per market

**Why ELO ended up on top:**

Same finding as the chess ELO post from earlier today — historical

rolling averages don't capture opponent quality. A team that faced

three weak sides in a row has inflated shot numbers. ELO adjusts

for that automatically.

In our feature importance output, ELO gap ranks #1 across

goals markets. Especially dominant for Over 0.5 — mismatched

games (ELO gap >200) almost never finish 0-0.

**Backtest methodology:**

Time-based 80/20 split — no data leakage. Trained on seasons

up to cutoff, tested on what came after. 12 European leagues,

11 betting markets.

**Results on 8,200 bets:**

| Market | Hit rate | n |

|-------------------|----------|-------|

| Over 0.5 goals | 93.5% | 1,134 |

| Corners over 12.5 | 78.0% | 1,134 |

| Over 1.5 goals | 77.8% | 1,096 |

| BTTS | 66.2% | 337 |

| High-conf overall | 85.9% | 1,588 |

High-confidence = model probability ≥ 0.70 across all three

base learners simultaneously.

**What I learned:**

Market selection beats model complexity. Over 0.5 is 93.5%

not because the model is smart — it's because only ~6% of

top European matches finish 0-0. The model just identifies

those 6%.
Stacking beats any single model by 8-12% consistently.

The meta-learner learns when to trust XGBoost over LightGBM

and vice versa depending on the market.
Isotonic calibration is underrated. Raw probabilities from

tree models are poorly calibrated. After isotonic calibration

the reliability diagram tightened significantly — matters a

lot for threshold selection.
Correct score and first goalscorer have too much irreducible

variance. Dropped them early. Focused on high base-rate markets.

Happy to discuss feature engineering or calibration approach

in the comments. Also tracking picks publicly since May 3

if anyone wants to see live results vs backtest baseline.

I've been running my model live since May 3. Here's what actually happened vs backtest expectations. [OC]

I built a stacking ensemble for football Over/Under markets across 8,200 bets. ELO gap turned out to be the strongest single predictor. [OC]