u/lobhas1

Looking for honest critique on my 6-fold walk-forward quant backtest — US equities, long-only, daily data

I've been building a cross-sectional equity ranker and want honest critique on the backtest framework + results. Keeping model/feature details abstract (that's the IP I've invested in) but happy to discuss architecture and methodology.

Setup

  • Universe: ~650 US equities (S&P 500 + mid-caps + some delisted names, point-in-time membership)
  • Data: daily OHLCV from Tiingo, 2006-present, adjusted prices
  • Label: 5-day forward excess return vs SPY, decile-ranked for training
  • Model: tree-based cross-sectional ranker

Walk-forward validation

  • 6 rolling folds, each 12y train / 1y validation / 1y test
  • 10-day embargo between val and test
  • Non-overlapping test windows spanning 2020-02 to 2026-02
  • Proper point-in-time universe (no look-ahead on ticker membership)

Three portfolio variants run in parallel

Portfolio Rebalance Holding
TOPN-5 Every 5 days Full 5 days
TRANCHE Daily (5 overlapping tranches) 5 days each
MINHOLD Daily entry Min 5 days, signal-driven exit

Per-portfolio sizing

After finding no single sizing works best for all, my production config runs:

  • TOPN / TRANCHE: rank-based confidence weighting (weights ∝ rank² within top-5)
  • MINHOLD: equal-weighted (daily entry made rank-concentration too noisy)

6-fold test-set results (total return, 1-year test each)

Fold Period TOPN TRANCHE MINHOLD SPY
1 2020-02→21-02 +72% +141% +146% +9.5%
2 21-02→22-02 +4% +18% +4% +9.9%
3 22-02→23-02 +63% +39% +55% −9.7%
4 23-02→24-02 −15% +25% +12% +23.6%
5 24-02→25-02 +176% +159% +184% +21.9%
6 25-02→26-02 +125% +78% +101% +11.7%
Avg +71% +76% +84% +13%

Test Sharpe ranges 0.3 to 3.6 across folds. IC (Spearman) averages 0.02, per-fold range −0.002 to +0.046.

Costs modeled: 1bp fee + 3bp slippage + 5bp spread buffer per trade, 50bp annual borrow (long-only in this config).

What I think might actually be alpha

  • Beats SPY in 5/6 folds across all three portfolios
  • TRANCHE's daily-5-tranche structure has the best risk-adjusted numbers — often Sharpe 2-3 on test
  • Consistent across varied regimes: COVID, 2022 drawdown, 2023 AI rally, 2025-26 range
  • Signal is orthogonal to market beta (test fold 3 returned +55% MINHOLD while SPY was −10%)

What's concerning me (please pile on)

  1. Fold 2 (2021-22) is universally weak. All three portfolios barely beat or lose to SPY. Growth-to-value rotation year. IC near zero — model has essentially no signal in that regime. I haven't found a fix.
  2. TOPN fold 4 was negative despite highest IC (0.046). Broader ranking was correct but the specific top-5 picks got unlucky. Concentrated-bet variance.
  3. IC of 0.02 is below the usual "tradeable" threshold of 0.04. Returns come from stacking small edges across many trades. Feels thin.
  4. Fold 5 and 6 look almost too good (TOPN +176%, MINHOLD +184%). I've been careful with walk-forward, embargo, point-in-time universe, label-derived features are lag-aware, etc. But Sharpe 2-3 on daily-rebalanced long-only in test feels too clean. Most likely explanation I can't rule out: subtle feature leakage.
  5. Adjusted-price drift across data refreshes. Tiingo re-applies dividend adjustments retroactively when new dividends are paid, so historical adjClose values shift. Discovered the hard way — the same code + same tickers ran with different adjClose snapshots gives different backtest numbers. Found ~20% of tickers had 10-100 bps adjClose drift on historical rows between two fetches a week apart. Results aren't bit-reproducible across refreshes.
  6. TOPN struggled in the 2023 AI rally — the concentrated top-5 missed the Mag-7 concentration. A broader (TRANCHE) basket captured some of it.

Open questions

  1. Low-IC high-return puzzle: is ~+70-84% annual return on low IC (0.02) plausible as alpha, or is there a typical look-ahead trap I should be hunting for?
  2. Rank-based confidence sizing: my ranker produces scores that sigmoid to a narrow band around the mean (not calibrated probabilities). Switching from the standard (p_up − 0.5) confidence weighting to rank-within-top-N added 4-6pp on concentrated portfolios. Is this a common fix for lambda-rank-style models, or is there a more principled approach (isotonic calibration etc.)?
  3. Dividend-adjustment drift: how do people handle this for reproducibility? Snapshot the dataset at a point in time? Use raw close and manually compound dividends? Accept drift and retrain?
  4. Fold-2-style regime change: is there a standard defensive overlay (macro gate, vol target, credit-spread filter) that you've seen actually work, or do most models just accept one bad regime year?
  5. Three correlated portfolio variants — is it defensible to run all three and report the best, or am I just p-hacking the presentation?
reddit.com
u/lobhas1 — 5 hours ago

Looking for honest critique on my 6-fold walk-forward quant backtest — US equities, long-only, daily data

I've been building a cross-sectional equity ranker and want honest critique on the backtest framework + results. Keeping model/feature details abstract (that's the IP I've invested in) but happy to discuss architecture and methodology.

Setup

  • Universe: ~650 US equities (S&P 500 + mid-caps + some delisted names, point-in-time membership)
  • Data: daily OHLCV from Tiingo, 2006-present, adjusted prices
  • Label: 5-day forward excess return vs SPY, decile-ranked for training
  • Model: tree-based cross-sectional ranker

Walk-forward validation

  • 6 rolling folds, each 12y train / 1y validation / 1y test
  • 10-day embargo between val and test
  • Non-overlapping test windows spanning 2020-02 to 2026-02
  • Proper point-in-time universe (no look-ahead on ticker membership)

Three portfolio variants run in parallel

Portfolio Rebalance Holding
TOPN-5 Every 5 days Full 5 days
TRANCHE Daily (5 overlapping tranches) 5 days each
MINHOLD Daily entry Min 5 days, signal-driven exit

Per-portfolio sizing

After finding no single sizing works best for all, my production config runs:

  • TOPN / TRANCHE: rank-based confidence weighting (weights ∝ rank² within top-5)
  • MINHOLD: equal-weighted (daily entry made rank-concentration too noisy)

6-fold test-set results (total return, 1-year test each)

Fold Period TOPN TRANCHE MINHOLD SPY
1 2020-02→21-02 +72% +141% +146% +9.5%
2 21-02→22-02 +4% +18% +4% +9.9%
3 22-02→23-02 +63% +39% +55% −9.7%
4 23-02→24-02 −15% +25% +12% +23.6%
5 24-02→25-02 +176% +159% +184% +21.9%
6 25-02→26-02 +125% +78% +101% +11.7%
Avg +71% +76% +84% +13%

Test Sharpe ranges 0.3 to 3.6 across folds. IC (Spearman) averages 0.02, per-fold range −0.002 to +0.046.

Costs modeled: 1bp fee + 3bp slippage + 5bp spread buffer per trade, 50bp annual borrow (long-only in this config).

What I think might actually be alpha

  • Beats SPY in 5/6 folds across all three portfolios
  • TRANCHE's daily-5-tranche structure has the best risk-adjusted numbers — often Sharpe 2-3 on test
  • Consistent across varied regimes: COVID, 2022 drawdown, 2023 AI rally, 2025-26 range
  • Signal is orthogonal to market beta (test fold 3 returned +55% MINHOLD while SPY was −10%)

What's concerning me (please pile on)

  1. Fold 2 (2021-22) is universally weak. All three portfolios barely beat or lose to SPY. Growth-to-value rotation year. IC near zero — model has essentially no signal in that regime. I haven't found a fix.
  2. TOPN fold 4 was negative despite highest IC (0.046). Broader ranking was correct but the specific top-5 picks got unlucky. Concentrated-bet variance.
  3. IC of 0.02 is below the usual "tradeable" threshold of 0.04. Returns come from stacking small edges across many trades. Feels thin.
  4. Fold 5 and 6 look almost too good (TOPN +176%, MINHOLD +184%). I've been careful with walk-forward, embargo, point-in-time universe, label-derived features are lag-aware, etc. But Sharpe 2-3 on daily-rebalanced long-only in test feels too clean. Most likely explanation I can't rule out: subtle feature leakage.
  5. Adjusted-price drift across data refreshes. Tiingo re-applies dividend adjustments retroactively when new dividends are paid, so historical adjClose values shift. Discovered the hard way — the same code + same tickers ran with different adjClose snapshots gives different backtest numbers. Found ~20% of tickers had 10-100 bps adjClose drift on historical rows between two fetches a week apart. Results aren't bit-reproducible across refreshes.
  6. TOPN struggled in the 2023 AI rally — the concentrated top-5 missed the Mag-7 concentration. A broader (TRANCHE) basket captured some of it.

Open questions

  1. Low-IC high-return puzzle: is ~+70-84% annual return on low IC (0.02) plausible as alpha, or is there a typical look-ahead trap I should be hunting for?
  2. Rank-based confidence sizing: my ranker produces scores that sigmoid to a narrow band around the mean (not calibrated probabilities). Switching from the standard (p_up − 0.5) confidence weighting to rank-within-top-N added 4-6pp on concentrated portfolios. Is this a common fix for lambda-rank-style models, or is there a more principled approach (isotonic calibration etc.)?
  3. Dividend-adjustment drift: how do people handle this for reproducibility? Snapshot the dataset at a point in time? Use raw close and manually compound dividends? Accept drift and retrain?
  4. Fold-2-style regime change: is there a standard defensive overlay (macro gate, vol target, credit-spread filter) that you've seen actually work, or do most models just accept one bad regime year?
  5. Three correlated portfolio variants — is it defensible to run all three and report the best, or am I just p-hacking the presentation?
reddit.com
u/lobhas1 — 5 hours ago