Looking for honest critique on my 6-fold walk-forward quant backtest — US equities, long-only, daily data
I've been building a cross-sectional equity ranker and want honest critique on the backtest framework + results. Keeping model/feature details abstract (that's the IP I've invested in) but happy to discuss architecture and methodology.
Setup
- Universe: ~650 US equities (S&P 500 + mid-caps + some delisted names, point-in-time membership)
- Data: daily OHLCV from Tiingo, 2006-present, adjusted prices
- Label: 5-day forward excess return vs SPY, decile-ranked for training
- Model: tree-based cross-sectional ranker
Walk-forward validation
- 6 rolling folds, each 12y train / 1y validation / 1y test
- 10-day embargo between val and test
- Non-overlapping test windows spanning 2020-02 to 2026-02
- Proper point-in-time universe (no look-ahead on ticker membership)
Three portfolio variants run in parallel
| Portfolio | Rebalance | Holding |
|---|---|---|
| TOPN-5 | Every 5 days | Full 5 days |
| TRANCHE | Daily (5 overlapping tranches) | 5 days each |
| MINHOLD | Daily entry | Min 5 days, signal-driven exit |
Per-portfolio sizing
After finding no single sizing works best for all, my production config runs:
- TOPN / TRANCHE: rank-based confidence weighting (weights ∝ rank² within top-5)
- MINHOLD: equal-weighted (daily entry made rank-concentration too noisy)
6-fold test-set results (total return, 1-year test each)
| Fold | Period | TOPN | TRANCHE | MINHOLD | SPY |
|---|---|---|---|---|---|
| 1 | 2020-02→21-02 | +72% | +141% | +146% | +9.5% |
| 2 | 21-02→22-02 | +4% | +18% | +4% | +9.9% |
| 3 | 22-02→23-02 | +63% | +39% | +55% | −9.7% |
| 4 | 23-02→24-02 | −15% | +25% | +12% | +23.6% |
| 5 | 24-02→25-02 | +176% | +159% | +184% | +21.9% |
| 6 | 25-02→26-02 | +125% | +78% | +101% | +11.7% |
| Avg | +71% | +76% | +84% | +13% |
Test Sharpe ranges 0.3 to 3.6 across folds. IC (Spearman) averages 0.02, per-fold range −0.002 to +0.046.
Costs modeled: 1bp fee + 3bp slippage + 5bp spread buffer per trade, 50bp annual borrow (long-only in this config).
What I think might actually be alpha
- Beats SPY in 5/6 folds across all three portfolios
- TRANCHE's daily-5-tranche structure has the best risk-adjusted numbers — often Sharpe 2-3 on test
- Consistent across varied regimes: COVID, 2022 drawdown, 2023 AI rally, 2025-26 range
- Signal is orthogonal to market beta (test fold 3 returned +55% MINHOLD while SPY was −10%)
What's concerning me (please pile on)
- Fold 2 (2021-22) is universally weak. All three portfolios barely beat or lose to SPY. Growth-to-value rotation year. IC near zero — model has essentially no signal in that regime. I haven't found a fix.
- TOPN fold 4 was negative despite highest IC (0.046). Broader ranking was correct but the specific top-5 picks got unlucky. Concentrated-bet variance.
- IC of 0.02 is below the usual "tradeable" threshold of 0.04. Returns come from stacking small edges across many trades. Feels thin.
- Fold 5 and 6 look almost too good (TOPN +176%, MINHOLD +184%). I've been careful with walk-forward, embargo, point-in-time universe, label-derived features are lag-aware, etc. But Sharpe 2-3 on daily-rebalanced long-only in test feels too clean. Most likely explanation I can't rule out: subtle feature leakage.
- Adjusted-price drift across data refreshes. Tiingo re-applies dividend adjustments retroactively when new dividends are paid, so historical adjClose values shift. Discovered the hard way — the same code + same tickers ran with different adjClose snapshots gives different backtest numbers. Found ~20% of tickers had 10-100 bps adjClose drift on historical rows between two fetches a week apart. Results aren't bit-reproducible across refreshes.
- TOPN struggled in the 2023 AI rally — the concentrated top-5 missed the Mag-7 concentration. A broader (TRANCHE) basket captured some of it.
Open questions
- Low-IC high-return puzzle: is ~+70-84% annual return on low IC (0.02) plausible as alpha, or is there a typical look-ahead trap I should be hunting for?
- Rank-based confidence sizing: my ranker produces scores that sigmoid to a narrow band around the mean (not calibrated probabilities). Switching from the standard
(p_up − 0.5)confidence weighting to rank-within-top-N added 4-6pp on concentrated portfolios. Is this a common fix for lambda-rank-style models, or is there a more principled approach (isotonic calibration etc.)? - Dividend-adjustment drift: how do people handle this for reproducibility? Snapshot the dataset at a point in time? Use raw close and manually compound dividends? Accept drift and retrain?
- Fold-2-style regime change: is there a standard defensive overlay (macro gate, vol target, credit-spread filter) that you've seen actually work, or do most models just accept one bad regime year?
- Three correlated portfolio variants — is it defensible to run all three and report the best, or am I just p-hacking the presentation?