I've been building a cross-sectional equity ranker and want honest critique on the backtest framework + results. Keeping model/feature details abstract (that's the IP I've invested in) but happy to discuss architecture and methodology.

Setup

Universe: ~650 US equities (S&P 500 + mid-caps + some delisted names, point-in-time membership)
Data: daily OHLCV from Tiingo, 2006-present, adjusted prices
Label: 5-day forward excess return vs SPY, decile-ranked for training
Model: tree-based cross-sectional ranker

Walk-forward validation

6 rolling folds, each 12y train / 1y validation / 1y test
10-day embargo between val and test
Non-overlapping test windows spanning 2020-02 to 2026-02
Proper point-in-time universe (no look-ahead on ticker membership)

Three portfolio variants run in parallel

Portfolio	Rebalance	Holding
TOPN-5	Every 5 days	Full 5 days
TRANCHE	Daily (5 overlapping tranches)	5 days each
MINHOLD	Daily entry	Min 5 days, signal-driven exit

Per-portfolio sizing

After finding no single sizing works best for all, my production config runs:

TOPN / TRANCHE: rank-based confidence weighting (weights ∝ rank² within top-5)
MINHOLD: equal-weighted (daily entry made rank-concentration too noisy)

6-fold test-set results (total return, 1-year test each)

Fold	Period	TOPN	TRANCHE	MINHOLD	SPY
1	2020-02→21-02	+72%	+141%	+146%	+9.5%
2	21-02→22-02	+4%	+18%	+4%	+9.9%
3	22-02→23-02	+63%	+39%	+55%	−9.7%
4	23-02→24-02	−15%	+25%	+12%	+23.6%
5	24-02→25-02	+176%	+159%	+184%	+21.9%
6	25-02→26-02	+125%	+78%	+101%	+11.7%
Avg		+71%	+76%	+84%	+13%

Test Sharpe ranges 0.3 to 3.6 across folds. IC (Spearman) averages 0.02, per-fold range −0.002 to +0.046.

Costs modeled: 1bp fee + 3bp slippage + 5bp spread buffer per trade, 50bp annual borrow (long-only in this config).

What I think might actually be alpha

Beats SPY in 5/6 folds across all three portfolios
TRANCHE's daily-5-tranche structure has the best risk-adjusted numbers — often Sharpe 2-3 on test
Consistent across varied regimes: COVID, 2022 drawdown, 2023 AI rally, 2025-26 range
Signal is orthogonal to market beta (test fold 3 returned +55% MINHOLD while SPY was −10%)

What's concerning me (please pile on)

Fold 2 (2021-22) is universally weak. All three portfolios barely beat or lose to SPY. Growth-to-value rotation year. IC near zero — model has essentially no signal in that regime. I haven't found a fix.
TOPN fold 4 was negative despite highest IC (0.046). Broader ranking was correct but the specific top-5 picks got unlucky. Concentrated-bet variance.
IC of 0.02 is below the usual "tradeable" threshold of 0.04. Returns come from stacking small edges across many trades. Feels thin.
Fold 5 and 6 look almost too good (TOPN +176%, MINHOLD +184%). I've been careful with walk-forward, embargo, point-in-time universe, label-derived features are lag-aware, etc. But Sharpe 2-3 on daily-rebalanced long-only in test feels too clean. Most likely explanation I can't rule out: subtle feature leakage.
Adjusted-price drift across data refreshes. Tiingo re-applies dividend adjustments retroactively when new dividends are paid, so historical adjClose values shift. Discovered the hard way — the same code + same tickers ran with different adjClose snapshots gives different backtest numbers. Found ~20% of tickers had 10-100 bps adjClose drift on historical rows between two fetches a week apart. Results aren't bit-reproducible across refreshes.
TOPN struggled in the 2023 AI rally — the concentrated top-5 missed the Mag-7 concentration. A broader (TRANCHE) basket captured some of it.

Open questions

Low-IC high-return puzzle: is ~+70-84% annual return on low IC (0.02) plausible as alpha, or is there a typical look-ahead trap I should be hunting for?
Rank-based confidence sizing: my ranker produces scores that sigmoid to a narrow band around the mean (not calibrated probabilities). Switching from the standard (p_up − 0.5) confidence weighting to rank-within-top-N added 4-6pp on concentrated portfolios. Is this a common fix for lambda-rank-style models, or is there a more principled approach (isotonic calibration etc.)?
Dividend-adjustment drift: how do people handle this for reproducibility? Snapshot the dataset at a point in time? Use raw close and manually compound dividends? Accept drift and retrain?
Fold-2-style regime change: is there a standard defensive overlay (macro gate, vol target, credit-spread filter) that you've seen actually work, or do most models just accept one bad regime year?
Three correlated portfolio variants — is it defensible to run all three and report the best, or am I just p-hacking the presentation?

Portfolio

Rebalance

Holding

TOPN-5

Every 5 days

Full 5 days

TRANCHE

Daily (5 overlapping tranches)

5 days each

MINHOLD

Daily entry

Min 5 days, signal-driven exit

6-fold test-set results (total return, 1-year test each)

Fold

Period

TOPN

TRANCHE

MINHOLD

SPY

2020-02→21-02

+72%

+141%

+146%

+9.5%

21-02→22-02

+4%

+18%

+4%

+9.9%

22-02→23-02

+63%

+39%

+55%

−9.7%

23-02→24-02

−15%

+25%

+12%

+23.6%

24-02→25-02

+176%

+159%

+184%

+21.9%

25-02→26-02

+125%

+78%

+101%

+11.7%

Avg

+71%

+76%

+84%

+13%

Test Sharpe ranges 0.3 to 3.6 across folds. IC (Spearman) averages 0.02, per-fold range −0.002 to +0.046.

Costs modeled: 1bp fee + 3bp slippage + 5bp spread buffer per trade, 50bp annual borrow (long-only in this config).

What I think might actually be alpha

Beats SPY in 5/6 folds across all three portfolios

TRANCHE's daily-5-tranche structure has the best risk-adjusted numbers — often Sharpe 2-3 on test

Consistent across varied regimes: COVID, 2022 drawdown, 2023 AI rally, 2025-26 range

Signal is orthogonal to market beta (test fold 3 returned +55% MINHOLD while SPY was −10%)

What's concerning me (please pile on)

Fold 2 (2021-22) is universally weak. All three portfolios barely beat or lose to SPY. Growth-to-value rotation year. IC near zero — model has essentially no signal in that regime. I haven't found a fix.

TOPN fold 4 was negative despite highest IC (0.046). Broader ranking was correct but the specific top-5 picks got unlucky. Concentrated-bet variance.

IC of 0.02 is below the usual "tradeable" threshold of 0.04. Returns come from stacking small edges across many trades. Feels thin.

Fold 5 and 6 look almost too good (TOPN +176%, MINHOLD +184%). I've been careful with walk-forward, embargo, point-in-time universe, label-derived features are lag-aware, etc. But Sharpe 2-3 on daily-rebalanced long-only in test feels too clean. Most likely explanation I can't rule out: subtle feature leakage.

Adjusted-price drift across data refreshes. Tiingo re-applies dividend adjustments retroactively when new dividends are paid, so historical adjClose values shift. Discovered the hard way — the same code + same tickers ran with different adjClose snapshots gives different backtest numbers. Found ~20% of tickers had 10-100 bps adjClose drift on historical rows between two fetches a week apart. Results aren't bit-reproducible across refreshes.

TOPN struggled in the 2023 AI rally — the concentrated top-5 missed the Mag-7 concentration. A broader (TRANCHE) basket captured some of it.

Open questions

Low-IC high-return puzzle: is ~+70-84% annual return on low IC (0.02) plausible as alpha, or is there a typical look-ahead trap I should be hunting for?

Rank-based confidence sizing: my ranker produces scores that sigmoid to a narrow band around the mean (not calibrated probabilities). Switching from the standard (p_up − 0.5) confidence weighting to rank-within-top-N added 4-6pp on concentrated portfolios. Is this a common fix for lambda-rank-style models, or is there a more principled approach (isotonic calibration etc.)?

Dividend-adjustment drift: how do people handle this for reproducibility? Snapshot the dataset at a point in time? Use raw close and manually compound dividends? Accept drift and retrain?

Fold-2-style regime change: is there a standard defensive overlay (macro gate, vol target, credit-spread filter) that you've seen actually work, or do most models just accept one bad regime year?

Three correlated portfolio variants — is it defensible to run all three and report the best, or am I just p-hacking the presentation?

u/lobhas1

Looking for honest critique on my 6-fold walk-forward quant backtest — US equities, long-only, daily data

Setup

Walk-forward validation

Three portfolio variants run in parallel

Per-portfolio sizing

6-fold test-set results (total return, 1-year test each)

What I think might actually be alpha

What's concerning me (please pile on)

Open questions

Looking for honest critique on my 6-fold walk-forward quant backtest — US equities, long-only, daily data

Setup

Walk-forward validation

Three portfolio variants run in parallel

Per-portfolio sizing

6-fold test-set results (total return, 1-year test each)

What I think might actually be alpha

What's concerning me (please pile on)

Open questions