Backtesting & Strategy Validation
A strategy that looks perfect in backtest often fails in production. The difference is usually bias—data the model wouldn't have had, or assumptions that don't hold. Proper validation catches it.
- Backtest = run your strategy on historical data. But the past isn't the future.
- Survivorship bias: testing on data that only includes survivors (e.g. stocks that still exist) inflates returns.
- Lookahead bias: using information you wouldn't have had at decision time. Fatal.
- Purged K-fold: time-aware cross-validation. Train on past, test on future. No data leakage.
- Cost-aware: include fees, slippage, and execution reality. Paper returns ≠ live returns.
Real-world example
A trading strategy shows 40% annual returns in backtest
You ran it on 5 years of price data. It looks amazing. You deploy with real money.
- Survivorship bias: Your dataset only had stocks that survived. Bankrupt companies were excluded. Real universe would have included losers. Returns drop 10–15%.
- Lookahead: Your "signal" used that day's closing price—but you'd trade at open. You had information you wouldn't have had. Remove it: returns collapse.
- Costs: Backtest ignored commissions, slippage, and spread. Add 0.1% per trade × 200 trades/year = 20% drag. Returns halve.
- Purged K-fold: Standard cross-validation shuffled data—future leaked into past. Time-aware split: train 2019–2021, test 2022. Out-of-sample performance much worse.
- The "40% strategy" might be 8% after proper validation. Or negative.
Backtest results are hypotheses. Validation turns them into defensible estimates.
- Survivorship bias: Data excludes failed companies, delisted stocks, discontinued products. You're testing on winners only.
- Lookahead bias: Using future information in past decisions. "If I had known X" — you wouldn't have.
- Overfitting: Tuning parameters until backtest fits perfectly. You fit noise, not signal.
- Ignoring costs: Real execution has fees, slippage, market impact. Paper returns are optimistic.
Standard K-fold shuffles data, so "future" data leaks into "past" training. Purged K-fold respects time order: train on earlier periods, test on later. A gap (purge) between train and test prevents leakage from overlapping windows.
The rule: the model never sees the future. Validation should mimic production—you decide with what you have, then observe what happens.
- Out-of-sample test: Hold back a time period the model never saw.
- Walk-forward: Retrain periodically, test on the next period. Simulates live deployment.
- Stress-test: How does it perform in 2008, 2020, regime shifts?
- Include all costs: fees, slippage, execution assumptions.
- Trusting in-sample backtest as if it were out-of-sample
- Using datasets that exclude failures (survivorship bias)
- Tuning hyperparameters on the same data you use to report performance
- Assuming paper returns translate to live returns without cost modeling
Building a strategy that needs proper validation?