decision-making

Backtesting & Strategy Validation

A strategy that looks perfect in backtest often fails in production. The difference is usually bias—data the model wouldn't have had, or assumptions that don't hold. Proper validation catches it.

  • Backtest = run your strategy on historical data. But the past isn't the future.
  • Survivorship bias: testing on data that only includes survivors (e.g. stocks that still exist) inflates returns.
  • Lookahead bias: using information you wouldn't have had at decision time. Fatal.
  • Purged K-fold: time-aware cross-validation. Train on past, test on future. No data leakage.
  • Cost-aware: include fees, slippage, and execution reality. Paper returns ≠ live returns.

Real-world example

A trading strategy shows 40% annual returns in backtest

You ran it on 5 years of price data. It looks amazing. You deploy with real money.

  • Survivorship bias: Your dataset only had stocks that survived. Bankrupt companies were excluded. Real universe would have included losers. Returns drop 10–15%.
  • Lookahead: Your "signal" used that day's closing price—but you'd trade at open. You had information you wouldn't have had. Remove it: returns collapse.
  • Costs: Backtest ignored commissions, slippage, and spread. Add 0.1% per trade × 200 trades/year = 20% drag. Returns halve.
  • Purged K-fold: Standard cross-validation shuffled data—future leaked into past. Time-aware split: train 2019–2021, test 2022. Out-of-sample performance much worse.
  • The "40% strategy" might be 8% after proper validation. Or negative.

Backtest results are hypotheses. Validation turns them into defensible estimates.

  • Survivorship bias: Data excludes failed companies, delisted stocks, discontinued products. You're testing on winners only.
  • Lookahead bias: Using future information in past decisions. "If I had known X" — you wouldn't have.
  • Overfitting: Tuning parameters until backtest fits perfectly. You fit noise, not signal.
  • Ignoring costs: Real execution has fees, slippage, market impact. Paper returns are optimistic.

Standard K-fold shuffles data, so "future" data leaks into "past" training. Purged K-fold respects time order: train on earlier periods, test on later. A gap (purge) between train and test prevents leakage from overlapping windows.

The rule: the model never sees the future. Validation should mimic production—you decide with what you have, then observe what happens.

  1. Out-of-sample test: Hold back a time period the model never saw.
  2. Walk-forward: Retrain periodically, test on the next period. Simulates live deployment.
  3. Stress-test: How does it perform in 2008, 2020, regime shifts?
  4. Include all costs: fees, slippage, execution assumptions.
  • Trusting in-sample backtest as if it were out-of-sample
  • Using datasets that exclude failures (survivorship bias)
  • Tuning hyperparameters on the same data you use to report performance
  • Assuming paper returns translate to live returns without cost modeling

Building a strategy that needs proper validation?

More resources

All resources →