decision-making

Backtesting & Strategy Validation

A strategy that looks perfect in backtest often fails in production. The difference is usually bias—data the model wouldn't have had, or assumptions that don't hold. Proper validation catches it.

Backtest = run your strategy on historical data. But the past isn't the future.
Survivorship bias: testing on data that only includes survivors (e.g. stocks that still exist) inflates returns.
Lookahead bias: using information you wouldn't have had at decision time. Fatal.
Purged K-fold: time-aware cross-validation. Train on past, test on future. No data leakage.
Cost-aware: include fees, slippage, and execution reality. Paper returns ≠ live returns.

Real-world example

A trading strategy shows 40% annual returns in backtest

You ran it on 5 years of price data. It looks amazing. You deploy with real money.

Survivorship bias: Your dataset only had stocks that survived. Bankrupt companies were excluded. Real universe would have included losers. Returns drop 10–15%.
Lookahead: Your "signal" used that day's closing price—but you'd trade at open. You had information you wouldn't have had. Remove it: returns collapse.
Costs: Backtest ignored commissions, slippage, and spread. Add 0.1% per trade × 200 trades/year = 20% drag. Returns halve.
Purged K-fold: Standard cross-validation shuffled data—future leaked into past. Time-aware split: train 2019–2021, test 2022. Out-of-sample performance much worse.
The "40% strategy" might be 8% after proper validation. Or negative.

Backtest results are hypotheses. Validation turns them into defensible estimates.

Survivorship bias: Data excludes failed companies, delisted stocks, discontinued products. You're testing on winners only.
Lookahead bias: Using future information in past decisions. "If I had known X" — you wouldn't have.
Overfitting: Tuning parameters until backtest fits perfectly. You fit noise, not signal.
Ignoring costs: Real execution has fees, slippage, market impact. Paper returns are optimistic.

Standard K-fold shuffles data, so "future" data leaks into "past" training. Purged K-fold respects time order: train on earlier periods, test on later. A gap (purge) between train and test prevents leakage from overlapping windows.

The rule: the model never sees the future. Validation should mimic production—you decide with what you have, then observe what happens.

Out-of-sample test: Hold back a time period the model never saw.
Walk-forward: Retrain periodically, test on the next period. Simulates live deployment.
Stress-test: How does it perform in 2008, 2020, regime shifts?
Include all costs: fees, slippage, execution assumptions.

Trusting in-sample backtest as if it were out-of-sample
Using datasets that exclude failures (survivorship bias)
Tuning hyperparameters on the same data you use to report performance
Assuming paper returns translate to live returns without cost modeling

Building a strategy that needs proper validation?

See my services Talk to me

More resources

All resources →