Walk-forward validation
Problem: Your backtest shows Sharpe 1.5 but you're worried about overfit. How do you validate out-of-sample?
Solution: Walk-forward CV — train on a rolling window, test on the next window, accumulate out-of-sample performance.
The basic idea
[---- train 252d ----][-- test 63d --] First fold
[---- train 252d ----][-- test 63d --] Second fold (rolled forward by step=63)
[---- train 252d ----][-- test 63d --] ...
Train on past, test on immediate future, never overlap.
Walk-forward via mts1b_quantkit.cv.walk_forward
from mts1b_quantkit.cv import walk_forward
from mts1b_quantkit.factors import get
cv = walk_forward(
factor=get("f_momentum_12_1"),
params_grid={"h_long": [126, 252, 378], "h_skip": [5, 21, 42]},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",
train_window=252, # 1 year train
test_window=63, # 1 quarter test
step=63, # roll forward 1 quarter at a time
rebal="monthly",
sizing={"method": "equal_weight_ls", "n_long": 10, "n_short": 10, "gross": 1.0},
cost_bps=10,
)
print(f"n_folds: {cv['n_folds']}")
print(f"OOS Sharpe: {cv['agg_sharpe']:.2f} CI95={cv['ci95_sharpe']}")
print(f"OOS IC: {cv['agg_ic']:.3f} t-stat={cv['agg_t_stat']:.1f}")
print(f"Best params: {cv['best_params']}")
print(f"Stability: {cv['stability_score']:.2f}")
Output:
n_folds: 40
OOS Sharpe: 1.12 CI95=[0.82, 1.43]
OOS IC: 0.042 t-stat=4.1
Best params: {'h_long': 252, 'h_skip': 21}
Stability: 0.78
Interpreting results
Compare in-sample vs OOS
in_sample = run_single(factor=get("f_momentum_12_1"),
params={"h_long": 252, "h_skip": 21},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",
...)
print(f"In-sample Sharpe: {in_sample.sharpe:.2f}")
print(f"OOS Sharpe: {cv['agg_sharpe']:.2f}")
If OOS is similar to in-sample → robust. If OOS is much lower → overfit.
| In-sample | OOS | Verdict |
|---|---|---|
| 1.5 | 1.4 | Robust, ship it |
| 1.5 | 1.0 | Some overfit, acceptable |
| 1.5 | 0.6 | Significant overfit, redesign |
| 1.5 | -0.2 | Pure overfit, abandon |
Stability score
stability_score measures consistency across folds. 1.0 = identical Sharpe each fold; 0.0 = random.
fold_sharpes = cv["fold_sharpes"] # e.g. [1.2, 1.4, 0.9, 1.3, 1.5, ...]
print(np.std(fold_sharpes) / np.mean(fold_sharpes)) # coefficient of variation
If stability < 0.4, the factor is regime-dependent. Either:
- Add a regime gate to the factor (only trade in
regime in {bull, chop}) - Combine with a regime-orthogonal factor
- Treat it as a smaller sleeve in your composite
Bootstrap confidence interval
ci95_sharpe is the bootstrap 95% CI:
ci95 = cv["ci95_sharpe"] # e.g. [0.82, 1.43]
If the lower bound is > 0, you have statistically significant alpha. If it crosses 0, your Sharpe could be luck.
Picking the "best" params
Don't pick the absolute best from the grid — pick from a stable plateau:
import pandas as pd
per_params = pd.DataFrame([
{**params, "sharpe": sharpe, "ic": ic}
for params, sharpe, ic in cv["params_results"]
])
print(per_params.sort_values("sharpe", ascending=False).head(10))
If multiple param combos cluster near the top, pick one in the center of that cluster (not the literal max). This protects against grid-search overfit.
# Stable plateau picker
def pick_plateau(per_params: pd.DataFrame, tol: float = 0.1) -> dict:
"""Among params within tol of the best Sharpe, pick the most central."""
best_sharpe = per_params["sharpe"].max()
plateau = per_params[per_params["sharpe"] >= best_sharpe - tol]
# Median of each param column
return plateau.median(numeric_only=True).to_dict()
best = pick_plateau(per_params)
print(best)
# {"h_long": 252, "h_skip": 21, "sharpe": 1.18}
Purged CV (for cross-sectional / lookback factors)
Standard walk-forward can leak if your factor's lookback overlaps the train/test boundary. Use purged_kfold:
from mts1b_quantkit.cv import purged_kfold
cv = purged_kfold(
factor=get("f_my_factor"),
params={"h": 21},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",
n_folds=10,
purge_window=21, # remove the 21 days adjacent to the test set
embargo=5, # extra "no-look" buffer after test
)
Use this when the factor has a long lookback (e.g., 252-day momentum).
Multi-asset class walk-forward
cv_equities = walk_forward(factor=..., universe="us-large-cap", ...)
cv_crypto = walk_forward(factor=..., universe="crypto-top-10", ...)
cv_fx = walk_forward(factor=..., universe="g10-fx", ...)
print("Equities:", cv_equities["agg_sharpe"])
print("Crypto: ", cv_crypto["agg_sharpe"])
print("FX: ", cv_fx["agg_sharpe"])
A factor that works across multiple asset classes is more robust than one that works in just one.
Common mistakes
Train window too short
# Wrong — 30 days is too little to learn a momentum signal
cv = walk_forward(..., train_window=30, test_window=21)
Rules of thumb:
| Factor decay | train_window | test_window |
|---|---|---|
| Very fast (5 day) | 63 | 21 |
| Fast (21 day) | 252 | 63 |
| Slow (1 yr) | 504 | 126 |
Test window > step
# Wrong — folds overlap; tests aren't independent
cv = walk_forward(..., test_window=63, step=21)
Always step >= test_window. Otherwise the same dates appear in multiple test sets, inflating apparent statistical power.
Using OOS to pick the universe
# WRONG — universe selection is a meta-decision; pick on in-sample only
best_universe = max(["sp500", "russell1000", "russell3000"],
key=lambda u: walk_forward(..., universe=u)["agg_sharpe"])
This is a form of data snooping. Pick the universe BEFORE running walk-forward (based on capacity, liquidity, business reasons).
Ignoring transaction costs in the CV
# Wrong — cost-free OOS Sharpe is meaningless
cv = walk_forward(..., cost_bps=0)
Always pass realistic cost_bps. If your factor is high-turnover, it might be cost-killed in production — better to discover that in walk-forward.
See also
mts1b_quantkit.cv— full API- Concept — Factor system
- Tutorial — Custom strategy
- Cookbook — Mean reversion + vol filter — example of stability analysis