Skip to main content

Walk-forward validation

Problem: Your backtest shows Sharpe 1.5 but you're worried about overfit. How do you validate out-of-sample?

Solution: Walk-forward CV — train on a rolling window, test on the next window, accumulate out-of-sample performance.

The basic idea

[---- train 252d ----][-- test 63d --] First fold
[---- train 252d ----][-- test 63d --] Second fold (rolled forward by step=63)
[---- train 252d ----][-- test 63d --] ...

Train on past, test on immediate future, never overlap.

Walk-forward via mts1b_quantkit.cv.walk_forward

from mts1b_quantkit.cv import walk_forward
from mts1b_quantkit.factors import get


cv = walk_forward(
factor=get("f_momentum_12_1"),
params_grid={"h_long": [126, 252, 378], "h_skip": [5, 21, 42]},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",

train_window=252, # 1 year train
test_window=63, # 1 quarter test
step=63, # roll forward 1 quarter at a time

rebal="monthly",
sizing={"method": "equal_weight_ls", "n_long": 10, "n_short": 10, "gross": 1.0},
cost_bps=10,
)

print(f"n_folds: {cv['n_folds']}")
print(f"OOS Sharpe: {cv['agg_sharpe']:.2f} CI95={cv['ci95_sharpe']}")
print(f"OOS IC: {cv['agg_ic']:.3f} t-stat={cv['agg_t_stat']:.1f}")
print(f"Best params: {cv['best_params']}")
print(f"Stability: {cv['stability_score']:.2f}")

Output:

n_folds: 40
OOS Sharpe: 1.12 CI95=[0.82, 1.43]
OOS IC: 0.042 t-stat=4.1
Best params: {'h_long': 252, 'h_skip': 21}
Stability: 0.78

Interpreting results

Compare in-sample vs OOS

in_sample = run_single(factor=get("f_momentum_12_1"),
params={"h_long": 252, "h_skip": 21},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",
...)
print(f"In-sample Sharpe: {in_sample.sharpe:.2f}")
print(f"OOS Sharpe: {cv['agg_sharpe']:.2f}")

If OOS is similar to in-sample → robust. If OOS is much lower → overfit.

In-sampleOOSVerdict
1.51.4Robust, ship it
1.51.0Some overfit, acceptable
1.50.6Significant overfit, redesign
1.5-0.2Pure overfit, abandon

Stability score

stability_score measures consistency across folds. 1.0 = identical Sharpe each fold; 0.0 = random.

fold_sharpes = cv["fold_sharpes"] # e.g. [1.2, 1.4, 0.9, 1.3, 1.5, ...]
print(np.std(fold_sharpes) / np.mean(fold_sharpes)) # coefficient of variation

If stability < 0.4, the factor is regime-dependent. Either:

  • Add a regime gate to the factor (only trade in regime in {bull, chop})
  • Combine with a regime-orthogonal factor
  • Treat it as a smaller sleeve in your composite

Bootstrap confidence interval

ci95_sharpe is the bootstrap 95% CI:

ci95 = cv["ci95_sharpe"] # e.g. [0.82, 1.43]

If the lower bound is > 0, you have statistically significant alpha. If it crosses 0, your Sharpe could be luck.

Picking the "best" params

Don't pick the absolute best from the grid — pick from a stable plateau:

import pandas as pd

per_params = pd.DataFrame([
{**params, "sharpe": sharpe, "ic": ic}
for params, sharpe, ic in cv["params_results"]
])
print(per_params.sort_values("sharpe", ascending=False).head(10))

If multiple param combos cluster near the top, pick one in the center of that cluster (not the literal max). This protects against grid-search overfit.

# Stable plateau picker
def pick_plateau(per_params: pd.DataFrame, tol: float = 0.1) -> dict:
"""Among params within tol of the best Sharpe, pick the most central."""
best_sharpe = per_params["sharpe"].max()
plateau = per_params[per_params["sharpe"] >= best_sharpe - tol]
# Median of each param column
return plateau.median(numeric_only=True).to_dict()


best = pick_plateau(per_params)
print(best)
# {"h_long": 252, "h_skip": 21, "sharpe": 1.18}

Purged CV (for cross-sectional / lookback factors)

Standard walk-forward can leak if your factor's lookback overlaps the train/test boundary. Use purged_kfold:

from mts1b_quantkit.cv import purged_kfold

cv = purged_kfold(
factor=get("f_my_factor"),
params={"h": 21},
universe="us-large-cap",
start="2014-01-01", end="2024-01-01",

n_folds=10,
purge_window=21, # remove the 21 days adjacent to the test set
embargo=5, # extra "no-look" buffer after test
)

Use this when the factor has a long lookback (e.g., 252-day momentum).

Multi-asset class walk-forward

cv_equities = walk_forward(factor=..., universe="us-large-cap", ...)
cv_crypto = walk_forward(factor=..., universe="crypto-top-10", ...)
cv_fx = walk_forward(factor=..., universe="g10-fx", ...)

print("Equities:", cv_equities["agg_sharpe"])
print("Crypto: ", cv_crypto["agg_sharpe"])
print("FX: ", cv_fx["agg_sharpe"])

A factor that works across multiple asset classes is more robust than one that works in just one.

Common mistakes

Train window too short

# Wrong — 30 days is too little to learn a momentum signal
cv = walk_forward(..., train_window=30, test_window=21)

Rules of thumb:

Factor decaytrain_windowtest_window
Very fast (5 day)6321
Fast (21 day)25263
Slow (1 yr)504126

Test window > step

# Wrong — folds overlap; tests aren't independent
cv = walk_forward(..., test_window=63, step=21)

Always step >= test_window. Otherwise the same dates appear in multiple test sets, inflating apparent statistical power.

Using OOS to pick the universe

# WRONG — universe selection is a meta-decision; pick on in-sample only
best_universe = max(["sp500", "russell1000", "russell3000"],
key=lambda u: walk_forward(..., universe=u)["agg_sharpe"])

This is a form of data snooping. Pick the universe BEFORE running walk-forward (based on capacity, liquidity, business reasons).

Ignoring transaction costs in the CV

# Wrong — cost-free OOS Sharpe is meaningless
cv = walk_forward(..., cost_bps=0)

Always pass realistic cost_bps. If your factor is high-turnover, it might be cost-killed in production — better to discover that in walk-forward.

See also