Walk-forward validation

Problem: Your backtest shows Sharpe 1.5 but you're worried about overfit. How do you validate out-of-sample?

Solution: Walk-forward CV — train on a rolling window, test on the next window, accumulate out-of-sample performance.

The basic idea

   [---- train 252d ----][-- test 63d --]              First fold
                          [---- train 252d ----][-- test 63d --]  Second fold (rolled forward by step=63)
                                                  [---- train 252d ----][-- test 63d --]  ...

Train on past, test on immediate future, never overlap.

Walk-forward via `mts1b_quantkit.cv.walk_forward`

from mts1b_quantkit.cv import walk_forward
from mts1b_quantkit.factors import get


cv = walk_forward(
    factor=get("f_momentum_12_1"),
    params_grid={"h_long": [126, 252, 378], "h_skip": [5, 21, 42]},
    universe="us-large-cap",
    start="2014-01-01", end="2024-01-01",

    train_window=252,             # 1 year train
    test_window=63,                # 1 quarter test
    step=63,                       # roll forward 1 quarter at a time

    rebal="monthly",
    sizing={"method": "equal_weight_ls", "n_long": 10, "n_short": 10, "gross": 1.0},
    cost_bps=10,
)

print(f"n_folds:        {cv['n_folds']}")
print(f"OOS Sharpe:     {cv['agg_sharpe']:.2f}  CI95={cv['ci95_sharpe']}")
print(f"OOS IC:         {cv['agg_ic']:.3f}  t-stat={cv['agg_t_stat']:.1f}")
print(f"Best params:    {cv['best_params']}")
print(f"Stability:      {cv['stability_score']:.2f}")

Output:

n_folds:        40
OOS Sharpe:     1.12  CI95=[0.82, 1.43]
OOS IC:         0.042 t-stat=4.1
Best params:    {'h_long': 252, 'h_skip': 21}
Stability:      0.78

Interpreting results

Compare in-sample vs OOS

in_sample = run_single(factor=get("f_momentum_12_1"),
                       params={"h_long": 252, "h_skip": 21},
                       universe="us-large-cap",
                       start="2014-01-01", end="2024-01-01",
                       ...)
print(f"In-sample Sharpe: {in_sample.sharpe:.2f}")
print(f"OOS Sharpe:       {cv['agg_sharpe']:.2f}")

If OOS is similar to in-sample → robust. If OOS is much lower → overfit.

In-sample	OOS	Verdict
1.5	1.4	Robust, ship it
1.5	1.0	Some overfit, acceptable
1.5	0.6	Significant overfit, redesign
1.5	-0.2	Pure overfit, abandon

Stability score

stability_score measures consistency across folds. 1.0 = identical Sharpe each fold; 0.0 = random.

fold_sharpes = cv["fold_sharpes"]    # e.g. [1.2, 1.4, 0.9, 1.3, 1.5, ...]
print(np.std(fold_sharpes) / np.mean(fold_sharpes))   # coefficient of variation

If stability < 0.4, the factor is regime-dependent. Either:

Add a regime gate to the factor (only trade in regime in {bull, chop})
Combine with a regime-orthogonal factor
Treat it as a smaller sleeve in your composite

Bootstrap confidence interval

ci95_sharpe is the bootstrap 95% CI:

ci95 = cv["ci95_sharpe"]    # e.g. [0.82, 1.43]

If the lower bound is > 0, you have statistically significant alpha. If it crosses 0, your Sharpe could be luck.

Picking the "best" params

Don't pick the absolute best from the grid — pick from a stable plateau:

import pandas as pd

per_params = pd.DataFrame([
    {**params, "sharpe": sharpe, "ic": ic}
    for params, sharpe, ic in cv["params_results"]
])
print(per_params.sort_values("sharpe", ascending=False).head(10))

If multiple param combos cluster near the top, pick one in the center of that cluster (not the literal max). This protects against grid-search overfit.

# Stable plateau picker
def pick_plateau(per_params: pd.DataFrame, tol: float = 0.1) -> dict:
    """Among params within tol of the best Sharpe, pick the most central."""
    best_sharpe = per_params["sharpe"].max()
    plateau = per_params[per_params["sharpe"] >= best_sharpe - tol]
    # Median of each param column
    return plateau.median(numeric_only=True).to_dict()


best = pick_plateau(per_params)
print(best)
# {"h_long": 252, "h_skip": 21, "sharpe": 1.18}

Purged CV (for cross-sectional / lookback factors)

Standard walk-forward can leak if your factor's lookback overlaps the train/test boundary. Use purged_kfold:

from mts1b_quantkit.cv import purged_kfold

cv = purged_kfold(
    factor=get("f_my_factor"),
    params={"h": 21},
    universe="us-large-cap",
    start="2014-01-01", end="2024-01-01",

    n_folds=10,
    purge_window=21,             # remove the 21 days adjacent to the test set
    embargo=5,                    # extra "no-look" buffer after test
)

Use this when the factor has a long lookback (e.g., 252-day momentum).

Multi-asset class walk-forward

cv_equities = walk_forward(factor=..., universe="us-large-cap", ...)
cv_crypto = walk_forward(factor=..., universe="crypto-top-10", ...)
cv_fx = walk_forward(factor=..., universe="g10-fx", ...)

print("Equities:", cv_equities["agg_sharpe"])
print("Crypto:  ", cv_crypto["agg_sharpe"])
print("FX:      ", cv_fx["agg_sharpe"])

A factor that works across multiple asset classes is more robust than one that works in just one.

Common mistakes

Train window too short

# Wrong — 30 days is too little to learn a momentum signal
cv = walk_forward(..., train_window=30, test_window=21)

Rules of thumb:

Factor decay	train_window	test_window
Very fast (5 day)	63	21
Fast (21 day)	252	63
Slow (1 yr)	504	126

Test window > step

# Wrong — folds overlap; tests aren't independent
cv = walk_forward(..., test_window=63, step=21)

Always step >= test_window. Otherwise the same dates appear in multiple test sets, inflating apparent statistical power.

Using OOS to pick the universe

# WRONG — universe selection is a meta-decision; pick on in-sample only
best_universe = max(["sp500", "russell1000", "russell3000"],
                     key=lambda u: walk_forward(..., universe=u)["agg_sharpe"])

This is a form of data snooping. Pick the universe BEFORE running walk-forward (based on capacity, liquidity, business reasons).

Ignoring transaction costs in the CV

# Wrong — cost-free OOS Sharpe is meaningless
cv = walk_forward(..., cost_bps=0)

Always pass realistic cost_bps. If your factor is high-turnover, it might be cost-killed in production — better to discover that in walk-forward.

The basic idea​

Walk-forward via mts1b_quantkit.cv.walk_forward​

Interpreting results​

Compare in-sample vs OOS​

Stability score​

Bootstrap confidence interval​

Picking the "best" params​

Purged CV (for cross-sectional / lookback factors)​

Multi-asset class walk-forward​

Common mistakes​

Train window too short​

Test window > step​

Using OOS to pick the universe​

Ignoring transaction costs in the CV​

See also​