Parameter Spaces and the Overfitting Boundary

Wait 5 sec.

Parameter Spaces and the Overfitting BoundaryE-mini S&P 500 FuturesCME_MINI_DL:ES1!EdgeToolsPart 2 of 5: Professional Strategy Testing in TradingView Your strategy produced a positive result. The equity curve slopes upward. The settings you chose produce profit. The natural question is: how special are those settings? This is the question that separates strategy development from curve fitting. A strategy with a genuine edge produces profit across a neighborhood of parameter values. A curve-fitted strategy produces profit at one precise combination and collapses everywhere else. The difference between these two situations is visible, measurable, and the single most important diagnostic you can run before trusting a backtest result. What a parameter space actually is Every strategy has parameters. A moving average crossover has two: the fast period and the slow period. A mean-reversion strategy might have a lookback window, a z-score threshold, and a holding period. Each parameter defines one axis of a multi-dimensional space. Every possible combination of parameter values is a point in that space, and at every point the strategy produces a performance metric: net profit, Sharpe ratio, profit factor, or whatever you choose to evaluate. For two parameters, this space is a surface you can visualize directly. The x-axis is parameter A, the y-axis is parameter B, and the z-axis (height) is performance. The shape of this surface tells you nearly everything about whether your strategy is robust or fragile. Formally, define the parameter vector theta = (theta_1, ..., theta_k) and the performance function f(theta, D) where D is your historical data. The parameter space is the set of all admissible theta, and f maps each point in that space to a scalar performance value. The topology of f is what we are investigating. Figure 1: A three-dimensional parameter surface for a two-parameter strategy on ES 1H. The x-axis is the fast MA period (5 to 50), the y-axis is the slow MA period (20 to 200), and the z-axis is the Sharpe ratio. The broad elevated region (plateau) in the center indicates parameter stability. The strategy is not dependent on precise tuning. Plateaus versus peaks The surface shape falls into two archetypes, and you want to be on the right side of this distinction. A plateau is a broad, elevated region where many neighboring parameter combinations produce similar performance. If your chosen parameters sit on a plateau, small changes in market behavior will not destroy the strategy. The parameters are not precise; they are approximate, and the strategy tolerates approximation. This is robustness in the sense defined by Pardo (2008): a parameter set is robust if its performance degrades gracefully under perturbation. A peak is a narrow spike where one specific combination produces excellent performance, but moving even slightly in any direction causes performance to collapse. If your chosen parameters sit on a peak, you have found a statistical artifact. The parameters are not capturing a market feature; they are memorizing a sequence of historical prices. The formal criterion: define the perturbation set P(theta, delta) as all parameter vectors within a relative distance delta of theta. A parameter choice is robust if: min{ f(theta') : theta' in P(theta, 0.2) } > 0.7 * f(theta) In words: the worst performance within a 20% perturbation radius must remain above 70% of the nominal performance. If this condition fails, the strategy depends on precise calibration and is unlikely to survive contact with live markets (Pardo, 2008). Figure 2: Top-down heatmap of the same parameter surface from Figure 1. Warm colors indicate higher Sharpe ratios. The dashed ellipse marks the robustness region where the 70% criterion holds. The global maximum is near the center of this region, not at its edge. Parameters outside the ellipse produce materially worse results. How to map the parameter space in TradingView TradingView does not have a built-in parameter optimizer. The Strategy Tester evaluates one parameter set at a time. This means you must map the parameter space manually or semi-systematically: Define parameter ranges based on economic reasoning (a 10-period MA captures roughly two weeks of daily data; a 200-period MA captures roughly one year) Choose step sizes that produce 8 to 12 values per parameter For each combination: adjust the strategy inputs via the settings panel, record the key metrics from the Performance Summary tab Enter results into a spreadsheet or script to construct the surface For a two-parameter strategy with 10 steps each, this requires 100 manual runs. For three parameters at 10 steps, it requires 1,000 runs. The key point remains: regardless of how you generate the data, what matters is the shape of the resulting surface. A flat, elevated region means robustness. A narrow spike means curve fitting. The overfitting boundary: degrees of freedom versus observations There is a mathematical relationship between the number of free parameters in a strategy and the amount of data required to validate them. This is not a heuristic; it follows directly from estimation theory. Consider a strategy with k free parameters estimated from N trades. Each parameter consumes one degree of freedom. The effective degrees of freedom available for validation is N - k. When N - k is small, the parameter estimates are unstable: they absorb noise rather than signal. The Akaike Information Criterion (Akaike, 1974) formalizes this tradeoff: AIC = 2k - 2 * ln(L) where k is the number of parameters and L is the maximized likelihood. The first term penalizes complexity; the second rewards fit. A lower AIC indicates a better tradeoff. In strategy evaluation, the analogous principle is that every additional parameter must be justified by a proportional increase in out-of-sample explanatory power. The practical rule from Pardo (2008): a strategy requires a minimum of 30 trades per optimized parameter for the optimization to have statistical meaning. Below this threshold, the parameter estimates are dominated by sampling noise. At 50 or more trades per parameter, discriminatory power becomes adequate. The ratio that matters: R = N / k where N is the number of closed trades and k is the number of optimized parameters. R < 20: overfitting virtually guaranteed (parameter estimates are noise) 20 < R < 50: marginal territory (results may or may not generalize) R > 50: adequate degrees of freedom for meaningful inference R > 100: robust inference possible Figure 3: The overfitting boundary. The x-axis is the number of optimized parameters; the y-axis is the number of closed trades. The red zone (R < 20) marks combinations where overfitting is statistically likely regardless of reported performance. The green zone (R > 50) marks combinations where genuine edge detection becomes possible. Four example strategies are plotted with their R-ratios. Consider a concrete example. A strategy with 5 optimized parameters that produces 60 trades has R = 12. That is deep in the red zone. Any reported performance is more likely to be noise than signal. The same strategy producing 300 trades has R = 60. Now you are in territory where statistical inference becomes meaningful. This is why Deep Backtesting (Part 1) matters: you need enough trades to support your parameter count. The multiple comparisons problem Mapping a parameter space is an implicit multiple comparison. If you test M parameter combinations and select the best one, you have effectively run M hypothesis tests and selected the winner. Under the null hypothesis (no real edge exists), the probability of finding at least one apparently significant result is given by the family-wise error rate (FWER): FWER = 1 - (1 - alpha)^M where alpha is the significance level for an individual test. At M = 50 combinations and alpha = 0.05: FWER = 1 - (1 - 0.05)^50 = 1 - 0.95^50 = 1 - 0.0769 = 0.923 A 92.3% probability of finding at least one false positive. At M = 200: FWER = 1 - 0.95^200 > 0.9999 You are virtually guaranteed to find something that looks like an edge even when none exists. Figure 4: Family-wise error rate as a function of parameter combinations tested. At 50 combinations (alpha = 0.05), FWER is already 92%. At 200 combinations, it exceeds 99.99%. Standard significance thresholds become meaningless without correction for multiplicity. The Bonferroni correction (Dunn, 1961) adjusts the significance threshold: alpha_adjusted = alpha / M For M = 100 combinations, the adjusted threshold becomes 0.05 / 100 = 0.0005. This is extremely conservative but controls the FWER at the nominal level. Less conservative alternatives include the Holm-Bonferroni procedure (Holm, 1979) and White's Reality Check (White, 2000), which uses bootstrap resampling to estimate the distribution of the best statistic under the null. Quantifying optimization bias The expected value of the maximum of M independent draws from a distribution provides a direct estimate of how much the best in-sample result overestimates true performance. For M independent draws from N(mu, sigma^2), the expected maximum is approximately (David and Nagaraja, 2003): E ~ mu + sigma * sqrt(2 * ln(M)) If the true Sharpe ratio is mu = 0.5 with estimation noise sigma = 0.4, and you test M = 100 combinations: E ~ 0.5 + 0.4 * sqrt(2 * ln(100)) ~ 0.5 + 0.4 * sqrt(9.21) ~ 0.5 + 0.4 * 3.03 ~ 0.5 + 1.21 ~ 1.71 The best in-sample Sharpe from 100 tests would be approximately 1.71, while the true value is only 0.5. The optimization bias here is +1.21 Sharpe units. This is not a defect of any particular strategy; it is a mathematical property of selecting the maximum from a distribution. Bailey and Lopez de Prado (2014) formalize this as the Deflated Sharpe Ratio (DSR), which adjusts the observed Sharpe ratio for the number of trials, skewness, and kurtosis of returns: DSR = (SR_observed - SR_expected_max) / sigma_SR where SR_expected_max accounts for the multiple testing bias. A DSR below zero indicates that the observed performance is entirely explained by selection among random trials. Figure 5: Upper panels: 3D parameter surfaces for a robust strategy (left, broad plateau) and a fragile strategy (right, narrow peak). Lower panels: cross-sections through each surface at the optimal Param B value. The robust strategy maintains 70%+ of peak performance across 40% of the parameter range. The fragile strategy collapses within 8% of the parameter range. Practical protocol for parameter validation Combining these principles into a repeatable workflow: Define parameter ranges based on economic reasoning, not on what produces the best backtest. Example: if you believe weekly momentum matters, test lookback periods from 5 to 30 days. Do not extend to 200 days just because it happens to produce better results. Map the parameter space manually (or with automation) across the full grid. Examine the surface shape: is there a plateau, or only a peak? Apply the robustness criterion: min performance within 20% perturbation > 0.7 * nominal performance. If a plateau exists: select parameters from its center, not from the global maximum. The center is the point that maximizes distance from all edges of the acceptable region. Calculate the trades-per-parameter ratio R = N / k. If R < 50, increase the backtest duration (use Deep Backtesting) or reduce the number of free parameters. Estimate the optimization bias: E ~ sigma * sqrt(2 * ln(M)) where M is the number of combinations tested. Subtract this from the best observed result to obtain a more realistic performance estimate. Record the bias-adjusted, plateau-center result as your in-sample estimate. This is still an upper bound, not an expected value. If steps 3, 4, or 5 fail, the strategy does not pass validation at this stage, regardless of how good the best single result appears. The distinction that matters Optimization asks: what parameters maximize historical performance? Validation asks: what parameters are most likely to produce future performance? These are different questions with different answers. The parameter surface, the plateau test, the R-ratio, and the optimization bias formula answer the second question. Most people stop at the first. The entire edge of this process is in the second. Part 3 covers what happens next: splitting your data into what you can see and what you deliberately cannot, so the strategy must prove itself on information it was never trained on. Next: Walk-Forward Testing and Out-of-Sample Validation References Akaike, H. (1974) 'A new look at the statistical model identification', IEEE Transactions on Automatic Control, 19(6), pp. 716-723. Aronson, D.R. (2006) Evidence-Based Technical Analysis: Applying the Scientific Method and Statistical Inference to Trading Signals. Hoboken: John Wiley and Sons. Bailey, D.H. and Lopez de Prado, M. (2014) 'The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality', Journal of Portfolio Management, 40(5). David, H.A. and Nagaraja, H.N. (2003) Order Statistics. 3rd edn. Hoboken: John Wiley and Sons. Dunn, O.J. (1961) 'Multiple Comparisons Among Means', Journal of the American Statistical Association, 56(293). Holm, S. (1979) 'A Simple Sequentially Rejective Multiple Test Procedure', Scandinavian Journal of Statistics, 6(2). Pardo, R. (2008) The Evaluation and Optimization of Trading Strategies. 2nd edn. Hoboken: John Wiley and Sons. White, H. (2000) 'A Reality Check for Data Snooping', Econometrica, 68(5), pp. 1097-1126.