# Why We Made Our Model Look Worse: How Walk-Forward Validation Beats Optimistic Numbers

## The Trap of a Good-Looking Test Score

Every machine learning practitioner has lived through the same heartbreak. You train a model, you run your validation script, and the test AUC comes back at 0.62. That number feels real. You ship it to shadow mode. You start writing the celebratory note to the team. Then, six weeks later, something feels off. The production numbers do not match the offline numbers. The model the test set said was good is mediocre or worse in the wild.

This post is about a day we spent making our own model look worse on paper, and why doing that was the single most useful thing we did all month. We dropped a reported test AUC of 0.616 down to a walk-forward cross-validated mean of 0.553. The model did not get any weaker between morning and evening. The measurement just got more honest. Once we knew the real number, every other decision became easier.

## Where We Started

We run a meta-model on top of our prediction-market trading bots. Our base models output a fair probability for each contract. The meta-model is a second-layer classifier that asks, "given this candidate signal, is this likely to be a winning trade after fees and slippage?" It runs in shadow mode, which means it logs what it would have blocked or allowed but does not actually intercept trades. It was trained on roughly one thousand measured trades using a standard 80/20 train/test split. The reported test AUC was 0.616.

The number looked respectable. AUC of 0.5 is random; 0.62 means the model is meaningfully separating the trades that beat the market close from the ones that lose to it. We had a 0.55 minimum gate built into the cron job — if the retrained model failed to beat 0.55 AUC, the previous version was kept. The current model was clearing that gate with margin. We were ready to start talking about flipping the gate from shadow mode to enforce mode and letting the meta-model actually block trades.

Then someone on the team played critic. The critic asked one question that turned the whole project on its head: how do you know the AUC is real? The 80/20 split is random across the entire trade history. That means trades from the same game can land in both the training set and the test set. That means features that vary slowly within a game — like Elo difference or the bot's identity — could be memorized by the model and look like predictive signal on the test set when they are just leakage. The critic was not saying the model was useless. The critic was saying you have not measured the right thing yet.

That single objection produced a productive day of work.

## Three Categories of Honest Improvements

We organized the day's work into three buckets, ranked by impact per hour of effort.

The first bucket was engineering hygiene. None of these change the model. They are basic plumbing that should have been there from the start. Idempotent logging so the gate decisions are not double-counted. Log rotation so the shadow log file does not balloon past a gigabyte. Pickle backups so we do not lose the trained model to a single corrupted file. Rate-limit handling on the external odds API. Nothing glamorous. All of it makes the next experiment cheaper to run.

The second bucket was statistical methodology. This is where the critic's point really lived. Replace the random 80/20 split with walk-forward cross-validation that respects time order. Hold out entire games instead of individual trades to eliminate within-game leakage. Apply time-decay sample weighting so recent trades count more than ancient ones. Train per-sport submodels in case the global model is averaging away sport-specific signal.

The third bucket was closing the train/serve skew. The model was being trained on rich features that the live bot was not passing through at inference time. Roughly 70% of the model's input columns were ending up as NaN when the model ran in production, because the bot just was not forwarding them. Pass the features. Wait a day or two for the new data to flow through. Retrain. Compare. Keep what helps; rip out what does not.

We shipped all three buckets in one day. Here is what we learned from each.

## Engineering Hygiene Is Invisible But Compounds

The first bucket sounds boring because it is. Idempotent logging is a one-hour task. You add a small time-to-live cache keyed by token identifier. Same signal evaluated every poll cycle gets logged once per minute, not 60 times. The shadow log goes from bloated to readable. The downstream joins go from duplicate-counted to clean. None of that lifts the model AUC. All of it makes every future debugging session faster.

Log rotation took thirty minutes. The gate decision log had grown to 891 megabytes and was adding 200 megabytes per week. A simple daily rotation script archives anything older than seven days into a compressed file and starts a fresh log. The disk does not fill. The log queries stay fast. Disaster avoided before it ever materialized.

Pickle backups took thirty minutes. The trained model lived as a single file. If the file got corrupted by an interrupted cron job, we would be retraining on whatever fresh data happened to be available, with no path back to the previous good version. The fix was a wrapper script that copies the pickle to a backups folder before each retrain and keeps the last fourteen versions. Trivial code. Eliminates a class of incidents.

The OddsAPI rate-limit handler was an hour of work. The previous fetcher would just fail outright if the API returned a 429 too-many-requests response. New version respects the Retry-After header, backs off exponentially, and writes a budget warning to the log if we are using more than 80% of our daily quota. None of this changes any model. All of it removes a fragile failure mode that would otherwise eat a Sunday afternoon a few weeks from now.

The general lesson: engineering hygiene is the price of admission for trusting any downstream measurement. You cannot do good science on a flaky pipeline. The walk-forward CV results we would compute later in the day are only believable because the data they ran on is clean.

## Walk-Forward Validation: The Real Test

The second bucket is where the model's reported numbers started to change. The critic's specific worry was that the 80/20 split was contaminated. We replaced it with four-fold expanding-window walk-forward cross-validation.

The mechanics are simple. Take the trades, sort them by timestamp, divide into five equal time slices. Train on slice one, test on slice two. Then train on slices one and two, test on slice three. Then train on one through three, test on four. Then train on one through four, test on five. Each test set is strictly in the future relative to its training set. There is no possible way for a feature to leak from the future into the training data.

When we ran the four folds, the test AUCs came back as 0.567, 0.482, 0.594, and 0.568. Mean across folds: 0.553. Standard deviation across folds: 0.042. That is a different story than the 0.616 from the random split. The model is not above 0.6 in any sense that a future trade would experience. It is right at the gate boundary, with one fold dipping below 0.5 (worse than random) and the best fold at 0.594. That is what real performance looks like after you strip out the leakage.

Game-level holdout produced a similar correction. We held out entire games — every trade from that game goes into the test set — instead of randomly sampling trades. The holdout AUC came back at 0.585, down from the 0.616 the leaky split had reported. The 3-point drop tells us roughly how much within-game leakage was inflating the original score.

Time-decay sample weighting changed the picture less than we expected. We weighted training samples by exp(-age / 14 days), which roughly halves the influence of trades from two weeks ago. The effective sample size shrank to about 280 trades. The fold AUCs were similar to the unweighted run. The interpretation: the model is not overfitting to ancient data; it just does not have enough fresh data to learn dramatic new patterns. We left the weighting on anyway because it does not hurt and it primes the pipeline for when we have more trades.

## The MLB Submodel That Did Not Survive Honesty

Per-sport submodels were the most interesting test. The hypothesis was that the global model is averaging across eleven sports with very different dynamics, and that a sport with a clear signal might do dramatically better with its own model. MLB had the cleanest sample — 204 trades, balanced labels, well-calibrated edge buckets. We trained an MLB-only meta-model.

Under the old leaky validation, this kind of experiment usually goes one way. The smaller, more focused sample lets the model find patterns that the global model could not see, and the test AUC jumps. We had been mentally prepared to celebrate.

Under walk-forward validation, the MLB-only model came back at a mean fold AUC of 0.559, with fold standard deviation of 0.111. The best fold was 0.738. The worst fold was 0.438 — actively worse than random. The model was not finding stable MLB-specific signal. It was finding fold-specific signal. If we had committed to the MLB submodel based on the leaky split, we would have shipped a model that performed brilliantly on one quarter of the year and terribly on another. The walk-forward methodology caught that. We did not save the submodel. We logged the result, left the global model in place, and moved on.

This is what good methodology buys you. It is not that you find more good models. It is that you find fewer bad ones. The cost of a model that looks great in offline testing and falls apart live is enormous — not just in money but in the time you spend trying to figure out why production does not match validation. Walk-forward CV pays for itself the first time it catches a fool's-gold model.

## Train/Serve Skew: The Quiet Killer

The third bucket was technically the most embarrassing. The model had been trained on around twenty features. The bot at inference time was only passing through six of them. The other fourteen were arriving as null values, which the model handled by falling back to default imputations. The model was effectively running on a much weaker feature set in production than the one it was trained on.

This is called train/serve skew, and it is one of the most common silent failures in production machine learning. The model thinks it knows about Elo differences, scoring patterns, pregame confirmation flags, and bot identity. It does not, because the bot is not passing any of those at inference time. The model is using whatever default the imputer produces, which is usually the training mean. The model's behavior in production is dominated by the features it can see, which are a subset of what it was trained on.

We patched the four trading bots to pass the missing fields: token identifier (for log deduplication), position size, both Elo ratings, period, score differential, raw pre-blend fair probability, and pregame confirmation flag. Some fields were not always available — tennis matches do not have periods, for example — but the bots now pass whatever is in scope. The percentage of NaN values at inference time dropped from roughly 70% to roughly 50%.

We did not retrain after the enrichment in the same session. That was deliberate. The new fields need a day or two of accumulation before there is enough enriched data to retrain on. The disciplined version of this experiment is: ship the enrichment, wait 48 hours, retrain on the enriched data, and compare the new walk-forward AUC to the current one. If the AUC moves by 0.02 or more, the features were doing real work and the enrichment was worth shipping. If not, the model was not really using those features and we should rip out the enrichment to keep the data pipeline simple.

That comparison is on the calendar for later this week. The framework matters more than the result. We are no longer adding features and trusting that they help. We are adding them and measuring whether they helped.

## The Numbers After the Day

The before-and-after table is the most useful artifact of the day's work. Before today's improvements, the model's reported test AUC was 0.616, the walk-forward CV mean was unmeasured, the fold variance was unmeasured, the production model was emitting only two distinct prediction values across thousands of calls because of the NaN problem, and the train/serve skew was around 70%. The reported numbers looked good. The reality underneath was murky.

After today's improvements, the reported holdout AUC is 0.585 — three points lower than before, and honest. The walk-forward CV mean is 0.553, with a standard deviation across folds of 0.042. The production model now emits eighteen distinct prediction values across the same volume of calls, because the enriched feature set gives it more dimensions to work in. The train/serve skew is around 50%. MLB-only as a submodel was tested and rejected because of fold instability.

A naive reader might look at that table and conclude that we made the model worse. A statistically careful reader would conclude that we made the measurement better. The model is what it was when we woke up this morning. We now know what it actually is.

## Methodology Matters More Than Features Until Sample Size Catches Up

The deeper lesson from a day like this is about priorities. You can spend infinite effort adding features to a model. There is always one more piece of context that might help — orderbook depth, social sentiment, weather, injury reports, paid data feeds. The temptation to keep enriching is constant. And almost none of it matters if your validation methodology is wrong.

A model with walk-forward AUC 0.55 and twenty features is no better than a model with walk-forward AUC 0.55 and four features. The marginal benefit of feature engineering is bounded by the methodology you are using to test it. If you cannot tell whether a new feature actually helped — because your validation is too noisy or too leaky — you should not be adding features. You should be fixing the validation.

The corollary: once your validation is honest, almost every improvement you make will look smaller than it would have under leaky validation. This is uncomfortable. The team morale boost from a one-point AUC jump is real, even when the jump is fake. The hardest thing about disciplined research is being willing to ship results that look less impressive than they could.

We chose to ship the honest numbers. The model is now reported as having a walk-forward AUC right at the gate boundary, with fold instability we have to keep watching. That is the truth. Future work will move the number up or it will not. Either way, we will know.

## The Bottleneck Is Data, Not Engineering

After a full day of improvements across three buckets, our best assessment is this: the meta-model does not need more methodology, more features, or more compute. It needs more measured trades.

At about one thousand trades, our current sample size limits the achievable AUC. The signal-to-noise ratio on a binary target — did this trade beat the closing line — is genuinely low for any single trade. The market is efficient enough that the gap between a winning trade and a losing one is often only a few cents of CLV, with substantial variance around the expectation. Distinguishing the winners from the losers reliably requires somewhere in the 2,000-to-5,000-trade range, not the 1,000 we have now.

The implication: every other improvement we considered is premature until the sample size doubles. We are not going to predict realized profit and loss as a regression target yet, because the sample is too small for the variance to converge. We are not going to add paid data feeds yet, because we cannot tell whether the free features are paying off. We are not going to flip the gate to enforce mode yet, because we do not have joined trade data showing the gate's would-block decisions would have saved money on resolved trades.

We are going to wait. The validation script that joins the meta-model's predictions to realized trade outcomes runs on a Monday cron. The first run with enough resolved trades to give a real verdict is coming up soon. That single verdict will tell us more about whether the meta-model is real than another month of feature engineering would.

This kind of patience is unusual in machine learning culture, which generally rewards visible iteration. But the discipline of waiting for measurement is the same discipline that produced today's improvements. You wait for honest numbers. You ship based on what they say, not what you wish they said.

## The Bottom Line

What we did today was not a model improvement. It was a measurement upgrade. The model AUC on paper went from 0.616 to 0.553 because the new methodology cannot hide the leakage that the old one was masking. The fold instability went from invisible to visible. The MLB submodel went from "promising" to "rejected." The train/serve skew went from "unknown" to "70% then 50% with a plan to measure again." The shadow log went from bloated to rotated. The pickle file went from single-point-of-failure to versioned. The odds fetcher went from fragile to backoff-aware.

None of these are headline-grade wins individually. Stacked, they convert a research project from "results we hope are real" into "results we can defend." That conversion is the entire game. Models do not make money. Models you can trust to behave the way they tested make money.

If you are running anything that looks like a production machine learning system, take a day this month and run the same audit. Replace your random validation split with walk-forward cross-validation that respects time order. Hold out entire groups — games, sessions, users — instead of individual rows. Check what features your training set has that your serving path is silently dropping. Rotate your logs. Back up your pickles. Catch your own model lying before reality catches it for you.

The work is unglamorous. The payoff compounds for years.

---

*ZenHodl runs eleven trading bots on Polymarket with a meta-model gate that decides which signals to allow through. Every probability our API returns is part of the same pipeline that produced the numbers in this post. Walk-forward cross-validation, game-level holdouts, time-decay weighting, and per-sport submodel comparisons are all run on every retrain. The CLV monitoring dashboard at zenhodl.net slash admin slash CLV shows the model's production decisions in real time. Seven-day free trial at zenhodl.net. The validation methodology is taught in Module 6 of the bot course included with every API plan.*