MatchModel is a from-scratch statistical forecaster for the 2026 World Cup. It rates every national team, turns those ratings into a full scoreline distribution for any fixture, and simulates the whole tournament tens of thousands of times to produce the probabilities you see across the site. This page documents what's under the hood, why each piece was chosen, and just as importantly, what we tried and threw away.
The shape of the model
Three layers stacked on top of one another. An Elo rating tracks each team's strength and updates after every international match. A Dixon-Coles bivariate-Poisson goals model turns the rating gap between two teams into expected goals for each side, and from there into the probability of every plausible scoreline. A Monte-Carlo simulator then plays the entire tournament out thousands of times, sampling scorelines from that model, to estimate how likely each team is to escape its group, reach each knockout round, and lift the trophy.
The layering matters. Win/draw/loss probabilities and the "champion %" are not predicted directly; they emerge from simulating concrete scorelines. That is the only honest way to answer tournament-shaped questions like who tops a group on goal difference, which teams the bracket pairs together, or who survives a penalty-prone knockout run, because every one of those depends on goals, not just on match results.
Why Elo
International football is a low-data sport: teams play a handful of competitive matches a year, squads churn, and friendlies are noisy. Elo is built for exactly this regime. It is data-efficient (every match is one update), needs no box-score detail, degrades gracefully when a team has been quiet, and is naturally well-calibrated, because the rating gap maps directly onto an expected result. More elaborate rating schemes exist, but on this volume of data they tend to overfit. The update rate (the K-factor) was tuned on held-out history rather than picked by feel.
From ratings to scorelines: Dixon-Coles
A rating gap tells you who is stronger, not by how many goals. The Dixon-Coles model bridges that. Each team's expected goals is an exponential function of the rating difference plus a home-field term, and goals are drawn from a Poisson process with a low-score dependence correction. That correction is the original Dixon-Coles tweak, which fixes the fact that 0-0, 1-0 and 1-1 are correlated in ways two independent Poissons get wrong. Sampling from the resulting score matrix gives a full distribution over results, which is what both the per-match win/draw/away bars and the tournament simulator consume.
Match context
We tested three contextual adjustments: days of rest between matches, travel distance between venues, and venue altitude. Only altitude survived. Rest and travel moved accuracy by essentially nothing on held-out data and were dropped to avoid adding noise. Altitude earned its place. A sea-level side playing in Mexico City (2,240 m) or Guadalajara (1,566 m) is measurably disadvantaged, so the model carries an altitude term, correctly signed and tuned, that switches on for the affected 2026 venues and cancels out when both teams are equally unaccustomed.
How we know it works: walk-forward validation
Every claim on this site rests on a strict walk-forward backtest over roughly 10,700 international matches from 2015 to 2025. The model is only ever shown the past: to predict a match it uses ratings built exclusively from games before that date, and the prediction is then scored against what actually happened. No future information leaks backward. The headline metric is the Ranked Probability Score (RPS), a proper scoring rule that, unlike raw accuracy, rewards probabilities that are both correct and sensibly spread across win/draw/loss, and punishes confident mistakes.
| Model | RPS (lower is better) | Accuracy |
|---|---|---|
| Dixon-Coles (this model) | 0.1714 | ~60% |
| Elo + logistic (simpler baseline) | 0.1716 | ~60% |
| Base rate (home/draw/away priors) | 0.2285 | n/a |
The goals model edges the simpler Elo-only baseline and comfortably beats the naive base rate. The gap to the baseline is small, and we don't pretend otherwise: most of the signal in international results is already captured by a good rating. The win is that the goals model also gives us the scoreline distribution the simulator needs.
Calibration
Being right on average isn't enough; the probabilities have to mean what they say. On the same backtest, when the model says an outcome is X% likely it happens almost exactly X% of the time. The average calibration error is under one percent. That reliability curve is plotted live on the Matches & Form page. Calibration is a fixed property of the model measured on history, not a figure that drifts during the tournament.
Simulating the tournament
The simulator plays all 104 matches tens of thousands of times. Group games are sampled scoreline by scoreline, tables are built with the real FIFA tiebreakers (points, goal difference, goals for, head-to-head), the eight best third-placed teams are selected, and the round of 32 is filled using FIFA's official Annex C allocation table. That table is not a guessed assignment: the third-place rules admit many technically-valid pairings, and only the published table is authoritative. Knockouts are then sampled round by round to the final.
Crucially, the simulator conditions on results that have already happened. Once a group match is played its real scoreline is locked in; once a knockout tie is decided the real winner advances; only the unplayed remainder is simulated. That is why a team eliminated in real life drops to a 0% title chance while the survivors' numbers rise: the forecast tracks the tournament as it unfolds rather than re-running a fantasy from scratch each time.
What we tried and threw away
The most useful work on this project was deciding what not to keep. Several intuitively-appealing features were tested honestly and cut because they didn't beat the metric.
Squad strength
Surely a team's player quality should help? We built national-team indices from two independent sources, aggregated FIFA player ratings and real Transfermarkt squad market values, and added each to the model under leakage-safe alignment. Both made it worse:
| Added feature | RPS before → after | Verdict |
|---|---|---|
| FIFA player-rating index | 0.1633 → 0.1638 | worse |
| Transfermarkt market value | 0.1735 → 0.1739 | worse |
The reason is subtle and worth stating. Squad strength splits into two parts. The part that correlates with results is already inside the Elo rating: good squads win, and winning is precisely what Elo measures. What's left over is the orthogonal part, the talent that, for whatever reason, hasn't translated into results. That residue turns out to be non-predictive. So squad data either duplicates the rating or adds noise. It was dropped.
Manager and coaching data
The same trap, plus a data problem. A manager's win rate is results-derived, so it is collinear with Elo by construction; the only genuinely new signal would be the disruption around a managerial change, which is weak and rare for nations that play infrequently. Clean, consistent historical coaching tenures also aren't reliably available. Not built.
Against the market
Where bookmaker odds are available, the model's probabilities are compared with the de-vigged market on the Matches & Form page. The model is competitive but tends to sit a touch flatter than the market on heavy favourites. That is expected, and honest: bookmakers price in squad news, injuries and lineups that a results-only model can't see. The model is information-limited, not algorithm-limited: the ceiling here is the data, not the maths.
Known limitations
Better to state these plainly than bury them. The model sees results, not squads, injuries, suspensions, lineups or in-game tactics, so it will miss a favourite hollowed out by absences. Penalty-shootout winners can't be recovered from the results feed, which records the post-extra-time draw with no shootout column, so a tie settled on penalties is treated as not-yet-decided until a later data refresh resolves it. And while the displayed bracket uses the official Annex C table, the simulator uses any valid third-place matching for speed, a difference that barely moves aggregate probabilities but is worth noting.
The pipeline
Everything is reproducible and automated. Results come from the public martj42 international-results dataset; an Elo pass builds current ratings; the walk-forward backtest and tuning run over the full history; the Monte-Carlo simulator produces the tournament probabilities; predictions are logged before kickoff and graded as results arrive; calibration is recomputed from the backtest; and a single pipeline writes the JSON this site reads. In production it runs itself: a scheduler pings a GitHub Actions job every few hours, which reruns the pipeline, commits the refreshed data, and triggers a redeploy. No match is ever scored after the fact, and no future data is ever used to fit the past.
The philosophy
The throughline is honest forecasting. Prefer a simple model that is well-calibrated over a complex one that merely looks clever. Tune and validate on held-out history, never on the test set. Drop any feature that doesn't beat the headline metric, however appealing it is. And publish the predictions and the running scorecard, so the model is judged on what it actually called, not on what it claims it can do.