The Cutback
Posts
Simulating Football with Dixon & Coles: A Deep Dive into xG, Data, and Predictive Modeling

Simulating Football with Dixon & Coles: A Deep Dive into xG, Data, and Predictive Modeling

How a blend of shot-level xG, historical data, and real match results could reshape football predictions, from the Premier League to Serie A

Davide Gualano
October 11, 2024 • Estimated Reading Time: 6 minutes

Hi everyone,

It's been a busy period for me lately, with some positive developments, including an ongoing job selection process, as well as some more difficult moments. Despite these ups and downs, I’ve continued working on various models, notebooks, and metrics, though I hadn’t felt to share them.

Today, I’m sharing something that, while not groundbreaking, is a project I’m proud to have completed. So, without further ado…

As you might have guessed from the title, today we're diving into the Dixon and Coles simulation model. Over the past 3 to 4 years, I’ve read quite a bit about season and match simulations, and like many, I initially thought that having more knowledge would be the key to winning bets. However, I never really learned how to properly simulate games — and I wasn’t winning many bets either.

Recently, I decided to take a different approach, aiming to learn both how to simulate games and improve my betting strategy pushed by the Variance Betting project launched by Ted Knutson through his newsletter. This led me to dive into the Dixon and Coles model. Along the way, I also recalled Ben Tovarney’s comprehensive posts about using shot-level xG data to create a more refined model.

So, I put in the work and built a model that:

Integrates xG on a shot-by-shot basis.
Uses actual match scores.
Finds the best blend between the two for the most accurate predictions.
Calculates team ratings based on match data, with more recent games weighted more heavily.
Simulates both future matches and the final league table.

All of this has been done in Python — unlike Ben’s original environment. Unless I’ve overlooked any issues (and that’s a big “unless”!), I’m happy with the results so far and have uncovered some interesting insights.

To kick things off, I collected data from 18 different competitions over 5 seasons. Here’s a quick breakdown:

Argentina Liga Profesional: 19/20, 2021, 2022, 2023
Brazil Serie A: 2020, 2022, 2023, 2024
USA MLS: 2020, 2021, 2022, 2023, 2024
Belgium Pro League: 20/21, 21/22, 22/23, 23/24, 24/25
England Championship: 20/21, 21/22, 22/23, 23/24, 24/25
England Premier League: 20/21, 21/22, 22/23, 23/24, 24/25
England League One: 20/21, 21/22, 22/23, 23/24, 24/25
Spain LaLiga: 20/21, 21/22, 22/23, 23/24, 24/25
UEFA Champions League: 20/21, 21/22, 22/23, 23/24, 24/25
UEFA Europa League: 20/21, 21/22, 22/23, 23/24, 24/25
France Ligue 1: 20/21, 21/22, 22/23, 23/24, 24/25
Germany Bundesliga: 20/21, 21/22, 22/23, 23/24, 24/25
Italy Serie A: 20/21, 21/22, 22/23, 23/24, 24/25
Netherlands Eredivisie: 20/21, 21/22, 22/23, 23/24, 24/25
Portugal Primera Liga: 20/21, 21/22, 22/23, 23/24, 24/25
Russia Premier Liga: 20/21, 21/22, 22/23, 23/24, 24/25
Scotland Premiership: 20/21, 21/22, 22/23, 23/24, 24/25
Turkey Super Lig: 20/21, 21/22, 22/23, 23/24, 24/25

As for the data, I’ve amassed quite a bit. My xG model is trained on around 350,000 shots—not groundbreaking, but certainly solid. For context, Ben Tovarney found the ideal weight between Understat’s xG model and actual match scores to be a 70-30 split. In my case, that balance turned out to be 72,6-27,4 across the entire dataset which seems promising!

Once I’ve found that weighting I can use it across any slice of data I choose to simulate. So, while I could be overlooking something (and that’s always possible!), for now, I’m happy with the results.

As I mentioned, I’ve collected quite a lot of data. Dixon and Coles originally based their model on four seasons of data for their paper. In my case, I found that training the ratings—which are crucial for simulating unplayed matches—on four seasons plus the current one takes around 2 hours and 30 minutes. However, training on just two seasons + the ongoing one cuts that time down to 30 minutes.

So, after some testing, I’ve decided that two seasons' worth of data is the sweet spot for our simulations.

Here’s why: Below are the Premier League ratings for this season, based solely on this season’s data. As you can see, the model ranks Fulham closer to Arsenal, which doesn’t quite reflect reality...

So, as you can imagine, Fulham probably won’t finish 4th—despite what the current ratings might suggest. And West Ham in 6th? That’s another eyebrow-raiser:

These placements don’t quite align with what we’d expect, highlighting why using more than just the current season’s data gives us a more reliable picture. Given that the difference in training time between using only the 24/25 season and incorporating data from 22/23 to 24/25 is just ten minutes, I’ve opted for the extra two seasons’ worth of data. The added context is well worth the slight increase in time.

Since it’s still early in the season, we see a lot of weight being given to last season’s data, but that will naturally shift as more games are played. Overall, I’m pretty happy with how the current ratings look, aside from a few anomalies like Fulham. It will be interesting to track how things evolve — and maybe you will too with me — as we move toward the end of the season.

To wrap things up, let's take a look at how the Italian Serie A simulations are shaping up using the 2 seasons + current season worth of data:

See you soon!