The Cutback
Posts
Player Positioning: A Clustering Approach

Player Positioning: A Clustering Approach

Data-Driven Method to Classify Football Players by Pitch Zones and Actions

Davide Gualano
January 03, 2025 • Estimated Reading Time: 5 minutes

Today, I’m not presenting anything groundbreaking but rather showcasing a clustering project aimed at assigning positions to players based on selected features. If you're not interested in the detailed reasoning or methodology, feel free to just stop to the results:

For comparison, here’s my previous positioning classification:

As you can see, there are significant changes in how players are divided. While there are plenty of caveats and explanations to discuss, I believe this new positioning classification is better suited to the recruitment-focused use case I envision. That said, this is purely a personal project—I’m not working within any organization or as a consultant.

Let’s dive into the details below. If you're interested, keep reading!

Background

This project builds on the 2022 methodology by John Muller, a well-known approach in the football analytics community. Muller’s method focuses on clustering players based on aggregated data—metrics like crosses per 90, shots per 90, and similar statistics. His two-phase clustering process moves from positional groups (e.g., centerbacks, defensive midfielders) to functional roles (e.g., anchors, box-to-box players, progressors).

However, I believe clustering based on aggregated performance data is better suited for identifying roles rather than positions–which is what Muller actually tried to do since his post is called Introducing “The Athletic’s 18 player roles”. When scouting for players, it’s more practical to group players by the areas of the pitch they operate in before assessing their specific contributions using other tools, like radars or comparative metrics, in my opinion.

That way you can say: «I need a midfielder» you watch who actually operates in midfield and then you search for the one that suits your needs.

Methodology

To reflect this idea, I based my model primarily on a dataset of team’s relative frequency counts of actions player performed across pitch zones, similar to what an heatmap visualization would reflect. More specifically exactly in how many bins this one is divided into:

Each bin in the heatmap corresponds to a column in my dataframe, emphasizing where actions occur rather than what those actions are. This approach led to some interesting results—for instance, a season of João Cancelo is classified as one of a center midfielder.

That said, the model struggles to clearly distinguish players that operates in the center of the field in front of the defensive block. To address this, I introduced additional features:

npxG, npxA, and actions per 98 minutes
Percentage of key actions on total actions: crosses, defensive actions, shots, carries, take-ons, box touches, and goalkeeping actions

These tweaks aim to avoid introducing excessive bias–you can argue npxG, npxA actually are excessive bias–based on what players do or their quality, while improving separation between positional groups.

(Quick note on using per 98 minutes: It’s done to account for extended playing time, including stoppage time instead of 90 minutes flat, which averages 98 minutes per game–in my entire dataset–instead of the traditional 90. A good read on this would be find here.)

Results and Challenges

Despite improvements, the model still has limitations. For example, consider the list of players classified as Attacking Midfielders across Juventus, Man City, Lecce, Barcelona, Atalanta, Torino, and Udinese–teams I can say I know well enough to use as standard on how the project worked and of differing style and quality:

While many players fit the bill, I’m less comfortable with names like Rodrigo Bentancur, Lovric, Linetty, Romeu, Pjanić, and Paredes appearing in this category.

Conclusion and Next Steps

I believe this updated model represents progress, but there’s still room for refinement. Incorporating event-level extracted from tracking data or some other form of calibration might address some of the remaining issues.

For now, I’ll use this classification with an awareness of its current limitations. I’m also open to suggestions on how to improve the model—feedback is always welcome!

Thanks for reading, and see you soon.