SciSports – Player Evaluation in Soccer Using 2D Tracking Data #SWI2018

Company Introduction

SciSports is a spin-off from the University of Twente, founded by Giels Brouwer and Anatoliy Babic in 2012. SciSports is a company which focusses on applying Scientic Methods in the world of soccer. Currently our focus is on the application of data analytics in soccer. Currently we have offices in Enschede and Amersfoort, and sales representatives in the Netherlands, Belgium, Germany, France and the UK.

Player Ranking

One of the products of SciSports is the SciSkill Index; a player ranking system for more than 90.000 players from the whole world. The player rankings are calculated using historical data and can be used to predict future performance and match outcome distributions. We are always looking for new ways to use data to determine player qualities. The goal of this week at SWI, is to improve the data we have about soccer players, by finding metrics that quantify the performance professional soccer players based on 2D-tracking data.

Current methods developed by SciSports are based on data that is collected manually. Statistics like goals scored, successful passes and successful tackles. Most of these metrics lack contextual information; some goals, passes and tackles are much more important than others. We also see that rare events, like goals scored, are prone to a high amount of randomness. The current models can be seen as a summary of historical events.

Current Model – Event Data

Recently, in the world of soccer, a lot of complex metrics have been developed which are based on event data. All relevant things that players do with the ball within a soccer match are registered in databases. These events can be used to add more context to the performance of a player. A well-known idea is to assign a scoring probability to shots; yielding the expected goals. SciSports has extended this idea, assigning values to all events (headers, passes, shots, and saves). This is done by comparing certain events with similar events (and their outcomes) in historical data.

The approach involves modeling the game state of the football match by a discrete Markov chain, incorporating the last $N$ events. We say that $E_i $ is the ith event. We define the game state after this event as $G_i$. We have that:

$$\text{GameState}_t = G_t = \left[E_t,E_{t-1},\ldots,E_{t-N}\right]$$

The idea is that we have a random forest algorithm (machine learning algorithm trained on historical data) that determines the value of a game state. This valuation is a mapping from the game state to the expected outcome.

$$V(G_t) \longrightarrow [-1,1]$$

How this mapping works in practice will be explained in more detail. If we are now interested in the (added) value of an event $E_t$, we find that

$$V(E_t) = V(G_t)-V(G_{t-1})$$

Available Data

At SciSports, we believe that in the future, the most advanced player analysis tools will use 2D positional data. The 2D positional data contains all the positions of all the players and the ball on the field. This can be used to extract all the actions of all the players on the field.

SciSports is developing a system, called BallJames, which uses 14 cameras around a soccer pitch to extract all the positional data by using machine vision algorithms. The camera system has been successfully installed at the Heracles Almelo Stadium and the stadium of a large Premier League club from London.

During the SWI we will provide an anonymous data set of 10 complete soccer matches with positional data. This data set contains the positions $(x, y) $ for each player and the ball for the complete match.

Problem Description

The goal of the project is to find mathematical techniques that can be used to estimate the qualities of soccer players by using 2D tracking data with no further prior knowledge. The qualities that can be extracted can be strategical (based on decisions) and physical (based on Spatio-temporal movement data). The idea is that the results that we yield can be used to improve the SciSkill Index.

We would like to use the 2D data to model the soccer game as a discrete-time Markov chain with a continuous state space. The idea is that we take snapshots of the positional data, define this as the current game state. Intuitively we can say that we are using a “picture”, on which we can observe all the positions of the players and the ball.

Remark

The current game state is not only defined by the position of all players but also their velocity (vector). We can extract the velocities by looking at the positional displacement between the two snapshots. This will make the game state a bit more complex, but it can still be modeled by a DTMC with continuous outcome space.

Questions we would like to have answered

Create an evaluation mapping of the game state. Just like our Random Forest algorithm did this for the event-based game state, we can evaluate the, newly defined, continuous game If we have a reliable evaluation algorithm, we can look at differences between two snapshots and assign these to players. The hope is that we can find which players improve the game state (and how).
Model the Markov chain as a physical system with Newtonian dynamics, where all players and the ball are modeled as particles. These particles can be modeled and parametrized by some movement and restriction parameters. We would like to find estimates for these parameters. Such estimates can be player specific and even vary over time
Extract the playing structure of a soccer team. Is it clear from the positional data what the roles of the players are (defender, midfielder, attacker)? Can we see which players are unpredictable (can we find a definition of chaos and measure which players create chaos)?

We hope the above questions are a useful starting point for further discussions. If the group decides to proceed with a completely different approach (that is in line with the goals formulated at the start of this paragraph), we will encourage this.

Details will follow soon.