Loading StatsBomb data¶
The StatsBombLoader
class provides an API client enabling you to
fetch StatsBomb event stream data as Pandas DataFrames. This document provides
an overview of the available data sources and how to access them.
Setup¶
To be able to load StatsBomb data, you’ll first need to install a few additional dependencies which are not included in the default installation of socceraction. You can install these additional dependencies by running:
$ pip install "socceraction[statsbomb]"
Connecting to a data store¶
First, you have to create a StatsBombLoader
object and configure it
for the data store you want to use. The StatsBombLoader
supports
loading data from the StatsBomb Open Data repository, from the official
StatsBomb API, and from local files.
Open Data repository¶
StatsBomb has made event stream data of certain leagues freely available for public non-commercial use at https://github.com/statsbomb/open-data. This open data can be accessed without the need of authentication, but its use is subject to a user agreement. The code below shows how to setup an API client that can fetch data from the repository.
# optional: suppress warning about missing authentication
import warnings
from statsbombpy.api_client import NoAuthWarning
warnings.simplefilter('ignore', NoAuthWarning)
from socceraction.data.statsbomb import StatsBombLoader
api = StatsBombLoader(getter="remote", creds=None)
Note
If you publish, share or distribute any research, analysis or insights based on this data, StatsBomb requires you to state the data source as StatsBomb and use their logo.
StatsBomb API¶
API access is for paying customers only. Authentication can be done by setting
environment variables named SB_USERNAME
and SB_PASSWORD
to your login
credentials. Alternatively, the constructor accepts an argument creds
to
pass your login credentials in the format {"user": "", "passwd": ""}
.
from socceraction.data.statsbomb import StatsBombLoader
# set authentication credentials as environment variables
import os
os.environ["SB_USERNAME"] = "your_username"
os.environ["SB_PASSWORD"] = "your_password"
api = StatsBombLoader(getter="remote")
# or provide authentication credentials as a dictionary
api = StatsBombLoader(getter="remote", creds={"user": "", "passwd": ""})
Local directory¶
A final option is to load data from a local directory. This local directory
can be specified by passing the root
argument to the constructor,
specifying the path to the local data directory.
from socceraction.data.statsbomb import StatsBombLoader
api = StatsBombLoader(getter="local", root="data/statsbomb")
Note that the data should be organized in the same way as the StatsBomb Open Data repository, which corresponds to the following file hierarchy:
root
├── competitions.json
├── events
│ ├── <match_id>.json
│ ├── ...
│ └── ...
├── lineups
│ ├── <match_id>.json
│ └── ...
├── matches
│ ├── <competition_id>
│ │ └── <season_id>.json
│ │ └── ...
│ └── ...
└── three-sixty
├── <match_id>.json
└── ...
Loading data¶
Next, you can load the match event stream data and metadata by calling the
corresponding methods on the StatsBombLoader
object.
StatsBombLoader.competitions()
¶
df_competitions = api.competitions()
season_id |
competition_id |
competition_name |
country_name |
competition_gender |
season_name |
---|---|---|---|---|---|
106 |
43 |
FIFA World Cup |
International |
male |
2022 |
30 |
72 |
Women’s World Cup |
International |
female |
2019 |
3 |
43 |
FIFA World Cup |
International |
male |
2018 |
StatsBombLoader.games()
¶
df_games = api.games(competition_id=43, season_id=3)
game_id |
season_id |
competition_id |
competition_stage |
game_day |
game_date |
home_team_id |
away_team_id |
home_score |
away_score |
venue |
referee_id |
---|---|---|---|---|---|---|---|---|---|---|---|
8658 |
3 |
43 |
Final |
7 |
2018-07-15 17:00:00 |
771 |
785 |
4 |
2 |
Stadion Luzhniki |
730 |
8657 |
3 |
43 |
3rd Place Final |
7 |
2018-07-14 16:00:00 |
782 |
768 |
2 |
0 |
Saint-Petersburg Stadium |
741 |
StatsBombLoader.teams()
¶
df_teams = api.teams(game_id=8658)
team_id |
team_name |
---|---|
771 |
France |
785 |
Croatia |
StatsBombLoader.players()
¶
df_players = api.players(game_id=8658)
game_id |
team_id |
player_id |
player_name |
nickname |
jersey_number |
is_starter |
starting_position_id |
starting_position_name |
minutes_played |
---|---|---|---|---|---|---|---|---|---|
8658 |
771 |
3009 |
Kylian Mbappé Lottin |
Kylian Mbappé |
10 |
True |
12 |
Right Midfield |
95 |
8658 |
785 |
5463 |
Luka Modrić |
10 |
True |
13 |
Right Center Midfield |
95 |
StatsBombLoader.events()
¶
df_events = api.events(game_id=8658)
event_id |
index |
period_id |
timestamp |
minute |
second |
type_id |
type_name |
possession |
possession_team_id |
possession_team_name |
play_pattern_id |
play_pattern_name |
team_id |
team_name |
duration |
extra |
related_events |
player_id |
player_name |
position_id |
position_name |
location |
under_pressure |
counterpress |
game_id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
47638847-fd43-4656-b49c-cff64e5cfc0a |
1 |
1 |
1900-01-01 |
0 |
0 |
35 |
Starting XI |
1 |
771 |
France |
1 |
Regular Play |
771 |
France |
0.0 |
{…} |
[] |
False |
False |
8658 |
|||||
0c04305d-5615-4520-9be5-7c232829954b |
2 |
1 |
1900-01-01 |
0 |
0 |
35 |
Starting XI |
1 |
771 |
France |
1 |
Regular Play |
785 |
Croatia |
1.412 |
{…} |
[] |
False |
False |
8658 |
|||||
c5e17439-efe2-480b-9cff-1600998674d7 |
3 |
1 |
1900-01-01 |
0 |
0 |
18 |
Half Start |
1 |
771 |
France |
1 |
Regular Play |
771 |
France |
0.0 |
{} |
[‘7e1460eb-c572-4059-8cd4-cec4857f818d’] |
False |
False |
8658 |
If 360 data snapshots are available for the game, they can be loaded by
passing load_360=True
to the events()
method. This will add two columns
to the events dataframe: visible_area_360
and freeze_frame_360
. The
former contains the visible area of the pitch in the 360 snapshot, while the
latter contains the player locations in the 360 snapshot.
df_events = api.events(game_id=3788741, load_360=True)