Loading Opta data¶
Opta’s event stream data comes in many different flavours. The
OptaLoader
class provides an API client enabling you to fetch
data from the following data feeds as Pandas DataFrames:
Opta F1, F9 and F24 JSON feeds
Opta F7 and F24 XML feeds
StatsPerform MA1 and MA3 JSON feeds
WhoScored.com JSON data
Currently, only loading data from local files is supported.
Connecting to a data store¶
First, you have to create a OptaLoader
object and configure it
for the data feeds you want to use.
Generic setup¶
To set up a OptaLoader
you have to specify the root
directory, the filename hierarchy of the feeds and a parser for each feed.
For example:
from socceraction.data.opta import OptaLoader, parsers
api = OptaLoader(
root="data/opta",
feeds = {
"f7": "f7-{competition_id}-{season_id}-{game_id}.xml",
"f24": "f24-{competition_id}-{season_id}-{game_id}.xml",
}
parser={
"f7": parsers.F7XMLParser,
"f24": parsers.F24XMLParser
}
)
Since the loader uses the directory structure and file names to determine
which files should be parsed, the root directory should have a predefined
file hierarchy defined in the feeds
argument. A wide range of file names
and directory structures are supported. However, the competition, season, and
game identifiers must be included in the file names to be able to locate the
corresponding files for each entity. For example, you might have grouped feeds
by competition and season as follows:
root
├── competition_<competition_id>
│ ├── season_<season_id>
│ │ ├── f7_<game_id>.xml
│ │ └── f24_<game_id>.xml
│ └── ...
└── ...
In this case, you can use the following feeds configuration:
feeds = {
"f7": "competition_{competition_id}/season_{season_id}/f7_{game_id}.xml",
"f24": "competition_{competition_id}/season_{season_id}/f24_{game_id}.xml",
}
Note
On Windows, the backslash character should be used as a path separator.
Furthermore, a few standard configurations are provided. These are listed below.
Opta F7 and F24 XML feeds¶
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/opta", parser="xml")
The root directory should have the following structure:
root
├── f7-{competition_id}-{season_id}.xml
├── f24-{competition_id}-{season_id}-{game_id}.xml
└── ...
Opta F1, F9 and F24 JSON feeds¶
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/opta", parser="json")
The root directory should have the following structure:
root
├── f1-{competition_id}-{season_id}.json
├── f9-{competition_id}-{season_id}.json
├── f24-{competition_id}-{season_id}-{game_id}.json
└── ...
StatsPerform MA1 and MA3 JSON feeds¶
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/statsperform", parser="statsperform")
The root directory should have the following structure:
root
├── ma1-{competition_id}-{season_id}.json
├── ma3-{competition_id}-{season_id}-{game_id}.json
└── ...
WhoScored¶
WhoScored.com is a popular website that provides detailed live match statistics.
These statistics are compiled from Opta’s event feed, which can be scraped
from the website’s source code using a library such as soccerdata. Once you
have downloaded the raw JSON data, you can parse it using the OptaLoader
with:
from socceraction.data.opta import OptaLoader
api = OptaLoader(root="data/whoscored", parser="whoscored")
The root directory should have the following structure:
root
├── {competition_id}-{season_id}-{game_id}.json
└── ...
Alternatively, the soccerdata library provides a wrapper that immediately
returns a OptaLoader
object for a scraped dataset.
import soccerdata as sd
# Setup a scraper for the 2021/2022 Premier League season
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021)
# Scrape all games and return a OptaLoader object
api = ws.read_events(output_fmt='loader')
Warning
Scraping data from WhoScored.com violates their terms of service. Legally, scraping this data is therefore a grey area. If you decide to use this data anyway, this is your own responsibility.
Loading data¶
Next, you can load the match event stream data and metadata by calling the
corresponding methods on the OptaLoader
object.