├── README.md ├── __init__.py ├── data_download.py ├── playerank ├── __init__.py ├── conf │ └── features_weigths.json ├── features │ ├── __init__.py │ ├── abstract.py │ ├── centerOfPerformanceFeature.py │ ├── goalScoredFeatures.py │ ├── matchPlayedFeatures.py │ ├── plainAggregation.py │ ├── playerankFeatures.py │ ├── qualityFeatures.py │ ├── relativeAggregation.py │ ├── requirements.txt │ ├── roleFeatures.py │ └── wyscoutEventsDefinition.py ├── models │ ├── Clusterer.py │ ├── Rater.py │ ├── Weighter.py │ └── __init__.py ├── setup.py └── utils │ ├── __init__.py │ ├── compute_features_weight.py │ ├── compute_features_weight.py~ │ ├── compute_playerank.py │ ├── compute_playerank.py~ │ └── compute_roles.py └── playerank_schema_tist.png /README.md: -------------------------------------------------------------------------------- 1 | # PlayeRank 2 | 3 | PlayeRank, a data-driven framework that offers a principled multi-dimensional and role-aware evaluation of the performance of soccer players. 4 | 5 | Playerank is designed to work with [soccer-logs](https://www.nature.com/articles/s41597-019-0247-7), in which a match consists of a sequence of events encoded as a tuple: `id`, `type`, `position`, `timestamp`,where `id` is the identifer of the player that originated/refers to this event, `type` is the event type (i.e., passes, shots, goals, tackles, etc.), `position` and `timestamp` denote the spatio-temporal coordinates of the event over the soccer field. PlayeRank assumes that soccer-logs are stored into a database, which is updated with new events after each soccer match. 6 | 7 | As described by the figure below, the PlayeRank framework consists of four main components: 8 |

soccer-logs database
rating module
learning module
ranking module

14 | 15 | ![PlayeRank framework schema](playerank_schema_tist.png) 16 | **Schema of the PlayeRank framework**. Starting from a database of soccer-logs **(a)**, it consists of three main phases. The learning phase **(c)** is an "online" procedure: It must be executed at least once before theother phases, since it generates information used in the other two phases, but then it can be updated separately. The rating **(b)** and the ranking phases **(d)** are online procedures, i.e., they are executed every time anew match is available in the database of soccer-logs. 17 | 18 | An exhaustive description of PlayeRank framework is available in this paper, if you use PlayeRank please cite it: 19 | 20 | Pappalardo, Luca, Cintia, Paolo, Ferragina, Paolo, Massucco, Emanuele, Pedreschi, Dino & Giannotti, Fosca (2019) PlayeRank: Data-driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach. ACM Transactions on Intelligent Systems and Technologies 10(5), DOI:https://doi.org/10.1145/3343172 21 | 22 | Bibtex: 23 | ``` 24 | @article{10.1145/3343172, 25 | author = {Pappalardo, Luca and Cintia, Paolo and Ferragina, Paolo and Massucco, Emanuele and Pedreschi, Dino and Giannotti, Fosca}, 26 | title = {PlayeRank: Data-Driven Performance Evaluation and Player Ranking in Soccer via a Machine Learning Approach}, 27 | year = {2019}, 28 | issue_date = {November 2019}, 29 | publisher = {Association for Computing Machinery}, 30 | address = {New York, NY, USA}, 31 | volume = {10}, 32 | number = {5}, 33 | issn = {2157-6904}, 34 | url = {https://doi.org/10.1145/3343172}, 35 | doi = {10.1145/3343172}, 36 | journal = {ACM Trans. Intell. Syst. Technol.}, 37 | month = sep, 38 | articleno = {Article 59}, 39 | numpages = {27}, 40 | keywords = {data science, soccer analytics, clustering, searching, multi-dimensional analysis, football analytics, predictive modelling, ranking, big data, Sports analytics} 41 | } 42 | ``` 43 | 44 | To build player rankings from soccer-logs data, the following steps are required: 45 | 46 | 1. compute feature weights (learning) 47 | 2. compute roles (learning) 48 | 3. compute performance scores (rating) 49 | 4. aggregate performance scores (ranking) 50 | 51 | The code to reproduce the PlayRank framework is available as a Google Colab document here: 52 | http://bit.ly/playerank_Tutorial 53 | 54 | It works on public soccer-logs data that are usable, upon citation, from this paper: 55 | 56 | Pappalardo, L., Cintia, P., Rossi, A., Massucco, E., Ferragina, P., Pedreschi, D. & Giannotti, F. (2019) A public data set of spatio-temporal match events in soccer competitions. Nature Scientific Data 6(236),doi:10.1038/s41597-019-0247-7 57 | 58 | Bibtex: 59 | ``` 60 | @article{10.1038/s41597-019-0247-7, 61 | Abstract = {Soccer analytics is attracting increasing interest in academia and industry, thanks to the availability of sensing technologies that provide high-fidelity data streams for every match. Unfortunately, these detailed data are owned by specialized companies and hence are rarely publicly available for scientific research. To fill this gap, this paper describes the largest open collection of soccer-logs ever released, containing all the spatio-temporal events (passes, shots, fouls, etc.) that occured during each match for an entire season of seven prominent soccer competitions. Each match event contains information about its position, time, outcome, player and characteristics. The nature of team sports like soccer, halfway between the abstraction of a game and the reality of complex social systems, combined with the unique size and composition of this dataset, provide an ideal ground for tackling a wide range of data science problems, including the measurement and evaluation of performance, both at individual and at collective level, and the determinants of success and failure.}, 62 | Author = {Pappalardo, Luca and Cintia, Paolo and Rossi, Alessio and Massucco, Emanuele and Ferragina, Paolo and Pedreschi, Dino and Giannotti, Fosca}, 63 | Da = {2019/10/28}, 64 | Date-Added = {2019-12-29 16:44:01 +0000}, 65 | Date-Modified = {2019-12-29 16:44:01 +0000}, 66 | Doi = {10.1038/s41597-019-0247-7}, 67 | Id = {Pappalardo2019}, 68 | Isbn = {2052-4463}, 69 | Journal = {Scientific Data}, 70 | Number = {1}, 71 | Pages = {236}, 72 | Title = {A public data set of spatio-temporal match events in soccer competitions}, 73 | Ty = {JOUR}, 74 | Url = {https://doi.org/10.1038/s41597-019-0247-7}, 75 | Volume = {6}, 76 | Year = {2019}, 77 | Bdsk-Url-1 = {https://doi.org/10.1038/s41597-019-0247-7}, 78 | Bdsk-Url-2 = {http://dx.doi.org/10.1038/s41597-019-0247-7}} 79 | 80 | ``` 81 | 82 | 83 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /data_download.py: -------------------------------------------------------------------------------- 1 | """ 2 | 3 | Downloading script for soccer logs public open dataset: 4 | https://figshare.com/collections/Soccer_match_event_dataset/4415000/2 5 | 6 | Data description available here: 7 | 8 | Please cite the source as: 9 | 10 | 11 | """ 12 | 13 | import requests, zipfile, json, io 14 | 15 | 16 | dataset_links = { 17 | 18 | 'matches' : 'https://ndownloader.figshare.com/files/14464622', 19 | 'events' : 'https://ndownloader.figshare.com/files/14464685', 20 | 'players' : 'https://ndownloader.figshare.com/files/15073721', 21 | 'teams': 'https://ndownloader.figshare.com/files/15073697', 22 | } 23 | 24 | 25 | r = requests.get(dataset_links['matches'], stream=True) 26 | z = zipfile.ZipFile(io.BytesIO(r.content)) 27 | z.extractall("data/matches") 28 | 29 | r = requests.get(dataset_links['events'], stream=True) 30 | z = zipfile.ZipFile(io.BytesIO(r.content)) 31 | z.extractall("data/events") 32 | # 33 | r = requests.get(dataset_links['teams'], stream=False) 34 | print (r.text, file=open('data/teams.json','w')) 35 | 36 | 37 | r = requests.get(dataset_links['players'], stream=False) 38 | print (r.text, file=open('data/players.json','w')) 39 | -------------------------------------------------------------------------------- /playerank/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /playerank/conf/features_weigths.json: -------------------------------------------------------------------------------- 1 | [{"feature_name": "Duel-Air duel", "weight": -0.006539158031210258}, {"feature_name": "Duel-Ground attacking duel", "weight": -0.010449840258461175}, {"feature_name": "Duel-Ground defending duel", "weight": -0.011199884240897105}, {"feature_name": "Duel-Ground loose ball duel", "weight": -0.006563558749954084}, {"feature_name": "Foul-Foul", "weight": -0.0024418993059094306}, {"feature_name": "Foul-Hand foul", "weight": -4.691059387662471e-05}, {"feature_name": "Foul-Late card foul", "weight": -4.627009973076854e-06}, {"feature_name": "Free Kick-Corner", "weight": -0.000770251646940783}, {"feature_name": "Free Kick-Free Kick", "weight": -0.0016080494936435326}, {"feature_name": "Free Kick-Free kick cross", "weight": -0.00027956399045302253}, {"feature_name": "Free Kick-Free kick shot", "weight": -2.4486972707553348e-05}, {"feature_name": "Free Kick-Goal kick", "weight": -0.001562937886724554}, {"feature_name": "Free Kick-Penalty", "weight": -2.6608471224568587e-06}, {"feature_name": "Free Kick-Throw in", "weight": -0.00335808302308207}, {"feature_name": "Goalkeeper leaving line-Goalkeeper leaving line", "weight": -0.0001439884391271861}, {"feature_name": "Others on the ball-Acceleration", "weight": -0.0012001055794066166}, {"feature_name": "Others on the ball-Clearance", "weight": -0.0024385592601346476}, {"feature_name": "Others on the ball-Touch", "weight": -0.006009466163102354}, {"feature_name": "Pass-Cross", "weight": -0.0027830544469876215}, {"feature_name": "Pass-Hand pass", "weight": -0.00032481786643807177}, {"feature_name": "Pass-Head pass", "weight": -0.0042851410847089535}, {"feature_name": "Pass-High pass", "weight": -0.005054173369198981}, {"feature_name": "Pass-Launch", "weight": -0.0018547584405906774}, {"feature_name": "Pass-Simple pass", "weight": -0.04355855466251161}, {"feature_name": "Pass-Smart pass", "weight": -0.0007643856879811938}, {"feature_name": "Save attempt-Reflexes", "weight": -0.00032000870309680147}, {"feature_name": "Save attempt-Save attempt", "weight": -0.00030415717728838655}, {"feature_name": "Shot-Shot", "weight": -0.0017280657576526214}, {"feature_name": "entity", "weight": -0.03312147947261782}, {"feature_name": "goal-scored", "weight": -0.8512573718382006}] -------------------------------------------------------------------------------- /playerank/features/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mesosbrodleto/playerank/79dd6464be98bbc35f48f99b4a3b626ea43e9a7e/playerank/features/__init__.py -------------------------------------------------------------------------------- /playerank/features/abstract.py: -------------------------------------------------------------------------------- 1 | import abc 2 | 3 | class Feature(object): 4 | __metaclass__ = abc.ABCMeta 5 | """ 6 | class to wrap all the scripts/method to aggregate features from the database 7 | """ 8 | @abc.abstractmethod 9 | def createFeature(self,collectionName,param): 10 | """ 11 | Method to define how a feature/set of features is computed. 12 | param contains eventual parameters for querying database competion, subset of teams, whatever 13 | Best practice: 14 | features have to be stored into a collection of documents in the form: 15 | {_id: {match: (numeric) unique identifier of the match, 16 | name: (string) name of the feature, 17 | entity: (string) name of the entity target of the aggregation. It could be teamId, playerID, teamID + role or whatever significant for an aggregation}, 18 | value: (numeric) the count for the feature} 19 | 20 | return the name of the collection where the features have been stored 21 | """ 22 | return 23 | 24 | class Aggregation(object): 25 | __metaclass__ = abc.ABCMeta 26 | 27 | """ 28 | defines the methods to aggregate one/more collection of features for each match 29 | it have to provide results as a dataframe, 30 | 31 | e.g. 32 | it is used to compute relative feature for each match 33 | match -> team (or entity) -> featureTeam - featureOppositor 34 | 35 | """ 36 | 37 | @abc.abstractproperty 38 | def get_features(self): 39 | 40 | return 'Should never get here' 41 | @abc.abstractproperty 42 | def set_features(self, collection_list): 43 | """ 44 | set the list of collection to use for relative features computing 45 | e.g. 46 | we could have a collection of quality features, one for quantity features, one for goals scored etc 47 | """ 48 | return 49 | 50 | @abc.abstractmethod 51 | def aggregate(self): 52 | """ 53 | merge the collections of feature and aggregate by match and team, computing the relative value for each team 54 | e.g. 55 | match -> team (or entity) -> featureTeam - featureOppositor 56 | 57 | returns a dataframe 58 | """ 59 | return -------------------------------------------------------------------------------- /playerank/features/centerOfPerformanceFeature.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | import glob 5 | import numpy as np 6 | from collections import defaultdict 7 | 8 | class centerOfPerformanceFeature(Feature): 9 | 10 | 11 | def createFeature(self, events_path, players_file, select = None): 12 | 13 | """ 14 | compute centerOfPerformanceFeatures 15 | parameters: 16 | -events_path: folder path of events file 17 | -players_file: file path of players data file 18 | -select: function for filtering matches collection. Default: aggregate over all matches 19 | -entity: it could either 'team' or 'player'. 20 | It selects the aggregation for qualityFeatures among teams or players qualityfeatures. 21 | Note: aggregation by team is exploited during learning phase, for features weights estimation, 22 | while aggregation by players is involved for rating phase. 23 | 24 | Output: 25 | list of json docs dictionaries in the format: 26 | {matchId : int , entity : int, feature: string , value} 27 | """ 28 | 29 | 30 | events = [] 31 | for file in glob.glob("%s"%events_path): 32 | data = json.load(open(file)) 33 | if select: 34 | data = list(filter(select,data)) 35 | events += data 36 | print ("[centerOfPerformanceFeature] added %s events from %s"%(len(data),file)) 37 | events = filter(lambda x: x['playerId']!=0,events) #filtering out referee 38 | if select: 39 | events = filter(select,events) 40 | players = json.load(open(players_file)) 41 | 42 | goalkeepers_ids = {player['wyId']:'GK' for player in players 43 | if player['role']['name']=='Goalkeeper'} 44 | events = filter(lambda x: x['playerId'] not in goalkeepers_ids,events ) 45 | aggregated_features = defaultdict(lambda : defaultdict(lambda: defaultdict(int))) 46 | 47 | MIN_EVENTS = 10 48 | players_positions = defaultdict(lambda : defaultdict(list)) 49 | for evt in events: 50 | if 'positions' in evt: 51 | player = evt['playerId'] 52 | match = evt['matchId'] 53 | position = (evt['positions'][0]['x'],evt['positions'][0]['y']) 54 | players_positions[match][player].append(position) 55 | 56 | 57 | #formatting features as json document 58 | results = [] 59 | for match,players_pos in players_positions.items(): 60 | for p in players_pos: 61 | positions = players_pos[p] 62 | x,y,count = np.mean([x[0] for x in positions]),np.mean([x[1] for x in positions]),len(positions) 63 | if count>MIN_EVENTS: 64 | documents = [ 65 | {'feature':'avg_x','entity':p,'match':match,'value':int(x)}, 66 | {'feature':'avg_y','entity':p,'match':match,'value':int(y)}, 67 | {'feature':'n_events','entity':p,'match':match,'value':count}, 68 | 69 | ] 70 | results+=documents 71 | 72 | return results 73 | -------------------------------------------------------------------------------- /playerank/features/goalScoredFeatures.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | from collections import defaultdict 5 | import glob 6 | 7 | 8 | class goalScoredFeatures(Feature): 9 | """ 10 | goals scored by each team in each match 11 | """ 12 | def createFeature(self,matches_path,select = None): 13 | """ 14 | stores qualityFeatures on database 15 | parameters: 16 | -matches_path: file path of matches file 17 | -select: function for filtering matches collection. Default: aggregate over all matches 18 | 19 | Output: 20 | list of documents in the format: match: matchId, entity: team, feature: feature, value: value 21 | """ 22 | matches =[] 23 | for file in glob.glob("%s"%matches_path): 24 | data = json.load(open(file)) 25 | matches += data 26 | print ("[GoalScored features] added %s matches"%len(data)) 27 | if select: 28 | matches = filter(select,matches) 29 | result =[] 30 | 31 | for match in matches: 32 | if 'teamsData' in match: 33 | for team in match['teamsData']: 34 | document = {} 35 | document['match'] = match['wyId'] 36 | document['entity'] = team 37 | document['feature'] = 'goal-scored' 38 | document['value'] = match['teamsData'][team]['score'] 39 | result.append(document) 40 | 41 | 42 | return result 43 | -------------------------------------------------------------------------------- /playerank/features/matchPlayedFeatures.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | import glob 5 | 6 | 7 | class matchPlayedFeatures(Feature): 8 | 9 | def createFeature(self,matches_path,players_file,select = None): 10 | """ 11 | It computes, for each player and match, total time (in minutes) played, 12 | goals scored and 13 | 14 | 15 | Input: 16 | -matches_path: folder with json files corresponding to matches data 17 | -select: function for filtering matches collection. Default: aggregate over all matches 18 | 19 | Output: 20 | 21 | a collection of documents in the f 22 | ormat _id-> {'match': this.wyId, 'player' : player, 23 | 'name': 'minutesPlayed'|'team'|'goalScored'|'timestamp'},value: |; 24 | 25 | """ 26 | players = json.load(open(players_file)) 27 | # filtering out all the events from goalkeepers 28 | goalkeepers_ids = [player['wyId'] for player in players 29 | if player['role']['name']=='Goalkeeper'] 30 | matches= [] 31 | for file in glob.glob("%s"%matches_path): 32 | matches += json.load(open(file)) 33 | if select: 34 | matches = list(filter(select,matches)) 35 | 36 | print ("[matchPlayedFeatures] processing %s matches"%len(matches)) 37 | result = [] 38 | for match in matches: 39 | matchId= match['wyId'] 40 | duration = 90 41 | if match['duration'] != 'Regular': 42 | duration = 120 43 | 44 | timestamp = match['dateutc'] 45 | 46 | for team in match['teamsData']: 47 | minutes_played = {} 48 | goals_scored = {} 49 | if match['teamsData'][team]['hasFormation']==1 and 'substitutions' in match['teamsData'][team]['formation']: 50 | for sub in match['teamsData'][team]['formation']['substitutions']: 51 | if type(sub) == dict: 52 | minute = sub['minute'] 53 | minutes_played[sub['playerOut']] = minute 54 | minutes_played[sub['playerIn']] = duration - minute 55 | if match['teamsData'][team]['hasFormation']==1 and 'lineup' in match['teamsData'][team]['formation']: 56 | for player in match['teamsData'][team]['formation']['lineup']: 57 | goals_scored[player['playerId']] = player['goals'] 58 | if player['playerId'] not in minutes_played: 59 | #player not substituted 60 | minutes_played[player['playerId']] = duration 61 | if match['teamsData'][team]['hasFormation']==1 and 'bench' in match['teamsData'][team]['formation']: 62 | for player in match['teamsData'][team]['formation']['bench']: 63 | goals_scored[player['playerId']] = player['goals'] 64 | if player['playerId'] not in minutes_played: 65 | #player not substituted 66 | minutes_played[player['playerId']] = duration 67 | for player,min in minutes_played.items(): 68 | if player not in goalkeepers_ids: 69 | document = {'match':matchId,'entity':player,'feature':'minutesPlayed', 70 | 'value': min} 71 | result.append (document) 72 | 73 | for player,gs in goals_scored.items(): 74 | if player not in goalkeepers_ids: 75 | try: 76 | gs = int(gs) 77 | except: 78 | gs = 0 79 | document = {'match':matchId,'entity':player,'feature':'goalScored', 80 | 'value': gs} 81 | result.append (document) 82 | ## adding timestamp and team for each player 83 | document = {'match':matchId,'entity':player,'feature':'timestamp', 84 | 'value': timestamp} 85 | 86 | result.append (document) 87 | ## adding timestamp and team for each player 88 | document = {'match':matchId,'entity':player,'feature':'team', 89 | 'value': team} 90 | result.append (document) 91 | print ("[matchPlayedFeatures] matches features computed. %s features processed"%(len(result))) 92 | return result 93 | -------------------------------------------------------------------------------- /playerank/features/plainAggregation.py: -------------------------------------------------------------------------------- 1 | from .abstract import Aggregation 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | import pandas as pd 5 | from collections import defaultdict 6 | 7 | class plainAggregation(Aggregation): 8 | """ 9 | merge features for each player and return a data frame 10 | match -> team (or entity) -> feature (playerank, timestamp, team, etc..) 11 | 12 | """ 13 | def set_features(self,collection_list): 14 | self.collections=collection_list 15 | 16 | def get_features(self): 17 | return self.collections 18 | def set_aggregated_collection(self, collection): 19 | self.aggregated_collection = collection 20 | def get_aggregated_collection(self): 21 | return self.aggregated_collection 22 | def aggregate(self, to_dataframe = False): 23 | 24 | 25 | 26 | ### 27 | # prior to aggregation, we merge all the features collections 28 | featdata = [] 29 | for collection in self.collections: 30 | featdata+=collection 31 | print ("[plainAggregation] added %s features"%len(collection)) 32 | """ 33 | single stage: transform aggregated feature per match into a collection of the form: 34 | match -> player -> {feature:value} 35 | """ 36 | aggregated = defaultdict(lambda : defaultdict(dict)) 37 | 38 | 39 | 40 | for document in featdata: 41 | match = document['match'] 42 | entity = int(document['entity']) 43 | feature = document['feature'] 44 | value = document['value'] 45 | aggregated[match][entity].update({feature:value}) 46 | 47 | result = [] 48 | 49 | for match in aggregated: 50 | for entity in aggregated[match]: 51 | 52 | document = {'match':match,'entity':entity} 53 | document.update(aggregated[match][entity]) 54 | result.append(document) 55 | 56 | 57 | print ("[plainAggregation] matches aggregated: %s"%len(result)) 58 | if to_dataframe : 59 | 60 | 61 | df=pd.DataFrame(result).fillna(0) 62 | return df 63 | else: 64 | return result 65 | -------------------------------------------------------------------------------- /playerank/features/playerankFeatures.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | from collections import defaultdict 4 | import json 5 | 6 | 7 | class playerankFeatures(Feature): 8 | """ 9 | Given a method to aggregate features and the corresponding weight of each feature, 10 | it computes playerank for each player and match 11 | input: 12 | -- features weights, computed within learning phase of playerank framework 13 | 14 | output: 15 | -- a collection of json documents in the format: 16 | {match:match_id, name: 'playerankScore', player:player_id, 17 | value: playerankScore(float)} 18 | """ 19 | def set_features(self,collection_list): 20 | self.collections=collection_list 21 | 22 | def get_features(self): 23 | return self.collections 24 | def createFeature(self,weights_file): 25 | 26 | weights=json.load(open(weights_file)) 27 | playerank_scores = defaultdict(lambda: defaultdict(float)) 28 | for feature_list in self.get_features(): 29 | for f in feature_list: 30 | if f['feature'] in weights: #check if feature have not been filtered 31 | playerank_scores[f['match']][f['entity']]+=f['value']*weights[f['feature']] 32 | 33 | result = [] 34 | for match in playerank_scores: 35 | for player in playerank_scores[match]: 36 | document = { 37 | 'match': match, 38 | 'entity': player, 39 | 'feature' : 'playerankScore', 40 | 'value' : float(playerank_scores[match][player]) 41 | } 42 | result.append(document) 43 | print ("[playerankFeatures] playerank scores computed. %s performance processed"%len(result)) 44 | return result 45 | -------------------------------------------------------------------------------- /playerank/features/qualityFeatures.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | from collections import defaultdict 5 | import glob 6 | 7 | 8 | class qualityFeatures(Feature): 9 | """ 10 | Quality features are the count of events with outcomes. 11 | E.g. 12 | - number of accurate passes 13 | - number of wrong passes 14 | ... 15 | """ 16 | def createFeature(self,events_path,players_file,entity = 'team',select = None): 17 | """ 18 | compute qualityFeatures 19 | parameters: 20 | -events_path: file path of events file 21 | -select: function for filtering events collection. Default: aggregate over all events 22 | -entity: it could either 'team' or 'player'. It selects the aggregation for qualityFeatures among teams or players qualityfeatures 23 | 24 | Output: 25 | list of dictionaries in the format: matchId -> entity -> feature -> value 26 | """ 27 | event2subevent2outcome={ 28 | 1:{10: [1801, 1802], 29 | 11: [1801, 1802], 30 | 12: [1801, 1802], 31 | 13: [1801, 1802]}, 32 | 2: [1702, 1703, 1701], #fouls aggregated into macroevent 33 | 34 | 3 :{30: [1801, 1802], 35 | 31: [1801, 1802], 36 | 32: [1801, 1802], 37 | 33: [1801, 1802], 38 | 34: [1801, 1802], 39 | 35: [1802], 40 | 36: [1801, 1802]}, 41 | 4: {40: [1801, 1802]}, 42 | 6: {60: []}, 43 | 7: {70: [1801, 1802,101], 44 | 71: [1801, 1802,101], 45 | 72: [1401, 1302, 201, 1901, 1301, 2001, 301]}, 46 | 8: {80: [1801, 1802,302,301], 47 | 81: [1801, 1802,302,301], 48 | 82: [1801, 1802,302,301], 49 | 83: [1801, 1802,302,301], 50 | 84: [1801, 1802,302,301], 51 | 85: [1801, 1802,302,301], 52 | 86: [1801, 1802,302,301]}, 53 | #90: [1801, 1802], 54 | #91: [1801, 1802], 55 | 10: {100: [1801, 1802]}} 56 | 57 | aggregated_features = defaultdict(lambda : defaultdict(lambda: defaultdict(int))) 58 | 59 | players = json.load(open(players_file)) 60 | # filtering out all the events from goalkeepers 61 | goalkeepers_ids = [player['wyId'] for player in players 62 | if player['role']['name']=='Goalkeeper'] 63 | 64 | events = [] 65 | for file in glob.glob("%s"%events_path): 66 | data = json.load(open(file)) 67 | if select: 68 | data = list(filter(select,data)) 69 | events += list(filter(lambda x: x['matchPeriod'] in ['1H','2H'] and x['playerId'] not in goalkeepers_ids,data)) #excluding penalties events 70 | print ("[qualityFeatures] added %s events from %s"%(len(data), file)) 71 | 72 | 73 | for evt in events: 74 | if evt['eventId'] in event2subevent2outcome: 75 | ent = evt['teamId'] #default 76 | if entity == 'player': 77 | ent = evt['playerId'] 78 | 79 | evtName =evt['eventName'] 80 | 81 | if type(event2subevent2outcome[evt['eventId']]) == dict: 82 | #hierarchy as event->subevent->tags 83 | if evt['subEventId'] not in event2subevent2outcome[evt['eventId']]: 84 | #malformed events 85 | continue #skip to next event 86 | tags = [x for x in evt['tags'] if x['id'] in event2subevent2outcome[evt['eventId']][evt['subEventId']]] 87 | 88 | evtName+="-%s"%evt['subEventName'] 89 | else: 90 | #hierarchy as event->tags 91 | tags = [x for x in evt['tags'] if x['id'] in event2subevent2outcome[evt['eventId']]] 92 | 93 | if len(tags)>0: 94 | for tag in tags: 95 | aggregated_features[evt['matchId']][ent]["%s-%s"%(evtName,tag2name[tag['id']])]+=1 96 | 97 | else: 98 | aggregated_features[evt['matchId']][ent]["%s"%(evtName)]+=1 99 | result =[] 100 | for match in aggregated_features: 101 | for entity in aggregated_features[match]: 102 | for feature in aggregated_features[match][entity]: 103 | document = {} 104 | document['match'] = match 105 | document['entity'] = entity 106 | document['feature'] = feature 107 | document['value'] = aggregated_features[match][entity][feature] 108 | result.append(document) 109 | 110 | return result 111 | -------------------------------------------------------------------------------- /playerank/features/relativeAggregation.py: -------------------------------------------------------------------------------- 1 | from .abstract import Aggregation 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | import pandas as pd 5 | from collections import defaultdict 6 | 7 | class relativeAggregation(Aggregation): 8 | """ 9 | compute relative feature for each match 10 | match -> team (or entity) -> featureTeam - featureOpponents 11 | """ 12 | def set_features(self,collection_list): 13 | self.collections=collection_list 14 | 15 | def get_features(self): 16 | return self.collections 17 | def aggregate(self,to_dataframe = False): 18 | """ 19 | compute relative aggregation: give a set of features it compute the A-B 20 | value for each entity in each team. 21 | Ex: 22 | passes for team A in match 111 : 500 23 | passes for team B in match 111 : 300 24 | lead to output: 25 | {'passes': 200} 26 | 27 | this method is involved for feature weight estimation phase of playerank framework. 28 | param 29 | 30 | - to_dataframe : return a dataframe instead of a list of documents 31 | 32 | """ 33 | 34 | featdata = [] 35 | for collection in self.collections: 36 | featdata+=collection 37 | print ("[relativeAggregation] added %s features"%len(collection)) 38 | #selecting teamA e teamB as teams[0] and team[1] 39 | aggregated = defaultdict(lambda : defaultdict(lambda: defaultdict(int))) 40 | #format of aggregation: match,team,feature,valueTeam-valueOppositor 41 | result = [] 42 | for document in featdata: 43 | match = document['match'] 44 | entity = int(document['entity']) 45 | feature = document['feature'] 46 | value = document['value'] 47 | aggregated[match][entity][feature] = int(value) 48 | 49 | 50 | for match in aggregated: 51 | for entity in aggregated[match]: 52 | for feature in aggregated[match][entity]: 53 | opponents = [x for x in aggregated[match] if x!=entity][0] 54 | 55 | result_doc = {} 56 | result_doc['match'] = match 57 | result_doc['entity'] = entity 58 | result_doc['name'] = feature 59 | value = aggregated[match][entity][feature] 60 | if feature in aggregated[match][opponents]: 61 | result_doc['value'] = value - aggregated[match][opponents][feature] 62 | else: 63 | result_doc['value'] = value 64 | result.append(result_doc) 65 | 66 | if to_dataframe : 67 | 68 | featlist = defaultdict(dict) 69 | for data in result: 70 | 71 | featlist["%s-%s"%(data['match'],data['entity'])].update({data['name']:data['value']}) 72 | print ("[relativeAggregation] matches aggregated: %s"%len(featlist.keys())) 73 | 74 | df=pd.DataFrame(list(featlist.values())).fillna(0) 75 | 76 | return df 77 | else: 78 | return result 79 | -------------------------------------------------------------------------------- /playerank/features/requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==0.23.4 2 | numpy==1.11.0 3 | -------------------------------------------------------------------------------- /playerank/features/roleFeatures.py: -------------------------------------------------------------------------------- 1 | from .abstract import Feature 2 | from .wyscoutEventsDefinition import * 3 | import json 4 | from collections import defaultdict 5 | 6 | class roleFeatures(Feature): 7 | 8 | def set_features(self,collection_list): 9 | self.collections=collection_list 10 | 11 | def get_features(self): 12 | return self.collections 13 | def createFeature(self,matrix_role_file): 14 | """ 15 | Given the matrix for roles, it computes, for each player and match, 16 | the role of a player 17 | 18 | A role matrix is a data structure where, given x and y (between 0 and 100), 19 | it contains the correspinding roles for a player having center of performance = x,y 20 | Role_matrix is computed within learning phase of playerank framework 21 | 22 | Input: 23 | role_matrix: file patch for dictionary in the format x->y->role 24 | feature_lists: lists of features for each player in each match, describing 25 | players' average position 26 | 27 | """ 28 | 29 | role_matrix = json.load(open(matrix_role_file,"r")) 30 | roles = defaultdict(lambda: defaultdict(dict)) 31 | for feature_list in self.get_features(): 32 | for f in feature_list: 33 | roles[f['match']][f['entity']].update({f['feature']: f['value']}) 34 | ## for each match and player we have 35 | ## avg_x,avg_y,n_events 36 | results = [] 37 | for match in roles: 38 | for player in roles[match]: 39 | match_data = roles[match][player] 40 | role_label = role_matrix[str(match_data['avg_x'])][str(match_data['avg_y'])] 41 | #note: string conversion because loading role matrix from file does set everything as string 42 | document = {'match':match, 'entity':player, 'feature':'roleCluster','value':role_label} 43 | results.append(document) 44 | return results 45 | -------------------------------------------------------------------------------- /playerank/features/wyscoutEventsDefinition.py: -------------------------------------------------------------------------------- 1 | ## events name definition for wyscout data 2 | 3 | 4 | ## CONSTANTS 5 | VICTORY = 1 6 | DEFEAT = 2 7 | DRAW = 0 8 | DRAW_DEFEAT = 0 9 | VICTORY_DRAW = 1 10 | 11 | ## CONSTANTS IDENTIFYING MACRO EVENTS 12 | DUEL = 1 13 | FOUL = 2 14 | FREE_KICK = 3 15 | GOALKEEPER_LEAVING_LINE = 4 16 | INTERRUPTION = 5 17 | OFFSIDE = 6 18 | OTHERS = 7 19 | PASS = 8 20 | SAVE = 9 21 | SHOT = 10 22 | 23 | macroevent2name = {DUEL: 'duel', FOUL: 'foul', FREE_KICK: 'free kick', 24 | GOALKEEPER_LEAVING_LINE: 'goalkeeper leaving line', 25 | INTERRUPTION: 'interruption', 26 | OFFSIDE: 'offside', OTHERS: 'others on the ball', 27 | PASS: 'pass', SAVE: 'save attempt', SHOT: 'shot'} 28 | 29 | macroevent2positions_index = {DUEL: 0, FOUL: 0, FREE_KICK: 0, 30 | GOALKEEPER_LEAVING_LINE: 0, INTERRUPTION: 0, 31 | OFFSIDE: 0, OTHERS: 0, PASS: 0, SAVE: 1, SHOT: 0 32 | } 33 | 34 | ## CONSTANTS IDENTIFYING SUB EVENTS 35 | # duels 36 | AIR_DUEL = 10 37 | GROUND_ATTACKING_DUEL = 11 38 | GROUND_DEFENDING_DUEL = 12 39 | GROUND_LOOSE_BALL_DUEL = 13 40 | 41 | # fouls 42 | NORMAL_FOUL = 20 43 | HAND_FOUL = 21 44 | LATE_CARD_FOUL = 22 45 | OUT_OF_GAME_FOUL = 23 46 | PROTEST_FOUL = 24 47 | SIMULATION_FOUL = 25 48 | TIME_LOST_FOUL = 26 49 | VIOLENT_FOUL = 27 50 | 51 | #free kicks 52 | CORNER_FREE_KICK = 30 53 | NORMAL_FREE_KICK = 31 54 | CROSS_FREE_KICK = 32 55 | SHOT_FREE_KICK = 33 56 | GOAL_FREE_KICK = 34 57 | PENALTY_FREE_KICK = 35 58 | THROW_IN_FREE_KICK = 36 59 | # goalkeeping leaving line 60 | GOALKEEPING_LEAVING_LINE = 40 61 | # interruption 62 | BALL_OUT_INTERRUPTION = 50 63 | WHISTLE_INTERRUPTION = 51 64 | # offside 65 | NORMAL_OFFSIDE=60 66 | # others on the ball 67 | ACCELERATION_OTHERS = 70 68 | CLEARANCE_OTHERS = 71 69 | TOUCH_OTHERS = 72 70 | # pass 71 | CROSS_PASS = 80 72 | HAND_PASS = 81 73 | HEAD_PASS = 82 74 | HIGH_PASS = 83 75 | LAUNCH_PASS = 84 76 | SIMPLE_PASS = 85 77 | SMART_PASS = 86 78 | # save 79 | REFLEXES_SAVE = 90 80 | NORMAL_SAVE = 91 81 | # shot 82 | NORMAL_SHOT = 100 83 | 84 | macroevent2subevents = {DUEL: [AIR_DUEL, GROUND_ATTACKING_DUEL, GROUND_DEFENDING_DUEL, GROUND_LOOSE_BALL_DUEL], 85 | FOUL: [NORMAL_FOUL, HAND_FOUL, LATE_CARD_FOUL, OUT_OF_GAME_FOUL, PROTEST_FOUL, SIMULATION_FOUL, 86 | TIME_LOST_FOUL, VIOLENT_FOUL], 87 | FREE_KICK: [CORNER_FREE_KICK, NORMAL_FREE_KICK, CROSS_FREE_KICK, SHOT_FREE_KICK, GOAL_FREE_KICK, 88 | PENALTY_FREE_KICK, THROW_IN_FREE_KICK], 89 | INTERRUPTION: [BALL_OUT_INTERRUPTION, WHISTLE_INTERRUPTION], 90 | OFFSIDE: [NORMAL_OFFSIDE], 91 | OTHERS: [ACCELERATION_OTHERS, CLEARANCE_OTHERS, TOUCH_OTHERS], 92 | PASS: [CROSS_PASS, HAND_PASS, HEAD_PASS, HIGH_PASS, LAUNCH_PASS, SIMPLE_PASS, SMART_PASS], 93 | SAVE: [REFLEXES_SAVE, NORMAL_SAVE], 94 | SHOT: [NORMAL_SHOT] 95 | } 96 | 97 | subevents2name = { 98 | 99 | AIR_DUEL:'air duel', GROUND_ATTACKING_DUEL:'ground attacking duel', 100 | GROUND_DEFENDING_DUEL:'ground defending duel', GROUND_LOOSE_BALL_DUEL:'ground loose ball duel', 101 | 102 | NORMAL_FOUL:'normal foul', HAND_FOUL:'hand foul', LATE_CARD_FOUL:'late card foul', 103 | OUT_OF_GAME_FOUL:'out of game foul', PROTEST_FOUL:'protest foul', SIMULATION_FOUL:'simulation foul', 104 | TIME_LOST_FOUL:'time lost foul', OUT_OF_GAME_FOUL:'out of game foul', PROTEST_FOUL:'protest foul', 105 | SIMULATION_FOUL:'simulation foul', TIME_LOST_FOUL:'time lost foul', VIOLENT_FOUL:'violent foul', 106 | 107 | CORNER_FREE_KICK:'corner free kick', NORMAL_FREE_KICK:'normal free kick', CROSS_FREE_KICK:'cross free kick', 108 | SHOT_FREE_KICK:'shot free kick', GOAL_FREE_KICK:'goal free kick', PENALTY_FREE_KICK:'penalty free kick', 109 | THROW_IN_FREE_KICK:'throw in free kick', 110 | 111 | BALL_OUT_INTERRUPTION:'ball out interruption', WHISTLE_INTERRUPTION:'whistle interruption', 112 | 113 | ACCELERATION_OTHERS:'accelleration', CLEARANCE_OTHERS:'clearance', TOUCH_OTHERS:'touch', 114 | 115 | CROSS_PASS:'cross pass', HAND_PASS:'hand pass', HEAD_PASS:'head pass', HIGH_PASS:'high pass', 116 | LAUNCH_PASS:'launch pass', SIMPLE_PASS:'simple pass', SMART_PASS:'smart pass', 117 | 118 | REFLEXES_SAVE:'reflexes save', NORMAL_SAVE:'normal save', 119 | 120 | NORMAL_SHOT: 'shot', 121 | 122 | NORMAL_OFFSIDE: 'offside', 123 | 124 | GOALKEEPING_LEAVING_LINE: 'goalkeeping leaving line' 125 | } 126 | 127 | 128 | ## CONSTANTS FOR TAGS 129 | GOAL_TAG = 101 130 | OWN_GOAL_TAG = 102 131 | ASSIST_TAG = 301 132 | KEY_PASS_TAG = 302 133 | COUNTER_ATTACK_TAG = 1901 134 | LEFT_FOOT_TAG = 401 135 | RIGHT_FOOT_TAG = 402 136 | HEAD_BODY_TAG = 403 137 | DIRECT_TAG = 1101 138 | INDIRECT_TAG = 1102 139 | DANGEROUS_BALL_LOST_TAG = 2001 140 | BLOCKED_TAG = 2101 141 | HIGH_TAG = 801 142 | LOW_TAG = 802 143 | INTERCEPTION_TAG = 1401 144 | CLEARANCE_TAG = 1501 145 | OPPORTUNITY_TAG = 201 146 | FEINT_TAG = 1301 147 | MISSED_BALL_TAG = 1302 148 | FREE_SPACE_RIGHT_TAG = 501 149 | FREE_SPACE_LEFT_TAG = 502 150 | TAKE_ON_LEFT_TAG = 503 151 | TAKE_ON_RIGHT_TAG = 504 152 | SLIDING_TACKLE_TAG = 1601 153 | ANTICIPATED_TAG = 601 154 | ANTICIPATION_TAG = 602 155 | RED_CARD_TAG = 1701 156 | YELLOW_CARD_TAG = 1702 157 | SECOND_YELLOW_CARD_TAG = 1703 158 | THROUGH_TAG = 901 159 | FAIRPLAY_TAG = 1001 160 | LOST_TAG = 701 161 | NEUTRAL_TAG = 702 162 | WON_TAG = 703 163 | ACCURATE_TAG = 1801 164 | NOT_ACCURATE_TAG = 1802 165 | NO_TAG = -1 166 | tag2name = { 167 | GOAL_TAG: 'GOAL', OWN_GOAL_TAG: 'OWN GOAL', ASSIST_TAG: 'assist', KEY_PASS_TAG:'key pass', 168 | COUNTER_ATTACK_TAG:'counter attack', LEFT_FOOT_TAG:'left foot', RIGHT_FOOT_TAG:'right foot', 169 | HEAD_BODY_TAG:'head_body', DIRECT_TAG:'direct', INDIRECT_TAG:'indirect', 170 | DANGEROUS_BALL_LOST_TAG:'dangerous ball lost', BLOCKED_TAG:'blocked', HIGH_TAG:'high', 171 | LOW_TAG:'low', INTERCEPTION_TAG: 'interception', CLEARANCE_TAG:'clearance', OPPORTUNITY_TAG:'opportunity', 172 | FEINT_TAG:'feint', MISSED_BALL_TAG:'missed ball', FREE_SPACE_RIGHT_TAG:'free space right', 173 | FREE_SPACE_LEFT_TAG:'free space left', TAKE_ON_LEFT_TAG:'takeon left', TAKE_ON_RIGHT_TAG:'takeon right', 174 | SLIDING_TACKLE_TAG:'sliding tackle', ANTICIPATED_TAG:'anticipated', ANTICIPATION_TAG:'anticipation', 175 | RED_CARD_TAG:'red card', YELLOW_CARD_TAG: 'yellow card', SECOND_YELLOW_CARD_TAG: 'second yellow card', 176 | THROUGH_TAG:'through', FAIRPLAY_TAG:'fairplay', LOST_TAG:'lost', NEUTRAL_TAG:'neutral', 177 | WON_TAG:'won', ACCURATE_TAG:'accurate', NOT_ACCURATE_TAG:'not accurate', NO_TAG: 'no tag' 178 | } 179 | 180 | 181 | subevent2tags = { 182 | 183 | ## OFFSIDE 184 | NORMAL_OFFSIDE: [], 185 | 186 | ## LEAVING LINE 187 | GOALKEEPING_LEAVING_LINE: [], 188 | 189 | ## SHOT 190 | NORMAL_SHOT: 191 | [LEFT_FOOT_TAG, OPPORTUNITY_TAG, NOT_ACCURATE_TAG, ACCURATE_TAG, HEAD_BODY_TAG, RIGHT_FOOT_TAG, BLOCKED_TAG, GOAL_TAG, INTERCEPTION_TAG, COUNTER_ATTACK_TAG, ASSIST_TAG], 192 | 193 | ##### DUELS ######## 194 | 195 | GROUND_ATTACKING_DUEL: 196 | [LOST_TAG, NOT_ACCURATE_TAG, WON_TAG, ACCURATE_TAG, TAKE_ON_RIGHT_TAG, ANTICIPATION_TAG, FREE_SPACE_LEFT_TAG, TAKE_ON_LEFT_TAG, NEUTRAL_TAG, FREE_SPACE_RIGHT_TAG, DANGEROUS_BALL_LOST_TAG, INTERCEPTION_TAG, COUNTER_ATTACK_TAG, SLIDING_TACKLE_TAG, OPPORTUNITY_TAG], 197 | 198 | AIR_DUEL: 199 | [LOST_TAG, NOT_ACCURATE_TAG, WON_TAG, ACCURATE_TAG, NEUTRAL_TAG, COUNTER_ATTACK_TAG, KEY_PASS_TAG, ASSIST_TAG], 200 | 201 | GROUND_LOOSE_BALL_DUEL: 202 | [LOST_TAG, NOT_ACCURATE_TAG, WON_TAG, ACCURATE_TAG, NEUTRAL_TAG, SLIDING_TACKLE_TAG, COUNTER_ATTACK_TAG, INTERCEPTION_TAG, DANGEROUS_BALL_LOST_TAG], 203 | 204 | GROUND_DEFENDING_DUEL: [SLIDING_TACKLE_TAG, WON_TAG, ACCURATE_TAG, LOST_TAG, NOT_ACCURATE_TAG, TAKE_ON_LEFT_TAG, ANTICIPATED_TAG, FREE_SPACE_RIGHT_TAG, TAKE_ON_RIGHT_TAG, NEUTRAL_TAG, FREE_SPACE_LEFT_TAG, COUNTER_ATTACK_TAG], 205 | 206 | ######### FREE KICKS ########### 207 | 208 | SHOT_FREE_KICK: 209 | [RIGHT_FOOT_TAG, DIRECT_TAG, OPPORTUNITY_TAG, ACCURATE_TAG, BLOCKED_TAG, NOT_ACCURATE_TAG, LEFT_FOOT_TAG, INDIRECT_TAG, GOAL_TAG], 210 | 211 | CROSS_FREE_KICK: 212 | [HIGH_TAG, NOT_ACCURATE_TAG, ASSIST_TAG, ACCURATE_TAG, KEY_PASS_TAG], 213 | 214 | NORMAL_FREE_KICK: 215 | [ACCURATE_TAG, NOT_ACCURATE_TAG, KEY_PASS_TAG], 216 | 217 | CORNER_FREE_KICK: 218 | [HIGH_TAG, NOT_ACCURATE_TAG, ACCURATE_TAG, KEY_PASS_TAG, OPPORTUNITY_TAG, ASSIST_TAG], 219 | 220 | THROW_IN_FREE_KICK: 221 | [ACCURATE_TAG, NOT_ACCURATE_TAG, FAIRPLAY_TAG], 222 | 223 | PENALTY_FREE_KICK: 224 | [GOAL_TAG, RIGHT_FOOT_TAG, ACCURATE_TAG, LEFT_FOOT_TAG, NOT_ACCURATE_TAG], 225 | 226 | GOAL_FREE_KICK: [NO_TAG], 227 | 228 | 229 | #### FOULS #### 230 | 231 | PROTEST_FOUL: 232 | [YELLOW_CARD_TAG, RED_CARD_TAG, SECOND_YELLOW_CARD_TAG], 233 | 234 | SIMULATION_FOUL: 235 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 236 | 237 | TIME_LOST_FOUL: 238 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 239 | 240 | VIOLENT_FOUL: 241 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 242 | 243 | NORMAL_FOUL: 244 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 245 | 246 | HAND_FOUL: 247 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 248 | 249 | LATE_CARD_FOUL: 250 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 251 | 252 | OUT_OF_GAME_FOUL: 253 | [YELLOW_CARD_TAG, SECOND_YELLOW_CARD_TAG, RED_CARD_TAG], 254 | 255 | #WHISTLE_INTERRUPTION: [], 256 | 257 | #BALL_OUT_INTERRUPTION: [], 258 | 259 | ### TOUCHS #### 260 | 261 | TOUCH_OTHERS: 262 | [INTERCEPTION_TAG, MISSED_BALL_TAG, OPPORTUNITY_TAG, COUNTER_ATTACK_TAG, FEINT_TAG, DANGEROUS_BALL_LOST_TAG, ASSIST_TAG, OWN_GOAL_TAG], 263 | 264 | CLEARANCE_OTHERS: 265 | [INTERCEPTION_TAG, NOT_ACCURATE_TAG, ACCURATE_TAG, COUNTER_ATTACK_TAG, FAIRPLAY_TAG, MISSED_BALL_TAG, OWN_GOAL_TAG], 266 | 267 | ACCELERATION_OTHERS: 268 | [NOT_ACCURATE_TAG, ACCURATE_TAG, COUNTER_ATTACK_TAG, INTERCEPTION_TAG], 269 | 270 | ##### SAVES #### 271 | 272 | NORMAL_SAVE: 273 | [ACCURATE_TAG, GOAL_TAG, NOT_ACCURATE_TAG, COUNTER_ATTACK_TAG], 274 | 275 | REFLEXES_SAVE: 276 | [ACCURATE_TAG, GOAL_TAG, NOT_ACCURATE_TAG, COUNTER_ATTACK_TAG], 277 | 278 | 279 | #### PASSES ##################### 280 | 281 | HEAD_PASS: 282 | [INTERCEPTION_TAG, ACCURATE_TAG, NOT_ACCURATE_TAG, ASSIST_TAG, COUNTER_ATTACK_TAG, KEY_PASS_TAG, DANGEROUS_BALL_LOST_TAG], 283 | 284 | HIGH_PASS: 285 | [NOT_ACCURATE_TAG, ACCURATE_TAG, KEY_PASS_TAG, THROUGH_TAG, COUNTER_ATTACK_TAG, INTERCEPTION_TAG, ASSIST_TAG], 286 | 287 | CROSS_PASS: 288 | [LEFT_FOOT_TAG, HIGH_TAG, NOT_ACCURATE_TAG, RIGHT_FOOT_TAG, ACCURATE_TAG, KEY_PASS_TAG, ASSIST_TAG, BLOCKED_TAG, COUNTER_ATTACK_TAG, INTERCEPTION_TAG], 289 | 290 | HAND_PASS: 291 | [ACCURATE_TAG, NOT_ACCURATE_TAG, INTERCEPTION_TAG, COUNTER_ATTACK_TAG, FAIRPLAY_TAG], 292 | 293 | SMART_PASS: 294 | [NOT_ACCURATE_TAG, ACCURATE_TAG, THROUGH_TAG, KEY_PASS_TAG, ASSIST_TAG, COUNTER_ATTACK_TAG, INTERCEPTION_TAG], 295 | 296 | LAUNCH_PASS: 297 | [ACCURATE_TAG, NOT_ACCURATE_TAG, INTERCEPTION_TAG, FAIRPLAY_TAG, DANGEROUS_BALL_LOST_TAG], 298 | 299 | SIMPLE_PASS: [ACCURATE_TAG, NOT_ACCURATE_TAG, INTERCEPTION_TAG, COUNTER_ATTACK_TAG, KEY_PASS_TAG, FAIRPLAY_TAG, DANGEROUS_BALL_LOST_TAG, ASSIST_TAG, OWN_GOAL_TAG] 300 | 301 | } 302 | 303 | #subevent2outcome={PROTEST_FOUL: 304 | # subevent2tags[PROTEST_FOUL]+ [NO_TAG], 305 | 306 | #SIMULATION_FOUL: 307 | # subevent2tags[SIMULATION_FOUL]+[ NO_TAG], 308 | 309 | #TIME_LOST_FOUL: 310 | # subevent2tags[TIME_LOST_FOUL]+[NO_TAG], 311 | 312 | #VIOLENT_FOUL: 313 | # subevent2tags[VIOLENT_FOUL]+[NO_TAG], 314 | 315 | #NORMAL_FOUL: 316 | # subevent2tags[NORMAL_FOUL]+[NO_TAG], 317 | 318 | #HAND_FOUL: 319 | # subevent2tags[HAND_FOUL]+[NO_TAG ], 320 | 321 | #LATE_CARD_FOUL: 322 | # subevent2tags[LATE_CARD_FOUL]+[ NO_TAG], 323 | 324 | #OUT_OF_GAME_FOUL: 325 | # subevent2tags[OUT_OF_GAME_FOUL]+[NO_TAG], 326 | # 'default':[ACCURATE_TAG, NOT_ACCURATE_TAG], 327 | # TOUCH_OTHERS:subevent2tags[TOUCH_OTHERS], 328 | #} 329 | 330 | 331 | 332 | 333 | -------------------------------------------------------------------------------- /playerank/models/Clusterer.py: -------------------------------------------------------------------------------- 1 | # /usr/local/bin/python 2 | from collections import defaultdict, OrderedDict, Counter 3 | import numpy as np 4 | from scipy import optimize 5 | from scipy.stats import gaussian_kde 6 | #from utils import * 7 | from sklearn.base import BaseEstimator 8 | from sklearn.svm import LinearSVC 9 | from sklearn.model_selection import cross_val_score 10 | from sklearn.dummy import DummyClassifier 11 | from sklearn.feature_selection import VarianceThreshold 12 | from sklearn.model_selection import GridSearchCV, StratifiedKFold 13 | from sklearn.feature_selection import RFECV 14 | from scipy.spatial.distance import euclidean 15 | from sklearn.preprocessing import StandardScaler, LabelEncoder 16 | from sklearn.metrics import silhouette_score, silhouette_samples 17 | from sklearn.cluster import KMeans, MiniBatchKMeans 18 | from sklearn.base import BaseEstimator, ClusterMixin 19 | from joblib import Parallel, delayed 20 | 21 | from sklearn.metrics.pairwise import pairwise_distances 22 | from itertools import combinations 23 | from sklearn.utils import check_random_state 24 | from sklearn.preprocessing import MinMaxScaler 25 | import json 26 | 27 | def scalable_silhouette_score(X, labels, metric='euclidean', sample_size=None, 28 | random_state=None, n_jobs=1, **kwds): 29 | """ 30 | Compute the mean Silhouette Coefficient of all samples. 31 | The Silhouette Coefficient is compute using the mean intra-cluster distance (a) 32 | and the mean nearest-cluster distance (b) for each sample. 33 | 34 | The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$. 35 | To clarify, b is the distance between a sample and the nearest cluster 36 | that b is not a part of. 37 | 38 | This function returns the mean Silhoeutte Coefficient over all samples. 39 | To obtain the values for each sample, it uses silhouette_samples. 40 | 41 | The best value is 1 and the worst value is -1. Values near 0 indicate 42 | overlapping clusters. Negative values generally indicate that a sample has 43 | been assigned to the wrong cluster, as a different cluster is more similar. 44 | 45 | Parameters 46 | ---------- 47 | X : array [n_samples_a, n_features] 48 | the Feature array. 49 | 50 | labels : array, shape = [n_samples] 51 | label values for each sample 52 | 53 | metric : string, or callable 54 | The metric to use when calculating distance between instances in a 55 | feature array. If metric is a string, it must be one of the options 56 | allowed by metrics.pairwise.pairwise_distances. If X is the distance 57 | array itself, use "precomputed" as the metric. 58 | 59 | sample_size : int or None 60 | The size of the sample to use when computing the Silhouette 61 | Coefficient. If sample_size is None, no sampling is used. 62 | 63 | random_state : integer or numpy.RandomState, optional 64 | The generator used to initialize the centers. If an integer is 65 | given, it fixes the seed. Defaults to the global numpy random 66 | number generator. 67 | 68 | **kwds : optional keyword parameters 69 | Any further parameters are passed directly to the distance function. 70 | If using a scipy.spatial.distance metric, the parameters are still 71 | metric dependent. See the scipy docs for usage examples. 72 | 73 | Returns 74 | ------- 75 | silhouette : float 76 | the Mean Silhouette Coefficient for all samples. 77 | 78 | References 79 | ---------- 80 | Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the 81 | Interpretation and Validation of Cluster Analysis". Computational 82 | and Applied Mathematics 20: 53-65. doi:10.1016/0377-0427(87)90125-7. 83 | http://en.wikipedia.org/wiki/Silhouette_(clustering) 84 | """ 85 | if sample_size is not None: 86 | random_state = check_random_state(random_state) 87 | indices = random_state.permutation(X.shape[0])[:sample_size] 88 | if metric == "precomputed": 89 | raise ValueError('Distance matrix cannot be precomputed') 90 | else: 91 | X, labels = X[indices], labels[indices] 92 | 93 | return np.mean(scalable_silhouette_samples( 94 | X, labels, metric=metric, n_jobs=n_jobs, **kwds)) 95 | 96 | 97 | def scalable_silhouette_samples(X, labels, metric='euclidean', n_jobs=1, **kwds): 98 | """ 99 | Compute the Silhouette Coefficient for each sample. The Silhoeutte Coefficient 100 | is a measure of how well samples are clustered with samples that are similar to themselves. 101 | Clustering models with a high Silhouette Coefficient are said to be dense, 102 | where samples in the same cluster are similar to each other, and well separated, 103 | where samples in different clusters are not very similar to each other. 104 | 105 | The Silhouette Coefficient is calculated using the mean intra-cluster 106 | distance (a) and the mean nearest-cluster distance (b) for each sample. 107 | 108 | The Silhouette Coefficient for a sample is $(b - a) / max(a, b)$. 109 | This function returns the Silhoeutte Coefficient for each sample. 110 | The best value is 1 and the worst value is -1. Values near 0 indicate 111 | overlapping clusters. 112 | 113 | Parameters 114 | ---------- 115 | X : array [n_samples_a, n_features] 116 | Feature array. 117 | 118 | labels : array, shape = [n_samples] 119 | label values for each sample 120 | 121 | metric : string, or callable 122 | The metric to use when calculating distance between instances in a 123 | feature array. If metric is a string, it must be one of the options 124 | allowed by metrics.pairwise.pairwise_distances. If X is the distance 125 | array itself, use "precomputed" as the metric. 126 | 127 | **kwds : optional keyword parameters 128 | Any further parameters are passed directly to the distance function. 129 | If using a scipy.spatial.distance metric, the parameters are still 130 | metric dependent. See the scipy docs for usage examples. 131 | 132 | Returns 133 | ------- 134 | silhouette : array, shape = [n_samples] 135 | Silhouette Coefficient for each samples. 136 | 137 | References 138 | ---------- 139 | Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the 140 | Interpretation and Validation of Cluster Analysis". Computational 141 | and Applied Mathematics 20: 53-65. doi:10.1016/0377-0427(87)90125-7. 142 | http://en.wikipedia.org/wiki/Silhouette_(clustering) 143 | """ 144 | A = _intra_cluster_distances_block(X, labels, metric, n_jobs=n_jobs, 145 | **kwds) 146 | B = _nearest_cluster_distance_block(X, labels, metric, n_jobs=n_jobs, 147 | **kwds) 148 | sil_samples = (B - A) / np.maximum(A, B) 149 | # nan values are for clusters of size 1, and should be 0 150 | return np.nan_to_num(sil_samples) 151 | 152 | def _intra_cluster_distances_block(X, labels, metric, n_jobs=1, **kwds): 153 | """ 154 | Calculate the mean intra-cluster distance for sample i. 155 | 156 | Parameters 157 | ---------- 158 | X : array [n_samples_a, n_features] 159 | Feature array. 160 | 161 | labels : array, shape = [n_samples] 162 | label values for each sample 163 | 164 | metric : string, or callable 165 | The metric to use when calculating distance between instances in a 166 | feature array. If metric is a string, it must be one of the options 167 | allowed by metrics.pairwise.pairwise_distances. If X is the distance 168 | array itself, use "precomputed" as the metric. 169 | 170 | **kwds : optional keyword parameters 171 | Any further parameters are passed directly to the distance function. 172 | If using a scipy.spatial.distance metric, the parameters are still 173 | metric dependent. See the scipy docs for usage examples. 174 | 175 | Returns 176 | ------- 177 | a : array [n_samples_a] 178 | Mean intra-cluster distance 179 | """ 180 | intra_dist = np.zeros(labels.size, dtype=float) 181 | values = Parallel(n_jobs=n_jobs)( 182 | delayed(_intra_cluster_distances_block_) 183 | (X[np.where(labels == label)[0]], metric, **kwds) 184 | for label in np.unique(labels)) 185 | for label, values_ in zip(np.unique(labels), values): 186 | intra_dist[np.where(labels == label)[0]] = values_ 187 | return intra_dist 188 | 189 | 190 | def _nearest_cluster_distance_block(X, labels, metric, n_jobs=1, **kwds): 191 | """Calculate the mean nearest-cluster distance for sample i. 192 | 193 | Parameters 194 | ---------- 195 | X : array [n_samples_a, n_features] 196 | Feature array. 197 | 198 | labels : array, shape = [n_samples] 199 | label values for each sample 200 | 201 | metric : string, or callable 202 | The metric to use when calculating distance between instances in a 203 | feature array. If metric is a string, it must be one of the options 204 | allowed by metrics.pairwise.pairwise_distances. If X is the distance 205 | array itself, use "precomputed" as the metric. 206 | 207 | **kwds : optional keyword parameters 208 | Any further parameters are passed directly to the distance function. 209 | If using a scipy.spatial.distance metric, the parameters are still 210 | metric dependent. See the scipy docs for usage examples. 211 | 212 | X : array [n_samples_a, n_features] 213 | Feature array. 214 | 215 | Returns 216 | ------- 217 | b : float 218 | Mean nearest-cluster distance for sample i 219 | """ 220 | inter_dist = np.empty(labels.size, dtype=float) 221 | inter_dist.fill(np.inf) 222 | # Compute cluster distance between pairs of clusters 223 | unique_labels = np.unique(labels) 224 | 225 | values = Parallel(n_jobs=n_jobs)( 226 | delayed(_nearest_cluster_distance_block_)( 227 | X[np.where(labels == label_a)[0]], 228 | X[np.where(labels == label_b)[0]], 229 | metric, **kwds) 230 | for label_a, label_b in combinations(unique_labels, 2)) 231 | 232 | for (label_a, label_b), (values_a, values_b) in \ 233 | zip(combinations(unique_labels, 2), values): 234 | 235 | indices_a = np.where(labels == label_a)[0] 236 | inter_dist[indices_a] = np.minimum(values_a, inter_dist[indices_a]) 237 | del indices_a 238 | indices_b = np.where(labels == label_b)[0] 239 | inter_dist[indices_b] = np.minimum(values_b, inter_dist[indices_b]) 240 | del indices_b 241 | return inter_dist 242 | 243 | def _intra_cluster_distances_block_(subX, metric, **kwds): 244 | distances = pairwise_distances(subX, metric=metric, **kwds) 245 | return distances.sum(axis=1) / (distances.shape[0] - 1) 246 | 247 | def _nearest_cluster_distance_block_(subX_a, subX_b, metric, **kwds): 248 | dist = pairwise_distances(subX_a, subX_b, metric=metric, **kwds) 249 | dist_a = dist.mean(axis=1) 250 | dist_b = dist.mean(axis=0) 251 | return dist_a, dist_b 252 | 253 | class Clusterer(BaseEstimator, ClusterMixin): 254 | """Performance clustering 255 | 256 | Parameters 257 | ---------- 258 | k_range: tuple (pair) 259 | the minimum and the maximum $k$ to try when choosing the best value of $k$ 260 | (the one having the best silhouette score) 261 | 262 | border_threshold: float 263 | the threshold to use for selecting the borderline. 264 | It indicates the max silhouette for a borderline point. 265 | 266 | verbose: boolean 267 | verbosity mode. 268 | default: False 269 | 270 | random_state : int 271 | RandomState instance or None, optional, default: None 272 | If int, random_state is the seed used by the random number generator; 273 | If RandomState instance, random_state is the random number generator; 274 | If None, the random number generator is the RandomState instance used 275 | by `np.random`. 276 | 277 | sample_size : int 278 | the number of samples (rows) that must be used when computing the silhouette score (the function silhouette_score is computationally expensive and generates a Memory Error when the number of samples is too high) 279 | default: 10000 280 | 281 | max_rows : int 282 | the maximum number of samples (rows) to be considered for the clustering task (the function silhouette_samples is computationally expensive and generates a Memory Error when the input matrix have too many rows) 283 | default: 40000 284 | 285 | 286 | Attributes 287 | ---------- 288 | cluster_centers_ : array, [n_clusters, n_features] 289 | Coordinates of cluster centers 290 | n_clusters_: int 291 | number of clusters found by the algorithm 292 | labels_ : 293 | Labels of each point 294 | k_range: tuple 295 | minimum and maximum number of clusters to try 296 | verbose: boolean 297 | whether or not to show details of the execution 298 | random_state: int 299 | RandomState instance or None, optional, default: None 300 | If int, random_state is the seed used by the random number generator; 301 | If RandomState instance, random_state is the random number generator; 302 | If None, the random number generator is the RandomState instance used 303 | by 'np.random'. 304 | sample_size: None 305 | kmeans: scikit-learn KMeans object 306 | """ 307 | 308 | def __init__(self, k_range=(2, 15), border_threshold=0.2, verbose=False, random_state=42, 309 | sample_size=None): 310 | self.k_range = k_range 311 | self.border_threshold = border_threshold 312 | self.verbose = verbose 313 | self.sample_size = sample_size 314 | # initialize attributes 315 | self.labels_ = [] 316 | self.random_state = random_state 317 | 318 | def _find_clusters(self, X, make_plot=True): 319 | if self.verbose: 320 | print ('FITTING kmeans...\n') 321 | print ('n_clust\t|silhouette') 322 | print ('---------------------') 323 | 324 | self.k2silhouettes_ = {} 325 | kmin, kmax = self.k_range 326 | range_n_clusters = range(kmin, kmax + 1) 327 | best_k, best_silhouette = 0, 0.0 328 | for k in range_n_clusters: 329 | 330 | # computation 331 | kmeans = MiniBatchKMeans(n_clusters=k, init='k-means++', max_iter=1000, n_init=1, 332 | random_state =self.random_state) 333 | kmeans.fit(X) 334 | cluster_labels = kmeans.labels_ 335 | 336 | silhouette = scalable_silhouette_score(X, cluster_labels, 337 | sample_size=self.sample_size, 338 | random_state=self.random_state) 339 | if self.verbose: 340 | print ('%s\t|%s' % (k, round(silhouette, 4))) 341 | 342 | if silhouette >= best_silhouette: 343 | best_silhouette = silhouette 344 | best_k = k 345 | #best_silhouette_samples = ss 346 | 347 | self.k2silhouettes_[k] = silhouette 348 | 349 | kmeans = MiniBatchKMeans(n_clusters=best_k, init='k-means++', max_iter=10000, n_init=1, 350 | random_state=self.random_state) 351 | kmeans.fit(X) 352 | self.kmeans_ = kmeans 353 | self.n_clusters_ = best_k 354 | self.cluster_centers_ = kmeans.cluster_centers_ 355 | self.labels_ = kmeans.labels_ 356 | if self.verbose: 357 | print ('Best: n_clust=%s (silhouette=%s)\n' % (best_k, round(best_silhouette, 4))) 358 | 359 | def _cluster_borderline(self, X): 360 | """ 361 | Assign clusters to borderline points, according to the borderline_threshold 362 | specified in the constructor 363 | """ 364 | if self.verbose: 365 | print ('FINDING hybrid centers of performance...\n') 366 | 367 | self.labels_ = [[] for i in range(len(X))] 368 | 369 | ss = scalable_silhouette_samples(X, self.kmeans_.labels_) 370 | for i, (row, silhouette, cluster_label) in enumerate(zip(X, ss, self.kmeans_.labels_)): 371 | if silhouette >= self.border_threshold: 372 | self.labels_[i].append(cluster_label) 373 | else: 374 | intra_silhouette = euclidean(row, self.kmeans_.cluster_centers_[cluster_label]) 375 | for label in set(self.kmeans_.labels_): 376 | inter_silhouette = euclidean(row, self.kmeans_.cluster_centers_[label]) 377 | silhouette = (inter_silhouette - intra_silhouette) / max(inter_silhouette, intra_silhouette) 378 | if silhouette <= self.border_threshold: 379 | self.labels_[i].append(label) 380 | 381 | return ss 382 | 383 | def _generate_matrix(self, ss, kind = 'multi'): 384 | """ 385 | Generate a matrix for optimizing the predict function 386 | """ 387 | matrix = {} 388 | X = [] 389 | 390 | for i in range(0, 101): 391 | for j in range(0, 101): 392 | X.append([i, j]) 393 | if kind == 'multi': 394 | multi_labels = self._predict_with_silhouette(X, ss) 395 | for row, labels in zip(X, multi_labels): 396 | matrix[tuple(row)] = labels 397 | else: 398 | for row, labels in zip(X, self.kmeans_.predict(X)): 399 | matrix[tuple(row)] = labels 400 | self._matrix = matrix 401 | 402 | def get_clusters_matrix(self, kind = 'single'): 403 | roles_matrix = {} 404 | m= self._matrix.items() 405 | # if kind != 'single': 406 | # m= self._matrix.items() 407 | # 408 | # else: 409 | # m = self._matrix_single.items() 410 | 411 | for k,v in m: 412 | x,y = int(k[0]),int(k[1]) 413 | if k[0] not in roles_matrix: 414 | roles_matrix[x] = {} 415 | roles_matrix[x][y] = "-".join(map(str,v)) if kind !='single' else int(v) #casting with python int, otherwise it's not json serializable 416 | return roles_matrix 417 | 418 | def fit(self, player_ids, match_ids, dataframe, y=None, kind='single', filename='clusters'): 419 | """ 420 | Compute performance clustering. 421 | 422 | Parameters 423 | ---------- 424 | X : array-like or sparse matrix, shape=(n_samples, n_features) 425 | Training instances to cluster. 426 | 427 | kind: str 428 | single: single cluster 429 | multi: multi cluster 430 | 431 | y: ignored 432 | """ 433 | self.kind_ = kind 434 | X = dataframe.values 435 | 436 | self._find_clusters(X) # find the clusters with kmeans 437 | if kind != 'single': 438 | 439 | 440 | silhouette_scores = self._cluster_borderline(X) # assign multiclusters to borderline performances 441 | self._generate_matrix(silhouette_scores) # generate the matrix for optimizing the predict function 442 | else: 443 | self._generate_matrix(None, kind = 'single') #no silhouette scores if kind single 444 | if self.verbose: 445 | print ("DONE.") 446 | 447 | 448 | 449 | 450 | return self 451 | 452 | def _predict_with_silhouette(self, X, ss): 453 | cluster_labels, threshold = self.kmeans_.predict(X), self.border_threshold 454 | multicluster_labels = [[] for _ in cluster_labels] 455 | if len(set(cluster_labels)) == 1: 456 | return [[cluster_label] for cluster_label in cluster_labels] 457 | for i, (row, silhouette, cluster_label) in enumerate(zip(X, ss, cluster_labels)): 458 | if silhouette >= threshold: 459 | multicluster_labels[i].append(cluster_label) 460 | else: 461 | intra_silhouette = euclidean(row, self.cluster_centers_[cluster_label]) 462 | for label in set(cluster_labels): 463 | inter_silhouette = euclidean(row, self.cluster_centers_[label]) 464 | silhouette = (inter_silhouette - intra_silhouette) / max(inter_silhouette, intra_silhouette) 465 | if silhouette <= threshold: 466 | multicluster_labels[i].append(label) 467 | 468 | return np.array(multicluster_labels) 469 | 470 | def predict(self, X, y=None): 471 | """ 472 | Predict the closest cluster each sample in X belongs to. 473 | In the vector quantization literature, `cluster_centers_` is called 474 | the code book and each value returned by `predict` is the index of 475 | the closest code in the code book. 476 | 477 | Parameters 478 | ---------- 479 | X : {array-like, sparse matrix}, shape = [n_samples, n_features] 480 | New data to predict. 481 | 482 | Returns 483 | ------- 484 | multi_labels : array, shape [n_samples,] 485 | Index of the cluster each sample belongs to. 486 | """ 487 | if self.kind_ == 'single': 488 | return self.kmeans_predict(X) 489 | else: 490 | multi_labels = [] 491 | for row in X: 492 | x, y = tuple(row) 493 | labels = self._matrix[(int(x), int(y))] 494 | multi_labels.append(labels) 495 | return multi_labels 496 | -------------------------------------------------------------------------------- /playerank/models/Rater.py: -------------------------------------------------------------------------------- 1 | # /usr/local/bin/python 2 | from collections import defaultdict, OrderedDict, Counter 3 | import numpy as np 4 | 5 | from sklearn.preprocessing import MinMaxScaler 6 | 7 | class Rater(): 8 | """Performance rating 9 | 10 | Parameters 11 | ---------- 12 | alpha_goal: float 13 | importance of the goal in the evaluation of performance, in the range [0, 1] 14 | default=0.0 15 | 16 | Attributes 17 | ---------- 18 | ratings_: numpy array 19 | the ratings of the performances 20 | """ 21 | def __init__(self, alpha_goal=0.0): 22 | self.alpha_goal = alpha_goal 23 | self.ratings_ = [] 24 | 25 | def get_rating(self, weighted_sum, goals): 26 | return weighted_sum * (1 - self.alpha_goal) + self.alpha_goal * goals 27 | 28 | def predict(self, dataframe, goal_feature, score_feature, filename='ratings'): 29 | """ 30 | Compute the rating of each performance in X 31 | 32 | Parameters 33 | ---------- 34 | dataframe: dataframe of playerank scores 35 | goal_feature: column name for goal scored dataframe column 36 | score_feature: column name for playerank score dataframe column 37 | 38 | 39 | Returns 40 | ------- 41 | ratings_: numpy array 42 | """ 43 | feature_names = dataframe.columns 44 | X = dataframe.values 45 | 46 | for i, row in enumerate(X): 47 | 48 | goal_index = feature_names.get_loc(goal_feature) 49 | pr_index = feature_names.get_loc(score_feature) 50 | 51 | rating = self.get_rating(float(row[pr_index]), float(row[goal_index]),) 52 | self.ratings_.append(rating) 53 | self.ratings_ = MinMaxScaler().fit_transform(np.array(self.ratings_).reshape(-1, 1))[:, 0] 54 | 55 | 56 | 57 | return self.ratings_ 58 | -------------------------------------------------------------------------------- /playerank/models/Weighter.py: -------------------------------------------------------------------------------- 1 | # /usr/local/bin/python 2 | from sklearn.base import BaseEstimator 3 | from sklearn.svm import LinearSVC 4 | from sklearn.model_selection import cross_val_score 5 | 6 | from sklearn.preprocessing import StandardScaler, LabelEncoder 7 | from sklearn.utils import check_random_state 8 | from sklearn.preprocessing import MinMaxScaler 9 | from sklearn.feature_selection import VarianceThreshold 10 | import json 11 | import pandas as pd 12 | import numpy as np 13 | 14 | class Weighter(BaseEstimator): 15 | """Automatic weighting of performance features 16 | 17 | Parameters 18 | ---------- 19 | label_type: str 20 | the label type associated to the game outcome. 21 | options: w-dl (victory vs draw or defeat), wd-l (victory or draw vs defeat), 22 | w-d-l (victory, draw, defeat) 23 | random_state : int 24 | RandomState instance or None, optional, default: None 25 | If int, random_state is the seed used by the random number generator; 26 | If RandomState instance, random_state is the random number generator; 27 | If None, the random number generator is the RandomState instance used 28 | by `np.random`. 29 | 30 | Attributes 31 | ---------- 32 | feature_names_ : array, [n_features] 33 | names of the features 34 | label_type_: str 35 | the label type associated to the game outcome. 36 | options: w-dl (victory vs draw or defeat), wd-l (victory or draw vs defeat), 37 | w-d-l (victory, draw, defeat) 38 | clf_: LinearSVC object 39 | the object of the trained classifier 40 | weights_ : array, [n_features] 41 | weights of the features computed by the classifier 42 | random_state_: int 43 | RandomState instance or None, optional, default: None 44 | If int, random_state is the seed used by the random number generator; 45 | If RandomState instance, random_state is the random number generator; 46 | If None, the random number generator is the RandomState instance used 47 | by 'np.random'. 48 | """ 49 | def __init__(self, label_type='w-dl', random_state=42): 50 | self.label_type_ = label_type 51 | self.random_state_ = random_state 52 | 53 | def fit(self, dataframe, target, scaled=False, var_threshold = 0.001 , filename='weights.json'): 54 | """ 55 | Compute weights of features. 56 | 57 | Parameters 58 | ---------- 59 | dataframe : pandas DataFrame 60 | a dataframe containing the feature values and the target values 61 | 62 | target: str 63 | a string indicating the name of the target variable in the dataframe 64 | 65 | scaled: boolean 66 | True if X must be normalized, False otherwise 67 | (optional) 68 | 69 | filename: str 70 | the name of the files to be saved (the json file containing the feature weights, 71 | ) 72 | default: "weights" 73 | """ 74 | ##feature selection by variance, to delete outlier features 75 | feature_names = list(dataframe.columns) 76 | # normalize the data and then eliminate the variables with zero variance 77 | sel = VarianceThreshold(var_threshold) 78 | X = sel.fit_transform(dataframe) 79 | selected_feature_names = [feature_names[i] for i, var in enumerate(list(sel.variances_)) if var > var_threshold] 80 | print ("[Weighter] filtered features:", [(feature_names[i],var) for i, var in enumerate(list(sel.variances_)) if var <= var_threshold]) 81 | dataframe = pd.DataFrame(X, columns=selected_feature_names) 82 | if self.label_type_ == 'w-dl': 83 | y = dataframe[target].apply(lambda x: 1 if x > 0 else -1) 84 | elif self.label_type_ == 'wd-l': 85 | y = dataframe[target].apply(lambda x: 1 if x >= 0 else -1 ) 86 | else: 87 | y = dataframe[target].apply(lambda x: 1 if x > 0 else 0 if x==0 else 2) 88 | X = dataframe.loc[:, dataframe.columns != target].values 89 | y = y.values 90 | 91 | if scaled: 92 | X = StandardScaler().fit_transform(X) 93 | 94 | self.feature_names_ = dataframe.loc[:, dataframe.columns != target].columns 95 | self.clf_ = LinearSVC(fit_intercept=True, dual = False, max_iter = 50000,random_state=self.random_state_) 96 | 97 | #f1_score = np.mean(cross_val_score(self.clf_, X, y, cv=2, scoring='f1_weighted')) 98 | #self.f1_score_ = f1_score 99 | 100 | self.clf_.fit(X, y) 101 | 102 | outcome = 0 103 | if self.label_type_ == 'w-d-l': 104 | outcome = 1 105 | 106 | importances = self.clf_.coef_[outcome] 107 | 108 | sum_importances = sum(np.abs(importances)) 109 | self.weights_ = importances / sum_importances 110 | 111 | ## Save the computed weights into a json file 112 | features_and_weights = {} 113 | for feature, weight in sorted(zip(self.feature_names_, self.weights_),key = lambda x: x[1]): 114 | features_and_weights[feature]= weight 115 | json.dump(features_and_weights, open('%s' %filename, 'w')) 116 | ## Save the object 117 | #pkl.dump(self, open('%s.pkl' %filename, 'wb')) 118 | 119 | def get_weights(self): 120 | return self.weights_ 121 | 122 | def get_feature_names(self): 123 | return self.feature_names_ 124 | -------------------------------------------------------------------------------- /playerank/models/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mesosbrodleto/playerank/79dd6464be98bbc35f48f99b4a3b626ea43e9a7e/playerank/models/__init__.py -------------------------------------------------------------------------------- /playerank/setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | 3 | setup( 4 | name='playerank', 5 | version='1.0', 6 | packages=['playerank',], 7 | install_requires=[ 8 | 'pandas==0.23.4', 9 | 'scipy==0.17.1', 10 | 'numpy==1.11.0', 11 | 'scikit_learn==0.21.3', 12 | 'joblib' 13 | ], 14 | license='Creative Commons Attribution-Noncommercial-Share Alike license', 15 | long_description=open('README.md').read(), 16 | ) 17 | -------------------------------------------------------------------------------- /playerank/utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mesosbrodleto/playerank/79dd6464be98bbc35f48f99b4a3b626ea43e9a7e/playerank/utils/__init__.py -------------------------------------------------------------------------------- /playerank/utils/compute_features_weight.py: -------------------------------------------------------------------------------- 1 | #import .models 2 | from ..models import Weighter 3 | 4 | from ..features import qualityFeatures, relativeAggregation,goalScoredFeatures 5 | 6 | def compute_feature_weights(output_path): 7 | 8 | qualityFeat = qualityFeatures.qualityFeatures() 9 | quality= qualityFeat.createFeature(events_path = 'playerank/data/events', 10 | players_file='playerank/data/players.json' ,entity = 'team') 11 | #computing goals scored for each team in each match 12 | gs=goalScoredFeatures.goalScoredFeatures() 13 | goals=gs.createFeature('playerank/data/matches') 14 | #merging of quality features and goals scored 15 | aggregation = relativeAggregation.relativeAggregation() 16 | aggregation.set_features([quality,goals]) 17 | df = aggregation.aggregate(to_dataframe = True) 18 | 19 | weighter = Weighter.Weighter(label_type='wd-l') 20 | weighter.fit(df, 'goal-scored', filename=output_path) 21 | print ("features weights stored in %s"%output_path) 22 | 23 | 24 | compute_feature_weights('playerank/conf/features_weights.json') 25 | -------------------------------------------------------------------------------- /playerank/utils/compute_features_weight.py~: -------------------------------------------------------------------------------- 1 | #import .models 2 | from ..models import Weighter 3 | 4 | from ..features import qualityFeatures, relativeAggregation,goalScoredFeatures 5 | import sys 6 | #computing all quality features (passes accurate, passes failed, shots, etc.) 7 | 8 | def compute_feature_weights(output_path): 9 | 10 | qualityFeat = qualityFeatures.qualityFeatures() 11 | quality= qualityFeat.createFeature('playerank/data/events.json',entity = 'team']) 12 | #computing goals scored for each team in each match 13 | gs=goalScoredFeatures.goalScoredFeatures() 14 | goals=gs.createFeature('playerank/data/matches.json') 15 | #merging of quality features and goals scored 16 | aggregation = relativeAggregation.relativeAggregation() 17 | aggregation.set_features([quality,goals]) 18 | df = aggregation.aggregate(to_dataframe = True) 19 | 20 | weighter = Weighter.Weighter(label_type='w-dl') 21 | weighter.fit(df, 'goal-scored', filename=output_path) 22 | print ("features weights stored in %s"%output_path) 23 | 24 | 25 | compute_feature_weights('playerank/conf/features_weigths.json') 26 | -------------------------------------------------------------------------------- /playerank/utils/compute_playerank.py: -------------------------------------------------------------------------------- 1 | from ..models import Weighter 2 | 3 | from ..features import centerOfPerformanceFeature,qualityFeatures,playerankFeatures, plainAggregation, matchPlayedFeatures,roleFeatures 4 | import sys,json 5 | 6 | weigths_file ='playerank/conf/features_weights.json' 7 | 8 | qualityFeat = qualityFeatures.qualityFeatures() 9 | quality= qualityFeat.createFeature(events_path = 'playerank/data/events', 10 | players_file='playerank/data/players.json' ,entity = 'player') 11 | 12 | 13 | prFeat = playerankFeatures.playerankFeatures() 14 | prFeat.set_features([quality]) 15 | pr= prFeat.createFeature(weigths_file) 16 | 17 | 18 | matchPlayedFeat = matchPlayedFeatures.matchPlayedFeatures() 19 | matchplayed = matchPlayedFeat.createFeature(matches_path = 'playerank/data/matches', 20 | players_file='playerank/data/players.json' ) 21 | 22 | center_performance = centerOfPerformanceFeature.centerOfPerformanceFeature() 23 | 24 | center_performance = center_performance.createFeature(events_path = 'playerank/data/events', 25 | players_file = 'playerank/data/players.json' ) 26 | 27 | 28 | roleFeat = roleFeatures.roleFeatures() 29 | roleFeat.set_features([center_performance]) 30 | roles= roleFeat.createFeature(matrix_role_file = 'playerank/conf/role_matrix.json') 31 | 32 | 33 | aggregation = plainAggregation.plainAggregation() 34 | 35 | aggregation.set_features([matchplayed,pr,roles]) 36 | 37 | df = aggregation.aggregate(to_dataframe = True) 38 | 39 | print (df.head()) 40 | -------------------------------------------------------------------------------- /playerank/utils/compute_playerank.py~: -------------------------------------------------------------------------------- 1 | from ..models import Weighter 2 | 3 | from ..features import playerankFeatures, plainAggregation, matchPlayedFeatures,roleFeatures 4 | import sys,json 5 | 6 | weigths_file = sys.argv[1] 7 | prFeat = playerankFeatures.playerankFeatures() 8 | 9 | pr= prFeat.createFeature(weigths_file,param={'competitionId': 524},limit = 5) 10 | 11 | print ("PlayeRank Score Computed. \n %s performance processed"%pr.count()) 12 | mFeat= matchPlayedFeatures.matchPlayedFeatures() 13 | 14 | mins= mFeat.createFeature(param={'competitionId': 524}) 15 | 16 | matrix_role = json.load(open('playerlib/conf/role_matrix.json')) 17 | roleFeat = roleFeatures.roleFeatures() 18 | 19 | roles= roleFeat.createFeature(matrix_role, param={'competitionId': 524}) 20 | 21 | aggregation = plainAggregation.plainAggregation() 22 | 23 | aggregation.set_features([mins,pr]) 24 | 25 | df = aggregation.aggregate(to_dataframe = True, stored_collection_name = 'playerank_scores') 26 | 27 | df.to_csv('playerank.csv') 28 | -------------------------------------------------------------------------------- /playerank/utils/compute_roles.py: -------------------------------------------------------------------------------- 1 | #import .models 2 | from ..models import Clusterer 3 | 4 | from ..features import centerOfPerformanceFeature, plainAggregation 5 | import sys,json 6 | #computing all quality features (passes accurate, passes failed, shots, etc.) 7 | 8 | def compute_roleMatrix(output_path): 9 | #getting average position for each player in each match 10 | centerfeat = centerOfPerformanceFeature.centerOfPerformanceFeature() 11 | centerfeat = centerfeat.createFeature(events_path = 'playerank/data/events', 12 | players_file='playerank/data/players.json') 13 | 14 | #plain aggregation to get a dataframe 15 | aggregation = plainAggregation.plainAggregation() 16 | aggregation.set_features([centerfeat]) 17 | df = aggregation.aggregate(to_dataframe = True ) 18 | 19 | #use clustering object to get the best fit 20 | clusterer = Clusterer.Clusterer(verbose=True, k_range=(8, 9)) 21 | clusterer.fit(df.entity, df.match, df[['avg_x', 'avg_y']], kind='multi') 22 | 23 | matrix_role = clusterer.get_clusters_matrix(kind = 'multi') 24 | 25 | matrix_role = matrix_role 26 | json.dump(matrix_role,open(output_path,'w')) 27 | 28 | 29 | compute_roleMatrix('playerank/conf/role_matrix.json') 30 | -------------------------------------------------------------------------------- /playerank_schema_tist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mesosbrodleto/playerank/79dd6464be98bbc35f48f99b4a3b626ea43e9a7e/playerank_schema_tist.png --------------------------------------------------------------------------------