├── .idea ├── .gitignore ├── deployment.xml ├── inspectionProfiles │ └── profiles_settings.xml ├── misc.xml ├── modules.xml ├── scraping_repo.iml └── vcs.xml ├── Project-ERD.pdf ├── README.md ├── api_data.py ├── chromedriver ├── chromedriver.exe ├── config.py ├── db_control.py ├── html_parser.py ├── main.py ├── requirements.txt └── sofa score demo day.pptx /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | # Default ignored files 2 | /workspace.xml -------------------------------------------------------------------------------- /.idea/deployment.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 7 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/scraping_repo.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /Project-ERD.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/Project-ERD.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # README - SofaScore Scraper 2 | 3 | ## Overview 4 | 5 | This Data Engineering project performs some detailed Data Mining operations, reaching the website [SofaScore](https://www.sofascore.com/) and scraping data about football players and managers from the main european leagues. 6 | 7 | Then, it makes sure all the data is safely stored in a Relational Database, ready to be exploited by the user. 8 | 9 | ## The Data 10 | 11 | For the selected football leagues, all the team names are collected. 12 | 13 | For every team, each player is scraped, and a series of information is stored in a database: 14 | 15 | ``` 16 | 1. Name 17 | 2. Nationality 18 | 3. Date of Birth 19 | 4. Height 20 | 5. Preferred Foot 21 | 6. Position on the Pitch 22 | 7. Shirt Number 23 | ``` 24 | 25 | Information about managers is parsed as well: 26 | 27 | ``` 28 | 1. Name 29 | 2. Date of Birth 30 | 3. Nationality 31 | 4. Preferred Formation 32 | 5. Average Points per Game 33 | 6. Games Won 34 | 7. Games Drawn 35 | 8. Games Lost 36 | ``` 37 | 38 | Moreover, extra data about the various teams is obtained from an external [API](https://thesportsdb.com/api.php) and stored in the DB: 39 | 40 | ``` 41 | 1. Name 42 | 2. Short Name 43 | 3. Alternative Name 44 | 4. Foundation Year 45 | 5. Stadium Name 46 | 6. Stadium Picture URL 47 | 7. Stadium Description 48 | 8. Stadium Location 49 | 9. Stadium Capacity 50 | 10. Team Website 51 | 11. Team Facebook 52 | 12. Team Twitter 53 | 13. Team Instagram 54 | 14. Team Description 55 | ``` 56 | 57 | If the data related to a certain league hasn't been scraped before, it will be simply added to the database. 58 | 59 | If otherwise the data from the selected league is already present in the database, this data will be overwritten by the current one. 60 | 61 | ## Usage 62 | 63 | As a first step it's important to write the correct username and password in "config.py". 64 | Then, just run the following commands: 65 | 66 | ``` 67 | pip install -r requirements.txt 68 | ``` 69 | 70 | ``` 71 | python main.py -s -p -l -b 72 | ``` 73 | 74 | CLI Arg | Action 75 | ------------ | ------------- 76 | -s | scrapes teams and players from Serie A 77 | -p | scrapes teams and players from Premier League 78 | -l | scrapes teams and players from La Liga 79 | -b | scrapes teams and players from Bundesliga 80 | 81 | The user can choose which league or combination of leagues to scrape and to create/update the database with. 82 | 83 | ## Created by: 84 | - `Daniel Saban` 85 | - `Sagi Elfassi` 86 | -------------------------------------------------------------------------------- /api_data.py: -------------------------------------------------------------------------------- 1 | import json 2 | import requests 3 | import config as cfg 4 | 5 | 6 | def get_info_from_api(team_name): 7 | """ 8 | getting some additional information about the teams and enriching our database 9 | thanks to the free API from thesportsdb.com/api.php 10 | :param team_name: team of the name to download more info to 11 | :return: a dictionary with some extra info about the team. 12 | """ 13 | if "-" in team_name: 14 | team_name = team_name.replace("-", "+") 15 | if "brighton" in team_name: # some teams has different names than in sofa-score 16 | team_name = "brighton" 17 | if "leicester" in team_name: 18 | team_name = "leicester" 19 | if "norwich" in team_name: 20 | team_name = "norwich" 21 | if "mallorca" in team_name: 22 | team_name = "mallorca" 23 | if "parma" in team_name: 24 | team_name = "parma+calcio" 25 | if "bayern" in team_name: 26 | team_name = "bayern" 27 | if "koln" in team_name: 28 | team_name = "fc+koln" 29 | if "union+berlin" in team_name: 30 | team_name = "union+berlin" 31 | if "fsv+mainz" in team_name: 32 | team_name = "mainz" 33 | if "hoffenheim" in team_name: 34 | team_name = "hoffenheim" 35 | if "mgladbach" in team_name: 36 | team_name = "borussia+monchengladbach" 37 | if "schalke" in team_name: 38 | team_name = "schalke" 39 | if "leverkusen" in team_name: 40 | team_name = "leverkusen" 41 | if "paderborn" in team_name: 42 | team_name = "paderborn" 43 | print(team_name) 44 | response = requests.get(cfg.API_URL + team_name) 45 | team_data = json.loads(response.text) 46 | return team_data['teams'][0] 47 | 48 | -------------------------------------------------------------------------------- /chromedriver: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/chromedriver -------------------------------------------------------------------------------- /chromedriver.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/chromedriver.exe -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | DB_NAME = "sofa_score" 2 | PASSWD = "Sofa_Score_2020" 3 | USERNAME = "Daniel" 4 | HOST = "localhost" 5 | 6 | API_URL = r"https://www.thesportsdb.com/api/v1/json/1/searchteams.php?t=" 7 | 8 | POSSIBLE_FOOT = ['Right', 'Left', 'Both'] 9 | POSSIBLE_POSITION = ['G', 'D', 'M', 'F'] 10 | PLAYER_FIELDS = ['birth_date', 'height', 'prefd_foot', 'position', 11 | 'shirt_num', 'nationality'] 12 | # MANAGER_FIELDS = ['birth_date', 'nationality', 'pref_formation', 'avg_points_per_game', 13 | # 'games_won', 'games_drawn', 'games_lost'] 14 | 15 | 16 | TOP_LEAGUES_URLS = {"seriea": r"https://www.sofascore.com/tournament/football/italy/serie-a/23", 17 | "laliga": r"https://www.sofascore.com/tournament/football/spain/laliga/8", 18 | "premier": r"https://www.sofascore.com/tournament/football/england/premier-league/17", 19 | "bundes": r"https://www.sofascore.com/tournament/football/germany/bundesliga/35"} 20 | -------------------------------------------------------------------------------- /db_control.py: -------------------------------------------------------------------------------- 1 | import mysql.connector 2 | import config as cfg 3 | import unicodedata 4 | 5 | 6 | def connector(): 7 | """ 8 | creating a connector to an existing db 9 | :return: return the valid connector 10 | """ 11 | return mysql.connector.connect( 12 | host=cfg.HOST, 13 | user=cfg.USERNAME, 14 | password=cfg.PASSWD, 15 | database=cfg.DB_NAME, 16 | charset='utf8mb4', 17 | use_unicode=True 18 | ) 19 | 20 | 21 | def create(): 22 | """ 23 | creating a MySQL database 24 | """ 25 | my_db = mysql.connector.connect( 26 | host=cfg.HOST, 27 | user=cfg.USERNAME, 28 | password=cfg.PASSWD, 29 | ) 30 | cur = my_db.cursor() 31 | cur.execute('''CREATE DATABASE IF NOT EXISTS ''' + cfg.DB_NAME) 32 | 33 | my_db = connector() 34 | cur = my_db.cursor() 35 | cur.execute('''CREATE TABLE IF NOT EXISTS leagues ( 36 | league_id INT PRIMARY KEY AUTO_INCREMENT, 37 | league_name VARCHAR(255) NOT NULL UNIQUE, 38 | number_of_teams INT NOT NULL)''') 39 | cur.execute('''CREATE TABLE IF NOT EXISTS teams ( 40 | team_id INT PRIMARY KEY AUTO_INCREMENT, 41 | league_id INT, 42 | team_name VARCHAR(255) NOT NULL UNIQUE, 43 | number_of_players INT, 44 | FOREIGN KEY (league_id) REFERENCES leagues(league_id))''') 45 | cur.execute('''CREATE TABLE IF NOT EXISTS managers ( 46 | manager_id INT PRIMARY KEY AUTO_INCREMENT, 47 | team_id INT, 48 | manager_name VARCHAR(255) NOT NULL, 49 | birth_date DATETIME, 50 | nationality VARCHAR(255), 51 | pref_formation VARCHAR(255), 52 | avg_points_per_game REAL, 53 | games_won INT, 54 | games_drawn INT, 55 | games_lost INT, 56 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''') 57 | cur.execute('''CREATE TABLE IF NOT EXISTS players ( 58 | player_id INT PRIMARY KEY AUTO_INCREMENT, 59 | team_id INT, 60 | player_name VARCHAR(255) NOT NULL, 61 | nationality VARCHAR(255), 62 | birth_date DATETIME, 63 | height_cm INT, 64 | prefd_foot VARCHAR(255), 65 | position VARCHAR(255), 66 | shirt_num INT, 67 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''') 68 | cur.execute('''CREATE TABLE IF NOT EXISTS teams_extras ( 69 | id_Team INT PRIMARY KEY, 70 | team_id INT, 71 | team_name VARCHAR(45) NOT NULL, 72 | team_name_short VARCHAR(45), 73 | alternate_team_name VARCHAR(255), 74 | formed_year INT, 75 | stadium_name VARCHAR(45), 76 | stadium_pic_url VARCHAR(255), 77 | stadium_description LONGTEXT, 78 | stadium_location VARCHAR(255), 79 | stadium_capacity INT, 80 | team_website VARCHAR(255), 81 | team_facebook VARCHAR(255), 82 | team_twitter VARCHAR(255), 83 | team_instagram VARCHAR(255), 84 | team_description LONGTEXT, 85 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''') 86 | cur.close() 87 | 88 | 89 | def write_league(league_info: list): 90 | """ 91 | writing the league info to the database 92 | :param league_info: a list with the league name and the number of teams in this league 93 | """ 94 | my_db = connector() 95 | cur = my_db.cursor() 96 | cur.execute("INSERT INTO leagues (league_name, number_of_teams) VALUES (%s, %s)" 97 | "ON DUPLICATE KEY UPDATE league_id=league_id", 98 | (league_info[0], league_info[1])) 99 | my_db.commit() 100 | cur.close() 101 | 102 | 103 | def write_teams(teams_info: list, lg_name): 104 | """ 105 | writing a team info to the database 106 | :param teams_info: the team name, number of players in the team 107 | :param lg_name: the league which the teams belong to - foreign key 108 | """ 109 | my_db = connector() 110 | cur = my_db.cursor() 111 | cur.execute("INSERT INTO teams (team_name, number_of_players, league_id) VALUES " 112 | "(%s, %s, (SELECT league_id FROM leagues WHERE league_name='"+lg_name+"'))" 113 | """ON DUPLICATE KEY UPDATE team_id=team_id""", 114 | (teams_info[0], teams_info[1])) 115 | my_db.commit() 116 | cur.close() 117 | 118 | 119 | def write_players(players_info: list, team_n): 120 | """ 121 | writing a list of players to the database 122 | :param players_info: a list of dictionaries, each dict stands for a player with several data fields. 123 | :param team_n: the name of the team which the players belong to 124 | """ 125 | my_db = connector() 126 | cur = my_db.cursor() 127 | for player in players_info: 128 | cur.execute("INSERT INTO players (team_id, player_name, nationality, birth_date, height_cm, prefd_foot," 129 | "position, shirt_num)" 130 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1),%s, %s, %s, %s, %s, %s, %s)" 131 | "ON DUPLICATE KEY UPDATE player_id=player_id", 132 | (player['name'], player['nationality'], player['birth_date'], player['height'], 133 | player['prefd_foot'], player['position'], player['shirt_num'])) 134 | my_db.commit() 135 | cur.close() 136 | 137 | 138 | def write_manager(mgr_info, team_n): 139 | """ 140 | writing a manager to the database 141 | :param mgr_info: a dictionary, that represent a manger with several data fields. 142 | :param team_n: the name of the team which the manager belong to 143 | """ 144 | my_db = connector() 145 | cur = my_db.cursor() 146 | mgr_name = (unicodedata.normalize('NFD', mgr_info['name']).encode('ascii', 'ignore')) 147 | cur.execute("INSERT INTO managers (team_id, manager_name, birth_date, nationality, pref_formation," 148 | "avg_points_per_game, games_won, games_drawn, games_lost)" 149 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1),%s, %s, %s, %s, %s, %s, %s, %s)" 150 | "ON DUPLICATE KEY UPDATE manager_id=manager_id", 151 | (mgr_name, mgr_info['birth_date'], mgr_info['nationality'], mgr_info['pref_formation'], 152 | mgr_info['avg_points_per_game'], mgr_info['games_won'], mgr_info['games_drawn'], mgr_info['games_lost'])) 153 | my_db.commit() 154 | cur.close() 155 | 156 | 157 | def write_team_extras(team_info, team_n): 158 | """ 159 | writing an additional information about each team from an external API - thesportsdb.com/api.php 160 | :param team_info: a dictionary with the desired external data 161 | :param team_n: team name that the external data is belong to 162 | """ 163 | my_db = connector() 164 | cur = my_db.cursor() 165 | team_des = (unicodedata.normalize('NFD', team_info['strDescriptionEN']).encode('ascii', 'ignore')) 166 | if team_info['strStadiumDescription'] is not None: 167 | stad_des = (unicodedata.normalize('NFD', team_info['strStadiumDescription']).encode('ascii', 'ignore')) 168 | else: 169 | stad_des = None 170 | cur.execute("INSERT INTO teams_extras (team_id, id_Team, team_name, team_name_short, alternate_team_name," 171 | "formed_year, stadium_name, stadium_pic_url, stadium_description, stadium_location, stadium_capacity," 172 | "team_website, team_facebook, team_twitter, team_instagram, team_description)" 173 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1), %s, %s, %s, %s, %s, %s, %s," 174 | "%s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE id_Team = id_Team", 175 | (team_info['idTeam'], team_info['strTeam'], team_info['strTeamShort'], team_info['strAlternate'], 176 | team_info['intFormedYear'], team_info['strStadium'], team_info['strStadiumThumb'], 177 | stad_des, team_info['strStadiumLocation'], team_info['intStadiumCapacity'], 178 | team_info['strWebsite'], team_info['strFacebook'], team_info['strTwitter'], team_info['strInstagram'], 179 | team_des)) 180 | my_db.commit() 181 | cur.close() 182 | 183 | 184 | def check_and_delete(league_name): 185 | """ 186 | this function checks if the league already exists in the database, if so deletes it. 187 | :param league_name: getting the league name from user 188 | """ 189 | my_db = connector() 190 | cur = my_db.cursor() 191 | cur.execute("SELECT league_id FROM leagues WHERE league_name ='"+league_name+"'") 192 | league_id = cur.fetchall() 193 | if len(league_id) != 0: 194 | cur.execute("SELECT team_id FROM teams WHERE league_id="+str(league_id[0][0])) 195 | team_ids = cur.fetchall() 196 | if len(team_ids) > 0: 197 | for team_id in team_ids: 198 | cur.execute("DELETE FROM players WHERE team_id = "+str(team_id[0])) 199 | cur.execute("DELETE FROM managers WHERE team_id = "+str(team_id[0])) 200 | cur.execute("DELETE FROM teams_extras WHERE team_id = "+str(team_id[0])) 201 | my_db.commit() 202 | cur.execute("DELETE FROM teams WHERE league_id = "+str(league_id[0][0])) 203 | my_db.commit() 204 | cur.execute("DELETE FROM leagues WHERE league_id = "+str(league_id[0][0])) 205 | my_db.commit() 206 | -------------------------------------------------------------------------------- /html_parser.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from selenium import webdriver 3 | from bs4 import BeautifulSoup 4 | from dateutil.parser import parse 5 | from selenium.webdriver.chrome.options import Options 6 | import config as cfg 7 | 8 | arrow_manipu = lambda x: x.replace("<", ">").split(">") 9 | 10 | 11 | def get_driver(): 12 | chrome_options = Options() 13 | chrome_options.add_argument("--headless") 14 | # working with selenium google driver as the data is not in the bs4 html 15 | return webdriver.Chrome(executable_path=r'./chromedriver', options=chrome_options) 16 | 17 | 18 | def extract_player_info(player_url): 19 | """ 20 | Starting from the url of a player's page on https://www.sofascore.com, the function extracts 21 | the most interesting infos about him, if available, and returns them in a dict. 22 | :param player_url: url of the player on https://www.sofascore.com site. 23 | :return: dict with this keys: [Name, Nationality, birth-date, Height, Preferred Foot, Position, Shirt Number, Average site rating] 24 | """ 25 | player_dict = {} 26 | # beautiful_soup of the player 27 | player_html = BeautifulSoup(requests.get(player_url).text, 'html.parser') 28 | # html of the most interesting data 29 | player_panel_html = player_html.find_all("h2", class_="styles__DetailBoxTitle-sc-1ss54tr-5 enIhhc") 30 | details = arrow_manipu(str(player_panel_html)) 31 | player_fields_html = player_html.find_all("div", class_="styles__DetailBoxContent-sc-1ss54tr-6 iAORZR") 32 | fields_list = arrow_manipu(str(player_fields_html)) 33 | player_dict['name'] = player_url.split("/")[-2].replace('-', ' ').title() 34 | if "Nationality" in fields_list: 35 | raw_nationality = player_html.find_all("span", class_="u-pL8") 36 | player_dict['nationality'] = arrow_manipu(str(raw_nationality))[-3] 37 | for field in fields_list: 38 | try: 39 | b_day = parse(field, fuzzy=False) 40 | player_dict['birth_date'] = b_day 41 | except ValueError: 42 | continue 43 | 44 | for i in range(len(details)): 45 | is_a_detail = (r"h2 class=" in details[i] or r"span style" in details[i]) and details[i + 1] != '' 46 | if is_a_detail: 47 | if 'cm' in details[i + 1]: 48 | player_dict['height'] = int(details[i + 1].split()[0]) 49 | elif details[i + 1] in cfg.POSSIBLE_FOOT: 50 | player_dict['prefd_foot'] = details[i + 1] 51 | elif details[i + 1] in cfg.POSSIBLE_POSITION: 52 | player_dict['position'] = details[i + 1] 53 | elif "Shirt number" in fields_list: 54 | player_dict['shirt_num'] = int(details[i + 1]) 55 | 56 | for key in cfg.PLAYER_FIELDS: 57 | if key not in player_dict.keys(): 58 | player_dict[key] = None 59 | return player_dict 60 | 61 | 62 | def extract_players_urls(team_url): 63 | """ 64 | this function fet url of a team and extract all of the players url list out of it. 65 | then it send the player url to extract player info func to get all info about the player 66 | :param team_url: a url to a team page as a string 67 | """ 68 | players_list = [] 69 | # using bs4 & requests to retrieve html as text 70 | driver = get_driver() 71 | driver.get(team_url) 72 | team_html = BeautifulSoup(driver.page_source, 'html.parser') 73 | # looking for the player info inside the page 74 | all_players_html = team_html.find_all("a", class_="styles__CardWrapper-sc-1dlv1k5-15 czXoLq") 75 | # manipulating the text to extract player links 76 | html_list = str(all_players_html).split() 77 | for line in html_list: 78 | if "href" in line: 79 | players_list.append(extract_player_info("https://www.sofascore.com" + line.split("\"")[1])) 80 | return players_list 81 | 82 | 83 | def extract_mgr_url(team_url): 84 | mgr_link = "" 85 | driver = get_driver() 86 | driver.get(team_url) 87 | soup = BeautifulSoup(driver.page_source, 'html.parser') 88 | mgr_html = soup.find('div', class_="Content-sc-1o55eay-0 styles__ManagerContent-qlwzq-9 dxQrED") 89 | for line in str(mgr_html).split(): 90 | if "href" in line: 91 | mgr_link = "https://www.sofascore.com" + line.split("\"")[1] 92 | return mgr_link 93 | 94 | 95 | def extract_mgr_info(manager_url): 96 | """ 97 | this function gets a manager url and extract all desired data from the source html 98 | :param manager_url: a manger url 99 | :return: a dictionary that contains all the dat about a manager. 100 | """ 101 | mgr_dict = {} 102 | mgr_soup = BeautifulSoup(requests.get(extract_mgr_url(manager_url)).text, 'html.parser') 103 | mgr_dict['name'] = mgr_soup.find('div', class_="Content-sc-1o55eay-0 gYsVZh u-fs25 fw-medium").get_text() 104 | mgr_dict['nationality'] = mgr_soup.find('span', style="padding-left:6px").get_text() 105 | field_list = arrow_manipu(str(mgr_soup.find_all('div', class_="Content-sc-1o55eay-0 gYsVZh u-txt-2"))) 106 | for field in field_list: 107 | try: 108 | b_day = parse(field, fuzzy=False) 109 | mgr_dict['birth_date'] = b_day 110 | except ValueError: 111 | continue 112 | values_list = arrow_manipu(str(mgr_soup.find_all('div', class_="Content-sc-1o55eay-0 gYsVZh u-fs21 fw-medium"))) 113 | mgr_dict['pref_formation'] = values_list[12] 114 | mgr_dict['avg_points_per_game'] = values_list[20] 115 | mgr_dict['games_won'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-green").get_text() 116 | mgr_dict['games_drawn'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-2").get_text() 117 | mgr_dict['games_lost'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-red").get_text() 118 | 119 | return mgr_dict 120 | 121 | 122 | def extract_teams_urls(league_url): 123 | """ 124 | this function extract teams url out of league home page 125 | :param league_url: league url as a string 126 | :return: all teams urlss unique and alphabetically sorted in a list 127 | """ 128 | team_list = [] 129 | driver = get_driver() 130 | driver.get(league_url) # mimicking human behaviour and opening league url 131 | team_html = BeautifulSoup(driver.page_source, 'html.parser') # getting the source with selenium, parsing with bs4 132 | driver.close() 133 | # looking after all teams urls 134 | all_teams_html = team_html.find_all("div", class_="Content-sc-1o55eay-0 gYsVZh u-pL8 fw-medium") 135 | html_list = str(all_teams_html).split() # splitting the string by spaces 136 | for line in html_list: 137 | is_line_with_link = "href=\"/team" in line and not line.endswith("img") 138 | if is_line_with_link: 139 | # appending it to a list 140 | team_list.append("https://www.sofascore.com" + line.replace("\"", "").split("=")[-1].split(">")[0]) 141 | sorted_teams = sorted(list(set(team_list))) # sorting and removing duplicates 142 | return sorted_teams 143 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import db_control 2 | import config as cfg 3 | import html_parser as hp 4 | import api_data as api 5 | from tqdm import tqdm 6 | import argparse 7 | 8 | 9 | def parsing(): 10 | """ 11 | using argparse to simplify the cli for the user 12 | """ 13 | leagues = [] 14 | parser = argparse.ArgumentParser() 15 | parser.add_argument("-s", "--SerieA", help="download Italian league players", action="store_true") 16 | parser.add_argument("-p", "--PremierLeague", help="download English league players", action="store_true") 17 | parser.add_argument("-l", "--LaLiga", help="download Spanish league players", action="store_true") 18 | parser.add_argument("-b", "--BundesLiga", help="download German league players", action="store_true") 19 | args = parser.parse_args() 20 | 21 | if args.SerieA: 22 | leagues.append("seriea") 23 | if args.PremierLeague: 24 | leagues.append("premier") 25 | if args.LaLiga: 26 | leagues.append("laliga") 27 | if args.BundesLiga: 28 | leagues.append("bundes") 29 | return leagues 30 | 31 | 32 | def main(): 33 | """ 34 | this is main calling function to extract players data out of https://www.sofascore.com 35 | """ 36 | # validating that the user changed the MySQL password and username to connect 37 | if cfg.PASSWD == "" or cfg.USERNAME == "": 38 | exit("Invalid username or password. Please read README.md!") 39 | 40 | leagues_to_download = parsing() # getting commands from the cli 41 | db_control.create() # will create database and tables that does not exists. 42 | 43 | for league in leagues_to_download: # iterating league links 44 | league_name = cfg.TOP_LEAGUES_URLS[league].split("/")[-2] 45 | db_control.check_and_delete(league_name) 46 | teams = hp.extract_teams_urls(cfg.TOP_LEAGUES_URLS[league]) # extracting teams out of leagues tables 47 | print("\ngetting teams from " + league_name) # printing for user "loading" in addition to tqdm 48 | db_control.write_league([league_name, len(teams)]) 49 | watch = tqdm(total=len(teams), position=0) 50 | for team_url in teams: # iterating all teams urls 51 | team_name = team_url.split('/')[-2] 52 | manager_info = hp.extract_mgr_info(team_url) 53 | players_list = hp.extract_players_urls(team_url) # extracting player url which 54 | db_control.write_teams([team_name, len(players_list)], league_name) 55 | db_control.write_players(players_list, team_name) 56 | db_control.write_manager(manager_info, team_name) 57 | extra_team_info = api.get_info_from_api(team_name) # retrieving external data from the api 58 | db_control.write_team_extras(extra_team_info, team_name) # writing this data into the database 59 | watch.update(1) 60 | 61 | 62 | if __name__ == '__main__': 63 | main() 64 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests 2 | selenium 3 | bs4 4 | python-dateutil 5 | mysql-connector-python 6 | tqdm 7 | argparse 8 | -------------------------------------------------------------------------------- /sofa score demo day.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/sofa score demo day.pptx --------------------------------------------------------------------------------