├── .idea
├── .gitignore
├── deployment.xml
├── inspectionProfiles
│ └── profiles_settings.xml
├── misc.xml
├── modules.xml
├── scraping_repo.iml
└── vcs.xml
├── Project-ERD.pdf
├── README.md
├── api_data.py
├── chromedriver
├── chromedriver.exe
├── config.py
├── db_control.py
├── html_parser.py
├── main.py
├── requirements.txt
└── sofa score demo day.pptx
/.idea/.gitignore:
--------------------------------------------------------------------------------
1 | # Default ignored files
2 | /workspace.xml
--------------------------------------------------------------------------------
/.idea/deployment.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
--------------------------------------------------------------------------------
/.idea/inspectionProfiles/profiles_settings.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/.idea/scraping_repo.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/Project-ERD.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/Project-ERD.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # README - SofaScore Scraper
2 |
3 | ## Overview
4 |
5 | This Data Engineering project performs some detailed Data Mining operations, reaching the website [SofaScore](https://www.sofascore.com/) and scraping data about football players and managers from the main european leagues.
6 |
7 | Then, it makes sure all the data is safely stored in a Relational Database, ready to be exploited by the user.
8 |
9 | ## The Data
10 |
11 | For the selected football leagues, all the team names are collected.
12 |
13 | For every team, each player is scraped, and a series of information is stored in a database:
14 |
15 | ```
16 | 1. Name
17 | 2. Nationality
18 | 3. Date of Birth
19 | 4. Height
20 | 5. Preferred Foot
21 | 6. Position on the Pitch
22 | 7. Shirt Number
23 | ```
24 |
25 | Information about managers is parsed as well:
26 |
27 | ```
28 | 1. Name
29 | 2. Date of Birth
30 | 3. Nationality
31 | 4. Preferred Formation
32 | 5. Average Points per Game
33 | 6. Games Won
34 | 7. Games Drawn
35 | 8. Games Lost
36 | ```
37 |
38 | Moreover, extra data about the various teams is obtained from an external [API](https://thesportsdb.com/api.php) and stored in the DB:
39 |
40 | ```
41 | 1. Name
42 | 2. Short Name
43 | 3. Alternative Name
44 | 4. Foundation Year
45 | 5. Stadium Name
46 | 6. Stadium Picture URL
47 | 7. Stadium Description
48 | 8. Stadium Location
49 | 9. Stadium Capacity
50 | 10. Team Website
51 | 11. Team Facebook
52 | 12. Team Twitter
53 | 13. Team Instagram
54 | 14. Team Description
55 | ```
56 |
57 | If the data related to a certain league hasn't been scraped before, it will be simply added to the database.
58 |
59 | If otherwise the data from the selected league is already present in the database, this data will be overwritten by the current one.
60 |
61 | ## Usage
62 |
63 | As a first step it's important to write the correct username and password in "config.py".
64 | Then, just run the following commands:
65 |
66 | ```
67 | pip install -r requirements.txt
68 | ```
69 |
70 | ```
71 | python main.py -s -p -l -b
72 | ```
73 |
74 | CLI Arg | Action
75 | ------------ | -------------
76 | -s | scrapes teams and players from Serie A
77 | -p | scrapes teams and players from Premier League
78 | -l | scrapes teams and players from La Liga
79 | -b | scrapes teams and players from Bundesliga
80 |
81 | The user can choose which league or combination of leagues to scrape and to create/update the database with.
82 |
83 | ## Created by:
84 | - `Daniel Saban`
85 | - `Sagi Elfassi`
86 |
--------------------------------------------------------------------------------
/api_data.py:
--------------------------------------------------------------------------------
1 | import json
2 | import requests
3 | import config as cfg
4 |
5 |
6 | def get_info_from_api(team_name):
7 | """
8 | getting some additional information about the teams and enriching our database
9 | thanks to the free API from thesportsdb.com/api.php
10 | :param team_name: team of the name to download more info to
11 | :return: a dictionary with some extra info about the team.
12 | """
13 | if "-" in team_name:
14 | team_name = team_name.replace("-", "+")
15 | if "brighton" in team_name: # some teams has different names than in sofa-score
16 | team_name = "brighton"
17 | if "leicester" in team_name:
18 | team_name = "leicester"
19 | if "norwich" in team_name:
20 | team_name = "norwich"
21 | if "mallorca" in team_name:
22 | team_name = "mallorca"
23 | if "parma" in team_name:
24 | team_name = "parma+calcio"
25 | if "bayern" in team_name:
26 | team_name = "bayern"
27 | if "koln" in team_name:
28 | team_name = "fc+koln"
29 | if "union+berlin" in team_name:
30 | team_name = "union+berlin"
31 | if "fsv+mainz" in team_name:
32 | team_name = "mainz"
33 | if "hoffenheim" in team_name:
34 | team_name = "hoffenheim"
35 | if "mgladbach" in team_name:
36 | team_name = "borussia+monchengladbach"
37 | if "schalke" in team_name:
38 | team_name = "schalke"
39 | if "leverkusen" in team_name:
40 | team_name = "leverkusen"
41 | if "paderborn" in team_name:
42 | team_name = "paderborn"
43 | print(team_name)
44 | response = requests.get(cfg.API_URL + team_name)
45 | team_data = json.loads(response.text)
46 | return team_data['teams'][0]
47 |
48 |
--------------------------------------------------------------------------------
/chromedriver:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/chromedriver
--------------------------------------------------------------------------------
/chromedriver.exe:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/chromedriver.exe
--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | DB_NAME = "sofa_score"
2 | PASSWD = "Sofa_Score_2020"
3 | USERNAME = "Daniel"
4 | HOST = "localhost"
5 |
6 | API_URL = r"https://www.thesportsdb.com/api/v1/json/1/searchteams.php?t="
7 |
8 | POSSIBLE_FOOT = ['Right', 'Left', 'Both']
9 | POSSIBLE_POSITION = ['G', 'D', 'M', 'F']
10 | PLAYER_FIELDS = ['birth_date', 'height', 'prefd_foot', 'position',
11 | 'shirt_num', 'nationality']
12 | # MANAGER_FIELDS = ['birth_date', 'nationality', 'pref_formation', 'avg_points_per_game',
13 | # 'games_won', 'games_drawn', 'games_lost']
14 |
15 |
16 | TOP_LEAGUES_URLS = {"seriea": r"https://www.sofascore.com/tournament/football/italy/serie-a/23",
17 | "laliga": r"https://www.sofascore.com/tournament/football/spain/laliga/8",
18 | "premier": r"https://www.sofascore.com/tournament/football/england/premier-league/17",
19 | "bundes": r"https://www.sofascore.com/tournament/football/germany/bundesliga/35"}
20 |
--------------------------------------------------------------------------------
/db_control.py:
--------------------------------------------------------------------------------
1 | import mysql.connector
2 | import config as cfg
3 | import unicodedata
4 |
5 |
6 | def connector():
7 | """
8 | creating a connector to an existing db
9 | :return: return the valid connector
10 | """
11 | return mysql.connector.connect(
12 | host=cfg.HOST,
13 | user=cfg.USERNAME,
14 | password=cfg.PASSWD,
15 | database=cfg.DB_NAME,
16 | charset='utf8mb4',
17 | use_unicode=True
18 | )
19 |
20 |
21 | def create():
22 | """
23 | creating a MySQL database
24 | """
25 | my_db = mysql.connector.connect(
26 | host=cfg.HOST,
27 | user=cfg.USERNAME,
28 | password=cfg.PASSWD,
29 | )
30 | cur = my_db.cursor()
31 | cur.execute('''CREATE DATABASE IF NOT EXISTS ''' + cfg.DB_NAME)
32 |
33 | my_db = connector()
34 | cur = my_db.cursor()
35 | cur.execute('''CREATE TABLE IF NOT EXISTS leagues (
36 | league_id INT PRIMARY KEY AUTO_INCREMENT,
37 | league_name VARCHAR(255) NOT NULL UNIQUE,
38 | number_of_teams INT NOT NULL)''')
39 | cur.execute('''CREATE TABLE IF NOT EXISTS teams (
40 | team_id INT PRIMARY KEY AUTO_INCREMENT,
41 | league_id INT,
42 | team_name VARCHAR(255) NOT NULL UNIQUE,
43 | number_of_players INT,
44 | FOREIGN KEY (league_id) REFERENCES leagues(league_id))''')
45 | cur.execute('''CREATE TABLE IF NOT EXISTS managers (
46 | manager_id INT PRIMARY KEY AUTO_INCREMENT,
47 | team_id INT,
48 | manager_name VARCHAR(255) NOT NULL,
49 | birth_date DATETIME,
50 | nationality VARCHAR(255),
51 | pref_formation VARCHAR(255),
52 | avg_points_per_game REAL,
53 | games_won INT,
54 | games_drawn INT,
55 | games_lost INT,
56 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''')
57 | cur.execute('''CREATE TABLE IF NOT EXISTS players (
58 | player_id INT PRIMARY KEY AUTO_INCREMENT,
59 | team_id INT,
60 | player_name VARCHAR(255) NOT NULL,
61 | nationality VARCHAR(255),
62 | birth_date DATETIME,
63 | height_cm INT,
64 | prefd_foot VARCHAR(255),
65 | position VARCHAR(255),
66 | shirt_num INT,
67 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''')
68 | cur.execute('''CREATE TABLE IF NOT EXISTS teams_extras (
69 | id_Team INT PRIMARY KEY,
70 | team_id INT,
71 | team_name VARCHAR(45) NOT NULL,
72 | team_name_short VARCHAR(45),
73 | alternate_team_name VARCHAR(255),
74 | formed_year INT,
75 | stadium_name VARCHAR(45),
76 | stadium_pic_url VARCHAR(255),
77 | stadium_description LONGTEXT,
78 | stadium_location VARCHAR(255),
79 | stadium_capacity INT,
80 | team_website VARCHAR(255),
81 | team_facebook VARCHAR(255),
82 | team_twitter VARCHAR(255),
83 | team_instagram VARCHAR(255),
84 | team_description LONGTEXT,
85 | FOREIGN KEY (team_id) REFERENCES teams(team_id))''')
86 | cur.close()
87 |
88 |
89 | def write_league(league_info: list):
90 | """
91 | writing the league info to the database
92 | :param league_info: a list with the league name and the number of teams in this league
93 | """
94 | my_db = connector()
95 | cur = my_db.cursor()
96 | cur.execute("INSERT INTO leagues (league_name, number_of_teams) VALUES (%s, %s)"
97 | "ON DUPLICATE KEY UPDATE league_id=league_id",
98 | (league_info[0], league_info[1]))
99 | my_db.commit()
100 | cur.close()
101 |
102 |
103 | def write_teams(teams_info: list, lg_name):
104 | """
105 | writing a team info to the database
106 | :param teams_info: the team name, number of players in the team
107 | :param lg_name: the league which the teams belong to - foreign key
108 | """
109 | my_db = connector()
110 | cur = my_db.cursor()
111 | cur.execute("INSERT INTO teams (team_name, number_of_players, league_id) VALUES "
112 | "(%s, %s, (SELECT league_id FROM leagues WHERE league_name='"+lg_name+"'))"
113 | """ON DUPLICATE KEY UPDATE team_id=team_id""",
114 | (teams_info[0], teams_info[1]))
115 | my_db.commit()
116 | cur.close()
117 |
118 |
119 | def write_players(players_info: list, team_n):
120 | """
121 | writing a list of players to the database
122 | :param players_info: a list of dictionaries, each dict stands for a player with several data fields.
123 | :param team_n: the name of the team which the players belong to
124 | """
125 | my_db = connector()
126 | cur = my_db.cursor()
127 | for player in players_info:
128 | cur.execute("INSERT INTO players (team_id, player_name, nationality, birth_date, height_cm, prefd_foot,"
129 | "position, shirt_num)"
130 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1),%s, %s, %s, %s, %s, %s, %s)"
131 | "ON DUPLICATE KEY UPDATE player_id=player_id",
132 | (player['name'], player['nationality'], player['birth_date'], player['height'],
133 | player['prefd_foot'], player['position'], player['shirt_num']))
134 | my_db.commit()
135 | cur.close()
136 |
137 |
138 | def write_manager(mgr_info, team_n):
139 | """
140 | writing a manager to the database
141 | :param mgr_info: a dictionary, that represent a manger with several data fields.
142 | :param team_n: the name of the team which the manager belong to
143 | """
144 | my_db = connector()
145 | cur = my_db.cursor()
146 | mgr_name = (unicodedata.normalize('NFD', mgr_info['name']).encode('ascii', 'ignore'))
147 | cur.execute("INSERT INTO managers (team_id, manager_name, birth_date, nationality, pref_formation,"
148 | "avg_points_per_game, games_won, games_drawn, games_lost)"
149 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1),%s, %s, %s, %s, %s, %s, %s, %s)"
150 | "ON DUPLICATE KEY UPDATE manager_id=manager_id",
151 | (mgr_name, mgr_info['birth_date'], mgr_info['nationality'], mgr_info['pref_formation'],
152 | mgr_info['avg_points_per_game'], mgr_info['games_won'], mgr_info['games_drawn'], mgr_info['games_lost']))
153 | my_db.commit()
154 | cur.close()
155 |
156 |
157 | def write_team_extras(team_info, team_n):
158 | """
159 | writing an additional information about each team from an external API - thesportsdb.com/api.php
160 | :param team_info: a dictionary with the desired external data
161 | :param team_n: team name that the external data is belong to
162 | """
163 | my_db = connector()
164 | cur = my_db.cursor()
165 | team_des = (unicodedata.normalize('NFD', team_info['strDescriptionEN']).encode('ascii', 'ignore'))
166 | if team_info['strStadiumDescription'] is not None:
167 | stad_des = (unicodedata.normalize('NFD', team_info['strStadiumDescription']).encode('ascii', 'ignore'))
168 | else:
169 | stad_des = None
170 | cur.execute("INSERT INTO teams_extras (team_id, id_Team, team_name, team_name_short, alternate_team_name,"
171 | "formed_year, stadium_name, stadium_pic_url, stadium_description, stadium_location, stadium_capacity,"
172 | "team_website, team_facebook, team_twitter, team_instagram, team_description)"
173 | "VALUES ((SELECT team_id FROM teams WHERE team_name='"+team_n+"' LIMIT 1), %s, %s, %s, %s, %s, %s, %s,"
174 | "%s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE id_Team = id_Team",
175 | (team_info['idTeam'], team_info['strTeam'], team_info['strTeamShort'], team_info['strAlternate'],
176 | team_info['intFormedYear'], team_info['strStadium'], team_info['strStadiumThumb'],
177 | stad_des, team_info['strStadiumLocation'], team_info['intStadiumCapacity'],
178 | team_info['strWebsite'], team_info['strFacebook'], team_info['strTwitter'], team_info['strInstagram'],
179 | team_des))
180 | my_db.commit()
181 | cur.close()
182 |
183 |
184 | def check_and_delete(league_name):
185 | """
186 | this function checks if the league already exists in the database, if so deletes it.
187 | :param league_name: getting the league name from user
188 | """
189 | my_db = connector()
190 | cur = my_db.cursor()
191 | cur.execute("SELECT league_id FROM leagues WHERE league_name ='"+league_name+"'")
192 | league_id = cur.fetchall()
193 | if len(league_id) != 0:
194 | cur.execute("SELECT team_id FROM teams WHERE league_id="+str(league_id[0][0]))
195 | team_ids = cur.fetchall()
196 | if len(team_ids) > 0:
197 | for team_id in team_ids:
198 | cur.execute("DELETE FROM players WHERE team_id = "+str(team_id[0]))
199 | cur.execute("DELETE FROM managers WHERE team_id = "+str(team_id[0]))
200 | cur.execute("DELETE FROM teams_extras WHERE team_id = "+str(team_id[0]))
201 | my_db.commit()
202 | cur.execute("DELETE FROM teams WHERE league_id = "+str(league_id[0][0]))
203 | my_db.commit()
204 | cur.execute("DELETE FROM leagues WHERE league_id = "+str(league_id[0][0]))
205 | my_db.commit()
206 |
--------------------------------------------------------------------------------
/html_parser.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from selenium import webdriver
3 | from bs4 import BeautifulSoup
4 | from dateutil.parser import parse
5 | from selenium.webdriver.chrome.options import Options
6 | import config as cfg
7 |
8 | arrow_manipu = lambda x: x.replace("<", ">").split(">")
9 |
10 |
11 | def get_driver():
12 | chrome_options = Options()
13 | chrome_options.add_argument("--headless")
14 | # working with selenium google driver as the data is not in the bs4 html
15 | return webdriver.Chrome(executable_path=r'./chromedriver', options=chrome_options)
16 |
17 |
18 | def extract_player_info(player_url):
19 | """
20 | Starting from the url of a player's page on https://www.sofascore.com, the function extracts
21 | the most interesting infos about him, if available, and returns them in a dict.
22 | :param player_url: url of the player on https://www.sofascore.com site.
23 | :return: dict with this keys: [Name, Nationality, birth-date, Height, Preferred Foot, Position, Shirt Number, Average site rating]
24 | """
25 | player_dict = {}
26 | # beautiful_soup of the player
27 | player_html = BeautifulSoup(requests.get(player_url).text, 'html.parser')
28 | # html of the most interesting data
29 | player_panel_html = player_html.find_all("h2", class_="styles__DetailBoxTitle-sc-1ss54tr-5 enIhhc")
30 | details = arrow_manipu(str(player_panel_html))
31 | player_fields_html = player_html.find_all("div", class_="styles__DetailBoxContent-sc-1ss54tr-6 iAORZR")
32 | fields_list = arrow_manipu(str(player_fields_html))
33 | player_dict['name'] = player_url.split("/")[-2].replace('-', ' ').title()
34 | if "Nationality" in fields_list:
35 | raw_nationality = player_html.find_all("span", class_="u-pL8")
36 | player_dict['nationality'] = arrow_manipu(str(raw_nationality))[-3]
37 | for field in fields_list:
38 | try:
39 | b_day = parse(field, fuzzy=False)
40 | player_dict['birth_date'] = b_day
41 | except ValueError:
42 | continue
43 |
44 | for i in range(len(details)):
45 | is_a_detail = (r"h2 class=" in details[i] or r"span style" in details[i]) and details[i + 1] != ''
46 | if is_a_detail:
47 | if 'cm' in details[i + 1]:
48 | player_dict['height'] = int(details[i + 1].split()[0])
49 | elif details[i + 1] in cfg.POSSIBLE_FOOT:
50 | player_dict['prefd_foot'] = details[i + 1]
51 | elif details[i + 1] in cfg.POSSIBLE_POSITION:
52 | player_dict['position'] = details[i + 1]
53 | elif "Shirt number" in fields_list:
54 | player_dict['shirt_num'] = int(details[i + 1])
55 |
56 | for key in cfg.PLAYER_FIELDS:
57 | if key not in player_dict.keys():
58 | player_dict[key] = None
59 | return player_dict
60 |
61 |
62 | def extract_players_urls(team_url):
63 | """
64 | this function fet url of a team and extract all of the players url list out of it.
65 | then it send the player url to extract player info func to get all info about the player
66 | :param team_url: a url to a team page as a string
67 | """
68 | players_list = []
69 | # using bs4 & requests to retrieve html as text
70 | driver = get_driver()
71 | driver.get(team_url)
72 | team_html = BeautifulSoup(driver.page_source, 'html.parser')
73 | # looking for the player info inside the page
74 | all_players_html = team_html.find_all("a", class_="styles__CardWrapper-sc-1dlv1k5-15 czXoLq")
75 | # manipulating the text to extract player links
76 | html_list = str(all_players_html).split()
77 | for line in html_list:
78 | if "href" in line:
79 | players_list.append(extract_player_info("https://www.sofascore.com" + line.split("\"")[1]))
80 | return players_list
81 |
82 |
83 | def extract_mgr_url(team_url):
84 | mgr_link = ""
85 | driver = get_driver()
86 | driver.get(team_url)
87 | soup = BeautifulSoup(driver.page_source, 'html.parser')
88 | mgr_html = soup.find('div', class_="Content-sc-1o55eay-0 styles__ManagerContent-qlwzq-9 dxQrED")
89 | for line in str(mgr_html).split():
90 | if "href" in line:
91 | mgr_link = "https://www.sofascore.com" + line.split("\"")[1]
92 | return mgr_link
93 |
94 |
95 | def extract_mgr_info(manager_url):
96 | """
97 | this function gets a manager url and extract all desired data from the source html
98 | :param manager_url: a manger url
99 | :return: a dictionary that contains all the dat about a manager.
100 | """
101 | mgr_dict = {}
102 | mgr_soup = BeautifulSoup(requests.get(extract_mgr_url(manager_url)).text, 'html.parser')
103 | mgr_dict['name'] = mgr_soup.find('div', class_="Content-sc-1o55eay-0 gYsVZh u-fs25 fw-medium").get_text()
104 | mgr_dict['nationality'] = mgr_soup.find('span', style="padding-left:6px").get_text()
105 | field_list = arrow_manipu(str(mgr_soup.find_all('div', class_="Content-sc-1o55eay-0 gYsVZh u-txt-2")))
106 | for field in field_list:
107 | try:
108 | b_day = parse(field, fuzzy=False)
109 | mgr_dict['birth_date'] = b_day
110 | except ValueError:
111 | continue
112 | values_list = arrow_manipu(str(mgr_soup.find_all('div', class_="Content-sc-1o55eay-0 gYsVZh u-fs21 fw-medium")))
113 | mgr_dict['pref_formation'] = values_list[12]
114 | mgr_dict['avg_points_per_game'] = values_list[20]
115 | mgr_dict['games_won'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-green").get_text()
116 | mgr_dict['games_drawn'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-2").get_text()
117 | mgr_dict['games_lost'] = mgr_soup.find('div', class_="Section-sc-1a7xrsb-0 dPyKtM u-txt-red").get_text()
118 |
119 | return mgr_dict
120 |
121 |
122 | def extract_teams_urls(league_url):
123 | """
124 | this function extract teams url out of league home page
125 | :param league_url: league url as a string
126 | :return: all teams urlss unique and alphabetically sorted in a list
127 | """
128 | team_list = []
129 | driver = get_driver()
130 | driver.get(league_url) # mimicking human behaviour and opening league url
131 | team_html = BeautifulSoup(driver.page_source, 'html.parser') # getting the source with selenium, parsing with bs4
132 | driver.close()
133 | # looking after all teams urls
134 | all_teams_html = team_html.find_all("div", class_="Content-sc-1o55eay-0 gYsVZh u-pL8 fw-medium")
135 | html_list = str(all_teams_html).split() # splitting the string by spaces
136 | for line in html_list:
137 | is_line_with_link = "href=\"/team" in line and not line.endswith("img")
138 | if is_line_with_link:
139 | # appending it to a list
140 | team_list.append("https://www.sofascore.com" + line.replace("\"", "").split("=")[-1].split(">")[0])
141 | sorted_teams = sorted(list(set(team_list))) # sorting and removing duplicates
142 | return sorted_teams
143 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | import db_control
2 | import config as cfg
3 | import html_parser as hp
4 | import api_data as api
5 | from tqdm import tqdm
6 | import argparse
7 |
8 |
9 | def parsing():
10 | """
11 | using argparse to simplify the cli for the user
12 | """
13 | leagues = []
14 | parser = argparse.ArgumentParser()
15 | parser.add_argument("-s", "--SerieA", help="download Italian league players", action="store_true")
16 | parser.add_argument("-p", "--PremierLeague", help="download English league players", action="store_true")
17 | parser.add_argument("-l", "--LaLiga", help="download Spanish league players", action="store_true")
18 | parser.add_argument("-b", "--BundesLiga", help="download German league players", action="store_true")
19 | args = parser.parse_args()
20 |
21 | if args.SerieA:
22 | leagues.append("seriea")
23 | if args.PremierLeague:
24 | leagues.append("premier")
25 | if args.LaLiga:
26 | leagues.append("laliga")
27 | if args.BundesLiga:
28 | leagues.append("bundes")
29 | return leagues
30 |
31 |
32 | def main():
33 | """
34 | this is main calling function to extract players data out of https://www.sofascore.com
35 | """
36 | # validating that the user changed the MySQL password and username to connect
37 | if cfg.PASSWD == "" or cfg.USERNAME == "":
38 | exit("Invalid username or password. Please read README.md!")
39 |
40 | leagues_to_download = parsing() # getting commands from the cli
41 | db_control.create() # will create database and tables that does not exists.
42 |
43 | for league in leagues_to_download: # iterating league links
44 | league_name = cfg.TOP_LEAGUES_URLS[league].split("/")[-2]
45 | db_control.check_and_delete(league_name)
46 | teams = hp.extract_teams_urls(cfg.TOP_LEAGUES_URLS[league]) # extracting teams out of leagues tables
47 | print("\ngetting teams from " + league_name) # printing for user "loading" in addition to tqdm
48 | db_control.write_league([league_name, len(teams)])
49 | watch = tqdm(total=len(teams), position=0)
50 | for team_url in teams: # iterating all teams urls
51 | team_name = team_url.split('/')[-2]
52 | manager_info = hp.extract_mgr_info(team_url)
53 | players_list = hp.extract_players_urls(team_url) # extracting player url which
54 | db_control.write_teams([team_name, len(players_list)], league_name)
55 | db_control.write_players(players_list, team_name)
56 | db_control.write_manager(manager_info, team_name)
57 | extra_team_info = api.get_info_from_api(team_name) # retrieving external data from the api
58 | db_control.write_team_extras(extra_team_info, team_name) # writing this data into the database
59 | watch.update(1)
60 |
61 |
62 | if __name__ == '__main__':
63 | main()
64 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests
2 | selenium
3 | bs4
4 | python-dateutil
5 | mysql-connector-python
6 | tqdm
7 | argparse
8 |
--------------------------------------------------------------------------------
/sofa score demo day.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/danielsaban/data-scraping-sofascore/2b2ac2657db10ef2105c0f89b0b8bd05c7cb9f32/sofa score demo day.pptx
--------------------------------------------------------------------------------