├── .gitattributes ├── .gitignore ├── Dockerfile ├── README.md ├── airflow_requirements └── requirements.txt ├── analytics ├── data_analysis │ ├── analysis.txt │ └── basic_data_facts.txt ├── kill_query.py ├── make_plots.py ├── plots │ ├── blitz_elo_over_time │ │ ├── blitz_elo_over_time_-1000_elo_cutoff.png │ │ ├── blitz_elo_over_time_100_elo_cutoff.png │ │ ├── blitz_elo_over_time_800_elo_cutoff.png │ │ ├── heatmap_elo_gain_count.png │ │ ├── heatmap_elo_gain_time.png │ │ ├── rating_percentiles_1000-1199.png │ │ ├── rating_percentiles_1200-1399.png │ │ ├── rating_percentiles_1400-1599.png │ │ ├── rating_percentiles_1600-1799.png │ │ ├── rating_percentiles_1800-1999.png │ │ ├── rating_percentiles_2000-2199.png │ │ ├── rating_percentiles_2200-2399.png │ │ └── rating_percentiles_800-999.png │ ├── elo_by_total_games_played_blitz.png │ ├── elo_by_total_games_played_per_month_blitz.png │ ├── elo_diff_by_total_games_played_blitz.png │ ├── elo_diff_per_month_by_total_games_per_month_blitz.png │ ├── games_per_elo_bracket.png │ ├── games_per_event.png │ ├── games_per_player.png │ ├── pct_analyzed_per_elo_bracket.png │ ├── players_per_elo_bracket.png │ └── popular_play_times │ │ ├── games_by_day.png │ │ ├── games_by_time_and_day.png │ │ └── games_by_time_of_day.png ├── psycopg2_query.py ├── query_out_storage │ ├── blitz_game_pairs_2013-2015.csv.bz2 │ ├── elo_band_maxs.csv │ ├── elo_band_means.csv │ ├── elo_band_medians.csv │ ├── elo_band_mins.csv │ ├── elo_diff_per_player.csv │ ├── elo_diff_per_player.csv.bz2 │ ├── games_played_by_time_of_day.csv.bz2 │ ├── min_max_ratings_per_player.csv.bz2 │ ├── pct_analyzed_per_elo_bracket.csv │ ├── terminations_by_game_type.csv │ ├── total_blitz_games_per_player_over_time.csv.bz2 │ ├── total_games_per_elo_bracket.csv │ ├── total_games_per_event.csv │ ├── total_games_per_player.csv.bz2 │ ├── total_games_per_player_and_event.csv.bz2 │ └── user_ids.csv.bz2 └── spark.py ├── docker-compose.yml ├── requirements.txt ├── screenshots ├── airflow_dag.png ├── airflow_dag_closeup.png ├── airflow_ui.png ├── german11_blitz.png ├── peschkach.png ├── peschkach_bio.png ├── processed_data.png ├── raw_data_sample.png └── user_ids_table.png └── src ├── CONFIG.py ├── airflow_dag_kafka.py ├── airflow_dag_local.py ├── check_db_status.py ├── consumer.py ├── data_process_util.py ├── database_util.py ├── download_games.py ├── download_links.txt ├── process_file_local.py └── producer.py /.gitattributes: -------------------------------------------------------------------------------- 1 | *.csv filter=lfs diff=lfs merge=lfs -text 2 | *.bz2 filter=lfs diff=lfs merge=lfs -text 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | **/data 2 | **/__pycache__ 3 | **/test.py 4 | **/files_downloaded.txt 5 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.9 2 | RUN useradd --create-home --shell /bin/bash username 3 | #apt-get update && apt-get install -y build-essential && 4 | #libpq-dev 5 | WORKDIR /home/user 6 | COPY ./requirements.txt ./ 7 | RUN pip install --no-cache-dir -r requirements.txt 8 | USER username 9 | COPY --chown=username . . 10 | COPY --chown=username ./src /opt/bitnami/airflow/dags 11 | CMD ["bash"] 12 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How Long Does It Take Ordinary People To "Get Good" At Chess? 2 | 3 | TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, most beginners will improve their rating by 100 lichess rating points in 3-6 months. Most "experienced" chess players in the 1400-1800 rating range will take 3-4 years to improve their rating by 100 lichess rating points. There's no strong evidence that playing more games makes you improve quicker. 4 | 5 | ## Table Of Contents 6 | 7 | 1. [Backstory](#backstory) 8 | 2. [ETL Process](#etl-process) 9 | 3. [Data Analysis](#data-analysis) 10 | - [How Long Does It Take To Improve At Chess?](#how-long-does-it-take-to-improve-at-chess) 11 | - [Does Playing More Games Make You A Better Chess Player?](#does-playing-more-games-make-you-a-better-chess-player) 12 | - [Misc. Questions Explored Out Of Curiosity](#misc-questions-explored-out-of-curiosity) 13 | 4. [How To Do What I Did](#how-to-do-what-i-did) 14 | 15 | 16 | ## Backstory 17 | 18 | I've been a casual chess player for a few years now. Like most people who get into chess, one question that's been at the back of my mind is, like the title suggests, how long is it going to take me to actually get good at this game? 19 | 20 | Fortunately for us, lichess.org, the largest open source online chess website, publishes all chess games played on their site freely available for the public to download (including rating data). This gives me just what I need to take a crack at this question. 21 | 22 | But before we start getting into the data mining, we would first need to define what is "good at chess". Naturally, that's going to very from person to person. If your definition of "good" is never losing, that's never going to happen (unless you're picky about who you choose to play or you're a computer). If your definition of "good" is better than most everyday people you would find off the street, then you would probably get there after spending 30 minutes learning how the pieces move. 23 | 24 | For what it's worth, the data says the 90th percentile rating is at 1927. On lichess.org/stat/rating/distribution/blitz, I believe the distribution is calculated from currently active players, raising the 90th percentile to 2075. For reference, GM Hikaru Nakamura (one of the world's top blitz players) was able to jump from total beginner (600 USCF ~1100 lichess) to the 90th percentile (1800 USCF, ~2000 lichess) in just 2 years (Jan 1995- Jan 1997). So best case scenario if you're a chess prodigy, you can get into the 90th percentile ballpark in 2 years of serious study. 25 | 26 | Given the ambiguous nature of this question, I decided to focus on improvement rate rather than trying to answer something like how long it takes to go from total beginner to grandmaster. 27 | 28 | There are a few reasons for this: 29 | 30 | 1. People who become grandmasters probably don't represent your typical everyday chess player. I'm more interested in how long it's going to take a "normal" person like me to improve. 31 | 2. While tournament games and rating histories of grandmasters (and other tournament players) are publically available, over the board chess is likely different than online chess, even if the time controls are the same. I'm more interested in getting better at online chess, which is where I usually play chess. 32 | 3. I think the rate of improvement is what people (myself included) are *really* interested in. I felt like the core of my question is actually "how long is it going to take until I see noticeable improvement in my chess skills?" It's also a helpful way to benchmark if your struggling at chess improvement is "normal" or if there's something wrong with your training plan. 33 | 4. It's unlikely I'm going to find many examples of people who went from beginner to grandmaster in the data. It takes at least 8-10 years to jump that gap in addition to extreme talent and near full-time study (back to point 1). Lichess data only goes back to 2013, 8 years of data as of this writing. Because it takes many years to develop strong players, there's a low chance such players have documented their entire progression on lichess. They would have had to start in the early days of lichess, back when it wasn't very popular. 34 | 5. Even if I could get data spanning the many years required to build strong players, the amount of data would be *massive* even after stripping down the data to the bare minimum required, given the exponential growth of chess games being played on lichess. There are ways I could store and analyze the *massive* amount of data, but that's going to cost me an arm and a leg. Phrasing the question this way is going to allow me to work with the limit data I'm able to get my hands on. 35 | 36 | So that brings me to the purpose of this project: to figure out the typical improvement rates for typical online chess players on lichess.org. 37 | 38 | ## ETL Process 39 | 40 | Here's an outline of what I did to do this analysis: 41 | 42 | 1. Extract the data from database.lichess.org 43 | 2. Transform the data to the format I need it in 44 | 3. Load the processed data in a relational database 45 | 4. Analyze the data to answer my questions 46 | 47 | Lichess has all their games available for download by month (e.g. January 2013). The problem is, the data in its raw format can't be queried easily, nor can it directly answer the question I'm interested in. 48 | 49 | Here's what the raw data looks like: 50 | 51 | ![Alt test](./screenshots/raw_data_sample.png?raw=true "Data Sample") 52 | 53 | I believe it's an export from MongoDB, a NoSQL document database, which does not have a fixed schema. The problem with schemaless databases like MongoDB is they aren't designed for complex analytical queries (but great for application flexibility and scaling for big data). 54 | 55 | I decided to migrate the data into postgreSQL, a structured, relational database. That's going to allow me to run complex SQL queries to get the rating per player over time, which is really what I need to answer my question. It also gives me the flexibility to play around with the data for other questions that might pop up in the future. 56 | 57 | To help manage the ETL process, I built a datapipeline using Airflow to schedule the downloading, processing, and deletion of each data file on lichess. The nice thing about Airflow is it comes with a web UI that helps you visually see where you're at in your data pipeline. It also provides logging and task status to help make troubleshooting/debugging a bit easier. 58 | 59 | Here's a couple of screenshots of what the pipeline looks like: 60 | 61 | ![Alt test](./screenshots/airflow_dag.png?raw=true "Airflow DAG graph") 62 | ![Alt test](./screenshots/airflow_dag_closeup.png?raw=true "Airflow DAG graph closeup") 63 | 64 | After a lot of troubleshooting and a few weeks of waiting for the data to download on my dinky 10 year old laptop, I got about 100 Gb of data loaded into postgres, spanning 5.5 years, 450 million games, and 2.3 million players. 65 | 66 | Here's a peek at what the processed data looks like in my postgresql database (the 'games' table). To save space, I truncated the event names and termination type to single characters (b = bullet, B = blitz, c = classical, C = correspondence, r = rapid; N = finished normally, F = lost/won on time, A = game aborted, ? = other, i.e. violating terms of service/cheating) and took out the 'https://lichess.org/' part for the 'site' column, leaving the unique game identifier: 67 | 68 | ![Alt test](./screenshots/processed_data.png?raw=true) 69 | 70 | I also assigned user ids to replace the username because an int type takes up less space (4 bytes) than varchar (~10-20 bytes depending on username length). The username can still be looked up by referencing/joining the user\_ids table: 71 | 72 | ![Alt test](./screenshots/user_ids_table.png?raw=true) 73 | 74 | Once that was done, I did the analysis with a bit of SQL and pandas then made the plots with seaborn/matplotlib. 75 | 76 | ## Data Analysis 77 | 78 | ### How Long Does It Take To Improve At Chess? 79 | 80 | Now to the fun stuff. 81 | 82 | What does the data say about chess improvement rate? 83 | 84 | After extracting the data for per player over time (including games as white and black), filtering for one time control, calculating the monthly average, aligning everyone's starting dates, assigning the ratings into rating bins, and averaging the ratings by the rating bins (with 95% confidence intervals), I get the plot below: 85 | 86 | ![Alt test](./analytics/plots/blitz_elo_over_time/blitz_elo_over_time_-1000_elo_cutoff.png?raw=true "blitz elo over time (all)") 87 | 88 | I analyzed the data from the perspective of a player's monthly average which should be a better estimate of a player's playing strength than looking at the game-by-game rating fluctuation. I'm not particularly interested in cases of players who managed to jump 100 points in one afternoon blitz binge session. I believe those instances can be attributed to random chance rather than those players suddenly having a "eureka" moment that boosted their playing strength by 100 rating points overnight. 89 | 90 | From the graph, it looks like improvement rate depends a lot on what your current rating is. As one might expect, lower ratings have the greatest opportunity to improve quickly, while higher ratings will take much longer to see improvement. Most players in the 800-1000 rating range (about 6% of players) will see their rating jump up 100 points in just a few months of activity. Most players in the 1600-2000 range (27% of players) will take 4 years or more to move up just 100 rating points. 91 | 92 | I'm not sure what the weird bump and dip is that happens around the 3 year mark. That may be an artifact of the data only containing 5.5 years of data, with datapoints heavily clustered around lower month counts (see player churn). 93 | 94 | 4 years just for 100 rating points? Seems a bit longer than I expected. But it is plausible. 95 | 96 | There are players in the data with long histories of activity who have not improved their rating despite playing many games over the span of many years. See the player who's played the most games of all time on lichess: 97 | 98 | ![Alt test](./screenshots/german11_blitz.png?raw=true "german11 blitz stats") 99 | 100 | But what if the mean ratings are being dragged down by the mass of "casual" players who aren't interested in improving and just play for fun? (Not that there's anything wrong with that). Is there a way to look at just the players who are serious about improving? I tried filtering the data to players who have gained at least 100 rating points since joining lichess: 101 | 102 | ![Alt test](./analytics/plots/blitz_elo_over_time/blitz_elo_over_time_100_elo_cutoff.png?raw=true) 103 | 104 | There's a strange jump in rating in the first month for players in this category that is less prevalent in the entire dataset. It's possible this could be due to players starting out on lichess as underrated and quickly catching up in the first month. I'm going to ignore the first month when calculating the improvement rate. 105 | 106 | From the chart, players in the 800-1000 range on average improve their rating by 100 points in just 1-2 months. Players in the 1600-2000 range improve their rating by 100 points in about 3-4 years, which isn't much different from the average of all players. One explanation for this is that most players at this rating range are already considered "serious" players and most of them "have what it takes" to improve at chess. Setting a cutoff in the data for 100 rating points of improvement is not as strong a filter for players in this range. 107 | 108 | So it looks like for most people, improvement happens over long periods of time of consistent study/activity on the scale of months for beginners and years for experienced players. 109 | 110 | But what about people who seem to have gone from beginner to ~2000 rating in just a couple of years? It's not possible to see what their progressions looked like from the plots above, yet these types of people *are* out there. I wanted to take a look at how many of these people actually exist. Are they really as rare as the above data seems to suggest? 111 | 112 | Here's a heatmap showing the number of players who have raised their rating by X rating, divided up by their starting rating: 113 | 114 | ![Alt test](./analytics/plots/blitz_elo_over_time/heatmap_elo_gain_count.png?raw=true) 115 | 116 | So it seems like there is a sizeable number of people who have made *significant* gains since they joined lichess as a beginner (~800-1200 rating). From the data for people in the 800-1200 starting range, there are about 120 people who have improved their rating by over 800 points since joining and about 1000 people who have improved their rating at least 500 points. However, it is important to note that the data does not filter out bots, cheaters, or smurfs and it's unclear what percentage such people contribute to these counts, but I suspect it's small. 117 | 118 | Another piece of information that would be interesting to look at is how long did it take on average for these "outliers" to achieve such impressive gains? 119 | 120 | Here's another heatmap showing the average time it took for people to achieve X rating gain, divided up by their starting rating: 121 | 122 | ![Alt test](./analytics/plots/blitz_elo_over_time/heatmap_elo_gain_time.png?raw=true) 123 | 124 | The results were actually pretty surprising. It looks like these "outliers" seem to have made these gains in a little less than *2 years*! Amazingly, that's about the amount of time it took GM Hikaru Nakamura to bridge that gap when he was learning chess as a child. So it seems that there is hope for people looking to become strong players. With serious study and dedication, it looks like it's possible to make massive improvements in a reasonably short amount of time. 125 | 126 | But one disclaimer I would like to add to that is to emphasize that these players are *outliers*. There are only a few thousand, *maybe* a few tens of thousands that have managed to accomplish this in comparison to the 820,000 player population included in the dataset. These people comprise just 1% of all players. While their results are *possible*, one could argue that these are not "ordinary" people. 127 | 128 | The 0 rating gain column is also a bit deceptive, seemingly implying that more time = more rating gain given the low time values listed in that column. There's some *tiny* hint of truth to that, but I think this number is just being dragged down by the mass of newer players. If we remember some of the earlier analysis, most people who actually stay active for longer times do not actually improve to the extent potentially suggested by this heatmap. 129 | 130 | To get a more detailed view of how *many* players are actually improving, I plotted the rating gain over time again, but this time divided up the data by percentiles according to each player's net rating change for each starting rating band (with 95% confidence intervals): 131 | 132 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_800-999.png?raw=true) 133 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_1000-1199.png?raw=true) 134 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_1200-1399.png?raw=true) 135 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_1400-1599.png?raw=true) 136 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_1600-1799.png?raw=true) 137 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_1800-1999.png?raw=true) 138 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_2000-2199.png?raw=true) 139 | ![Alt test](./analytics/plots/blitz_elo_over_time/rating_percentiles_2200-2399.png?raw=true) 140 | 141 | The data looks consistent with previous analysis, but I think it better illustrates how only a small percentage of players actually do improve. It looks like only about the top 10% of players achieve meaningful improvement (\> 100 rating gain) over time, with only about 1% of players breaking past more than 200 rating in a few years. The majority 90% of players seem to hover around their initial rating despite being active on lichess for several years. 142 | 143 | One side note about the data: at the 50th percentile in each plot, the line seems to be very noisy and short. Since the percentile assignments are determined by ending rating - starting rating and the most common rating change is 0, the 50th percentile seems to be capturing the players who have only played a very few number of games and/or have been active only a short amount of time. I suspect this results in a high number of players who generate very sparse datapoints for the 50th percentile leading to the noisy 50th percentile line shown in the plots. 144 | 145 | ### Does Playing More Games Make You A Better Chess Player? 146 | 147 | But while I have the data here, let's take a moment to answer the question everyone's asking: **does playing more games make you a better chess player?** 148 | 149 | Intuitively, the answer seems like it should be *yes*. Seems like experience should be a huge factor in someone's chess strength. It also seems like the go-to answer many of the top players recommend for improving at chess. 150 | 151 | This is what the data says: 152 | 153 | **Elo Rating vs. Number of Games Played** 154 | ![Alt test](./analytics/plots/elo_by_total_games_played_blitz.png?raw=true "elo vs number of games") 155 | **Elo Rating vs. Number of Games Played Per Month** 156 | ![Alt test](./analytics/plots/elo_by_total_games_played_per_month_blitz.png?raw=true "elo vs number of games/month") 157 | **Net Elo Gain vs. Number of Games Played** 158 | ![Alt test](./analytics/plots/elo_diff_by_total_games_played_blitz.png?raw=true "net elo gain vs number of games") 159 | **Net Elo Gain Per Month vs. Number of Games Played Per Month** 160 | ![Alt test](./analytics/plots/elo_diff_per_month_by_total_games_per_month_blitz.png?raw=true "net elo gain/month vs number of games/month") 161 | 162 | It seems like no matter how the data is sliced up, there does not seem to be a clear 1:1 linear correlation between improvement rate and number of games played. There's *maybe* a slight upward trend with rating gain vs games per month, but it seems to be a weak trend. 163 | 164 | However, there does seem to be a sweet spot in the rating gain rate where a large portion of the players who have gained the most elo per month seem to cluster around 100-300 games per month, which comes out to a handful of games per day. That may be evidence that playing at least a few games here and there on a consistent basis will give the best chances at improving, but that could also be due to the fact that there are just more data points for players playing at that rate. 165 | 166 | In case you're curious about how people gained 100+ points per month, I checked the data and the majority of them are either cheaters, smurf accounts, or bots. For reference, the most improved player is (as of oct 2021) an 11 year old world chess champion from the Ukraine. His rate of improvement averaged out to 33 points per month over a period of 6 years. 167 | 168 | Regardless, the data does not give any indications that bingeing chess games like it's a full time job is going to make you a stronger chess player any faster than playing a few games per day on your lunch breaks. 169 | 170 | So what does make you a better chess player? I can only assume that it takes additional study beyond mindlessly playing chess games (like many other skills in life). Perhaps by practicing tactics, studying chess strategy, or studying your games. That seems to be what all the top players do. I would love to take a look at player's tactics data or studying habits in relation to their improvement rate, but that data is not readily available/easily obtainable, especially at the scale of this dataset. As a project manager would say, that's probably out of scope for this project. 171 | 172 | ### Misc. Questions Explored Out Of Curiosity 173 | 174 | I had a lot of other ideas I wanted to explore while I had this data stored in my postgres database. That's the nice thing about relational databases. You have a lot of flexibility in how you can slice up and analyze the data. 175 | 176 | These questions didn't fit with the narrative I kicked this project off with, so I'm just going to dump them here for you to browse through in case you're interested. 177 | 178 | **How many games are played per time control?** 179 | 180 | ![Alt test](./analytics/plots/games_per_event.png?raw=true "games played per time control") 181 | 182 | Seems like blitz is the most popular and almost no one plays correspondence. That surprised me a bit, I would have thought bullet would be the most popular just because more bullet games can be played than blitz games in a fixed amount of time. 183 | 184 | **When do most people play chess?** 185 | 186 | ![Alt test](./analytics/plots/popular_play_times/games_by_time_of_day.png?raw=true) 187 | ![Alt test](./analytics/plots/popular_play_times/games_by_day.png?raw=true) 188 | ![Alt test](./analytics/plots/popular_play_times/games_by_time_and_day.png?raw=true) 189 | 190 | Looks like 6pm UTC is when lichess players are the most active and 4am when they are least active. I guess most lichess players are from Europe because that lines up with when most people in European time zones would be getting off work and sleeping. 191 | 192 | There's not much difference between the days of the week, but maybe Sunday has a slight edge over the other days. 193 | 194 | **Who has played the most games in the dataset?** 195 | 196 | The most active player in this dataset played 250,516 games (mostly bullet games), about 0.05% of all games in this database. There are only 10 players who have played over 100,000 games in this dataset (ordered from most to least): 197 | 198 | 1. german11 199 | 2. ribarisah 200 | 3. decidement 201 | 4. ASONINYA 202 | 5. bernes 203 | 6. Grga1 204 | 7. leko29 205 | 8. rezo2015 206 | 9. jiacomo 207 | 10. PILOTVE 208 | 209 | **What are the highest/lowest ratings in the dataset?** 210 | 211 | - Lowest: 630 by JAckermaan 212 | - Highest (bot): 3233 by LeelaChessOfficial 213 | - Highest (non-bot): 3196 by penguingim1 and Kuuhaku\_1 214 | - Highest (non-bot blitz): 3031 by penguingim1, Ultimate\_SHSL\_Gamer, BehaardTheBonobo, galbijjim, Kuuhaku\_1, BeepBeepImAJeep, keithster 215 | 216 | **How many games are played per rating band?** 217 | 218 | ![Alt test](./analytics/plots/games_per_elo_bracket.png?raw=true) 219 | 220 | **How many games does the median, average, maximum user play?** 221 | 222 | ![Alt test](./analytics/plots/games_per_player.png?raw=true) 223 | 224 | Here's the number of games players have played in this dataset: 225 | - Median games played: 27 226 | - Average games played: 482 227 | - Maximum games played: 250,516 228 | 229 | **Which rating bands analyze their games the most?** 230 | 231 | ![Alt test](./analytics/plots/pct_analyzed_per_elo_bracket.png?raw=true) 232 | 233 | Most analyzed games are clustered at the top levels, probably by viewers or possibly by someone who decided to have a bunch of master games analyzed in bulk. 1800s seem to analyze the least. Perhaps at the higher levels, there is less need to rely on the computer and more reliance on player skill. Intuitively, that seems to make sense, at least from what I've seen strong chess players on youtube do. But then again, the difference is really small (only 1-2%) and probably is not significant. 234 | 235 | **What is the typical change between players' starting and ending ratings?** 236 | 237 | - **Most negative change (banned):** -1045.96. This account (laurent17320) was closed for violating terms of service (probably for intentionally losing games). They went from 1950 down to 895 rating. 238 | - **Most negative change (not banned):** -1012.33. This account (Gyarados) went from 2145 rating down to 1133. I think this account used to be played by a strong player around the year 2016, then was handed off to a weaker player in 2017 who proceeded to drop the rating down to where it is now. 239 | - **Median change:** 0. Most players don't play more than ~30 or so games before quitting lichess permanently, much less so staying active longer than 1 month. Hence why the most common rating change is 0. 240 | - **Mean change:** 22. As expected, players should tend to get stronger over time, but the average is probably brought down by the majority of casual, non-serious chess players on lichess. 241 | - **Most positive change:** 1404. This is held by the account PeshkaCh, as of this writing (Oct 2021) an 11 year old world chess champion from ukraine. Browsing some of the other top accounts, some of them are bots, many of them have been closed. 242 | - **Highest positive change/month:** 897. I have explored accounts with the greatest rating change/month, however most of them are either bots with a very short activity timespan (ok\_zabu) or accounts that have been closed/banned probably for cheating (as expected). I don't think these accounts are particularly interesting. For reference, the account with the most positive change had an average increase of 33.45 elo/month. 243 | 244 | ![Alt test](./screenshots/peschkach_bio.png?raw=true) 245 | ![Alt test](./screenshots/peschkach.png?raw=true) 246 | 247 | **How do most games finish?** 248 | 249 | - Closed for fair play violation: 25,251 (0.01%) 250 | - Game abandoned: 770,328 (0.17%) 251 | - Lost due to time: 149,958,742 (33%) 252 | - Finished normally: 301,876,379 (67%) 253 | 254 | **How many games were played where black and/or white had a provisional rating (rating gain/loss of over 30 points)?** 255 | 256 | 26,778,555, about 6% of all games played 257 | 258 | **Who played the most games for each time control?** 259 | 260 | - german11: 211,781 bullet games 261 | - jlomb: 58,966 blitz games 262 | - tutubelezza: 22,674 rapid games #note: Lichess categorized rapid as either blitz/classical initially, so data isn't entirely accurate here 263 | - Snowden: 45,814 classical games 264 | - lapos: 6,870 correspondence games 265 | 266 | ## How To Do What I Did 267 | 268 | If you're interested in exploring some of this data yourself, I tried to make it easy to get everything setup on your machine. It will probably help if you're familiar with docker, python, SQL, and airflow. 269 | 270 | First, you're going to need docker and docker compose installed on your computer. You can visit https://docs.docker.com/get-docker/ for instructions on how to do that. Once you have that installed, you can follow the steps below: 271 | 272 | Clone this repository to your computer: 273 | 274 | git clone . 275 | 276 | Give global write access to the src folder (for docker-compose) 277 | 278 | chmod +777 src 279 | 280 | Run docker-compose up at the home directory of the repository where the docker-compose.yml file is. This will setup several docker containers for the airflow webserver, airflow worker, airflow scheduler, postgres databases (one for chess game data, one for airflow), redis database (for airflow), and a bash cli with python installed to run any scripts (mainly for plotting/running SQL queries) 281 | 282 | docker-compose up 283 | 284 | Once the containers are up and running, you can get to the airflow webserver UI by going to localhost:8080 in your web browser. You can login with "username" and "password" (the default specified in the docker-compose.yml file). 285 | 286 | You can click the switch to activate the DAG to start loading data into postgres. That's pretty much all you need to do to start loading data into your postgres database. You may want to modify the "airflow\_dag\_local.py" file if you want to download data from particular months. NOTE: There is a DAG that uses Kafka that I was playing with, but does not work without setting up kafka (complicated): 287 | 288 | ![Alt test](./screenshots/airflow_ui.png?raw=true) 289 | 290 | You can use the cli container if you want to play with any of the python code (i.e. modify the code that transforms the data from lichess to postgres) 291 | 292 | docker ps 293 | docker container exec -it bash 294 | 295 | Alternatively, you could just install the required python packages by installing from requirements.txt and run the scripts without docker: 296 | 297 | pip install -r requirements.txt 298 | 299 | You can run SQL queries directly on the postgres database containing all the chess games you've downloaded (i.e. to peek at what the data looks like going into postgres): 300 | 301 | docker ps 302 | docker container exec -it psql lichess_games username 303 | 304 | -------------------------------------------------------------------------------- /airflow_requirements/requirements.txt: -------------------------------------------------------------------------------- 1 | cycler==0.10.0 2 | decorator==5.1.0 3 | kiwisolver==1.3.2 4 | numpy==1.21.2 5 | pandas==1.3.3 6 | Pillow==8.3.2 7 | psycopg2-binary==2.9.1 8 | py==1.10.0 9 | pyparsing==2.4.7 10 | python-dateutil==2.8.2 11 | pytz==2021.3 12 | retry==0.9.2 13 | scipy==1.7.1 14 | six==1.16.0 15 | tqdm==4.62.3 16 | kafka-python==2.0.2 17 | -------------------------------------------------------------------------------- /analytics/data_analysis/analysis.txt: -------------------------------------------------------------------------------- 1 | How long does it take to increase elo by 100? 2 | 3 | [blitz_elo_over_time_-1000_elo_cutoff.png] 4 | 5 | It depends a lot on what your current elo is. As one might expect, lower elo ratings have the greatest opportunity to improve quickly, while higher elo ratings will take much longer to see improvement. 6 | 7 | I analyzed the data from the perspective of a player's monthly average which should be a better estimate of a player's playing strength than looking at the game-by-game elo fluctuation. I'm not particularly interested in cases of players who managed to jump 100 points in one afternoon blitz binge session. I believe those instances can be attributed to random chance rather than those players suddenly having a "eureka" moment that boosted their playing strength by 100 elo points overnight. 8 | 9 | From the blitz_elo_over_time plots, most players in the 800-1000 rating range (about 6% of players) will see their elo jump up 100 points in just a few months of activity. Most players in the 1600-2000 range (27% of players) will take 4 years or more to move up just 100 elo points. 10 | 11 | [elo_diff_by_total_games_played.png] 12 | 13 | It appears that elo increase is only weakly correlated with the number of games played. There is a very faint upward trend in the dark blue region in the center of the blob in the 0-100 elo change range. This suggests improvement doesn't come from solely from experience with games played, but needs to be supplemented with additional study outside of playing games. 14 | 15 | ===================================== 16 | 17 | What is the improvement rate for players who have increased their elo by at least 100 points? 18 | 19 | [blitz_elo_over_time_100_elo_cutoff.png] 20 | 21 | There's a strange jump in rating in the first month for players in this category. It's possible this could be due to players starting out on lichess as underrated and quickly catching up in the first month. I'm going to ignore the first month when calculating the improvement rate. 22 | 23 | From the chart, players in the 800-1000 range are able to improve their elo by 200 points in just 11 months. Players in the 1600-2000 range are able to improve their elo by 100 points in about 3-4 years, which isn't much different from the average of all players. One explanation for this is that most players at this rating range are already considered "serious" players and most of them should be looking to invest effort into improving. Setting a cutoff in the data for 100 elo points of improvement is not as strong a filter in this case. 24 | 25 | ===================================== 26 | 27 | What is the typical change between players' starting and ending elo ratings? 28 | 29 | [elo_diff_per_player.csv] 30 | 31 | most negative change (banned): -1045.96. This account (laurent17320) was closed for violating terms of service (probably for intentionally losing games). They went from 1950 down to 895 elo. 32 | most negative change (not banned): -1012.33. This account (Gyarados) went from 2145 elo down to 1133. I think this account used to be played by a strong player around the year 2016, then was handed off to a weaker player in 2017 who proceeded to drop the rating down to where it is now. 33 | median change: 0. Most players don't play more than ~30 or so games before quitting lichess permanently, much less so staying active longer than 1 month. Hence why the most common rating change is 0. 34 | mean change: 22. As expected, players should tend to get stronger over time, but the average is probably brought down by the majority of casual, non-serious chess players on lichess. 35 | most positive change: 1404. This is held by the account PeshkaCh, as of this writing (Sep 2021) an 11 year old world chess champion from ukraine. Browsing some of the other top accounts, some of them are bots, many of them have been closed. 36 | Highest positive change/month: 897. I have explored accounts with the greatest elo change/month, however most of them are either bots with a very short activity timespan (ok_zabu) or accounts that have been closed/banned probably for cheating (as expected). I don't think these accounts are particularly interesting. For reference, the account with the most positive change had an average increase of 33.45 elo/month. 37 | -------------------------------------------------------------------------------- /analytics/data_analysis/basic_data_facts.txt: -------------------------------------------------------------------------------- 1 | What date ranges are included in the data? 2 | 2012-12-31 23:01:03 3 | 2018-09-30 21:53:02 4 | a little over 5.5 years of data between 2013 and 2018 5 | 6 | SQL: 7 | SELECT min(date_time), max(date_time) FROM games; 8 | 9 | ============= 10 | 11 | How many games are contained in this dataset? 12 | 452630700 13 | ~450 Million 14 | 15 | SQL: 16 | select count(*) from games; 17 | 18 | ============== 19 | 20 | How many players are contained in this dataset? 21 | 2296327 22 | ~2.3 Million 23 | 24 | SQL: 25 | select count(*) from user_ids; 26 | 27 | ============== 28 | 29 | What are the all time highest and lowest ratings in the dataset? 30 | Lowest: 630 by JAckermaan 31 | Highest (bot): 3233 by LeelaChessOfficial 32 | Highest (non-bot): 3196 by penguingim1 and Kuuhaku_1 33 | Highest (non-bot blitz): 3031 by penguingim1, Ultimate_SHSL_Gamer, BehaardTheBonobo, galbijjim, Kuuhaku_1, BeepBeepImAJeep, keithster 34 | 35 | SQL: 36 | select * from games join user_ids u1 on u1.id = games.black 37 | join user_ids u2 on u2.id = games.white 38 | where blackelo = (select max(blackelo) from games) 39 | or whiteelo=(SELECT MAX(whiteelo) from games); 40 | 41 | select * from games join user_ids u1 on u1.id = games.black 42 | join user_ids u2 on u2.id = games.white 43 | where blackelo = (select max(blackelo) from games where blacktitle <> 'BOT') 44 | or whiteelo=(SELECT MAX(whiteelo) from games where whitetitle <> 'BOT'); 45 | 46 | select * from games join user_ids u1 on u1.id = games.black 47 | join user_ids u2 on u2.id = games.white 48 | where blackelo = (select max(blackelo) from games where blacktitle <> 'BOT' and event = 'B') 49 | or whiteelo=(SELECT MAX(whiteelo) from games where whitetitle <> 'BOT' and event = 'B'); 50 | 51 | ============== 52 | 53 | How many games are played for each time control? 54 | bullet: 139632730 ~140 Million, (31%) 55 | blitz: 210314205 ~210 Million, (47%) 56 | classical: 66543178 ~67 Million, (15%) 57 | correspondence: 1333618 ~1 Million, (0.2%) 58 | rapid: 34806969 ~35 Million (8%) 59 | 60 | SQL: 61 | SELECT event, termination, count(*) FROM games group by event, termination; 62 | 63 | ============== 64 | 65 | How many games were closed for fair play violation? 66 | 25251 67 | ~25 Thousand (0.01%) 68 | 69 | How many games were abandoned? 70 | 770328 71 | ~770 Thousand (0.17%) 72 | 73 | How many games were lost due to time? 74 | 149958742 75 | ~150 Million (33%) 76 | 77 | How many games finished normally? 78 | 62791173 79 | ~301 Million (67%) 80 | 81 | SQL: 82 | SELECT event, termination, count(*) FROM games group by event, termination; 83 | 84 | ============== 85 | 86 | How many games were played where white had a provisional rating (rating gain/loss of over 30 points)? 87 | 13820422 88 | ~13 Million 89 | 90 | SQL: 91 | SELECT count(*) FROM GAMES 92 | WHERE abs(whiteratingdiff) > 30; 93 | 94 | ============== 95 | 96 | How many games were played where black and/or white had a provisional rating (rating gain/loss of over 30 points)? 97 | 26778555 98 | ~27 Million, about 6% of all games played 99 | --> newer players are paired together 100 | 101 | SQL: 102 | SELECT count(*) FROM GAMES 103 | WHERE abs(whiteratingdiff) > 30 104 | OR abs(blackratingdiff) > 30; 105 | 106 | ============== 107 | 108 | How many games does each player play before quitting lichess? 109 | Median games played: 27 110 | Average games played: 482 111 | Maximum games played: 250516 112 | 113 | SQL: 114 | SELECT sub1.white, sub1.white_game_count + sub2.black_game_count as total_games 115 | FROM 116 | (SELECT White, COUNT(*) as white_game_count 117 | FROM games g1 118 | GROUP BY White) as sub1 119 | JOIN 120 | (SELECT Black, COUNT(*) as black_game_count 121 | FROM games g2 122 | GROUP BY Black) as sub2 123 | ON sub1.white = sub2.black 124 | ORDER BY total_games DESC; 125 | 126 | ============== 127 | 128 | Who played the largest number of games? 129 | The most active player in this dataset played 250516 games (mostly bullet games), about 0.05% of all games in this database. 130 | There are only 10 players who have played over 100,000 games (ordered from most to least): 131 | german11 132 | ribarisah 133 | decidement 134 | ASONINYA 135 | bernes 136 | Grga1 137 | leko29 138 | rezo2015 139 | jiacomo 140 | PILOTVE 141 | 142 | SQL: 143 | SELECT sub1.white, sub1.white_game_count + sub2.black_game_count as total_games 144 | FROM 145 | (SELECT White, COUNT(*) as white_game_count 146 | FROM games g1 147 | GROUP BY White) as sub1 148 | JOIN 149 | (SELECT Black, COUNT(*) as black_game_count 150 | FROM games g2 151 | GROUP BY Black) as sub2 152 | ON sub1.white = sub2.black 153 | ORDER BY total_games DESC; 154 | 155 | ============== 156 | 157 | Who played the most games for each time control? 158 | german11, 211781 bullet games 159 | jlomb, 58966 blitz games 160 | tutubelezza, 22674 rapid games #note: Lichess categorized rapid as either blitz/classical initially, so data isn't entirely accurate here 161 | Snowden, 45814 classical games 162 | lapos, 6870 correspondence games 163 | 164 | SQL: #note: I searched the top players manually from the output, then looked up the user_ids afterwards, but you could 165 | # have the query handle that with a join with the user_ids table and order by the event first 166 | SELECT sub1.white, sub1.event, sub1.white_game_count + sub2.black_game_count as total_games 167 | FROM 168 | (SELECT White, event, COUNT(*) as white_game_count 169 | FROM games g1 170 | GROUP BY White, event) as sub1 171 | JOIN 172 | (SELECT Black, event, COUNT(*) as black_game_count 173 | FROM games g2 174 | GROUP BY Black, event) as sub2 175 | ON sub1.white = sub2.black 176 | AND sub1.event = sub2.event 177 | ORDER BY total_games DESC; 178 | 179 | ============== 180 | 181 | How many games are played in each elo bracket? 182 | 183 | see plots/games_per_elo_bracket.png for a pie chart 184 | 185 | 600 <= elo < 800: 113134 (0.03%) 186 | 800 <= elo < 1000: 9289504 (2.1%) 187 | 1000 <= elo < 1200: 33022554 (7.3%) 188 | 1200 <= elo < 1400: 76131114 (16.8%) 189 | 1400 <= elo < 1600: 118401628 (26.2%) 190 | 1600 <= elo < 1800: 116825676 (26.0%) 191 | 1800 <= elo < 2000: 69078142 (15.3%) 192 | 2000 <= elo < 2200: 23779201 (5.3%) 193 | 2200 <= elo < 2400: 5243316 (1.2%) 194 | 2400 <= elo < 2600: 693187 (0.15%) 195 | 2600 <= elo < 2800: 51112 (0.01%) 196 | 2800 <= elo: 2132 (0.0005%) 197 | 198 | SQL: 199 | SELECT 200 | CASE WHEN LEAST(Whiteelo, Blackelo) < 600 THEN 'elo < 600' 201 | WHEN LEAST(Whiteelo, Blackelo) >= 600 AND LEAST(Whiteelo, Blackelo) < 800 THEN '600 <= elo < 800' 202 | WHEN LEAST(Whiteelo, Blackelo) >= 800 AND LEAST(Whiteelo, Blackelo) < 1000 Then '800 <= elo < 1000' 203 | WHEN LEAST(Whiteelo, Blackelo) >= 1000 AND LEAST(Whiteelo, Blackelo) < 1200 Then '1000 <= elo < 1200' 204 | WHEN LEAST(Whiteelo, Blackelo) >= 1200 AND LEAST(Whiteelo, Blackelo) < 1400 Then '1200 <= elo < 1400' 205 | WHEN LEAST(Whiteelo, Blackelo) >= 1400 AND LEAST(Whiteelo, Blackelo) < 1600 Then '1400 <= elo < 1600' 206 | WHEN LEAST(Whiteelo, Blackelo) >= 1600 AND LEAST(Whiteelo, Blackelo) < 1800 Then '1600 <= elo < 1800' 207 | WHEN LEAST(Whiteelo, Blackelo) >= 1800 AND LEAST(Whiteelo, Blackelo) < 2000 Then '1800 <= elo < 2000' 208 | WHEN LEAST(Whiteelo, Blackelo) >= 2000 AND LEAST(Whiteelo, Blackelo) < 2200 Then '2000 <= elo < 2200' 209 | WHEN LEAST(Whiteelo, Blackelo) >= 2200 AND LEAST(Whiteelo, Blackelo) < 2400 Then '2200 <= elo < 2400' 210 | WHEN LEAST(Whiteelo, Blackelo) >= 2400 AND LEAST(Whiteelo, Blackelo) < 2600 Then '2400 <= elo < 2600' 211 | WHEN LEAST(Whiteelo, Blackelo) >= 2600 AND LEAST(Whiteelo, Blackelo) < 2800 Then '2600 <= elo < 2800' 212 | WHEN LEAST(Whiteelo, Blackelo) >= 2800 Then '2800 <= elo' END as elo_bracket, 213 | COUNT(*) as total_games 214 | FROM games 215 | GROUP by 1; 216 | 217 | -------------------------------------------------------------------------------- /analytics/kill_query.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | 3 | if __name__ == "__main__": 4 | #NOTE: haven't got this to work on psycopg2, but the query can be run on postgres to kill long queries 5 | DB_NAME = "lichess_games_db" 6 | DB_USER = "joe" 7 | connect_string = "dbname=" + DB_NAME + " user=" + DB_USER 8 | conn = psycopg2.connect(connect_string) 9 | 10 | queries = [] 11 | filenames = [] 12 | q0 = "select pg_cancel_backend(select pid FROM pg_stat_activity where state='active');" 13 | cur = conn.cursor() 14 | cur.execute(q0) 15 | conn.commit() 16 | -------------------------------------------------------------------------------- /analytics/make_plots.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import seaborn as sns 3 | import datetime as datetime 4 | import numpy as np 5 | import matplotlib.pyplot as plt 6 | import re 7 | import math 8 | from pathlib import Path 9 | from matplotlib.colors import LogNorm 10 | 11 | def convert_day(x): 12 | if x == 0: x = "Sunday" 13 | elif x == 1: x = "Monday" 14 | elif x == 2: x = "Tuesday" 15 | elif x == 3: x ="Wednesday" 16 | elif x == 4: x = "Thursday" 17 | elif x == 5: x = "Friday" 18 | elif x == 6: x = "Saturday" 19 | return x 20 | 21 | def group_date_times(): 22 | """ad-hoc function to read the data and format it for plots 23 | that plot the number of games by time""" 24 | df = pd.read_csv("./query_out_storage/games_played_by_time_of_day.csv") 25 | #format hours and minutes into date_time 26 | df["time"] = df.apply(lambda x: datetime.time(int(x["hour"]), int(x["minute"])), axis=1) 27 | df["time_str"] = df["time"].apply(str) 28 | df["time_int"] = df.apply(lambda x: x["hour"]*60 + x["minute"], axis=1) 29 | 30 | #grouping into 48 "bins", one for each 30 minutes in a day 31 | df["time_group"] = pd.qcut(df["time_int"],48) 32 | df["day"] = df["day"].apply(convert_day) 33 | return df 34 | 35 | def barplot_games_by_time_of_day(filename): 36 | df = group_date_times() 37 | df2 = df.groupby(by="time_group").sum() 38 | #ordering color palette by number_of_games 39 | pal = sns.color_palette("flare_r", df2.shape[0]) 40 | rank = df2["number_of_games"].argsort().argsort() 41 | #creating barplot 42 | bar = sns.barplot(x=df2.index, y="number_of_games", data=df2, palette=np.array(pal)[rank]) 43 | bar.set(xlabel='UTC Time', ylabel='Total Games Played (Millions)') 44 | bar.set_xticks(range(0,48,2)) 45 | bar.set_xticklabels(range(0,24)) 46 | bar.set_title("Total Number of Games Played by Time of Day") 47 | fig = bar.get_figure() 48 | fig.savefig(filename, dpi=300) 49 | 50 | def barplot_games_by_day(filename): 51 | df = group_date_times() 52 | df2 = df.groupby(by="day").sum() 53 | df2 = df2.sort_values(by="number_of_games") 54 | bar = sns.barplot(x=df2.index, y="number_of_games", data=df2, palette="flare_r") 55 | bar.set(xlabel='Day of the Week', ylabel='Total Games Played (Millions)') 56 | bar.set_title("Total Number of Games Played by Day of the Week") 57 | fig = bar.get_figure() 58 | fig.savefig(filename, dpi=300) 59 | 60 | def lineplot_games_by_time_and_day(filename): 61 | df = group_date_times() 62 | p = sns.relplot(data=df, x="time_int", y="number_of_games", hue="day", kind="line", linewidth=0.3) 63 | for ax in p.axes.flat: 64 | labels = ax.get_xticklabels() 65 | ax.set_xticks(ticks=range(0,24*60,60)) 66 | ax.set_xticklabels(fontsize=8, labels=range(0,24)) 67 | ax.set_title("Total Number of Games Played by Time and Day") 68 | p.set(xlabel='UTC Time', ylabel='Total Games Played') 69 | p.savefig(filename, dpi=300) 70 | 71 | def pieplot_games_per_elo_band(filename): 72 | df = pd.read_csv("query_out_storage/total_games_per_elo_bracket.csv") 73 | df = df.sort_values(by="elo_bracket") 74 | #reordering > 600 and > 800 elo brackets 75 | df = pd.concat([df.iloc[[11],:], df.drop(11, axis=0)], axis=0) 76 | df = pd.concat([df.iloc[[11],:], df.drop(10, axis=0)], axis=0) 77 | #creating pie plot 78 | pct = 100*df.total_games/df.total_games.sum() 79 | labels = ["{0}: {1:1.2f} %".format(i,j) for i,j in zip(df.elo_bracket, pct)] 80 | pie, ax = plt.subplots(figsize=[10,6]) 81 | patches, text = plt.pie(x=df.total_games, wedgeprops={'linewidth':1, 'linestyle':'-', 'edgecolor':'k'}, \ 82 | pctdistance=1, startangle=90) 83 | ax.set_title("Total Number of Games Played per Elo Bracket") 84 | plt.legend(patches, labels, loc='center right', bbox_to_anchor=(1.4, .5), 85 | fontsize=8) 86 | plt.savefig(filename, dpi=300, bbox_inches='tight') 87 | 88 | def pieplot_players_per_elo_band(filename): 89 | df = pd.read_csv("./query_out_storage/elo_diff_per_player.csv") 90 | df = df.groupby("band").count()["player"] 91 | df = df.reset_index() 92 | df = pd.concat([df.iloc[[7],:], df.drop(7, axis=0)], axis=0) 93 | pct = 100*df.player/df.player.sum() 94 | labels = ["{0}: {1:1.2f} %".format(i,j) for i,j in zip(df.band, pct)] 95 | pie, ax = plt.subplots(figsize=[10,6]) 96 | patches, text = plt.pie(x=df.player, wedgeprops={'linewidth':1, 'linestyle':'-', 'edgecolor':'k'}, pctdistance=1, startangle=90) 97 | plt.legend(patches, labels, loc='center right', bbox_to_anchor=(1.4, .5),fontsize=8) 98 | ax.set_title("Total Number of Players per Elo Bracket") 99 | plt.savefig(filename, dpi=300, bbox_inches='tight') 100 | 101 | def pieplot_games_per_event(filename): 102 | df = pd.read_csv("./query_out_storage/total_games_per_event.csv") 103 | labels = ["Bullet", "Blitz", "Classical", "Correspondence", "Rapid"] 104 | pie, ax = plt.subplots(figsize=[10,6]) 105 | patches, text, _ = plt.pie(x=df["count"], labels=labels, wedgeprops={'linewidth':1, 'linestyle':'-', 'edgecolor':'k'}, startangle=90, autopct='%1.1f%%') 106 | plt.legend(patches, labels, loc='center right', bbox_to_anchor=(1.4, .5),fontsize=8) 107 | ax.set_title("Games Played per Time Control") 108 | plt.savefig(filename, dpi=300, bbox_inches='tight') 109 | 110 | def histogram_player_churn(filename): 111 | df = pd.read_csv("query_out_storage/total_games_per_player.csv") 112 | h, ax = plt.subplots(figsize=[10,6]) 113 | n, bins, patches = plt.hist(df.total_games, bins=100, log=False, histtype='step') 114 | logbins = np.logspace(np.log10(bins[0]), np.log10(bins[-1]), len(bins)) 115 | h, ax = plt.subplots(figsize=[10,6]) 116 | plt.hist(df.total_games, bins=logbins, log=False, density=True, cumulative=1, histtype='step') 117 | plt.xscale('log') 118 | ax.set_title("Total Games Played per Player Cumulative Histogram") 119 | ax.set_xlabel("Number of Games Played") 120 | ax.set_ylabel("% of players who played x number of games or less") 121 | plt.savefig(filename, dpi=300, bbox_inches='tight') 122 | print("median games played per player: ", np.median(df.total_games)) 123 | print("average games played per player: ", np.average(df.total_games)) 124 | print("maximum games played by one player: ", max(df.total_games)) 125 | 126 | def barplot_pct_analyzed_per_elo_bracket(filename): 127 | df = pd.read_csv("query_out_storage/pct_analyzed_per_elo_bracket.csv") 128 | df["pct_analyzed"] = df["analyzed_games"]/df["total_games"]*100 129 | #reordering > 600 and > 800 elo brackets 130 | df = pd.concat([df.iloc[[11],:], df.drop(11, axis=0)], axis=0) 131 | df = pd.concat([df.iloc[[11],:], df.drop(10, axis=0)], axis=0) 132 | #ordering color palette by pct_analyzed 133 | pal = sns.color_palette("flare_r", df.shape[0]) 134 | rank = df["pct_analyzed"].argsort().argsort() 135 | #creating barplot 136 | bar = sns.barplot(x="elo_bracket", y="pct_analyzed", data=df, palette=np.array(pal)[rank]) 137 | bar.set(xlabel='elo_bracket', ylabel='pct_analyzed (%)') 138 | bar.set_title("Percentage of Games Analyzed by Elo Bracket") 139 | plt.setp(bar.xaxis.get_majorticklabels(), rotation=45, ha="right") 140 | fig = bar.get_figure() 141 | fig.savefig(filename, dpi=300, bbox_inches='tight') 142 | 143 | def lineplot_elo_vs_days(filename): 144 | print("reading csv...") 145 | df = pd.read_csv("query_out_storage/total_blitz_games_per_player_over_time.csv") 146 | print("converting days since start...") 147 | df["days_since_start"] = df["days_since_start"].apply(lambda x: int(re.search(r'\d+', x).group(0))) 148 | df_starting_elo = df[df["days_since_start"] == 0] 149 | #assign rating bands to players 150 | for lower_elo_lim in range(800,2201,200): 151 | upper_elo_lim = lower_elo_lim + 199 152 | df_band = df_starting_elo[(df_starting_elo["min"] >= lower_elo_lim) & (df_starting_elo["min"] <= upper_elo_lim)] 153 | df.loc[df["player"].isin(df_band["player"]), "band"] = f"{lower_elo_lim} - {upper_elo_lim}" 154 | df = df.dropna() #drop extreme elo values (600-800 and 2400+) due to low sample size 155 | print("plotting...") 156 | ax = sns.lineplot(x="days_since_start", y="min", hue="band", data=df) 157 | box = ax.get_position() 158 | ax.set_position([box.x0, box.y0,box.width*.8, box.height]) 159 | handles, labels = ax.get_legend_handles_labels() 160 | labels, handles = zip(*sorted(zip(labels, handles), key=lambda x: int(x[0][0:4]))) #sort by elo number instead of str 161 | ax.legend(handles, labels, loc='center left', bbox_to_anchor=(1,.5), title="starting elo") 162 | ax.set_xlabel("days since player's first stable rating") 163 | ax.set_ylabel('stable elo rating') 164 | ax.set_title('elo rating over time (95% confidence interval)') 165 | fig = ax.get_figure() 166 | print("saving figure...") 167 | fig.savefig(filename, dpi=300) 168 | 169 | def lineplot_elo_vs_months(rating_diff_cutoff=-1000): 170 | filename = f"plots/blitz_elo_over_time/blitz_elo_over_time_{rating_diff_cutoff}_elo_cutoff.png" 171 | print("reading csv...") 172 | df = pd.read_csv("query_out_storage/total_blitz_games_per_player_over_time.csv") 173 | print("converting days since start...") 174 | df["days_since_start"] = df["days_since_start"].apply(lambda x: int(re.search(r'\d+', x).group(0))) 175 | df["months_since_start"] = df["days_since_start"].apply(lambda x: math.floor(x/30.5)) 176 | df = df.groupby(["player","months_since_start"]).mean().reset_index() 177 | df_starting_elo = df[df["months_since_start"] == 0] 178 | #assign rating bands to players 179 | for lower_elo_lim in range(800,2201,200): 180 | upper_elo_lim = lower_elo_lim + 200 181 | df_band = df_starting_elo[(df_starting_elo["min"] >= lower_elo_lim) & (df_starting_elo["min"] < upper_elo_lim)] 182 | df.loc[df["player"].isin(df_band["player"]), "band"] = f"{lower_elo_lim} - {upper_elo_lim-1}" 183 | print("calculating starting and ending elo diff...") 184 | df = df.dropna() #drop extreme elo values (600-800 and 2400+) due to low sample size 185 | #get diff between start and end ratings per player 186 | df_tmp = df.groupby(["player"]).min() 187 | df_start_elo = pd.merge(df[["player","min", "band","months_since_start"]], 188 | df_tmp["months_since_start"], on=["player","months_since_start"]) 189 | df_tmp = df.groupby(["player"]).max() 190 | df_end_elo = pd.merge(df[["player","min","months_since_start", "band"]], 191 | df_tmp["months_since_start"], on=["player","months_since_start"]) 192 | df_end_elo = df_end_elo.set_index("player") 193 | df_starting_elo = df_starting_elo.set_index("player") 194 | df_end_elo["diff"] = df_end_elo["min"] - df_starting_elo["min"] 195 | df_end_elo = df_end_elo.reset_index() 196 | #store statistics on diff data 197 | print("saving data in query_out_storage...") 198 | df_end_elo.to_csv("query_out_storage/elo_diff_per_player.csv") 199 | df_end_elo[["band","min","diff"]].groupby("band").median().to_csv("query_out_storage/elo_band_medians.csv") 200 | df_end_elo[["band","min","diff"]].groupby("band").mean().to_csv("query_out_storage/elo_band_means.csv") 201 | df_end_elo[["band","min","diff"]].groupby("band").min().to_csv("query_out_storage/elo_band_mins.csv") 202 | df_end_elo[["band","min","diff"]].groupby("band").max().to_csv("query_out_storage/elo_band_maxs.csv") 203 | df = df[df["player"].isin(df_end_elo["player"].loc[df_end_elo["diff"] > rating_diff_cutoff])] 204 | print(f"{df.shape[0]} rows remain after using a start-end elo diff cutoff of {rating_diff_cutoff}") 205 | n_players = df['player'].nunique() 206 | print(f"{n_players} blitz players have gained {rating_diff_cutoff} rating points since their first stable rating") 207 | print("plotting...") 208 | ax = sns.lineplot(x="months_since_start", y="min", hue="band", data=df, alpha=0.6) 209 | plt.grid(b=True, axis='y', linestyle='--', color='black', alpha=0.3) 210 | box = ax.get_position() 211 | ax.set_position([box.x0, box.y0,box.width*.9, box.height]) 212 | ax.set_yticks(np.arange(1000,2600,200)) 213 | handles, labels = ax.get_legend_handles_labels() 214 | labels, handles = zip(*sorted(zip(labels, handles), key=lambda x: int(x[0][0:4]))) #sort by elo number instead of str 215 | ax.legend(handles, labels, loc='center left', bbox_to_anchor=(1.05,.5), title="starting elo") 216 | ax.set_xlabel("months since player's first stable rating") 217 | ax.set_ylabel('stable elo rating') 218 | ax.set_title(f"elo vs. time: {n_players} users who gained > {rating_diff_cutoff} elo") 219 | fig = ax.get_figure() 220 | print(f"saving figure to {filename}...") 221 | fig.savefig(filename, dpi=300, bbox_inches='tight') 222 | return 223 | 224 | def hexbin_elo_vs_games_played(filename, y="diff", mode="net"): 225 | df_total_games = pd.read_csv("query_out_storage/total_games_per_player.csv") 226 | df_elo_diff = pd.read_csv("query_out_storage/elo_diff_per_player.csv") 227 | df_total_games = df_total_games.rename(columns={"white":"player"}) 228 | df = df_total_games.merge(df_elo_diff, on="player") 229 | if mode == "per_month": 230 | print("reading total_blitz_games_per_player_over_time.csv") 231 | df_time = pd.read_csv("query_out_storage/total_blitz_games_per_player_over_time.csv") 232 | print("converting to months_since_start") 233 | df_time["months_since_start"] = df_time["days_since_start"].apply(lambda x: math.floor(int(re.search(r'\d+', x).group(0))/30.5)) 234 | print("grouping by player") 235 | df_time = df_time.groupby(["player"]).max().reset_index() 236 | df = df.merge(df_time, on="player") 237 | df["total_games"] = df["total_games"]/df["months_since_start_y"] 238 | df.replace([np.inf,-np.inf],np.nan,inplace=True) 239 | if y == "diff": 240 | if mode == "per_month": 241 | df["diff"] = df["diff"]/df["months_since_start_y"] 242 | df.replace([np.inf,-np.inf],np.nan,inplace=True) 243 | df = df.dropna() 244 | g = sns.jointplot(data=df, x="total_games", y="diff", kind="hex", ylim=[-1000,1000], xscale='log', bins='log') 245 | else: 246 | df.replace([np.inf,-np.inf],np.nan,inplace=True) 247 | df = df.dropna() 248 | g = sns.jointplot(data=df, x="total_games", y="min_x", kind="hex", xscale='log', bins='log') 249 | ax = g.ax_joint 250 | cbar = plt.colorbar(location='right') 251 | cbar.set_label('Number of players') 252 | if mode == "per_month": 253 | ax.set_xlabel("Number of Games Played per Month") 254 | else: 255 | ax.set_xlabel("Number of Games Played") 256 | if y == "diff": 257 | if mode == "per_month": 258 | ax.set_ylabel('Average Elo Change per Month') 259 | else: 260 | ax.set_ylabel('Net Elo Change') 261 | else: 262 | ax.set_ylabel('Elo Rating') 263 | fig = ax.get_figure() 264 | fig.savefig(filename, dpi=300, bbox_inches='tight') 265 | 266 | def heatmap_elo_vs_count(filename): 267 | df = pd.read_csv("query_out_storage/elo_diff_per_player.csv") 268 | df_pos = df[df["diff"] >= 0] 269 | df_pos["diff_round"] = df_pos["diff"].apply(lambda x: round(x,-2)) 270 | df_grouped = df_pos.groupby(by=["band", "diff_round"]).mean() 271 | df_grouped["count"] = df_pos.groupby(by=["band", "diff_round"]).count()["player"] 272 | df_grouped = df_grouped.reset_index() 273 | df_count_pivot = df_grouped.pivot(index="band",columns="diff_round",values="count") 274 | df_count_pivot["tmp"] = [7,6,5,4,3,2,1,8] 275 | df_count_pivot = df_count_pivot.sort_values('tmp').drop('tmp',1) 276 | ax = sns.heatmap(df_count_pivot, square=True, norm=LogNorm(), cbar_kws={'label':'# of Players'}) 277 | ax.set_xlabel("Elo Gain") 278 | ax.set_ylabel("Starting Elo") 279 | ax.set_title("Count of Elo Gain vs Starting Elo Pairs") 280 | plt.savefig(filename, dpi=300, bbox_inches='tight') 281 | 282 | def heatmap_elo_vs_time(filename): 283 | df = pd.read_csv("query_out_storage/elo_diff_per_player.csv") 284 | df_pos = df[df["diff"] >= 0] 285 | df_pos["diff_round"] = df_pos["diff"].apply(lambda x: round(x,-2)) 286 | df_grouped = df_pos.groupby(by=["band", "diff_round"]).mean() 287 | df_grouped["time"] = df_pos.groupby(by=["band", "diff_round"]).mean()["months_since_start"] 288 | df_grouped["time"] = df_grouped["time"].apply(lambda x: round(x)) 289 | df_grouped = df_grouped.reset_index() 290 | df_count_pivot = df_grouped.pivot(index="band",columns="diff_round",values="time") 291 | df_count_pivot["tmp"] = [7,6,5,4,3,2,1,8] 292 | df_count_pivot = df_count_pivot.sort_values('tmp').drop('tmp',1) 293 | ax = sns.heatmap(df_count_pivot, square=True, cbar_kws={'label':'Time to Reach (Months)'}, annot=True) 294 | ax.set_xlabel("Elo Gain") 295 | ax.set_ylabel("Starting Elo") 296 | ax.set_title("Avg Time to Reach Elo Gain vs. Starting Elo") 297 | plt.savefig(filename, dpi=300, bbox_inches='tight') 298 | 299 | def lineplot_elo_vs_time_percentiles(pct_list=[0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]): 300 | df = pd.read_csv("query_out_storage/total_blitz_games_per_player_over_time.csv") 301 | print("aggregating data by months...") 302 | df["days_since_start"] = df["days_since_start"].apply(lambda x: int(re.search(r'\d+', x).group(0))) 303 | df["months_since_start"] = df["days_since_start"].apply(lambda x: math.floor(x/30.5)) 304 | df = df.groupby(["player","months_since_start"]).mean().reset_index() 305 | print("calculating net elo diff per player...") 306 | df_starting_elo = df[df["months_since_start"] == 0] 307 | for lower_elo_lim in range(800,2201,200): 308 | upper_elo_lim = lower_elo_lim + 200 309 | df_band = df_starting_elo[(df_starting_elo["min"] >= lower_elo_lim) & (df_starting_elo["min"] < upper_elo_lim)] 310 | df.loc[df["player"].isin(df_band["player"]), "band"] = f"{lower_elo_lim} - {upper_elo_lim-1}" 311 | 312 | df = df.dropna() 313 | df_tmp = df.groupby(["player"]).min() 314 | df_start_elo = pd.merge(df[["player","min", "band","months_since_start"]],df_tmp["months_since_start"], on=["player","months_since_start"]) 315 | df_tmp = df.groupby(["player"]).max() 316 | df_end_elo = pd.merge(df[["player","min","months_since_start", "band"]],df_tmp["months_since_start"], on=["player","months_since_start"]) 317 | df_end_elo = df_end_elo.set_index("player") 318 | df_starting_elo = df_starting_elo.set_index("player") 319 | df_end_elo["diff"] = df_end_elo["min"] - df_starting_elo["min"] 320 | df_end_elo = df_end_elo.reset_index() 321 | print("calculating percentiles within each rating band...") 322 | #calculate percentile within starting elo band 323 | df_end_elo = df_end_elo.assign(percentile=df_end_elo.groupby("band")["diff"].rank(pct=True)) 324 | #round to nearest percentile of interest 325 | df_end_elo["percentile"] = df_end_elo["percentile"].apply(lambda x: min(pct_list, key=lambda y: abs(y - x))) 326 | df = df.join(df_end_elo[["player","percentile"]].set_index('player'), on="player") 327 | #plot percentile lines 328 | for band in df_end_elo["band"].unique(): 329 | print(f"generating percentile plots for {band}...") 330 | ax = sns.lineplot(data=df[df["band"] == band], x="months_since_start", 331 | y="min", size="percentile", hue="percentile", legend="full") 332 | ax.set_xlabel("months since player's first stable rating") 333 | ax.set_ylabel('stable rating') 334 | ax.set_title(f"rating vs. time percentiles: {band} start rating") 335 | fig = ax.get_figure() 336 | fig.savefig(f"plots/blitz_elo_over_time/rating_percentiles_{band.replace(' ','')}.png", dpi=300, bbox_inches='tight') 337 | fig.clf() 338 | 339 | 340 | if __name__ == "__main__": 341 | Path("./plots/popular_play_times").mkdir(parents=True, exist_ok=True) 342 | Path("./plots/blitz_elo_over_time").mkdir(parents=True, exist_ok=True) 343 | #barplot_games_by_time_of_day("./plots/popular_play_times/games_by_time_of_day.png") 344 | #lineplot_games_by_time_and_day("./plots/popular_play_times/games_by_time_and_day.png") 345 | #barplot_games_by_day("./plots/popular_play_times/games_by_day.png") 346 | #pieplot_games_per_elo_band("./plots/games_per_elo_bracket.png") 347 | #histogram_player_churn("./plots/games_per_player.png") 348 | #barplot_pct_analyzed_per_elo_bracket("./plots/pct_analyzed_per_elo_bracket.png") 349 | #lineplot_elo_vs_days("./plots/blitz_elo_over_time/blitz_elo_over_time.png") 350 | #lineplot_elo_vs_months() 351 | #lineplot_elo_vs_months(rating_diff_cutoff=100) 352 | #lineplot_elo_vs_months(rating_diff_cutoff=800) 353 | #pieplot_players_per_elo_band("plots/players_per_elo_bracket.png") 354 | #hexbin_elo_vs_games_played("plots/elo_diff_by_total_games_played.png") 355 | #hexbin_elo_vs_games_played("plots/elo_by_total_games_played.png", y="elo") 356 | #hexbin_elo_vs_games_played("plots/elo_by_total_games_played_per_month_blitz.png", y="elo", mode="per_month") 357 | #hexbin_elo_vs_games_played("plots/elo_diff_by_total_games_played_per_month_blitz.png", mode="per_month") 358 | #pieplot_games_per_event("./plots/games_per_event.png") 359 | #heatmap_elo_vs_count("./plots/blitz_elo_over_time/heatmap_elo_gain_count.png") 360 | #heatmap_elo_vs_time("./plots/blitz_elo_over_time/heatmap_elo_gain_time.png") 361 | lineplot_elo_vs_time_percentiles() 362 | -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/blitz_elo_over_time_-1000_elo_cutoff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/blitz_elo_over_time_-1000_elo_cutoff.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/blitz_elo_over_time_100_elo_cutoff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/blitz_elo_over_time_100_elo_cutoff.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/blitz_elo_over_time_800_elo_cutoff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/blitz_elo_over_time_800_elo_cutoff.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/heatmap_elo_gain_count.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/heatmap_elo_gain_count.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/heatmap_elo_gain_time.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/heatmap_elo_gain_time.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_1000-1199.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_1000-1199.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_1200-1399.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_1200-1399.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_1400-1599.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_1400-1599.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_1600-1799.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_1600-1799.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_1800-1999.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_1800-1999.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_2000-2199.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_2000-2199.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_2200-2399.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_2200-2399.png -------------------------------------------------------------------------------- /analytics/plots/blitz_elo_over_time/rating_percentiles_800-999.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/blitz_elo_over_time/rating_percentiles_800-999.png -------------------------------------------------------------------------------- /analytics/plots/elo_by_total_games_played_blitz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/elo_by_total_games_played_blitz.png -------------------------------------------------------------------------------- /analytics/plots/elo_by_total_games_played_per_month_blitz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/elo_by_total_games_played_per_month_blitz.png -------------------------------------------------------------------------------- /analytics/plots/elo_diff_by_total_games_played_blitz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/elo_diff_by_total_games_played_blitz.png -------------------------------------------------------------------------------- /analytics/plots/elo_diff_per_month_by_total_games_per_month_blitz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/elo_diff_per_month_by_total_games_per_month_blitz.png -------------------------------------------------------------------------------- /analytics/plots/games_per_elo_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/games_per_elo_bracket.png -------------------------------------------------------------------------------- /analytics/plots/games_per_event.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/games_per_event.png -------------------------------------------------------------------------------- /analytics/plots/games_per_player.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/games_per_player.png -------------------------------------------------------------------------------- /analytics/plots/pct_analyzed_per_elo_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/pct_analyzed_per_elo_bracket.png -------------------------------------------------------------------------------- /analytics/plots/players_per_elo_bracket.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/players_per_elo_bracket.png -------------------------------------------------------------------------------- /analytics/plots/popular_play_times/games_by_day.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/popular_play_times/games_by_day.png -------------------------------------------------------------------------------- /analytics/plots/popular_play_times/games_by_time_and_day.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/popular_play_times/games_by_time_and_day.png -------------------------------------------------------------------------------- /analytics/plots/popular_play_times/games_by_time_of_day.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/analytics/plots/popular_play_times/games_by_time_of_day.png -------------------------------------------------------------------------------- /analytics/psycopg2_query.py: -------------------------------------------------------------------------------- 1 | import psycopg2 2 | import csv 3 | import os 4 | 5 | def select_query(sql_list, conn, filenames=None): 6 | """takes in a query or list of queries, a psycopg2 connection, and a list of filenames 7 | and writes the output to csv files""" 8 | if isinstance(sql_list, str): 9 | sql_list = (sql_list,) 10 | if filenames is None: 11 | filenames = [] 12 | for i in range(0,len(sql_list)): 13 | filenames.append("results_" + str(i) + ".csv") 14 | elif len(sql_list) != len(filenames): 15 | print("length of filenames != number of sql queries") 16 | quit() 17 | 18 | cur = conn.cursor() 19 | for (sql, filename) in zip(sql_list, filenames): 20 | print("RUNNING: \n" + sql) 21 | cur.execute(sql) 22 | colnames = [desc[0] for desc in cur.description] 23 | with open(filename, 'w') as out: 24 | csv_out = csv.writer(out) 25 | csv_out.writerow(colnames) 26 | for r in cur: 27 | csv_out.writerow(r) 28 | 29 | 30 | if __name__ == "__main__": 31 | DB_NAME = os.getenv('POSTGRESQL_DATABASE', 'lichess_games_db') #env variables come from docker-compose.yml 32 | DB_USER = os.getenv('POSTGRESQL_USERNAME','username') 33 | DB_PASSWORD = os.getenv('POSTGRESQL_PASSWORD','password') 34 | HOSTNAME = os.getenv('HOSTNAME','localhost') 35 | PORT = os.getenv('POSTGRESQL_PORT', '5432') 36 | #connect_string = "host=" + HOSTNAME + " dbname=" + DB_NAME + " user=" + DB_USER + " password=" + DB_PASSWORD \ 37 | # + " port=" + PORT 38 | connect_string = "dbname=lichess_games_db user=joe password=password" 39 | conn = psycopg2.connect(connect_string) 40 | 41 | queries = [] 42 | filenames = [] 43 | #baseline query ~30 minutes to run, 7.5M disk page fetches 44 | q0 = "EXPLAIN SELECT COUNT(*) FROM Games;" 45 | 46 | q00 = "SELECT * FROM user_ids;" 47 | 48 | q1 = """ 49 | (SELECT * FROM GAMES g JOIN user_ids u ON g.white = u.id 50 | WHERE g.whiteratingdiff < 30 51 | limit 100) 52 | UNION ALL 53 | (SELECT * FROM GAMES g JOIN user_ids u ON g.black = u.id 54 | WHERE g.blackratingdiff < 30 55 | limit 100); 56 | """ 57 | q2 = """ 58 | EXPLAIN 59 | SELECT EXTRACT(DOW FROM date_time) as Day, 60 | EXTRACT(HOUR FROM date_time) as Hour, 61 | EXTRACT(MINUTE FROM date_time) as Minute, 62 | COUNT(date_time) as number_of_games 63 | FROM games 64 | GROUP BY 65 | EXTRACT(DOW FROM date_time), 66 | EXTRACT(HOUR FROM date_time), 67 | EXTRACT(MINUTE FROM date_time) 68 | ORDER BY Day, Hour, Minute; 69 | """ 70 | q3 = """ 71 | SELECT event, white, black, whiteratingdiff, blackratingdiff, whiteelo, blackelo, date_time, site 72 | FROM GAMES WHERE white = 101000 OR black = 101000 73 | ORDER BY event, date_time 74 | """ 75 | #total games per player 76 | q4 = """ 77 | SELECT sub1.white, sub1.white_game_count + sub2.black_game_count as total_games 78 | FROM 79 | (SELECT White, COUNT(*) as white_game_count 80 | FROM games g1 81 | GROUP BY White) as sub1 82 | JOIN 83 | (SELECT Black, COUNT(*) as black_game_count 84 | FROM games g2 85 | GROUP BY Black) as sub2 86 | ON sub1.white = sub2.black 87 | ORDER BY total_games DESC; 88 | """ 89 | #total games per elo bracket 90 | q5 = """ 91 | SELECT 92 | CASE WHEN LEAST(Whiteelo, Blackelo) < 600 THEN 'elo < 600' 93 | WHEN LEAST(Whiteelo, Blackelo) >= 600 AND LEAST(Whiteelo, Blackelo) < 800 THEN '600 <= elo < 800' 94 | WHEN LEAST(Whiteelo, Blackelo) >= 800 AND LEAST(Whiteelo, Blackelo) < 1000 Then '800 <= elo < 1000' 95 | WHEN LEAST(Whiteelo, Blackelo) >= 1000 AND LEAST(Whiteelo, Blackelo) < 1200 Then '1000 <= elo < 1200' 96 | WHEN LEAST(Whiteelo, Blackelo) >= 1200 AND LEAST(Whiteelo, Blackelo) < 1400 Then '1200 <= elo < 1400' 97 | WHEN LEAST(Whiteelo, Blackelo) >= 1400 AND LEAST(Whiteelo, Blackelo) < 1600 Then '1400 <= elo < 1600' 98 | WHEN LEAST(Whiteelo, Blackelo) >= 1600 AND LEAST(Whiteelo, Blackelo) < 1800 Then '1600 <= elo < 1800' 99 | WHEN LEAST(Whiteelo, Blackelo) >= 1800 AND LEAST(Whiteelo, Blackelo) < 2000 Then '1800 <= elo < 2000' 100 | WHEN LEAST(Whiteelo, Blackelo) >= 2000 AND LEAST(Whiteelo, Blackelo) < 2200 Then '2000 <= elo < 2200' 101 | WHEN LEAST(Whiteelo, Blackelo) >= 2200 AND LEAST(Whiteelo, Blackelo) < 2400 Then '2200 <= elo < 2400' 102 | WHEN LEAST(Whiteelo, Blackelo) >= 2400 AND LEAST(Whiteelo, Blackelo) < 2600 Then '2400 <= elo < 2600' 103 | WHEN LEAST(Whiteelo, Blackelo) >= 2600 AND LEAST(Whiteelo, Blackelo) < 2800 Then '2600 <= elo < 2800' 104 | WHEN LEAST(Whiteelo, Blackelo) >= 2800 Then '2800 <= elo' END as elo_bracket, 105 | COUNT(*) as total_games 106 | FROM games 107 | GROUP by 1; 108 | """ 109 | 110 | q6 = """ 111 | SELECT sub1.white, sub1.event, sub1.white_game_count + sub2.black_game_count as total_games 112 | FROM 113 | (SELECT White, event, COUNT(*) as white_game_count 114 | FROM games g1 115 | GROUP BY White, event) as sub1 116 | JOIN 117 | (SELECT Black, event, COUNT(*) as black_game_count 118 | FROM games g2 119 | GROUP BY Black, event) as sub2 120 | ON sub1.white = sub2.black 121 | AND sub1.event = sub2.event 122 | ORDER BY total_games DESC; 123 | """ 124 | 125 | q7 = """ 126 | SELECT event, count(*) 127 | FROM games 128 | GROUP BY event; 129 | """ 130 | #number of games analyzed per elo bracket 131 | q8 = """ 132 | EXPLAIN 133 | SELECT 134 | CASE WHEN LEAST(Whiteelo, Blackelo) < 600 THEN 'elo < 600' 135 | WHEN LEAST(Whiteelo, Blackelo) >= 600 AND LEAST(Whiteelo, Blackelo) < 800 THEN '600 <= elo < 800' 136 | WHEN LEAST(Whiteelo, Blackelo) >= 800 AND LEAST(Whiteelo, Blackelo) < 1000 Then '800 <= elo < 1000' 137 | WHEN LEAST(Whiteelo, Blackelo) >= 1000 AND LEAST(Whiteelo, Blackelo) < 1200 Then '1000 <= elo < 1200' 138 | WHEN LEAST(Whiteelo, Blackelo) >= 1200 AND LEAST(Whiteelo, Blackelo) < 1400 Then '1200 <= elo < 1400' 139 | WHEN LEAST(Whiteelo, Blackelo) >= 1400 AND LEAST(Whiteelo, Blackelo) < 1600 Then '1400 <= elo < 1600' 140 | WHEN LEAST(Whiteelo, Blackelo) >= 1600 AND LEAST(Whiteelo, Blackelo) < 1800 Then '1600 <= elo < 1800' 141 | WHEN LEAST(Whiteelo, Blackelo) >= 1800 AND LEAST(Whiteelo, Blackelo) < 2000 Then '1800 <= elo < 2000' 142 | WHEN LEAST(Whiteelo, Blackelo) >= 2000 AND LEAST(Whiteelo, Blackelo) < 2200 Then '2000 <= elo < 2200' 143 | WHEN LEAST(Whiteelo, Blackelo) >= 2200 AND LEAST(Whiteelo, Blackelo) < 2400 Then '2200 <= elo < 2400' 144 | WHEN LEAST(Whiteelo, Blackelo) >= 2400 AND LEAST(Whiteelo, Blackelo) < 2600 Then '2400 <= elo < 2600' 145 | WHEN LEAST(Whiteelo, Blackelo) >= 2600 AND LEAST(Whiteelo, Blackelo) < 2800 Then '2600 <= elo < 2800' 146 | WHEN LEAST(Whiteelo, Blackelo) >= 2800 Then '2800 <= elo' END as elo_bracket, 147 | SUM(CASE WHEN analyzed THEN 1 ELSE 0 END) as analyzed_games, 148 | COUNT(*) as total_games 149 | FROM games 150 | GROUP by 1; 151 | """ 152 | 153 | q9 = """ 154 | EXPLAIN 155 | SELECT g1.White as player, g1.whiteelo as elo, g1.event, g1.date_time 156 | FROM games g1 157 | UNION ALL 158 | SELECT g1.Black as player, g1.blackelo as elo, g1.event, g1.date_time 159 | FROM games g1; 160 | """ 161 | #select min and max ratings per player with start and end dates of activity (black games) 162 | q10 = """ 163 | SELECT player, event, min(elo), max(elo), min(date_time), max(date_time) FROM 164 | (SELECT white as player, whiteelo as elo, event, date_time 165 | FROM games 166 | WHERE ABS(whiteratingdiff) < 30 167 | UNION ALL 168 | SELECT black as player, blackelo as elo, event, date_time 169 | FROM games 170 | WHERE ABS(blackratingdiff) < 30) as s 171 | GROUP BY player, event; 172 | """ 173 | #games per player by days since player's first stable rating 174 | q11 = """ 175 | SELECT g.Black as player, MIN(g.blackelo), g.date_time::date, 176 | AGE(g.date_time::date, Min(sub.start_date)) as days_since_start 177 | FROM games g 178 | JOIN (SELECT g2.black, min(g2.date_time::date) as start_date 179 | FROM games g2 180 | WHERE g2.event = 'B' 181 | AND ABS(g2.blackratingdiff) < 30 182 | GROUP BY g2.black) sub 183 | on g.black = sub.black 184 | WHERE g.event = 'B' 185 | AND ABS(g.blackratingdiff) < 30 186 | GROUP BY g.Black, g.date_time::date 187 | """ 188 | q12 = """ 189 | SELECT g.Black, g.White, g.blackelo, g.whiteelo, g.date_time 190 | FROM games g 191 | where g.event = 'B' 192 | AND extract(YEAR from g.date_time) <= 2016; 193 | """ 194 | 195 | queries = [q12] 196 | filenames = ["game_pairs.csv"] 197 | select_query(queries, conn, filenames=filenames) 198 | 199 | #conn.commit() 200 | -------------------------------------------------------------------------------- /analytics/query_out_storage/blitz_game_pairs_2013-2015.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:55d0e74752bb56420a5b81a9955f4f7e4a7eb5ae352175d8e65b9ef21fa13d18 3 | size 496063350 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_band_maxs.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:11abfcf851f68c59535c19e23b85c4d7fd0d114f023d157f51d1acd3e5a3b760 3 | size 264 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_band_means.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:774a078b197f7747f7d994f77f505b65736750090728d31469d955ead4b69cd9 3 | size 405 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_band_medians.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:de5d2eb0866868b1fbf472b9c98d202744ffe6ec164749f9bd291452af4dde77 3 | size 256 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_band_mins.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ccbfa84fcc40023bd85ddf199f0d4db6d7cfc232522b4ac4d9016f99e78332ab 3 | size 252 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_diff_per_player.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:9d21ae693c5e0be7b04f58dcf4ccd10f9c47ef0f8718103ff2e3a0fa4d69a31f 3 | size 39584992 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/elo_diff_per_player.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:5c8d5f42c1df2026833c4a9715b07a1532fe1454da339d77b7484592244ac0d4 3 | size 7910205 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/games_played_by_time_of_day.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:bf0842e6211eaa13f41d75100f004f9b994a8a6e2568afc2255b41121fad844b 3 | size 41825 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/min_max_ratings_per_player.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:0e5faa3f979a0efad7d893be49b5e3495d908f14bd0398b4ab84c7f69351c247 3 | size 33545410 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/pct_analyzed_per_elo_bracket.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:fb6b18ad4f20d99bb1dfc298a70b90b2de4f9dc6f59e61b8567d0c7ea75c9f6b 3 | size 454 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/terminations_by_game_type.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:e40d9cbfe16e5c6861a2322b6846780acab026f7db1e4726079cf42deb92eb3d 3 | size 265 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/total_blitz_games_per_player_over_time.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:c521496d60009df36e7cf308f58a773155cedb5bb584a75fcca12295c96facba 3 | size 183363101 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/total_games_per_elo_bracket.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:49bbbc291038dc4b25bb8f299273fe5e86474429c3e4fc743eafef301461ae7b 3 | size 352 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/total_games_per_event.csv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:a8b2cae77499f6966c19541aa7d52961f11c6a8454c6f4d7f4728778dfef101b 3 | size 74 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/total_games_per_player.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:a8764c76af20105fed82f695f352cd6c053c3df916fba66d3d3247232c979284 3 | size 7039821 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/total_games_per_player_and_event.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:a389caf9141c36825df213b2e896f0c88fafaba1a7b88fcc682058c2dadd3b80 3 | size 13070841 4 | -------------------------------------------------------------------------------- /analytics/query_out_storage/user_ids.csv.bz2: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:d349e62849c8bcb0d76e6ecee4c2bdc7357883210bc00f50f93b4f5d9d0a0654 3 | size 18783794 4 | -------------------------------------------------------------------------------- /analytics/spark.py: -------------------------------------------------------------------------------- 1 | from pyspark.sql import SparkSession 2 | from pyspark.conf import SparkConf 3 | from pyspark import StorageLevel 4 | from time import time 5 | 6 | if __name__ == '__main__': 7 | 8 | spark = SparkSession.builder \ 9 | .master("local") \ 10 | .appName("lichess_games_analysis") \ 11 | .config("spark.jars", "./postgresql-42.2.23.jar") \ 12 | .config("spark.driver.memory", "3g") \ 13 | .config("spark.executor.memory", "3g") \ 14 | .config("spark.memory.offHeap.enabled", True) \ 15 | .config("spark.memory.offHeap.size", "50g") \ 16 | .getOrCreate() 17 | 18 | df = spark.read \ 19 | .format("jdbc") \ 20 | .option("url", "jdbc:postgresql://localhost:5432/lichess_games_db") \ 21 | .option("dbtable", "games") \ 22 | .option("user", "database_admin") \ 23 | .option("password", "password") \ 24 | .option("driver", "org.postgresql.Driver") \ 25 | .load().repartition(800) \ 26 | .persist(StorageLevel.DISK_ONLY) 27 | 28 | df2 = spark.read \ 29 | .format("jdbc") \ 30 | .option("url", "jdbc:postgresql://localhost:5432/lichess_games_db") \ 31 | .option("dbtable", "(SELECT * FROM user_ids WHERE id = 1) as tmp2") \ 32 | .option("user", "database_admin") \ 33 | .option("password", "password") \ 34 | .option("driver", "org.postgresql.Driver") \ 35 | .load().repartition(8) \ 36 | .persist(StorageLevel.DISK_ONLY) 37 | 38 | df3 = spark.read \ 39 | .format("jdbc") \ 40 | .option("url", "jdbc:postgresql://localhost:5432/lichess_games_db") \ 41 | .option("dbtable", "user_ids_p") \ 42 | .option("user", "database_admin") \ 43 | .option("password", "password") \ 44 | .option("driver", "org.postgresql.Driver") \ 45 | .load().repartition(8) \ 46 | .persist(StorageLevel.DISK_ONLY) 47 | 48 | 49 | df.createOrReplaceTempView("games") 50 | df2.createOrReplaceTempView("user_ids") 51 | df3.createOrReplaceTempView("user_ids_p") 52 | 53 | t0 = time() 54 | game_count = spark.sql("""SELECT avg(case when analyzed = TRUE then 1 else 0 END) from games;""") 55 | game_count.explain() 56 | game_count.show() 57 | #user_count = spark.sql("""SELECT count(id) from user_ids;""") 58 | #user_count.explain() 59 | #user_count.show() 60 | 61 | print(time() - t0) 62 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '2' 2 | 3 | services: 4 | cli: 5 | build: . 6 | image: python_cli 7 | environment: 8 | - POSTGRESQL_DATABASE=lichess_games 9 | - POSTGRESQL_PASSWORD=password 10 | - POSTGRESQL_USERNAME=username 11 | - HOSTNAME=postgresql-chess 12 | #- POSTGRESQL_PORT=5000 13 | volumes: 14 | - ./cli_container_data:/home/usr/src 15 | stdin_open: true 16 | tty: true 17 | postgresql-chess: 18 | image: 'bitnami/postgresql:12' 19 | environment: 20 | - POSTGRESQL_DATABASE=lichess_games 21 | - POSTGRESQL_PASSWORD=password 22 | - POSTGRESQL_USERNAME=username 23 | volumes: 24 | - 'postgresql_data:/postgresql_chess' 25 | postgresql: 26 | image: docker.io/bitnami/postgresql:10 27 | volumes: 28 | - 'postgresql_data:/bitnami/postgresql' 29 | environment: 30 | - POSTGRESQL_DATABASE=bitnami_airflow 31 | - POSTGRESQL_USERNAME=bn_airflow 32 | - POSTGRESQL_PASSWORD=bitnami1 33 | # ALLOW_EMPTY_PASSWORD is recommended only for development. 34 | - ALLOW_EMPTY_PASSWORD=yes 35 | redis: 36 | image: docker.io/bitnami/redis:6.0 37 | volumes: 38 | - 'redis_data:/bitnami' 39 | environment: 40 | # ALLOW_EMPTY_PASSWORD is recommended only for development. 41 | - ALLOW_EMPTY_PASSWORD=yes 42 | airflow-scheduler: 43 | # TODO: to be reverted to use proper registry/distro on T39132 44 | # image: docker.io/bitnami/airflow-scheduler:2 45 | image: docker.io/bitnami/airflow-scheduler:2 46 | environment: 47 | - AIRFLOW_DATABASE_NAME=bitnami_airflow 48 | - AIRFLOW_DATABASE_USERNAME=bn_airflow 49 | - AIRFLOW_DATABASE_PASSWORD=bitnami1 50 | - AIRFLOW_EXECUTOR=CeleryExecutor 51 | - AIRFLOW_WEBSERVER_HOST=airflow 52 | - AIRFLOW_LOAD_EXAMPLES=no 53 | depends_on: 54 | - "airflow" 55 | volumes: 56 | - airflow_scheduler_data:/bitnami 57 | - ./src:/opt/bitnami/airflow/dags 58 | - ./airflow_requirements:/bitnami/python/ 59 | airflow-worker: 60 | # TODO: to be reverted to use proper registry/distro on T39132 61 | # image: docker.io/bitnami/airflow-worker:2 62 | image: docker.io/bitnami/airflow-worker:2 63 | environment: 64 | - AIRFLOW_DATABASE_NAME=bitnami_airflow 65 | - AIRFLOW_DATABASE_USERNAME=bn_airflow 66 | - AIRFLOW_DATABASE_PASSWORD=bitnami1 67 | - AIRFLOW_EXECUTOR=CeleryExecutor 68 | - AIRFLOW_WEBSERVER_HOST=airflow 69 | - POSTGRESQL_DATABASE=lichess_games 70 | - POSTGRESQL_PASSWORD=password 71 | - POSTGRESQL_USERNAME=username 72 | - BATCH_SIZE=10000 73 | - HOSTNAME=postgresql-chess 74 | #- POSTGRESQL_PORT=5000 75 | depends_on: 76 | - "airflow" 77 | configs: 78 | - mode: 777 79 | volumes: 80 | - airflow_worker_data:/bitnami 81 | - ./src:/opt/bitnami/airflow/dags 82 | - ./airflow_requirements:/bitnami/python/ 83 | airflow: 84 | image: docker.io/bitnami/airflow:2 85 | environment: 86 | - AIRFLOW_DATABASE_NAME=bitnami_airflow 87 | - AIRFLOW_DATABASE_USERNAME=bn_airflow 88 | - AIRFLOW_DATABASE_PASSWORD=bitnami1 89 | - AIRFLOW_EXECUTOR=CeleryExecutor 90 | - AIRFLOW_USERNAME=username 91 | - AIRFLOW_PASSWORD=password 92 | ports: 93 | - '8080:8080' 94 | volumes: 95 | - airflow_data:/bitnami 96 | volumes: 97 | airflow_scheduler_data: 98 | driver: local 99 | airflow_worker_data: 100 | driver: local 101 | airflow_data: 102 | driver: local 103 | postgresql_data: 104 | driver: local 105 | redis_data: 106 | driver: local 107 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | cycler==0.10.0 2 | decorator==5.1.0 3 | kiwisolver==1.3.2 4 | matplotlib==3.4.3 5 | numpy==1.21.2 6 | pandas==1.3.3 7 | Pillow==8.3.2 8 | psycopg2==2.9.1 9 | py==1.10.0 10 | pyparsing==2.4.7 11 | python-dateutil==2.8.2 12 | pytz==2021.3 13 | retry==0.9.2 14 | scipy==1.7.1 15 | seaborn==0.11.2 16 | six==1.16.0 17 | tqdm==4.62.3 18 | kafka-python==2.0.2 19 | -------------------------------------------------------------------------------- /screenshots/airflow_dag.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/airflow_dag.png -------------------------------------------------------------------------------- /screenshots/airflow_dag_closeup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/airflow_dag_closeup.png -------------------------------------------------------------------------------- /screenshots/airflow_ui.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/airflow_ui.png -------------------------------------------------------------------------------- /screenshots/german11_blitz.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/german11_blitz.png -------------------------------------------------------------------------------- /screenshots/peschkach.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/peschkach.png -------------------------------------------------------------------------------- /screenshots/peschkach_bio.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/peschkach_bio.png -------------------------------------------------------------------------------- /screenshots/processed_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/processed_data.png -------------------------------------------------------------------------------- /screenshots/raw_data_sample.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/raw_data_sample.png -------------------------------------------------------------------------------- /screenshots/user_ids_table.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jcw024/lichess_database_ETL/9596bfdb0167f1e86ccc71d168aa354fee617654/screenshots/user_ids_table.png -------------------------------------------------------------------------------- /src/CONFIG.py: -------------------------------------------------------------------------------- 1 | DB_NAME = 'lichess_games_db' 2 | DB_USER = 'username' 3 | DB_PASSWORD = 'password' 4 | BATCH_SIZE = 10000 5 | -------------------------------------------------------------------------------- /src/airflow_dag_kafka.py: -------------------------------------------------------------------------------- 1 | from datetime import timedelta, datetime 2 | from textwrap import dedent 3 | from download_games import download_file 4 | from producer import start_producer 5 | from airflow import DAG 6 | from airflow.utils.state import State 7 | from airflow.sensors.external_task import ExternalTaskSensor 8 | from airflow.operators.dummy import DummyOperator 9 | from airflow.operators.python import PythonOperator 10 | from airflow.utils.dates import days_ago 11 | import time 12 | import os 13 | 14 | DAG_PATH = os.path.realpath(__file__) 15 | DAG_PATH = '/' + '/'.join(DAG_PATH.split('/')[1:-1]) + '/' 16 | 17 | """ 18 | def download_game(url): 19 | with open('log.txt', 'a') as f: 20 | f.write(f"{datetime.now()} downloading {url}\n") 21 | time.sleep(5) 22 | 23 | def start_producer(url): 24 | with open('log.txt', 'a') as f: 25 | f.write(f"{datetime.now()} starting producer for {url}...\n") 26 | time.sleep(20) 27 | """ 28 | def delete_file(url): 29 | """takes file url and deletes file to save disk space. assumes data has been downloaded in ../data and sent by the producer""" 30 | with open(DAG_PATH + 'log.txt', 'a') as f: 31 | f.write(f"{datetime.now()} deleting {url}...\n") 32 | filepath = os.getcwd() + "/../data/" + url.split("/")[-1] 33 | os.remove(filepath) 34 | 35 | 36 | default_args = { 37 | 'owner': 'Joe', 38 | 'depends_on_past':True, 39 | 'email_on_failure': False, 40 | 'email_on_retry': False, 41 | 'retries': 10, 42 | 'retry_delay':timedelta(seconds=30), 43 | 'retry_exponential_backoff':True, 44 | #'max_retry_delay':timedelta(minutes=30), 45 | 'execution_timeout':None 46 | } 47 | 48 | with DAG( 49 | 'lichess_ETL_pipeline_kafka', 50 | default_args=default_args, 51 | description='download game and send to kafka to write to db', 52 | schedule_interval=timedelta(minutes=5), 53 | start_date=days_ago(7), 54 | #end_date=datetime(2021,8,1), 55 | max_active_runs=1, 56 | catchup=False 57 | ) as dag2: 58 | 59 | with open(DAG_PATH + 'download_links.txt','r') as f: 60 | for line in f: 61 | url_list = f.read().splitlines() 62 | url_list = reversed(url_list) #read the links from oldest to newest 63 | 64 | for i, url in enumerate(url_list): 65 | month = url.split("_")[-1].split(".")[0] 66 | download_task = PythonOperator( 67 | python_callable=download_file, 68 | task_id="downloader_" + month, 69 | op_args=(url,), 70 | op_kwargs={'years_to_download': [2013,2014,2015,2016,2017,2018]} 71 | ) 72 | 73 | producer_task = PythonOperator( 74 | python_callable=start_producer, 75 | task_id="producer_" + month, 76 | op_args=(url,) 77 | ) 78 | delete_file_task = PythonOperator( 79 | python_callable=delete_file, 80 | task_id="deletion_" + month, 81 | op_args=(url,) 82 | ) 83 | 84 | if i == 0: 85 | download_task >> producer_task >> delete_file_task 86 | else: 87 | prev_download_task >> download_task 88 | [prev_producer_task, download_task] >> producer_task >> delete_file_task 89 | if i >= 2: 90 | prev_prev_producer_task >> download_task #downloads are usually faster than producer, we don't want 91 | #downloaders to far outpace processing of files to minimize disk usage 92 | if i >= 1: 93 | prev_prev_producer_task = prev_producer_task 94 | prev_download_task = download_task 95 | prev_producer_task = producer_task 96 | -------------------------------------------------------------------------------- /src/airflow_dag_local.py: -------------------------------------------------------------------------------- 1 | from datetime import timedelta, datetime 2 | from textwrap import dedent 3 | from download_games import download_file 4 | from process_file_local import process_file 5 | from airflow import DAG 6 | from airflow.utils.state import State 7 | from airflow.sensors.external_task import ExternalTaskSensor 8 | from airflow.operators.dummy import DummyOperator 9 | from airflow.operators.python import PythonOperator 10 | from airflow.utils.dates import days_ago 11 | import time 12 | import os 13 | 14 | DAG_PATH = os.path.realpath(__file__) 15 | DAG_PATH = '/' + '/'.join(DAG_PATH.split('/')[1:-1]) + '/' 16 | 17 | """ 18 | def download_game(url): 19 | with open('log.txt', 'a') as f: 20 | f.write(f"{datetime.now()} downloading {url}\n") 21 | time.sleep(5) 22 | 23 | def start_producer(url): 24 | with open('log.txt', 'a') as f: 25 | f.write(f"{datetime.now()} starting producer for {url}...\n") 26 | time.sleep(20) 27 | """ 28 | def delete_file(url): 29 | """takes file url and deletes file to save disk space. assumes data has been downloaded in ../data and sent by the producer""" 30 | with open(DAG_PATH + 'log.txt', 'a') as f: 31 | f.write(f"{datetime.now()} deleting {url}...\n") 32 | #filepath = os.getcwd() + "/../data/" + url.split("/")[-1] 33 | filepath = DAG_PATH + url.split("/")[-1] 34 | os.remove(filepath) 35 | 36 | 37 | default_args = { 38 | 'owner': 'Joe', 39 | 'depends_on_past':True, 40 | 'email_on_failure': False, 41 | 'email_on_retry': False, 42 | 'retries': 10, 43 | 'retry_delay':timedelta(seconds=30), 44 | 'retry_exponential_backoff':True, 45 | #'max_retry_delay':timedelta(minutes=30), 46 | 'execution_timeout':None 47 | } 48 | 49 | with DAG( 50 | 'lichess_ETL_pipeline_local_processing', 51 | default_args=default_args, 52 | description='download game and process data into database locally', 53 | schedule_interval=timedelta(minutes=5), 54 | start_date=days_ago(7), 55 | #end_date=datetime(2021,8,1), 56 | max_active_runs=1, 57 | catchup=False 58 | ) as dag2: 59 | 60 | with open(DAG_PATH + 'download_links.txt','r') as f: 61 | for line in f: 62 | url_list = f.read().splitlines() 63 | url_list = reversed(url_list) #read the links from oldest to newest 64 | 65 | for i, url in enumerate(url_list): 66 | month = url.split("_")[-1].split(".")[0] 67 | download_task = PythonOperator( 68 | python_callable=download_file, 69 | task_id="downloader_" + month, 70 | op_args=(url,), 71 | op_kwargs={'years_to_download': [2013,2014,2015,2016,2017,2018]} 72 | ) 73 | 74 | process_task = PythonOperator( 75 | python_callable=process_file, 76 | task_id="processer_" + month, 77 | op_args=(url,) 78 | ) 79 | delete_file_task = PythonOperator( 80 | python_callable=delete_file, 81 | task_id="deletion_" + month, 82 | op_args=(url,) 83 | ) 84 | 85 | if i == 0: 86 | download_task >> process_task >> delete_file_task 87 | else: 88 | prev_download_task >> download_task 89 | [prev_process_task, download_task] >> process_task >> delete_file_task 90 | if i >= 2: 91 | prev_prev_process_task >> download_task #downloads are usually faster than producer, we don't want 92 | #downloaders to far outpace processing of files to minimize disk usage 93 | if i >= 1: 94 | prev_prev_process_task = prev_process_task 95 | prev_download_task = download_task 96 | prev_process_task = process_task 97 | -------------------------------------------------------------------------------- /src/check_db_status.py: -------------------------------------------------------------------------------- 1 | from CONFIG import DB_NAME, DB_USER 2 | import psycopg2 3 | 4 | if __name__ == "__main__": 5 | connect_string = "dbname=" + DB_NAME + " user=" + DB_USER 6 | conn = psycopg2.connect(connect_string) 7 | cur = conn.cursor() 8 | cur.execute("select pg_size_pretty(pg_relation_size('games'));") 9 | print(f"games table size: {cur.fetchone()[0]}") 10 | """ 11 | cur.execute("select * from INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = N'games';") 12 | column_data = cur.fetchall() 13 | columns = [] 14 | for c in column_data: 15 | columns.append(c[3]) 16 | cur.execute("select * from games limit 5;") 17 | print("top 5 rows:") 18 | print(columns) 19 | for row in cur.fetchall(): 20 | print(row) 21 | """ 22 | cur.execute("SELECT COUNT(*) FROM games;") 23 | print(f"total rows: {cur.fetchall()}") 24 | -------------------------------------------------------------------------------- /src/consumer.py: -------------------------------------------------------------------------------- 1 | from kafka import KafkaConsumer 2 | from data_process_util import * 3 | from database_util import * 4 | from CONFIG import DB_NAME, DB_USER, BATCH_SIZE #enter these values in CONFIG.public.py, then change CONFIG to CONFIG.public 5 | from datetime import datetime #or rename CONFIG.public.py to CONFIG.py 6 | from tqdm import tqdm 7 | from collections import OrderedDict 8 | from psycopg2.errors import InFailedSqlTransaction 9 | import re 10 | import psycopg2 11 | import psycopg2.extras 12 | import io 13 | 14 | if __name__ == "__main__": 15 | connect_string = "dbname=" + DB_NAME + " user=" + DB_USER 16 | conn = psycopg2.connect(connect_string) 17 | try: #if any exception, write the id_dict to "user_IDs" database table to record new user_IDs before raising error 18 | games_columns = initialize_tables(conn) #create necessary tables in postgresql if they don't already exist 19 | id_dict = load_id_dict(conn) #load dict to assign user IDs to usernames 20 | new_id_dict = {} 21 | 22 | #setup consumer 23 | consumer_configs = { 24 | 'bootstrap_servers':'localhost:9092', 25 | 'group_id':'main_group', 26 | 'auto_offset_reset':'earliest', 27 | 'max_partition_fetch_bytes':1048576*100, 28 | 'enable_auto_commit':True #whether or not to continue where consumer left off or start over 29 | } 30 | 31 | consumer = KafkaConsumer('ChessGamesTopic', **consumer_configs) 32 | print("starting consumer...") 33 | 34 | #consumer will read data until it's read a full game's data, then add the game data to batch 35 | batch = [] #database writes are done in batches to minimize server roundtrips 36 | game = OrderedDict() 37 | for line in tqdm(consumer): 38 | line = line.value.decode('utf-8') 39 | if line == '\n' or line[0] == ' ': continue 40 | try: 41 | key = re.search("\[(.*?) ",line).group(1) 42 | val = re.search(" \"(.*?)\"\]", line).group(1) 43 | if key in ("Date", "Round", "Opening"): continue #skip irrelevant data (adjust if you prefer) 44 | if key not in games_columns + ["UTCDate", "UTCTime"]: continue #if some unforseen data type not in table, skip it 45 | if key in ("White", "Black"): 46 | (val, id_dict, new_id_dict) = assign_user_ID(val, id_dict, new_id_dict) #converts username to user ID and updates id_dict 47 | key, val = format_data(key, val) 48 | game[key] = val 49 | except AttributeError: 50 | pass 51 | 52 | #checks if the line is describing the moves of a game (the line starts with "1"). 53 | #If so, all the data for the game has been read and we can format the game data 54 | if line[0] == '1': 55 | if 'eval' in line: 56 | game["Analyzed"] = True 57 | else: 58 | game["Analyzed"] = False 59 | game = format_game(game) 60 | if game: 61 | batch.append(game) 62 | game = OrderedDict() #reset game dict variable for the next set of game data 63 | if len(batch) >= BATCH_SIZE: 64 | print(batch) 65 | copy_data(conn, batch, "games") 66 | dump_dict(new_id_dict, conn) 67 | batch = [] 68 | new_id_dict = {} 69 | except (Exception, KeyboardInterrupt) as e: 70 | #on consumer shutdown, write remaining games data and id_dict values to database 71 | print(f"{e} exception raised, writing id_dict to database") 72 | dump_dict(id_dict, conn) 73 | copy_data(conn, batch, "games") 74 | raise e 75 | 76 | 77 | -------------------------------------------------------------------------------- /src/data_process_util.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from collections import OrderedDict 3 | from bz2 import BZ2File as bzopen 4 | import re 5 | 6 | def read_lines(bzip_file): 7 | """takes a bzip file path and returns a generator that yields each line in the file""" 8 | with bzopen(bzip_file,"r") as bzfin: 9 | game_data = [] 10 | for i, line in enumerate(bzfin): 11 | yield line 12 | 13 | def assign_user_ID(username, id_dict, new_id_dict): 14 | """takes a username and gets the ID or assigns a new one if not already in id_dict 15 | returns the ID and id_dict (with the new ID added if a new one was added) 16 | if a new id was added, it will be added to new_id_dict""" 17 | if username in id_dict: 18 | return id_dict[username], id_dict, new_id_dict 19 | elif len(id_dict) == 0: 20 | ID = 1 21 | else: 22 | ID = max(id_dict.values()) + 1 23 | id_dict[username] = ID 24 | new_id_dict[username] = ID 25 | return ID, id_dict, new_id_dict 26 | 27 | def format_data(key, val): 28 | """takes in lichess game key and value and formats the data prior to writing it to the database""" 29 | if key == "Event": 30 | if "bullet" in val.lower(): 31 | val = 'b' 32 | elif "blitz" in val.lower(): 33 | val = 'B' 34 | elif "standard" in val.lower() or "rapid" in val.lower(): 35 | val = 'R' 36 | elif "classical" in val.lower(): 37 | val = 'c' 38 | elif "correspondence" in val.lower(): 39 | val = 'C' 40 | else: 41 | val = '?' 42 | elif key == "UTCDate": 43 | val = datetime.strptime(val, '%Y.%m.%d').date() 44 | elif key == "UTCTime": 45 | val = datetime.strptime(val, '%H:%M:%S').time() 46 | elif key == "Site": 47 | val = re.search("org/(.*)", val).group(1) 48 | elif key in ("WhiteRatingDiff", "BlackRatingDiff", "WhiteElo", "BlackElo"): 49 | if "?" in val: #if any player is "anonymous" or has provisional rating, 50 | val = None #elo data will be NULL. this will trigger the game to be thrown out 51 | else: 52 | val = int(val) 53 | elif key == "Termination": 54 | if val == "Normal": val = 'N' 55 | elif val == "Time forfeit": val = 'F' 56 | elif val == "Abandoned": val = 'A' 57 | else: val = '?' #usually means cheater detected 58 | elif key == "TimeControl": 59 | val = format_time_control(val) 60 | elif key == "Result": 61 | if val == "1/2-1/2": 62 | val = 'D' 63 | elif val == "1-0": 64 | val = 'W' 65 | elif val == "0-1": 66 | val = 'B' 67 | else: 68 | val = '?' 69 | return (key, val) 70 | 71 | def merge_datetime(game): 72 | """takes in a game dict and merges the date and time with datetime.combine()""" 73 | try: 74 | game['Date_time'] = datetime.combine(game['UTCDate'], game['UTCTime']) 75 | del game['UTCDate'] 76 | del game['UTCTime'] 77 | except KeyError: 78 | if 'UTCDate' in game: 79 | del game['UTCDate'] 80 | if 'UTCTime' in game: 81 | del game['UTCTime'] 82 | return game 83 | 84 | def format_time_control(time_control): 85 | """takes in a time_control string (i.e. '300+5') and converts to int by 86 | multiplying the increment by 40 moves (how lichess categorizes game time control type)""" 87 | try: 88 | time_control = time_control.split("+") 89 | return int(time_control[0]) + int(time_control[1])*40 90 | except ValueError: 91 | return 0 92 | 93 | def format_game(game): 94 | """takes game and adds an 'analyzed' key/value, fills in player titles if not already existing, and formats dates""" 95 | #game moves are not stored to save disk space, just track if the game has been analyzed or not 96 | try: 97 | if any([game[i] is None for i in ["BlackElo", "WhiteElo"]]): 98 | return {} #check if black or white are None, throw game out if yes 99 | if any([game[i] is None for i in ["WhiteRatingDiff", "BlackRatingDiff", "WhiteElo", "BlackElo"]]): 100 | return {} #throw out the game if any player is "anonymous" with no rating 101 | except KeyError: 102 | return {} 103 | if "WhiteTitle" not in game: 104 | game["WhiteTitle"] = None 105 | if "BlackTitle" not in game: 106 | game["BlackTitle"] = None 107 | game = merge_datetime(game) 108 | return OrderedDict(sorted(game.items())) 109 | 110 | 111 | -------------------------------------------------------------------------------- /src/database_util.py: -------------------------------------------------------------------------------- 1 | from psycopg2.errors import InFailedSqlTransaction 2 | import psycopg2 3 | import psycopg2.extras 4 | import io 5 | 6 | def initialize_tables(conn): 7 | """creates a games and user_ids table in postgresql if they do not already exist""" 8 | #setup database tables. database name and user is configured in CONFIG.py 9 | cur = conn.cursor() 10 | #sort columns alphabetically for the copy_from method used to insert rows later 11 | games_columns = sorted(["Event VARCHAR(1) NOT NULL", "Site VARCHAR(8) PRIMARY KEY", 12 | "White INT NOT NULL", "Black INT NOT NULL", 13 | "Result VARCHAR(1) NOT NULL", "WhiteElo SMALLINT NOT NULL", 14 | "BlackElo SMALLINT NOT NULL", "WhiteRatingDiff SMALLINT NOT NULL", 15 | "BlackRatingDiff SMALLINT NOT NULL", 16 | "ECO VARCHAR(3) NOT NULL", "TimeControl SMALLINT NOT NULL", 17 | "Termination VARCHAR(1) NOT NULL", "BlackTitle VARCHAR(3)", 18 | "WhiteTitle VARCHAR(3)", "Analyzed BOOLEAN NOT NULL", 19 | "Date_time timestamp NOT NULL"]) 20 | cur.execute( 21 | "CREATE TABLE IF NOT EXISTS games (" + ', '.join(games_columns) + ");" 22 | ) 23 | cur.execute( 24 | """ 25 | CREATE TABLE IF NOT EXISTS user_ids ( 26 | ID INT NOT NULL PRIMARY KEY, username varchar(30) NOT NULL) 27 | """ 28 | ) 29 | conn.commit() 30 | #create lookup tables for event, result, ECO?, termination 31 | return [col.split(" ")[0] for col in games_columns] 32 | 33 | def copy_data(conn, batch, table, retry=False): 34 | """takes a list of games and a database connection and inserts into database""" 35 | try: 36 | cur = conn.cursor() 37 | csv_object = io.StringIO() 38 | for item in batch: 39 | csv_object.write('|'.join(map(csv_format, item.values())) + '\n') 40 | csv_object.seek(0) 41 | try: 42 | cur.copy_from(csv_object, table, sep='|') 43 | except psycopg2.errors.UniqueViolation: 44 | copy_conflict(csv_object, conn, table) 45 | conn.commit() 46 | except InFailedSqlTransaction: #if sql transaction fail, do a rollback and retry 47 | if not retry: 48 | conn.rollback() 49 | copy_data(conn, batch, table, retry=True) 50 | else: 51 | raise InFailedSqlTransaction 52 | 53 | def copy_conflict(csv_object, conn, table): 54 | csv_object.seek(0) 55 | conn.rollback() 56 | cur = conn.cursor() 57 | cur.execute("CREATE temporary table __copy as (select * from " + table + " limit 0);") 58 | cur.copy_from(csv_object, "__copy", sep='|') 59 | cur.execute("INSERT INTO " + table + " SELECT * FROM __copy ON CONFLICT DO NOTHING") 60 | cur.execute("DROP TABLE __copy") 61 | return 62 | 63 | def csv_format(val): 64 | """used in copy_data function to format data for the csv_object to be used in the copy_from method. 65 | Takes a val of any type and returns the val converted to a str. 66 | None is converted to '\\N', indicating a null value""" 67 | if val is None: 68 | return r'\N' 69 | return str(val) 70 | 71 | def write_row(columns, values, conn, table): 72 | """takes in a list of columns and a list of values and writes the data to 73 | a postgresql database table using the provided psycopg2 connection""" 74 | if len(columns) == 0: return False 75 | col_text = ', '.join(columns) 76 | row_text = "%s" + ", %s"*(len(values)-1) 77 | command = "INSERT INTO " + table + " (" + col_text + ") VALUES (" + row_text + ") ON CONFLICT DO NOTHING" 78 | try: 79 | cur = conn.cursor() 80 | cur.execute(command, list(values)) 81 | except psycopg2.errors.NotNullViolation: 82 | pass 83 | conn.commit() 84 | return True 85 | 86 | def dump_dict(data_dict, conn): 87 | id_list = [{"id":i[1],"username":i[0]} for i in data_dict.items()] #formatting to fit expected input of copy_data function 88 | copy_data(conn, id_list, "user_ids") 89 | conn.commit() 90 | return 91 | 92 | def load_id_dict(conn): 93 | """reads data from the 'user_ids' table and returns the data in a dict 94 | of username: ID""" 95 | cur = conn.cursor() 96 | cur.execute("SELECT * FROM user_ids;") 97 | conn.commit() 98 | user_IDs = cur.fetchall() 99 | id_dict = {} 100 | for (ID, username) in user_IDs: 101 | id_dict[username] = ID 102 | return id_dict 103 | 104 | 105 | -------------------------------------------------------------------------------- /src/download_games.py: -------------------------------------------------------------------------------- 1 | from multiprocessing.pool import ThreadPool 2 | from urllib.request import urlopen 3 | from urllib.error import HTTPError 4 | from retry import retry 5 | from pathlib import Path 6 | import shutil 7 | import glob 8 | import time 9 | import os 10 | 11 | 12 | 13 | @retry(HTTPError, tries=-1, delay=60) 14 | def urlopen_retry(url): 15 | return urlopen(url) 16 | 17 | def download_file(url, years_to_download=None, chunk_size=16*1024): 18 | DAG_PATH = os.path.realpath(__file__) 19 | DAG_PATH = '/' + '/'.join(DAG_PATH.split('/')[1:-1]) + '/' 20 | #Path('../lichess_data').mkdir(exist_ok=True) 21 | filename = url.split("/")[-1] 22 | year = int(filename.split("_")[-1][:4]) 23 | downloaded = [i.replace("./","") for i in glob.glob("./lichess*")] 24 | if years_to_download and year not in years_to_download: 25 | return (url, False) 26 | if filename not in downloaded: 27 | #if int(filename.split("-")[-1][:2]) % 2 == 0: #download even months only (save disk space) 28 | print(f"downloading {filename}...") 29 | response = urlopen_retry(url) 30 | with open(DAG_PATH + filename, 'wb') as local_f: 31 | while True: 32 | chunk = response.read(chunk_size) 33 | if not chunk: 34 | break 35 | local_f.write(chunk) 36 | return (url, True) 37 | else: 38 | return (url, False) 39 | 40 | if __name__ == "__main__": 41 | urls = [] 42 | with open("download_links.txt","r") as url_f: 43 | years_to_download = [2018, 2019] #limited to 2018/2019 to save disk space 44 | for line in url_f: 45 | line = line.replace("\n","") 46 | urls.append(line) 47 | with ThreadPool() as p: 48 | results = [p.apply_async(download_file, (url, years_to_download)) for url in urls] 49 | for r in results: 50 | print(r.get()) 51 | 52 | 53 | -------------------------------------------------------------------------------- /src/download_links.txt: -------------------------------------------------------------------------------- 1 | https://database.lichess.org/standard/lichess_db_standard_rated_2021-05.pgn.bz2 2 | https://database.lichess.org/standard/lichess_db_standard_rated_2021-04.pgn.bz2 3 | https://database.lichess.org/standard/lichess_db_standard_rated_2021-03.pgn.bz2 4 | https://database.lichess.org/standard/lichess_db_standard_rated_2021-02.pgn.bz2 5 | https://database.lichess.org/standard/lichess_db_standard_rated_2021-01.pgn.bz2 6 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-12.pgn.bz2 7 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-11.pgn.bz2 8 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-10.pgn.bz2 9 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-09.pgn.bz2 10 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-08.pgn.bz2 11 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-07.pgn.bz2 12 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-06.pgn.bz2 13 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-05.pgn.bz2 14 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-04.pgn.bz2 15 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-03.pgn.bz2 16 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-02.pgn.bz2 17 | https://database.lichess.org/standard/lichess_db_standard_rated_2020-01.pgn.bz2 18 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-12.pgn.bz2 19 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-11.pgn.bz2 20 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-10.pgn.bz2 21 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-09.pgn.bz2 22 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-08.pgn.bz2 23 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-07.pgn.bz2 24 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-06.pgn.bz2 25 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-05.pgn.bz2 26 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-04.pgn.bz2 27 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-03.pgn.bz2 28 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-02.pgn.bz2 29 | https://database.lichess.org/standard/lichess_db_standard_rated_2019-01.pgn.bz2 30 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-12.pgn.bz2 31 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-11.pgn.bz2 32 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-10.pgn.bz2 33 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-09.pgn.bz2 34 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-08.pgn.bz2 35 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-07.pgn.bz2 36 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-06.pgn.bz2 37 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-05.pgn.bz2 38 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-04.pgn.bz2 39 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-03.pgn.bz2 40 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-02.pgn.bz2 41 | https://database.lichess.org/standard/lichess_db_standard_rated_2018-01.pgn.bz2 42 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-12.pgn.bz2 43 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-11.pgn.bz2 44 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-10.pgn.bz2 45 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-09.pgn.bz2 46 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-08.pgn.bz2 47 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-07.pgn.bz2 48 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-06.pgn.bz2 49 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-05.pgn.bz2 50 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-04.pgn.bz2 51 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-03.pgn.bz2 52 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-02.pgn.bz2 53 | https://database.lichess.org/standard/lichess_db_standard_rated_2017-01.pgn.bz2 54 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-12.pgn.bz2 55 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-11.pgn.bz2 56 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-10.pgn.bz2 57 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-09.pgn.bz2 58 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-08.pgn.bz2 59 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-07.pgn.bz2 60 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-06.pgn.bz2 61 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-05.pgn.bz2 62 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-04.pgn.bz2 63 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-03.pgn.bz2 64 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-02.pgn.bz2 65 | https://database.lichess.org/standard/lichess_db_standard_rated_2016-01.pgn.bz2 66 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-12.pgn.bz2 67 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-11.pgn.bz2 68 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-10.pgn.bz2 69 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-09.pgn.bz2 70 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-08.pgn.bz2 71 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-07.pgn.bz2 72 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-06.pgn.bz2 73 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-05.pgn.bz2 74 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-04.pgn.bz2 75 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-03.pgn.bz2 76 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-02.pgn.bz2 77 | https://database.lichess.org/standard/lichess_db_standard_rated_2015-01.pgn.bz2 78 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-12.pgn.bz2 79 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-11.pgn.bz2 80 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-10.pgn.bz2 81 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-09.pgn.bz2 82 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-08.pgn.bz2 83 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-07.pgn.bz2 84 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-06.pgn.bz2 85 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-05.pgn.bz2 86 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-04.pgn.bz2 87 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-03.pgn.bz2 88 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-02.pgn.bz2 89 | https://database.lichess.org/standard/lichess_db_standard_rated_2014-01.pgn.bz2 90 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-12.pgn.bz2 91 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-11.pgn.bz2 92 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-10.pgn.bz2 93 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-09.pgn.bz2 94 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-08.pgn.bz2 95 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-07.pgn.bz2 96 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-06.pgn.bz2 97 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-05.pgn.bz2 98 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-04.pgn.bz2 99 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-03.pgn.bz2 100 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-02.pgn.bz2 101 | https://database.lichess.org/standard/lichess_db_standard_rated_2013-01.pgn.bz2 102 | -------------------------------------------------------------------------------- /src/process_file_local.py: -------------------------------------------------------------------------------- 1 | from kafka import KafkaConsumer 2 | from data_process_util import * 3 | from database_util import * 4 | from datetime import datetime 5 | from tqdm import tqdm 6 | from collections import OrderedDict 7 | from psycopg2.errors import InFailedSqlTransaction 8 | import re 9 | import psycopg2 10 | import psycopg2.extras 11 | import os 12 | 13 | 14 | def process_file(url): 15 | """python function for airflow dag. takes a url who's file has been downloaded and loads data into database""" 16 | DAG_PATH = os.path.realpath(__file__) 17 | DAG_PATH = '/' + '/'.join(DAG_PATH.split('/')[1:-1]) + '/' 18 | DB_NAME = os.getenv('POSTGRESQL_DATABASE', 'lichess_games') #env variables come from docker-compose.yml 19 | DB_USER = os.getenv('POSTGRESQL_USERNAME','username') 20 | DB_PASSWORD = os.getenv('POSTGRESQL_PASSWORD','password') 21 | HOSTNAME = os.getenv('HOSTNAME','localhost') 22 | PORT = os.getenv('POSTGRESQL_PORT', '5432') 23 | BATCH_SIZE = int(os.getenv('BATCH_SIZE', 10000)) 24 | connect_string = "host=" + HOSTNAME + " dbname=" + DB_NAME + " user=" + DB_USER + " password=" + DB_PASSWORD \ 25 | + " port=" + PORT 26 | conn = psycopg2.connect(connect_string) 27 | try: #if any exception, write the id_dict to "user_IDs" database table to record new user_IDs before raising error 28 | games_columns = initialize_tables(conn) #create necessary tables in postgresql if they don't already exist 29 | id_dict = load_id_dict(conn) #load dict to assign user IDs to usernames 30 | new_id_dict = {} 31 | 32 | #consumer will read data until it's read a full game's data, then add the game data to batch 33 | batch = [] #database writes are done in batches to minimize server roundtrips 34 | game = OrderedDict() 35 | data_path = DAG_PATH #+"../lichess_data/" 36 | filename = url.split('/')[-1] 37 | filepath = data_path + filename 38 | lines = read_lines(filepath) 39 | for line in tqdm(lines): 40 | if len(line) <= 1: continue 41 | line = line.decode('utf-8') 42 | if line == '\n' or line[0] == ' ': continue 43 | try: 44 | key = re.search("\[(.*?) ",line).group(1) 45 | val = re.search(" \"(.*?)\"\]", line).group(1) 46 | if key in ("Date", "Round", "Opening"): continue #skip irrelevant data (adjust if you prefer) 47 | if key not in games_columns + ["UTCDate", "UTCTime"]: continue #if some unforseen data type not in table, skip it 48 | if key in ("White", "Black"): 49 | (val, id_dict, new_id_dict) = assign_user_ID(val, id_dict, new_id_dict) #converts username to user ID and updates id_dict 50 | key, val = format_data(key, val) 51 | game[key] = val 52 | except AttributeError: 53 | pass 54 | 55 | #checks if the line is describing the moves of a game (the line starts with "1"). 56 | #If so, all the data for the game has been read and we can format the game data 57 | if line[0] == '1': 58 | if 'eval' in line: 59 | game["Analyzed"] = True 60 | else: 61 | game["Analyzed"] = False 62 | game = format_game(game) 63 | if game: 64 | batch.append(game) 65 | game = OrderedDict() #reset game dict variable for the next set of game data 66 | if len(batch) >= BATCH_SIZE: 67 | copy_data(conn, batch, "games") 68 | dump_dict(new_id_dict, conn) 69 | batch = [] 70 | new_id_dict = {} 71 | except (Exception, KeyboardInterrupt) as e: 72 | #on consumer shutdown, write remaining games data and id_dict values to database 73 | print(f"{e} exception raised, writing id_dict to database") 74 | dump_dict(id_dict, conn) 75 | copy_data(conn, batch, "games") 76 | raise e 77 | 78 | 79 | if __name__ == "__main__": 80 | process_file("https://database.lichess.org/standard/lichess_db_standard_rated_2017-04.pgn.bz2") 81 | -------------------------------------------------------------------------------- /src/producer.py: -------------------------------------------------------------------------------- 1 | from kafka import KafkaProducer 2 | from collections import defaultdict 3 | from tqdm import tqdm 4 | from data_process_util import read_lines 5 | import time 6 | import os 7 | import glob 8 | 9 | SRC_PATH = os.getcwd() #assumes cwd is lichess_games/src 10 | BZ2_DATA = glob.glob(SRC_PATH+"/../data/lichess*") 11 | 12 | def start_producer(url): 13 | """assumes a lichess bzip file specified by the url has been downloaded in ../data and creates a kafka producer 14 | to send data to the kafka broker""" 15 | data_path = SRC_PATH+"/../data/" 16 | filename = url.split('/')[-1] 17 | filepath = data_path + filename 18 | producer = KafkaProducer(bootstrap_servers='localhost:9092', linger_ms=1000*10000, batch_size=16384*50) 19 | lines = read_lines(filepath) 20 | for line in tqdm(lines): 21 | if len(line) <= 1: continue 22 | producer.send('ChessGamesTopic', line) 23 | 24 | if __name__ == "__main__": 25 | #execute producer.py to test on small dataset 26 | producer = KafkaProducer(bootstrap_servers='localhost:9092', linger_ms=1000*10000, batch_size=16384*50) 27 | lines = read_lines('../data/lichess_db_standard_rated_2013-01.pgn.bak.bz2') 28 | #for bzip_file in BZ2_DATA: 29 | # lines = read_lines('bzip_file') 30 | for line in tqdm(lines): 31 | if len(line) <= 1: continue 32 | producer.send('ChessGamesTopic', line) 33 | --------------------------------------------------------------------------------