├── .DS_Store
├── README.md
├── .gitignore
├── ProjectProposal_Group035_WI24.ipynb
└── DataCheckpoint_Group035_WI24.ipynb
/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/COGS108/Group035_WI24/master/.DS_Store
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | This is your group repo for your final project for COGS108.
2 |
3 | This repository is private, and is only visible to the course instructors and your group mates; it is not visible to anyone else.
4 |
5 | Template notebooks for each component are provided. Only work on the notebook prior to its due date. After each submission is due, move onto the next notebook (For example, after the proposal is due, start working in the Data Checkpoint notebook).
6 |
7 | This repository will be frozen on the final project due date. No further changes can be made after that time.
8 |
9 | Your project proposal and final project will be graded based solely on the corresponding project notebooks in this repository.
10 |
11 | Template Jupyter notebooks have been included, with your group number replacing the XXX in the following file names. For each due date, make sure you have a notebook present in this repository by each due date with the following name (where XXX is replaced by your group number):
12 |
13 | - `ProjectProposal_groupXXX.ipynb`
14 | - `DataCheckpoint_groupXXX.ipynb`
15 | - `EDACheckpoint_groupXXX.ipynb`
16 | - `FinalProject_groupXXX.ipynb`
17 |
18 | This is *your* repo. You are free to manage the repo as you see fit, edit this README, add data files, add scripts, etc. So long as there are the four files above on due dates with the required information, the rest is up to you all.
19 |
20 | Also, you are free and encouraged to share this project after the course and to add it to your portfolio. Just be sure to fork it to your GitHub at the end of the quarter!
21 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | share/python-wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 | dataset/
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 | cover/
54 |
55 | # Translations
56 | *.mo
57 | *.pot
58 |
59 | # Django stuff:
60 | *.log
61 | local_settings.py
62 | db.sqlite3
63 | db.sqlite3-journal
64 |
65 | # Flask stuff:
66 | instance/
67 | .webassets-cache
68 |
69 | # Scrapy stuff:
70 | .scrapy
71 |
72 | # Sphinx documentation
73 | docs/_build/
74 |
75 | # PyBuilder
76 | .pybuilder/
77 | target/
78 |
79 | # Jupyter Notebook
80 | .ipynb_checkpoints
81 |
82 | # IPython
83 | profile_default/
84 | ipython_config.py
85 |
86 | # pyenv
87 | # For a library or package, you might want to ignore these files since the code is
88 | # intended to run in multiple environments; otherwise, check them in:
89 | # .python-version
90 |
91 | # pipenv
92 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
93 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
94 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
95 | # install all needed dependencies.
96 | #Pipfile.lock
97 |
98 | # poetry
99 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
100 | # This is especially recommended for binary packages to ensure reproducibility, and is more
101 | # commonly ignored for libraries.
102 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
103 | #poetry.lock
104 |
105 | # pdm
106 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
107 | #pdm.lock
108 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
109 | # in version control.
110 | # https://pdm.fming.dev/#use-with-ide
111 | .pdm.toml
112 |
113 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
114 | __pypackages__/
115 |
116 | # Celery stuff
117 | celerybeat-schedule
118 | celerybeat.pid
119 |
120 | # SageMath parsed files
121 | *.sage.py
122 |
123 | # Environments
124 | .env
125 | .venv
126 | env/
127 | venv/
128 | ENV/
129 | env.bak/
130 | venv.bak/
131 |
132 | # Spyder project settings
133 | .spyderproject
134 | .spyproject
135 |
136 | # Rope project settings
137 | .ropeproject
138 |
139 | # mkdocs documentation
140 | /site
141 |
142 | # mypy
143 | .mypy_cache/
144 | .dmypy.json
145 | dmypy.json
146 |
147 | # Pyre type checker
148 | .pyre/
149 |
150 | # pytype static type analyzer
151 | .pytype/
152 |
153 | # Cython debug symbols
154 | cython_debug/
155 |
156 | # PyCharm
157 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
158 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
159 | # and can be added to the global gitignore or merged into this file. For a more nuclear
160 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
161 | #.idea/
162 |
163 |
--------------------------------------------------------------------------------
/ProjectProposal_Group035_WI24.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# COGS 108 - Project Proposal"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "# Names\n",
15 | "\n",
16 | "- Erick Amaro Hernandez\n",
17 | "- Sanjith Devineni\n",
18 | "- Alex Macias\n",
19 | "- Lawrence Ong\n",
20 | "- Mingyang Yao"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "# Research Question"
28 | ]
29 | },
30 | {
31 | "cell_type": "markdown",
32 | "metadata": {},
33 | "source": [
34 | "When competing in a powerlifting competition, choosing how much weight you'll be lifting has a lot of factors incorporated into it such as your own ability to judge the difficulty of your previous attempts and how you performed on those attempts. Our question is: Does the percentage(%) increase in attempt weight selection vary between different categorical variables such as weight class, event, and gender? \n"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "## Background and Prior Work"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "The sport of powerlifting consists of 3 main events. The squat, the bench press, and the deadlift. A powerlifting competition gives you 3 attempts at the squat, 3 for the bench press, and 3 for the deadlift, and per event, each attempt must be heavier than the last (within the event), with the max weight lifted in each event summing up to the competitor's \"total\" for a meet. \n",
49 | "\n",
50 | "As such, attempt selection is an incredibly important skill a powerlifter must attain if they want to be successful in the sport. Seasoned coaches in the sport suggest that attempts start at around 90% of an athlete's max in the event. With 5% jumps between attempts being suggested from attempt 1 to 2, and <3% between attempt 2 and 3. It is also noted that heavier jumps may occur in the last event, the deadlift, due to it being the most exciting event of a powerlifting meet and/or the unique rules that come from it being the final event of a meet. [1](#cite_ref-1) \n",
51 | "\n",
52 | "Most of the hype around the deadlift event comes from it contributing on average >40% of an athelete's total regardless of gender according to a previous analysis of the dataset by Kaggle user Steven Wilson [2](#cite_note-2). The importance of a lifter's deadlift fuels our hypothesis that maybe this lift comes with even higher/riskier jumps in weight per attempt on average. Another thing we can learn from Wilson's analysis and build on is that a larger portion of Men's totals is made up by their bench press, with it making up around 25% of their total in comparison to female lifters where it only makes up 20%. The study also shows the almost trivial trend of a higher athlete body weight being associated with higher totals.\n",
53 | "\n",
54 | "[^](#cite_ref-1) Krawczyk, Bryce (6 August, 2023) So You Wanna Be a Powerlifter? Attempt Selection During a Meet. *BarBend.com* https://www.barbend.com/powerlifting-meet-attempt-selection/\n",
55 | "\n",
56 | "[^](#cite-ref2) Steven Wilson (23 Sept. 2020) Strength Differences in Powerlifting by Gender https://www.kaggle.com/code/stevenwilson8/strength-differences-in-powerlifting-by-gender\n"
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "# Hypothesis\n"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "1. We believe there will be no difference between genders when it comes to attempt selection in general. We think this because the different distributions of weight selection a per-event basis will balance eachother out to be equal when comparing just by gender. For example, although women may choose higher squat attempts, their bench attempts will might be lower and vice versa for men.\n",
71 | "2. The deadlift will have the higher jumps due to the hype around the amount of weight lifted and it being the main contributor to an athlete's total.\n",
72 | "3. Following from the last prediction, and from Wilson's past research on contribution of an event to an athlete's total, we predict that on a per-event basis there will be a difference between men and women's jumps, e.g. since more of men's totals is made up of the bench press, they might have higher jumps than women in the bench press. \n"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "# Data"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "The ideal dataset would be a log of powerlifting competitions and results of each of the competitors. Ideally we'd like some sort of identifier for the lifter for potentially tracking differences as they gain more experience competing. Even more ideal would be data that includes competitors' best lifts outside of competition, but that is never reported when signing up for a competition, so we believe it is not feasably attainable. \n",
87 | "\n",
88 | "\n",
89 | "As for collecting a dataset, it already has been. Openpowerlifting.com archives all powerlifting competitions and has a well-maintained dataset published on Kaggle which we will be using for the project. It contains weight, weight class, attempt weights, whether the attempt was successful, and calculations of various scores that are used to rank lifters across the different federations in powerlifting in each row, where each row is a competitor's performance at a specific meet."
90 | ]
91 | },
92 | {
93 | "cell_type": "markdown",
94 | "metadata": {},
95 | "source": [
96 | "# Ethics & Privacy"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "In conducting our research on attempt selection in powerlifting, it's paramount to address ethical and privacy concerns throughout the data science process. Our dataset, sourced from Openpowerlifting.com, provides valuable information on powerlifting competitions, but we recognize the importance of safeguarding the privacy of individuals involved. We will ensure that the data used is anonymized and that participants' consent to data usage is respected, given that the information is publicly available. Additionally, we acknowledge potential biases within the dataset, including underrepresentation of certain demographic groups in powerlifting competitions. To mitigate these biases, we will conduct thorough exploratory data analysis to identify any imbalances and employ strategies such as stratification or oversampling to ensure equitable analysis.\n",
104 | "\n",
105 | "Throughout the analysis, we will maintain transparency regarding the limitations and potential biases of the data. We will present findings in a fair and balanced manner, highlighting uncertainties and caveats associated with the analysis. Post-analysis, we will critically review our findings to assess their implications on different demographic groups and engage in ongoing dialogue with stakeholders to address concerns related to equity and fairness. Our research will adhere to ethical guidelines and best practices in data science, prioritizing transparency, accountability, and respect for the rights and dignity of research participants. By upholding these principles, we aim to contribute responsibly to the understanding of attempt selection in powerlifting while promoting inclusivity and ethical data use."
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "# Team Expectations "
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "\n",
120 | "## Communication\n",
121 | "* We understand that we are all students with different coursework and load and may not be available for days to respond with input to someone's ideas or questions. \n",
122 | "\n",
123 | "* To give some semblance of \"I have seen your message\", since Discord doesn't provide read receipts, we hope to adopt a practice of reacting with an emote to messages to acknowledge them if we don't have time to craft a response at the time of reading it. [Read about reacting to messages on Discord](https://support.discord.com/hc/en-us/articles/12102061808663-Reactions-and-Super-Reactions-FAQ). The tutorial shows how to do it on PC, and can be done on mobile by holding down on a message.\n",
124 | "\n",
125 | "## Collaboration\n",
126 | "* Since this project is a group project, and every one should contribute equally or close to equally. Everyone should take the responsibilities of a suitable amount of the work in data wrangling, analysis, report writing, revision, etc. When the work is distributed in the meeting/discord, every team member should keep track of updates on the task, report difficulties encountered (if any) so that we can adjust the work distribution, and help each other out when in need.\n",
127 | "* The deadline set by the group need to be maintained and every member, though has difference schedules, should complete his or her part of the work before the deadline, too.\n"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "# Project Timeline Proposal"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "Our main form of communication is through Discord, to account for changing schedules like midterms/other class deadlines, meeting times should be discussed every week to be as early as all our schedules allow so we can all be on the same page and have time to work on how we decide the split the tasks each week. We believe the table below, which was provided in the template of this proposal, is good for the quarter. \n",
142 | "\n",
143 | "\n",
144 | "| Meeting Date | Meeting Time| Completed Before Meeting | Discuss at Meeting |\n",
145 | "|---|---|---|---|\n",
146 | "| 1/20 | Discuss on Discord | Read & Think about COGS 108 expectations; brainstorm topics/questions | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | \n",
147 | "| 1/26 | Discuss on Discord | Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | \n",
148 | "| 2/11 | Discuss on Discord | Edit, finalize, and submit proposal; Search for datasets | Discuss Wrangling and possible analytical approaches|\n",
149 | "| 2/18 | Discuss on Discord | Assign group members to lead each specific part; Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan|\n",
150 | "| 2/25 | Discuss on Discord | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |\n",
151 | "| 3/13 | Discuss on Discord | Complete analysis; Draft results/conclusion/discussion| Review project report; Edit full project |\n",
152 | "| 3/20 | Before 11:59 PM | NA | Turn in Final Project & Group Project Surveys |"
153 | ]
154 | }
155 | ],
156 | "metadata": {
157 | "kernelspec": {
158 | "display_name": "Python 3 (ipykernel)",
159 | "language": "python",
160 | "name": "python3"
161 | },
162 | "language_info": {
163 | "codemirror_mode": {
164 | "name": "ipython",
165 | "version": 3
166 | },
167 | "file_extension": ".py",
168 | "mimetype": "text/x-python",
169 | "name": "python",
170 | "nbconvert_exporter": "python",
171 | "pygments_lexer": "ipython3",
172 | "version": "3.9.7"
173 | }
174 | },
175 | "nbformat": 4,
176 | "nbformat_minor": 2
177 | }
178 |
--------------------------------------------------------------------------------
/DataCheckpoint_Group035_WI24.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback** \n",
8 | "\n",
9 | "Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.\n",
10 | "\n",
11 | "Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed."
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "# COGS 108 - Data Checkpoint"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "# Names\n",
26 | "\n",
27 | "- Erick Amaro Hernandez\n",
28 | "- Sanjith Devineni\n",
29 | "- Alex Macias\n",
30 | "- Lawrence Ong\n",
31 | "- Mingyang Yao"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "# Research Question"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "Our question is: Does weight class, event (type of lift: squat, bench press, or deadlift), and gender affect the percentage(%) increase in attempt weight selection(amount of weight the competitor chooses for their next attempt)? For example, given that an athlete has lifted 100kg in an event defined previously, would something like gender or weight class influence whether they pick 105 kgs, 110 kgs, or more for their next attempt? \n"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "## Background and Prior Work"
53 | ]
54 | },
55 | {
56 | "cell_type": "markdown",
57 | "metadata": {},
58 | "source": [
59 | "The sport of powerlifting consists of 3 main **events**. The squat, the bench press, and the deadlift. A powerlifting competition gives you 3 attempts at the squat, 3 for the bench press, and 3 for the deadlift, and per event, each attempt must be heavier than the last (within the event), with the max weight lifted in each event summing up to the competitor's \"**total**\" for a meet. \n",
60 | "\n",
61 | "When competing in a powerlifting competition, choosing how much weight you'll be lifting has a lot of factors incorporated into it such as your own ability to judge the difficulty of your previous attempts and how you performed on those attempts. As such, attempt selection is an incredibly important skill a powerlifter must attain if they want to be successful in the sport. Seasoned coaches in the sport suggest that attempts start at around 90% of an athlete's max in the event. With 5% jumps between attempts being suggested from attempt 1 to 2, and <3% between attempt 2 and 3. It is also noted that heavier jumps may occur in the last event, the deadlift, due to it being the most exciting event of a powerlifting meet and/or the unique rules that come from it being the final event of a meet. [1](#cite_ref-1) \n",
62 | "\n",
63 | "Most of the hype around the deadlift event comes from it contributing on average >40% of an athelete's total regardless of gender according to a previous analysis of the dataset by Kaggle user Steven Wilson [2](#cite_note-2). The importance of a lifter's deadlift fuels our hypothesis that maybe this lift comes with even higher/riskier jumps in weight per attempt on average. Another thing we can learn from Wilson's analysis and build on is that a larger portion of Men's totals is made up by their bench press, with it making up around 25% of their total in comparison to female lifters where it only makes up 20%. The study also shows the almost trivial trend of a higher athlete body weight being associated with higher totals.\n",
64 | "\n",
65 | "The prior work mostly looks at statistics of powerlifting like athlete totals rather than seeing some psychology behind the sport. We set out to uncover unconscious biases in weight selection for each attempt. If we discover that these biases exist and are significant, then this would help athletes plan out their attempts more efficiently. Athletes could then choose to be more conservative if planning a risky jump (being concious that they may be susceptible to biases found in this project), potentially adding more pounds or kilograms to their meet total rather than failing their next attempt or vice versa if we find out that their category tend to play more conservative. \n",
66 | "\n",
67 | "[^](#cite_ref-1) Krawczyk, Bryce (6 August, 2023) So You Wanna Be a Powerlifter? Attempt Selection During a Meet. *BarBend.com* https://www.barbend.com/powerlifting-meet-attempt-selection/\n",
68 | "\n",
69 | "[^](#cite-ref2) Steven Wilson (23 Sept. 2020) Strength Differences in Powerlifting by Gender https://www.kaggle.com/code/stevenwilson8/strength-differences-in-powerlifting-by-gender\n"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "# Hypothesis\n"
77 | ]
78 | },
79 | {
80 | "cell_type": "markdown",
81 | "metadata": {},
82 | "source": [
83 | "1. We believe there will be no difference between genders when it comes to attempt selection in general. We think this because the different distributions of weight selection a per-event basis will balance eachother out to be equal when comparing just by gender. For example, although women may choose higher squat attempts, their bench attempts will might be lower and vice versa for men.\n",
84 | "2. The deadlift will have the higher jumps due to the hype around the amount of weight lifted and it being the main contributor to an athlete's total.\n",
85 | "3. Following from the last prediction, and from Wilson's past research on contribution of an event to an athlete's total, we predict that on a per-event basis there will be a difference between men and women's jumps, e.g. since more of men's totals is made up of the bench press, they might have higher jumps than women in the bench press. \n"
86 | ]
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "# Data"
93 | ]
94 | },
95 | {
96 | "cell_type": "markdown",
97 | "metadata": {},
98 | "source": [
99 | "The ideal dataset would be a log of powerlifting competitions and results of each of the competitors. Ideally we'd like some sort of identifier for the lifter for potentially tracking differences as they gain more experience competing. Even more ideal would be data that includes competitors' best lifts outside of competition, but that is never reported when signing up for a competition, so we believe it is not feasably attainable. \n",
100 | "\n",
101 | "\n",
102 | "As for collecting a dataset, it already has been. Openpowerlifting.com archives all powerlifting competitions and has a well-maintained dataset published on Kaggle which we will be using for the project. It contains weight, weight class, attempt weights, whether the attempt was successful, and calculations of various scores that are used to rank lifters across the different federations in powerlifting in each row, where each row is a competitor's performance at a specific meet."
103 | ]
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "metadata": {},
108 | "source": [
109 | "## Data overview\n",
110 | "\n",
111 | "For each dataset include the following information\n",
112 | "- Dataset #1\n",
113 | " - Dataset Name: Open Powerlifting Database\n",
114 | " - Link to the dataset: https://www.kaggle.com/datasets/open-powerlifting/powerlifting-database\n",
115 | " - Number of observations: 3,043,013\n",
116 | " - Number of variables: 38\n",
117 | "\n",
118 | "The Openpowerlifting dataset is a collection of archived powerlifting competition results from 1946 - 1/1/2023.\n",
119 | "\n",
120 | "Every row of the OpenPowerlifting dataset consists of a record of how a lifter did during a competition. A row contains their age/weight data, what events they decided to compete in, whether they competed Equipped (certain competitions dont allow certain equipment to be used during a lift), geographic data like country,state,town, as well as the name of the Competition. It also includes rows of different scoring systems that take weight, gender, and total lifted in order to compare one's performance to another across varying weight/sex classes. All weight data is stored in Kilograms.\n",
121 | "\n",
122 | "Powerlifting is a sport where the philosophy behind the sport varies drastically. As such, there are results from 300+ federations in this dataset with their own rules behind what counts as a \"valid lift\". This also includes each federation having their own weight classes. Thankfully, the raw bodyweight a lifter weighed in at on the day of the competition is included as one of the columns, and 5 different scores are precomputed as columns if a lifter's performance was able to be scored.\n",
123 | "\n",
124 | "In order to compare lifters in this dataset, we will standardize to one weight class category, and one scoring system. Weight classes will be standardized to those that the International Powerlifting Federation uses (with some caveats, since the cutoffs are different for Female and Male lifters). DOTS will be our selected scoring system because it does not require that an athlete complete all 3 lifts to be calculated and is actively used today.\n",
125 | "\n",
126 | "We also need to calculate the jumps between a lifter's Squat 1 and Squat 2 attempt, which will be calculated in the following section. In this dataset, a failed attempt is set to the negative of the attempt weight. This means if there is a -100 in the Squat1Kg column, the lifter attempted 100kg and failed the attempt. Even if an attempt is failed, they are not allowed to lower the weight for the next attempt, so these 0% jumps should be ignored as to not skew any averages we compute."
127 | ]
128 | },
129 | {
130 | "cell_type": "markdown",
131 | "metadata": {},
132 | "source": [
133 | "## Openpowerlifting Database"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 125,
139 | "metadata": {},
140 | "outputs": [
141 | {
142 | "name": "stderr",
143 | "output_type": "stream",
144 | "text": [
145 | "C:\\Users\\Erick\\AppData\\Local\\Temp\\ipykernel_23204\\3511301325.py:3: DtypeWarning: Columns (31,33,35,38) have mixed types. Specify dtype option on import or set low_memory=False.\n",
146 | " df = pd.read_csv(\"./dataset/pl_data.csv\")\n"
147 | ]
148 | },
149 | {
150 | "data": {
151 | "text/plain": [
152 | "(3043013, 41)"
153 | ]
154 | },
155 | "execution_count": 125,
156 | "metadata": {},
157 | "output_type": "execute_result"
158 | }
159 | ],
160 | "source": [
161 | "import numpy as np\n",
162 | "import pandas as pd\n",
163 | "df = pd.read_csv(\"./dataset/pl_data.csv\")\n",
164 | "df.shape"
165 | ]
166 | },
167 | {
168 | "cell_type": "markdown",
169 | "metadata": {},
170 | "source": [
171 | "## NaN values\n",
172 | "\n",
173 | "Differences between how federations categorize their lifters are shown here, some recorded an age (Age), or just the range in which they fell into (AgeClass), some decide to combine the two in a \"Division\" column. There is also the \"Tested\" column which indicates whether or not athletes were drug tested, which is True if they were, NaN if they werent. Some Federations do not test for PEDs.\n",
174 | "\n",
175 | "Attempt values like Squat\\[1-3\\]kg are NaN if an athlete did not show up, or if an athlete decided to not do 3 attempts. There is also a 4th attempt column in the dataset, which is when there is an error in competition organization and the lifter is granted an extra attempt. \n",
176 | "\n",
177 | "BodyweightKg is also null if someone is No-Show to a competition."
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 120,
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "data": {
187 | "text/plain": [
188 | "Name 0\n",
189 | "Sex 0\n",
190 | "Event 0\n",
191 | "Equipment 0\n",
192 | "Age 1070337\n",
193 | "AgeClass 814256\n",
194 | "BirthYearClass 1000666\n",
195 | "Division 1435\n",
196 | "BodyweightKg 38368\n",
197 | "WeightClassKg 40757\n",
198 | "Squat1Kg 2137906\n",
199 | "Squat2Kg 2147142\n",
200 | "Squat3Kg 2171053\n",
201 | "Squat4Kg 3034560\n",
202 | "Best3SquatKg 1009169\n",
203 | "Bench1Kg 1652910\n",
204 | "Bench2Kg 1668528\n",
205 | "Bench3Kg 1710945\n",
206 | "Bench4Kg 3022026\n",
207 | "Best3BenchKg 358566\n",
208 | "Deadlift1Kg 2021904\n",
209 | "Deadlift2Kg 2039602\n",
210 | "Deadlift3Kg 2080157\n",
211 | "Deadlift4Kg 3020077\n",
212 | "Best3DeadliftKg 836823\n",
213 | "TotalKg 202421\n",
214 | "Place 0\n",
215 | "Dots 227397\n",
216 | "Wilks 227397\n",
217 | "Glossbrenner 227397\n",
218 | "Goodlift 476889\n",
219 | "Tested 804298\n",
220 | "Country 1284501\n",
221 | "State 2408023\n",
222 | "Federation 0\n",
223 | "ParentFederation 1068468\n",
224 | "Date 0\n",
225 | "MeetCountry 0\n",
226 | "MeetState 859344\n",
227 | "MeetTown 411193\n",
228 | "MeetName 0\n",
229 | "dtype: int64"
230 | ]
231 | },
232 | "execution_count": 120,
233 | "metadata": {},
234 | "output_type": "execute_result"
235 | }
236 | ],
237 | "source": [
238 | "df.isna().sum()"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "metadata": {},
244 | "source": [
245 | "## Anomalies \n",
246 | "There are also some rows that are anomalies. The following competitor competed in a Bench Press/Deadlift-only competition, yet is missing bodyweight and has a best bench despite not having any Bench\\[1,2,3\\]Kg values filled in. It is important to note there is no official way to keep track of powerlifting events, so each is recorded differently through either a Excel- and we're subject to errors in data entry."
247 | ]
248 | },
249 | {
250 | "cell_type": "code",
251 | "execution_count": 124,
252 | "metadata": {},
253 | "outputs": [
254 | {
255 | "data": {
256 | "text/plain": [
257 | "Name Fabricio Ubirajara de Assis\n",
258 | "Sex M\n",
259 | "Event B\n",
260 | "Equipment Raw\n",
261 | "Age NaN\n",
262 | "AgeClass NaN\n",
263 | "BirthYearClass 40-49\n",
264 | "Division MO\n",
265 | "BodyweightKg NaN\n",
266 | "WeightClassKg 140+\n",
267 | "Squat1Kg NaN\n",
268 | "Squat2Kg NaN\n",
269 | "Squat3Kg NaN\n",
270 | "Squat4Kg NaN\n",
271 | "Best3SquatKg NaN\n",
272 | "Bench1Kg NaN\n",
273 | "Bench2Kg NaN\n",
274 | "Bench3Kg NaN\n",
275 | "Bench4Kg NaN\n",
276 | "Best3BenchKg 200.0\n",
277 | "Deadlift1Kg NaN\n",
278 | "Deadlift2Kg NaN\n",
279 | "Deadlift3Kg NaN\n",
280 | "Deadlift4Kg NaN\n",
281 | "Best3DeadliftKg NaN\n",
282 | "TotalKg 200.0\n",
283 | "Place 1\n",
284 | "Dots NaN\n",
285 | "Wilks NaN\n",
286 | "Glossbrenner NaN\n",
287 | "Goodlift NaN\n",
288 | "Tested NaN\n",
289 | "Country NaN\n",
290 | "State NaN\n",
291 | "Federation CONBRAP\n",
292 | "ParentFederation GPC\n",
293 | "Date 2018-04-14\n",
294 | "MeetCountry Brazil\n",
295 | "MeetState NaN\n",
296 | "MeetTown Jaú\n",
297 | "MeetName Campeonato Paulista de Supino e Lev.Terra\n",
298 | "Name: 579, dtype: object"
299 | ]
300 | },
301 | "execution_count": 124,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "df[df['Name'] == \"Fabricio Ubirajara de Assis\"].iloc[0]"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "## Dropping Data\n",
315 | "\n",
316 | "Since we are looking at attempt selection, we will be dropping any rows where the lifter signed up, but didn't compete. These can be found by those rows with NaN in their 1st attempt of every event, as well as those whose placement in a competition was \"NS\", for \"no-show\". \n"
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": 107,
322 | "metadata": {},
323 | "outputs": [],
324 | "source": [
325 | "#Filter out any rows where there is no lift data, or lifter was no-show to the competition\n",
326 | "df = df.drop(['Goodlift','Wilks','Glossbrenner'], axis=1)\n",
327 | "df = df[~((df['Squat1Kg'].isna()) & (df['Bench1Kg'].isna()) & (df['Deadlift1Kg'].isna()))]\n",
328 | "df = df[df['Place'] != 'NS']"
329 | ]
330 | },
331 | {
332 | "cell_type": "markdown",
333 | "metadata": {},
334 | "source": [
335 | "## Standardizing Weight classes\n",
336 | "\n",
337 | "Earlier we discussed how there are 300+ federations, each with possibly their own defined weight classes. In this study, we will standardize the weight classes used by the International Powerlifting Federation, which hosts world championships, and can be seen as a \"Parent Federation\" in the dataset for federations worldwide, this means they all follow and conform to IPF guidelines to some degree. It's popularity worldwide made us choose it for this standardization. \n",
338 | "\n",
339 | "The weight classes are as follows, defined on [page 4 of the IPF rulebook](https://www.powerlifting.sport/fileadmin/ipf/data/rules/technical-rules/english/IPF_Technical_Rules_Book_2023__1_.pdf). A weight class like 80kg means you weigh under 80kg, but over the next smallest weight class. \n",
340 | "\n",
341 | "- Men\n",
342 | " - \\[59kg, 66kg, 74kg, 83kg, 93kg, 105kg, 120kg, 120+kg\\]\n",
343 | "- Women\n",
344 | " - \\[47kg, 52kg, 57kg, 63kg, 69kg, 76kg, 84kg, 84+kg\\]\n",
345 | " \n",
346 | "This however excludes people who do not identify as either (labeled as Mx in federations who support this), so we will merge the two categorizations to account for more bodyweights and identities. The following is the resulting weight classes, we use men's weight classes and append any lower weight classes that the women's classes support.\n",
347 | "\n",
348 | "\\[43kg, 47kg, 53kg, 59kg, 66kg, 74kg, 83kg, 93kg, 105kg, 120kg, 120+kg\\]"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 108,
354 | "metadata": {},
355 | "outputs": [],
356 | "source": [
357 | "#Standardizing weight classes\n",
358 | "def standardize_weight_class(weight):\n",
359 | " if weight == np.nan:\n",
360 | " return np.nan\n",
361 | " \n",
362 | " combined_ipf_classes = [43, 47, 53, 59, 66, 74, 83, 93, 105, 120]\n",
363 | "\n",
364 | " unlimited_class = str(combined_ipf_classes[-1]) + \"+\"\n",
365 | " for weight_class in combined_ipf_classes:\n",
366 | " if weight < weight_class:\n",
367 | " return str(weight_class)\n",
368 | "\n",
369 | " return unlimited_class\n",
370 | "\n",
371 | "df['WeightClassKg'] = df['BodyweightKg'].apply(standardize_weight_class)"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 126,
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "data": {
381 | "text/plain": [
382 | "WeightClassKg\n",
383 | "90 249072\n",
384 | "75 232213\n",
385 | "100 230505\n",
386 | "82.5 225302\n",
387 | "67.5 166052\n",
388 | " ... \n",
389 | "36.5 1\n",
390 | "128 1\n",
391 | "135.9 1\n",
392 | "65.3 1\n",
393 | "82.6 1\n",
394 | "Name: count, Length: 388, dtype: int64"
395 | ]
396 | },
397 | "execution_count": 126,
398 | "metadata": {},
399 | "output_type": "execute_result"
400 | }
401 | ],
402 | "source": [
403 | "df['WeightClassKg'].value_counts()"
404 | ]
405 | },
406 | {
407 | "cell_type": "markdown",
408 | "metadata": {},
409 | "source": [
410 | "## Columns of Percentage Increase\n",
411 | "\n",
412 | "To perform further analysis on the change of performance of each player, we add columns for each lifter and for the percentage difference between two attempts on one event. We name each column as 'item_name {ij}_percent_increase' if this column represents percentage increase from attempt i to attempt j. \n",
413 | "\n",
414 | "However, by previous null value analysis, we know that there are many null values in columns, so the percentage increase also contains null values. Attempts will also contain negative values to signify failed attempts, so we will take the absolute value before calculating any increases. This is because if someone does 100kg in their first attempt, and then -105kg signifying they then failed 105kj, this would result in a negative increase."
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 116,
420 | "metadata": {},
421 | "outputs": [],
422 | "source": [
423 | "df_absolute_value = df[['Squat1Kg','Squat2Kg','Squat3Kg','Bench1Kg','Bench2Kg','Bench3Kg','Deadlift1Kg','Deadlift2Kg','Deadlift3Kg']].abs()\n",
424 | "\n",
425 | "df['squat12_percent_increase'] = ((df_absolute_value['Squat2Kg'] - df_absolute_value['Squat1Kg']) / df_absolute_value['Squat1Kg'])\n",
426 | "df['squat23_percent_increase'] = ((df_absolute_value['Squat3Kg'] - df_absolute_value['Squat2Kg']) / df_absolute_value['Squat2Kg'])\n",
427 | "\n",
428 | "df['bench12_percent_increase'] = ((df_absolute_value['Bench2Kg'] - df_absolute_value['Bench1Kg']) / df_absolute_value['Bench1Kg'])\n",
429 | "df['bench23_percent_increase'] = ((df_absolute_value['Bench3Kg'] - df_absolute_value['Bench2Kg']) / df_absolute_value['Bench2Kg'])\n",
430 | "\n",
431 | "df['Deadlift12_percent_increase'] = ((df_absolute_value['Deadlift2Kg'] - df_absolute_value['Deadlift1Kg']) / df_absolute_value['Deadlift1Kg'])\n",
432 | "df['Deadlift23_percent_increase'] = ((df_absolute_value['Deadlift3Kg'] - df_absolute_value['Deadlift2Kg']) / df_absolute_value['Deadlift2Kg'])\n"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": 117,
438 | "metadata": {},
439 | "outputs": [
440 | {
441 | "data": {
442 | "text/plain": [
443 | "Name Alona Vladi\n",
444 | "Sex F\n",
445 | "Event SBD\n",
446 | "Equipment Raw\n",
447 | "Age 33.0\n",
448 | "AgeClass 24-34\n",
449 | "BirthYearClass 24-39\n",
450 | "Division O\n",
451 | "BodyweightKg 58.3\n",
452 | "WeightClassKg 59\n",
453 | "Squat1Kg 75.0\n",
454 | "Squat2Kg 80.0\n",
455 | "Squat3Kg -90.0\n",
456 | "Squat4Kg NaN\n",
457 | "Best3SquatKg 80.0\n",
458 | "Bench1Kg 50.0\n",
459 | "Bench2Kg 55.0\n",
460 | "Bench3Kg 60.0\n",
461 | "Bench4Kg NaN\n",
462 | "Best3BenchKg 60.0\n",
463 | "Deadlift1Kg 95.0\n",
464 | "Deadlift2Kg 105.0\n",
465 | "Deadlift3Kg 107.5\n",
466 | "Deadlift4Kg NaN\n",
467 | "Best3DeadliftKg 107.5\n",
468 | "TotalKg 247.5\n",
469 | "Place 1\n",
470 | "Dots 279.44\n",
471 | "Tested Yes\n",
472 | "Country Russia\n",
473 | "State NaN\n",
474 | "Federation GFP\n",
475 | "ParentFederation NaN\n",
476 | "Date 2019-05-11\n",
477 | "MeetCountry Russia\n",
478 | "MeetState NaN\n",
479 | "MeetTown Bryansk\n",
480 | "MeetName Open Tournament\n",
481 | "squat12_percent_increase 0.066667\n",
482 | "squat23_percent_increase 0.125\n",
483 | "squat34_percent_increase NaN\n",
484 | "bench12_percent_increase 0.1\n",
485 | "bench23_percent_increase 0.090909\n",
486 | "bench34_percent_increase NaN\n",
487 | "Deadlift12_percent_increase 0.105263\n",
488 | "Deadlift23_percent_increase 0.02381\n",
489 | "Name: 0, dtype: object"
490 | ]
491 | },
492 | "execution_count": 117,
493 | "metadata": {},
494 | "output_type": "execute_result"
495 | }
496 | ],
497 | "source": [
498 | "df.iloc[0]"
499 | ]
500 | },
501 | {
502 | "cell_type": "markdown",
503 | "metadata": {},
504 | "source": [
505 | "# Ethics & Privacy"
506 | ]
507 | },
508 | {
509 | "cell_type": "markdown",
510 | "metadata": {},
511 | "source": [
512 | "In conducting our research on attempt selection in powerlifting, it's paramount to address ethical and privacy concerns throughout the data science process. Our dataset, sourced from Openpowerlifting.com, provides valuable information on powerlifting competitions, but we recognize the importance of safeguarding the privacy of individuals involved. We will ensure that the data used is anonymized and that participants' consent to data usage is respected, given that the information is publicly available. Additionally, we acknowledge potential biases within the dataset, including underrepresentation of certain demographic groups in powerlifting competitions. To mitigate these biases, we will conduct thorough exploratory data analysis to identify any imbalances and employ strategies such as stratification or oversampling to ensure equitable analysis.\n",
513 | "\n",
514 | "Throughout the analysis, we will maintain transparency regarding the limitations and potential biases of the data. We will present findings in a fair and balanced manner, highlighting uncertainties and caveats associated with the analysis. Post-analysis, we will critically review our findings to assess their implications on different demographic groups and engage in ongoing dialogue with stakeholders to address concerns related to equity and fairness. Our research will adhere to ethical guidelines and best practices in data science, prioritizing transparency, accountability, and respect for the rights and dignity of research participants. By upholding these principles, we aim to contribute responsibly to the understanding of attempt selection in powerlifting while promoting inclusivity and ethical data use."
515 | ]
516 | },
517 | {
518 | "cell_type": "markdown",
519 | "metadata": {},
520 | "source": [
521 | "# Team Expectations "
522 | ]
523 | },
524 | {
525 | "cell_type": "markdown",
526 | "metadata": {},
527 | "source": [
528 | "\n",
529 | "## Communication\n",
530 | "* We understand that we are all students with different coursework and load and may not be available for days to respond with input to someone's ideas or questions. \n",
531 | "\n",
532 | "* To give some semblance of \"I have seen your message\", since Discord doesn't provide read receipts, we hope to adopt a practice of reacting with an emote to messages to acknowledge them if we don't have time to craft a response at the time of reading it. [Read about reacting to messages on Discord](https://support.discord.com/hc/en-us/articles/12102061808663-Reactions-and-Super-Reactions-FAQ). The tutorial shows how to do it on PC, and can be done on mobile by holding down on a message.\n",
533 | "\n",
534 | "## Collaboration\n",
535 | "* Since this project is a group project, and every one should contribute equally or close to equally. Everyone should take the responsibilities of a suitable amount of the work in data wrangling, analysis, report writing, revision, etc. When the work is distributed in the meeting/discord, every team member should keep track of updates on the task, report difficulties encountered (if any) so that we can adjust the work distribution, and help each other out when in need.\n",
536 | "* The deadline set by the group need to be maintained and every member, though has difference schedules, should complete his or her part of the work before the deadline, too.\n"
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "# Project Timeline Proposal"
544 | ]
545 | },
546 | {
547 | "cell_type": "markdown",
548 | "metadata": {},
549 | "source": [
550 | "Our main form of communication is through Discord, to account for changing schedules like midterms/other class deadlines, meeting times should be discussed every week to be as early as all our schedules allow so we can all be on the same page and have time to work on how we decide the split the tasks each week. We believe the table below, which was provided in the template of this proposal, is good for the quarter. \n",
551 | "\n",
552 | "\n",
553 | "| Meeting Date | Meeting Time| Completed Before Meeting | Discuss at Meeting |\n",
554 | "|---|---|---|---|\n",
555 | "| 1/20 | Discuss on Discord | Read & Think about COGS 108 expectations; brainstorm topics/questions | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | \n",
556 | "| 1/26 | Discuss on Discord | Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | \n",
557 | "| 2/11 | Discuss on Discord | Edit, finalize, and submit proposal; Search for datasets | Discuss Wrangling and possible analytical approaches|\n",
558 | "| 2/18 | Discuss on Discord | Assign group members to lead each specific part; Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan|\n",
559 | "| 2/25 | Discuss on Discord | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |\n",
560 | "| 3/13 | Discuss on Discord | Complete analysis; Draft results/conclusion/discussion| Review project report; Edit full project |\n",
561 | "| 3/20 | Before 11:59 PM | NA | Turn in Final Project & Group Project Surveys |"
562 | ]
563 | }
564 | ],
565 | "metadata": {
566 | "kernelspec": {
567 | "display_name": "Python 3 (ipykernel)",
568 | "language": "python",
569 | "name": "python3"
570 | },
571 | "language_info": {
572 | "codemirror_mode": {
573 | "name": "ipython",
574 | "version": 3
575 | },
576 | "file_extension": ".py",
577 | "mimetype": "text/x-python",
578 | "name": "python",
579 | "nbconvert_exporter": "python",
580 | "pygments_lexer": "ipython3",
581 | "version": "3.12.2"
582 | }
583 | },
584 | "nbformat": 4,
585 | "nbformat_minor": 4
586 | }
587 |
--------------------------------------------------------------------------------