├── LICENSE
├── README.md
├── mapping_nba_ids
    ├── .gitignore
    ├── README.md
    ├── mapnbaid.py
    ├── mapping_nba_ids.csv
    └── requirements.txt
└── sat_logo.jpeg


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | <p align="center">
 2 |   <img src="https://github.com/shufinskiy/sport_analytics_tools/blob/main/sat_logo.jpeg"/>
 3 | </p>
 4 | 
 5 | <h1 align="center">Sport Analytics Tools</h1>
 6 | 
 7 | **Sport Analytics Tools** is a project dedicated to publishing information (code, tutorials, media) to help data scientists and sports enthusiasts work with sports data effectively.
 8 | 
 9 | ## Motivation
10 | 
11 | - I believe that open-source solutions are superior to proprietary technologies.  
12 | - I believe that if you've solved a problem or possess valuable knowledge, you should share it.  
13 | 
14 | ## Objective
15 | 
16 | The goal of this project is to assist people interested in sports data science in tackling data analysis tasks. 
17 | 
18 | - I've developed **nba_data**, a repository of NBA data that can be accessed in seconds instead of hours via the NBA API.  
19 | - I am the author of the **nba-on-court** library, which simplifies working with NBA data.  
20 | - I am a contributor to well-known sports libraries such as **nba_api**, **hoopR**, and **worldfootballR**.  
21 | 
22 | Through this project, I aim to create a comprehensive knowledge base of tools and resources to enhance the workflow with sports data.
23 | 
24 | ## List of Projects
25 | 
26 | ### 1. NBA Player ID Mapping Tool 🏀
27 | Tool for mapping player IDs between NBA Stats API and Basketball Reference.
28 | 
29 | #### Features
30 | - Automated ID mapping between different basketball data sources
31 | - Multiple matching algorithms for high accuracy
32 | - Handles special cases and non-English names
33 | - Easy-to-use Python interface
34 | 
35 | #### Requirements
36 | - Python 3.8+
37 | - Core dependencies: beautifulsoup4, numpy, pandas, requests, nba_api, python-Levenshtein
38 | 
39 | [Learn more about NBA Player ID Mapping Tool →](https://github.com/shufinskiy/sport_analytics_tools/tree/main/mapping_nba_ids)
40 | 
41 | ## Project Table
42 | 
43 | |Name|Description|
44 | |------|---------|
45 | |[NBA Player ID Mapping Tool](https://github.com/shufinskiy/sport_analytics_tools/tree/main/mapping_nba_ids)| Code automating the process of mapping ID from the NBA website and basketball-reference|
46 | 
47 | ## Installation
48 | 
49 | ```bash
50 | # Clone the repository
51 | git clone https://github.com/shufinskiy/sport_analytics_tools.git
52 | cd sport_analytics_tools
53 | 
54 | # Install dependencies
55 | pip install -r requirements.txt
56 | ```
57 | 
58 | ## Contributing 🤝
59 | Contributions are welcome! Please feel free to submit pull requests, particularly for:
60 | 
61 | - Adding new tools for sports analytics
62 | - Improving existing functionality
63 | - Adding documentation and tutorials
64 | - Bug fixes and optimizations
65 | 
66 | ## License 📄
67 | Apache License 2.0
68 | 
69 | ## Contact 📫
70 | 
71 | <div id="header" align="center">
72 |   <div id="badges">
73 |     <a href="https://www.linkedin.com/in/vladislav-shufinskiy/">
74 |       <img src="https://img.shields.io/badge/LinkedIn-blue?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn Badge"/>
75 |     </a>
76 |     <a href="https://t.me/brains14482">
77 |       <img src="https://img.shields.io/badge/Telegram-blue?style=for-the-badge&logo=telegram&logoColor=white" alt="Telegram Badge"/>
78 |     </a>
79 |     <a href="https://twitter.com/vshufinskiy">
80 |       <img src="https://img.shields.io/badge/Twitter-blue?style=for-the-badge&logo=twitter&logoColor=white" alt="Twitter Badge"/>
81 |     </a>
82 |   </div>
83 | </div>
84 | 


--------------------------------------------------------------------------------
/mapping_nba_ids/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # UV
 98 | #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #uv.lock
102 | 
103 | # poetry
104 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
106 | #   commonly ignored for libraries.
107 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108 | #poetry.lock
109 | 
110 | # pdm
111 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112 | #pdm.lock
113 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114 | #   in version control.
115 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116 | .pdm.toml
117 | .pdm-python
118 | .pdm-build/
119 | 
120 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121 | __pypackages__/
122 | 
123 | # Celery stuff
124 | celerybeat-schedule
125 | celerybeat.pid
126 | 
127 | # SageMath parsed files
128 | *.sage.py
129 | 
130 | # Environments
131 | .env
132 | .venv
133 | env/
134 | venv/
135 | ENV/
136 | env.bak/
137 | venv.bak/
138 | 
139 | # Spyder project settings
140 | .spyderproject
141 | .spyproject
142 | 
143 | # Rope project settings
144 | .ropeproject
145 | 
146 | # mkdocs documentation
147 | /site
148 | 
149 | # mypy
150 | .mypy_cache/
151 | .dmypy.json
152 | dmypy.json
153 | 
154 | # Pyre type checker
155 | .pyre/
156 | 
157 | # pytype static type analyzer
158 | .pytype/
159 | 
160 | # Cython debug symbols
161 | cython_debug/
162 | 
163 | # PyCharm
164 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
167 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
168 | #.idea/
169 | 


--------------------------------------------------------------------------------
/mapping_nba_ids/README.md:
--------------------------------------------------------------------------------
 1 | # NBA Player ID Mapping Tool 🏀
 2 | 
 3 | A Python tool for mapping player IDs between NBA Stats API and Basketball Reference. This tool helps solve the common challenge of matching player data across different basketball data sources.
 4 | 
 5 | ## Why This Tool? 🤔
 6 | 
 7 | When working with basketball data, analysts often need to combine data from multiple sources. Two of the most popular sources are:
 8 | - NBA Stats API (official NBA statistics)
 9 | - Basketball Reference (comprehensive historical data)
10 | 
11 | However, these sources use different ID systems for players, making it difficult to merge data. This tool creates a mapping between these IDs, allowing for seamless data integration.
12 | 
13 | ## How It Works 🛠️
14 | 
15 | The tool uses a multi-step matching algorithm to ensure the highest possible accuracy:
16 | 
17 | 1. **Exact Name Matching** 📋
18 |    - First attempts to match players by their exact names
19 |    - Separates cases with no matches and multiple matches for further processing
20 | 
21 | 2. **Multiple Match Resolution** 🔄
22 |    - For players with multiple potential matches, uses additional criteria like active years
23 |    - Creates separate handling for special cases
24 | 
25 | 3. **Non-English Character Handling** 🌐
26 |    - Processes names containing non-English characters
27 |    - Attempts various transliterations to find matches
28 | 
29 | 4. **Surname-Based Matching** 👥
30 |    - Matches players using surnames when full names don't match
31 |    - Includes additional verification using career years
32 | 
33 | 5. **Fuzzy Matching** 🔍
34 |    - Removes punctuation and special characters
35 |    - Uses Levenshtein distance for approximate string matching
36 | 
37 | 6. **Manual Dictionary Mapping** 📘
38 |    - Falls back to a pre-defined mapping for special cases
39 |    - Handles edge cases that automated matching can't resolve
40 | 
41 | ## Usage 💻
42 | 
43 | ```python
44 | from mapping_nba_ids import mapping_nba_id
45 | 
46 | # Basic usage with default parameters
47 | mapped_players = mapping_nba_id()
48 | 
49 | # Advanced usage with custom parameters
50 | mapped_players = mapping_nba_id(
51 |     verbose=True,  # Print progress information
52 |     letters='abcde',  # Only process players whose names start with these letters
53 |     base_url='https://www.basketball-reference.com/players'  # Custom base URL
54 | )
55 | ```
56 | 
57 | ## Requirements 📦
58 | 
59 | ### Python Version
60 | - Python 3.8 or higher
61 | 
62 | ### Required Libraries
63 | ```txt
64 | nba_api>=1.4.0
65 | numpy>=1.22.2,<2.0.0
66 | pandas>=2.0.0
67 | Levenshtein==0.26.1
68 | beautifulsoup4>=4.10.0
69 | requests>=2.31.0
70 | lxml>=5.2.0
71 | ```
72 | 
73 | ## Output 📊
74 | The tool returns a pandas DataFrame containing:
75 | 
76 | - NBA Stats API Player ID
77 | - Player Name
78 | - Basketball Reference ID
79 | - Basketball Reference URL
80 | 
81 | **ID mapping table is located in mapping_nba_ids.csv file and will be updated periodically. You can run the code locally or just download this file.**
82 |   
83 | ## Contributing 🤝
84 | Contributions are welcome! Here's how you can help:
85 | 
86 | 1. Fork the repository
87 | 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
88 | 3. Commit your changes (`git commit -m 'Add some amazing feature`)
89 | 4. Push to the branch (`git push origin feature/amazing-feature`)
90 | 5. Open a Pull Request
91 | 
92 | Author ✍️
93 | shufinskiy - [GitHub Profile](https://github.com/shufinskiy)
94 | 
95 | - 📫 How to reach me: Create an issue in this repository
96 | - 🌟 If you find this tool useful, please consider giving it a star!
97 | 
98 | 


--------------------------------------------------------------------------------
/mapping_nba_ids/mapnbaid.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Module for mapping NBA player IDs between different data sources.
  3 | This module provides functionality to map player IDs between NBA Stats API and Basketball Reference.
  4 | """
  5 | 
  6 | from string import ascii_lowercase
  7 | from pathlib import Path
  8 | from typing import Optional, Union
  9 | from itertools import product
 10 | import re
 11 | 
 12 | import requests
 13 | from bs4 import BeautifulSoup
 14 | import numpy as np
 15 | import pandas as pd
 16 | from nba_api.stats.endpoints import CommonAllPlayers
 17 | from Levenshtein import distance
 18 | 
 19 | 
 20 | ENGLISH = np.hstack((np.arange(65, 91),np.arange(97, 123), np.array([32, 45, 46])))
 21 | 
 22 | MAPPING_DICT = {
 23 |     202392: 'blakema01',
 24 |     1629129: 'bluietr01',
 25 |     1642486: 'buiebo01',
 26 |     202221: 'butchbr01',
 27 |     1642382: 'carlsbr01',
 28 |     1642269: 'cartede02',
 29 |     1642353: 'chrisca02',
 30 |     1642384: 'crawfis01',
 31 |     1642368: 'nfalyda01',
 32 |     76521: 'davisdw01',
 33 |     1642399: 'edwarje01',
 34 |     1642348: 'edwarju01',
 35 |     203543: 'favervi01',
 36 |     1642280: 'flowetr01',
 37 |     202070: 'gaffnto01',
 38 |     1641945: 'galloja01',
 39 |     1619: 'garriki01',
 40 |     2775: 'seungha01',
 41 |     202238: 'hasbrke01',
 42 |     1641747: 'holmeda01',
 43 |     1630258: 'homesca01',
 44 |     77082: 'hundlho01',
 45 |     201998: 'jerrecu01',
 46 |     1642352: 'johnske10',
 47 |     77199: 'joneswa01',
 48 |     1641752: 'klintbo01',
 49 |     1630249: 'krejcvi01',
 50 |     986: 'mannma01',
 51 |     1641970: 'pereima01',
 52 |     77510: 'mcclate01',
 53 |     1641755: 'mcculke01',
 54 |     203183: 'mitchto03',
 55 |     203502: 'mitchto02',
 56 |     1642439: 'olivaqu01',
 57 |     1629341: 'phillta01',
 58 |     1642366: 'postqu01',
 59 |     202375: 'rollema01',
 60 |     202067: 'simpsdi01',
 61 |     1630569: 'stewadj01',
 62 |     1630597: 'stewadj02',
 63 |     78302: 'taylofa01',
 64 |     1642260: 'topicni01',
 65 |     201987: 'vadenro01',
 66 |     78409: 'vaughch01',
 67 |     1630492: 'vildolu01',
 68 |     202358: 'whitete01',
 69 |     78539: 'williar01',
 70 |     1629624: 'wooteke01',
 71 |     1642385: 'cuiyo01',
 72 |     1631322: "mccoyja01"
 73 | }
 74 | 
 75 | 
 76 | class PlayerDataBBref(object):
 77 |     """Class for scraping player data from Basketball Reference website.
 78 | 
 79 |     This class handles the scraping of player data from basketball-reference.com,
 80 |     organizing it by player name's first letter.
 81 | 
 82 |     Attributes:
 83 |         base_url (str): Base URL for basketball-reference player pages.
 84 |         letters (str): Letters to iterate through for player lookup.
 85 |         verbose (bool): If True, prints progress information during scraping.
 86 |         bbref_players (list): List of dictionaries containing player information.
 87 |     """
 88 | 
 89 |     def __init__(self,
 90 |                  base_url: str="https://www.basketball-reference.com/players",
 91 |                  letters: str=ascii_lowercase,
 92 |                  verbose: bool=False) -> None:
 93 |         """Initialize PlayerDataBBref.
 94 | 
 95 |         Args:
 96 |             base_url (str, optional): Base URL for basketball-reference.com player pages.
 97 |                 Defaults to "https://www.basketball-reference.com/players".
 98 |             letters (str, optional): Letters to iterate through. Defaults to ascii_lowercase.
 99 |             verbose (bool, optional): Whether to print progress information. Defaults to False.
100 |         """
101 |         self.base_url = base_url
102 |         self.letters = letters
103 |         self.verbose = verbose
104 |         self.bbref_players: list[dict[str: Union[str, int]]] = []
105 | 
106 |     def bbref_player_data(self) -> pd.DataFrame:
107 |         """Scrape player data for all specified letters.
108 | 
109 |         Returns:
110 |             pd.DataFrame: DataFrame containing player information from Basketball Reference.
111 |         """
112 |         for letter in self.letters:
113 |             self.scrape_player_data(letter)
114 |             if self.verbose:
115 |                 print(f"Letter: {letter} finished")
116 |         return pd.DataFrame(self.bbref_players)
117 | 
118 |     def scrape_player_data(self, letter: str) -> None:
119 |         """Scrape player data for a specific letter.
120 | 
121 |         Args:
122 |             letter (str): The letter to scrape player data for.
123 | 
124 |         Raises:
125 |             ValueError: If no player information is found on the page.
126 |         """
127 |         url = f"{self.base_url}/{letter}/"
128 |         response = requests.get(url)
129 |         soup = BeautifulSoup(response.content, 'lxml')
130 |         table = soup.find('table', {'id': 'players'})
131 |         if table:
132 |             rows = table.find('tbody').find_all('tr')
133 | 
134 |             if len(rows) != 0:
135 |                 for row in rows:
136 |                     player_name = row.find('th').get_text()
137 |                     player_url = row.find('th').find('a')['href'] if row.find('th').find('a') else None
138 |                     from_year = row.find("td", {"data-stat": "year_min"}).get_text() if row.find("td", {"data-stat": "year_min"}) else None
139 |                     to_year = row.find("td", {"data-stat": "year_max"}).get_text() if row.find("td", {"data-stat": "year_max"}) else None
140 | 
141 |                     self.bbref_players.append({
142 |                         'name': player_name.replace("*", ""),
143 |                         'url': f"https://www.basketball-reference.com{player_url}" if player_url else None,
144 |                         'bbref_id': Path(player_url).stem if player_url else None,
145 |                         'from_year': int(from_year) - 1,
146 |                         'to_year': int(to_year) - 1
147 |                 })
148 |         else:
149 |             raise ValueError(f"On page {url} there is no information about the players")
150 | 
151 | 
152 | class MergePlayerID(object):
153 |     """Class for merging player IDs between NBA Stats and Basketball Reference.
154 | 
155 |     This class implements various methods to match and merge player identifiers
156 |     between NBA Stats API and Basketball Reference data sources.
157 | 
158 |     Attributes:
159 |         nbastats (pd.DataFrame): DataFrame containing NBA Stats API player data.
160 |         bbref (pd.DataFrame): DataFrame containing Basketball Reference player data.
161 |         zero_df (pd.DataFrame, optional): Players with no matches.
162 |         double_df (pd.DataFrame, optional): Players with multiple matches.
163 |         non_merge_bbref (pd.DataFrame, optional): Unmatched Basketball Reference players.
164 |         non_merge_nbastats (pd.DataFrame, optional): Unmatched NBA Stats players.
165 |         full_coincidence_df (pd.DataFrame, optional): Players with full information matches.
166 |     """
167 | 
168 |     def __init__(self,
169 |                  nbastats: pd.DataFrame,
170 |                  bbref: pd.DataFrame) -> None:
171 |         """Initialize MergePlayerID.
172 | 
173 |         Args:
174 |             nbastats (pd.DataFrame): NBA Stats API player data.
175 |             bbref (pd.DataFrame): Basketball Reference player data.
176 |         """
177 |         self.nbastats = (
178 |             nbastats
179 |             .assign(FIRST_LETTER = lambda df_: [x[:1].lower() for x in df_.DISPLAY_LAST_COMMA_FIRST])
180 |         )
181 |         self.bbref = bbref
182 |         self.zero_df: Optional[pd.DataFrame] = None
183 |         self.double_df: Optional[pd.DataFrame] = None
184 |         self.non_merge_bbref: Optional[pd.DataFrame] = None
185 |         self.non_merge_nbastats: Optional[pd.DataFrame] = None
186 |         self.full_coincidence_df: Optional[pd.DataFrame] = None
187 | 
188 |     def merge_by_name(self) -> pd.DataFrame:
189 |         """Merge players by exact name matches.
190 | 
191 |         Returns:
192 |             pd.DataFrame: DataFrame of matched players by name.
193 |         """
194 |         merge_index = []
195 |         zero_index = []
196 |         double_index = []
197 |         for idx, nba_name in enumerate(self.nbastats.loc[:, "DISPLAY_FIRST_LAST"]):
198 |             tmp_df = self.bbref.loc[self.bbref["name"] == nba_name]
199 |             if tmp_df.shape[0] == 1:
200 |                 merge_index.append(idx)
201 |             elif tmp_df.shape[0] == 0:
202 |                 zero_index.append(idx)
203 |             elif tmp_df.shape[0] > 1:
204 |                 double_index.append(idx)
205 |             else:
206 |                 raise ValueError("Error")
207 | 
208 |         self.zero_df = self.nbastats.iloc[zero_index].reset_index(drop=True)
209 |         self.double_df = self.nbastats.iloc[double_index].reset_index(drop=True)
210 | 
211 |         merge_df = (
212 |             self.nbastats
213 |             .iloc[merge_index]
214 |             .reset_index(drop=True)
215 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR"]])
216 |             .pipe(lambda df_: df_.merge(self.bbref, how="inner", left_on="DISPLAY_FIRST_LAST", right_on="name"))
217 |         )
218 | 
219 |         self.upd_non_merge(merge_df)
220 | 
221 |         return merge_df
222 | 
223 |     def merge_double(self, merge_df: pd.DataFrame) -> pd.DataFrame:
224 |         """Merge players with multiple potential matches.
225 | 
226 |         Args:
227 |             merge_df (pd.DataFrame): Previously merged player data.
228 | 
229 |         Returns:
230 |             pd.DataFrame: Updated DataFrame with additional matches.
231 |         """
232 |         merge_double = (
233 |             self.double_df
234 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR"]])
235 |             .astype({'FROM_YEAR': 'int', 'TO_YEAR': 'int'})
236 |             .pipe(lambda df_: df_.merge(self.non_merge_bbref,
237 |                                         how="left",
238 |                                         left_on=["DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR"],
239 |                                         right_on=["name", "from_year", "to_year"]
240 |                                         ))
241 |         )
242 | 
243 |         non_match = merge_double.loc[pd.isna(merge_double.name), ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR"]]
244 |         self.full_coincidence_df = (
245 |             merge_double
246 |             .pipe(lambda df_: df_.merge((
247 |                 df_
248 |                 .groupby(["PERSON_ID"], as_index=False)["TO_YEAR"]
249 |                 .count()
250 |                 .pipe(lambda df_: df_.loc[df_.TO_YEAR > 1, "PERSON_ID"])
251 |             ), how="inner", on="PERSON_ID"))
252 |             .loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR"]]
253 |             .drop_duplicates()
254 |             .reset_index(drop=True)
255 |         )
256 | 
257 |         non_ids = non_match.PERSON_ID.to_list() + self.full_coincidence_df.PERSON_ID.to_list()
258 | 
259 |         merge_df = pd.concat([merge_df, merge_double.loc[~merge_double.PERSON_ID.isin(non_ids)]],
260 |                              axis=0, ignore_index=True)
261 |         self.upd_non_merge(merge_df)
262 | 
263 |         merge_non_match = non_match.merge(self.non_merge_bbref, how="left", left_on="DISPLAY_FIRST_LAST", right_on="name")
264 |         merge_df = pd.concat([merge_df, merge_non_match], axis=0, ignore_index=True)
265 | 
266 |         self.upd_non_merge(merge_df)
267 | 
268 |         return merge_df
269 | 
270 |     def merge_non_english(self, merge_df: pd.DataFrame) -> pd.DataFrame:
271 |         """Merge players with non-English characters in their names.
272 | 
273 |         Args:
274 |             merge_df (pd.DataFrame): Previously merged player data.
275 | 
276 |         Returns:
277 |             pd.DataFrame: Updated DataFrame with additional matches.
278 |         """
279 |         non_eng_idx = np.array([self._detect_non_english(x) for x in self.non_merge_bbref.name])
280 |         non_eng = self.non_merge_bbref.iloc[non_eng_idx].reset_index(drop=True)
281 |         non_eng["non_english_count"] = [self._count_non_english(x) for x in non_eng.name]
282 |         non_eng["name_lower"] = [x.lower() for x in non_eng["name"]]
283 | 
284 |         check_non_eng = (
285 |             self.non_merge_nbastats
286 |             .pipe(lambda df_: df_.merge(non_eng, how="inner", left_on="DISPLAY_FIRST_LAST", right_on="name"))
287 |             .loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR",
288 |                      "name", "url", "bbref_id", "from_year", "to_year"]]
289 |         )
290 | 
291 |         if check_non_eng.shape[0] != 0:
292 |             merge_df = pd.concat([merge_df, check_non_eng], axis=0, ignore_index=True)
293 |             self.upd_non_merge(merge_df)
294 | 
295 |         transform_nbastats = (
296 |             self.non_merge_nbastats
297 |             .assign(
298 |                 name_lower=lambda df_: [x.lower() for x in df_.DISPLAY_FIRST_LAST],
299 |             )
300 |             .loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR", "name_lower"]]
301 |         )
302 | 
303 |         cnt_sym = np.sort(np.unique(non_eng.non_english_count))
304 |         prod_dict = {key: list(product(*[list(ascii_lowercase) for _ in range(key)])) for key in cnt_sym}
305 |         eng = np.hstack((np.arange(65, 91), np.arange(97, 123), np.array([32, 45, 46])))
306 | 
307 |         for i, row in enumerate(non_eng.itertuples()):
308 |             n = np.array([ord(x) in eng for x in row.name_lower])
309 |             if row.non_english_count == 1:
310 |                 replace_idx = np.where(n == False)[0]
311 |             else:
312 |                 replace_idx = np.where(n == False)[0]
313 |             for sym_cand in prod_dict[row.non_english_count]:
314 |                 name_ = list(row.name_lower)
315 |                 for pos in range(len(sym_cand)):
316 |                     name_[replace_idx[pos]] = sym_cand[pos]
317 |                     new_name = "".join(name_)
318 |                 check_idx = transform_nbastats.loc[transform_nbastats["name_lower"] == new_name].index
319 |                 if len(check_idx) == 0:
320 |                     continue
321 |                 else:
322 |                     non_eng.iloc[i, 6] = new_name
323 |                     break
324 | 
325 |         merge_non_eng = (
326 |             non_eng
327 |             .pipe(lambda df_: df_.merge(transform_nbastats, how="inner", on="name_lower"))
328 |             .loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR",
329 |                      "name", "url", "bbref_id", "from_year", "to_year"]]
330 |         )
331 |         merge_df = pd.concat([merge_df, merge_non_eng], axis=0, ignore_index=True)
332 |         self.upd_non_merge(merge_df)
333 | 
334 |         return merge_df
335 | 
336 |     def merge_surname(self, merge_df: pd.DataFrame) -> pd.DataFrame:
337 |         """Merge players based on surname matches.
338 | 
339 |         Args:
340 |             merge_df (pd.DataFrame): Previously merged player data.
341 | 
342 |         Returns:
343 |             pd.DataFrame: Updated DataFrame with additional matches.
344 |         """
345 |         nbastats_surname = set(
346 |             self.non_merge_nbastats
347 |             .assign(
348 |                 CNT_PART = lambda df_: [len(x.split()) for x in df_.DISPLAY_FIRST_LAST],
349 |                 SURNAME=lambda df_: [x.split()[1] if y > 1 else None for x, y in zip(df_.DISPLAY_FIRST_LAST, df_.CNT_PART)]
350 |             )
351 |             .groupby("SURNAME", as_index=False)["PERSON_ID"].count()
352 |             .pipe(lambda df_: df_.loc[df_.PERSON_ID == 1])
353 |             .reset_index(drop=True)
354 |             .iloc[:, 0]
355 |             .to_list()
356 |         )
357 | 
358 |         bbref_surname = (
359 |             self.non_merge_bbref
360 |             .assign(SURNAME=lambda df_: [x.split()[1] for x in df_.name])
361 |             .groupby("SURNAME", as_index=False)["bbref_id"].count()
362 |             .pipe(lambda df_: df_.loc[df_.bbref_id == 1])
363 |             .reset_index(drop=True)
364 |             .iloc[:, 0]
365 |             .to_list()
366 |         )
367 |         surname_set = nbastats_surname.intersection(bbref_surname)
368 | 
369 |         comp_surname = (
370 |             self.non_merge_nbastats
371 |             .assign(SURNAME=lambda df_: [x.split()[1] for x in df_.DISPLAY_FIRST_LAST])
372 |             .pipe(lambda df_: df_.loc[df_.SURNAME.isin(surname_set)])
373 |             .reset_index(drop=True)
374 |             .pipe(lambda df_: df_.merge(
375 |                 (
376 |                     self.non_merge_bbref
377 |                     .assign(SURNAME=lambda df_: [x.split()[1] for x in df_.name])
378 |                     .pipe(lambda df_: df_.loc[df_.SURNAME.isin(surname_set)])
379 |                     .reset_index(drop=True)
380 |                 ),
381 |                 how="inner",
382 |                 on="SURNAME"
383 |             ))
384 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR",
385 |                                           "name", "url", "bbref_id", "from_year", "to_year"]])
386 |         )
387 | 
388 |         merge_df = pd.concat([merge_df, comp_surname], axis=0, ignore_index=True)
389 | 
390 |         self.upd_non_merge(merge_df)
391 | 
392 |         nbastats_surname_year = (
393 |             self.non_merge_nbastats
394 |             .assign(
395 |                 CNT_PART = lambda df_: [len(x.split()) for x in df_.DISPLAY_FIRST_LAST],
396 |                 SURNAME=lambda df_: [x.split()[1] if y > 1 else None for x, y in zip(df_.DISPLAY_FIRST_LAST, df_.CNT_PART)]
397 |             )
398 |             .drop(columns="CNT_PART")
399 |         )
400 | 
401 |         bbref_surname_year = (
402 |             self.non_merge_bbref
403 |             .assign(
404 |                 CNT_PART=lambda df_: [len(x.split()) for x in df_.name],
405 |                 SURNAME=lambda df_: [x.split()[1] if y > 1 else None for x, y in
406 |                                      zip(df_.name, df_.CNT_PART)]
407 |             )
408 |             .drop(columns="CNT_PART")
409 |         )
410 | 
411 |         comp_surname_year = (
412 |             nbastats_surname_year
413 |             .astype({'FROM_YEAR': 'int', 'TO_YEAR': 'int'})
414 |             .pipe(lambda df_: df_.merge(
415 |                 bbref_surname_year,
416 |                 how="inner",
417 |                 left_on=["SURNAME", "FROM_YEAR", "TO_YEAR"],
418 |                 right_on=["SURNAME", "from_year", "to_year"]
419 |             ))
420 |             .pipe(lambda df_: df_.loc[~df_.PERSON_ID.isin([203183, 203502]),
421 |             ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR",
422 |              "name", "url", "bbref_id", "from_year", "to_year"]])
423 |             .reset_index(drop=True)
424 |         )
425 | 
426 |         merge_df = pd.concat([merge_df, comp_surname_year], axis=0, ignore_index=True)
427 | 
428 |         self.upd_non_merge(merge_df)
429 | 
430 |         return merge_df
431 | 
432 |     def merge_wo_punctuation(self, merge_df: pd.DataFrame) -> pd.DataFrame:
433 |         """Merge players after removing punctuation from names.
434 | 
435 |         Args:
436 |             merge_df (pd.DataFrame): Previously merged player data.
437 | 
438 |         Returns:
439 |             pd.DataFrame: Updated DataFrame with additional matches.
440 |         """
441 |         nba_letters = (
442 |             self.non_merge_nbastats
443 |             .assign(
444 |                 ONLY_LETTER=lambda df_: [re.sub(r'[^a-zA-Z]', "", re.sub(" I$| II$| III$| IV$| V$", "", x)).lower() for
445 |                                          x in df_.DISPLAY_FIRST_LAST])
446 |         )
447 | 
448 |         bbref_letters = (
449 |             self.non_merge_bbref
450 |             .assign(only_letter=lambda df_: [re.sub(r'[^a-zA-Z]', '', x).lower() for x in df_.name])
451 |         )
452 | 
453 |         comp_letter = (
454 |             nba_letters
455 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR", "ONLY_LETTER"]])
456 |             .pipe(lambda df_: df_.merge(bbref_letters, how="inner", left_on="ONLY_LETTER", right_on="only_letter"))
457 |             .pipe(lambda df_: df_.loc[~df_["PERSON_ID"].isin([203183, 203502])])
458 |             .reset_index(drop=True)
459 |         )
460 | 
461 |         merge_df = pd.concat([merge_df, comp_letter.drop(columns=["ONLY_LETTER", "only_letter"])], axis=0, ignore_index=True)
462 | 
463 |         self.upd_non_merge(merge_df)
464 | 
465 |         bbref_letters = (
466 |             bbref_letters
467 |             .pipe(lambda df_: df_.loc[~df_.bbref_id.isin(comp_letter.bbref_id)])
468 |             .reset_index(drop=True)
469 |         )
470 | 
471 |         nba_letters = (
472 |             nba_letters
473 |             .pipe(lambda df_: df_.loc[~df_.PERSON_ID.isin(comp_letter.PERSON_ID)])
474 |             .reset_index(drop=True)
475 |         )
476 | 
477 |         list_nba_names = nba_letters.ONLY_LETTER.to_list()
478 |         list_bbref_names = bbref_letters.only_letter.to_list()
479 | 
480 |         best = []
481 |         idx_best = []
482 |         for player in list_nba_names:
483 |             min_dist = 10000
484 |             second_min_dist = 10000
485 |             idx = 0
486 |             for idx_comp, player_comp in enumerate(list_bbref_names):
487 |                 dist = distance(player, player_comp)
488 |                 if dist <= min_dist:
489 |                     min_dist = dist
490 |                     idx = idx_comp
491 |                 elif dist < second_min_dist:
492 |                     second_min_dist = dist
493 |                 else:
494 |                     pass
495 |             best.append(min_dist)
496 |             idx_best.append(idx)
497 | 
498 |         comp_lev = (
499 |             nba_letters
500 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR", "ONLY_LETTER"]])
501 |             .assign(
502 |                 LEN_LETTER=lambda df_: [len(x) for x in df_.ONLY_LETTER],
503 |                 BEST_LEV=best,
504 |                 BEST_IDX=idx_best
505 |             )
506 |             .pipe(lambda df_: df_.loc[(df_.BEST_LEV <= 2) & (~df_.PERSON_ID.isin([203183, 203502])),
507 |             ["PERSON_ID", "DISPLAY_FIRST_LAST",
508 |              "FROM_YEAR", "TO_YEAR", "BEST_IDX"]])
509 |             .reset_index(drop=True)
510 |             .pipe(lambda df_: df_.merge(
511 |                 bbref_letters.assign(IDX=lambda df_: df_.index),
512 |                 how="inner",
513 |                 left_on="BEST_IDX", right_on="IDX"
514 |             ))
515 |             .drop(columns=["BEST_IDX", "only_letter", "IDX"])
516 |         )
517 | 
518 |         merge_df = pd.concat([merge_df, comp_lev], axis=0, ignore_index=True)
519 | 
520 |         self.upd_non_merge(merge_df)
521 | 
522 |         return merge_df
523 | 
524 |     def merge_from_dict(self, merge_df: pd.DataFrame) -> pd.DataFrame:
525 |         """Merge players using predefined mapping dictionary.
526 | 
527 |         Args:
528 |             merge_df (pd.DataFrame): Previously merged player data.
529 | 
530 |         Returns:
531 |             pd.DataFrame: Final merged DataFrame with all matches.
532 |         """
533 |         comp_dict = (
534 |             self.non_merge_nbastats
535 |             .assign(bbref_id = lambda df_: [self._mapping_dict(x) for x in df_.PERSON_ID])
536 |             .pipe(lambda df_: df_.merge(self.non_merge_bbref, how="left", on="bbref_id"))
537 |             .pipe(lambda df_: df_.loc[:, ["PERSON_ID", "DISPLAY_FIRST_LAST", "FROM_YEAR", "TO_YEAR", "name", "url",
538 |                                           "bbref_id", "from_year", "to_year"]])
539 |             .assign(
540 |                 url=lambda df_: [
541 |                     "/".join(["https://www.basketball-reference.com/players", x[0], x + ".html"])
542 |                     if isinstance(x, str) else None
543 |                     for x in df_.bbref_id
544 |                 ]
545 |             )
546 |         )
547 | 
548 |         merge_df = pd.concat([merge_df, comp_dict], axis=0, ignore_index=True)
549 | 
550 |         self.upd_non_merge(merge_df)
551 | 
552 |         return merge_df.drop(columns=["FROM_YEAR", "TO_YEAR", "from_year", "to_year"])
553 | 
554 |     def upd_non_merge(self, merge_df: pd.DataFrame) -> None:
555 |         """Update non-merged player lists after each merge operation.
556 | 
557 |         Args:
558 |             merge_df (pd.DataFrame): Current merged player data.
559 |         """
560 |         merge_bbref_id = merge_df.bbref_id
561 |         merge_person_id = merge_df.PERSON_ID
562 | 
563 |         if self.non_merge_bbref is not None:
564 |             self.non_merge_bbref = self.non_merge_bbref[~self.non_merge_bbref.bbref_id.isin(merge_bbref_id)].reset_index(drop=True)
565 |         else:
566 |             self.non_merge_bbref = self.bbref.loc[~self.bbref.bbref_id.isin(merge_bbref_id)].reset_index(drop=True)
567 | 
568 |         if self.non_merge_nbastats is not None:
569 |             self.non_merge_nbastats = self.non_merge_nbastats[~self.non_merge_nbastats.PERSON_ID.isin(merge_person_id)].reset_index(drop=True)
570 |         else:
571 |             self.non_merge_nbastats = self.nbastats.loc[~self.nbastats.PERSON_ID.isin(merge_person_id)].reset_index(drop=True)
572 | 
573 |     @staticmethod
574 |     def _detect_non_english(names: str) -> bool:
575 |         """Detect if a name contains non-English characters.
576 | 
577 |         Args:
578 |             names (str): Player name to check.
579 | 
580 |         Returns:
581 |             bool: True if name contains non-English characters, False otherwise.
582 |         """
583 |         ord_name = not all([ord(x) in ENGLISH for x in names])
584 |         return ord_name
585 | 
586 |     @staticmethod
587 |     def _count_non_english(names: str) -> int:
588 |         """Count number of non-English characters in a name.
589 | 
590 |         Args:
591 |             names (str): Player name to check.
592 | 
593 |         Returns:
594 |             int: Number of non-English characters found.
595 |         """
596 |         return np.sum([ord(x) not in ENGLISH for x in names])
597 | 
598 |     @staticmethod
599 |     def _mapping_dict(person_id: int) -> Optional[str]:
600 |         """Get Basketball Reference ID from mapping dictionary.
601 | 
602 |         Args:
603 |             person_id (int): NBA Stats API player ID.
604 | 
605 |         Returns:
606 |             Optional[str]: Basketball Reference ID if found, None otherwise.
607 |         """
608 |         try:
609 |             bbref_id = MAPPING_DICT[person_id]
610 |         except KeyError:
611 |             bbref_id = None
612 |         return bbref_id
613 | 
614 | class MappingBasketID(object):
615 |     """Main class for mapping basketball player IDs between different sources.
616 | 
617 |     This class orchestrates the entire process of mapping player IDs between
618 |     NBA Stats API and Basketball Reference data sources.
619 |     """
620 | 
621 |     def __init__(self):
622 |         """Initialize MappingBasketID."""
623 |         pass
624 | 
625 |     def __call__(self, *args, **kwargs):
626 |         """Execute the complete ID mapping process.
627 | 
628 |         Args:
629 |             **kwargs: Keyword arguments including:
630 |                 verbose (bool): Whether to print progress information.
631 |                 bbref (pd.DataFrame): Existing Basketball Reference data.
632 |                 nbastats (pd.DataFrame): Existing NBA Stats data.
633 |                 letters (str): Letters to scrape from Basketball Reference.
634 |                 base_url (str): Base URL for Basketball Reference.
635 | 
636 |         Returns:
637 |             pd.DataFrame: Complete mapping between NBA Stats and Basketball Reference IDs.
638 |         """
639 | 
640 |         self.verbose = kwargs.get("verbose", False)
641 |         self.bbref = kwargs.get("bbref", None)
642 |         self.nbastats = kwargs.get("nbastats", None)
643 |         self.letters = kwargs.get("letters", ascii_lowercase)
644 |         self.base_url = kwargs.get("base_url", "https://www.basketball-reference.com/players")
645 |         if self.bbref is None:
646 |             bbref_players = PlayerDataBBref(verbose=self.verbose, letters=self.letters, base_url=self.base_url)
647 |             self.bbref = bbref_players.bbref_player_data()
648 |         if self.nbastats is None:
649 |             self.nbastats = CommonAllPlayers().get_data_frames()[0]
650 |         merge_players = MergePlayerID(self.nbastats, self.bbref)
651 |         players_df = merge_players.merge_by_name()
652 |         players_df = merge_players.merge_double(players_df)
653 |         players_df = merge_players.merge_non_english(players_df)
654 |         players_df = merge_players.merge_surname(players_df)
655 |         players_df = merge_players.merge_surname(players_df)
656 |         players_df = merge_players.merge_wo_punctuation(players_df)
657 |         players_df = merge_players.merge_from_dict(players_df)
658 | 
659 |         return players_df
660 | 
661 | mapping_nba_id = MappingBasketID()
662 | 


--------------------------------------------------------------------------------
/mapping_nba_ids/requirements.txt:
--------------------------------------------------------------------------------
1 | nba_api>=1.4.0
2 | numpy>=1.22.2,<2.0.0
3 | pandas>=2.0.0
4 | Levenshtein==0.26.1
5 | beautifulsoup4>=4.10.0
6 | requests>=2.31.0
7 | lxml>=5.2.0


--------------------------------------------------------------------------------
/sat_logo.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shufinskiy/sport_analytics_tools/c3b1172790725630953800b3878a472d55153b7b/sat_logo.jpeg


--------------------------------------------------------------------------------