├── .dockerignore ├── .gitattributes ├── .github └── workflows │ └── test.yml ├── .gitignore ├── .pre-commit-config.yaml ├── CHANGELOG.md ├── CONTRIBUTING.md ├── Dockerfile ├── LICENSE ├── MANIFEST.in ├── PURPOSE.md ├── README.md ├── STYLE.md ├── dataset ├── Superheroes NLP Dataset │ ├── README.md │ ├── create_dataset.py │ ├── create_download_file.py │ └── helper.py ├── bbcsport.csv ├── countries.geojson └── superheroes_nlp_dataset.csv ├── docker-compose.yml ├── docs ├── Makefile ├── source │ ├── conf.py │ ├── index.rst │ ├── nlp.rst │ ├── preprocessing.rst │ ├── representation.rst │ └── visualization.rst └── to_docusaurus.py ├── examples ├── README.md ├── README.md.ipynb └── getting-started.ipynb ├── github ├── demo.gif ├── logo.png ├── scatterplot_bbcsport.svg ├── scatterplot_bbcsport_kmeans.svg └── screencast.gif ├── requirements.txt ├── scripts ├── check.sh ├── format.sh ├── install-hooks.sh ├── install.sh ├── pre-commit.sh ├── push_pip.sh ├── test_coverage.sh ├── tests.sh └── update_documentation.sh ├── setup.cfg ├── setup.py ├── tests ├── README.md ├── __init__.py ├── conftest.py ├── test_helpers.py ├── test_indexes.py ├── test_nlp.py ├── test_preprocessing.py ├── test_representation.py ├── test_types.py └── test_visualization.py ├── texthero ├── __init__.py ├── _helper.py ├── _types.py ├── nlp.py ├── preprocessing.py ├── representation.py ├── stopwords.py └── visualization.py ├── vercel.json └── website ├── blog ├── 2017-10-24-texthero-welcome.md ├── 2020-04-27-rename-columns-pandas.md ├── 2020-04-27-text-preprocessing-with-pandas.md ├── 2020-05-03-text-mining-with-python.md ├── 2020-05-03-text-unsupervised-learning.md └── 2020-05-08-bar-run-chart-python.md ├── build.sh ├── core ├── AnnouncementBar.js ├── Footer.js ├── Showcase.js └── annonucement-bar.css ├── docs ├── api-nlp.md ├── api-preprocessing.md ├── api-representation.md ├── api-visualization.md ├── api │ ├── texthero.nlp.dependency_parse.md │ ├── texthero.nlp.named_entities.md │ ├── texthero.nlp.noun_chunks.md │ ├── texthero.preprocessing.clean.md │ ├── texthero.preprocessing.drop_no_content.md │ ├── texthero.preprocessing.get_default_pipeline.md │ ├── texthero.preprocessing.has_content.md │ ├── texthero.preprocessing.remove_angle_brackets.md │ ├── texthero.preprocessing.remove_brackets.md │ ├── texthero.preprocessing.remove_curly_brackets.md │ ├── texthero.preprocessing.remove_diacritics.md │ ├── texthero.preprocessing.remove_digits.md │ ├── texthero.preprocessing.remove_html_tags.md │ ├── texthero.preprocessing.remove_punctuation.md │ ├── texthero.preprocessing.remove_round_brackets.md │ ├── texthero.preprocessing.remove_square_brackets.md │ ├── texthero.preprocessing.remove_stopwords.md │ ├── texthero.preprocessing.remove_urls.md │ ├── texthero.preprocessing.remove_whitespace.md │ ├── texthero.preprocessing.replace_punctuation.md │ ├── texthero.preprocessing.replace_stopwords.md │ ├── texthero.preprocessing.replace_urls.md │ ├── texthero.preprocessing.stem.md │ ├── texthero.preprocessing.tokenize.md │ ├── texthero.representation.dbscan.md │ ├── texthero.representation.kmeans.md │ ├── texthero.representation.meanshift.md │ ├── texthero.representation.nmf.md │ ├── texthero.representation.pca.md │ ├── texthero.representation.term_frequency.md │ ├── texthero.representation.tfidf.md │ ├── texthero.representation.tsne.md │ ├── texthero.visualization.scatterplot.md │ ├── texthero.visualization.top_words.md │ └── texthero.visualization.wordcloud.md ├── assets │ └── texthero.png ├── getting-started-installation.md ├── getting-started-preprocessing.md ├── getting-started.md └── tutorial-tfidf.md ├── package.json ├── pages └── en │ ├── help.js │ ├── index.js │ ├── index_original.js │ └── users.js ├── sidebars.json ├── siteConfig.js ├── static ├── css │ ├── announcement-bar.css │ ├── code-block-buttons.css │ ├── custom.css │ ├── pygments.css │ └── sphinx_basic.css ├── figure │ └── scatterplot_bccsport_kmeans.svg ├── img │ ├── T.png │ ├── favicon.png │ ├── logo_v2.png │ ├── logo_v2_transparent.png │ ├── oss_logo.png │ ├── scatterplot_bccsport.svg │ ├── undraw_code_review.svg │ ├── undraw_monitor.svg │ ├── undraw_note_list.svg │ ├── undraw_online.svg │ ├── undraw_open_source.svg │ ├── undraw_operating_system.svg │ ├── undraw_react.svg │ ├── undraw_tweetstorm.svg │ └── undraw_youtube_tutorial.svg └── js │ ├── analytics.js │ ├── code-block-buttons.js │ └── start_highlight.js └── vercel.json /.dockerignore: -------------------------------------------------------------------------------- 1 | */node_modules 2 | *.log 3 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | website/* linguist-documentation 2 | Dockerfile -linguist-vendored 3 | -------------------------------------------------------------------------------- /.github/workflows/test.yml: -------------------------------------------------------------------------------- 1 | name: CI 2 | 3 | on: 4 | push: 5 | branches: [master] 6 | pull_request: 7 | 8 | concurrency: 9 | group: ${{ github.workflow }}-${{ github.ref_name }}-${{ github.event.pull_request.number || github.sha }} 10 | cancel-in-progress: true 11 | 12 | jobs: 13 | test: 14 | runs-on: ubuntu-latest 15 | strategy: 16 | matrix: 17 | python-version: ["3.8", "3.9", "3.10", "3.11"] 18 | steps: 19 | - name: Checkout project 20 | uses: actions/checkout@v3 21 | 22 | - name: Set up Python ${{ matrix.python-version }} 23 | uses: actions/setup-python@v3 24 | with: 25 | python-version: ${{ matrix.python-version }} 26 | 27 | - name: Set up venv 28 | shell: bash 29 | run: | 30 | python3 -m pip install --upgrade pip setuptools 31 | python3 -m venv .venv 32 | 33 | - name: Install project 34 | shell: bash 35 | run: | 36 | source .venv/bin/activate 37 | python3 -m pip install ".[dev]" 38 | 39 | - name: Test 40 | run: .venv/bin/python3 -m pytest --cov=texthero --cov-report=term-missing --cov-report xml --cov-branch 41 | 42 | - name: Upload coverage reports to Codecov 43 | uses: codecov/codecov-action@v3 44 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ############################# 2 | # Texthero 3 | ############################# 4 | 5 | build 6 | dist 7 | __pycache__ 8 | .DS_Store 9 | .cache 10 | 11 | .ipynb_checkpoints/ 12 | 13 | *.log 14 | 15 | .idea 16 | 17 | alpha 18 | 19 | texthero.egg-info 20 | 21 | ############################# 22 | # Dataset 23 | ############################# 24 | 25 | raw 26 | download.sh 27 | 28 | 29 | ############################# 30 | # Docusaurus and website 31 | ############################# 32 | 33 | website/translated_docs 34 | website/build/ 35 | website/yarn.lock 36 | website/node_modules 37 | website/i18n/* 38 | 39 | node_modules 40 | 41 | lib/core/metadata.js 42 | lib/core/MetadataBlog.js 43 | 44 | ############################# 45 | # GITHUB gitignore 46 | ############################# 47 | 48 | # Byte-compiled / optimized / DLL files 49 | __pycache__/ 50 | *.py[cod] 51 | *$py.class 52 | 53 | # C extensions 54 | *.so 55 | 56 | # Distribution / packaging 57 | .Python 58 | build/ 59 | develop-eggs/ 60 | dist/ 61 | downloads/ 62 | eggs/ 63 | .eggs/ 64 | lib/ 65 | lib64/ 66 | parts/ 67 | sdist/ 68 | var/ 69 | wheels/ 70 | share/python-wheels/ 71 | *.egg-info/ 72 | .installed.cfg 73 | *.egg 74 | MANIFEST 75 | 76 | # PyInstaller 77 | # Usually these files are written by a python script from a template 78 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 79 | *.manifest 80 | *.spec 81 | 82 | # Installer logs 83 | pip-log.txt 84 | pip-delete-this-directory.txt 85 | 86 | # Unit test / coverage reports 87 | htmlcov/ 88 | .tox/ 89 | .nox/ 90 | .coverage 91 | .coverage.* 92 | .cache 93 | nosetests.xml 94 | coverage.xml 95 | *.cover 96 | *.py,cover 97 | .hypothesis/ 98 | .pytest_cache/ 99 | cover/ 100 | 101 | # Translations 102 | *.mo 103 | *.pot 104 | 105 | # Django stuff: 106 | *.log 107 | local_settings.py 108 | db.sqlite3 109 | db.sqlite3-journal 110 | 111 | # Flask stuff: 112 | instance/ 113 | .webassets-cache 114 | 115 | # Scrapy stuff: 116 | .scrapy 117 | 118 | # Sphinx documentation 119 | docs/_build/ 120 | 121 | # PyBuilder 122 | .pybuilder/ 123 | target/ 124 | 125 | # Jupyter Notebook 126 | .ipynb_checkpoints 127 | 128 | # IPython 129 | profile_default/ 130 | ipython_config.py 131 | 132 | # pyenv 133 | # For a library or package, you might want to ignore these files since the code is 134 | # intended to run in multiple environments; otherwise, check them in: 135 | # .python-version 136 | 137 | # pipenv 138 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 139 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 140 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 141 | # install all needed dependencies. 142 | #Pipfile.lock 143 | 144 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 145 | __pypackages__/ 146 | 147 | # Celery stuff 148 | celerybeat-schedule 149 | celerybeat.pid 150 | 151 | # SageMath parsed files 152 | *.sage.py 153 | 154 | # Environments 155 | .env 156 | .venv 157 | env/ 158 | venv/ 159 | ENV/ 160 | env.bak/ 161 | venv.bak/ 162 | 163 | # Spyder project settings 164 | .spyderproject 165 | .spyproject 166 | 167 | # Rope project settings 168 | .ropeproject 169 | 170 | # mkdocs documentation 171 | /site 172 | 173 | # mypy 174 | .mypy_cache/ 175 | .dmypy.json 176 | dmypy.json 177 | 178 | # Pyre type checker 179 | .pyre/ 180 | 181 | # pytype static type analyzer 182 | .pytype/ 183 | 184 | # Cython debug symbols 185 | cython_debug/ 186 | docs/source/api 187 | 188 | 189 | # Hide vs code hidden files 190 | .vs_code 191 | -------------------------------------------------------------------------------- /.pre-commit-config.yaml: -------------------------------------------------------------------------------- 1 | repos: 2 | - repo: https://github.com/ambv/black 3 | rev: stable 4 | hooks: 5 | - id: black 6 | language_version: python3 -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # 2020-04-20 2 | 3 | * Version 1.0 4 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM node:lts 2 | 3 | WORKDIR /app/website 4 | 5 | EXPOSE 3000 35729 6 | COPY ./docs /app/docs 7 | COPY ./website /app/website 8 | RUN yarn install 9 | 10 | CMD ["yarn", "start"] 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2020 Texthero 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include LICENSE 2 | -------------------------------------------------------------------------------- /PURPOSE.md: -------------------------------------------------------------------------------- 1 | # PURPOSE 2 | 3 | This document attempts at defining the purpose of Texthero and it's future enhancements. 4 | 5 | ### Motivation 6 | 7 | We believe the text mining and text analytics community is missing a space for learning how to deal with all the different NLP/text mining/text analytics tools as well as a simple python package based on Pandas to work with text data effortlessly. 8 | 9 | The lack is in a clear "Universal text mining and data analysis documentation" and that's the main purpose of **texthero**. 10 | 11 | ### Objective 12 | 13 | We can decompose the objective of Texthero in two parts: 14 | 15 | 1. ** Offer an efficient tool to deal with text-based datasets (The texthero python package). Texthero is mainly a teaching tool and therefore easy to use and understand, but at the same time quite efficient and should be able to handle large quantities of data. 16 | 17 | 2. ** Provide a sustain to newcomers in the NLP world to efficiently learn all the main core topics (tf-idf, text cleaning, regular expression, etc). As there are many other tutorials, the main approach is to redirect users to valuable resources and explain better any missing point. This part is done mainly through the *tutorials* on texthero.org. 18 | 19 | 20 | ### Channels 21 | 22 | 1. **Github repository** development of texthero python package. The README should mainly discuss the PyPI package and not the extra tutorials. 23 | 24 | 2. **Texthero.org** 25 | The website acts both as the official documentation for the python package as well as a source of information to learn about how to deal with textual data. 26 | 27 | - **Getting Started** 4/5 pages document that explains how to use the Texthero tool. The tutorials assume a very basic understanding of the main topics (representation, tf-idf, word2vec, etc) but at the same time provide links to internal (tutorials) and external resources. 28 | 29 | - **Tutorials** Sort of a blog with articles related to NLP and text mining. This includes both tutorials on how to use certain texthero tools, how some part of the Texthero code has been developed as well as extra articles related to other parts of text analytics. Tutorials should focus on how to analyze large quantities of text. 30 | 31 | - **?** Open to any request. For ideas, open a new issue and/or contact jonathan.besomi__AT__gmail.com 32 | 33 | 34 | ### Python package 35 | 36 | For future development, it is important to have a clear idea in mind of the purpose of Texthero as a python package. 37 | 38 | 39 | **Package core purpose** 40 | 41 | The goal is to extract insights from the whole corpora, i.e collection of document and not from the single element. 42 | 43 | Generally, the corpora are composed of a __long__ collection of documents and therefore the required techniques need to be efficient to deal with a large amount of text. 44 | 45 | **Neural network** 46 | 47 | Texthero function (as of now) does not make use of a neural network solution. The main reason is that there is no need for that as there are mature libraries (PyTorch and Tensorflow to name a few). 48 | 49 | What Texthero offers is a tool to be used in addition to any other machine learning libraries. Ideally, texthero should be used before applying any "sophisticated" approach to the dataset; to first better understand the underlying data before applying any complex model. 50 | 51 | 52 | Note: a text corpus or collection of documents need to be always in form of a Pandas Series. "do that on a text corpus" or "do that on a Pandas Series" refers to the same act. 53 | 54 | **Common usage**: 55 | - Clean a text Pandas Series 56 | - Tokenize a text Pandas Series 57 | - Represent a text Pandas Series 58 | - Benchmark on very simple models (Bayes ?) if changes improved the models 59 | - Understand a text without the need for using complex models such as Transformers. 60 | - Extract the main facts from a Pandas Series 61 | 62 | 63 | **Naive Pandas Support** 64 | 65 | Most of texthero python functions should accept as an argument a Pandas Series and return a Pandas Series. This permits to chain the different functions and also always append the Series to a Pandas Column. 66 | 67 | Few exceptions: 68 | - When representing the data, the results might be very sparse, in this case, the returned value is a _Sparse_ Pandas Series. It's important to underline the difference in the documentation. 69 | 70 | - The "visualization" module might return visualization such as the count of top words. An alternative would be to add a custom `hero` accessor to access this kind of features. 71 | 72 | -------------------------------------------------------------------------------- /STYLE.md: -------------------------------------------------------------------------------- 1 | # Texthero Style 2 | 3 | Color palette: 4 | 5 | (Mango Tango, orange): ff8c42 6 | (Corn, yellow): fff275 7 | (Green blue): 3f88c5 8 | (Crimson, red): d7263d 9 | (Oxford blue): 02182b 10 | 11 | 12 | Orange stronger for menubar: ff7b26 13 | -------------------------------------------------------------------------------- /dataset/Superheroes NLP Dataset/README.md: -------------------------------------------------------------------------------- 1 | # Superheroes NLP Dataset 2 | 3 | A playground dataset to learn and practice NLP, text mining and data analysis while having fun. 4 | 5 | The same dataset can be found on Kaggle: [Superheroes NLP Dataset](https://www.kaggle.com/jonathanbesomi/superheroes-nlp-dataset). 6 | 7 | All data have been scraped with python from [Superhero Database](https://www.superherodb.com/), credits belongs to them. 8 | 9 | ## Dataset summary 10 | 11 | Size: 8 MB. 12 | 13 | Num. columns: 81. 14 | 15 | Num. superheroes: 1447. 16 | 17 | Main columns: 18 | - name 19 | - real_name 20 | - full_name 21 | - overall_score - how powerful is the superhero according to superherodb. 22 | - *history_text* - Superhero's history. 23 | - *powers_text* - Description of superhero's powers 24 | - intelligence_score 25 | - strength_score 26 | - speed_score 27 | - durability_score 28 | - power_score 29 | - combat_score 30 | - alter_egos - List of alternative personality 31 | - aliases 32 | - creator - _DC Comics_ or _Marvel Comics_ for instance. 33 | - alignment - Is the character good or bad? 34 | - occupation 35 | - type_race 36 | - height 37 | - weight 38 | - eye_color 39 | - hair_color 40 | - skin_color 41 | 42 | 43 | ## Getting started 44 | 45 | You can download the complete dataset directly from Github here: [Superheroes NLP Dataset](https://github.com/jbesomi/texthero/tree/master/dataset/Superheroes%20NLP%20Dataset/data). 46 | 47 | If you feel lazy, you can also import it directly from pandas: 48 | 49 | ```python 50 | import pandas as pd 51 | 52 | df = pd.read_csv("https://raw.githubusercontent.com/jbesomi/texthero/master/dataset/superheroes_nlp_dataset.csv") 53 | 54 | df.head() 55 | ``` 56 | 57 | ```bash 58 | name real_name full_name overall_score ... has_durability has_stamina has_agility has_super_strength 59 | 0 3-D Man Delroy Garrett, Jr. Delroy Garrett, Jr. 6 ... 0.0 0.0 0.0 1.0 60 | 1 514A (Gotham) Bruce Wayne NaN 10 ... 1.0 0.0 0.0 1.0 61 | 2 A-Bomb Richard Milhouse Jones Richard Milhouse Jones 20 ... 1.0 1.0 1.0 1.0 62 | 3 Aa Aa NaN 12 ... 0.0 0.0 0.0 0.0 63 | 4 Aaron Cash Aaron Cash Aaron Cash 5 ... 0.0 0.0 0.0 0.0 64 | 65 | [5 rows x 81 columns] 66 | ``` 67 | -------------------------------------------------------------------------------- /dataset/Superheroes NLP Dataset/create_dataset.py: -------------------------------------------------------------------------------- 1 | """ 2 | Create the dataset from the raw data 3 | """ 4 | 5 | import helper as h 6 | import glob 7 | import pandas as pd 8 | from tqdm import tqdm 9 | import ast 10 | 11 | DOWNLOAD_DIR = "./data/raw/" 12 | 13 | all_files = glob.glob(DOWNLOAD_DIR + "*.html") 14 | 15 | # Get unique names: 16 | all_about = glob.glob(DOWNLOAD_DIR + "*_about.html") 17 | 18 | ids = [h.get_id_from_about(a) for a in all_about] 19 | 20 | dataset = [] 21 | 22 | for id in tqdm(ids): 23 | 24 | filename_about = DOWNLOAD_DIR + id + "_about.html" 25 | filename_history = DOWNLOAD_DIR + id + "_history.html" 26 | filename_powers = DOWNLOAD_DIR + id + "_powers.html" 27 | 28 | try: 29 | h.get_soup(filename_about) 30 | except: 31 | print(filename_about) 32 | 33 | data_about = h.get_soup(filename_about) 34 | data_history = h.get_soup(filename_history) 35 | data_powers = h.get_soup(filename_powers) 36 | 37 | row = h.merge_data(data_about, data_history, data_powers) 38 | 39 | dataset.append(row) 40 | 41 | df = pd.DataFrame(dataset) 42 | 43 | df.columns = df.columns.str.lower() 44 | # Clean dataset 45 | 46 | 47 | def clean_teams(df): 48 | df["teams"] = df["teams"].astype(str) 49 | df["teams"] = df["teams"].str.replace("\nNo teams added.", "no_team") 50 | 51 | df["teams"] = df["teams"].str.replace("\n", "").str.strip() 52 | return df 53 | 54 | 55 | df = clean_teams(df) 56 | 57 | # lowercase all columns 58 | df.columns = df.columns.str.lower().str.replace(" ", "_") 59 | 60 | 61 | # Rename columns 62 | df = df.rename(columns={"type_/_race": "type_race"}) 63 | 64 | power_score = dict( 65 | intelligence="intelligence_score", 66 | strength="strength_score", 67 | speed="speed_score", 68 | durability="durability_score", 69 | power="power_score", 70 | combat="combat_score", 71 | ) 72 | 73 | df = df.rename(columns=power_score) 74 | 75 | df = df.rename(columns=dict(hist_content="history_text", powers_content="powers_text")) 76 | 77 | # Reorder columns 78 | df = df[ 79 | [ 80 | "name", 81 | "real_name", 82 | "full_name", 83 | "overall_score", 84 | "history_text", 85 | "powers_text", 86 | "intelligence_score", 87 | "strength_score", 88 | "speed_score", 89 | "durability_score", 90 | "power_score", 91 | "combat_score", 92 | "superpowers", 93 | "alter_egos", 94 | "aliases", 95 | "place_of_birth", 96 | "first_appearance", 97 | "creator", 98 | "alignment", 99 | "occupation", 100 | "base", 101 | "teams", 102 | "relatives", 103 | "gender", 104 | "type_race", 105 | "height", 106 | "weight", 107 | "eye_color", 108 | "hair_color", 109 | "skin_color", 110 | "img", 111 | ] 112 | ] 113 | 114 | # Extract 'superpowers' data 115 | 116 | df_superpowers = ( 117 | df["superpowers"].apply(pd.Series).stack().pipe(pd.get_dummies).sum(level=0) 118 | ) 119 | 120 | # Keep only most 50 common superpowers 121 | common_superpowers = df_superpowers.sum(axis=0).sort_values().tail(50).index 122 | df_superpowers = df_superpowers[common_superpowers] 123 | df_superpowers.columns = df_superpowers.columns.str.lower().str.replace(" ", "_") 124 | df_superpowers = df_superpowers.add_prefix("has_") 125 | 126 | df = df.join(df_superpowers) 127 | 128 | 129 | # Split aliases 130 | df["aliases"] = df["aliases"].str.split("\n") 131 | 132 | print(df.shape) 133 | 134 | # Keep only rows where 'history_text' or 'powers_text' is not null. 135 | df = df[ 136 | ~(df["history_text"].str.strip() == "") | ~(df["powers_text"].str.strip() == "") 137 | ] 138 | 139 | print(df.shape) 140 | 141 | 142 | df.to_csv("./data/superheroes_nlp_dataset.csv", index=False) 143 | -------------------------------------------------------------------------------- /dataset/Superheroes NLP Dataset/create_download_file.py: -------------------------------------------------------------------------------- 1 | """ 2 | Create a "download.sh" file containing a list of all http url that needs to be downloaded. 3 | """ 4 | 5 | import helper as h 6 | 7 | # NUM_PAGE = 1 8 | # data_char = h.get_data("https://www.superherodb.com/characters/?page_nr={}".format(NUM_PAGE)) 9 | # superhero_links = h.get_superheroes_links(data_char) 10 | 11 | 12 | # Get all superheroes link 13 | 14 | TOTAL_PAGES = 33 15 | all_links = [] 16 | 17 | for p in range(1, 33 + 1): 18 | data_char = h.get_data( 19 | "https://www.superherodb.com/characters/?page_nr={}".format(p) 20 | ) 21 | all_links += h.get_superheroes_links(data_char) 22 | 23 | 24 | DOWNLOAD_DIR = "./data/raw/" 25 | 26 | file_content = "" 27 | command = "wget {} -t 5 --limit-rate=20K --show-progress -O {}\n" 28 | 29 | file_content += "#!/bin/sh\n\n\n" 30 | file_content += "mkdir -p {}\n\n\n".format(DOWNLOAD_DIR) 31 | 32 | filename_set = [] 33 | 34 | for link in all_links: 35 | 36 | filename = DOWNLOAD_DIR + link.split("/")[-3] 37 | 38 | # Download about 39 | filename_about = filename + "_about.html" 40 | file_content += command.format(link, filename_about) 41 | 42 | # Download history 43 | filename_history = filename + "_history.html" 44 | file_content += command.format(link + "history/", filename_history) 45 | 46 | # Download powers 47 | filename_powers = filename + "_powers.html" 48 | file_content += command.format(link + "powers/", filename_powers) 49 | 50 | file_content += "\n" 51 | 52 | filename_set.append(filename) 53 | 54 | print("There are ", len(filename_set), " files.") 55 | print("There are ", len(set(filename_set)), "unique files.") 56 | 57 | # with open("download.sh", "w") as file: 58 | # file.write(file_content) 59 | -------------------------------------------------------------------------------- /dataset/Superheroes NLP Dataset/helper.py: -------------------------------------------------------------------------------- 1 | """ 2 | Helper functions to scrape all superrheroes data from 'https://www.superherodb.com/' 3 | """ 4 | 5 | from bs4 import BeautifulSoup 6 | import urllib3 7 | import pandas as pd 8 | from collections import defaultdict 9 | import requests 10 | import re 11 | 12 | 13 | def get_data(url): 14 | """ 15 | Return BeautifulSoup html object. 16 | """ 17 | r = requests.get(url) 18 | data = BeautifulSoup(r.text, "lxml") 19 | return data 20 | 21 | 22 | def get_superheroes_links(data): 23 | herolinks = [] 24 | 25 | home_url = "https://www.superherodb.com" 26 | 27 | for all_li in data.find_all(class_="list"): 28 | for link in all_li.find_all("li"): 29 | for hero in link.find_all("a"): 30 | herolinks.append(home_url + hero["href"]) 31 | return herolinks 32 | 33 | 34 | def get_id_from_about(filename): 35 | """ 36 | Extract id from local filename. 37 | """ 38 | return filename.replace("_about.html", "").split("/")[-1] 39 | 40 | 41 | def get_soup(filename): 42 | with open(filename, "rb") as f: 43 | file = f.read() 44 | return BeautifulSoup(file, "lxml") 45 | 46 | 47 | """ 48 | Get data 49 | """ 50 | 51 | 52 | def get_data(url): 53 | r = requests.get(url) 54 | data = BeautifulSoup(r.text, "lxml") 55 | return data 56 | 57 | 58 | """ 59 | About 60 | """ 61 | 62 | 63 | def get_image(data_about): 64 | 65 | img = data_about.find(class_="portrait").find("img") 66 | if img: 67 | return dict(img=img["src"]) 68 | else: 69 | return dict(img=None) 70 | 71 | 72 | def get_name_real_name(data_about): 73 | name = data_about.find("h1").text 74 | real_name = data_about.find("h2").text 75 | return dict(name=name, real_name=real_name) 76 | 77 | 78 | def get_overall_score(data_about): 79 | return dict(overall_score=data_about.find(href="#class-info").text) 80 | 81 | 82 | def get_power_stats(data_about): 83 | 84 | scripts = data_about.findAll("script") 85 | # Find script containng the 'stats_shdb' 86 | script = next( 87 | (s.text for s in scripts if s.text.strip().startswith("var stats_shdb = [")) 88 | ) 89 | # Extract the list of powers 90 | values = re.findall(r"(\d+)", script.split(";")[0]) 91 | values = [int(v) for v in values] 92 | 93 | labels = data_about.find(class_="stat-holder").findAll("label") 94 | labels = [l.text for l in labels] 95 | 96 | return dict(zip(labels, values)) 97 | 98 | 99 | def get_super_powers(data_about): 100 | superpowers = data_about.find("h3", text="Super Powers").findParent().findAll("a") 101 | superpowers = [s.text for s in superpowers] 102 | return dict(superpowers=superpowers) 103 | 104 | 105 | def get_all_links(td): 106 | links = td.findAll("a") 107 | links = [a.text for a in links] 108 | return links 109 | 110 | 111 | def get_origin(data_about): 112 | 113 | data = data_about.find("h3", text="Origin").findNext() 114 | 115 | origin = {} 116 | 117 | for row in data.find_all("tr"): 118 | key = row.find_all("td")[0].text 119 | value = row.find_all("td")[1] 120 | 121 | if "alter egos" in key.lower(): 122 | origin[key] = get_all_links(value) 123 | else: 124 | origin[key] = value.text 125 | return origin 126 | 127 | 128 | def get_connections(data_about): 129 | data = data_about.find("h3", text="Connections").findNext() 130 | 131 | connections = {} 132 | 133 | for row in data.find_all("tr"): 134 | key = row.find_all("td")[0].text 135 | value = row.find_all("td")[1] 136 | 137 | if "Teams" in key: 138 | connections[key] = get_all_links(value) 139 | else: 140 | connections[key] = value.text 141 | 142 | return connections 143 | 144 | 145 | def get_appearance(data_about): 146 | table = data_about.find("h3", text="Appearance").findParent() 147 | labels = table.findAll(class_="table-label") 148 | return dict([(l.text, l.findNext().text) for l in labels]) 149 | 150 | 151 | """ 152 | History 153 | """ 154 | 155 | 156 | def get_history(data_history): 157 | content = data_history.find(class_="text-columns-2") 158 | title = content.find("h3").text 159 | subtitles = [s.text for s in content.findAll("h4")] 160 | content = " ".join([p.text for p in content.findAll("p")]).replace("\s+", " ") 161 | return {"hist_title": title, "hist_subtitles": subtitles, "hist_content": content} 162 | 163 | 164 | """ 165 | Powers 166 | """ 167 | 168 | 169 | def get_powers(data_powers): 170 | content = data_powers.find_all(class_="col-8")[1] 171 | title = content.find("h3").text 172 | subtitles = [s.text for s in content.findAll("h4")] 173 | content = " ".join([p.text for p in content.findAll("p")]).replace("\s+", " ") 174 | return { 175 | "powers_title": title, 176 | "powers_subtitles": subtitles, 177 | "powers_content": content, 178 | } 179 | 180 | 181 | """ 182 | Merge all 183 | """ 184 | 185 | 186 | def merge_data(data_about, data_history, data_powers): 187 | 188 | data = {} 189 | 190 | # Get from about page 191 | data.update(get_image(data_about)) 192 | data.update(get_name_real_name(data_about)) 193 | data.update(get_overall_score(data_about)) 194 | data.update(get_power_stats(data_about)) 195 | data.update(get_super_powers(data_about)) 196 | data.update(get_origin(data_about)) 197 | data.update(get_connections(data_about)) 198 | data.update(get_appearance(data_about)) 199 | 200 | # Get history data 201 | data.update(get_history(data_history)) 202 | 203 | # Get powers data 204 | data.update(get_powers(data_powers)) 205 | 206 | return data 207 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: "3" 2 | 3 | services: 4 | docusaurus: 5 | build: . 6 | ports: 7 | - 3000:3000 8 | - 35729:35729 9 | volumes: 10 | - ./docs:/app/docs 11 | - ./website/blog:/app/website/blog 12 | - ./website/core:/app/website/core 13 | - ./website/i18n:/app/website/i18n 14 | - ./website/pages:/app/website/pages 15 | - ./website/static:/app/website/static 16 | - ./website/sidebars.json:/app/website/sidebars.json 17 | - ./website/siteConfig.js:/app/website/siteConfig.js 18 | working_dir: /app/website 19 | -------------------------------------------------------------------------------- /docs/Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = -q 6 | SPHINXBUILD = sphinx-build 7 | SOURCEDIR = source 8 | BUILDDIR = _build 9 | 10 | # Put it first so that "make" without argument is like "make help". 11 | help: 12 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 13 | 14 | .PHONY: help Makefile 15 | 16 | # Catch-all target: route all unknown targets to Sphinx using the new 17 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 18 | %: Makefile 19 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 20 | @./to_docusaurus.py 21 | @rsync -u ./_build/html/_static/basic.css ../website/static/css/sphinx_basic.css 22 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | # Configuration file for the Sphinx documentation builder. 2 | # 3 | # This file only contains a selection of the most common options. For a full 4 | # list see the documentation: 5 | # https://www.sphinx-doc.org/en/master/usage/configuration.html 6 | 7 | # -- Path setup -------------------------------------------------------------- 8 | 9 | # If extensions (or modules to document with autodoc) are in another directory, 10 | # add these directories to sys.path here. If the directory is relative to the 11 | # documentation root, use os.path.abspath to make it absolute, like shown here. 12 | # 13 | import os 14 | import sys 15 | 16 | import matplotlib 17 | 18 | sys.path.insert(0, os.path.abspath(".")) 19 | 20 | 21 | # -- Project information ----------------------------------------------------- 22 | 23 | project = "texthero" 24 | copyright = "" # will not be used. 25 | author = "" # will not be used. 26 | 27 | # The full version, including alpha/beta/rc tags 28 | release = "" # will not be used. 29 | 30 | 31 | # -- General configuration --------------------------------------------------- 32 | 33 | # Add any Sphinx extension module names here, as strings. They can be 34 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 35 | # ones. 36 | extensions = [ 37 | "numpydoc", 38 | "sphinx.ext.autodoc", # automatically construct the documentation. 39 | "sphinx.ext.autosummary", 40 | # prefer numpydoc at sphinx.ext.napoleon as it looks nicer. 41 | "sphinx.ext.intersphinx", 42 | "matplotlib.sphinxext.plot_directive", 43 | ] 44 | 45 | # Add any paths that contain templates here, relative to this directory. 46 | templates_path = ["_templates"] 47 | 48 | # List of patterns, relative to source directory, that match files and 49 | # directories to ignore when looking for source files. 50 | # This pattern also affects html_static_path and html_extra_path. 51 | exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "*.md"] 52 | 53 | add_module_names = False 54 | 55 | autosummary_generate = True 56 | 57 | autodoc_typehints = "none" 58 | 59 | 60 | intersphinx_mapping = {"python": ("https://docs.python.org/3", None)} 61 | 62 | # -- Options for HTML output ------------------------------------------------- 63 | 64 | # The theme to use for HTML and HTML Help pages. See the documentation for 65 | # a list of builtin themes. 66 | # 67 | html_theme = "pydata_sphinx_theme" # "alabaster", "pydata_sphinx_theme" 68 | 69 | # html_theme_options = {"nosidebar": "true"} 70 | 71 | # html_use_index = False # Create an extra page containing the index. 72 | 73 | # html_show_sourcelink = False 74 | 75 | # html_file_suffix = ".md" later 76 | 77 | # html_show_copyright = False 78 | 79 | # html_show_sphinx = False 80 | 81 | # html_domain_indices = False 82 | 83 | # Add any paths that contain custom static files (such as style sheets) here, 84 | # relative to this directory. They are copied after the builtin static files, 85 | # so a file named "default.css" will overwrite the builtin "default.css". 86 | html_static_path = [] 87 | 88 | 89 | html_css_files = [ 90 | "css/pigments.css", 91 | "css/custom.css", 92 | ] 93 | 94 | autodoc_typehints = "none" 95 | 96 | source_suffix = [".rst"] 97 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | ************ 2 | Texthero API 3 | ************ 4 | 5 | Preprocessing 6 | ============= 7 | 8 | .. toctree:: 9 | 10 | preprocessing 11 | 12 | NLP 13 | ============== 14 | 15 | .. toctree:: 16 | 17 | nlp 18 | 19 | Representation 20 | ============== 21 | 22 | .. toctree:: 23 | 24 | representation 25 | 26 | Visualization 27 | ============= 28 | 29 | .. toctree:: 30 | 31 | visualization 32 | -------------------------------------------------------------------------------- /docs/source/nlp.rst: -------------------------------------------------------------------------------- 1 | .. automodule:: texthero.nlp 2 | 3 | .. autosummary:: 4 | :toctree: api 5 | 6 | named_entities 7 | noun_chunks 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | -------------------------------------------------------------------------------- /docs/source/preprocessing.rst: -------------------------------------------------------------------------------- 1 | .. automodule:: texthero.preprocessing 2 | 3 | .. autosummary:: 4 | :toctree: api 5 | 6 | clean 7 | drop_no_content 8 | get_default_pipeline 9 | has_content 10 | remove_angle_brackets 11 | remove_brackets 12 | remove_curly_brackets 13 | remove_diacritics 14 | remove_digits 15 | remove_html_tags 16 | remove_punctuation 17 | remove_round_brackets 18 | remove_square_brackets 19 | remove_stopwords 20 | remove_urls 21 | replace_urls 22 | remove_whitespace 23 | replace_punctuation 24 | replace_stopwords 25 | tokenize 26 | 27 | 28 | 29 | 30 | 31 | 32 | -------------------------------------------------------------------------------- /docs/source/representation.rst: -------------------------------------------------------------------------------- 1 | .. automodule:: texthero.representation 2 | 3 | .. autosummary:: 4 | :toctree: api 5 | 6 | dbscan 7 | kmeans 8 | meanshift 9 | nmf 10 | pca 11 | term_frequency 12 | tfidf 13 | tsne 14 | 15 | 16 | 17 | 18 | 19 | 20 | -------------------------------------------------------------------------------- /docs/source/visualization.rst: -------------------------------------------------------------------------------- 1 | .. automodule:: texthero.visualization 2 | 3 | 4 | .. autosummary:: 5 | :toctree: api 6 | 7 | scatterplot 8 | top_words 9 | wordcloud 10 | 11 | 12 | 13 | 14 | 15 | 16 | -------------------------------------------------------------------------------- /docs/to_docusaurus.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | """ 4 | Takes the output from Sphinx, clean it and send it to Docusaurus. 5 | 6 | 1. Get four main modules from _build/html/ 7 | - Extract only the 'body' html and store it as a md file under 8 | ./website/docs/api-{module_name}.md 9 | 10 | 2. Get all files under _build/html/api/ 11 | - Extract 'body' html and store it as a md file under 12 | ./website/docs/api/{filenames}.md 13 | 14 | 3. Update 'sidebars.json' with the new markdown files 15 | - Update the 'api' section. 16 | - Add each module under a sub-directory. 17 | """ 18 | 19 | 20 | """ 21 | Takes all relevant html files from the html output sphinx folder, parse it with Beautifulsoup, remove unnecessary html data (such as
) and 22 | save a markdown file. 23 | """ 24 | 25 | from bs4 import BeautifulSoup 26 | import glob 27 | from pathlib import Path 28 | from typing import List 29 | import re 30 | import json 31 | 32 | """ 33 | PARAMETERS 34 | """ 35 | 36 | MODULES = ["preprocessing", "nlp", "representation", "visualization"] 37 | ROOT_HTML_DIRECTORY = "./_build/html" 38 | ROOT_MD_DIRECTORY = "../website/docs/" 39 | SIDEBARS_FILEPATH = "../website/sidebars.json" 40 | """ 41 | Helper functions 42 | """ 43 | 44 | 45 | def get_content(soup): 46 | return soup.find("main").find("div") 47 | 48 | 49 | def add_docusaurus_metadata(content: str, id: str, title: str, hide_title) -> str: 50 | """ 51 | Add docusaurus metadata into content. 52 | """ 53 | return f"---\nid: {id}\ntitle: {title}\nhide_title: {hide_title}\n---\n\n" + content 54 | 55 | 56 | def fix_href(soup, module: str): 57 | """ 58 | Fix internal href to be compatible with docusaurus. 59 | """ 60 | 61 | for a in soup.find_all("a", {"class": "reference internal"}, href=True): 62 | a["href"] = re.sub("^texthero\.", f"/docs/{module}/", a["href"]) 63 | a["href"] = a["href"].lower() 64 | return soup 65 | 66 | 67 | def to_md( 68 | in_html_filepath: str, out_md_filepath: str, id: str, title: str, hide_title: str 69 | ) -> None: 70 | """ 71 | Convert Sphinx-generated html to md. 72 | 73 | Parameters 74 | ---------- 75 | in_html_filepath : str 76 | input html file. Example: ./_build/html/preprocessing.html 77 | out_md_filepath : str 78 | output html file. Example: ../website/docs/preprocessing.md 79 | id : str 80 | Docusaurus document id 81 | title : str 82 | Docusaurus title id 83 | hide_title : str ("true" or "false") 84 | Whether to hide title in Docusaurus. 85 | 86 | """ 87 | 88 | with open(in_html_filepath, "r") as f: 89 | soup = BeautifulSoup(f.read(), "html.parser") 90 | body = get_content(soup) 91 | 92 | with open(out_md_filepath, "w") as f: 93 | content = add_docusaurus_metadata(str(body), id, title, hide_title) 94 | f.write(content) 95 | 96 | 97 | def get_html(module: str) -> List[str]: 98 | """Return all html files on the html/module folder""" 99 | files = glob.glob(f"./html/{module}/*.html") 100 | # remove ./html/module 101 | return [f.replace(f"./html/{module}/texthero.", "") for f in files] 102 | 103 | 104 | def get_prettified_module_name(module_name: str): 105 | """ 106 | Return a prettified version of the module name. 107 | 108 | Examples 109 | -------- 110 | >>> get_title("preprocessing") 111 | Preprocessing 112 | >>> get_title("nlp") 113 | NLP 114 | """ 115 | module_name = module_name.lower().strip() 116 | if module_name == "nlp": 117 | return "NLP" 118 | else: 119 | return module_name.capitalize() 120 | 121 | 122 | """ 123 | Update sidebars and markdown files 124 | """ 125 | 126 | # make sure folder exists 127 | Path(ROOT_MD_DIRECTORY).mkdir(parents=True, exist_ok=True) 128 | Path(ROOT_MD_DIRECTORY + "api").mkdir(parents=True, exist_ok=True) 129 | 130 | api_sidebars = {} 131 | 132 | for m in MODULES: 133 | in_html_filename = f"{ROOT_HTML_DIRECTORY}/{m}.html" 134 | out_md_filename = f"{ROOT_MD_DIRECTORY}/api-{m}.md" 135 | id = "api-" + m.lower().strip() 136 | title = get_prettified_module_name(m) 137 | 138 | hide_title = "false" 139 | 140 | # initialize api_sidebars 141 | api_sidebars[title] = [id] 142 | 143 | to_md(in_html_filename, out_md_filename, id, title, hide_title) 144 | 145 | 146 | for a in glob.glob("./_build/html/api/*.html"): 147 | object_name = a.split("/")[-1].replace(".html", "") 148 | 149 | id = object_name 150 | (_, module_name, fun_name) = object_name.split(".") 151 | 152 | title = f"{module_name}.{fun_name}" 153 | 154 | module_name = get_prettified_module_name(module_name) 155 | 156 | hide_title = "true" 157 | 158 | api_sidebars[module_name].sort() 159 | 160 | api_sidebars[module_name] = api_sidebars[module_name] + ["api/" + id] 161 | 162 | in_html_filename = f"{ROOT_HTML_DIRECTORY}/api/{object_name}.html" 163 | out_md_filename = f"{ROOT_MD_DIRECTORY}/api/{object_name}.md" 164 | 165 | to_md(in_html_filename, out_md_filename, id, title, hide_title) 166 | 167 | 168 | # Load, update and save again sidebars.json 169 | with open(SIDEBARS_FILEPATH) as js: 170 | sidebars = json.load(js) 171 | 172 | sidebars["api"] = api_sidebars 173 | 174 | with open(SIDEBARS_FILEPATH, "w") as f: 175 | json.dump(sidebars, f, indent=2) 176 | -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | # Examples 2 | 3 | - `getting-started.ipynb` should contains the exact same code shown on the [getting started](https://texthero.org/docs/getting-started) doc page. 4 | -------------------------------------------------------------------------------- /github/demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jbesomi/texthero/37d09f0299fe14329b4fae5002c3a1950e4f563e/github/demo.gif -------------------------------------------------------------------------------- /github/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jbesomi/texthero/37d09f0299fe14329b4fae5002c3a1950e4f563e/github/logo.png -------------------------------------------------------------------------------- /github/screencast.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jbesomi/texthero/37d09f0299fe14329b4fae5002c3a1950e4f563e/github/screencast.gif -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /scripts/check.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # if any command inside script returns error, exit and return that error 4 | set -e 5 | 6 | # ensure that we're always inside the root of our application 7 | cd "${0%/*}/.." 8 | 9 | cd scripts 10 | 11 | echo "Format code." 12 | ./format.sh 13 | 14 | 15 | echo "Update documentation." 16 | ./update_documentation.sh 17 | 18 | 19 | echo "Test code." 20 | ./tests.sh 21 | 22 | #cd website 23 | #npm run build 24 | -------------------------------------------------------------------------------- /scripts/format.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd .. 4 | 5 | black texthero 6 | black tests 7 | -------------------------------------------------------------------------------- /scripts/install-hooks.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | GIT_DIR=$(git rev-parse --git-dir) 4 | 5 | echo "Installing hooks..." 6 | # this command creates symlink to our pre-commit script 7 | ln -sf ../../scripts/pre-commit.sh $GIT_DIR/hooks/pre-commit 8 | echo "Done!" 9 | -------------------------------------------------------------------------------- /scripts/install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd .. 4 | 5 | pip3 install -e . 6 | -------------------------------------------------------------------------------- /scripts/pre-commit.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | # Run check. 4 | ./scripts/check.sh 5 | 6 | # $? stores exit value of the last command 7 | if [ $? -ne 0 ]; then 8 | echo "All tests must pass before commit." 9 | exit 1 10 | fi 11 | -------------------------------------------------------------------------------- /scripts/push_pip.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | echo "Formatting code ..." 4 | ./format.sh 5 | 6 | echo "Checking code ..." 7 | ./check.sh 8 | 9 | echo "Updating doc ..." 10 | cd ../docs/ 11 | make html 12 | ./to_docusaurus.py 13 | cd .. 14 | 15 | python3 setup.py sdist bdist_wheel 16 | twine upload --skip-existing dist/* 17 | -------------------------------------------------------------------------------- /scripts/test_coverage.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd .. 4 | 5 | coverage run -m unittest discover -s tests -t . 6 | 7 | coverage report -m 8 | coverage html 9 | -------------------------------------------------------------------------------- /scripts/tests.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd .. 4 | 5 | python3 -m unittest discover -s tests -t . 6 | -------------------------------------------------------------------------------- /scripts/update_documentation.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | cd ../docs 4 | make html 5 | ./to_docusaurus.py 6 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = texthero 3 | version = 1.0.9 4 | description = Text preprocessing, representation and visualization from zero to hero. 5 | author = Jonathan Besomi 6 | license = MIT 7 | long_description = file: README.md 8 | long_description_content_type = text/markdown 9 | classifiers = 10 | Development Status :: 3 - Alpha 11 | License :: OSI Approved :: MIT License 12 | Intended Audience :: Developers 13 | Programming Language :: Python 14 | Natural Language :: English 15 | Topic :: Scientific/Engineering 16 | keywords = 17 | text mining 18 | text preprocessing 19 | text representation 20 | text visualization 21 | url = https://github.com/jbesomi/texthero 22 | project_urls = 23 | Documentation = https://texthero.org/ 24 | Source Code = https://github.com/jbesomi/texthero 25 | Bug Tracker = https://github.com/jbesomi/texthero/issues 26 | [options] 27 | packages = find: 28 | python_requires = >=3.6.1 29 | install_requires = 30 | numpy>=1.17 31 | scikit-learn>=0.22 32 | spacy<3.0.0 33 | tqdm>=4.3, <5 34 | nltk>=3.3, <4 35 | plotly>=4.2.0, <5 36 | pandas>=1.0.2, <2 37 | wordcloud>=1.5.0, <2 38 | gensim>4.0, <5 39 | matplotlib>=3.1.0, <3.7 40 | # TODO pick the correct version. 41 | [options.extras_require] 42 | dev = 43 | black==19.10b0 44 | pytest>=4.0.0 45 | pytest-cov 46 | Sphinx>=3.0.3 47 | sphinx-markdown-builder>=0.5.4 48 | recommonmark>=0.6.0 49 | nbsphinx 50 | parameterized>=0.7.4 51 | coverage 52 | pre-commit 53 | pandas>=1.1.0 54 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | if __name__ == "__main__": 4 | setuptools.setup() 5 | -------------------------------------------------------------------------------- /tests/README.md: -------------------------------------------------------------------------------- 1 | # TESTS 2 | 3 | "In most cases, missing type hints in third-party packages is not something you want to be bothered with so you can silence these messages:" 4 | 5 | => mypy --ignore-missing-imports 6 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import pandas as pd 3 | 4 | 5 | class PandasTestCase(unittest.TestCase): 6 | def assertDataframeEqual(self, a, b, msg): 7 | try: 8 | pd.testing.assert_frame_equal(a, b) 9 | except AssertionError as e: 10 | raise self.failureException(msg) from e 11 | 12 | def assertSeriesEqual(self, a, b, msg): 13 | try: 14 | pd.testing.assert_series_equal(a, b) 15 | except AssertionError as e: 16 | raise self.failureException(msg) from e 17 | 18 | def setUp(self): 19 | self.addTypeEqualityFunc(pd.DataFrame, self.assertDataframeEqual) 20 | self.addTypeEqualityFunc(pd.Series, self.assertSeriesEqual) 21 | -------------------------------------------------------------------------------- /tests/conftest.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | 4 | def pytest_addoption(parser): 5 | parser.addoption( 6 | "--no-skip-broken", 7 | action="store_true", 8 | default=False, 9 | help="run tests marked as broken", 10 | ) 11 | 12 | 13 | def pytest_configure(config): 14 | config.addinivalue_line("markers", "skip_broken: mark test broken") 15 | 16 | 17 | def pytest_collection_modifyitems(config, items): 18 | if config.getoption("--no-skip-broken"): 19 | return 20 | 21 | skip_broken = pytest.mark.skip(reason="test marked as broken") 22 | for item in items: 23 | if "skip_broken" in item.keywords: 24 | item.add_marker(skip_broken) 25 | 26 | 27 | def broken_case(*params): 28 | return pytest.param(*params, marks=(pytest.mark.skip_broken)) 29 | -------------------------------------------------------------------------------- /tests/test_helpers.py: -------------------------------------------------------------------------------- 1 | """ 2 | Unit-tests for the helper module. 3 | """ 4 | 5 | import pandas as pd 6 | import numpy as np 7 | 8 | from . import PandasTestCase 9 | import doctest 10 | import unittest 11 | import warnings 12 | 13 | from texthero import _helper 14 | 15 | """ 16 | Doctests. 17 | """ 18 | 19 | 20 | def load_tests(loader, tests, ignore): 21 | tests.addTests(doctest.DocTestSuite(_helper)) 22 | return tests 23 | 24 | 25 | """ 26 | Test Decorators. 27 | """ 28 | 29 | 30 | class TestHelpers(PandasTestCase): 31 | """ 32 | handle_nans. 33 | """ 34 | 35 | def test_handle_nans(self): 36 | s = pd.Series(["Test", np.nan, pd.NA]) 37 | 38 | @_helper.handle_nans(replace_nans_with="This was a NAN") 39 | def f(s): 40 | return s 41 | 42 | s_true = pd.Series(["Test", "This was a NAN", "This was a NAN"]) 43 | 44 | with warnings.catch_warnings(): 45 | warnings.simplefilter("ignore") 46 | self.assertEqual(f(s), s_true) 47 | 48 | with self.assertWarns(Warning): 49 | f(s) 50 | 51 | def test_handle_nans_no_nans_in_input(self): 52 | s = pd.Series(["Test"]) 53 | 54 | @_helper.handle_nans(replace_nans_with="This was a NAN") 55 | def f(s): 56 | return s 57 | 58 | s_true = pd.Series(["Test"]) 59 | 60 | self.assertEqual(f(s), s_true) 61 | 62 | # This is not in test_indexes.py as it requires a custom test case. 63 | def test_handle_nans_index(self): 64 | s = pd.Series(["Test", np.nan, pd.NA], index=[4, 5, 6]) 65 | 66 | @_helper.handle_nans(replace_nans_with="This was a NAN") 67 | def f(s): 68 | return s 69 | 70 | s_true = pd.Series( 71 | ["Test", "This was a NAN", "This was a NAN"], index=[4, 5, 6] 72 | ) 73 | 74 | with warnings.catch_warnings(): 75 | warnings.simplefilter("ignore") 76 | self.assertTrue(f(s).index.equals(s_true.index)) 77 | -------------------------------------------------------------------------------- /tests/test_indexes.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from texthero import nlp, visualization, preprocessing, representation 3 | 4 | import pytest 5 | 6 | from . import PandasTestCase 7 | import unittest 8 | import string 9 | from parameterized import parameterized 10 | 11 | from .conftest import broken_case 12 | 13 | 14 | # Define valid inputs for different functions. 15 | s_text = pd.Series(["Test"], index=[5]) 16 | s_tokenized_lists = pd.Series([["Test", "Test2"], ["Test3"]], index=[5, 6]) 17 | s_numeric = pd.Series([5.0], index=[5]) 18 | s_numeric_lists = pd.Series([[5.0, 5.0], [6.0, 6.0]], index=[5, 6]) 19 | 20 | # Define all test cases. Every test case is a list 21 | # of [name of test case, function to test, tuple of valid input for the function]. 22 | # First argument of valid input has to be the Pandas Series where we 23 | # want to keep the index. If this is different for a function, a separate 24 | # test case has to implemented in the class below. 25 | # The tests will be run by AbstractIndexTest below through the @parameterized 26 | # decorator. 27 | # The names will be expanded automatically, so e.g. "named_entities" 28 | # creates test cases test_correct_index_named_entities and test_incorrect_index_named_entities. 29 | 30 | test_cases_nlp = [ 31 | ["named_entities", nlp.named_entities, (s_text,)], 32 | ["noun_chunks", nlp.noun_chunks, (s_text,)], 33 | ["stem", nlp.stem, (s_text,)], 34 | ] 35 | 36 | test_cases_preprocessing = [ 37 | ["fillna", preprocessing.fillna, (s_text,)], 38 | ["lowercase", preprocessing.lowercase, (s_text,)], 39 | ["replace_digits", preprocessing.replace_digits, (s_text, "")], 40 | ["remove_digits", preprocessing.remove_digits, (s_text,)], 41 | ["replace_punctuation", preprocessing.replace_punctuation, (s_text, "")], 42 | ["remove_punctuation", preprocessing.remove_punctuation, (s_text,)], 43 | ["remove_diacritics", preprocessing.remove_diacritics, (s_text,)], 44 | ["remove_whitespace", preprocessing.remove_whitespace, (s_text,)], 45 | ["replace_stopwords", preprocessing.replace_stopwords, (s_text, "")], 46 | ["remove_stopwords", preprocessing.remove_stopwords, (s_text,)], 47 | ["clean", preprocessing.clean, (s_text,)], 48 | ["remove_round_brackets", preprocessing.remove_round_brackets, (s_text,)], 49 | ["remove_curly_brackets", preprocessing.remove_curly_brackets, (s_text,)], 50 | ["remove_square_brackets", preprocessing.remove_square_brackets, (s_text,)], 51 | ["remove_angle_brackets", preprocessing.remove_angle_brackets, (s_text,)], 52 | ["remove_brackets", preprocessing.remove_brackets, (s_text,)], 53 | ["remove_html_tags", preprocessing.remove_html_tags, (s_text,)], 54 | ["tokenize", preprocessing.tokenize, (s_text,)], 55 | broken_case("phrases", preprocessing.phrases, (s_tokenized_lists,)), 56 | ["replace_urls", preprocessing.replace_urls, (s_text, "")], 57 | ["remove_urls", preprocessing.remove_urls, (s_text,)], 58 | ["replace_tags", preprocessing.replace_tags, (s_text, "")], 59 | ["remove_tags", preprocessing.remove_tags, (s_text,)], 60 | ] 61 | 62 | test_cases_representation = [ 63 | broken_case("count", representation.count, (s_tokenized_lists,),), 64 | broken_case("term_frequency", representation.term_frequency, (s_tokenized_lists,),), 65 | broken_case("tfidf", representation.tfidf, (s_tokenized_lists,),), 66 | ["pca", representation.pca, (s_numeric_lists, 0)], 67 | ["nmf", representation.nmf, (s_numeric_lists,)], 68 | broken_case("tsne", representation.tsne, (s_numeric_lists,)), 69 | ["kmeans", representation.kmeans, (s_numeric_lists, 1)], 70 | ["dbscan", representation.dbscan, (s_numeric_lists,)], 71 | ["meanshift", representation.meanshift, (s_numeric_lists,)], 72 | ] 73 | 74 | test_cases = test_cases_nlp + test_cases_preprocessing + test_cases_representation 75 | 76 | 77 | class TestAbstractIndex: 78 | """ 79 | Class for index test cases. Tests for all cases 80 | in test_cases whether the input's index is correctly 81 | preserved by the function. Some function's tests 82 | are implemented manually as they take different inputs. 83 | 84 | """ 85 | 86 | """ 87 | Tests defined in test_cases above. 88 | """ 89 | 90 | @pytest.mark.parametrize("name, test_function, valid_input", test_cases) 91 | def test_correct_index(self, name, test_function, valid_input): 92 | s = valid_input[0] 93 | result_s = test_function(*valid_input) 94 | t_same_index = pd.Series(s.values, s.index) 95 | assert result_s.index.equals(t_same_index.index) 96 | 97 | @pytest.mark.parametrize("name, test_function, valid_input", test_cases) 98 | def test_incorrect_index(self, name, test_function, valid_input): 99 | s = valid_input[0] 100 | result_s = test_function(*valid_input) 101 | t_different_index = pd.Series(s.values, index=None) 102 | assert not result_s.index.equals(t_different_index.index) 103 | -------------------------------------------------------------------------------- /tests/test_nlp.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from texthero import nlp 4 | 5 | from . import PandasTestCase 6 | import doctest 7 | import unittest 8 | import string 9 | 10 | """ 11 | Test doctest 12 | """ 13 | 14 | 15 | def load_tests(loader, tests, ignore): 16 | tests.addTests(doctest.DocTestSuite(nlp)) 17 | return tests 18 | 19 | 20 | class TestNLP(PandasTestCase): 21 | """ 22 | Named entity. 23 | """ 24 | 25 | def test_named_entities(self): 26 | s = pd.Series("New York is a big city") 27 | s_true = pd.Series([[("New York", "GPE", 0, 8)]]) 28 | self.assertEqual(nlp.named_entities(s), s_true) 29 | 30 | """ 31 | Noun chunks. 32 | """ 33 | 34 | def test_noun_chunks(self): 35 | s = pd.Series("Today is such a beautiful day") 36 | s_true = pd.Series( 37 | [[("Today", "NP", 0, 5), ("such a beautiful day", "NP", 9, 29)]] 38 | ) 39 | self.assertEqual(nlp.noun_chunks(s), s_true) 40 | 41 | """ 42 | Count sentences. 43 | """ 44 | 45 | def test_count_sentences(self): 46 | s = pd.Series("I think ... it counts correctly. Doesn't it? Great!") 47 | s_true = pd.Series(3) 48 | self.assertEqual(nlp.count_sentences(s), s_true) 49 | 50 | def test_count_sentences_numeric(self): 51 | s = pd.Series([13.0, 42.0]) 52 | self.assertRaises(TypeError, nlp.count_sentences, s) 53 | 54 | def test_count_sentences_missing_value(self): 55 | s = pd.Series(["Test.", np.nan]) 56 | self.assertRaises(TypeError, nlp.count_sentences, s) 57 | 58 | def test_count_sentences_index(self): 59 | s = pd.Series(["Test"], index=[5]) 60 | counted_sentences_s = nlp.count_sentences(s) 61 | t_same_index = pd.Series([""], index=[5]) 62 | 63 | self.assertTrue(counted_sentences_s.index.equals(t_same_index.index)) 64 | 65 | def test_count_sentences_wrong_index(self): 66 | s = pd.Series(["Test", "Test"], index=[5, 6]) 67 | counted_sentences_s = nlp.count_sentences(s) 68 | t_different_index = pd.Series(["", ""], index=[5, 7]) 69 | 70 | self.assertFalse(counted_sentences_s.index.equals(t_different_index.index)) 71 | 72 | """ 73 | POS tagging. 74 | """ 75 | 76 | def test_pos(self): 77 | s = pd.Series(["Today is such a beautiful day", "São Paulo is a great city"]) 78 | pos_tagging = nlp.pos_tag(s) 79 | s_true = pd.Series( 80 | [ 81 | [ 82 | ("Today", "NOUN", "NN", 0, 5), 83 | ("is", "AUX", "VBZ", 6, 8), 84 | ("such", "DET", "PDT", 9, 13), 85 | ("a", "DET", "DT", 14, 15), 86 | ("beautiful", "ADJ", "JJ", 16, 25), 87 | ("day", "NOUN", "NN", 26, 29), 88 | ], 89 | [ 90 | ("São", "PROPN", "NNP", 0, 3), 91 | ("Paulo", "PROPN", "NNP", 4, 9), 92 | ("is", "AUX", "VBZ", 10, 12), 93 | ("a", "DET", "DT", 13, 14), 94 | ("great", "ADJ", "JJ", 15, 20), 95 | ("city", "NOUN", "NN", 21, 25), 96 | ], 97 | ] 98 | ) 99 | 100 | self.assertEqual(pos_tagging, s_true) 101 | -------------------------------------------------------------------------------- /tests/test_types.py: -------------------------------------------------------------------------------- 1 | """ 2 | Unit-tests for the types module. 3 | """ 4 | 5 | import pandas as pd 6 | import numpy as np 7 | 8 | from . import PandasTestCase 9 | import doctest 10 | import unittest 11 | 12 | from texthero import _types 13 | 14 | """ 15 | Doctests. 16 | """ 17 | 18 | 19 | def load_tests(loader, tests, ignore): 20 | tests.addTests(doctest.DocTestSuite(_types)) 21 | return tests 22 | 23 | 24 | class TestTypes(PandasTestCase): 25 | """ 26 | InputSeries. 27 | """ 28 | 29 | def test_inputseries_function_executes_correctly(self): 30 | @_types.InputSeries(_types.TextSeries) 31 | def f(s, t): 32 | return t 33 | 34 | s = pd.Series("I'm a TextSeries") 35 | t = "test" 36 | self.assertEqual(f(s, t), t) 37 | 38 | def test_inputseries_wrong_type(self): 39 | @_types.InputSeries(_types.TextSeries) 40 | def f(s): 41 | pass 42 | 43 | self.assertRaises(TypeError, f, pd.Series([["token", "ized"]])) 44 | 45 | def test_inputseries_correct_type_textseries(self): 46 | @_types.InputSeries(_types.TextSeries) 47 | def f(s): 48 | pass 49 | 50 | try: 51 | f(pd.Series("I'm a TextSeries")) 52 | except TypeError: 53 | self.fail("Failed although input type is correct.") 54 | 55 | def test_inputseries_correct_type_tokenseries(self): 56 | @_types.InputSeries(_types.TokenSeries) 57 | def f(s): 58 | pass 59 | 60 | try: 61 | f(pd.Series([["token", "ized"]])) 62 | except TypeError: 63 | self.fail("Failed although input type is correct.") 64 | 65 | def test_inputseries_correct_type_vectorseries(self): 66 | @_types.InputSeries(_types.VectorSeries) 67 | def f(s): 68 | pass 69 | 70 | try: 71 | f(pd.Series([[0.0, 1.0]])) 72 | except TypeError: 73 | self.fail("Failed although input type is correct.") 74 | 75 | def test_inputseries_correct_type_DataFrame(self): 76 | @_types.InputSeries(_types.DataFrame) 77 | def f(s): 78 | pass 79 | 80 | try: 81 | f(pd.DataFrame([[1, 2, 3]], columns=["a", "b", "c"], dtype="Sparse",)) 82 | except TypeError: 83 | self.fail("Failed although input type is correct.") 84 | 85 | def test_inputseries_correct_type_first_value_is_nan_TextSeries(self): 86 | @_types.InputSeries(_types.TextSeries) 87 | def f(s): 88 | pass 89 | 90 | try: 91 | f(pd.Series([np.nan, pd.NA, "I'm a TextSeries"])) 92 | except TypeError: 93 | self.fail("Failed although input type is correct.") 94 | 95 | def test_inputseries_correct_type_first_value_is_nan_TokenSeries(self): 96 | @_types.InputSeries(_types.TokenSeries) 97 | def f(s): 98 | pass 99 | 100 | try: 101 | f(pd.Series([np.nan, pd.NA, ["Token", "Series"]])) 102 | except TypeError: 103 | self.fail("Failed although input type is correct.") 104 | 105 | def test_inputseries_correct_type_first_value_is_nan_VectorSeries(self): 106 | @_types.InputSeries(_types.VectorSeries) 107 | def f(s): 108 | pass 109 | 110 | try: 111 | f(pd.Series([np.nan, pd.NA, [0, 1, 2]])) 112 | except TypeError: 113 | self.fail("Failed although input type is correct.") 114 | 115 | def test_several_possible_types_correct_type(self): 116 | @_types.InputSeries([_types.DataFrame, _types.VectorSeries]) 117 | def f(x): 118 | pass 119 | 120 | try: 121 | f(pd.DataFrame([[1, 2, 3]], columns=["a", "b", "c"], dtype="Sparse",)) 122 | 123 | f(pd.Series([[1.0, 2.0]])) 124 | 125 | except TypeError: 126 | self.fail("Failed although input type is correct.") 127 | 128 | def test_several_possible_types_wrong_type(self): 129 | @_types.InputSeries([_types.DataFrame, _types.VectorSeries]) 130 | def f(x): 131 | pass 132 | 133 | self.assertRaises(TypeError, f, pd.Series([["token", "ized"]])) 134 | -------------------------------------------------------------------------------- /tests/test_visualization.py: -------------------------------------------------------------------------------- 1 | import string 2 | 3 | import pandas as pd 4 | import doctest 5 | 6 | from texthero import visualization 7 | from . import PandasTestCase 8 | 9 | 10 | """ 11 | Test doctest 12 | """ 13 | 14 | 15 | def load_tests(loader, tests, ignore): 16 | tests.addTests(doctest.DocTestSuite(visualization)) 17 | return tests 18 | 19 | 20 | class TestVisualization(PandasTestCase): 21 | """ 22 | Test scatterplot. 23 | """ 24 | 25 | def test_scatterplot_dimension_too_high(self): 26 | s = pd.Series([[1, 2, 3, 4], [1, 2, 3, 4]]) 27 | df = pd.DataFrame(s) 28 | self.assertRaises(ValueError, visualization.scatterplot, df, col=0) 29 | 30 | def test_scatterplot_dimension_too_low(self): 31 | s = pd.Series([[1], [1]]) 32 | df = pd.DataFrame(s) 33 | self.assertRaises(ValueError, visualization.scatterplot, df, col=0) 34 | 35 | def test_scatterplot_return_figure(self): 36 | s = pd.Series([[1, 2, 3], [1, 2, 3]]) 37 | df = pd.DataFrame(s) 38 | ret = visualization.scatterplot(df, col=0, return_figure=True) 39 | self.assertIsNotNone(ret) 40 | 41 | """ 42 | Test top_words. 43 | """ 44 | 45 | def test_top_words(self): 46 | s = pd.Series("one two two three three three") 47 | s_true = pd.Series([1, 3, 2], index=["one", "three", "two"]) 48 | self.assertEqual(visualization.top_words(s).sort_index(), s_true) 49 | 50 | def test_top_words_space_char(self): 51 | s = pd.Series("one \n\t") 52 | s_true = pd.Series([1], index=["one"]) 53 | self.assertEqual(visualization.top_words(s), s_true) 54 | 55 | def test_top_words_punctuation_between(self): 56 | s = pd.Series("can't hello-world u.s.a") 57 | s_true = pd.Series([1, 1, 1], index=["can't", "hello-world", "u.s.a"]) 58 | self.assertEqual(visualization.top_words(s).sort_index(), s_true) 59 | 60 | def test_top_words_remove_external_punctuation(self): 61 | s = pd.Series("stop. please!") 62 | s_true = pd.Series([1, 1], index=["please", "stop"]) 63 | self.assertEqual(visualization.top_words(s).sort_index(), s_true) 64 | 65 | def test_top_words_digits(self): 66 | s = pd.Series("123 hello h1n1") 67 | s_true = pd.Series([1, 1, 1], index=["123", "h1n1", "hello"]) 68 | self.assertEqual(visualization.top_words(s).sort_index(), s_true) 69 | 70 | def test_top_words_digits_punctuation(self): 71 | s = pd.Series("123. .321 -h1n1 -cov2") 72 | s_true = pd.Series([1, 1, 1, 1], index=["123", "321", "cov2", "h1n1"]) 73 | self.assertEqual(visualization.top_words(s).sort_index(), s_true) 74 | 75 | """ 76 | Test worcloud 77 | """ 78 | 79 | def test_wordcloud(self): 80 | s = pd.Series("one two three") 81 | self.assertEqual(visualization.wordcloud(s), None) 82 | -------------------------------------------------------------------------------- /texthero/__init__.py: -------------------------------------------------------------------------------- 1 | """Texthero: python toolkit for text preprocessing, representation and visualization. 2 | 3 | 4 | 5 | """ 6 | from . import preprocessing 7 | from .preprocessing import * 8 | 9 | from . import representation 10 | from .representation import * 11 | 12 | from . import visualization 13 | from .visualization import * 14 | 15 | from . import nlp 16 | from .nlp import * 17 | 18 | from . import stopwords 19 | -------------------------------------------------------------------------------- /texthero/_helper.py: -------------------------------------------------------------------------------- 1 | """ 2 | Useful helper functions for the texthero library. 3 | """ 4 | 5 | import pandas as pd 6 | import functools 7 | import warnings 8 | 9 | 10 | """ 11 | Warnings. 12 | """ 13 | 14 | _warning_nans_in_input = ( 15 | "There are NaNs (missing values) in the given input series." 16 | " They were replaced with appropriate values before the function" 17 | " was applied. Consider using hero.fillna to replace those NaNs yourself" 18 | " or hero.drop_no_content to remove them." 19 | ) 20 | 21 | 22 | """ 23 | Decorators. 24 | """ 25 | 26 | 27 | def handle_nans(replace_nans_with): 28 | """ 29 | Decorator to handle NaN values in a function's input. 30 | 31 | Using the decorator, if there are NaNs in the input, 32 | they are replaced with replace_nans_with 33 | and a warning is printed. 34 | 35 | The function must take as first input a Pandas Series. 36 | 37 | Examples 38 | -------- 39 | >>> from texthero._helper import handle_nans 40 | >>> import pandas as pd 41 | >>> import numpy as np 42 | >>> @handle_nans(replace_nans_with="I was missing!") 43 | ... def replace_b_with_c(s): 44 | ... return s.str.replace("b", "c") 45 | >>> s_with_nan = pd.Series(["Test b", np.nan]) 46 | >>> replace_b_with_c(s_with_nan) # doctest: +SKIP 47 | 0 Test c 48 | 1 I was missing! 49 | dtype: object 50 | """ 51 | 52 | def decorator(func): 53 | @functools.wraps(func) 54 | def wrapper(*args, **kwargs): 55 | 56 | # Get first input argument (the series) and replace the NaNs. 57 | s = args[0] 58 | if s.isna().values.any(): 59 | warnings.warn(_warning_nans_in_input, UserWarning) 60 | s = s.fillna(value=replace_nans_with) 61 | 62 | # Put the series back into the input. 63 | if args[1:]: 64 | args = (s,) + args[1:] 65 | else: 66 | args = (s,) 67 | 68 | # Apply function as usual. 69 | return func(*args, **kwargs) 70 | 71 | return wrapper 72 | 73 | return decorator 74 | -------------------------------------------------------------------------------- /texthero/stopwords.py: -------------------------------------------------------------------------------- 1 | import nltk 2 | import spacy 3 | 4 | try: 5 | # If not present, download NLTK stopwords. 6 | nltk.data.find("corpora/stopwords") 7 | except LookupError: 8 | nltk.download("stopwords") 9 | 10 | from nltk.corpus import stopwords as nltk_en_stopwords 11 | from spacy.lang.en import stop_words as spacy_en_stopwords 12 | 13 | DEFAULT = set(nltk_en_stopwords.words("english")) 14 | NLTK_EN = DEFAULT 15 | SPACY_EN = spacy_en_stopwords.STOP_WORDS 16 | -------------------------------------------------------------------------------- /vercel.json: -------------------------------------------------------------------------------- 1 | { 2 | "github": { 3 | "silent": true 4 | } 5 | } 6 | -------------------------------------------------------------------------------- /website/blog/2017-10-24-texthero-welcome.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Texthero welcome 3 | author: Jonathan Besomi 4 | --- 5 | 6 | # Texthero welcome. 7 | 8 | 9 | Welcome to Texthero. 10 | 11 | Texthero is a python package for working with text-based dataset with ease. 12 | 13 | You can start from the online [documentation](https://texthero.org/docs/). 14 | 15 | This tab is a work in progress, soon interesting articles will pop-up. Stay tuned. 16 | -------------------------------------------------------------------------------- /website/blog/2020-04-27-rename-columns-pandas.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Groupby and rename columns in pandas 3 | author: Jonathan Besomi 4 | unlisted: True 5 | --- 6 | 7 | 8 | 9 | ## Groupby and rename columns in pandas 10 | 11 | ``` 12 | df.groupby(['artist']).mean().stack().rename_axis(['one', 'bar']).reset_index(name='ooo') 13 | ``` 14 | 15 | ``` 16 | df_empath = ( 17 | df_empath.groupby(['artist']) 18 | .max() 19 | .stack() 20 | .rename_axis(['artist', 'sentiment']) 21 | .reset_index(name='r') 22 | ) 23 | ``` 24 | -------------------------------------------------------------------------------- /website/blog/2020-04-27-text-preprocessing-with-pandas.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Text preprocessing with Pandas and Texthero. 3 | author: Jonathan Besomi 4 | unlisted: True 5 | --- 6 | 7 | ## Text preprocessing with Pandas and Texthero. 8 | -------------------------------------------------------------------------------- /website/blog/2020-05-03-text-mining-with-python.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Text mining with python 3 | author: Jonathan Besomi 4 | unlisted: True 5 | --- 6 | 7 | # Text mining with python 8 | -------------------------------------------------------------------------------- /website/blog/2020-05-03-text-unsupervised-learning.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Unsupervised text learning 3 | author: Jonathan Besomi 4 | unlisted: True 5 | --- 6 | 7 | ## Pandas and Texthero. 8 | 9 | 10 | ## Introduction 11 | 12 | - Pandas 13 | - Texthero 14 | - Unsupervised learning 15 | - Dataset: 16 | 17 | ## Text representation 18 | 19 | - TF-IDF 20 | - Count 21 | - Word2Vec 22 | 23 | ## If it is due to identical points in the dataset, removing these points may help. 24 | 25 | 26 | ## Clustering 27 | 28 | ## Unsupervised 29 | - k-means 30 | - Totally Random Trees embedding 31 | 32 | ## Semi-supervised 33 | - LDA 34 | 35 | ## Dimensionality reduction 36 | To visualize the structure of a dataset, the dimension must be reduced somehow. 37 | 38 | - PCA 39 | - NCA 40 | - Multi-dimensional Scaling, a technique used for analyzing similarity or dissimilarity data. It attempts to model dissimilarity or similarity of data as distance in a geometric vector spaces. 41 | - t-sne 42 | 43 | ## 44 | -------------------------------------------------------------------------------- /website/blog/2020-05-08-bar-run-chart-python.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: Bar run chart in python 3 | author: Jonathan Besomi 4 | unlisted: True 5 | --- 6 | 7 | # Bar run chart in python 8 | -------------------------------------------------------------------------------- /website/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | npm run build 4 | -------------------------------------------------------------------------------- /website/core/AnnouncementBar.js: -------------------------------------------------------------------------------- 1 | import React, {useState, useEffect} from 'react'; 2 | 3 | // import styles from './announcement-bar.css'; 4 | 5 | const STORAGE_DISMISS_KEY = 'docusaurus.announcement.dismiss'; 6 | const STORAGE_ID_KEY = 'docusaurus.announcement.id'; 7 | 8 | function AnnouncementBar() { 9 | 10 | 11 | //const {id, content, backgroundColor, textColor} = { 12 | // id: 'supportus', 13 | // content: 14 | // '⭐️ If you like Docusaurus, give it a star on GitHub! ⭐️', 15 | // }; 16 | 17 | const id = "supportus" 18 | const content = "⭐️ If you like Docusaurus" 19 | 20 | const [isClosed, setClosed] = useState(true); 21 | const handleClose = () => { 22 | localStorage.setItem(STORAGE_DISMISS_KEY, true); 23 | setClosed(true); 24 | }; 25 | 26 | useEffect(() => { 27 | const viewedId = localStorage.getItem(STORAGE_ID_KEY); 28 | const isNewAnnouncement = id !== viewedId; 29 | 30 | localStorage.setItem(STORAGE_ID_KEY, id); 31 | 32 | if (isNewAnnouncement) { 33 | localStorage.setItem(STORAGE_DISMISS_KEY, false); 34 | } 35 | 36 | if ( 37 | isNewAnnouncement || 38 | localStorage.getItem(STORAGE_DISMISS_KEY) === 'false' 39 | ) { 40 | setClosed(false); 41 | } 42 | }, []); 43 | 44 | if (!content || isClosed) { 45 | return null; 46 | } 47 | 48 | return ( 49 |Common NLP tasks such as named_entities, noun_chunks, etc.
9 |
|
16 | Return named-entities. |
17 |
|
19 | Return noun_chunks, group of consecutive words that belong together. |
20 |
Map words into vectors using different algorithms such as TF-IDF, word2vec or GloVe.
9 |
|
16 | Perform DBSCAN clustering. |
17 |
|
19 | Perform K-means clustering algorithm. |
20 |
|
22 | Perform mean shift clustering. |
23 |
|
25 | Perform non-negative matrix factorization. |
26 |
|
28 | Perform principal component analysis on the given Pandas Series. |
29 |
|
31 | Represent a text-based Pandas Series using term_frequency. |
32 |
|
34 | Represent a text-based Pandas Series using TF-IDF. |
35 |
|
37 | Perform TSNE on the given pandas series. |
38 |
Visualize insights and statistics of a text-based Pandas DataFrame.
9 |
|
16 | Show scatterplot using python plotly scatter. |
17 |
|
19 | Return a pandas series with index the top words and as value the count. |
20 |
|
22 | Plot wordcloud image using WordCloud from word_cloud package. |
23 |
named_entities
(s, package='spacy')¶Return named-entities.
14 |Return a Pandas Series where each rows contains a list of tuples containing information regarding the given named entities.
15 |Tuple: (entity’name, entity’label, starting character, ending character)
16 |Under the hood, named_entities make use of Spacy name entity recognition.
17 |PERSON: People, including fictional.
NORP: Nationalities or religious or political groups.
FAC: Buildings, airports, highways, bridges, etc.
ORG : Companies, agencies, institutions, etc.
GPE: Countries, cities, states.
LOC: Non-GPE locations, mountain ranges, bodies of water.
PRODUCT: Objects, vehicles, foods, etc. (Not services.)
EVENT: Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW: Named documents made into laws.
LANGUAGE: Any named language.
DATE: Absolute or relative dates or periods.
TIME: Times smaller than a day.
PERCENT: Percentage, including ”%“.
MONEY: Monetary values, including unit.
QUANTITY: Measurements, as of weight or distance.
ORDINAL: “first”, “second”, etc.
CARDINAL: Numerals that do not fall under another type.
Examples
41 |>>> import texthero as hero
42 | >>> import pandas as pd
43 | >>> s = pd.Series("Yesterday I was in NY with Bill de Blasio")
44 | >>> hero.named_entities(s)[0]
45 | [('Yesterday', 'DATE', 0, 9), ('NY', 'GPE', 19, 21), ('Bill de Blasio', 'PERSON', 27, 41)]
46 |
clean
(s: pandas.core.series.Series, pipeline=None) → pandas.core.series.Series¶Pre-process a text-based Pandas Series.
14 |texthero.preprocessing.fillna()
texthero.preprocessing.lowercase()
drop_no_content
(s: pandas.core.series.Series)¶Drop all rows without content.
14 |Drop all rows from the given Pandas Series where has_content()
is False.
Examples
16 |>>> s = pd.Series(["content", np.nan, "\t\n", " "])
17 | >>> drop_no_content(s)
18 | 0 content
19 | dtype: object
20 |
get_default_pipeline
() → List[Callable[[pandas.core.series.Series], pandas.core.series.Series]]¶Return a list contaning all the methods used in the default cleaning pipeline.
14 |texthero.preprocessing.fillna()
texthero.preprocessing.lowercase()
has_content
(s: pandas.core.series.Series)¶Return a Boolean Pandas Series indicating if the rows has content.
14 |Examples
15 |>>> s = pd.Series(["content", np.nan, "\t\n", " "])
16 | >>> has_content(s)
17 | 0 True
18 | 1 False
19 | 2 False
20 | 3 False
21 | dtype: bool
22 |
remove_angle_brackets
(s: pandas.core.series.Series)¶Remove content within angle brackets <> and the angle brackets.
14 |See also
16 |remove_brackets()
remove_round_brackets()
remove_curly_brackets()
remove_square_brackets()
Examples
24 |>>> s = pd.Series("Texthero <is not a superhero!>")
25 | >>> remove_angle_brackets(s)
26 | 0 Texthero
27 | dtype: object
28 |
remove_brackets
(s: pandas.core.series.Series)¶Remove content within brackets and the brackets itself.
14 |Remove content from any kind of brackets, (), [], {}, <>.
15 |See also
17 |remove_round_brackets()
remove_curly_brackets()
remove_square_brackets()
remove_angle_brackets()
Examples
25 |>>> s = pd.Series("Texthero (round) [square] [curly] [angle]")
26 | >>> remove_brackets(s)
27 | 0 Texthero
28 | dtype: object
29 |
remove_curly_brackets
(s: pandas.core.series.Series)¶Remove content within curly brackets {} and the curly brackets.
14 |See also
16 |remove_brackets()
remove_angle_brackets()
remove_round_brackets()
remove_square_brackets()
Examples
24 |>>> s = pd.Series("Texthero {is not a superhero!}")
25 | >>> remove_curly_brackets(s)
26 | 0 Texthero
27 | dtype: object
28 |
remove_diacritics
(input: pandas.core.series.Series) → pandas.core.series.Series¶Remove all diacritics and accents.
14 |Remove all diacritics and accents from any word and characters from the given Pandas Series. Return a cleaned version of the Pandas Series.
15 |Examples
16 |>>> import texthero as hero
17 | >>> import pandas as pd
18 | >>> s = pd.Series("Noël means Christmas in French")
19 | >>> hero.remove_diacritics(s)
20 | 0 Noel means Christmas in French
21 | dtype: object
22 |
remove_digits
(input: pandas.core.series.Series, only_blocks=True) → pandas.core.series.Series¶Remove all digits and replace it with a single space.
14 |By default, only removes “blocks” of digits. For instance, 1234 falcon9 becomes ` falcon9`.
15 |When the arguments only_blocks is set to ´False´, remove any digits.
16 |See also replace_digits()
to replace digits with another string.
Remove only blocks of digits.
22 |Examples
27 |>>> import texthero as hero
28 | >>> import pandas as pd
29 | >>> s = pd.Series("7ex7hero is fun 1111")
30 | >>> hero.preprocessing.remove_digits(s)
31 | 0 7ex7hero is fun
32 | dtype: object
33 | >>> hero.preprocessing.remove_digits(s, only_blocks=False)
34 | 0 ex hero is fun
35 | dtype: object
36 |
Remove html tags from the given Pandas Series.
14 |Remove all html tags of the type <.*?> such as <html>, <p>, <div class=”hello”> and remove all html tags of type   and return a cleaned Pandas Series.
15 |Examples
16 |>>> s = pd.Series("<html><h1>Title</h1></html>")
17 | >>> remove_html_tags(s)
18 | 0 Title
19 | dtype: object
20 |
remove_punctuation
(input: pandas.core.series.Series) → pandas.core.series.Series¶Replace all punctuation with a single space (” “).
14 |remove_punctuation removes all punctuation from the given Pandas Series and replace it with a single space. It consider as punctuation characters all string.punctuation
symbols !”#$%&’()*+,-./:;<=>?@[]^_`{|}~).
See also replace_punctuation()
to replace punctuation with a custom symbol.
Examples
17 |>>> import texthero as hero
18 | >>> import pandas as pd
19 | >>> s = pd.Series("Finnaly.")
20 | >>> hero.remove_punctuation(s)
21 | 0 Finnaly
22 | dtype: object
23 |
remove_round_brackets
(s: pandas.core.series.Series)¶Remove content within parentheses () and parentheses.
14 |See also
16 |remove_brackets()
remove_angle_brackets()
remove_curly_brackets()
remove_square_brackets()
Examples
24 |>>> s = pd.Series("Texthero (is not a superhero!)")
25 | >>> remove_round_brackets(s)
26 | 0 Texthero
27 | dtype: object
28 |
remove_square_brackets
(s: pandas.core.series.Series)¶Remove content within square brackets [] and the square brackets.
14 |See also
16 |remove_brackets()
remove_angle_brackets()
remove_round_brackets()
remove_curly_brackets()
Examples
24 |>>> s = pd.Series("Texthero [is not a superhero!]")
25 | >>> remove_square_brackets(s)
26 | 0 Texthero
27 | dtype: object
28 |
remove_stopwords
(input: pandas.core.series.Series, stopwords: Union[Set[str], NoneType] = None, remove_str_numbers=False) → pandas.core.series.Series¶Remove all instances of words.
14 |By default uses NLTK’s english stopwords of 179 words:
15 |Set of stopwords string to remove. If not passed, by default it used NLTK English stopwords.
20 |Examples
25 |Using default NLTK list of stopwords:
26 |>>> import texthero as hero
27 | >>> import pandas as pd
28 | >>> s = pd.Series("Texthero is not only for the heroes")
29 | >>> hero.remove_stopwords(s)
30 | 0 Texthero heroes
31 | dtype: object
32 |
Add custom words into the default list of stopwords:
35 |>>> import texthero as hero
36 | >>> from texthero import stopwords
37 | >>> import pandas as pd
38 | >>> default_stopwords = stopwords.DEFAULT
39 | >>> custom_stopwords = default_stopwords.union(set(["heroes"]))
40 | >>> s = pd.Series("Texthero is not only for the heroes")
41 | >>> hero.remove_stopwords(s, custom_stopwords)
42 | 0 Texthero
43 | dtype: object
44 |
remove_urls
(s: pandas.core.series.Series) → pandas.core.series.Series¶Remove all urls from a given Pandas Series.
14 |remove_urls remove any urls and replace it with a single empty space.
15 |See also
17 |texthero.preprocessing.replace_urls()
Examples
22 |>>> import texthero as hero
23 | >>> import pandas as pd
24 | >>> s = pd.Series("Go to: https://example.com")
25 | >>> hero.remove_urls(s)
26 | 0 Go to:
27 | dtype: object
28 |
remove_whitespace
(input: pandas.core.series.Series) → pandas.core.series.Series¶Remove any extra white spaces.
14 |Remove any extra whitespace in the given Pandas Series. Removes also newline, tabs and any form of space.
15 |Useful when there is a need to visualize a Pandas Series and most cells have many newlines or other kind of space characters.
16 |Examples
17 |>>> import texthero as hero
18 | >>> import pandas as pd
19 | >>> s = pd.Series("Title \n Subtitle \t ...")
20 | >>> hero.remove_whitespace(s)
21 | 0 Title Subtitle ...
22 | dtype: object
23 |
replace_punctuation
(input: pandas.core.series.Series, symbol: str = ' ') → pandas.core.series.Series¶Replace all punctuation with a given symbol.
14 |replace_punctuation replace all punctuation from the given Pandas Series and replace it with a custom symbol. It consider as punctuation characters all string.punctuation
symbols !”#$%&’()*+,-./:;<=>?@[]^_`{|}~).
Symbol to use as replacement for all string punctuation.
20 |Examples
25 |>>> import texthero as hero
26 | >>> import pandas as pd
27 | >>> s = pd.Series("Finnaly.")
28 | >>> hero.replace_punctuation(s, " <PUNCT> ")
29 | 0 Finnaly <PUNCT>
30 | dtype: object
31 |
replace_stopwords
(input: pandas.core.series.Series, symbol: str, stopwords: Union[Set[str], NoneType] = None) → pandas.core.series.Series¶Replace all instances of words with symbol.
14 |By default uses NLTK’s english stopwords of 179 words.
15 |Character(s) to replace words with.
20 |Set of stopwords string to remove. If not passed, by default it used NLTK English stopwords.
22 |Examples
27 |>>> s = pd.Series("the book of the jungle")
28 | >>> replace_stopwords(s, "X")
29 | 0 X book X X jungle
30 | dtype: object
31 |
replace_urls
(s: pandas.core.series.Series, symbol: str) → pandas.core.series.Series¶Replace all urls with the given symbol.
14 |replace_urls replace any urls from the given Pandas Series with the given symbol.
15 |See also
17 |texthero.preprocessing.remove_urls()
Examples
22 |>>> import texthero as hero
23 | >>> import pandas as pd
24 | >>> s = pd.Series("Go to: https://example.com")
25 | >>> hero.replace_urls(s, "<URL>")
26 | 0 Go to: <URL>
27 | dtype: object
28 |
stem
(input: pandas.core.series.Series, stem='snowball', language='english') → pandas.core.series.Series¶Stem series using either porter or snowball NLTK stemmers.
14 |The act of stemming means removing the end of a words with an heuristic process. It’s useful in context where the meaning of the word is important rather than his derivation. Stemming is very efficient and adapt in case the given dataset is large.
15 |texthero.preprocessing.stem make use of two NLTK stemming algorithms known as nltk.stem.SnowballStemmer
and nltk.stem.PorterStemmer
. SnowballStemmer should be used when the Pandas Series contains non-English text has it has multilanguage support.
Stemming algorithm. It can be either ‘snowball’ or ‘porter’
21 |Supported languages: danish, dutch, english, finnish, french, german , hungarian, italian, norwegian, portuguese, romanian, russian, spanish and swedish.
23 |Notes
28 |By default NLTK stemming algorithms lowercase all text.
29 |Examples
30 |>>> import texthero as hero
31 | >>> import pandas as pd
32 | >>> s = pd.Series("I used to go \t\n running.")
33 | >>> hero.preprocessing.stem(s)
34 | 0 i use to go running.
35 | dtype: object
36 |
tokenize
(s: pandas.core.series.Series) → pandas.core.series.Series¶Tokenize each row of the given Series.
14 |Tokenize each row of the given Pandas Series and return a Pandas Series where each row contains a list of tokens.
15 |Algorithm: add a space between any punctuation symbol at 16 | exception if the symbol is between two alphanumeric character and split.
17 |Examples
18 |>>> import texthero as hero
19 | >>> import pandas as pd
20 | >>> s = pd.Series(["Today you're looking great!"])
21 | >>> hero.tokenize(s)
22 | 0 [Today, you're, looking, great, !]
23 | dtype: object
24 |
kmeans
(s: pandas.core.series.Series, n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=- 1, algorithm='auto')¶Perform K-means clustering algorithm.
14 |pca
(s, n_components=2)¶Perform principal component analysis on the given Pandas Series.
14 |In general, pca should be called after the text has already been represented.
15 |Number of components to keep. If n_components is not set or None, all components are kept.
20 |Examples
25 |>>> import texthero as hero
26 | >>> import pandas as pd
27 | >>> s = pd.Series(["Sentence one", "Sentence two"])
28 |
term_frequency
(s: pandas.core.series.Series, max_features: Union[int, NoneType] = None, return_feature_names=False)¶Represent a text-based Pandas Series using term_frequency.
14 |Maximum number of features to keep.
19 |If True, return a tuple (term_frequency_series, features_names)
21 |Examples
26 |>>> import texthero as hero
27 | >>> import pandas as pd
28 | >>> s = pd.Series(["Sentence one", "Sentence two"])
29 | >>> hero.term_frequency(s)
30 | 0 [1, 1, 0]
31 | 1 [1, 0, 1]
32 | dtype: object
33 |
To return the features_names:
36 |>>> import texthero as hero
37 | >>> import pandas as pd
38 | >>> s = pd.Series(["Sentence one", "Sentence two"])
39 | >>> hero.term_frequency(s, return_feature_names=True)
40 | (0 [1, 1, 0]
41 | 1 [1, 0, 1]
42 | dtype: object, ['Sentence', 'one', 'two'])
43 |
tfidf
(s: pandas.core.series.Series, max_features=None, min_df=1, return_feature_names=False)¶Represent a text-based Pandas Series using TF-IDF.
14 |Maximum number of features to keep.
19 |When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
21 |If True, return a tuple (tfidf_series, features_names)
23 |Examples
28 |>>> import texthero as hero
29 | >>> import pandas as pd
30 | >>> s = pd.Series(["Sentence one", "Sentence two"])
31 | >>> hero.tfidf(s)
32 | 0 [0.5797386715376657, 0.8148024746671689, 0.0]
33 | 1 [0.5797386715376657, 0.0, 0.8148024746671689]
34 | dtype: object
35 |
To return the feature_names:
38 |>>> import texthero as hero
39 | >>> import pandas as pd
40 | >>> s = pd.Series(["Sentence one", "Sentence two"])
41 | >>> hero.tfidf(s, return_feature_names=True)
42 | (0 [0.5797386715376657, 0.8148024746671689, 0.0]
43 | 1 [0.5797386715376657, 0.0, 0.8148024746671689]
44 | dtype: object, ['Sentence', 'one', 'two'])
45 |
tsne
(s: pandas.core.series.Series, n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, random_state=None, method='barnes_hut', angle=0.5, n_jobs=- 1)¶Perform TSNE on the given pandas series.
14 |Number of components to keep. If n_components is not set or None, all components are kept.
19 |scatterplot
(df: pandas.core.frame.DataFrame, col: str, color: str = None, hover_data: ] = None, title='', return_figure=False)¶Show scatterplot using python plotly scatter.
14 |The name of the column of the DataFrame used for x and y axis.
19 |top_words
(s: pandas.core.series.Series, normalize=False) → pandas.core.series.Series¶Return a pandas series with index the top words and as value the count.
14 |Tokenization: split by space and remove all punctuations that are not between characters.
15 |When set to true, return normalized values.
19 |wordcloud
(s: pandas.core.series.Series, font_path: str = None, width: int = 400, height: int = 200, max_words=200, mask=None, contour_width=0, contour_color='PAPAYAWHIP', background_color='PAPAYAWHIP', relative_scaling='auto', colormap=None, return_figure=False)¶Plot wordcloud image using WordCloud from word_cloud package.
14 |Most of the arguments are very similar if not equal to the mother function. In constrast, all words are taken into account when computing the wordcloud, inclusive stopwords. They can be easily removed with preprocessing.remove_stopwords.
15 |Word are compute using generate_from_frequencies.
16 |Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don’t have this font, you need to adjust this path.
21 |Width of the canvas.
23 |Height of the canvas.
25 |The maximum number of words.
27 |When set, gives a binary mask on where to draw words. When set, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd “masked out” while other entries will be free to draw on.
29 |If mask is not None and contour_width > 0, draw the mask contour.
31 |Mask contour color.
33 |Smallest font size to use. Will stop when there is no more room in this size.
35 |Background color for the word cloud image.
37 |Maximum font size for the largest word. If None, height of the image is used.
39 |Importance of relative word frequencies for font-size. With 41 | relative_scaling=0, only word-ranks are considered. With 42 | relative_scaling=1, a word that is twice as frequent will have twice 43 | the size. If you want to consider the word frequencies and not only 44 | their rank, relative_scaling around .5 often looks good. 45 | If ‘auto’ it will be set to 0.5 unless repeat is true, in which 46 | case it will be set to 0.
47 |Matplotlib colormap to randomly draw colors from for each word.
49 |This project is maintained by a dedicated group of people.
47 |This project is used by all these people
186 |This project is used by many folks
35 |Are you using this project?
38 | 39 | Add your company 40 | 41 |