├── .github
└── workflows
│ └── build-package-upload-pypi.yml
├── .gitignore
├── LICENSE
├── README.md
├── _assets
├── hf-space.png
├── logo.png
└── logo1.png
├── examples
└── example_notebook.ipynb
├── keyphrasetransformer
├── __init__.py
└── keyphrasetransformer.py
├── requirements.txt
└── setup.py
/.github/workflows/build-package-upload-pypi.yml:
--------------------------------------------------------------------------------
1 | name: Publish KeyPhraseTransformer to PyPI and TestPyPI
2 |
3 | on:
4 | # This workflows will upload a Python Package on PyPI when a release is created
5 | release:
6 | types: [created]
7 | branches:
8 | - main
9 |
10 | jobs:
11 | build-n-publish:
12 | name: Build and publish Publish KeyPhraseTransformer to PyPI and TestPyPI
13 | runs-on: ubuntu-18.04
14 |
15 | steps:
16 | - uses: actions/checkout@main
17 | - name: Set up Python 3.7
18 | uses: actions/setup-python@v1
19 | with:
20 | python-version: 3.7
21 |
22 | - name: Build a binary wheel and a source tarball
23 | run: >-
24 | python setup.py sdist
25 | - name: Publish a Python distribution to PyPI
26 | uses: pypa/gh-action-pypi-publish@release/v1
27 | with:
28 | user: __token__
29 | password: ${{ secrets.PYPI_UPLOAD }}
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | *.log
3 | models
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Shivanand Roy
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Quickly extract key-phrases/topics from you text data with T5 transformer 5 | 6 | 7 | 8 | **KeyPhraseTransformer** is built on T5 Transformer architecture, trained on 500,000 training samples to extract important phrases/topics/themes from text of any length. 9 | 10 | ### Why KeyPhraseTransformer? 11 | - You get the power of amazing T5 architecture. 12 | - The underlying T5 model is specifically trained in extracting important phrases from the text corpus, so the results are of superior quality. 13 | - No pre-processing is needed of any kind. Just dump your data to the model 14 | - It does not need any n-gram-related inputs from user. It can automatically extract unigram, bigram, or trigram on its own. 15 | - It can process text data of any length as it breaks down input text into smaller chunks internally 16 | - It helps to automate the topic modeling/keyword extraction process end to end with no manual intervention. 17 | 18 | ### Installation: 19 | ```python 20 | pip install keyphrasetransformer 21 | ``` 22 | ### Use: 23 | [](https://huggingface.co/spaces/snrspeaks/KeyPhraseTransformer) 24 | 25 | ```python 26 | from keyphrasetransformer import KeyPhraseTransformer 27 | 28 | kp = KeyPhraseTransformer() 29 | 30 | doc = """ 31 | Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned 32 | on a downstream task, has emerged as a powerful technique in natural language processing (NLP). 33 | The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. 34 | In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework 35 | that converts every language problem into a text-to-text format. Our systematic study compares pretraining objectives, 36 | architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. 37 | By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, 38 | we achieve state-of-the-art results on many benchmarks covering summarization, question answering, 39 | text classification, and more. To facilitate future work on transfer learning for NLP, 40 | we release our dataset, pre-trained models, and code. 41 | 42 | """ 43 | 44 | kp.get_key_phrases(doc) 45 | ``` 46 | ``` 47 | ['transfer learning', 48 | 'natural language processing (nlp)', 49 | 'nlp', 50 | 'text-to-text', 51 | 'language understanding', 52 | 'transfer approach', 53 | 'pretraining objectives', 54 | 'corpus', 55 | 'summarization', 56 | 'question answering'] 57 | ``` -------------------------------------------------------------------------------- /_assets/hf-space.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Shivanandroy/KeyPhraseTransformer/80e2c8d7c4869d1e120f6e8004630812443e766b/_assets/hf-space.png -------------------------------------------------------------------------------- /_assets/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Shivanandroy/KeyPhraseTransformer/80e2c8d7c4869d1e120f6e8004630812443e766b/_assets/logo.png -------------------------------------------------------------------------------- /_assets/logo1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Shivanandroy/KeyPhraseTransformer/80e2c8d7c4869d1e120f6e8004630812443e766b/_assets/logo1.png -------------------------------------------------------------------------------- /examples/example_notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "from keyphrasetransformer import KeyPhraseTransformer\n", 10 | "\n", 11 | "kp = KeyPhraseTransformer()" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 3, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/plain": [ 22 | "['transfer learning',\n", 23 | " 'natural language processing (nlp)',\n", 24 | " 'nlp',\n", 25 | " 'text-to-text',\n", 26 | " 'language understanding',\n", 27 | " 'transfer approach',\n", 28 | " 'pretraining objectives',\n", 29 | " 'corpus',\n", 30 | " 'summarization',\n", 31 | " 'question answering']" 32 | ] 33 | }, 34 | "execution_count": 3, 35 | "metadata": {}, 36 | "output_type": "execute_result" 37 | } 38 | ], 39 | "source": [ 40 | "doc = \"\"\"Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned \n", 41 | "on a downstream task, has emerged as a powerful technique in natural language processing (NLP). \n", 42 | "The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. \n", 43 | "In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework \n", 44 | "that converts every language problem into a text-to-text format. Our systematic study compares pretraining objectives, \n", 45 | "architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. \n", 46 | "By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, \n", 47 | "we achieve state-of-the-art results on many benchmarks covering summarization, question answering, \n", 48 | "text classification, and more. To facilitate future work on transfer learning for NLP, \n", 49 | "we release our dataset, pre-trained models, and code.\n", 50 | "\n", 51 | "\"\"\"\n", 52 | "\n", 53 | "kp.get_key_phrases(doc)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": null, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [] 62 | } 63 | ], 64 | "metadata": { 65 | "interpreter": { 66 | "hash": "7306abc11ed4dec42b6661423f565940329246c8f691811cb840605cdee0cded" 67 | }, 68 | "kernelspec": { 69 | "display_name": "Python 3.8.12 ('KPT-ENV')", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.8.12" 84 | }, 85 | "orig_nbformat": 4 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 2 89 | } 90 | -------------------------------------------------------------------------------- /keyphrasetransformer/__init__.py: -------------------------------------------------------------------------------- 1 | from .keyphrasetransformer import KeyPhraseTransformer -------------------------------------------------------------------------------- /keyphrasetransformer/keyphrasetransformer.py: -------------------------------------------------------------------------------- 1 | # import 2 | import os 3 | import sys 4 | import nltk 5 | from nltk.corpus import words 6 | from nltk.tokenize import word_tokenize, sent_tokenize 7 | from transformers import AutoTokenizer, T5ForConditionalGeneration, MT5ForConditionalGeneration 8 | 9 | nltk.download('punkt') 10 | nltk.download("words") 11 | 12 | class KeyPhraseTransformer: 13 | def __init__(self, model_type: str = "t5", model_name: str = "snrspeaks/KeyPhraseTransformer"): 14 | self.model_name = model_name 15 | if model_type == "t5": 16 | self.model = T5ForConditionalGeneration.from_pretrained(self.model_name) 17 | self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) 18 | if model_type == "mt5": 19 | self.model = MT5ForConditionalGeneration.from_pretrained(self.model_name) 20 | self.tokenizer = AutoTokenizer.from_pretrained(self.model_name) 21 | 22 | def split_into_paragraphs(self, doc: str, max_tokens_per_para: int = 128): 23 | sentences = sent_tokenize(doc.strip()) 24 | temp = "" 25 | temp_list = [] 26 | final_list = [] 27 | 28 | for i, sentence in enumerate(sentences): 29 | sent = sentence 30 | temp = temp + " " + sent 31 | wc_temp = len(self.tokenizer.tokenize(temp)) 32 | 33 | if wc_temp < max_tokens_per_para: 34 | temp_list.append(sentence) 35 | 36 | if i == len(sentences) - 1: 37 | final_list.append(" ".join(temp_list)) 38 | 39 | else: 40 | final_list.append(" ".join(temp_list)) 41 | 42 | temp = sentence 43 | temp_list = [sentence] 44 | 45 | if i == len(sentences) - 1: 46 | final_list.append(" ".join(temp_list)) 47 | 48 | return [para for para in final_list if len(para.strip()) != 0] 49 | 50 | def process_outputs(self, outputs): 51 | temp = [output[0].split(" | ") for output in outputs] 52 | flatten = [item for sublist in temp for item in sublist] 53 | return sorted(set(flatten), key=flatten.index) 54 | 55 | def filter_outputs(self, key_phrases, text): 56 | key_phrases = [elem.lower() for elem in key_phrases] 57 | text = text.lower() 58 | 59 | valid_phrases = [] 60 | invalid_phrases = [] 61 | 62 | for phrases in key_phrases: 63 | for phrase in word_tokenize(phrases): 64 | if (phrase in word_tokenize(text)) or (phrase in words.words()): 65 | if phrases not in valid_phrases: 66 | valid_phrases.append(phrases) 67 | else: 68 | invalid_phrases.append(phrases) 69 | 70 | return [elem for elem in valid_phrases if elem not in invalid_phrases] 71 | 72 | def predict(self, doc: str): 73 | input_ids = self.tokenizer.encode( 74 | doc, return_tensors="pt", add_special_tokens=True 75 | ) 76 | generated_ids = self.model.generate( 77 | input_ids=input_ids, 78 | num_beams=2, 79 | max_length=512, 80 | repetition_penalty=2.5, 81 | length_penalty=1, 82 | early_stopping=True, 83 | top_p=0.95, 84 | top_k=50, 85 | num_return_sequences=1, 86 | ) 87 | preds = [ 88 | self.tokenizer.decode( 89 | g, skip_special_tokens=True, clean_up_tokenization_spaces=True 90 | ) 91 | for g in generated_ids 92 | ] 93 | return preds 94 | 95 | def get_key_phrases(self, text: str, text_block_size: int = 64): 96 | results = [] 97 | paras = self.split_into_paragraphs( 98 | doc=text, max_tokens_per_para=text_block_size 99 | ) 100 | 101 | for para in paras: 102 | results.append(self.predict(para)) 103 | 104 | key_phrases = self.filter_outputs(self.process_outputs(results), text) 105 | return key_phrases 106 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Shivanandroy/KeyPhraseTransformer/80e2c8d7c4869d1e120f6e8004630812443e766b/requirements.txt -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | from os import path 3 | 4 | here = path.abspath(path.dirname(__file__)) 5 | 6 | with open(path.join(here, "README.md"), encoding="utf-8") as f: 7 | long_description = f.read() 8 | 9 | setuptools.setup( 10 | name="keyphrasetransformer", 11 | version="0.0.2", 12 | license="apache-2.0", 13 | author="Shivanand Roy", 14 | author_email="snrcodes@gmail.com", 15 | description="Quickly extract key-phrases/topics from you text data with T5 transformer.", 16 | long_description=long_description, 17 | long_description_content_type="text/markdown", 18 | url="https://github.com/Shivanandroy/KeyPhraseTransformer", 19 | project_urls={ 20 | "Repo": "https://github.com/Shivanandroy/KeyPhraseTransformer", 21 | "Bug Tracker": "https://github.com/Shivanandroy/KeyPhraseTransformer/issues", 22 | }, 23 | keywords=[ 24 | "keyword extraction", 25 | "keyphrase extraction", 26 | "keyphrase", 27 | "extraction" 28 | "T5", 29 | "simpleT5", 30 | "transformers", 31 | "NLP" 32 | ], 33 | packages=setuptools.find_packages(), 34 | python_requires=">=3.5", 35 | install_requires=[ 36 | "nltk", 37 | "transformers" 38 | ], 39 | classifiers=[ 40 | "Intended Audience :: Developers", 41 | "Topic :: Scientific/Engineering :: Artificial Intelligence", 42 | "License :: OSI Approved :: Apache Software License", 43 | "Programming Language :: Python :: 3", 44 | "Programming Language :: Python :: 3.5", 45 | "Programming Language :: Python :: 3.6", 46 | "Programming Language :: Python :: 3.7", 47 | "Programming Language :: Python :: 3.8", 48 | ], 49 | ) --------------------------------------------------------------------------------