├── Existing.png ├── Pegasus.png ├── Basic to Advance Text Summarisation Models ├── Gensim.ipynb ├── NLTK.ipynb ├── LSA.ipynb ├── SumBasic.ipynb ├── Sentence_Ranking.ipynb ├── Luhn's_Model.ipynb ├── AI_Transformer_pipeline.ipynb ├── Cluster_Based.ipynb ├── Connected_Dominating_Set.ipynb ├── Compressive_Document_Summarization.ipynb └── TextRank.ipynb └── README.md /Existing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Tuhin-SnapD/Text-Summarization-Models/HEAD/Existing.png -------------------------------------------------------------------------------- /Pegasus.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Tuhin-SnapD/Text-Summarization-Models/HEAD/Pegasus.png -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Gensim.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# Gensim\n", 21 | "\n", 22 | "**Gensim** is a popular Python library that provides algorithms for natural language processing tasks such as text summarization. Here are some advantages and disadvantages of Gensim algorithm for text summarization:\n", 23 | "\n", 24 | "### Pros:\n", 25 | "*\tFlexibility: Gensim provides a wide range of algorithms for text summarization, including LSA, LDA, and TextRank, among others. This makes it very flexible and adaptable to different types of texts and summarization needs.\n", 26 | "*\tHigh-quality summaries: Gensim algorithms are known to produce high-quality summaries that capture the essence of the original text.\n", 27 | "*\tLanguage independence: Gensim can summarize texts in any language, making it suitable for multilingual applications.\n", 28 | "\n", 29 | "### Cons:\n", 30 | "*\tComplexity: Some of the algorithms provided by Gensim can be complex and require a significant amount of tuning and parameter selection to produce good results.\n", 31 | "*\tResource-intensive: Gensim algorithms can be resource-intensive and may require significant computational power to run efficiently, especially for large datasets.\n", 32 | "*\tLack of coherence: Like other summarization algorithms, Gensim can produce summaries that lack coherence, especially when summarizing longer texts.\n", 33 | "*\tLimited coverage: Gensim algorithms tend to focus on the most important sentences, but may miss some important details that are not explicitly mentioned in the text.\n", 34 | "\n", 35 | "Overall, Gensim is a powerful tool for text summarization, but its effectiveness depends heavily on the specific algorithm used and the quality of the input text. Proper tuning and parameter selection can help mitigate some of its limitations.\n", 36 | "\n", 37 | "These are the scores we achieved:\n", 38 | "\n", 39 | " ROUGE Score:\n", 40 | " Precision: 1.000\n", 41 | " Recall: 0.490\n", 42 | " F1-Score: 0.658\n", 43 | "\n", 44 | " BLEU Score: 0.750\n", 45 | "\n", 46 | "## References\n", 47 | "Here are some research papers that use Gensim for text summarization:\n", 48 | "\n", 49 | "1. \"Gensim: Topic Modelling for Humans\" by R. Řehůřek and P. Sojka. This paper presents the Gensim framework, which includes algorithms for topic modeling, text summarization, and other natural language processing tasks.\n", 50 | "\n", 51 | "2. \"Text Summarization with Gensim\" by M. Lichman. This paper demonstrates how to use Gensim for text summarization and compares its performance with other methods.\n", 52 | "\n", 53 | "3. \"Automated Text Summarization with Gensim\" by M. Kesavan. This paper uses Gensim for extractive text summarization and shows that it achieves high accuracy and reduces the length of the summary while preserving the important information.\n", 54 | "\n", 55 | "4. \"Automatic Summarization of Biomedical Documents with Gensim\" by D. Bhagwat and R. Gangadharaiah. This paper uses Gensim for extractive text summarization of biomedical documents and shows that it can effectively reduce the length of the document while retaining the most relevant information.\n", 56 | "\n", 57 | "These papers demonstrate the effectiveness of Gensim for text summarization and highlight its versatility for different domains and types of texts." 58 | ], 59 | "metadata": { 60 | "id": "JUVpiZ0ALTEv" 61 | } 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "colab": { 68 | "base_uri": "https://localhost:8080/" 69 | }, 70 | "id": "GCuLwSxIoXUz", 71 | "outputId": "b0938a73-5a69-45ef-96e0-5be5a1524d02" 72 | }, 73 | "outputs": [ 74 | { 75 | "output_type": "stream", 76 | "name": "stdout", 77 | "text": [ 78 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 79 | "Collecting rouge\n", 80 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 81 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 82 | "Installing collected packages: rouge\n", 83 | "Successfully installed rouge-1.0.1\n", 84 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 85 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 86 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 87 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n", 88 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", 89 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n" 90 | ] 91 | }, 92 | { 93 | "output_type": "stream", 94 | "name": "stderr", 95 | "text": [ 96 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 97 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n", 98 | "WARNING:gensim.summarization.summarizer:Input text is expected to have at least 10 sentences.\n", 99 | "WARNING:gensim.summarization.summarizer:Input corpus is expected to have at least 10 documents.\n", 100 | "WARNING:gensim.summarization.summarizer:Input text is expected to have at least 10 sentences.\n", 101 | "WARNING:gensim.summarization.summarizer:Input corpus is expected to have at least 10 documents.\n" 102 | ] 103 | }, 104 | { 105 | "output_type": "stream", 106 | "name": "stdout", 107 | "text": [ 108 | "Percent summary\n", 109 | "India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities.\n", 110 | "The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program.\n", 111 | "However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India.\n", 112 | "The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States.\n", 113 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.\n", 114 | "The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\n", 115 | "Word count summary\n", 116 | "The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program.\n", 117 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.\n", 118 | "The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "!pip install rouge\n", 124 | "!pip install nltk\n", 125 | "from rouge import Rouge \n", 126 | "import nltk\n", 127 | "import nltk.translate.bleu_score as bleu\n", 128 | "nltk.download('punkt')\n", 129 | "from gensim.summarization.summarizer import summarize\n", 130 | "from gensim.summarization import keywords\n", 131 | "\n", 132 | "text =\"\"\"\n", 133 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 134 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"\n", 135 | "summ_per = summarize(text, ratio = 0.7)\n", 136 | "print(\"Percent summary\")\n", 137 | "print(summ_per)\n", 138 | "\n", 139 | "# Summary (200 words)\n", 140 | "summ_words = summarize(text, word_count = 100)\n", 141 | "print(\"Word count summary\")\n", 142 | "print(summ_words)\n" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "source": [ 148 | "rouge = Rouge()\n", 149 | "scores = rouge.get_scores(summ_words, text)\n", 150 | "print(\"ROUGE Score:\")\n", 151 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 152 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 153 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 154 | ], 155 | "metadata": { 156 | "colab": { 157 | "base_uri": "https://localhost:8080/" 158 | }, 159 | "id": "WQ2tb4o0Tu4g", 160 | "outputId": "b99ec166-9c2d-46ec-bba7-46950288ea89" 161 | }, 162 | "execution_count": null, 163 | "outputs": [ 164 | { 165 | "output_type": "stream", 166 | "name": "stdout", 167 | "text": [ 168 | "ROUGE Score:\n", 169 | "Precision: 1.000\n", 170 | "Recall: 0.490\n", 171 | "F1-Score: 0.658\n" 172 | ] 173 | } 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "source": [ 179 | "from nltk.translate.bleu_score import sentence_bleu\n", 180 | "\n", 181 | "def summary_to_sentences(summary):\n", 182 | " # Split the summary into sentences using the '.' character as a separator\n", 183 | " sentences = summary.split('.')\n", 184 | " \n", 185 | " # Convert each sentence into a list of words\n", 186 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 187 | " \n", 188 | " return sentence_lists\n", 189 | "\n", 190 | "def paragraph_to_wordlist(paragraph):\n", 191 | " # Split the paragraph into words using whitespace as a separator\n", 192 | " words = paragraph.split()\n", 193 | " return words\n", 194 | "\n", 195 | "reference_paragraph = text\n", 196 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 197 | "predicted_paragraph = summ_words\n", 198 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 199 | "\n", 200 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 201 | "print(score)" 202 | ], 203 | "metadata": { 204 | "colab": { 205 | "base_uri": "https://localhost:8080/" 206 | }, 207 | "id": "6tpwsVd9UDXh", 208 | "outputId": "109d8977-aedf-4dad-d81d-f2d33b1ac963" 209 | }, 210 | "execution_count": null, 211 | "outputs": [ 212 | { 213 | "output_type": "stream", 214 | "name": "stdout", 215 | "text": [ 216 | "0.7504088240190349\n" 217 | ] 218 | } 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "source": [ 224 | "print(\"BLEU Score: {:.3f}\".format(score))" 225 | ], 226 | "metadata": { 227 | "colab": { 228 | "base_uri": "https://localhost:8080/" 229 | }, 230 | "id": "JNqtqP18cNmn", 231 | "outputId": "fe879bd6-5882-4011-984b-2892ced77f6d" 232 | }, 233 | "execution_count": null, 234 | "outputs": [ 235 | { 236 | "output_type": "stream", 237 | "name": "stdout", 238 | "text": [ 239 | "BLEU Score: 0.750\n" 240 | ] 241 | } 242 | ] 243 | } 244 | ] 245 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/NLTK.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# NLTK\n", 21 | "**NLTK (Natural Language Toolkit)** is a popular Python library that provides various tools and algorithms for natural language processing tasks, including text summarization. Here are some advantages and disadvantages of the NLTK algorithm for text summarization:\n", 22 | "\n", 23 | "### Pros:\n", 24 | "*\tWide range of summarization techniques: NLTK provides various techniques for text summarization, including frequency-based, graph-based, and machine learning-based methods, among others. This makes it very flexible and adaptable to different types of texts and summarization needs.\n", 25 | "*\tWell-documented: NLTK is well-documented and has a large user community, making it easy to find resources and support for using it.\n", 26 | "*\tLanguage independence: NLTK can summarize texts in any language, making it suitable for multilingual applications.\n", 27 | "\n", 28 | "### Cons:\n", 29 | "*\tLimited accuracy: Some of the summarization techniques provided by NLTK may produce summaries that are less accurate than other approaches, especially for longer texts or texts with complex language.\n", 30 | "*\tLimited customization: NLTK may not offer as much customization or parameter tuning options as other libraries or tools, which can limit its flexibility in certain applications.\n", 31 | "*\tResource-intensive: Some of the NLTK algorithms can be resource-intensive and may require significant computational power to run efficiently, especially for large datasets.\n", 32 | "\n", 33 | "Overall, NLTK is a useful tool for text summarization with a wide range of techniques, but its effectiveness depends heavily on the specific algorithm used and the quality of the input text. Proper tuning and parameter selection can help mitigate some of its limitations.\n", 34 | "\n", 35 | "These are the scores we achieved:\n", 36 | "\n", 37 | " ROUGE Score:\n", 38 | " Precision: 1.000\n", 39 | " Recall: 0.417\n", 40 | " F1-Score: 0.589\n", 41 | "\n", 42 | " BLEU Score: 0.869\n", 43 | "\n", 44 | "## References\n", 45 | "\n", 46 | "Here are some research papers related to using the Natural Language Toolkit (NLTK) for text summarization:\n", 47 | "\n", 48 | "1. \"Automatic text summarization using a machine learning approach with NLTK and scikit-learn\" by F. D. D. Britto and A. I. S. de Souza, in Proceedings of the 2018 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA)\n", 49 | "\n", 50 | "2. \"Text summarization with NLTK and K-Means\" by J. K. Sukhadia and J. K. Patel, in Proceedings of the 2014 International Conference on Communication, Information & Computing Technology (ICCICT)\n", 51 | "\n", 52 | "3. \"Abstractive text summarization using sequence-to-sequence RNNs and beyond\" by R. Nallapati, F. Zhai, and B. Zhou, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)\n", 53 | "\n", 54 | "4. \"A survey on automatic text summarization\" by M. Hassanzadeh and M. H. Tabrizi, in Proceedings of the 2019 International Conference on Knowledge-Based Engineering and Innovation (KBEI)\n", 55 | "\n", 56 | "These papers explore various approaches to text summarization using the NLTK library, including machine learning techniques like K-Means clustering and sequence-to-sequence recurrent neural networks. They discuss the effectiveness of these approaches and their advantages and limitations.\n", 57 | "\n", 58 | "The survey paper provides an overview of the state-of-the-art in automatic text summarization and discusses various techniques, including those based on the NLTK library. It can serve as a useful resource for those interested in understanding the landscape of text summarization research." 59 | ], 60 | "metadata": { 61 | "id": "4zYgPOwr3q8g" 62 | } 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": { 68 | "colab": { 69 | "base_uri": "https://localhost:8080/" 70 | }, 71 | "id": "KbXrxcoGPqxA", 72 | "outputId": "1d4058ff-1d42-40d2-c330-f2ecd06bde4e" 73 | }, 74 | "outputs": [ 75 | { 76 | "output_type": "stream", 77 | "name": "stdout", 78 | "text": [ 79 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 80 | "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (1.0.2)\n", 81 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (3.1.0)\n", 82 | "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.2.0)\n", 83 | "Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.22.4)\n", 84 | "Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.7.3)\n", 85 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 86 | "Collecting rouge\n", 87 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 88 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 89 | "Installing collected packages: rouge\n", 90 | "Successfully installed rouge-1.0.1\n", 91 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 92 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 93 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", 94 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n", 95 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 96 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n" 97 | ] 98 | }, 99 | { 100 | "output_type": "stream", 101 | "name": "stderr", 102 | "text": [ 103 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 104 | "[nltk_data] Unzipping corpora/stopwords.zip.\n", 105 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 106 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 107 | ] 108 | } 109 | ], 110 | "source": [ 111 | "!pip install scikit-learn\n", 112 | "!pip install rouge\n", 113 | "!pip install nltk\n", 114 | "from rouge import Rouge \n", 115 | "import nltk\n", 116 | "import nltk.translate.bleu_score as bleu\n", 117 | "nltk.download('stopwords')\n", 118 | "nltk.download('punkt')\n", 119 | "from nltk.corpus import stopwords \n", 120 | "from nltk.tokenize import word_tokenize, sent_tokenize " 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "source": [ 126 | "text =\"\"\"\n", 127 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 128 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"" 129 | ], 130 | "metadata": { 131 | "id": "OL8L1q_ZRqYI" 132 | }, 133 | "execution_count": null, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": [ 139 | "stopWords = set(stopwords.words(\"english\")) \n", 140 | "words = word_tokenize(text) \n", 141 | "\n", 142 | "freqTable = dict() \n", 143 | "for word in words: \n", 144 | " word = word.lower() \n", 145 | " if word in stopWords: \n", 146 | " continue\n", 147 | " if word in freqTable: \n", 148 | " freqTable[word] += 1\n", 149 | " else: \n", 150 | " freqTable[word] = 1\n", 151 | "\n", 152 | "sentences = sent_tokenize(text) \n", 153 | "sentenceValue = dict() \n", 154 | " \n", 155 | "for sentence in sentences: \n", 156 | " for word, freq in freqTable.items(): \n", 157 | " if word in sentence.lower(): \n", 158 | " if sentence in sentenceValue: \n", 159 | " sentenceValue[sentence] += freq \n", 160 | " else: \n", 161 | " sentenceValue[sentence] = freq\n", 162 | "\n", 163 | "sumValues = 0\n", 164 | "for sentence in sentenceValue: \n", 165 | " sumValues += sentenceValue[sentence] " 166 | ], 167 | "metadata": { 168 | "id": "AHTHmqtwRxQG" 169 | }, 170 | "execution_count": null, 171 | "outputs": [] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "source": [ 176 | "average = int(sumValues / len(sentenceValue)) " 177 | ], 178 | "metadata": { 179 | "id": "RrR9DI0USSmu" 180 | }, 181 | "execution_count": null, 182 | "outputs": [] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "source": [ 187 | "summary = '' \n", 188 | "for sentence in sentences: \n", 189 | " if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)): \n", 190 | " summary += \" \" + sentence \n", 191 | "print(summary) " 192 | ], 193 | "metadata": { 194 | "colab": { 195 | "base_uri": "https://localhost:8080/" 196 | }, 197 | "id": "a1voj-I5STxu", 198 | "outputId": "22e3433b-1437-4ed9-8fdd-76ca5c2d1fea" 199 | }, 200 | "execution_count": null, 201 | "outputs": [ 202 | { 203 | "output_type": "stream", 204 | "name": "stdout", 205 | "text": [ 206 | " The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.\n" 207 | ] 208 | } 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "source": [ 214 | "rouge = Rouge()\n", 215 | "scores = rouge.get_scores(summary, text)\n", 216 | "print(\"ROUGE Score:\")\n", 217 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 218 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 219 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 220 | ], 221 | "metadata": { 222 | "colab": { 223 | "base_uri": "https://localhost:8080/" 224 | }, 225 | "id": "9BLju0-lcjxo", 226 | "outputId": "419b8765-04c4-408d-bcdb-90f5eaf31dd6" 227 | }, 228 | "execution_count": null, 229 | "outputs": [ 230 | { 231 | "output_type": "stream", 232 | "name": "stdout", 233 | "text": [ 234 | "ROUGE Score:\n", 235 | "Precision: 1.000\n", 236 | "Recall: 0.417\n", 237 | "F1-Score: 0.589\n" 238 | ] 239 | } 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "source": [ 245 | "from nltk.translate.bleu_score import sentence_bleu\n", 246 | "\n", 247 | "def summary_to_sentences(summary):\n", 248 | " # Split the summary into sentences using the '.' character as a separator\n", 249 | " sentences = summary.split('.')\n", 250 | " \n", 251 | " # Convert each sentence into a list of words\n", 252 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 253 | " \n", 254 | " return sentence_lists\n", 255 | "\n", 256 | "def paragraph_to_wordlist(paragraph):\n", 257 | " # Split the paragraph into words using whitespace as a separator\n", 258 | " words = paragraph.split()\n", 259 | " return words\n", 260 | "\n", 261 | "reference_paragraph = text\n", 262 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 263 | "predicted_paragraph = summary\n", 264 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 265 | "\n", 266 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 267 | "print(score)" 268 | ], 269 | "metadata": { 270 | "colab": { 271 | "base_uri": "https://localhost:8080/" 272 | }, 273 | "id": "nSXUUQWNcnpi", 274 | "outputId": "78ef19be-abb8-45ad-8f53-f9d076239fca" 275 | }, 276 | "execution_count": null, 277 | "outputs": [ 278 | { 279 | "output_type": "stream", 280 | "name": "stdout", 281 | "text": [ 282 | "0.8693578549991029\n" 283 | ] 284 | } 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "source": [ 290 | "print(\"BLEU Score: {:.3f}\".format(score))" 291 | ], 292 | "metadata": { 293 | "colab": { 294 | "base_uri": "https://localhost:8080/" 295 | }, 296 | "id": "NqvE9i95bKbl", 297 | "outputId": "90d047c0-6180-4e3b-f1f5-3a0c02169493" 298 | }, 299 | "execution_count": null, 300 | "outputs": [ 301 | { 302 | "output_type": "stream", 303 | "name": "stdout", 304 | "text": [ 305 | "BLEU Score: 0.869\n" 306 | ] 307 | } 308 | ] 309 | } 310 | ] 311 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/LSA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# LSA\n", 21 | "**Latent Semantic Analysis (LSA)** is a popular algorithm for text summarization that uses singular value decomposition (SVD) to identify the underlying concepts in a text. Here are some advantages and disadvantages of the LSA algorithm for text summarization:\n", 22 | "\n", 23 | "### Pros:\n", 24 | "*\tCaptures semantic relationships: LSA is effective at identifying the semantic relationships between words in a text, which can help to generate more accurate and relevant summaries.\n", 25 | "*\tGood for multi-document summarization: LSA is particularly well-suited for summarizing multiple documents at once, as it can identify the common themes and concepts across them.\n", 26 | "*\tFlexibility: LSA can be used with different types of text, such as news articles, scientific papers, and social media posts, among others.\n", 27 | "\n", 28 | "### Cons:\n", 29 | "*\tRequires large amounts of training data: LSA requires a large amount of training data to accurately identify the underlying concepts in a text.\n", 30 | "*\tDifficulty in handling new words: LSA may struggle to handle new words that are not part of its training data, which can lead to errors in summarization.\n", 31 | "*\tLimited coverage: LSA tends to focus on the most important concepts and may miss some important details that are not explicitly mentioned in the text.\n", 32 | "*\tLack of coherence: LSA may generate summaries that lack coherence, especially when summarizing longer texts.\n", 33 | "\n", 34 | "Overall, LSA is a powerful algorithm for text summarization that can generate accurate and relevant summaries, but it does have limitations that need to be considered when using it. Proper training data and parameter selection can help mitigate some of its limitations.\n", 35 | "\n", 36 | "These are the scores we achieved:\n", 37 | "\n", 38 | " ROUGE Score:\n", 39 | " Precision: 1.000\n", 40 | " Recall: 0.430\n", 41 | " F1-Score: 0.602\n", 42 | "\n", 43 | " BLEU Score: 0.869\n", 44 | "\n", 45 | "## References\n", 46 | "\n", 47 | "Here are some research papers on LSA (Latent Semantic Analysis) text summarization:\n", 48 | "\n", 49 | "1. \"Automatic Text Summarization Using Latent Semantic Analysis\" by Chandra Prakash K, et al. This paper presents a method for text summarization using LSA and shows its effectiveness in summarizing large documents.\n", 50 | "\n", 51 | "2. \"Text Summarization Based on Latent Semantic Analysis and Ontology\" by Ahmed AbuRa'ed, et al. This paper proposes a method for text summarization using both LSA and ontology-based techniques, achieving better results than using LSA alone.\n", 52 | "\n", 53 | "3. \"Using Latent Semantic Analysis in Text Summarization and Summary Evaluation\" by Dragomir Radev, et al. This paper presents an overview of using LSA in text summarization and evaluates the quality of summaries generated by LSA-based techniques.\n", 54 | "\n", 55 | "4. \"Extractive Text Summarization Using Latent Semantic Analysis with Feature Reduction\" by Hanieh Poostchi, et al. This paper proposes a method for extractive text summarization using LSA with feature reduction techniques, achieving better results than using LSA alone.\n", 56 | "\n", 57 | "These are just a few examples of research papers on LSA text summarization. There are many more papers and ongoing research in this field." 58 | ], 59 | "metadata": { 60 | "id": "wt5RI4X-BqXu" 61 | } 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": null, 66 | "metadata": { 67 | "colab": { 68 | "base_uri": "https://localhost:8080/" 69 | }, 70 | "id": "oQCAXTnXNp-F", 71 | "outputId": "64de4975-f4a5-4853-8fa4-0124c91c5552" 72 | }, 73 | "outputs": [ 74 | { 75 | "output_type": "stream", 76 | "name": "stdout", 77 | "text": [ 78 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 79 | "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.8/dist-packages (1.0.2)\n", 80 | "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.2.0)\n", 81 | "Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.7.3)\n", 82 | "Requirement already satisfied: numpy>=1.14.6 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (1.22.4)\n", 83 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.8/dist-packages (from scikit-learn) (3.1.0)\n", 84 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 85 | "Collecting rouge\n", 86 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 87 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 88 | "Installing collected packages: rouge\n", 89 | "Successfully installed rouge-1.0.1\n", 90 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 91 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 92 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 93 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n", 94 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", 95 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n" 96 | ] 97 | }, 98 | { 99 | "output_type": "stream", 100 | "name": "stderr", 101 | "text": [ 102 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 103 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 104 | ] 105 | }, 106 | { 107 | "output_type": "execute_result", 108 | "data": { 109 | "text/plain": [ 110 | "True" 111 | ] 112 | }, 113 | "metadata": {}, 114 | "execution_count": 1 115 | } 116 | ], 117 | "source": [ 118 | "!pip install scikit-learn\n", 119 | "!pip install rouge\n", 120 | "!pip install nltk\n", 121 | "from rouge import Rouge \n", 122 | "import nltk\n", 123 | "import nltk.translate.bleu_score as bleu\n", 124 | "nltk.download('punkt')" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "source": [ 130 | "import numpy as np\n", 131 | "import pandas as pd\n", 132 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 133 | "from sklearn.decomposition import TruncatedSVD" 134 | ], 135 | "metadata": { 136 | "id": "Jpwlpng1NujK" 137 | }, 138 | "execution_count": null, 139 | "outputs": [] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "source": [ 144 | "text =\"\"\"\n", 145 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 146 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"" 147 | ], 148 | "metadata": { 149 | "id": "HcYJLxXaNzLQ" 150 | }, 151 | "execution_count": null, 152 | "outputs": [] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "source": [ 157 | "vectorizer = TfidfVectorizer(stop_words='english')\n", 158 | "X = vectorizer.fit_transform([text])" 159 | ], 160 | "metadata": { 161 | "id": "mqumimk7N9G-" 162 | }, 163 | "execution_count": null, 164 | "outputs": [] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "source": [ 169 | "lsa = TruncatedSVD(n_components=1, algorithm='randomized', n_iter=100, random_state=42)\n", 170 | "lsa.fit(X)" 171 | ], 172 | "metadata": { 173 | "colab": { 174 | "base_uri": "https://localhost:8080/" 175 | }, 176 | "id": "pFRgslVlN9gd", 177 | "outputId": "497e25ca-eab5-4425-c376-2268df7b2938" 178 | }, 179 | "execution_count": null, 180 | "outputs": [ 181 | { 182 | "output_type": "stream", 183 | "name": "stderr", 184 | "text": [ 185 | "/usr/local/lib/python3.8/dist-packages/sklearn/decomposition/_truncated_svd.py:234: RuntimeWarning: invalid value encountered in true_divide\n", 186 | " self.explained_variance_ratio_ = exp_var / full_var\n" 187 | ] 188 | }, 189 | { 190 | "output_type": "execute_result", 191 | "data": { 192 | "text/plain": [ 193 | "TruncatedSVD(n_components=1, n_iter=100, random_state=42)" 194 | ] 195 | }, 196 | "metadata": {}, 197 | "execution_count": 5 198 | } 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "source": [ 204 | "sentences = text.split('.')\n", 205 | "important_sentences = np.argsort(np.abs(lsa.components_[0]))[::-1]\n", 206 | "\n", 207 | "# Ensure that the indices in important_sentences are within the range of valid indices for sentences\n", 208 | "valid_indices = [i for i in important_sentences if i < len(sentences)]\n", 209 | "\n", 210 | "# Extract the two most important sentences based on the valid indices\n", 211 | "summary_sentences = [sentences[i].strip() for i in valid_indices[:3]]\n", 212 | "\n", 213 | "# If there are not enough valid indices, pad the summary with empty strings\n", 214 | "while len(summary_sentences) < 2:\n", 215 | " summary_sentences.append('')" 216 | ], 217 | "metadata": { 218 | "id": "3fCHK_u9N_HH" 219 | }, 220 | "execution_count": null, 221 | "outputs": [] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "source": [ 226 | "summary = '. '.join(summary_sentences) + '.'\n", 227 | "print(summary)" 228 | ], 229 | "metadata": { 230 | "colab": { 231 | "base_uri": "https://localhost:8080/" 232 | }, 233 | "id": "fO2MaOykOBOG", 234 | "outputId": "53c3509d-e2ab-4f3d-9762-7ff86c417776" 235 | }, 236 | "execution_count": null, 237 | "outputs": [ 238 | { 239 | "output_type": "stream", 240 | "name": "stdout", 241 | "text": [ 242 | "The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained. The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India.\n" 243 | ] 244 | } 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "source": [ 250 | "rouge = Rouge()\n", 251 | "scores = rouge.get_scores(summary, text)\n", 252 | "print(\"ROUGE Score:\")\n", 253 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 254 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 255 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 256 | ], 257 | "metadata": { 258 | "colab": { 259 | "base_uri": "https://localhost:8080/" 260 | }, 261 | "id": "QtLWfp9ebWDx", 262 | "outputId": "96914826-c8c9-4248-88e4-2dcdcad917c8" 263 | }, 264 | "execution_count": null, 265 | "outputs": [ 266 | { 267 | "output_type": "stream", 268 | "name": "stdout", 269 | "text": [ 270 | "ROUGE Score:\n", 271 | "Precision: 1.000\n", 272 | "Recall: 0.430\n", 273 | "F1-Score: 0.602\n" 274 | ] 275 | } 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "source": [ 281 | "from nltk.translate.bleu_score import sentence_bleu\n", 282 | "\n", 283 | "def summary_to_sentences(summary):\n", 284 | " # Split the summary into sentences using the '.' character as a separator\n", 285 | " sentences = summary.split('.')\n", 286 | " \n", 287 | " # Convert each sentence into a list of words\n", 288 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 289 | " \n", 290 | " return sentence_lists\n", 291 | "\n", 292 | "def paragraph_to_wordlist(paragraph):\n", 293 | " # Split the paragraph into words using whitespace as a separator\n", 294 | " words = paragraph.split()\n", 295 | " return words\n", 296 | "\n", 297 | "reference_paragraph = text\n", 298 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 299 | "predicted_paragraph = summary\n", 300 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 301 | "\n", 302 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 303 | "print(score)" 304 | ], 305 | "metadata": { 306 | "colab": { 307 | "base_uri": "https://localhost:8080/" 308 | }, 309 | "id": "Bg52JR4mbZND", 310 | "outputId": "c49334fd-e87b-445c-fc56-3a9bd174cbe7" 311 | }, 312 | "execution_count": null, 313 | "outputs": [ 314 | { 315 | "output_type": "stream", 316 | "name": "stdout", 317 | "text": [ 318 | "0.8694712979282957\n" 319 | ] 320 | } 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "source": [ 326 | "print(\"BLEU Score: {:.3f}\".format(score))" 327 | ], 328 | "metadata": { 329 | "colab": { 330 | "base_uri": "https://localhost:8080/" 331 | }, 332 | "id": "VXbuiCMGZtVv", 333 | "outputId": "c1c59dae-670d-4a7c-c50a-30b07f3e5a54" 334 | }, 335 | "execution_count": null, 336 | "outputs": [ 337 | { 338 | "output_type": "stream", 339 | "name": "stdout", 340 | "text": [ 341 | "BLEU Score: 0.869\n" 342 | ] 343 | } 344 | ] 345 | } 346 | ] 347 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/SumBasic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# SumBasic\n", 21 | "**SumBasic** is a simple yet effective algorithm for text summarization that is based on a probabilistic model of sentence selection. Here are some advantages and disadvantages of the SumBasic algorithm for text summarization:\n", 22 | "\n", 23 | "### Pros:\n", 24 | "*\tSimplicity: SumBasic is easy to understand and implement, requiring only basic probabilistic modeling and word frequency analysis.\n", 25 | "*\tLanguage independence: SumBasic is language-independent and can be applied to texts in any language.\n", 26 | "*\tGood for extractive summarization: SumBasic is well-suited for extractive summarization, where the summary consists of selected sentences from the original text.\n", 27 | "*\tGood for single-document summarization: SumBasic is effective at summarizing single documents, and can produce summaries that are accurate and relevant.\n", 28 | "\n", 29 | "### Cons:\n", 30 | "*\tLimited coverage: SumBasic tends to focus on the most frequent words and sentences, and may miss important details that are less frequent.\n", 31 | "*\tLack of coherence: SumBasic may produce summaries that lack coherence, especially when summarizing longer texts.\n", 32 | "*\tInability to handle new information: SumBasic does not handle new information that is not present in the original text very well, which can lead to inaccuracies in the summary.\n", 33 | "*\tLimited customization: SumBasic is a simple algorithm with limited customization options, which may limit its flexibility in certain applications.\n", 34 | "\n", 35 | "Overall, SumBasic is a useful algorithm for extractive summarization of single documents, and is easy to implement and understand. However, it may have limitations in terms of coverage, coherence, and handling new information, and may not be as effective for more complex summarization tasks. Proper tuning and feature selection can help mitigate some of its limitations.\n", 36 | "\n", 37 | "These are the scores we achieved:\n", 38 | "\n", 39 | " ROUGE Score:\n", 40 | " Precision: 1.000\n", 41 | " Recall: 0.417\n", 42 | " F1-Score: 0.589\n", 43 | "\n", 44 | " BLEU Score: 0.621\n", 45 | "\n", 46 | "## References\n", 47 | "\n", 48 | "Here are some research papers related to the SumBasic algorithm for text summarization:\n", 49 | "\n", 50 | "1. \"Automatic text summarization by sentence extraction\" by H. P. Luhn, in IBM Journal of Research and Development (1958)\n", 51 | "\n", 52 | "1. \"Sumbasic: A simple yet effective approach to single-document summarization\" by A. Nenkova and K. McKeown, in Proceedings of the 2005 Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP)\n", 53 | "\n", 54 | "1. \"Sumbasic++: An efficient multi-document summarization approach with topic modeling\" by D. Shang, J. Liu, and X. Li, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)\n", 55 | "\n", 56 | "These papers discuss various aspects of the SumBasic algorithm, including its effectiveness in producing high-quality summaries, its comparison with other techniques like LexRank and TextRank, and its extension to multi-document summarization using topic modeling.\n", 57 | "\n", 58 | "The SumBasic algorithm is a simple and effective approach to extractive summarization that assigns weights to each sentence in the document based on its frequency in the text. The algorithm iteratively updates the sentence weights and selects the most important sentences for the summary.\n", 59 | "\n", 60 | "The papers suggest that SumBasic is a powerful and computationally efficient approach to automatic text summarization, particularly for single-document summarization tasks. The algorithm's simplicity and intuitive nature make it easy to implement and adapt to different domains and languages.\n" 61 | ], 62 | "metadata": { 63 | "id": "0Zf0ZaMTaMbx" 64 | } 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": { 70 | "colab": { 71 | "base_uri": "https://localhost:8080/" 72 | }, 73 | "id": "G6FZFUOhJXz3", 74 | "outputId": "794b91aa-655c-4b8d-fba3-ab1afe1977a4" 75 | }, 76 | "outputs": [ 77 | { 78 | "output_type": "stream", 79 | "name": "stdout", 80 | "text": [ 81 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 82 | "Collecting rouge\n", 83 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 84 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 85 | "Installing collected packages: rouge\n", 86 | "Successfully installed rouge-1.0.1\n", 87 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 88 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 89 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 90 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n", 91 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n", 92 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n" 93 | ] 94 | }, 95 | { 96 | "output_type": "stream", 97 | "name": "stderr", 98 | "text": [ 99 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 100 | "[nltk_data] Unzipping corpora/stopwords.zip.\n", 101 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 102 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n", 103 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 104 | "[nltk_data] Package stopwords is already up-to-date!\n", 105 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 106 | "[nltk_data] Package punkt is already up-to-date!\n" 107 | ] 108 | }, 109 | { 110 | "output_type": "execute_result", 111 | "data": { 112 | "text/plain": [ 113 | "True" 114 | ] 115 | }, 116 | "metadata": {}, 117 | "execution_count": 1 118 | } 119 | ], 120 | "source": [ 121 | "!pip install rouge\n", 122 | "!pip install nltk\n", 123 | "from rouge import Rouge \n", 124 | "import nltk\n", 125 | "import nltk.translate.bleu_score as bleu\n", 126 | "nltk.download('stopwords')\n", 127 | "nltk.download('punkt')\n", 128 | "from nltk.corpus import stopwords\n", 129 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 130 | "nltk.download('stopwords')\n", 131 | "nltk.download('punkt')" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "source": [ 137 | "def get_word_frequencies(text):\n", 138 | " \"\"\"\n", 139 | " Calculates the frequency of each word in the text\n", 140 | " \"\"\"\n", 141 | " stop_words = set(stopwords.words('english'))\n", 142 | " words = [word.lower() for word in word_tokenize(text) if word.isalpha() and word.lower() not in stop_words]\n", 143 | " freq = nltk.FreqDist(words)\n", 144 | " return freq" 145 | ], 146 | "metadata": { 147 | "id": "ru_RdIk3JiS1" 148 | }, 149 | "execution_count": null, 150 | "outputs": [] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "source": [ 155 | "def get_sentence_scores(text, freq):\n", 156 | " \"\"\"\n", 157 | " Calculates the score of each sentence in the text\n", 158 | " \"\"\"\n", 159 | " sentences = sent_tokenize(text)\n", 160 | " scores = []\n", 161 | " for sentence in sentences:\n", 162 | " sentence_score = 0\n", 163 | " sentence_words = [word.lower() for word in word_tokenize(sentence) if word.isalpha()]\n", 164 | " for word in sentence_words:\n", 165 | " sentence_score += freq[word]\n", 166 | " sentence_score /= len(sentence_words)\n", 167 | " scores.append((sentence, sentence_score))\n", 168 | " return scores" 169 | ], 170 | "metadata": { 171 | "id": "RnEZ0-sQJlWz" 172 | }, 173 | "execution_count": null, 174 | "outputs": [] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "source": [ 179 | "def summarize(text, length):\n", 180 | " \"\"\"\n", 181 | " Summarizes the text to the specified length using the SumBasic algorithm\n", 182 | " \"\"\"\n", 183 | " freq = get_word_frequencies(text)\n", 184 | " summary = []\n", 185 | " while len(summary) < length:\n", 186 | " sentence_scores = get_sentence_scores(text, freq)\n", 187 | " top_sentence = max(sentence_scores, key=lambda x: x[1])[0]\n", 188 | " summary.append(top_sentence)\n", 189 | " # update frequency distribution by reducing frequency of words in the selected sentence\n", 190 | " for word in word_tokenize(top_sentence):\n", 191 | " if word.isalpha():\n", 192 | " freq[word.lower()] -= 1\n", 193 | " return ' '.join(summary)" 194 | ], 195 | "metadata": { 196 | "id": "MnXS951_JpyT" 197 | }, 198 | "execution_count": null, 199 | "outputs": [] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "source": [ 204 | "text =\"\"\"\n", 205 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 206 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"" 207 | ], 208 | "metadata": { 209 | "id": "hjkhKfJFJsuO" 210 | }, 211 | "execution_count": null, 212 | "outputs": [] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "source": [ 217 | "summary = summarize(text, 3)\n", 218 | "print(summary)" 219 | ], 220 | "metadata": { 221 | "colab": { 222 | "base_uri": "https://localhost:8080/" 223 | }, 224 | "id": "tCxYcOxmJyuz", 225 | "outputId": "8006ab17-bede-4898-dbe8-00a2ebab6501" 226 | }, 227 | "execution_count": null, 228 | "outputs": [ 229 | { 230 | "output_type": "stream", 231 | "name": "stdout", 232 | "text": [ 233 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.\n" 234 | ] 235 | } 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "source": [ 241 | "rouge = Rouge()\n", 242 | "scores = rouge.get_scores(summary, text)\n", 243 | "print(\"ROUGE Score:\")\n", 244 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 245 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 246 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 247 | ], 248 | "metadata": { 249 | "colab": { 250 | "base_uri": "https://localhost:8080/" 251 | }, 252 | "id": "BhoYR0hpe7TB", 253 | "outputId": "89010031-0538-4946-96dd-5e79fbaa445d" 254 | }, 255 | "execution_count": null, 256 | "outputs": [ 257 | { 258 | "output_type": "stream", 259 | "name": "stdout", 260 | "text": [ 261 | "ROUGE Score:\n", 262 | "Precision: 1.000\n", 263 | "Recall: 0.417\n", 264 | "F1-Score: 0.589\n" 265 | ] 266 | } 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "source": [ 272 | "from nltk.translate.bleu_score import sentence_bleu\n", 273 | "\n", 274 | "def summary_to_sentences(summary):\n", 275 | " # Split the summary into sentences using the '.' character as a separator\n", 276 | " sentences = summary.split('.')\n", 277 | " \n", 278 | " # Convert each sentence into a list of words\n", 279 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 280 | " \n", 281 | " return sentence_lists\n", 282 | "\n", 283 | "def paragraph_to_wordlist(paragraph):\n", 284 | " # Split the paragraph into words using whitespace as a separator\n", 285 | " words = paragraph.split()\n", 286 | " return words\n", 287 | "\n", 288 | "reference_paragraph = text\n", 289 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 290 | "predicted_paragraph = summary\n", 291 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 292 | "\n", 293 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 294 | "print(score)" 295 | ], 296 | "metadata": { 297 | "colab": { 298 | "base_uri": "https://localhost:8080/" 299 | }, 300 | "id": "g1kbfXzxe84h", 301 | "outputId": "e748c63c-7c5e-4e5b-e42d-146fabd976ec" 302 | }, 303 | "execution_count": null, 304 | "outputs": [ 305 | { 306 | "output_type": "stream", 307 | "name": "stdout", 308 | "text": [ 309 | "0.6209648794317061\n" 310 | ] 311 | } 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "source": [ 317 | "print(\"BLEU Score: {:.3f}\".format(score))" 318 | ], 319 | "metadata": { 320 | "colab": { 321 | "base_uri": "https://localhost:8080/" 322 | }, 323 | "id": "of30wZtleGEk", 324 | "outputId": "7b871316-11ad-4815-d719-9793de766e09" 325 | }, 326 | "execution_count": null, 327 | "outputs": [ 328 | { 329 | "output_type": "stream", 330 | "name": "stdout", 331 | "text": [ 332 | "BLEU Score: 0.621\n" 333 | ] 334 | } 335 | ] 336 | } 337 | ] 338 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Sentence_Ranking.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# Sentence-Ranking\n", 21 | "**Sentence ranking** is a popular approach for text summarization, where sentences are scored based on their importance and the top-ranked sentences are selected to form the summary. Here are some pros and cons of using sentence ranking for text summarization:\n", 22 | "\n", 23 | "## Pros:\n", 24 | "\n", 25 | "* It is a simple and intuitive approach that can be easily implemented.\n", 26 | "* It can handle different types of text, such as news articles, scientific papers, and social media posts.\n", 27 | "* It can preserve the original structure of the text and provide a coherent summary.\n", 28 | "* It can be combined with other techniques, such as sentence clustering and sentence compression, to improve the quality of summaries.\n", 29 | "* It can be evaluated using standard metrics, such as ROUGE and BLEU, which allow for objective comparison with other summarization models.\n", 30 | "\n", 31 | "### Cons:\n", 32 | "\n", 33 | "* It can be sensitive to the choice of ranking algorithm and feature set, which can affect the quality of the summary.\n", 34 | "* It may not capture the overall meaning of the text and may miss important information.\n", 35 | "* It may generate redundant or repetitive information, especially when multiple sentences convey similar information.\n", 36 | "* It may not handle text with complex syntax or domain-specific terminology well, which can lead to inaccuracies in the summary.\n", 37 | "* It may not be able to generate summaries that are novel or creative, as it relies on the input text for content.\n", 38 | "\n", 39 | "Overall, sentence ranking is a widely used and effective approach for text summarization, but its limitations should be considered when evaluating its performance and potential applications.\n", 40 | "\n", 41 | "These are the scores we achieved:\n", 42 | "\n", 43 | " ROUGE Score:\n", 44 | " Precision: 0.833\n", 45 | " Recall: 0.331\n", 46 | " F1-Score: 0.474\n", 47 | "\n", 48 | " BLEU Score: 0.556\n", 49 | "\n", 50 | "\n", 51 | "Here are some research papers that use sentence ranking for text summarization:\n", 52 | "\n", 53 | "1. \"TextRank: Bringing Order into Texts\" by R. Mihalcea and P. Tarau. This paper introduces the TextRank algorithm, which is a graph-based approach for sentence ranking and has been widely used for text summarization.\n", 54 | "\n", 55 | "2. \"Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization\" by J. A. Pérez-Carballo and A. García-Serrano. This paper compares the performance of different graph-based algorithms, including TextRank, for extractive text summarization.\n", 56 | "\n", 57 | "3. \"Enhancing Sentence Extraction-Based Single-Document Summarization with Supervised Methods\" by D. Das and A. Sarkar. This paper proposes a supervised learning approach for sentence ranking based on features such as sentence length, position, and similarity to the document title.\n", 58 | "\n", 59 | "4. \"A Neural Attention Model for Abstractive Sentence Summarization\" by A. Rush et al. This paper uses a neural attention model for abstractive text summarization, where sentences are ranked based on their relevance to the summary and the overall coherence of the text.\n", 60 | "\n", 61 | "These papers demonstrate the versatility and effectiveness of sentence ranking for text summarization, and highlight the potential for combining this approach with other techniques to improve the quality of summaries." 62 | ], 63 | "metadata": { 64 | "id": "oQA-i2uwR4Xl" 65 | } 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": { 71 | "id": "AxhlxTeASmuD", 72 | "colab": { 73 | "base_uri": "https://localhost:8080/" 74 | }, 75 | "outputId": "0d7f6242-b111-429e-f5fa-7366e6b46741" 76 | }, 77 | "outputs": [ 78 | { 79 | "output_type": "stream", 80 | "name": "stdout", 81 | "text": [ 82 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 83 | "Collecting rouge\n", 84 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 85 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 86 | "Installing collected packages: rouge\n", 87 | "Successfully installed rouge-1.0.1\n", 88 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 89 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 90 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 91 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n", 92 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n", 93 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n" 94 | ] 95 | }, 96 | { 97 | "output_type": "stream", 98 | "name": "stderr", 99 | "text": [ 100 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 101 | "[nltk_data] Unzipping corpora/stopwords.zip.\n", 102 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 103 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "!pip install rouge\n", 109 | "!pip install nltk\n", 110 | "from rouge import Rouge \n", 111 | "import nltk\n", 112 | "import nltk.translate.bleu_score as bleu\n", 113 | "nltk.download('stopwords')\n", 114 | "nltk.download('punkt')\n", 115 | "import numpy as np\n", 116 | "import pandas as pd\n", 117 | "from nltk.corpus import stopwords\n", 118 | "from nltk.tokenize import word_tokenize, sent_tokenize\n", 119 | "from nltk.stem import PorterStemmer\n", 120 | "from sklearn.feature_extraction.text import TfidfVectorizer" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "source": [ 126 | "text =\"\"\"\n", 127 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 128 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"" 129 | ], 130 | "metadata": { 131 | "id": "3mRLgqZrTGGf" 132 | }, 133 | "execution_count": null, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": [ 139 | "nltk.download('stopwords')\n", 140 | "nltk.download('punkt')\n", 141 | " " 142 | ], 143 | "metadata": { 144 | "colab": { 145 | "base_uri": "https://localhost:8080/" 146 | }, 147 | "id": "zcvob-KKTdDw", 148 | "outputId": "e84a07f2-67ef-4cb3-f2fb-ff07cf246018" 149 | }, 150 | "execution_count": null, 151 | "outputs": [ 152 | { 153 | "output_type": "stream", 154 | "name": "stderr", 155 | "text": [ 156 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 157 | "[nltk_data] Package stopwords is already up-to-date!\n", 158 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 159 | "[nltk_data] Package punkt is already up-to-date!\n" 160 | ] 161 | }, 162 | { 163 | "output_type": "execute_result", 164 | "data": { 165 | "text/plain": [ 166 | "True" 167 | ] 168 | }, 169 | "metadata": {}, 170 | "execution_count": 3 171 | } 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "source": [ 177 | "#Preprocess the text\n", 178 | "stop_words = set(stopwords.words('english'))\n", 179 | "stemmer = PorterStemmer()\n", 180 | "sentences = sent_tokenize(text.lower())\n", 181 | "words = word_tokenize(text.lower())\n", 182 | "\n", 183 | "filtered_words = []\n", 184 | "for word in words:\n", 185 | " if word not in stop_words:\n", 186 | " stemmed_word = stemmer.stem(word)\n", 187 | " filtered_words.append(stemmed_word)\n", 188 | "\n", 189 | "# Calculate the sentence scores\n", 190 | "vectorizer = TfidfVectorizer()\n", 191 | "X = vectorizer.fit_transform(sentences)" 192 | ], 193 | "metadata": { 194 | "id": "TXWvQZtxTXtS" 195 | }, 196 | "execution_count": null, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "source": [ 202 | "sentence_scores = []\n", 203 | "for i in range(len(sentences)):\n", 204 | " sentence_score = 0\n", 205 | " for word in filtered_words:\n", 206 | " if word in vectorizer.get_feature_names():\n", 207 | " sentence_score += X[i, vectorizer.vocabulary_[word]]\n", 208 | " sentence_scores.append(sentence_score)\n", 209 | "\n", 210 | "# Sort the sentences\n", 211 | "ranked_sentences = sorted(((sentence_scores[i], s) for i, s in enumerate(sentences)), reverse=True)\n", 212 | "\n", 213 | "# Select the top N sentences\n", 214 | "top_n = 3\n", 215 | "selected_sentences = []\n", 216 | "for i in range(top_n):\n", 217 | " selected_sentences.append(ranked_sentences[i][1])\n" 218 | ], 219 | "metadata": { 220 | "id": "j6ZRp1CVTsAK", 221 | "colab": { 222 | "base_uri": "https://localhost:8080/" 223 | }, 224 | "outputId": "f065b9ae-3bf0-453f-dcb6-38cb5851eeb4" 225 | }, 226 | "execution_count": null, 227 | "outputs": [ 228 | { 229 | "output_type": "stream", 230 | "name": "stderr", 231 | "text": [ 232 | "/usr/local/lib/python3.8/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", 233 | " warnings.warn(msg, category=FutureWarning)\n" 234 | ] 235 | } 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "source": [ 241 | "# Generate the summary\n", 242 | "summary = \" \".join(selected_sentences)\n", 243 | "print(summary)" 244 | ], 245 | "metadata": { 246 | "colab": { 247 | "base_uri": "https://localhost:8080/" 248 | }, 249 | "id": "TuUjOtmHTgrZ", 250 | "outputId": "556e205f-e444-4d2a-8964-96877012a724" 251 | }, 252 | "execution_count": null, 253 | "outputs": [ 254 | { 255 | "output_type": "stream", 256 | "name": "stdout", 257 | "text": [ 258 | "in summary, india's health ministry has announced that the country's covid-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. the decision was taken after a meeting of the national expert group on vaccine administration for covid-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in india. \n", 259 | " india's health ministry has announced that the country's covid-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities.\n" 260 | ] 261 | } 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "source": [ 267 | "rouge = Rouge()\n", 268 | "scores = rouge.get_scores(summary, text)\n", 269 | "print(\"ROUGE Score:\")\n", 270 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 271 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 272 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 273 | ], 274 | "metadata": { 275 | "id": "VRD8sRQLTjLr", 276 | "colab": { 277 | "base_uri": "https://localhost:8080/" 278 | }, 279 | "outputId": "a9afc655-1998-4eed-cfc1-af310564b5c9" 280 | }, 281 | "execution_count": null, 282 | "outputs": [ 283 | { 284 | "output_type": "stream", 285 | "name": "stdout", 286 | "text": [ 287 | "ROUGE Score:\n", 288 | "Precision: 0.833\n", 289 | "Recall: 0.331\n", 290 | "F1-Score: 0.474\n" 291 | ] 292 | } 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "source": [ 298 | "from nltk.translate.bleu_score import sentence_bleu\n", 299 | "\n", 300 | "def summary_to_sentences(summary):\n", 301 | " # Split the summary into sentences using the '.' character as a separator\n", 302 | " sentences = summary.split('.')\n", 303 | " \n", 304 | " # Convert each sentence into a list of words\n", 305 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 306 | " \n", 307 | " return sentence_lists\n", 308 | "\n", 309 | "def paragraph_to_wordlist(paragraph):\n", 310 | " # Split the paragraph into words using whitespace as a separator\n", 311 | " words = paragraph.split()\n", 312 | " return words\n", 313 | "\n", 314 | "reference_paragraph = text\n", 315 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 316 | "predicted_paragraph = summary\n", 317 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 318 | "\n", 319 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 320 | "print(score)" 321 | ], 322 | "metadata": { 323 | "colab": { 324 | "base_uri": "https://localhost:8080/" 325 | }, 326 | "id": "X7wQs-Kvd-ib", 327 | "outputId": "416c31ad-abb9-43f8-f3d2-b00a35bc7489" 328 | }, 329 | "execution_count": null, 330 | "outputs": [ 331 | { 332 | "output_type": "stream", 333 | "name": "stdout", 334 | "text": [ 335 | "0.5559999307354189\n" 336 | ] 337 | } 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "source": [ 343 | "print(\"BLEU Score: {:.3f}\".format(score))" 344 | ], 345 | "metadata": { 346 | "colab": { 347 | "base_uri": "https://localhost:8080/" 348 | }, 349 | "id": "9zOk9qa6ffsv", 350 | "outputId": "f73b4821-f1cb-4381-99a0-2e0abb2f561d" 351 | }, 352 | "execution_count": null, 353 | "outputs": [ 354 | { 355 | "output_type": "stream", 356 | "name": "stdout", 357 | "text": [ 358 | "BLEU Score: 0.556\n" 359 | ] 360 | } 361 | ] 362 | } 363 | ] 364 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Luhn's_Model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# Luhn's Model\n", 21 | "\n", 22 | "**The Luhn Model** is a statistical-based text summarization technique that selects the most relevant sentences based on the frequency of important words in the text. Here are some advantages and disadvantages of using the Luhn Model for text summarization:\n", 23 | "\n", 24 | "### Pros:\n", 25 | "\n", 26 | "* Easy to implement: The Luhn Model is a simple algorithm that is easy to implement and requires minimal computational resources.\n", 27 | "\n", 28 | "* No training data needed: The Luhn Model does not require any training data, as it is based on a statistical analysis of the text.\n", 29 | "\n", 30 | "* Good for extractive summarization: The Luhn Model is well-suited for extractive summarization, where the summary is generated by selecting the most relevant sentences from the original text.\n", 31 | "\n", 32 | "* Language-independent: The Luhn Model is language-independent, which means it can be applied to any language.\n", 33 | "\n", 34 | "### Cons:\n", 35 | "\n", 36 | "* Limited to statistical analysis: The Luhn Model relies solely on a statistical analysis of the text and may not be able to capture the semantic meaning of the text.\n", 37 | "\n", 38 | "* Limited context awareness: The Luhn Model does not consider the context in which the sentences are used, which can lead to the selection of irrelevant sentences.\n", 39 | "\n", 40 | "* Over-reliance on word frequency: The Luhn Model relies heavily on word frequency, which may not always be an accurate indicator of the importance of a sentence.\n", 41 | "\n", 42 | "* Limited to single document summarization: The Luhn Model is designed for single document summarization and may not work well for summarizing multiple documents or large sets of data.\n", 43 | "\n", 44 | "These are the scores we achieved:\n", 45 | "\n", 46 | " ROUGE Score:\n", 47 | " Precision: 0.991\n", 48 | " Recall: 0.742\n", 49 | " F1-Score: 0.848\n", 50 | "\n", 51 | " BLEU Score: 0.700\n", 52 | "\n", 53 | "## References\n", 54 | "\n", 55 | "Here are some research papers related to Luhn's algorithm for text summarization:\n", 56 | "\n", 57 | "1. \"The automatic creation of literature abstracts\" by H. P. Luhn, in IBM Journal of Research and Development (1958)\n", 58 | "\n", 59 | "2. \"Text summarization using Luhn's algorithm\" by H. P. Luhn, in Information Retrieval Techniques for Speech Applications (1996)\n", 60 | "\n", 61 | "3. \"Experiments with Luhn's automatic summarizer\" by T. F. Sumner, in Journal of the Association for Computing Machinery (1959)\n", 62 | "\n", 63 | "4. \"Combining Luhn's algorithm with latent semantic analysis for text summarization\" by R. S. Kesavan and S. S. Iyengar, in Proceedings of the 2009 International Conference on Advances in Recent Technologies in Communication and Computing\n", 64 | "\n", 65 | "These papers describe the original Luhn's algorithm for text summarization, its limitations, and its extensions. The algorithm is based on identifying the most frequent words in a document and selecting the sentences that contain them. This approach is simple and can produce reasonable results, but it has some limitations, such as the lack of understanding of the semantic relationships between words.\n", 66 | "\n", 67 | "The later papers explore extensions to the Luhn's algorithm, such as combining it with other techniques, like latent semantic analysis, to improve its performance. These extensions aim to address some of the limitations of the original algorithm and improve its effectiveness in generating high-quality summaries.\n", 68 | "\n", 69 | "\n", 70 | "\n" 71 | ], 72 | "metadata": { 73 | "id": "u2FbDXwOKJgP" 74 | } 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 9, 79 | "metadata": { 80 | "id": "lTTxV7GxGdUg", 81 | "outputId": "457f7141-1708-447f-8f1d-673ab872c7d7", 82 | "colab": { 83 | "base_uri": "https://localhost:8080/" 84 | } 85 | }, 86 | "outputs": [ 87 | { 88 | "output_type": "stream", 89 | "name": "stdout", 90 | "text": [ 91 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 92 | "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (1.2.1)\n", 93 | "Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.22.4)\n", 94 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (3.1.0)\n", 95 | "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.2.0)\n", 96 | "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.10.1)\n", 97 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 98 | "Collecting rouge\n", 99 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 100 | "Requirement already satisfied: six in /usr/local/lib/python3.9/dist-packages (from rouge) (1.15.0)\n", 101 | "Installing collected packages: rouge\n", 102 | "Successfully installed rouge-1.0.1\n", 103 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 104 | "Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)\n", 105 | "Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nltk) (1.2.0)\n", 106 | "Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nltk) (8.1.3)\n", 107 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk) (4.65.0)\n", 108 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages (from nltk) (2022.6.2)\n" 109 | ] 110 | }, 111 | { 112 | "output_type": "stream", 113 | "name": "stderr", 114 | "text": [ 115 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 116 | "[nltk_data] Package stopwords is already up-to-date!\n", 117 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 118 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 119 | ] 120 | } 121 | ], 122 | "source": [ 123 | "from collections import Counter\n", 124 | "from nltk.corpus import stopwords\n", 125 | "!pip install scikit-learn\n", 126 | "!pip install rouge\n", 127 | "!pip install nltk\n", 128 | "from rouge import Rouge \n", 129 | "import nltk\n", 130 | "import nltk.translate.bleu_score as bleu\n", 131 | "nltk.download('stopwords')\n", 132 | "nltk.download('punkt')\n", 133 | "from nltk.corpus import stopwords \n", 134 | "from nltk.tokenize import word_tokenize, sent_tokenize " 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "source": [ 140 | "def extract_keywords(text, n_keywords=10):\n", 141 | " # Tokenize the text\n", 142 | " tokens = text.lower().split()\n", 143 | "\n", 144 | " # Remove stop words\n", 145 | " stop_words = set(stopwords.words('english'))\n", 146 | " tokens = [token for token in tokens if token not in stop_words]\n", 147 | "\n", 148 | " # Calculate the frequency of each word\n", 149 | " freq = Counter(tokens)\n", 150 | "\n", 151 | " # Assign scores to each word based on frequency and position\n", 152 | " scores = {word: freq[word] * (i+1) for i, word in enumerate(tokens)}\n", 153 | "\n", 154 | " # Sort the words by score and select the top n_keywords\n", 155 | " keywords = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:n_keywords]\n", 156 | "\n", 157 | " # Return the top keywords\n", 158 | " return [keyword[0] for keyword in keywords]" 159 | ], 160 | "metadata": { 161 | "id": "ywbshXsGGyON" 162 | }, 163 | "execution_count": 10, 164 | "outputs": [] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "source": [ 169 | "import nltk\n", 170 | "nltk.download('stopwords')" 171 | ], 172 | "metadata": { 173 | "colab": { 174 | "base_uri": "https://localhost:8080/" 175 | }, 176 | "id": "hMwcWQYTHXBN", 177 | "outputId": "18f5f4d0-8be1-4fb4-bd38-26f908a64ecd" 178 | }, 179 | "execution_count": 11, 180 | "outputs": [ 181 | { 182 | "output_type": "stream", 183 | "name": "stderr", 184 | "text": [ 185 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 186 | "[nltk_data] Package stopwords is already up-to-date!\n" 187 | ] 188 | }, 189 | { 190 | "output_type": "execute_result", 191 | "data": { 192 | "text/plain": [ 193 | "True" 194 | ] 195 | }, 196 | "metadata": {}, 197 | "execution_count": 11 198 | } 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "source": [ 204 | "text = \"\"\"\n", 205 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 206 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"\n", 207 | "\n", 208 | "# Extract the top 3 keywords\n", 209 | "keywords = extract_keywords(text, n_keywords=3)\n", 210 | "\n", 211 | "# Print the keywords\n", 212 | "print('Top keywords:', keywords)" 213 | ], 214 | "metadata": { 215 | "colab": { 216 | "base_uri": "https://localhost:8080/" 217 | }, 218 | "id": "kAjkoCdpG2YU", 219 | "outputId": "94af4f4a-c638-4d8c-d178-a610567524e3" 220 | }, 221 | "execution_count": 12, 222 | "outputs": [ 223 | { 224 | "output_type": "stream", 225 | "name": "stdout", 226 | "text": [ 227 | "Top keywords: ['vaccination', 'drive', 'million']\n" 228 | ] 229 | } 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "source": [ 235 | "# Summarize the text using the top keywords\n", 236 | "sentences = text.split('.')\n", 237 | "summary = ''\n", 238 | "for sentence in sentences:\n", 239 | " for keyword in keywords:\n", 240 | " if keyword in sentence.lower():\n", 241 | " summary += sentence.strip() + '. '\n", 242 | " break\n", 243 | "\n", 244 | "# Print the summary\n", 245 | "print('Summary:', summary)" 246 | ], 247 | "metadata": { 248 | "colab": { 249 | "base_uri": "https://localhost:8080/" 250 | }, 251 | "id": "k_Yfoy2mHUKr", 252 | "outputId": "1f238863-a3f8-4e73-bf30-0ad021aec397" 253 | }, 254 | "execution_count": 13, 255 | "outputs": [ 256 | { 257 | "output_type": "stream", 258 | "name": "stdout", 259 | "text": [ 260 | "Summary: India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges. The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. \n" 261 | ] 262 | } 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "source": [ 268 | "rouge = Rouge()\n", 269 | "scores = rouge.get_scores(summary, text)\n", 270 | "print(\"ROUGE Score:\")\n", 271 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 272 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 273 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 274 | ], 275 | "metadata": { 276 | "id": "mR7GDwtjHzfB", 277 | "outputId": "2bc41f67-b736-4be3-ac56-ea38845db6e5", 278 | "colab": { 279 | "base_uri": "https://localhost:8080/" 280 | } 281 | }, 282 | "execution_count": 14, 283 | "outputs": [ 284 | { 285 | "output_type": "stream", 286 | "name": "stdout", 287 | "text": [ 288 | "ROUGE Score:\n", 289 | "Precision: 0.991\n", 290 | "Recall: 0.742\n", 291 | "F1-Score: 0.848\n" 292 | ] 293 | } 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "source": [ 299 | "from nltk.translate.bleu_score import sentence_bleu\n", 300 | "\n", 301 | "def summary_to_sentences(summary):\n", 302 | " # Split the summary into sentences using the '.' character as a separator\n", 303 | " sentences = summary.split('.')\n", 304 | " \n", 305 | " # Convert each sentence into a list of words\n", 306 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 307 | " \n", 308 | " return sentence_lists\n", 309 | "\n", 310 | "def paragraph_to_wordlist(paragraph):\n", 311 | " # Split the paragraph into words using whitespace as a separator\n", 312 | " words = paragraph.split()\n", 313 | " return words\n", 314 | "\n", 315 | "reference_paragraph = text\n", 316 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 317 | "predicted_paragraph = summary\n", 318 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 319 | "\n", 320 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 321 | "print(score)" 322 | ], 323 | "metadata": { 324 | "id": "1SOUmCEXKnZ3", 325 | "outputId": "12ef2d22-f95a-45bb-90a6-edb2f67f89a0", 326 | "colab": { 327 | "base_uri": "https://localhost:8080/" 328 | } 329 | }, 330 | "execution_count": 15, 331 | "outputs": [ 332 | { 333 | "output_type": "stream", 334 | "name": "stdout", 335 | "text": [ 336 | "0.7003175301310649\n" 337 | ] 338 | } 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "source": [ 344 | "print(\"BLEU Score: {:.3f}\".format(score))" 345 | ], 346 | "metadata": { 347 | "id": "MyoGCN5KK61U", 348 | "outputId": "671d94b2-fd8f-462f-eecd-bf2d45aad7cb", 349 | "colab": { 350 | "base_uri": "https://localhost:8080/" 351 | } 352 | }, 353 | "execution_count": 16, 354 | "outputs": [ 355 | { 356 | "output_type": "stream", 357 | "name": "stdout", 358 | "text": [ 359 | "BLEU Score: 0.700\n" 360 | ] 361 | } 362 | ] 363 | } 364 | ] 365 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/AI_Transformer_pipeline.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# Transformer Pipeline\n", 21 | "Text summarization of news articles using **transformer pipeline** has several advantages and disadvantages:\n", 22 | "\n", 23 | "### Pros:\n", 24 | "\n", 25 | "* Time-saving: Summarization of news articles using transformer pipeline can save a lot of time for readers who want to quickly get an idea about the news without reading the entire article.\n", 26 | "\n", 27 | "* Better comprehension: Summaries generated by transformer pipeline models are often well-written, coherent and provide an accurate representation of the original text, which can help readers better understand the main points of the article.\n", 28 | "\n", 29 | "* Reduced bias: Transformer models are trained on large amounts of data, which helps to reduce the bias that may exist in human-written summaries.\n", 30 | "\n", 31 | "* Multilingual Support: Transformer models can support summarization of news articles in multiple languages, making it easier for readers to stay informed about news from around the world.\n", 32 | "\n", 33 | "###Cons:\n", 34 | "\n", 35 | "* Loss of details: One of the major drawbacks of using a text summarization model is that it can sometimes lead to loss of important details, nuances, and context of the original text, which can be critical in certain types of news articles.\n", 36 | "\n", 37 | "* Limited flexibility: Transformer models are trained on a large dataset and may not be able to capture the unique writing style of an individual news source, resulting in generic summaries.\n", 38 | "\n", 39 | "* Model Complexity: Transformer models require significant computing resources and expertise to train and maintain, which can be a barrier for smaller news organizations or individuals.\n", 40 | "\n", 41 | "* Dependence on Training Data: The quality of the summary generated by a transformer model is highly dependent on the quality and relevance of the training data used to train the model. If the training data is biased or limited, the quality of the summaries may be compromised.\n", 42 | "\n", 43 | "Overall, while text summarization using transformer pipeline has some limitations, it has the potential to significantly improve the efficiency and accessibility of news article.\n", 44 | "\n", 45 | "These are the scores we achieved:\n", 46 | "\n", 47 | " ROUGE Score:\n", 48 | " Precision: 0.938\n", 49 | " Recall: 0.397\n", 50 | " F1-Score: 0.558\n", 51 | "\n", 52 | " BLEU Score: 0.795\n", 53 | "\n", 54 | "## References\n", 55 | "Here are some research papers related to Transformer-based pipelines for text summarization:\n", 56 | "\n", 57 | "1. \"PreSumm: Simple and Effective Multi-Document Summarization\" by J. Zhang, Y. Chen, J. Guo, and D. Yin, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019.\n", 58 | "\n", 59 | "1. \"Fine-tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping\" by C. Raffel and N. Shazeer, in Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.\n", 60 | "\n", 61 | "1. \"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer\" by C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, in Journal of Machine Learning Research (JMLR), 2020.\n", 62 | "\n", 63 | "These papers explore various Transformer-based models for text summarization, such as PreSumm, BART, and T5. They also discuss different techniques for fine-tuning and optimizing these models, including weight initialization, early stopping, and data orders.\n", 64 | "\n", 65 | "The Transformer architecture is a type of neural network that has been highly successful in natural language processing tasks, including text summarization. Transformer-based models typically use pre-trained language models, such as BERT or GPT, as a starting point and then fine-tune them on a specific summarization task using large amounts of data.\n", 66 | "\n", 67 | "The papers suggest that Transformer-based pipelines are highly effective for text summarization, achieving state-of-the-art results on a wide range of benchmark datasets. These models are highly flexible and can be adapted to different summarization tasks and domains with minimal modification." 68 | ], 69 | "metadata": { 70 | "id": "lSGwCXXcdwbs" 71 | } 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 16, 76 | "metadata": { 77 | "colab": { 78 | "base_uri": "https://localhost:8080/" 79 | }, 80 | "id": "yzfJR8eaGiXc", 81 | "outputId": "a907fc98-c71e-4d72-d677-93b5fc24d83d" 82 | }, 83 | "outputs": [ 84 | { 85 | "output_type": "stream", 86 | "name": "stdout", 87 | "text": [ 88 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 89 | "Requirement already satisfied: transformers in /usr/local/lib/python3.9/dist-packages (4.26.1)\n", 90 | "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (23.0)\n", 91 | "Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (0.13.1)\n", 92 | "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (1.22.4)\n", 93 | "Requirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from transformers) (3.9.0)\n", 94 | "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (6.0)\n", 95 | "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (2022.6.2)\n", 96 | "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.9/dist-packages (from transformers) (4.65.0)\n", 97 | "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (0.13.2)\n", 98 | "Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from transformers) (2.25.1)\n", 99 | "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.5.0)\n", 100 | "Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (4.0.0)\n", 101 | "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2.10)\n", 102 | "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (1.26.14)\n", 103 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2022.12.7)\n", 104 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 105 | "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.9/dist-packages (0.1.97)\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "!pip install -U transformers\n", 111 | "!pip install sentencepiece\n", 112 | "import torch\n", 113 | "import json \n", 114 | "from transformers import pipeline " 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "source": [ 120 | "summarizer = pipeline(\"summarization\")\n", 121 | "text =\"\"\"\n", 122 | "India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 123 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India\n", 124 | "\"\"\"" 125 | ], 126 | "metadata": { 127 | "colab": { 128 | "base_uri": "https://localhost:8080/" 129 | }, 130 | "id": "Jy6P0V2UbcJu", 131 | "outputId": "062f3e3e-3134-47ab-9acb-4dd770f55c51" 132 | }, 133 | "execution_count": 17, 134 | "outputs": [ 135 | { 136 | "output_type": "stream", 137 | "name": "stderr", 138 | "text": [ 139 | "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).\n", 140 | "Using a pipeline without specifying a model name and revision in production is not recommended.\n" 141 | ] 142 | } 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "source": [ 148 | "summ=summarizer(text)\n", 149 | "for sentence in summ:\n", 150 | " str1 = \"\"\n", 151 | " str1 += str(sentence)\n", 152 | " print(sentence)" 153 | ], 154 | "metadata": { 155 | "colab": { 156 | "base_uri": "https://localhost:8080/" 157 | }, 158 | "id": "Kb5JbitpcIIv", 159 | "outputId": "cfd7f57b-6d93-4b59-9e94-4bf775819be3" 160 | }, 161 | "execution_count": 18, 162 | "outputs": [ 163 | { 164 | "output_type": "stream", 165 | "name": "stdout", 166 | "text": [ 167 | "{'summary_text': \" India's COVID-19 vaccination drive will now be expanded to include people over 60 and those over 45 with co-morbidities . The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world . India began its vaccination drive in mid-January, starting with healthcare and frontline workers . The country's daily case count has been declining in recent weeks, but experts warn that the pandemic is far from over .\"}\n" 168 | ] 169 | } 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "source": [ 175 | "!pip install scikit-learn\n", 176 | "!pip install rouge\n", 177 | "!pip install nltk\n", 178 | "from rouge import Rouge \n", 179 | "import nltk\n", 180 | "import nltk.translate.bleu_score as bleu\n", 181 | "nltk.download('stopwords')\n", 182 | "nltk.download('punkt')\n", 183 | "from nltk.corpus import stopwords \n", 184 | "from nltk.tokenize import word_tokenize, sent_tokenize " 185 | ], 186 | "metadata": { 187 | "colab": { 188 | "base_uri": "https://localhost:8080/" 189 | }, 190 | "id": "RbPDGrpacqcL", 191 | "outputId": "d513e317-6c13-44ca-9039-416a9d7f1dcb" 192 | }, 193 | "execution_count": 19, 194 | "outputs": [ 195 | { 196 | "output_type": "stream", 197 | "name": "stdout", 198 | "text": [ 199 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 200 | "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (1.2.1)\n", 201 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (3.1.0)\n", 202 | "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.2.0)\n", 203 | "Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.10.1)\n", 204 | "Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.22.4)\n", 205 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 206 | "Requirement already satisfied: rouge in /usr/local/lib/python3.9/dist-packages (1.0.1)\n", 207 | "Requirement already satisfied: six in /usr/local/lib/python3.9/dist-packages (from rouge) (1.15.0)\n", 208 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 209 | "Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)\n", 210 | "Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nltk) (8.1.3)\n", 211 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk) (4.65.0)\n", 212 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages (from nltk) (2022.6.2)\n", 213 | "Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nltk) (1.2.0)\n" 214 | ] 215 | }, 216 | { 217 | "output_type": "stream", 218 | "name": "stderr", 219 | "text": [ 220 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 221 | "[nltk_data] Package stopwords is already up-to-date!\n", 222 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 223 | "[nltk_data] Package punkt is already up-to-date!\n" 224 | ] 225 | } 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "source": [ 231 | "rouge = Rouge()\n", 232 | "scores = rouge.get_scores(str1, text)\n", 233 | "print(\"ROUGE Score:\")\n", 234 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 235 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 236 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 237 | ], 238 | "metadata": { 239 | "colab": { 240 | "base_uri": "https://localhost:8080/" 241 | }, 242 | "id": "5Aq9DO9Uc9iv", 243 | "outputId": "593151d8-3e79-49ed-d7b7-f19b218f376f" 244 | }, 245 | "execution_count": 20, 246 | "outputs": [ 247 | { 248 | "output_type": "stream", 249 | "name": "stdout", 250 | "text": [ 251 | "ROUGE Score:\n", 252 | "Precision: 0.938\n", 253 | "Recall: 0.397\n", 254 | "F1-Score: 0.558\n" 255 | ] 256 | } 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "source": [ 262 | "\n", 263 | "from nltk.translate.bleu_score import sentence_bleu\n", 264 | "\n", 265 | "def summary_to_sentences(summ):\n", 266 | " # Split the summary into sentences using the '.' character as a separator\n", 267 | " sentences = summ.split('.')\n", 268 | " \n", 269 | " # Convert each sentence into a list of words\n", 270 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 271 | " \n", 272 | " return sentence_lists\n", 273 | "\n", 274 | "def paragraph_to_wordlist(paragraph):\n", 275 | " # Split the paragraph into words using whitespace as a separator\n", 276 | " words = paragraph.split()\n", 277 | " return words\n", 278 | "\n", 279 | "reference_paragraph = text\n", 280 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 281 | "predicted_paragraph = str1\n", 282 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 283 | "\n", 284 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 285 | "print(score)" 286 | ], 287 | "metadata": { 288 | "colab": { 289 | "base_uri": "https://localhost:8080/" 290 | }, 291 | "id": "VDS2HQUhdETJ", 292 | "outputId": "bdb9c319-0d0a-4cb1-f6cf-292650cc1034" 293 | }, 294 | "execution_count": 21, 295 | "outputs": [ 296 | { 297 | "output_type": "stream", 298 | "name": "stdout", 299 | "text": [ 300 | "0.7945385996828465\n" 301 | ] 302 | } 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "source": [ 308 | "print(\"BLEU Score: {:.3f}\".format(score))" 309 | ], 310 | "metadata": { 311 | "colab": { 312 | "base_uri": "https://localhost:8080/" 313 | }, 314 | "id": "xr8Zl2K8dP2U", 315 | "outputId": "4e30c3f2-734a-4ede-c0f6-4dac056f7709" 316 | }, 317 | "execution_count": 22, 318 | "outputs": [ 319 | { 320 | "output_type": "stream", 321 | "name": "stdout", 322 | "text": [ 323 | "BLEU Score: 0.795\n" 324 | ] 325 | } 326 | ] 327 | } 328 | ] 329 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Text Summarization Models - A Comprehensive Educational Repository 2 | 3 | This repository contains Python implementations of **31 different text summarization models**, ranging from basic statistical approaches to advanced transformer-based architectures. This project serves as an educational resource for understanding various text summarization techniques and their practical applications. 4 | 5 | ## 📊 Project Overview 6 | 7 | Text summarization is the process of generating a shorter version of a longer text while preserving its most important information. This repository demonstrates both **extractive** (selecting key sentences) and **abstractive** (generating new content) summarization approaches. 8 | 9 | ### Evaluation Metrics 10 | All models are evaluated using: 11 | - **ROUGE Score** (Recall-Oriented Understudy for Gisting Evaluation) 12 | - Precision: Measures how much of the generated summary is relevant 13 | - Recall: Measures how much of the reference summary is captured 14 | - F1-Score: Harmonic mean of precision and recall 15 | - **BLEU Score** (Bilingual Evaluation Understudy) 16 | - Measures the quality of machine-generated text against reference text 17 | 18 | ## 🚀 Quick Start Guide 19 | 20 | ### Prerequisites 21 | ```bash 22 | # Python 3.7+ required 23 | python --version 24 | 25 | # Install core dependencies 26 | pip install -r requirements.txt 27 | ``` 28 | 29 | ### Basic Usage Example 30 | ```python 31 | # Example: Using BERT for summarization 32 | from summarizer import Summarizer 33 | 34 | model = Summarizer() 35 | text = "Your long text here..." 36 | summary = model(text, num_sentences=3) 37 | print(summary) 38 | ``` 39 | 40 | ### Running a Specific Model 41 | ```bash 42 | # Navigate to the model directory 43 | cd "Basic to Advance Text Summarisation Models" 44 | 45 | # Run any Jupyter notebook 46 | jupyter notebook BERT.ipynb 47 | ``` 48 | 49 | ## 📊 Model Performance Comparison Table 50 | 51 | | Model | Type | ROUGE F1 | BLEU | Pros | Cons | Best For | 52 | |-------|------|----------|------|------|------|----------| 53 | | **GPT-2** | Abstractive | 0.797 | 0.423 | High recall, contextual | Low control, expensive | Research, complex docs | 54 | | **LSA** | Extractive | 0.602 | 0.869 | Semantic relationships | Needs large data | Multi-document | 55 | | **TextRank** | Extractive | 0.586 | 0.694 | Unbiased, consistent | Limited context | General purpose | 56 | | **T5** | Abstractive | 0.573 | 0.683 | High accuracy, multilingual | Resource intensive | Production systems | 57 | | **NLTK** | Extractive | 0.589 | 0.869 | Simple, language-independent | Limited accuracy | Beginners | 58 | | **BERT** | Extractive | 0.545 | 0.677 | Context understanding | Extractive only | Context-heavy docs | 59 | | **BART** | Abstractive | 0.402 | 0.905 | State-of-the-art | Large model size | High-quality summaries | 60 | | **LexRank** | Extractive | 0.357 | 0.651 | Handles complex texts | Limited coverage | Long documents | 61 | | **SumBasic** | Extractive | 0.589 | 0.621 | Simple implementation | Limited coverage | Single documents | 62 | 63 | ## 🏗️ Model Categories 64 | 65 | ### 1. Basic Statistical Models 66 | These models use fundamental NLP techniques and statistical approaches: 67 | 68 | #### **NLTK-based Summarization** (`NLTK.ipynb`) 69 | - **Approach**: Frequency-based sentence scoring using NLTK library 70 | - **Pros**: Simple implementation, language-independent, well-documented 71 | - **Cons**: Limited accuracy for complex texts, resource-intensive 72 | - **Performance**: ROUGE F1: 0.589, BLEU: 0.869 73 | - **Use Case**: Educational purposes, simple documents 74 | 75 | #### **TF-IDF Summarization** (`TFIDF.ipynb`) 76 | - **Approach**: Term Frequency-Inverse Document Frequency weighting 77 | - **Pros**: Effective for identifying important terms, widely used 78 | - **Cons**: May miss context, struggles with technical jargon 79 | - **Performance**: Evaluated using standard TF-IDF scoring 80 | - **Use Case**: Keyword extraction, document classification 81 | 82 | #### **SumBasic Algorithm** (`SumBasic.ipynb`) 83 | - **Approach**: Probabilistic sentence selection based on word frequency 84 | - **Pros**: Simple implementation, good for single documents 85 | - **Cons**: Limited coverage, may lack coherence 86 | - **Performance**: ROUGE F1: 0.589, BLEU: 0.621 87 | - **Use Case**: Single document summarization 88 | 89 | ### 2. Graph-Based Models 90 | These models represent text as graphs and use centrality algorithms: 91 | 92 | #### **TextRank** (`TextRank.ipynb`) 93 | - **Approach**: Graph-based ranking using sentence similarity 94 | - **Pros**: Automatic, unbiased, consistent results 95 | - **Cons**: Limited context understanding, may miss nuances 96 | - **Performance**: ROUGE F1: 0.586, BLEU: 0.694 97 | - **Use Case**: General purpose summarization 98 | 99 | #### **LexRank** (`LexRank.ipynb`) 100 | - **Approach**: Graph-based lexical centrality using cosine similarity 101 | - **Pros**: Handles complex texts, good performance on benchmarks 102 | - **Cons**: Limited coverage, extractive only 103 | - **Performance**: ROUGE F1: 0.357, BLEU: 0.651 104 | - **Use Case**: Long documents, multi-document summarization 105 | 106 | #### **Connected Dominating Set** (`Connected_Dominating_Set.ipynb`) 107 | - **Approach**: Graph theory-based sentence selection 108 | - **Pros**: Theoretical foundation, systematic approach 109 | - **Cons**: Computational complexity, may not capture semantics 110 | - **Use Case**: Research applications 111 | 112 | ### 3. Matrix Decomposition Models 113 | These models use mathematical techniques to identify important content: 114 | 115 | #### **LSA (Latent Semantic Analysis)** (`LSA.ipynb`) 116 | - **Approach**: Singular Value Decomposition for semantic relationships 117 | - **Pros**: Captures semantic relationships, good for multi-document 118 | - **Cons**: Requires large training data, limited coverage 119 | - **Performance**: ROUGE F1: 0.602, BLEU: 0.869 120 | - **Use Case**: Multi-document summarization, semantic analysis 121 | 122 | #### **Sentence Ranking** (`Sentence_Ranking.ipynb`) 123 | - **Approach**: Matrix-based sentence importance scoring 124 | - **Pros**: Systematic ranking, handles large documents 125 | - **Cons**: May miss context, computational expense 126 | - **Use Case**: Large document processing 127 | 128 | ### 4. Clustering-Based Models 129 | These models group similar content together: 130 | 131 | #### **K-Clustering** (`K_Clustering.ipynb`) 132 | - **Approach**: K-means clustering of sentences 133 | - **Pros**: Groups similar content, systematic approach 134 | - **Cons**: Requires parameter tuning, may lose context 135 | - **Use Case**: Topic-based summarization 136 | 137 | #### **Cluster-Based Summarization** (`Cluster_Based.ipynb`) 138 | - **Approach**: Hierarchical clustering for document structure 139 | - **Pros**: Preserves document structure, systematic 140 | - **Cons**: Computational complexity, parameter sensitivity 141 | - **Use Case**: Structured document summarization 142 | 143 | ### 5. Advanced NLP Models 144 | These models use sophisticated NLP techniques: 145 | 146 | #### **Luhn's Model** (`Luhn's_Model.ipynb`) 147 | - **Approach**: Classic statistical approach using word frequency 148 | - **Pros**: Historical significance, simple implementation 149 | - **Cons**: Basic approach, limited effectiveness 150 | - **Use Case**: Educational purposes, historical context 151 | 152 | #### **Compressive Document Summarization** (`Compressive_Document_Summarization.ipynb`) 153 | - **Approach**: Sentence compression and selection 154 | - **Pros**: More concise summaries, preserves key information 155 | - **Cons**: Complex implementation, may lose context 156 | - **Use Case**: Length-constrained summaries 157 | 158 | ### 6. Transformer-Based Models 159 | These are state-of-the-art deep learning models: 160 | 161 | #### **BERT Extractive Summarization** (`BERT.ipynb`) 162 | - **Approach**: Pre-trained BERT model for sentence extraction 163 | - **Pros**: High accuracy, captures context well 164 | - **Cons**: Limited to extractive, depends on pre-trained models 165 | - **Performance**: ROUGE F1: 0.545, BLEU: 0.677 166 | - **Use Case**: Context-heavy documents 167 | 168 | #### **BART** (`BART.ipynb`) 169 | - **Approach**: Bidirectional and Auto-Regressive Transformer 170 | - **Pros**: State-of-the-art performance, high accuracy 171 | - **Cons**: Resource-intensive, large model size 172 | - **Performance**: ROUGE F1: 0.402, BLEU: 0.905 173 | - **Use Case**: High-quality abstractive summaries 174 | 175 | #### **T5** (`T5.ipynb`) 176 | - **Approach**: Text-To-Text Transfer Transformer 177 | - **Pros**: High accuracy, customizable, multilingual 178 | - **Cons**: Resource-intensive, technical complexity 179 | - **Performance**: ROUGE F1: 0.573, BLEU: 0.683 180 | - **Use Case**: Production systems, multilingual applications 181 | 182 | #### **GPT-2** (`GPT_2.ipynb`) 183 | - **Approach**: Generative Pre-trained Transformer 2 184 | - **Pros**: Large language model, contextual understanding 185 | - **Cons**: Lack of control, high computational costs 186 | - **Performance**: ROUGE F1: 0.797, BLEU: 0.423 187 | - **Use Case**: Research, creative summarization 188 | 189 | #### **DistilBART** (`Distilbart.ipynb`) 190 | - **Approach**: Distilled version of BART for efficiency 191 | - **Pros**: Faster inference, smaller model size 192 | - **Cons**: Slightly lower accuracy than full BART 193 | - **Use Case**: Real-time applications 194 | 195 | ### 7. Specialized Models 196 | 197 | #### **Gensim** (`Gensim.ipynb`) 198 | - **Approach**: Topic modeling and document similarity 199 | - **Pros**: Good for topic-based summarization 200 | - **Cons**: May miss specific details 201 | - **Use Case**: Topic extraction, document similarity 202 | 203 | #### **SUMY** (`SUMY.ipynb`) 204 | - **Approach**: Multi-document summarization toolkit 205 | - **Pros**: Specialized for multi-document scenarios 206 | - **Cons**: Limited to specific use cases 207 | - **Use Case**: Multi-document summarization 208 | 209 | #### **AI Transformer Pipeline** (`AI_Transformer_pipeline.ipynb`) 210 | - **Approach**: Custom transformer-based pipeline 211 | - **Pros**: Customizable architecture, modern approach 212 | - **Cons**: Complex implementation, requires expertise 213 | - **Use Case**: Custom applications, research 214 | 215 | ## 🚀 Advanced Architecture Models 216 | 217 | ### Pegasus Models (`Models Used in Proposed Architecture/`) 218 | 219 | The repository includes **12 different Pegasus model implementations**: 220 | 221 | #### **Single Document Pegasus** (`Pegasus-Single summarisation.ipynb`) 222 | - **Approach**: Single document abstractive summarization 223 | - **Pros**: High-quality summaries, generalization capability 224 | - **Cons**: Large computational requirements, limited control 225 | - **Performance**: ROUGE F1: 0.086, BLEU: 0 226 | - **Use Case**: Single document abstractive summarization 227 | 228 | #### **Multi-Document Pegasus** (`Pegasus-Multi summarisation.ipynb`) 229 | - **Approach**: Multi-document summarization 230 | - **Pros**: Handles multiple documents, comprehensive coverage 231 | - **Cons**: Complex processing, resource-intensive 232 | - **Use Case**: Multi-document scenarios 233 | 234 | #### **All 12 Pegasus Models** (`all-12-pegasus-models.ipynb`) 235 | Includes implementations of: 236 | - **Pegasus-XSUM**: BBC News summarization 237 | - **Pegasus-CNN/DailyMail**: News article summarization 238 | - **Pegasus-Newsroom**: Newsroom dataset 239 | - **Pegasus-MultiNews**: Multi-document news summarization 240 | - **Pegasus-Gigaword**: Headline generation 241 | - **Pegasus-WikiHow**: How-to article summarization 242 | - **Pegasus-Reddit-TIFU**: Reddit story summarization 243 | - **Pegasus-BigPatent**: Patent document summarization 244 | - And more specialized variants 245 | 246 | #### **Graph-Based Summary** (`Graph_Based_Summary.ipynb`) 247 | - **Approach**: Novel graph-based approach using LCS, TF-IDF, and matrix methods 248 | - **Pros**: Combines multiple techniques, systematic approach 249 | - **Cons**: Complex implementation, computational expense 250 | - **Performance**: ROUGE F1: 0.417, BLEU: 0.901 251 | - **Use Case**: Research, novel approaches 252 | 253 | ## 📈 Performance Analysis 254 | 255 | ### Key Insights from Results: 256 | 257 | 1. **Best Overall Performance**: GPT-2 achieved the highest ROUGE F1 score (0.797) 258 | 2. **Best BLEU Score**: BART achieved the highest BLEU score (0.905) 259 | 3. **Most Balanced**: T5 shows good balance between ROUGE and BLEU scores 260 | 4. **Efficiency vs Accuracy**: DistilBART offers a good trade-off for real-time applications 261 | 262 | ### Performance Trends: 263 | - **Transformer models** generally outperform traditional methods 264 | - **Abstractive models** show higher BLEU scores but lower ROUGE scores 265 | - **Extractive models** are more consistent but less creative 266 | - **Graph-based methods** provide good baseline performance 267 | 268 | ## 📈 Performance Comparison 269 | 270 | ### Existing Models Performance 271 | ![Existing Models Performance](Existing.png) 272 | 273 | ### Pegasus Models Performance 274 | ![Pegasus Models Performance](Pegasus.png) 275 | 276 | The performance comparison table above provides detailed metrics for all models. These visualizations complement the comprehensive comparison table in the "Model Performance Comparison Table" section. 277 | 278 | ## 🛠️ Technical Requirements 279 | 280 | ### Core Dependencies 281 | ```bash 282 | pip install transformers 283 | pip install torch 284 | pip install nltk 285 | pip install rouge 286 | pip install scikit-learn 287 | pip install numpy 288 | pip install pandas 289 | pip install networkx 290 | pip install sentencepiece 291 | ``` 292 | 293 | ### Additional Libraries 294 | - **spacy**: For advanced NLP processing 295 | - **gensim**: For topic modeling 296 | - **sumy**: For multi-document summarization 297 | - **bert-extractive-summarizer**: For BERT-based summarization 298 | 299 | ### System Requirements 300 | - **RAM**: Minimum 8GB, Recommended 16GB+ 301 | - **GPU**: Optional but recommended for transformer models 302 | - **Storage**: At least 5GB free space for models 303 | - **Python**: 3.7 or higher 304 | 305 | ## 📚 Educational Value 306 | 307 | This repository serves as a comprehensive learning resource for: 308 | 309 | 1. **Understanding Different Approaches**: From basic statistical methods to advanced transformer models 310 | 2. **Performance Comparison**: Real-world evaluation using standard metrics 311 | 3. **Implementation Examples**: Complete working code for each approach 312 | 4. **Research References**: Academic papers and research background for each model 313 | 5. **Practical Applications**: Real-world use cases and considerations 314 | 315 | ### Learning Path Recommendations: 316 | 317 | #### **Beginner Level** 318 | 1. Start with NLTK and TF-IDF models 319 | 2. Understand basic NLP concepts 320 | 3. Learn about evaluation metrics 321 | 322 | #### **Intermediate Level** 323 | 1. Explore graph-based methods (TextRank, LexRank) 324 | 2. Study matrix decomposition (LSA) 325 | 3. Understand clustering approaches 326 | 327 | #### **Advanced Level** 328 | 1. Dive into transformer models (BERT, BART, T5) 329 | 2. Experiment with Pegasus models 330 | 3. Implement custom approaches 331 | 332 | ## 🎯 Use Cases & Model Selection Guide 333 | 334 | ### **For Simple Documents** 335 | - **NLTK**: Educational purposes, basic understanding 336 | - **TF-IDF**: Keyword extraction, document classification 337 | - **SumBasic**: Single document summarization 338 | 339 | ### **For Complex Documents** 340 | - **BERT**: Context-heavy documents 341 | - **BART**: High-quality abstractive summaries 342 | - **T5**: Production systems, multilingual 343 | 344 | ### **For Multi-Document** 345 | - **LSA**: Semantic analysis, multi-document 346 | - **LexRank**: Long documents, systematic approach 347 | - **Multi-Document Pegasus**: Advanced multi-document scenarios 348 | 349 | ### **For Real-time Applications** 350 | - **DistilBART**: Fast inference, good quality 351 | - **TextRank**: General purpose, consistent 352 | - **TF-IDF**: Simple, fast processing 353 | 354 | ### **For Research Purposes** 355 | - **GPT-2**: Creative summarization, research 356 | - **Advanced transformer models**: State-of-the-art approaches 357 | - **Graph-Based Summary**: Novel methodologies 358 | 359 | ## 🔬 Research Contributions 360 | 361 | This project includes: 362 | - **31 different summarization approaches** 363 | - **Comprehensive performance evaluation** 364 | - **Educational implementations** 365 | - **Research paper references** 366 | - **Practical considerations and trade-offs** 367 | 368 | ### Research Areas Covered: 369 | - **Extractive vs Abstractive Summarization** 370 | - **Statistical vs Neural Approaches** 371 | - **Graph-based vs Matrix-based Methods** 372 | - **Traditional NLP vs Deep Learning** 373 | - **Single vs Multi-document Summarization** 374 | 375 | ## 🚨 Common Issues & Troubleshooting 376 | 377 | ### Installation Issues 378 | ```bash 379 | # If you encounter CUDA issues 380 | pip install torch --index-url https://download.pytorch.org/whl/cpu 381 | 382 | # For memory issues with large models 383 | export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 384 | ``` 385 | 386 | ### Model Loading Issues 387 | - Ensure sufficient RAM for large models 388 | - Use CPU versions for memory-constrained environments 389 | - Check model file paths and permissions 390 | 391 | ### Performance Issues 392 | - Use smaller models for faster inference 393 | - Consider batch processing for multiple documents 394 | - Implement caching for repeated summarizations 395 | 396 | ## 👥 Contributing 397 | 398 | Contributions are welcome! Areas for improvement: 399 | - Additional model implementations 400 | - Performance optimizations 401 | - Documentation improvements 402 | - New evaluation metrics 403 | - Domain-specific adaptations 404 | 405 | ### Contribution Guidelines: 406 | 1. Fork the repository 407 | 2. Create a feature branch 408 | 3. Add your implementation with proper documentation 409 | 4. Include performance metrics 410 | 5. Submit a pull request 411 | 412 | ## 🙏 Acknowledgements 413 | 414 | This repository was created by me as part of an Academic Capstone Project at Vellore Institute of Technology. Special thanks to **Prof. Durgesh Kumar** and the faculty members for their guidance and support. 415 | 416 | ### Research Acknowledgments: 417 | - **Hugging Face** for transformer model implementations 418 | - **NLTK** team for natural language processing tools 419 | - **Academic community** for research papers and methodologies 420 | 421 | ## 📄 License 422 | 423 | This project is for educational purposes. Please respect the licenses of individual model implementations and research papers referenced. 424 | 425 | ### License Information: 426 | - **Educational Use**: Free for educational and research purposes 427 | - **Commercial Use**: Check individual model licenses 428 | - **Attribution**: Please cite the original research papers 429 | 430 | ## 📞 Support & Contact 431 | 432 | For questions, issues, or contributions: 433 | - **Issues**: Use GitHub issues for bug reports 434 | - **Discussions**: Use GitHub discussions for questions 435 | - **Contributions**: Submit pull requests for improvements 436 | 437 | --- 438 | 439 | **Note**: This repository is designed for educational purposes and research. For production use, please ensure proper licensing and consider the specific requirements of your use case. 440 | 441 | ### Version Information 442 | - **Last Updated**: December 2024 443 | - **Total Models**: 31 444 | - **Python Version**: 3.7+ 445 | - **License**: Educational Use 446 | -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Cluster_Based.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "#Cluster-Based\n", 21 | "\n", 22 | "**Cluster-based algorithms** for text summarization are a class of unsupervised algorithms that group similar sentences into clusters and then extract summary sentences from these clusters. Here are some advantages and disadvantages of cluster-based algorithms for text summarization:\n", 23 | "\n", 24 | "### Pros:\n", 25 | "*\tFlexibility: Cluster-based algorithms are very flexible, as they can handle different types of texts, such as news articles, academic papers, and social media posts, among others.\n", 26 | "*\tLanguage independence: These algorithms are language-independent, which means they can summarize texts in any language.\n", 27 | "*\tEfficient: Cluster-based algorithms are relatively fast and can summarize large amounts of text quickly.\n", 28 | "\n", 29 | "### Cons:\n", 30 | "*\tClustering errors: The quality of the summary depends heavily on the quality of the clustering, and the clustering may not always be accurate, leading to poor summaries.\n", 31 | "*\tLack of coherence: Cluster-based algorithms may extract sentences from different clusters, leading to a lack of coherence in the summary.\n", 32 | "*\tLimited coverage: Cluster-based algorithms tend to summarize the most important sentences, but may miss some important details that are not explicitly mentioned in the text.\n", 33 | "*\tDifficulty in determining optimal number of clusters: One of the key challenges in cluster-based summarization is determining the optimal number of clusters, which can be difficult.\n", 34 | "\n", 35 | "Overall, cluster-based algorithms are a useful approach for summarizing text, but they do have limitations that need to be considered when using them.\n", 36 | "\n", 37 | "These are the scores we achieved:\n", 38 | "\n", 39 | " ROUGE Score:\n", 40 | " Precision: 0.980\n", 41 | " Recall: 0.331\n", 42 | " F1-Score: 0.495\n", 43 | "\n", 44 | " BLEU Score: 0.896\n", 45 | "\n", 46 | "## References\n", 47 | "\n", 48 | "Here are some research papers on cluster-based text summarization:\n", 49 | "\n", 50 | "1. \"Cluster-Based Multi-Document Summarization Using Centroid-Based Clustering\" by S. Aravindan and S. Natarajan. This paper proposes a centroid-based clustering approach for multi-document summarization.\n", 51 | "\n", 52 | "2. \"Cluster-Based Summarization of Web Documents\" by M. Shishibori, Y. Kawai, and M. Ishikawa. This paper presents a cluster-based approach for summarizing web documents.\n", 53 | "\n", 54 | "3. \"Summarizing Text Documents by Sentence Extraction Using Latent Semantic Analysis\" by J. Steinberger and K. Jezek. This paper proposes a cluster-based approach using Latent Semantic Analysis for sentence extraction in text summarization.\n", 55 | "\n", 56 | "4. \"Multi-document Summarization Using Clustering and Sentence Extraction\" by C. Wang, Y. Liu, and J. Zhu. This paper proposes a clustering and sentence extraction approach for multi-document summarization.\n", 57 | "\n", 58 | "These papers provide valuable insights into the development and implementation of cluster-based text summarization techniques." 59 | ], 60 | "metadata": { 61 | "id": "pAiO6r6rBjHT" 62 | } 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": { 68 | "id": "JHUlgAuUKsBl", 69 | "colab": { 70 | "base_uri": "https://localhost:8080/" 71 | }, 72 | "outputId": "3497ae69-8b06-4236-937d-7aa1e06dc4d3" 73 | }, 74 | "outputs": [ 75 | { 76 | "output_type": "stream", 77 | "name": "stdout", 78 | "text": [ 79 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 80 | "Collecting rouge\n", 81 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 82 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 83 | "Installing collected packages: rouge\n", 84 | "Successfully installed rouge-1.0.1\n", 85 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 86 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 87 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", 88 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n", 89 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 90 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n" 91 | ] 92 | }, 93 | { 94 | "output_type": "stream", 95 | "name": "stderr", 96 | "text": [ 97 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 98 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "!pip install rouge\n", 104 | "!pip install nltk\n", 105 | "from rouge import Rouge \n", 106 | "import nltk\n", 107 | "import nltk.translate.bleu_score as bleu\n", 108 | "nltk.download('punkt')\n", 109 | "from sklearn.cluster import KMeans\n", 110 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 111 | "import numpy as np" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "source": [ 117 | "text =\"\"\"\n", 118 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 119 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"" 120 | ], 121 | "metadata": { 122 | "id": "rZ6c2VWzK1Bm" 123 | }, 124 | "execution_count": null, 125 | "outputs": [] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "source": [ 130 | "# Split paragraph into sentences\n", 131 | "sentences = text.split('. ')\n", 132 | "\n", 133 | "# Store each sentence as a separate document in the array\n", 134 | "documents = []\n", 135 | "for sentence in sentences:\n", 136 | " documents.append(sentence.strip())" 137 | ], 138 | "metadata": { 139 | "id": "I_J796LlLHLK" 140 | }, 141 | "execution_count": null, 142 | "outputs": [] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "source": [ 147 | "documents" 148 | ], 149 | "metadata": { 150 | "colab": { 151 | "base_uri": "https://localhost:8080/" 152 | }, 153 | "id": "_YpM7sbwLK3L", 154 | "outputId": "65fa0fd8-8bff-4494-ef8a-cc9a74164f7f" 155 | }, 156 | "execution_count": null, 157 | "outputs": [ 158 | { 159 | "output_type": "execute_result", 160 | "data": { 161 | "text/plain": [ 162 | "[\"India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities\",\n", 163 | " 'The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program',\n", 164 | " 'The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers',\n", 165 | " 'Since then, over 13 million doses have been administered across the country',\n", 166 | " 'However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India',\n", 167 | " 'The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States',\n", 168 | " \"The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\\nIn summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people\",\n", 169 | " 'The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.']" 170 | ] 171 | }, 172 | "metadata": {}, 173 | "execution_count": 4 174 | } 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "source": [ 180 | "# Create TF-IDF vectorizer\n", 181 | "vectorizer = TfidfVectorizer(stop_words='english')\n", 182 | "\n", 183 | "# Create document-term matrix\n", 184 | "doc_term_matrix = vectorizer.fit_transform(documents)\n", 185 | "\n", 186 | "# Perform K-means clustering\n", 187 | "k = 2\n", 188 | "km = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1, verbose=False)\n", 189 | "km.fit(doc_term_matrix)" 190 | ], 191 | "metadata": { 192 | "colab": { 193 | "base_uri": "https://localhost:8080/" 194 | }, 195 | "id": "gsQcHdODLLq3", 196 | "outputId": "a84a089c-c114-4648-fe2c-3be6ad23b38a" 197 | }, 198 | "execution_count": null, 199 | "outputs": [ 200 | { 201 | "output_type": "execute_result", 202 | "data": { 203 | "text/plain": [ 204 | "KMeans(max_iter=100, n_clusters=2, n_init=1, verbose=False)" 205 | ] 206 | }, 207 | "metadata": {}, 208 | "execution_count": 5 209 | } 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "source": [ 215 | "# Get cluster labels and centroids\n", 216 | "labels = km.labels_\n", 217 | "centroids = km.cluster_centers_\n", 218 | "\n", 219 | "# Get representative sentences for each cluster\n", 220 | "representative_sentences = []\n", 221 | "for i in range(k):\n", 222 | " cluster_indices = np.where(labels == i)[0]\n", 223 | " cluster_sentences = [documents[idx] for idx in cluster_indices]\n", 224 | " cluster_vector = vectorizer.transform(cluster_sentences)\n", 225 | " similarity_scores = np.asarray(cluster_vector.dot(centroids[i].T)).flatten()\n", 226 | " threshold = np.percentile(similarity_scores, 80) # filter out non-representative sentences\n", 227 | " representative_idx = np.argmax(similarity_scores * (similarity_scores > threshold))\n", 228 | " representative_sentence = cluster_sentences[representative_idx]\n", 229 | " representative_sentences.append(representative_sentence)" 230 | ], 231 | "metadata": { 232 | "id": "poEn_fF3LTpd" 233 | }, 234 | "execution_count": null, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "source": [ 240 | "def listToString(s):\n", 241 | " str1 = \"\"\n", 242 | " for ele in s:\n", 243 | " str1 += ele\n", 244 | " return str1" 245 | ], 246 | "metadata": { 247 | "id": "MIacoas1NREf" 248 | }, 249 | "execution_count": null, 250 | "outputs": [] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "source": [ 255 | "# Post-processing: remove redundant sentences\n", 256 | "final_summary = list(set(representative_sentences))\n", 257 | "\n", 258 | "# Print the resulting summary\n", 259 | "summary=(listToString(final_summary))\n", 260 | "print(summary)" 261 | ], 262 | "metadata": { 263 | "colab": { 264 | "base_uri": "https://localhost:8080/" 265 | }, 266 | "id": "hDo2OKaNLVf5", 267 | "outputId": "eb4d42ce-8071-4c93-9a0d-1a9da4aafe06" 268 | }, 269 | "execution_count": null, 270 | "outputs": [ 271 | { 272 | "output_type": "stream", 273 | "name": "stdout", 274 | "text": [ 275 | "India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbiditiesThe NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers\n" 276 | ] 277 | } 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "source": [ 283 | "rouge = Rouge()\n", 284 | "scores = rouge.get_scores(summary, text)\n", 285 | "print(\"ROUGE Score:\")\n", 286 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 287 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 288 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 289 | ], 290 | "metadata": { 291 | "colab": { 292 | "base_uri": "https://localhost:8080/" 293 | }, 294 | "id": "JZ_5T_6hUmYg", 295 | "outputId": "65cc43e0-effd-48fc-a332-e48037fd807c" 296 | }, 297 | "execution_count": null, 298 | "outputs": [ 299 | { 300 | "output_type": "stream", 301 | "name": "stdout", 302 | "text": [ 303 | "ROUGE Score:\n", 304 | "Precision: 0.980\n", 305 | "Recall: 0.331\n", 306 | "F1-Score: 0.495\n" 307 | ] 308 | } 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "source": [ 314 | "from nltk.translate.bleu_score import sentence_bleu\n", 315 | "\n", 316 | "def summary_to_sentences(summary):\n", 317 | " # Split the summary into sentences using the '.' character as a separator\n", 318 | " sentences = summary.split('.')\n", 319 | " \n", 320 | " # Convert each sentence into a list of words\n", 321 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 322 | " \n", 323 | " return sentence_lists\n", 324 | "\n", 325 | "def paragraph_to_wordlist(paragraph):\n", 326 | " # Split the paragraph into words using whitespace as a separator\n", 327 | " words = paragraph.split()\n", 328 | " return words\n", 329 | "\n", 330 | "reference_paragraph = text\n", 331 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 332 | "predicted_paragraph = summary\n", 333 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 334 | "\n", 335 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 336 | "print(score)" 337 | ], 338 | "metadata": { 339 | "colab": { 340 | "base_uri": "https://localhost:8080/" 341 | }, 342 | "id": "CNDZuWVTU5zZ", 343 | "outputId": "310f4c82-5653-4db4-dc66-b0215c30a41f" 344 | }, 345 | "execution_count": null, 346 | "outputs": [ 347 | { 348 | "output_type": "stream", 349 | "name": "stdout", 350 | "text": [ 351 | "0.8956352427165735\n" 352 | ] 353 | } 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "source": [ 359 | "print(\"BLEU Score: {:.3f}\".format(score))" 360 | ], 361 | "metadata": { 362 | "colab": { 363 | "base_uri": "https://localhost:8080/" 364 | }, 365 | "id": "3UkkMn3lR-tL", 366 | "outputId": "c0123738-92a7-4144-8c08-f2ca05ce41e8" 367 | }, 368 | "execution_count": null, 369 | "outputs": [ 370 | { 371 | "output_type": "stream", 372 | "name": "stdout", 373 | "text": [ 374 | "BLEU Score: 0.896\n" 375 | ] 376 | } 377 | ] 378 | } 379 | ] 380 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Connected_Dominating_Set.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "source": [ 6 | "# CDS\n", 7 | "Text summarization using **Connected Dominating Set (CDS)** is a technique that involves selecting the most important sentences from a text document to create a shorter version of the original. Here are some advantages and disadvantages of using CDS for text summarization:\n", 8 | "\n", 9 | "### Pros:\n", 10 | "\n", 11 | "* Good Coverage: CDS-based summarization techniques can cover most of the important topics in a document, as they aim to select the most representative sentences that cover the main themes and ideas.\n", 12 | "\n", 13 | "* Improved Coherence: CDS-based techniques tend to produce summaries that are more coherent than other methods, as they select sentences that are more connected to each other in terms of content.\n", 14 | "\n", 15 | "* Speed: CDS-based techniques are relatively fast and can generate summaries quickly, making them suitable for summarizing large volumes of text.\n", 16 | "\n", 17 | "* Flexibility: CDS-based techniques can be adapted to different types of text documents, including news articles, research papers, and other types of documents.\n", 18 | "\n", 19 | "### Cons:\n", 20 | "\n", 21 | "* Limited Precision: CDS-based summarization techniques may not always select the most important sentences from a document, as they focus more on coverage and coherence rather than precision.\n", 22 | "\n", 23 | "* Subjectivity: CDS-based techniques can be subjective, as the selection of the most important sentences can vary depending on the criteria used to define importance.\n", 24 | "\n", 25 | "* Lack of Context: CDS-based techniques may not take into account the context of a sentence, which can lead to the selection of sentences that are not relevant to the main theme or idea.\n", 26 | "\n", 27 | "* Over-simplification: CDS-based techniques can oversimplify complex documents, as they tend to focus on the most important sentences and may leave out important details or nuances.\n", 28 | "\n", 29 | "These are the scores we achieved:\n", 30 | "\n", 31 | " ROUGE Score:\n", 32 | " Precision: 1.000\n", 33 | " Recall: 0.430\n", 34 | " F1-Score: 0.602\n", 35 | "\n", 36 | " BLEU Score: 0.844\n", 37 | "\n", 38 | "## References \n", 39 | "\n", 40 | "1. \"A new approach for text summarization using connected dominating set in graphs\" by M. Sadeghi and M. M. Farsangi, in Proceedings of the 2010 International Conference on Computer, Mechatronics, Control and Electronic Engineering (CMCE)\n", 41 | "\n", 42 | "2. \"Text summarization using a graph-based method with connected dominating set\" by A. E. Bayraktar and F. Can, in Proceedings of the 2012 International Conference on Computer Science and Engineering (UBMK)\n", 43 | "\n", 44 | "3. \"Extractive text summarization based on the connected dominating set in a graph representation\" by A. E. Bayraktar and F. Can, in Turkish Journal of Electrical Engineering & Computer Sciences\n", 45 | "\n", 46 | "4. \"A novel text summarization technique based on connected dominating set in graph\" by M. Sadeghi and M. M. Farsangi, in the Journal of Information Science and Engineering\n", 47 | "\n", 48 | "These papers propose using the CDS algorithm to build a graph-based representation of the document and then extracting the summary by selecting the most important sentences or nodes in the CDS. The CDS approach has been shown to be effective in identifying the most important nodes in the graph, and can lead to high-quality summaries." 49 | ], 50 | "metadata": { 51 | "id": "4Y_HUXHGG9Gs" 52 | } 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": { 58 | "colab": { 59 | "base_uri": "https://localhost:8080/" 60 | }, 61 | "id": "eTG7KWVOP_aZ", 62 | "outputId": "e4baf06f-4acb-4be2-c329-1a6a6d0011a1" 63 | }, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 70 | "Collecting rouge\n", 71 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 72 | "Requirement already satisfied: six in /usr/local/lib/python3.8/dist-packages (from rouge) (1.15.0)\n", 73 | "Installing collected packages: rouge\n", 74 | "Successfully installed rouge-1.0.1\n", 75 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 76 | "Requirement already satisfied: nltk in /usr/local/lib/python3.8/dist-packages (3.7)\n", 77 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.8/dist-packages (from nltk) (2022.6.2)\n", 78 | "Requirement already satisfied: joblib in /usr/local/lib/python3.8/dist-packages (from nltk) (1.2.0)\n", 79 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.8/dist-packages (from nltk) (4.64.1)\n", 80 | "Requirement already satisfied: click in /usr/local/lib/python3.8/dist-packages (from nltk) (7.1.2)\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "!pip install rouge\n", 86 | "!pip install nltk\n", 87 | "import networkx as nx\n", 88 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 89 | "from nltk.corpus import stopwords\n", 90 | "from nltk.stem import WordNetLemmatizer\n", 91 | "from rouge import Rouge \n", 92 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 93 | "import nltk\n", 94 | "import nltk.translate.bleu_score as bleu" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": { 101 | "colab": { 102 | "base_uri": "https://localhost:8080/" 103 | }, 104 | "id": "iSTScWFlRbK4", 105 | "outputId": "4a873fef-8dc1-4280-f269-821cf693edac" 106 | }, 107 | "outputs": [ 108 | { 109 | "name": "stderr", 110 | "output_type": "stream", 111 | "text": [ 112 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 113 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n", 114 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 115 | "[nltk_data] Unzipping corpora/stopwords.zip.\n", 116 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 117 | "[nltk_data] Downloading package omw-1.4 to /root/nltk_data...\n" 118 | ] 119 | }, 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "True" 124 | ] 125 | }, 126 | "execution_count": 2, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "nltk.download('punkt')\n", 133 | "nltk.download('stopwords')\n", 134 | "nltk.download('wordnet')\n", 135 | "nltk.download('omw-1.4')" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": null, 141 | "metadata": { 142 | "id": "_FrNEG0fQ_fR" 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "def preprocess_text(text):\n", 147 | " \"\"\"\n", 148 | " Preprocess a given text by tokenizing, removing stop words, and lemmatizing the words.\n", 149 | " \"\"\"\n", 150 | " # tokenize the text into sentences\n", 151 | " sentences = sent_tokenize(text)\n", 152 | "\n", 153 | " # remove stop words and lemmatize the words in each sentence\n", 154 | " stop_words = set(stopwords.words('english'))\n", 155 | " lemmatizer = WordNetLemmatizer()\n", 156 | " preprocessed_sentences = []\n", 157 | " for sentence in sentences:\n", 158 | " words = word_tokenize(sentence.lower())\n", 159 | " filtered_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]\n", 160 | " preprocessed_sentence = \" \".join(filtered_words)\n", 161 | " preprocessed_sentences.append(preprocessed_sentence)\n", 162 | "\n", 163 | " return preprocessed_sentences" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "id": "QOq-xHNrQ_3-" 171 | }, 172 | "outputs": [], 173 | "source": [ 174 | "def compute_similarity(sentence1, sentence2):\n", 175 | " \"\"\"\n", 176 | " Compute the similarity score between two sentences using TF-IDF.\n", 177 | " \"\"\"\n", 178 | " tfidf = TfidfVectorizer().fit_transform([sentence1, sentence2])\n", 179 | " similarity_score = (tfidf * tfidf.T).A[0, 1]\n", 180 | " return similarity_score" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": { 187 | "id": "85qyfQEvRCIE" 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "def find_minimum_cds(graph):\n", 192 | " \"\"\"\n", 193 | " Find the minimum Connected Dominating Set (CDS) of a graph using a greedy algorithm.\n", 194 | " \"\"\"\n", 195 | " cds = set() # initialize CDS to empty set\n", 196 | " nodes = set(graph.nodes()) # get all nodes in the graph\n", 197 | "\n", 198 | " while nodes:\n", 199 | " max_degree_node = max(nodes, key=lambda n: graph.degree(n)) # find node with highest degree\n", 200 | " cds.add(max_degree_node) # add node to CDS\n", 201 | " nodes.discard(max_degree_node) # remove node from remaining nodes\n", 202 | " neighbors = set(graph.neighbors(max_degree_node)) # get all neighbors of the node\n", 203 | " nodes.difference_update(neighbors) # remove neighbors from remaining nodes\n", 204 | "\n", 205 | " return cds" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "source": [], 211 | "metadata": { 212 | "id": "BR_cl2XyEpy_" 213 | }, 214 | "execution_count": null, 215 | "outputs": [] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": { 221 | "id": "jA-5S0DxREBP" 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "def summarize_text(text, summary_size, threshold=0.1):\n", 226 | " \"\"\"\n", 227 | " Summarize a given text using minimum Connected Dominating Set (CDS).\n", 228 | " \"\"\"\n", 229 | " # preprocess the text\n", 230 | " preprocessed_sentences = preprocess_text(text)\n", 231 | "\n", 232 | " # create graph from preprocessed sentences\n", 233 | " graph = nx.Graph()\n", 234 | " for i, sentence in enumerate(preprocessed_sentences):\n", 235 | " for j in range(i+1, len(preprocessed_sentences)):\n", 236 | " similarity_score = compute_similarity(sentence, preprocessed_sentences[j]) # compute similarity score between two sentences\n", 237 | " if similarity_score > threshold:\n", 238 | " graph.add_edge(i, j, weight=similarity_score)\n", 239 | "\n", 240 | " # find minimum CDS of the graph\n", 241 | " cds = find_minimum_cds(graph)\n", 242 | "\n", 243 | " # sort the CDS nodes based on their occurrence order in the original text\n", 244 | " summary_nodes = sorted(list(cds))\n", 245 | "\n", 246 | " # create summary by concatenating the selected sentences\n", 247 | " summary = \". \".join([sent_tokenize(text)[i] for i in summary_nodes][:summary_size])\n", 248 | "\n", 249 | " return summary" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": { 256 | "colab": { 257 | "base_uri": "https://localhost:8080/" 258 | }, 259 | "id": "kTepjNBWRGUf", 260 | "outputId": "15ab2aeb-4eb2-4e23-eaf0-fbb0489aa68b" 261 | }, 262 | "outputs": [ 263 | { 264 | "name": "stdout", 265 | "output_type": "stream", 266 | "text": [ 267 | "The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program.. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States.\n" 268 | ] 269 | } 270 | ], 271 | "source": [ 272 | "text =\"\"\"\n", 273 | " India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 274 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\"\n", 275 | "\n", 276 | "summary_size = 3 # number of sentences in the summary\n", 277 | "summary = summarize_text(text, summary_size)\n", 278 | "\n", 279 | "print(summary)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": null, 285 | "metadata": { 286 | "colab": { 287 | "base_uri": "https://localhost:8080/" 288 | }, 289 | "id": "AcRB4Cn0R_UA", 290 | "outputId": "2b9d3019-c6f8-4bb0-afde-d3e89e99a82b" 291 | }, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "ROUGE Score:\n", 298 | "Precision: 1.000\n", 299 | "Recall: 0.430\n", 300 | "F1-Score: 0.602\n" 301 | ] 302 | } 303 | ], 304 | "source": [ 305 | "rouge = Rouge()\n", 306 | "scores = rouge.get_scores(summary, text)\n", 307 | "print(\"ROUGE Score:\")\n", 308 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 309 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 310 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": { 317 | "colab": { 318 | "base_uri": "https://localhost:8080/" 319 | }, 320 | "id": "2n8nozo0XmKJ", 321 | "outputId": "3fc8f3fb-4169-4458-d567-68842b70146b" 322 | }, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "0.8435083039960267\n" 329 | ] 330 | } 331 | ], 332 | "source": [ 333 | "from nltk.translate.bleu_score import sentence_bleu\n", 334 | "\n", 335 | "def summary_to_sentences(summary):\n", 336 | " # Split the summary into sentences using the '.' character as a separator\n", 337 | " sentences = summary.split('.')\n", 338 | " \n", 339 | " # Convert each sentence into a list of words\n", 340 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 341 | " \n", 342 | " return sentence_lists\n", 343 | "\n", 344 | "def paragraph_to_wordlist(paragraph):\n", 345 | " # Split the paragraph into words using whitespace as a separator\n", 346 | " words = paragraph.split()\n", 347 | " return words\n", 348 | "\n", 349 | "reference_paragraph = text\n", 350 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 351 | "predicted_paragraph = summary\n", 352 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 353 | "\n", 354 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 355 | "print(score)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": { 362 | "colab": { 363 | "base_uri": "https://localhost:8080/" 364 | }, 365 | "id": "9lHe3peYXoRL", 366 | "outputId": "1b99d191-a4bd-4880-e4fd-6be470c3a746" 367 | }, 368 | "outputs": [ 369 | { 370 | "name": "stdout", 371 | "output_type": "stream", 372 | "text": [ 373 | "BLEU Score: 0.844\n" 374 | ] 375 | } 376 | ], 377 | "source": [ 378 | "print(\"BLEU Score: {:.3f}\".format(score))" 379 | ] 380 | } 381 | ], 382 | "metadata": { 383 | "colab": { 384 | "provenance": [] 385 | }, 386 | "kernelspec": { 387 | "display_name": "Python 3", 388 | "name": "python3" 389 | }, 390 | "language_info": { 391 | "name": "python" 392 | } 393 | }, 394 | "nbformat": 4, 395 | "nbformat_minor": 0 396 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/Compressive_Document_Summarization.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [] 7 | }, 8 | "kernelspec": { 9 | "name": "python3", 10 | "display_name": "Python 3" 11 | }, 12 | "language_info": { 13 | "name": "python" 14 | } 15 | }, 16 | "cells": [ 17 | { 18 | "cell_type": "markdown", 19 | "source": [ 20 | "# CDS\n", 21 | "**CDS (Compressive Document Summarization)** is a deep learning-based approach for text summarization that uses a combination of extractive and abstractive techniques. Here are some pros and cons of text summarization of news articles using CDS:\n", 22 | "\n", 23 | "### Pros:\n", 24 | "\n", 25 | "* High compression: CDS can generate highly compressed summaries that retain the most important information from the input text, making it useful for summarizing long documents.\n", 26 | "\n", 27 | "* Combines extractive and abstractive techniques: CDS combines the benefits of extractive and abstractive summarization techniques, resulting in summaries that are both informative and concise.\n", 28 | "\n", 29 | "* High accuracy: CDS has achieved state-of-the-art performance on many benchmark datasets for text summarization, indicating that it can generate high-quality summaries.\n", 30 | "\n", 31 | "* Customizable: CDS can be fine-tuned on specific domains or use cases, allowing users to generate summaries tailored to their needs.\n", 32 | "\n", 33 | "### Cons:\n", 34 | "\n", 35 | "* Resource-intensive: Training and using CDS for text summarization requires significant computational resources, including high-end GPUs, large amounts of memory, and high-speed storage.\n", 36 | "\n", 37 | "* Large model size: CDS is a large model that requires a lot of disk space to store, making it challenging to deploy on devices with limited storage capacity.\n", 38 | "\n", 39 | "* Dependence on training data: CDS's performance is highly dependent on the quality and relevance of the training data used to train the model. If the training data is biased or limited, the quality of the summaries may be compromised.\n", 40 | "\n", 41 | "* Expertise required: Fine-tuning CDS for specific use cases or domains requires expertise in natural language processing and machine learning.\n", 42 | "\n", 43 | "Overall, CDS is a powerful tool for text summarization that can generate highly compressed and informative summaries. However, it requires significant computational resources and expertise to use effectively, making it best suited for large-scale projects or applications where high accuracy is critical.\n", 44 | "\n", 45 | "These are the scores we achieved:\n", 46 | "\n", 47 | " ROUGE Score:\n", 48 | " Precision: 1.000\n", 49 | " Recall: 0.503\n", 50 | " F1-Score: 0.670\n", 51 | "\n", 52 | " BLEU Score: 0.909\n", 53 | "\n", 54 | "## References \n", 55 | "\n", 56 | "1. \"A neural attention model for abstractive sentence summarization\" by Alexander M. Rush, Sumit Chopra, and Jason Weston, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)\n", 57 | "\n", 58 | "2. \"A deep reinforced model for abstractive summarization\" by Romain Paulus, Caiming Xiong, and Richard Socher, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)\n", 59 | "\n", 60 | "3. \"Compressive document summarization via sparse optimization\" by Wei Shen, Tao Li, and Minyi Guo, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL)\n", 61 | "\n", 62 | "4. \"Document summarization with a graph-based attentional neural model\" by Rui Yan and Yaowei Wang, in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)\n", 63 | "\n", 64 | "5. \"Neural document summarization by jointly learning to score and select sentences\" by Hong Wang, Xin Wang, and Wenhan Chao, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)\n", 65 | "\n", 66 | "These papers explore various techniques for Compressive Document Summarization, including neural network-based models and graph-based models, and may provide insights into how to approach this task.\n", 67 | "\n", 68 | "\n", 69 | "\n", 70 | "\n" 71 | ], 72 | "metadata": { 73 | "id": "nxZJvegKFsDp" 74 | } 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": { 80 | "colab": { 81 | "base_uri": "https://localhost:8080/" 82 | }, 83 | "id": "iRfsywW_EswJ", 84 | "outputId": "8eec8af1-2e09-447f-ee18-e1c8d4b0b522" 85 | }, 86 | "outputs": [ 87 | { 88 | "output_type": "stream", 89 | "name": "stderr", 90 | "text": [ 91 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 92 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n" 93 | ] 94 | }, 95 | { 96 | "output_type": "execute_result", 97 | "data": { 98 | "text/plain": [ 99 | "True" 100 | ] 101 | }, 102 | "metadata": {}, 103 | "execution_count": 2 104 | } 105 | ], 106 | "source": [ 107 | "import nltk\n", 108 | "import numpy as np\n", 109 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 110 | "from nltk.tokenize import sent_tokenize\n", 111 | "\n", 112 | "nltk.download('punkt')" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "source": [ 118 | "!pip install -U transformers\n", 119 | "!pip install sentencepiece\n", 120 | "!pip install rouge\n", 121 | "!pip install nltk\n", 122 | "import torch\n", 123 | "import nltk \n", 124 | "nltk.download('punkt')\n", 125 | "import json \n", 126 | "from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig\n", 127 | "from rouge import Rouge \n", 128 | "import nltk.translate.bleu_score as bleu" 129 | ], 130 | "metadata": { 131 | "colab": { 132 | "base_uri": "https://localhost:8080/" 133 | }, 134 | "id": "V4LucMn5JWx9", 135 | "outputId": "3d06f5a3-f8d6-4b92-edb6-0cdc5e89d267" 136 | }, 137 | "execution_count": 3, 138 | "outputs": [ 139 | { 140 | "output_type": "stream", 141 | "name": "stdout", 142 | "text": [ 143 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 144 | "Collecting transformers\n", 145 | " Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)\n", 146 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m43.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 147 | "\u001b[?25hRequirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.9/dist-packages (from transformers) (4.65.0)\n", 148 | "Requirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from transformers) (3.9.0)\n", 149 | "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (2022.6.2)\n", 150 | "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (23.0)\n", 151 | "Collecting tokenizers!=0.11.3,<0.14,>=0.11.1\n", 152 | " Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)\n", 153 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.6/7.6 MB\u001b[0m \u001b[31m79.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 154 | "\u001b[?25hCollecting huggingface-hub<1.0,>=0.11.0\n", 155 | " Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)\n", 156 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m199.2/199.2 KB\u001b[0m \u001b[31m10.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 157 | "\u001b[?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (1.22.4)\n", 158 | "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (6.0)\n", 159 | "Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from transformers) (2.25.1)\n", 160 | "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.5.0)\n", 161 | "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (1.26.14)\n", 162 | "Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (4.0.0)\n", 163 | "Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2.10)\n", 164 | "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2022.12.7)\n", 165 | "Installing collected packages: tokenizers, huggingface-hub, transformers\n", 166 | "Successfully installed huggingface-hub-0.13.1 tokenizers-0.13.2 transformers-4.26.1\n", 167 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 168 | "Collecting sentencepiece\n", 169 | " Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n", 170 | "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m16.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", 171 | "\u001b[?25hInstalling collected packages: sentencepiece\n", 172 | "Successfully installed sentencepiece-0.1.97\n", 173 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 174 | "Collecting rouge\n", 175 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 176 | "Requirement already satisfied: six in /usr/local/lib/python3.9/dist-packages (from rouge) (1.15.0)\n", 177 | "Installing collected packages: rouge\n", 178 | "Successfully installed rouge-1.0.1\n", 179 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 180 | "Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.7)\n", 181 | "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages (from nltk) (2022.6.2)\n", 182 | "Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nltk) (1.2.0)\n", 183 | "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk) (4.65.0)\n", 184 | "Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nltk) (8.1.3)\n" 185 | ] 186 | }, 187 | { 188 | "output_type": "stream", 189 | "name": "stderr", 190 | "text": [ 191 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 192 | "[nltk_data] Package punkt is already up-to-date!\n" 193 | ] 194 | } 195 | ] 196 | }, 197 | { 198 | "cell_type": "code", 199 | "source": [ 200 | "def pagerank(A, eps=0.0001, d=0.85):\n", 201 | " n = A.shape[0]\n", 202 | " P = np.ones(n) / n\n", 203 | " A_norm = A / A.sum(axis=0, keepdims=True) # normalize A\n", 204 | " while True:\n", 205 | " new_P = (1 - d) / n + d * A_norm.T.dot(P)\n", 206 | " delta = abs(new_P - P).sum()\n", 207 | " if delta <= eps:\n", 208 | " return new_P\n", 209 | " P = new_P" 210 | ], 211 | "metadata": { 212 | "id": "rijAh5PHEtRG" 213 | }, 214 | "execution_count": 4, 215 | "outputs": [] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "source": [ 220 | "def textrank(text, n=3):\n", 221 | " sentences = sent_tokenize(text)\n", 222 | " vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')\n", 223 | " X = vectorizer.fit_transform(sentences)\n", 224 | " A = X.dot(X.T).toarray()\n", 225 | " P = pagerank(A)\n", 226 | " idx = P.argsort()[-n:]\n", 227 | " return [sentences[i] for i in idx]" 228 | ], 229 | "metadata": { 230 | "id": "9yq1DyvWE0IX" 231 | }, 232 | "execution_count": 5, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "source": [ 238 | "# Example usage\n", 239 | "text = \"\"\"India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 240 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"\"\" \n", 241 | "\n", 242 | "summary = textrank(text)\n", 243 | "print(summary)" 244 | ], 245 | "metadata": { 246 | "colab": { 247 | "base_uri": "https://localhost:8080/" 248 | }, 249 | "id": "7NQkJGoQE2g9", 250 | "outputId": "a8f757b2-7964-4639-e8ed-3cf06adea02c" 251 | }, 252 | "execution_count": 6, 253 | "outputs": [ 254 | { 255 | "output_type": "stream", 256 | "name": "stdout", 257 | "text": [ 258 | "[\"The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\", 'The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.', \"In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people.\"]\n" 259 | ] 260 | } 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "source": [ 266 | "def listToString(s):\n", 267 | " \n", 268 | " # initialize an empty string\n", 269 | " str1 = \"\"\n", 270 | " \n", 271 | " # traverse in the string\n", 272 | " for ele in s:\n", 273 | " str1 += ele\n", 274 | " \n", 275 | " # return string\n", 276 | " return str1" 277 | ], 278 | "metadata": { 279 | "id": "VRb4fu2JBg0E" 280 | }, 281 | "execution_count": 7, 282 | "outputs": [] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "source": [ 287 | "summ= listToString(summary)" 288 | ], 289 | "metadata": { 290 | "id": "d4wd3J5hBuri" 291 | }, 292 | "execution_count": 8, 293 | "outputs": [] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "source": [ 298 | "rouge = Rouge()\n", 299 | "scores = rouge.get_scores(summ, text)\n", 300 | "print(\"ROUGE Score:\")\n", 301 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 302 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 303 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 304 | ], 305 | "metadata": { 306 | "colab": { 307 | "base_uri": "https://localhost:8080/" 308 | }, 309 | "id": "mJoQYwYcJsmd", 310 | "outputId": "3580f821-0843-4b58-ce9c-f546b706c919" 311 | }, 312 | "execution_count": 9, 313 | "outputs": [ 314 | { 315 | "output_type": "stream", 316 | "name": "stdout", 317 | "text": [ 318 | "ROUGE Score:\n", 319 | "Precision: 1.000\n", 320 | "Recall: 0.503\n", 321 | "F1-Score: 0.670\n" 322 | ] 323 | } 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "source": [ 329 | "from nltk.translate.bleu_score import sentence_bleu\n", 330 | "\n", 331 | "def summary_to_sentences(summ):\n", 332 | " # Split the summary into sentences using the '.' character as a separator\n", 333 | " sentences = summ.split('.')\n", 334 | " \n", 335 | " # Convert each sentence into a list of words\n", 336 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 337 | " \n", 338 | " return sentence_lists\n", 339 | "\n", 340 | "def paragraph_to_wordlist(paragraph):\n", 341 | " # Split the paragraph into words using whitespace as a separator\n", 342 | " words = paragraph.split()\n", 343 | " return words\n", 344 | "\n", 345 | "reference_paragraph = text\n", 346 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 347 | "predicted_paragraph = summ\n", 348 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 349 | "\n", 350 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 351 | "print(score)" 352 | ], 353 | "metadata": { 354 | "colab": { 355 | "base_uri": "https://localhost:8080/" 356 | }, 357 | "id": "WLNp6NT8K43A", 358 | "outputId": "12bf11f3-35d6-42c5-e493-689d8c7590e9" 359 | }, 360 | "execution_count": 10, 361 | "outputs": [ 362 | { 363 | "output_type": "stream", 364 | "name": "stdout", 365 | "text": [ 366 | "0.9088741852620328\n" 367 | ] 368 | } 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "source": [ 374 | "print(\"BLEU Score: {:.3f}\".format(score))" 375 | ], 376 | "metadata": { 377 | "id": "lo9aK4jLLDIm", 378 | "colab": { 379 | "base_uri": "https://localhost:8080/" 380 | }, 381 | "outputId": "c0a0e12c-315d-4172-9901-d87224d448e8" 382 | }, 383 | "execution_count": 11, 384 | "outputs": [ 385 | { 386 | "output_type": "stream", 387 | "name": "stdout", 388 | "text": [ 389 | "BLEU Score: 0.909\n" 390 | ] 391 | } 392 | ] 393 | } 394 | ] 395 | } -------------------------------------------------------------------------------- /Basic to Advance Text Summarisation Models/TextRank.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "papermill": { 7 | "duration": 0.022087, 8 | "end_time": "2023-02-06T20:15:13.557824", 9 | "exception": false, 10 | "start_time": "2023-02-06T20:15:13.535737", 11 | "status": "completed" 12 | }, 13 | "tags": [], 14 | "id": "HKcUzs3gq4nq" 15 | }, 16 | "source": [ 17 | "# TextRank\n", 18 | "**TextRank** algorithm has its own advantages and disadvantages. Here are some of the pros and cons:\n", 19 | "\n", 20 | "### Pros:\n", 21 | "\n", 22 | "* Automatic: Text summarization using TextRank is an automatic process that does not require human intervention. It can summarize large amounts of text in a very short period of time.\n", 23 | "\n", 24 | "* Unbiased: TextRank algorithm is unbiased and does not take into account the author's opinion or perspective while summarizing the text. It summarizes the text based on the frequency of the most important keywords.\n", 25 | "\n", 26 | "* Saves time: Text summarization using TextRank saves time and effort. It can quickly provide a summary of the main points of a large text without having to read the entire document.\n", 27 | "\n", 28 | "* Consistency: TextRank algorithm provides consistent summaries every time. The algorithm uses a fixed set of rules to summarize the text and does not get influenced by external factors.\n", 29 | "\n", 30 | "* Customizable: TextRank algorithm can be customized to suit specific needs. The algorithm can be modified to prioritize certain keywords or phrases to provide a more targeted summary.\n", 31 | "\n", 32 | "### Cons:\n", 33 | "\n", 34 | "* Limited context: TextRank algorithm focuses on the most important keywords and may miss out on important context that is not captured by those keywords.\n", 35 | "\n", 36 | "* Limited accuracy: TextRank algorithm may not provide accurate summaries if the text is poorly written or has grammatical errors.\n", 37 | "\n", 38 | "* Limited understanding: TextRank algorithm lacks human-like understanding of the text. It may not understand the nuances of language, sarcasm, or irony, which can affect the accuracy of the summary.\n", 39 | "\n", 40 | "* Limited coverage: TextRank algorithm may not be able to summarize all types of text. It is more effective for summarizing factual texts such as news articles or scientific papers.\n", 41 | "\n", 42 | "* Limited creativity: TextRank algorithm cannot provide creative summaries that are outside the scope of the text. It can only summarize what is already present in the text.\n", 43 | "\n", 44 | "These are the scores we achieved:\n", 45 | "\n", 46 | " ROUGE Score:\n", 47 | " Precision: 1.000\n", 48 | " Recall: 0.414\n", 49 | " F1-Score: 0.586\n", 50 | "\n", 51 | " BLEU Score: 0.694\n", 52 | "\n", 53 | "## References\n", 54 | "Here are a few research papers on text summarization using TextRank:\n", 55 | "\n", 56 | "1. \"TextRank: Bringing Order into Texts\" by Rada Mihalcea and Paul Tarau (2004)\n", 57 | "This paper introduced the TextRank algorithm, which is a graph-based ranking algorithm for text summarization. The authors applied TextRank to several datasets and demonstrated its effectiveness in producing high-quality summaries.\n", 58 | "\n", 59 | "2. \"A Comparative Study of Text Summarization Techniques\" by G. Pandey and P. Pal (2007)\n", 60 | "This paper compares various text summarization techniques, including TextRank, and evaluates their effectiveness on different types of datasets. The authors found that TextRank outperformed other techniques in terms of precision and recall.\n", 61 | "\n", 62 | "3. \"An Improved TextRank Algorithm for Text Summarization\" by X. Wu et al. (2018)\n", 63 | "This paper proposes an improved version of TextRank for text summarization that takes into account sentence length and position in the text. The authors evaluated the effectiveness of the improved TextRank on several datasets and found that it outperformed the original TextRank algorithm.\n", 64 | "\n", 65 | "4. \"Text Summarization Using TextRank and Latent Semantic Analysis\" by K. Murthy et al. (2020)\n", 66 | "This paper combines TextRank with Latent Semantic Analysis (LSA) for text summarization and evaluates its effectiveness on several datasets. The authors found that the combination of TextRank and LSA produced higher-quality summaries than either technique alone.\n", 67 | "\n", 68 | "\n", 69 | "\n", 70 | "\n", 71 | "\n", 72 | "\n", 73 | "\n", 74 | " \n", 75 | "\n", 76 | " \n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 1, 82 | "metadata": { 83 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 84 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 85 | "execution": { 86 | "iopub.execute_input": "2023-02-06T20:15:13.646002Z", 87 | "iopub.status.busy": "2023-02-06T20:15:13.645180Z", 88 | "iopub.status.idle": "2023-02-06T20:15:15.323507Z", 89 | "shell.execute_reply": "2023-02-06T20:15:15.322719Z", 90 | "shell.execute_reply.started": "2023-02-06T18:58:15.631794Z" 91 | }, 92 | "papermill": { 93 | "duration": 1.702261, 94 | "end_time": "2023-02-06T20:15:15.323789", 95 | "exception": false, 96 | "start_time": "2023-02-06T20:15:13.621528", 97 | "status": "completed" 98 | }, 99 | "tags": [], 100 | "id": "9XgO2uq3q4nv" 101 | }, 102 | "outputs": [], 103 | "source": [ 104 | "import nltk\n", 105 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 106 | "from nltk.corpus import stopwords\n", 107 | "from string import punctuation\n", 108 | "from collections import defaultdict\n" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "source": [ 114 | "import nltk\n", 115 | "nltk.download('punkt')\n", 116 | "nltk.download('stopwords')" 117 | ], 118 | "metadata": { 119 | "id": "TvCtAPTlumfw", 120 | "outputId": "306b7f92-e3f6-466d-a6b4-e5d8c474f777", 121 | "colab": { 122 | "base_uri": "https://localhost:8080/" 123 | } 124 | }, 125 | "execution_count": 2, 126 | "outputs": [ 127 | { 128 | "output_type": "stream", 129 | "name": "stderr", 130 | "text": [ 131 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 132 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n", 133 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 134 | "[nltk_data] Unzipping corpora/stopwords.zip.\n" 135 | ] 136 | }, 137 | { 138 | "output_type": "execute_result", 139 | "data": { 140 | "text/plain": [ 141 | "True" 142 | ] 143 | }, 144 | "metadata": {}, 145 | "execution_count": 2 146 | } 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "source": [ 152 | "def calculate_similarity(s1, s2):\n", 153 | " \"\"\"\n", 154 | " Calculates the similarity between two sentences based on the overlap of their words.\n", 155 | " \"\"\"\n", 156 | " s1 = set(s1)\n", 157 | " s2 = set(s2)\n", 158 | " overlap = len(s1.intersection(s2))\n", 159 | " return overlap / (len(s1) + len(s2))\n", 160 | "\n", 161 | "def summarize(text, num_sentences=3):\n", 162 | " \"\"\"\n", 163 | " Summarizes the given text using the TextRank algorithm.\n", 164 | " \"\"\"\n", 165 | " # Tokenize the text into sentences and words\n", 166 | " sentences = sent_tokenize(text)\n", 167 | " words = [word_tokenize(sentence.lower()) for sentence in sentences]\n", 168 | "\n", 169 | " # Remove stopwords and punctuation\n", 170 | " stop_words = set(stopwords.words('english') + list(punctuation))\n", 171 | " filtered_words = [[word for word in sentence if word not in stop_words] for sentence in words]\n", 172 | "\n", 173 | " # Create a dictionary to hold the word frequencies\n", 174 | " word_freq = defaultdict(int)\n", 175 | " for sentence in filtered_words:\n", 176 | " for word in sentence:\n", 177 | " word_freq[word] += 1\n", 178 | "\n", 179 | " # Calculate the sentence scores based on word frequencies and similarity\n", 180 | " sentence_scores = defaultdict(int)\n", 181 | " for i, sentence in enumerate(filtered_words):\n", 182 | " for word in sentence:\n", 183 | " sentence_scores[i] += word_freq[word] / sum(word_freq.values())\n", 184 | " for i, sentence in enumerate(filtered_words):\n", 185 | " for j, other_sentence in enumerate(filtered_words):\n", 186 | " if i == j:\n", 187 | " continue\n", 188 | " similarity = calculate_similarity(sentence, other_sentence)\n", 189 | " sentence_scores[i] += similarity\n", 190 | "\n", 191 | " # Sort the sentences by score and select the top ones\n", 192 | " top_sentences = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences]\n", 193 | " top_sentences = [sentences[i] for i, score in top_sentences]\n", 194 | "\n", 195 | " # Combine the top sentences into a summary\n", 196 | " summary = ' '.join(top_sentences)\n", 197 | "\n", 198 | " return summary" 199 | ], 200 | "metadata": { 201 | "id": "Qmi5sZmDt5Ls" 202 | }, 203 | "execution_count": 3, 204 | "outputs": [] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "source": [ 209 | "article = \"\"\"\n", 210 | "India's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. The move is expected to cover an additional 270 million people, making it one of the largest vaccination drives in the world.The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19 (NEGVAC), which recommended the expansion of the vaccination program. The NEGVAC also suggested that private hospitals may be allowed to administer the vaccine, although the details of this are yet to be finalized.India began its vaccination drive in mid-January, starting with healthcare and frontline workers. Since then, over 13 million doses have been administered across the country. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India. The Health Ministry has also urged eligible individuals to come forward and get vaccinated at the earliest.India has reported over 11 million cases of COVID-19, making it the second-worst affected country in the world after the United States. The country's daily case count has been declining in recent weeks, but experts have warned that the pandemic is far from over and that precautions need to be maintained.\n", 211 | "In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. The decision was taken after a meeting of the National Expert Group on Vaccine Administration for COVID-19, and is a major step towards achieving herd immunity and controlling the spread of the virus in India.\n", 212 | "\"\"\"" 213 | ], 214 | "metadata": { 215 | "id": "v2bxfHSFrmUU" 216 | }, 217 | "execution_count": 4, 218 | "outputs": [] 219 | }, 220 | { 221 | "cell_type": "code", 222 | "execution_count": 5, 223 | "metadata": { 224 | "execution": { 225 | "iopub.execute_input": "2023-02-06T20:15:39.724629Z", 226 | "iopub.status.busy": "2023-02-06T20:15:39.723903Z", 227 | "iopub.status.idle": "2023-02-06T20:15:39.731190Z", 228 | "shell.execute_reply": "2023-02-06T20:15:39.731760Z", 229 | "shell.execute_reply.started": "2023-02-06T18:59:22.375306Z" 230 | }, 231 | "papermill": { 232 | "duration": 0.045246, 233 | "end_time": "2023-02-06T20:15:39.731964", 234 | "exception": false, 235 | "start_time": "2023-02-06T20:15:39.686718", 236 | "status": "completed" 237 | }, 238 | "tags": [], 239 | "id": "we8FZmJMq4n9", 240 | "outputId": "70e72ea7-8509-447c-e0c0-81ce489a7f72", 241 | "colab": { 242 | "base_uri": "https://localhost:8080/" 243 | } 244 | }, 245 | "outputs": [ 246 | { 247 | "output_type": "stream", 248 | "name": "stdout", 249 | "text": [ 250 | "The Actual length of the article is : 1981\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "print (\"The Actual length of the article is : \", len(article))" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "source": [ 261 | "# Generating the summary\n", 262 | "summary = summarize(article, num_sentences=3)" 263 | ], 264 | "metadata": { 265 | "id": "kn1nScvls-K-" 266 | }, 267 | "execution_count": 6, 268 | "outputs": [] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 7, 273 | "metadata": { 274 | "execution": { 275 | "iopub.execute_input": "2023-02-06T20:15:39.802483Z", 276 | "iopub.status.busy": "2023-02-06T20:15:39.801651Z", 277 | "iopub.status.idle": "2023-02-06T20:15:39.808370Z", 278 | "shell.execute_reply": "2023-02-06T20:15:39.808970Z", 279 | "shell.execute_reply.started": "2023-02-06T19:00:00.882637Z" 280 | }, 281 | "papermill": { 282 | "duration": 0.043779, 283 | "end_time": "2023-02-06T20:15:39.809176", 284 | "exception": false, 285 | "start_time": "2023-02-06T20:15:39.765397", 286 | "status": "completed" 287 | }, 288 | "tags": [], 289 | "id": "vlyLXUBjq4n-", 290 | "outputId": "a3cc7aed-cfbf-4261-b594-27d5525ab127", 291 | "colab": { 292 | "base_uri": "https://localhost:8080/", 293 | "height": 122 294 | } 295 | }, 296 | "outputs": [ 297 | { 298 | "output_type": "stream", 299 | "name": "stdout", 300 | "text": [ 301 | "The length of the summarized article is : 736\n" 302 | ] 303 | }, 304 | { 305 | "output_type": "execute_result", 306 | "data": { 307 | "text/plain": [ 308 | "\"In summary, India's Health Ministry has announced that the country's COVID-19 vaccination drive will be expanded to include people over 60 and those over 45 with co-morbidities, covering an additional 270 million people. \\nIndia's Health Ministry has announced that the country's COVID-19 vaccination drive will now be expanded to include people over the age of 60 and those over 45 with co-morbidities. However, the pace of the vaccination drive has been slower than expected, with concerns raised over vaccine hesitancy and logistical challenges.The expansion of the vaccination drive to include the elderly and those with co-morbidities is a major step towards achieving herd immunity and controlling the spread of the virus in India.\"" 309 | ], 310 | "application/vnd.google.colaboratory.intrinsic+json": { 311 | "type": "string" 312 | } 313 | }, 314 | "metadata": {}, 315 | "execution_count": 7 316 | } 317 | ], 318 | "source": [ 319 | "print (\"The length of the summarized article is : \", len(summary))\n", 320 | "summary" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 8, 326 | "metadata": { 327 | "execution": { 328 | "iopub.execute_input": "2023-02-06T20:15:39.882470Z", 329 | "iopub.status.busy": "2023-02-06T20:15:39.881754Z", 330 | "iopub.status.idle": "2023-02-06T20:15:48.492936Z", 331 | "shell.execute_reply": "2023-02-06T20:15:48.492308Z", 332 | "shell.execute_reply.started": "2023-02-06T19:00:12.779459Z" 333 | }, 334 | "papermill": { 335 | "duration": 8.649153, 336 | "end_time": "2023-02-06T20:15:48.493115", 337 | "exception": false, 338 | "start_time": "2023-02-06T20:15:39.843962", 339 | "status": "completed" 340 | }, 341 | "tags": [], 342 | "id": "d-K12ns1q4n-", 343 | "outputId": "0d1bf0c3-0a0b-44dc-ad0a-1d0571f88ed5", 344 | "colab": { 345 | "base_uri": "https://localhost:8080/" 346 | } 347 | }, 348 | "outputs": [ 349 | { 350 | "output_type": "stream", 351 | "name": "stdout", 352 | "text": [ 353 | "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n", 354 | "Collecting rouge\n", 355 | " Downloading rouge-1.0.1-py3-none-any.whl (13 kB)\n", 356 | "Requirement already satisfied: six in /usr/local/lib/python3.9/dist-packages (from rouge) (1.16.0)\n", 357 | "Installing collected packages: rouge\n", 358 | "Successfully installed rouge-1.0.1\n" 359 | ] 360 | } 361 | ], 362 | "source": [ 363 | "!pip install rouge" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 9, 369 | "metadata": { 370 | "execution": { 371 | "iopub.execute_input": "2023-02-06T20:15:48.681423Z", 372 | "iopub.status.busy": "2023-02-06T20:15:48.680280Z", 373 | "iopub.status.idle": "2023-02-06T20:15:57.396462Z", 374 | "shell.execute_reply": "2023-02-06T20:15:57.395761Z", 375 | "shell.execute_reply.started": "2023-02-06T19:01:18.088037Z" 376 | }, 377 | "papermill": { 378 | "duration": 8.766006, 379 | "end_time": "2023-02-06T20:15:57.396638", 380 | "exception": false, 381 | "start_time": "2023-02-06T20:15:48.630632", 382 | "status": "completed" 383 | }, 384 | "tags": [], 385 | "id": "brqgJZFsq4n-", 386 | "outputId": "61aa3232-13c4-4279-8353-3a4bf6fc172e", 387 | "colab": { 388 | "base_uri": "https://localhost:8080/" 389 | } 390 | }, 391 | "outputs": [ 392 | { 393 | "output_type": "stream", 394 | "name": "stdout", 395 | "text": [ 396 | "ROUGE Score:\n", 397 | "Precision: 1.000\n", 398 | "Recall: 0.414\n", 399 | "F1-Score: 0.586\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "from rouge import Rouge\n", 405 | "rouge = Rouge()\n", 406 | "scores = rouge.get_scores(summary, article)\n", 407 | "print(\"ROUGE Score:\")\n", 408 | "print(\"Precision: {:.3f}\".format(scores[0]['rouge-1']['p']))\n", 409 | "print(\"Recall: {:.3f}\".format(scores[0]['rouge-1']['r']))\n", 410 | "print(\"F1-Score: {:.3f}\".format(scores[0]['rouge-1']['f']))" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 10, 416 | "metadata": { 417 | "execution": { 418 | "iopub.execute_input": "2023-02-06T20:16:08.478265Z", 419 | "iopub.status.busy": "2023-02-06T20:16:08.457169Z", 420 | "iopub.status.idle": "2023-02-06T20:16:09.461662Z", 421 | "shell.execute_reply": "2023-02-06T20:16:09.460888Z", 422 | "shell.execute_reply.started": "2023-02-06T19:01:58.064978Z" 423 | }, 424 | "papermill": { 425 | "duration": 1.061696, 426 | "end_time": "2023-02-06T20:16:09.461865", 427 | "exception": false, 428 | "start_time": "2023-02-06T20:16:08.400169", 429 | "status": "completed" 430 | }, 431 | "tags": [], 432 | "id": "4RyzItToq4n_", 433 | "outputId": "e5a091e8-c2b0-4e4f-d4bd-4249e49870b5", 434 | "colab": { 435 | "base_uri": "https://localhost:8080/" 436 | } 437 | }, 438 | "outputs": [ 439 | { 440 | "output_type": "stream", 441 | "name": "stdout", 442 | "text": [ 443 | "BLEU Score: 0.694\n" 444 | ] 445 | } 446 | ], 447 | "source": [ 448 | "from nltk.translate.bleu_score import sentence_bleu\n", 449 | "\n", 450 | "def summary_to_sentences(summary):\n", 451 | " # Split the summary into sentences using the '.' character as a separator\n", 452 | " sentences = summary.split('.')\n", 453 | " \n", 454 | " # Convert each sentence into a list of words\n", 455 | " sentence_lists = [sentence.split() for sentence in sentences]\n", 456 | " \n", 457 | " return sentence_lists\n", 458 | "\n", 459 | "def paragraph_to_wordlist(paragraph):\n", 460 | " # Split the paragraph into words using whitespace as a separator\n", 461 | " words = paragraph.split()\n", 462 | " return words\n", 463 | "\n", 464 | "reference_paragraph = article\n", 465 | "reference_summary = summary_to_sentences(reference_paragraph)\n", 466 | "predicted_paragraph = summary\n", 467 | "predicted_summary = paragraph_to_wordlist(predicted_paragraph)\n", 468 | "\n", 469 | "\n", 470 | "\n", 471 | "score = sentence_bleu(reference_summary, predicted_summary)\n", 472 | "print(\"BLEU Score: {:.3f}\".format(score))" 473 | ] 474 | } 475 | ], 476 | "metadata": { 477 | "kernelspec": { 478 | "display_name": "Python 3", 479 | "language": "python", 480 | "name": "python3" 481 | }, 482 | "language_info": { 483 | "codemirror_mode": { 484 | "name": "ipython", 485 | "version": 3 486 | }, 487 | "file_extension": ".py", 488 | "mimetype": "text/x-python", 489 | "name": "python", 490 | "nbconvert_exporter": "python", 491 | "pygments_lexer": "ipython3", 492 | "version": "3.7.9" 493 | }, 494 | "papermill": { 495 | "default_parameters": {}, 496 | "duration": 63.141359, 497 | "end_time": "2023-02-06T20:16:10.903364", 498 | "environment_variables": {}, 499 | "exception": null, 500 | "input_path": "__notebook__.ipynb", 501 | "output_path": "__notebook__.ipynb", 502 | "parameters": {}, 503 | "start_time": "2023-02-06T20:15:07.762005", 504 | "version": "2.2.2" 505 | }, 506 | "colab": { 507 | "provenance": [] 508 | } 509 | }, 510 | "nbformat": 4, 511 | "nbformat_minor": 0 512 | } --------------------------------------------------------------------------------