├── README.md ├── SumBasic.ipynb ├── Text_Rank_.ipynb └── luhn_sum.py /README.md: -------------------------------------------------------------------------------- 1 | # Extractive Text Summarization 2 | 3 | Tried out these algorithms for Extractive Summarization.
4 | * [TextRank](Text_Rank_.ipynb)
5 | * [SumBasic](SumBasic.ipynb)
6 | * [Luhns Summarization](luhn_sum.py)
7 | 8 | 9 | ## Sample Results 10 | 11 | ### **Document :** 12 | 13 | ```"In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, 'With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow.' The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills."``` 14 | 15 | ### Outputs: 16 | 17 | #### TextRank: 18 | 19 | ```In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, 'With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset."``` 20 | 21 | #### SumBasic: 22 | 23 | ``` 24 | Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public.As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. 25 | The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry.Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. 26 | ``` 27 | 28 | #### Luhn's Summarization: 29 | 30 | ```The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, 'With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses.``` 31 | 32 | 33 | ## Built With 34 | 35 | * [nltk](https://www.nltk.org/) 36 | * [numpy](https://numpy.org/) 37 | 38 | 39 | ## Acknowledgments 40 | 41 | * Research paper on SumBasic Text Summarization https://www.cs.bgu.ac.il/~elhadad/nlp09/sumbasic.pdf 42 | * A youtube video which explains PageRank Algorithm https://www.youtube.com/watch?v=P8Kt6Abq_rM 43 | * Research paper on TextRank Algorithm https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf 44 | * Masa Nekic's talk on Automatic Text Summarization https://www.youtube.com/watch?v=_d0OXm0dRZ4 45 | -------------------------------------------------------------------------------- /SumBasic.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "SumBasic.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyOOnfjNIXLmlSVIZrhvEVHX", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "view-in-github", 22 | "colab_type": "text" 23 | }, 24 | "source": [ 25 | "\"Open" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "metadata": { 31 | "id": "z9G5P11zLisE", 32 | "colab_type": "code", 33 | "colab": {} 34 | }, 35 | "source": [ 36 | "import numpy as np \n", 37 | "from nltk.corpus import stopwords\n", 38 | "from nltk.stem import WordNetLemmatizer\n", 39 | "from nltk.tokenize import sent_tokenize,word_tokenize\n", 40 | "from bs4 import BeautifulSoup\n", 41 | "import requests\n", 42 | "import re" 43 | ], 44 | "execution_count": 0, 45 | "outputs": [] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "metadata": { 50 | "id": "ZNYXEzqDLrx9", 51 | "colab_type": "code", 52 | "outputId": "d9ca6bd4-98ba-4091-dd43-e3032f6d77f4", 53 | "colab": { 54 | "base_uri": "https://localhost:8080/", 55 | "height": 134 56 | } 57 | }, 58 | "source": [ 59 | "import nltk\n", 60 | "nltk.download('wordnet')\n", 61 | "nltk.download('punkt')\n", 62 | "nltk.download('stopwords')" 63 | ], 64 | "execution_count": 2, 65 | "outputs": [ 66 | { 67 | "output_type": "stream", 68 | "text": [ 69 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 70 | "[nltk_data] Unzipping corpora/wordnet.zip.\n", 71 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 72 | "[nltk_data] Unzipping tokenizers/punkt.zip.\n", 73 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 74 | "[nltk_data] Unzipping corpora/stopwords.zip.\n" 75 | ], 76 | "name": "stdout" 77 | }, 78 | { 79 | "output_type": "execute_result", 80 | "data": { 81 | "text/plain": [ 82 | "True" 83 | ] 84 | }, 85 | "metadata": { 86 | "tags": [] 87 | }, 88 | "execution_count": 2 89 | } 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "id": "fXyJQD1yLvAf", 96 | "colab_type": "code", 97 | "colab": {} 98 | }, 99 | "source": [ 100 | "def _input(topic):\n", 101 | "\tarticle = \"\"\n", 102 | "\tlink = \"https://en.wikipedia.org/wiki/\" + topic.strip() \n", 103 | "\tpage = requests.get(link)\n", 104 | "\tcontent = BeautifulSoup(page.content,'html.parser')\n", 105 | "\tparagraphs = content.find_all('p')\n", 106 | "\tfor paragraph in paragraphs:\n", 107 | "\t\tarticle+= paragraph.text+\" \"\n", 108 | "\treturn article\n" 109 | ], 110 | "execution_count": 0, 111 | "outputs": [] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "metadata": { 116 | "id": "QMlzG6l1L0pK", 117 | "colab_type": "code", 118 | "colab": {} 119 | }, 120 | "source": [ 121 | "def clean(sentences):\n", 122 | "\tlemmatizer = WordNetLemmatizer()\n", 123 | "\tcleaned_sentences = []\n", 124 | "\tfor sentence in sentences:\n", 125 | "\t\tsentence = sentence.lower()\n", 126 | "\t\tsentence = re.sub(r'[^a-zA-Z]',' ',sentence)\n", 127 | "\t\tsentence = sentence.split()\n", 128 | "\t\tsentence = [lemmatizer.lemmatize(word) for word in sentence if word not in set(stopwords.words('english'))]\n", 129 | "\t\tsentence = ' '.join(sentence)\n", 130 | "\t\tcleaned_sentences.append(sentence)\n", 131 | "\treturn cleaned_sentences" 132 | ], 133 | "execution_count": 0, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "metadata": { 139 | "id": "njsyZChlMDQL", 140 | "colab_type": "code", 141 | "colab": {} 142 | }, 143 | "source": [ 144 | "def init_probability(sentences):\n", 145 | "\tprobability_dict = {}\n", 146 | "\twords = word_tokenize('. '.join(sentences))\n", 147 | "\ttotal_words = len(set(words))\n", 148 | "\tfor word in words:\n", 149 | "\t\tif word!='.':\n", 150 | "\t\t\tif not probability_dict.get(word):\n", 151 | "\t\t\t\tprobability_dict[word] = 1\n", 152 | "\t\t\telse:\n", 153 | "\t\t\t\tprobability_dict[word] += 1\n", 154 | "\n", 155 | "\tfor word,count in probability_dict.items():\n", 156 | "\t\tprobability_dict[word] = count/total_words \n", 157 | "\t\n", 158 | "\treturn probability_dict" 159 | ], 160 | "execution_count": 0, 161 | "outputs": [] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "metadata": { 166 | "id": "Gcx847gPMF8S", 167 | "colab_type": "code", 168 | "colab": {} 169 | }, 170 | "source": [ 171 | "def update_probability(probability_dict,word):\n", 172 | "\tif probability_dict.get(word):\n", 173 | "\t\tprobability_dict[word] = probability_dict[word]**2\n", 174 | "\treturn probability_dict" 175 | ], 176 | "execution_count": 0, 177 | "outputs": [] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "metadata": { 182 | "id": "E2BF3MeaMOgU", 183 | "colab_type": "code", 184 | "colab": {} 185 | }, 186 | "source": [ 187 | "def average_sentence_weights(sentences,probability_dict):\n", 188 | "\tsentence_weights = {}\n", 189 | "\tfor index,sentence in enumerate(sentences):\n", 190 | "\t\tif len(sentence) != 0:\n", 191 | "\t\t\taverage_proba = sum([probability_dict[word] for word in sentence if word in probability_dict.keys()])\n", 192 | "\t\t\taverage_proba /= len(sentence)\n", 193 | "\t\t\tsentence_weights[index] = average_proba \n", 194 | "\treturn sentence_weights\n" 195 | ], 196 | "execution_count": 0, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "metadata": { 202 | "id": "nzXqF7hWMQTh", 203 | "colab_type": "code", 204 | "colab": {} 205 | }, 206 | "source": [ 207 | "def generate_summary(sentence_weights,probability_dict,cleaned_article,tokenized_article,summary_length = 30):\n", 208 | "\tsummary = \"\"\n", 209 | "\tcurrent_length = 0\n", 210 | "\twhile current_length < summary_length :\n", 211 | "\t\thighest_probability_word = max(probability_dict,key=probability_dict.get)\n", 212 | "\t\tsentences_with_max_word= [index for index,sentence in enumerate(cleaned_article) if highest_probability_word in set(word_tokenize(sentence))]\n", 213 | "\t\tsentence_list = sorted([[index,sentence_weights[index]] for index in sentences_with_max_word],key=lambda x:x[1],reverse=True)\n", 214 | "\t\tsummary += tokenized_article[sentence_list[0][0]] + \"\\n\"\n", 215 | "\t\tfor word in word_tokenize(cleaned_article[sentence_list[0][0]]):\n", 216 | "\t\t\tprobability_dict = update_probability(probability_dict,word)\n", 217 | "\t\tcurrent_length+=1\n", 218 | "\treturn summary" 219 | ], 220 | "execution_count": 0, 221 | "outputs": [] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "metadata": { 226 | "id": "fp_63zzsMS0b", 227 | "colab_type": "code", 228 | "colab": {} 229 | }, 230 | "source": [ 231 | "def main():\n", 232 | "\ttopic = input(\"Enter the title of the wikipedia article to be scraped----->\")\n", 233 | "\tarticle = _input(topic)\n", 234 | "\trequired_length = int(input(\"Enter the number of required sentences\"))\n", 235 | "\ttokenized_article = sent_tokenize(article)\n", 236 | "\tcleaned_article = clean(tokenized_article) \n", 237 | "\tprobability_dict = init_probability(cleaned_article)\n", 238 | "\tsentence_weights = average_sentence_weights(cleaned_article,probability_dict)\n", 239 | "\tsummary = generate_summary(sentence_weights,probability_dict,cleaned_article,tokenized_article,required_length)\n", 240 | "\tprint(summary)" 241 | ], 242 | "execution_count": 0, 243 | "outputs": [] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "metadata": { 248 | "id": "MBaXuFyiMVv2", 249 | "colab_type": "code", 250 | "outputId": "bc516d62-a86e-4b20-a56f-a725eaa35412", 251 | "colab": { 252 | "base_uri": "https://localhost:8080/", 253 | "height": 185 254 | } 255 | }, 256 | "source": [ 257 | "if __name__ == \"__main__\":\n", 258 | "\tmain()" 259 | ], 260 | "execution_count": 12, 261 | "outputs": [ 262 | { 263 | "output_type": "stream", 264 | "text": [ 265 | "Enter the title of the wikipedia article to be scraped----->Github\n", 266 | "Enter the number of required sentences5\n", 267 | "[93] On January 10, 2015, GitHub was unblocked.\n", 268 | "Previously, only public repositories were free.\n", 269 | "On July 5, 2009, GitHub announced that the site was being harnessed by over 100,000 users.\n", 270 | "[5] Free GitHub accounts are commonly used to host open source projects.\n", 271 | "Those services include:\n", 272 | " Site: https://github.community/\n", 273 | " GitHub maintains a community forum where users can ask questions publicly or answer questions of other users.\n", 274 | "\n" 275 | ], 276 | "name": "stdout" 277 | } 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "metadata": { 283 | "id": "LPuiaLUiMYkc", 284 | "colab_type": "code", 285 | "colab": {} 286 | }, 287 | "source": [ 288 | "" 289 | ], 290 | "execution_count": 0, 291 | "outputs": [] 292 | } 293 | ] 294 | } -------------------------------------------------------------------------------- /Text_Rank_.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Text_Rank .ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true, 10 | "mount_file_id": "1wDoCJ6_t9zR-F3FSm41ilYrOblnf5ioD", 11 | "authorship_tag": "ABX9TyOO2xe/oy3dQt47khu4BEWx", 12 | "include_colab_link": true 13 | }, 14 | "kernelspec": { 15 | "name": "python3", 16 | "display_name": "Python 3" 17 | } 18 | }, 19 | "cells": [ 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "id": "view-in-github", 24 | "colab_type": "text" 25 | }, 26 | "source": [ 27 | "\"Open" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "metadata": { 33 | "id": "qBKp0xYVwZGn", 34 | "colab_type": "code", 35 | "colab": {} 36 | }, 37 | "source": [ 38 | "import numpy as np\n", 39 | "import pandas as pd\n", 40 | "import gensim\n", 41 | "import nltk\n", 42 | "from nltk.corpus import stopwords\n", 43 | "from nltk.tokenize import word_tokenize,sent_tokenize" 44 | ], 45 | "execution_count": 0, 46 | "outputs": [] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "metadata": { 51 | "id": "7GkeGFkJORpN", 52 | "colab_type": "code", 53 | "outputId": "933beec8-8c7e-4a91-e920-0980eb0ccd83", 54 | "colab": { 55 | "base_uri": "https://localhost:8080/", 56 | "height": 134 57 | } 58 | }, 59 | "source": [ 60 | "nltk.download('punkt')\n", 61 | "nltk.download('stopwords')\n", 62 | "nltk.download('wordnet')" 63 | ], 64 | "execution_count": 55, 65 | "outputs": [ 66 | { 67 | "output_type": "stream", 68 | "text": [ 69 | "[nltk_data] Downloading package punkt to /root/nltk_data...\n", 70 | "[nltk_data] Package punkt is already up-to-date!\n", 71 | "[nltk_data] Downloading package stopwords to /root/nltk_data...\n", 72 | "[nltk_data] Package stopwords is already up-to-date!\n", 73 | "[nltk_data] Downloading package wordnet to /root/nltk_data...\n", 74 | "[nltk_data] Package wordnet is already up-to-date!\n" 75 | ], 76 | "name": "stdout" 77 | }, 78 | { 79 | "output_type": "execute_result", 80 | "data": { 81 | "text/plain": [ 82 | "True" 83 | ] 84 | }, 85 | "metadata": { 86 | "tags": [] 87 | }, 88 | "execution_count": 55 89 | } 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "metadata": { 95 | "id": "1v6JIBzRzwyt", 96 | "colab_type": "code", 97 | "outputId": "74f39990-464c-45b0-e730-16e0e99ba2a3", 98 | "colab": { 99 | "base_uri": "https://localhost:8080/", 100 | "height": 34 101 | } 102 | }, 103 | "source": [ 104 | "from google.colab import drive\n", 105 | "drive.mount('/content/drive')" 106 | ], 107 | "execution_count": 56, 108 | "outputs": [ 109 | { 110 | "output_type": "stream", 111 | "text": [ 112 | "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" 113 | ], 114 | "name": "stdout" 115 | } 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "metadata": { 121 | "id": "o_DdDijmOeXZ", 122 | "colab_type": "code", 123 | "colab": {} 124 | }, 125 | "source": [ 126 | "import pickle\n", 127 | "\n", 128 | "path = '/content/drive/My Drive/Embeddings/glove.840B.300d.pkl'\n", 129 | "\n", 130 | "with open(path,'rb') as f:\n", 131 | " embeddings = pickle.load(f)" 132 | ], 133 | "execution_count": 0, 134 | "outputs": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "metadata": { 139 | "id": "6yUeROgrOhBC", 140 | "colab_type": "code", 141 | "colab": {} 142 | }, 143 | "source": [ 144 | "from nltk.stem import WordNetLemmatizer\n", 145 | "import re\n", 146 | "lem = WordNetLemmatizer()" 147 | ], 148 | "execution_count": 0, 149 | "outputs": [] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "metadata": { 154 | "id": "OG0BdFI2OljD", 155 | "colab_type": "code", 156 | "colab": {} 157 | }, 158 | "source": [ 159 | "def clean(sentence):\n", 160 | " sentence = sentence.lower()\n", 161 | " sentence = re.sub(r'http\\S+',' ',sentence)\n", 162 | " sentence = re.sub(r'[^a-zA-Z]',' ',sentence)\n", 163 | " sentence = sentence.split()\n", 164 | " sentence = [lem.lemmatize(word) for word in sentence if word not in stopwords.words('english')]\n", 165 | " sentence = ' '.join(sentence)\n", 166 | " return sentence" 167 | ], 168 | "execution_count": 0, 169 | "outputs": [] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "metadata": { 174 | "id": "iyQNRZw3Ota1", 175 | "colab_type": "code", 176 | "colab": {} 177 | }, 178 | "source": [ 179 | "def average_vector(sentence):\n", 180 | " words = sentence.split()\n", 181 | " size = len(words)\n", 182 | " average_vector = np.zeros((size,300))\n", 183 | " unknown_words=[]\n", 184 | "\n", 185 | " for index,word in enumerate(words):\n", 186 | " try: \n", 187 | " average_vector[index] = embeddings[word].reshape(1,-1)\n", 188 | " except Exception as e:\n", 189 | " unknown_words.append(word)\n", 190 | " average_vector[index] = 0\n", 191 | "\n", 192 | " if size!=0:\n", 193 | " average_vector = sum(average_vector)/size\n", 194 | " return average_vector,unknown_words" 195 | ], 196 | "execution_count": 0, 197 | "outputs": [] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "metadata": { 202 | "id": "WIqJTmiuOwnZ", 203 | "colab_type": "code", 204 | "colab": {} 205 | }, 206 | "source": [ 207 | "def cosine_similarity(vector_1,vector_2):\n", 208 | " cosine_similarity = 0\n", 209 | " try:\n", 210 | " cosine_similarity = (np.dot(vector_1,vector_2)/(np.linalg.norm(vector_1)*np.linalg.norm(vector_2)))\n", 211 | " except Exception as e :\n", 212 | " # print(\"Exception occured\",str(e))\n", 213 | " pass\n", 214 | " return cosine_similarity" 215 | ], 216 | "execution_count": 0, 217 | "outputs": [] 218 | }, 219 | { 220 | "cell_type": "code", 221 | "metadata": { 222 | "id": "xw4v2G6NOzTV", 223 | "colab_type": "code", 224 | "colab": {} 225 | }, 226 | "source": [ 227 | "def find_similarity(string1,string2):\n", 228 | " # string1,string2 = clean(string1),clean(string2)\n", 229 | " vector1,unknown_words1 = average_vector(string1)\n", 230 | " vector2,unknown_words2 = average_vector(string2)\n", 231 | " similarity = cosine_similarity(vector1,vector2)\n", 232 | " return similarity" 233 | ], 234 | "execution_count": 0, 235 | "outputs": [] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "metadata": { 240 | "id": "yvt1rweuO4H4", 241 | "colab_type": "code", 242 | "colab": { 243 | "base_uri": "https://localhost:8080/", 244 | "height": 34 245 | }, 246 | "outputId": "d5785ea1-6abe-4eb0-d547-c89912693f22" 247 | }, 248 | "source": [ 249 | "from bs4 import BeautifulSoup\n", 250 | "import requests\n", 251 | "\n", 252 | "subject = input(\"Enter the wikipedia topic to be summarised\")\n", 253 | "base_url = \"https://en.wikipedia.org/wiki/\"+subject\n", 254 | "page = requests.get(base_url)\n", 255 | "\n", 256 | "soup = BeautifulSoup(page.content,'html.parser')\n", 257 | "paragraphs = soup.find_all('p')\n", 258 | "\n", 259 | "content=\"\"\n", 260 | "for paragraph in paragraphs:\n", 261 | " content+=paragraph.text\n" 262 | ], 263 | "execution_count": 63, 264 | "outputs": [ 265 | { 266 | "output_type": "stream", 267 | "text": [ 268 | "Enter the wikipedia topic to be summariseddeep learning\n" 269 | ], 270 | "name": "stdout" 271 | } 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "metadata": { 277 | "id": "vGL38jR_PmBn", 278 | "colab_type": "code", 279 | "colab": {} 280 | }, 281 | "source": [ 282 | "sentences = sent_tokenize(content)" 283 | ], 284 | "execution_count": 0, 285 | "outputs": [] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "metadata": { 290 | "id": "PDzNoXa_P6h3", 291 | "colab_type": "code", 292 | "colab": {} 293 | }, 294 | "source": [ 295 | "cleaned_sentences=[]\n", 296 | "for sentence in sentences:\n", 297 | " cleaned_sentences.append(clean(sentence))" 298 | ], 299 | "execution_count": 0, 300 | "outputs": [] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "metadata": { 305 | "id": "lkekWLWdQFQK", 306 | "colab_type": "code", 307 | "colab": {} 308 | }, 309 | "source": [ 310 | "similarity_matrix = np.zeros((len(cleaned_sentences),len(cleaned_sentences)))\n", 311 | "\n", 312 | "for i in range(0,len(cleaned_sentences)):\n", 313 | " for j in range(0,len(cleaned_sentences)):\n", 314 | " if type(find_similarity(cleaned_sentences[i],cleaned_sentences[j])) == np.float64 :\n", 315 | " similarity_matrix[i,j] = find_similarity(cleaned_sentences[i],cleaned_sentences[j])" 316 | ], 317 | "execution_count": 0, 318 | "outputs": [] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "metadata": { 323 | "id": "rH_H_Q0SUjDc", 324 | "colab_type": "code", 325 | "outputId": "942d7b1f-d248-4faf-bcc5-0ff56120c2e4", 326 | "colab": { 327 | "base_uri": "https://localhost:8080/", 328 | "height": 235 329 | } 330 | }, 331 | "source": [ 332 | "similarity_matrix" 333 | ], 334 | "execution_count": 68, 335 | "outputs": [ 336 | { 337 | "output_type": "execute_result", 338 | "data": { 339 | "text/plain": [ 340 | "array([[1. , 0.59638508, 0.92437967, ..., 0.73430471, 0.53050314,\n", 341 | " 0. ],\n", 342 | " [0.59638508, 1. , 0.51069377, ..., 0.37824843, 0.48714121,\n", 343 | " 0. ],\n", 344 | " [0.92437967, 0.51069377, 1. , ..., 0.81988919, 0.5482137 ,\n", 345 | " 0. ],\n", 346 | " ...,\n", 347 | " [0.73430471, 0.37824843, 0.81988919, ..., 1. , 0.58826502,\n", 348 | " 0. ],\n", 349 | " [0.53050314, 0.48714121, 0.5482137 , ..., 0.58826502, 1. ,\n", 350 | " 0. ],\n", 351 | " [0. , 0. , 0. , ..., 0. , 0. ,\n", 352 | " 0. ]])" 353 | ] 354 | }, 355 | "metadata": { 356 | "tags": [] 357 | }, 358 | "execution_count": 68 359 | } 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "metadata": { 365 | "id": "KP_WPKa2TwB7", 366 | "colab_type": "code", 367 | "colab": {} 368 | }, 369 | "source": [ 370 | "class Graph:\n", 371 | " \n", 372 | " def __init__(self,graph_dictionary):\n", 373 | " if not graph_dictionary:\n", 374 | " graph_dictionary={}\n", 375 | " self.graph_dictionary = graph_dictionary\n", 376 | " \n", 377 | " def vertices(self):\n", 378 | " return self.graph_dictionary.keys()\n", 379 | " \n", 380 | " def edges(self):\n", 381 | " return self.generate_edges()\n", 382 | "\n", 383 | " def add_vertex(self,vertex):\n", 384 | " if vertex not in graph_dictionary.keys():\n", 385 | " graph_dictionary[vertex] = []\n", 386 | " \n", 387 | " def add_edge(self,edge):\n", 388 | " vertex1,vertex2 = tuple(set(edge))\n", 389 | " if vertex1 in graph_dictionary.keys():\n", 390 | " graph_dictionary[vertex1].append(vertex2)\n", 391 | " else:\n", 392 | " graph_dictionary[vertex1] = [vertex2]\n", 393 | "\n", 394 | " def generate_edges(self):\n", 395 | " edges = set()\n", 396 | " for vertex in graph_dictionary.keys():\n", 397 | " for edges in graph_dictionary[vertex]:\n", 398 | " edges.add([vertex,edge])\n", 399 | " return list(edges)" 400 | ], 401 | "execution_count": 0, 402 | "outputs": [] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "metadata": { 407 | "id": "FmD2SVC4ZbIE", 408 | "colab_type": "code", 409 | "colab": {} 410 | }, 411 | "source": [ 412 | "similarity_threshold = 0.70\n", 413 | "network_dictionary = {}\n", 414 | "\n", 415 | "for i in range(len(cleaned_sentences)):\n", 416 | " network_dictionary[i] = [] \n", 417 | "\n", 418 | "for i in range(len(cleaned_sentences)):\n", 419 | " for j in range(len(cleaned_sentences)):\n", 420 | " if similarity_matrix[i][j] > similarity_threshold:\n", 421 | " if j not in network_dictionary[i]:\n", 422 | " network_dictionary[i].append(j)\n", 423 | " if i not in network_dictionary[j]:\n", 424 | " network_dictionary[j].append(i)\n" 425 | ], 426 | "execution_count": 0, 427 | "outputs": [] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "metadata": { 432 | "id": "3B-TQfbya-pN", 433 | "colab_type": "code", 434 | "outputId": "c70eb0bd-c912-4bc8-ce90-ac925d726af5", 435 | "colab": { 436 | "base_uri": "https://localhost:8080/", 437 | "height": 235 438 | } 439 | }, 440 | "source": [ 441 | "similarity_matrix" 442 | ], 443 | "execution_count": 71, 444 | "outputs": [ 445 | { 446 | "output_type": "execute_result", 447 | "data": { 448 | "text/plain": [ 449 | "array([[1. , 0.59638508, 0.92437967, ..., 0.73430471, 0.53050314,\n", 450 | " 0. ],\n", 451 | " [0.59638508, 1. , 0.51069377, ..., 0.37824843, 0.48714121,\n", 452 | " 0. ],\n", 453 | " [0.92437967, 0.51069377, 1. , ..., 0.81988919, 0.5482137 ,\n", 454 | " 0. ],\n", 455 | " ...,\n", 456 | " [0.73430471, 0.37824843, 0.81988919, ..., 1. , 0.58826502,\n", 457 | " 0. ],\n", 458 | " [0.53050314, 0.48714121, 0.5482137 , ..., 0.58826502, 1. ,\n", 459 | " 0. ],\n", 460 | " [0. , 0. , 0. , ..., 0. , 0. ,\n", 461 | " 0. ]])" 462 | ] 463 | }, 464 | "metadata": { 465 | "tags": [] 466 | }, 467 | "execution_count": 71 468 | } 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "metadata": { 474 | "id": "3_bswL7laqpS", 475 | "colab_type": "code", 476 | "colab": {} 477 | }, 478 | "source": [ 479 | "graph = Graph(network_dictionary)" 480 | ], 481 | "execution_count": 0, 482 | "outputs": [] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "metadata": { 487 | "id": "v92awKu6b-0v", 488 | "colab_type": "code", 489 | "colab": {} 490 | }, 491 | "source": [ 492 | "def page_rank(graph,iterations = 50,sentences=20):\n", 493 | " ranks = []\n", 494 | " # ranks = {}\n", 495 | " network = graph.graph_dictionary\n", 496 | " current_ranks = np.squeeze(np.zeros((1,len(cleaned_sentences))))\n", 497 | " prev_ranks = np.array([1/len(cleaned_sentences)]*len(cleaned_sentences))\n", 498 | " for iteration in range(0,iterations):\n", 499 | " for i in range(0,len(list(network.keys()))):\n", 500 | " current_score = 0\n", 501 | " adjacent_vertices = network[list(network.keys())[i]]\n", 502 | " for vertex in adjacent_vertices:\n", 503 | " current_score += prev_ranks[vertex]/len(network[vertex])\n", 504 | " current_ranks[i] = current_score\n", 505 | " prev_ranks = current_ranks\n", 506 | " \n", 507 | " for index in range(len(cleaned_sentences)):\n", 508 | " # ranks[index] = prev_ranks[index]\n", 509 | " if prev_ranks[index]: \n", 510 | " ranks.append((index,prev_ranks[index]))\n", 511 | " # ranks = {index:rank for index,rank in sorted(ranks.items(),key=ranks.get,reverse=True)}[:sentences]\n", 512 | " ranks = sorted(ranks,key = lambda x:x[1],reverse=True)[:sentences]\n", 513 | "\n", 514 | " return ranks\n", 515 | "\n", 516 | "ranks = page_rank(graph,iterations=1000)" 517 | ], 518 | "execution_count": 0, 519 | "outputs": [] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "metadata": { 524 | "id": "tSX5ULHlvNkx", 525 | "colab_type": "code", 526 | "colab": {} 527 | }, 528 | "source": [ 529 | "summary = \"\"\n", 530 | "for index,rank in ranks:\n", 531 | " summary+=sentences[index]+\" \"" 532 | ], 533 | "execution_count": 0, 534 | "outputs": [] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "metadata": { 539 | "id": "qpNeEqVfvmbX", 540 | "colab_type": "code", 541 | "outputId": "902a53af-0135-4210-acb7-fdff6f8a9b69", 542 | "colab": { 543 | "base_uri": "https://localhost:8080/", 544 | "height": 54 545 | } 546 | }, 547 | "source": [ 548 | "summary" 549 | ], 550 | "execution_count": 77, 551 | "outputs": [ 552 | { 553 | "output_type": "execute_result", 554 | "data": { 555 | "text/plain": [ 556 | "'[206]\\nIn further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained[207] demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian\\'s[208] website. [75] However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. [170]\\nDeep learning has been shown to produce competitive results in medical application such as cancer cell classification, lesion detection, organ segmentation and image enhancement[171][172]\\nFinding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and analyzed before a target segment can be created and used in ad serving by any ad server. As Mühlhoff argues, involvement of human users to generate training and verification data is so typical for most commercial end-user applications of Deep Learning that such systems may be referred to as \"human-aided artificial intelligence\". [1][2][3]\\nDeep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance. [65][77][75][80]\\nIn 2010, researchers extended deep learning from TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by decision trees. These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of nerve growth factor) support the self-organization somewhat analogous to the neural networks utilized in deep learning models. [65][76] The nature of the recognition errors produced by the two types of systems was characteristically different,[77][74] offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems. The 2009 NIPS Workshop on Deep Learning for Speech Recognition[74] was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets (DNN) might become practical. [1]\\nFor supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation. The training process can be guaranteed to converge in one step with a new batch of data, and the computational complexity of the training algorithm is linear with respect to the number of neurons involved. Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. [45][46]\\nSimpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of artificial neural network\\'s (ANN) computational cost and a lack of understanding of how the brain wires its biological networks. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information. [209] These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar[212] decompositions of observed entities and events. Results on commonly used evaluation sets such as TIMIT (ASR) and MNIST (image classification), as well as a range of large-vocabulary speech recognition tasks have steadily improved. [99] In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition. For example, the computations performed by deep learning units could be similar to those of actual neurons[188][189] and neural populations. [55]\\nMany aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997. [108] That way the algorithm can make certain parameters more influential, until it determines the correct mathematical manipulation to fully process the data. '" 557 | ] 558 | }, 559 | "metadata": { 560 | "tags": [] 561 | }, 562 | "execution_count": 77 563 | } 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "metadata": { 569 | "id": "CGcXlNEw1UWF", 570 | "colab_type": "code", 571 | "colab": {} 572 | }, 573 | "source": [ 574 | "" 575 | ], 576 | "execution_count": 0, 577 | "outputs": [] 578 | } 579 | ] 580 | } -------------------------------------------------------------------------------- /luhn_sum.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import requests 3 | from nltk.corpus import stopwords 4 | from nltk.stem import WordNetLemmatizer 5 | from nltk.tokenize import word_tokenize,sent_tokenize 6 | import re 7 | 8 | 9 | word_limit = 300 10 | 11 | 12 | def get_content(topic): 13 | base_url = "https://en.wikipedia.org/wiki/"+topic 14 | page = requests.get(base_url) 15 | soup = BeautifulSoup(page.content , 'html.parser') 16 | paragraphs = soup.find_all('p') 17 | content = "" 18 | for para in paragraphs: 19 | content+=para.text 20 | return content 21 | 22 | 23 | def clean(article): 24 | lem = WordNetLemmatizer() 25 | article = re.sub(r'\[[^\]]*\]','',article) 26 | article = sent_tokenize(article) 27 | cleaned_list=[] 28 | for sent in article: 29 | sent = sent.lower() 30 | word_list = [] 31 | words = word_tokenize(sent) 32 | for word in words: 33 | word_list.append(lem.lemmatize(word.lower())) 34 | cleaned_list.append(' '.join(word_list)) 35 | return cleaned_list 36 | 37 | def get_frequency_dictionary(content): 38 | frequency = {} 39 | for sentence in content: 40 | word_list = word_tokenize(sentence) 41 | for word in word_list: 42 | if word not in set(stopwords.words('english')).union({',','.',';','%',')','(','``'}): 43 | if frequency.get(word) is None: 44 | frequency[word]=1 45 | else: 46 | frequency[word]+=1 47 | return frequency 48 | 49 | def get_score(content,frequency_dictionary): 50 | sentence_score={} 51 | for sentence in content: 52 | score=0 53 | word_list = word_tokenize(sentence) 54 | start_idx,end_idx = -1,len(word_list)+1 55 | index_list=[] 56 | for word in word_list: 57 | if word not in set(stopwords.words('english')).union({',','.',';','%',')','(','``'}) and word in frequency_dictionary.keys(): 58 | index_list.append(word_list.index(word)) 59 | if index_list: 60 | score = len(index_list)**2/(max(index_list) - min(index_list)) 61 | sentence_score[content.index(sentence)] = score 62 | return sentence_score 63 | 64 | 65 | def get_summary(sentence_scores,content,threshold): 66 | summary = "" 67 | sentence_indexes = sorted(sentence_scores,key=sentence_scores.get,reverse=True)[:threshold-1] 68 | for index in sentence_indexes: 69 | summary+=content[index]+" " 70 | return summary 71 | 72 | def main(): 73 | topic_name = input("Enter the topic name for wikipedia article") 74 | content = get_content(topic_name) 75 | cleaned_content = clean(content) 76 | threshold = len(cleaned_content)//40 77 | frequency_dictionary = get_frequency_dictionary(cleaned_content) 78 | sorted_dictionary = {key: frequency_dictionary[key] for key in sorted(frequency_dictionary,key=frequency_dictionary.get,reverse=True)[:word_limit]} 79 | sentence_scores = get_score(cleaned_content,sorted_dictionary) 80 | summary = get_summary(sentence_scores,sent_tokenize(content),threshold) 81 | print(summary) 82 | 83 | if __name__ == "__main__": 84 | main() 85 | --------------------------------------------------------------------------------