├── LICENSE
├── README.md
├── resources
    └── images
    │   ├── CLR.png
    │   ├── CLR_graph.jpg
    │   ├── SS1.png
    │   ├── SS2.png
    │   ├── archi_diag.png
    │   ├── conceptnet-numberbatch.png
    │   └── model.jpg
├── text_summarisation-v2.ipynb
├── text_summarisation-v3_enc-dec.ipynb
├── text_summarisation-v4.ipynb
├── text_summarisation-v5.ipynb
└── text_summarisation.ipynb


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Nikhil Gupta
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Text-Summarization Using Deep Learning
 2 | Jupyter notebooks for text summarization using Deep Learning techniques
 3 | 
 4 | #### -- Project Status: Active
 5 | 
 6 | ## Introduction
 7 | The purpose of this project is to produce a model for Abstractive Text Summarization, starting with the RNN encoder-decoder as the baseline model. From there, we come across the effectiveness of different methods for attention in abstractive summarization. These methods try to first understand the text and then rephrase it in a shorter manner, using possibly different words. For perfect abstractive summary, the model has to first truly understand the document and then try to express that understanding in short possibly using new words and phrases. We have used the concept of an encoder-decoder recurrent neural network with LSTM units and attention to generate summary from a given text.
 8 | 
 9 | ### Methods Used
10 | * Word Embeddings using GloVe (Global Vectors)
11 | * Encoder-decoder using RNN(Recurrent Neural Network)
12 | 
13 | ### Technologies
14 | * Python
15 | * Keras Library
16 | * TensorFlow
17 | * Jupyter
18 | * etc.
19 | 
20 | ## Description
21 | In this project we have used a [sample dataset of news articles (CNN , Daily Mail)](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail). Currently we are facing a problem in implementing the pointer-generator network.
22 | 
23 | ## Architecture
24 | ![Architecture](./resources/images/model.jpg)
25 | 
26 | ## Learning Rate Configuration
27 | ![Learning Rate](./resources/images/CLR.png)
28 | 
29 | `CyclicLR(mode='triangular2', base_lr= 0.2, max_lr= 0.001,
30 |                           step_size= (len(padded_sorted_texts)*0.9/BATCH_SIZE) * 2)`
31 | 
32 | ## Word Embeddings
33 | ![ConceptNet Numberbatch](./resources/images/conceptnet-numberbatch.png)
34 | 
35 | * ConceptNet Numbernatch word embeddings were used to encode the word meanings
36 | 
37 | ## Getting Started
38 | 
39 | 1. Clone this repo (for help see this [tutorial](https://help.github.com/articles/cloning-a-repository/)).
40 | 2. Raw Data is being kept on the local storage at the location ~/Text-Summarization/Original_data/cnn/stories
41 | 
42 | 3. Data processing/transformation scripts are being kept [here](Repo folder containing data processing scripts/notebooks)
43 | 4. Installation steps:
44 |   Use `single backticks` to call out code or a command within a sentence.
45 |   ```
46 |   To format code or text into its own distinct block, use triple backticks
47 |   example:
48 |     git status
49 |     git commit -m
50 |   ```
51 | 
52 | ## Featured Notebooks/Analysis/Deliverables
53 | * [Notebook/Markdown/Slide Deck Title](link)
54 | * [Notebook/Markdown/Slide DeckTitle](link)
55 | * [Blog Post](link)
56 | * [Blog Post](link)
57 | 
58 | 
59 | ## Contributing Team Members
60 | 
61 | **Team Lead (Contacts) : [Nikhil Gupta](https://github.com/nikhilcss97)**
62 | 
63 | #### Other Members:
64 | 
65 | [Blair Fernandes](https://github.com/blair49), [Asjad Baig]
66 | 
67 | ## Contact
68 | * Feel free to contact me on nikhil.css97@gmail.com with any questions or if you are interested in contributing!
69 | 


--------------------------------------------------------------------------------
/resources/images/CLR.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/CLR.png


--------------------------------------------------------------------------------
/resources/images/CLR_graph.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/CLR_graph.jpg


--------------------------------------------------------------------------------
/resources/images/SS1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/SS1.png


--------------------------------------------------------------------------------
/resources/images/SS2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/SS2.png


--------------------------------------------------------------------------------
/resources/images/archi_diag.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/archi_diag.png


--------------------------------------------------------------------------------
/resources/images/conceptnet-numberbatch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/conceptnet-numberbatch.png


--------------------------------------------------------------------------------
/resources/images/model.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nikhilcss97/Text-Summarization/eb6c8c736de68475c5e5b1d06243c77f50fbdffa/resources/images/model.jpg


--------------------------------------------------------------------------------
/text_summarisation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Imports"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [],
 15 |    "source": [
 16 |     "# !pip3 install os\n",
 17 |     "from os import listdir\n",
 18 |     "import string\n",
 19 |     "from pickle import dump,load"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "## Loading the Data"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 2,
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "class LoadData:\n",
 36 |     "    def __init__(self, directory):\n",
 37 |     "        self.directory= directory\n",
 38 |     "        \n",
 39 |     "    def load_stories(self):\n",
 40 |     "        \"\"\"\n",
 41 |     "        Load the data and store it in a list of dictionaries\n",
 42 |     "        \n",
 43 |     "        \"\"\"\n",
 44 |     "        all_stories= list()\n",
 45 |     "        \n",
 46 |     "        def load_doc(filename):\n",
 47 |     "            \"\"\"\n",
 48 |     "            Return the data from a given filename\n",
 49 |     "            \"\"\"\n",
 50 |     "            file = open(filename, encoding='utf-8')\n",
 51 |     "            text = file.read()\n",
 52 |     "            file.close()\n",
 53 |     "            return text\n",
 54 |     "        \n",
 55 |     "        def split_story(doc):\n",
 56 |     "            \"\"\"\n",
 57 |     "            Split story from summaries based on the separater -> \"@highlight\"\n",
 58 |     "            \"\"\"\n",
 59 |     "            index = doc.find('@highlight')\n",
 60 |     "            story, highlights = doc[:index], doc[index:].split('@highlight')\n",
 61 |     "            highlights = [h.strip() for h in highlights if len(h) > 0]\n",
 62 |     "            return story, highlights\n",
 63 |     "        \n",
 64 |     "        list_of_files= listdir(self.directory)\n",
 65 |     "        for name in list_of_files[:1000]:\n",
 66 |     "            filename = self.directory + '/' + name\n",
 67 |     "            doc = load_doc(filename)\n",
 68 |     "            story, highlights= split_story(doc)\n",
 69 |     "            all_stories.append({'story': story, 'highlights': highlights})\n",
 70 |     "        \n",
 71 |     "        return all_stories"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 3,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "DIR_PATH= \"/home/nikhil/Downloads/cnn/stories\"\n",
 81 |     "obj= LoadData(DIR_PATH)\n",
 82 |     "stories= obj.load_stories()"
 83 |    ]
 84 |   },
 85 |   {
 86 |    "cell_type": "code",
 87 |    "execution_count": 4,
 88 |    "metadata": {},
 89 |    "outputs": [
 90 |     {
 91 |      "data": {
 92 |       "text/plain": [
 93 |        "1000"
 94 |       ]
 95 |      },
 96 |      "execution_count": 4,
 97 |      "metadata": {},
 98 |      "output_type": "execute_result"
 99 |     }
100 |    ],
101 |    "source": [
102 |     "len(stories)"
103 |    ]
104 |   },
105 |   {
106 |    "cell_type": "code",
107 |    "execution_count": 5,
108 |    "metadata": {},
109 |    "outputs": [
110 |     {
111 |      "name": "stdout",
112 |      "output_type": "stream",
113 |      "text": [
114 |       "['Shirley Sotloff pleads directly to the leader of ISIS', '\"Please release my child,\" she says', 'Steven Sotloff disappeared while reporting in Syria last year']\n",
115 |       "\n",
116 |       "A mother's plea to the terrorists holding her son hostage: No individual should be punished for events he cannot control.\n",
117 |       "\n",
118 |       "The mother is Shirley Sotloff, and she speaks directly to ISIS leader  Abu Bakr al-Baghdadi in a video broadcast Wednesday on Al Arabiya Network.\n",
119 |       "\n",
120 |       "Her son, freelance journalist Steven Sotloff, appeared last week in an ISIS video showing the decapitation of American journalist James Foley.\n",
121 |       "\n",
122 |       "The militant in the video warns that Steven Sotloff's fate depends on what President Barack Obama does next in Iraq.\n",
123 |       "\n",
124 |       "A day after the video was posted, Obama vowed that the United States would be \"relentless\" in striking back against ISIS.\n",
125 |       "\n",
126 |       "\"Steven is a journalist who traveled to the Middle East to cover the suffering of Muslims at the hands of tyrants. Steven is a loyal and generous son, brother and grandson,\" Shirley Sotloff said in the rare public appeal. \"He is an honorable man and has always tried to help the weak.\"\n",
127 |       "\n",
128 |       "The journalist has no control over what the United States government does, and he should not be held responsible for its actions, she says.\n",
129 |       "\n",
130 |       "\"He's an innocent journalist,\" she said.\n",
131 |       "\n",
132 |       "Friends of ISIS captive Sotloff speak out admiringly of his talent, passion\n",
133 |       "\n",
134 |       "The mother appeals to al-Baghdadi's self-declared title of caliph of the Islamic State.\n",
135 |       "\n",
136 |       "As caliph, he has the power to grant amnesty to Steven Sotloff, the mother said.\n",
137 |       "\n",
138 |       "\"I ask you to please release my child,\" she said.\n",
139 |       "\n",
140 |       "Steven Sotloff disappeared while reporting from Syria in August 2013, but his family kept the news secret, fearing harm to him if they went public.\n",
141 |       "\n",
142 |       "Out of public view, the family and a number of government agencies have been trying to gain Sotloff's release for the past year.\n",
143 |       "\n",
144 |       "Sotloff, 31, grew up in South Florida with his mother, father and younger sister. He majored in journalism at the University of Central Florida. His personal Facebook page lists musicians like the Dave Matthews Band, Phish, Miles Davis and movies like \"Lawrence of Arabia\" and \"The Big Lebowski\" as favorites. On his Twitter page, he playfully identifies himself as a \"stand-up philosopher from Miami.\"\n",
145 |       "\n",
146 |       "In 2004, Sotloff left UCF and moved back to the Miami area.\n",
147 |       "\n",
148 |       "He graduated from another college, began taking Arabic classes and subsequently picked up freelance writing work for a number of publications, including Time, Foreign Policy, World Affairs and the Christian Science Monitor. His travels took him to Yemen, Saudi Arabia, Qatar and Turkey -- among other countries -- and eventually Syria.\n",
149 |       "\n",
150 |       "The following is the full text of Shirley Sotloff's video statement:\n",
151 |       "\n",
152 |       "I'm sending this message to you, Abu Bakr al-Baghdadi ... the caliph of the Islamic State.\n",
153 |       "\n",
154 |       "I am Shirley Sotloff. My son, Steven, is in your hands. Steven is a journalist who traveled to the Middle East to cover the suffering of Muslims at the hands of tyrants.\n",
155 |       "\n",
156 |       "Steven is a loyal and generous son, brother and grandson. He's an honorable man and has always tried to help the weak.\n",
157 |       "\n",
158 |       "We've not seen Stephen for over a year, and we miss him very much.  We want to see him home safe and sound and to hug him.\n",
159 |       "\n",
160 |       "Since Stephen's capture, I've learned a lot about Islam. I've learned that Islam teaches that no individual should be held responsible for the sins of others.\n",
161 |       "\n",
162 |       "Stephen has no control over the actions of the U.S. government. He's an innocent journalist.\n",
163 |       "\n",
164 |       "I've always learned that you, the caliph, can grant amnesty. I ask you to please release my child.  As a mother, I ask your justice to be merciful and not punish my son for matters he has no control over.\n",
165 |       "\n",
166 |       "I ask you to use your authority to spare his life and to set the example of the Prophet Mohammed, who protected people of the Book.\n",
167 |       "\n",
168 |       "I want what every mother wants: to live to see her children's children.\n",
169 |       "\n",
170 |       "I plead with you to grant me this.\n",
171 |       "\n",
172 |       "Opinion: Foley is a reminder why freelance reporting is so dangerous\n",
173 |       "\n",
174 |       "\n"
175 |      ]
176 |     }
177 |    ],
178 |    "source": [
179 |     "print(stories[10]['highlights'])\n",
180 |     "print()\n",
181 |     "print(stories[10]['story'])"
182 |    ]
183 |   },
184 |   {
185 |    "cell_type": "code",
186 |    "execution_count": 6,
187 |    "metadata": {},
188 |    "outputs": [
189 |     {
190 |      "data": {
191 |       "text/plain": [
192 |        "[{'highlights': ['168 police officers and 67 firefighters are laid off in Camden, New Jersey',\n",
193 |        "   \"City's mayor said she couldn't get $8 million in budget concessions to save jobs\",\n",
194 |        "   'The mayor had been asking police and firefighters for concessions',\n",
195 |        "   'They were asked to pay more for health care and accept salary freezes or reductions'],\n",
196 |        "  'story': '(CNN) -- The mayor of crime-ridden Camden, New Jersey, has announced layoffs of nearly half of the city\\'s police force and close to a third of its fire department.\\n\\nOne hundred sixty-eight police officers and 67 firefighters were laid off Tuesday, as officials struggle to close a $26.5 million budget gap through a series of belt-tightening measures, Mayor Dana Redd told reporters. The layoffs take effect immediately.\\n\\nRedd said she was unable to secure the $8 million in budget concessions that she says she needed to save the jobs of up to 100 police officers and many of the city\\'s firefighters.\\n\\nThe mayor -- who said she will continue negotiations with police and fire unions -- had been asking the workers to pay more for their health care, freeze or reduce their salaries and take furlough days.\\n\\nThe apparent impasse has left administrators of a city with the second-highest crime rate in the nation scrambling to figure out solutions to keep residents safe. Camden is second only to St. Louis, Missouri, in annual rankings of cities based on compilations of FBI crime statistics.\\n\\nSome clerical officers were demoted and reassigned to the streets, the mayor said, pledging that the cuts would not affect public safety.\\n\\n\"We\\'re still going to protect our residents,\" said Robert Corrales, spokesman for Redd. Public safety \"will remain our top concern. We\\'ll shift our resources to be more efficient with what we have.\"\\n\\nBut police and firefighter union officials say the layoffs will most certainly have an impact.\\n\\n\"It\\'s absolutely, physically impossible to cover the same amount of ground in the same amount of time with less people,\" said John Williamson, president of the Fraternal Order of Police union in Camden. \"Response times will be slower.\"\\n\\nOne local business owner, David Brown, said he does not \"understand how you can do more with less.\"\\n\\n\"I don\\'t want to be a pessimist, but I can\\'t be optimistic.\"\\n\\nCamden resident and sanitation worker Gloria Valentin said she is now fearful that the city does not have enough police protection to keep people safe.\\n\\n\"Today is a real sad day in the city of Camden,\" she said.\\n\\n'},\n",
197 |        " {'highlights': ['The leaders of Brazil, Russia, India, China and South Africa hold talks Thursday',\n",
198 |        "   'They agree to study the idea of setting up their own development bank',\n",
199 |        "   'The leaders also call for reform on the International Monetary Fund',\n",
200 |        "   'They say the problems in Syria and Iran can only solved through dialogue'],\n",
201 |        "  'story': 'New Delhi (CNN) -- The leaders of five of the world\\'s top emerging economies moved closer Thursday toward establishing a development bank that could one day serve as an alternative to the World Bank.\\n\\nThe leaders of Brazil, Russia, India, China and South Africa -- collectively known as the BRICS -- \"agreed to examine in greater detail a proposal to set up a BRICS-led South-South Development Bank, funded and managed by the BRICS and other developing countries,\" said Prime Minister Manmohan Singh of India.\\n\\nThe leaders were meeting in New Delhi on Thursday for their fourth annual summit. Finance ministers from the five countries have been directed to look into the idea of the development bank and report back at the next summit, Singh said.\\n\\nThe leaders also asked the International Monetary Fund to speed up changes in its governance to better represent the developing world as a voting bloc.\\n\\n\"The rapid recovery of the BRICS economies from the financial crisis highlighted their role as growth drivers of the global economy,\" Singh said.\\n\\nTogether, the BRICS nations make up more than 40% of the world population and one-fifth of the global economy.\\n\\nThe group called for peaceful resolutions of the Syria and Iranian crises.\\n\\n\"We agreed that a lasting solution to the problems in Syria and Iran can only be found through dialogue,\" Singh said.\\n\\nBesides the IMF, the leaders also urged reforms of the United Nations and other international bodies for a larger voice for the developing world.\\n\\n\"Institutions of global political and economic governance created more than six decades ago have not kept pace with the changing world. While some progress has been made in international financial institutions, there is lack of movement on the political side,\" Singh said. \"BRICS should speak with one voice on important issues such as the reform of the UN Security Council.\"\\n\\n'}]"
202 |       ]
203 |      },
204 |      "execution_count": 6,
205 |      "metadata": {},
206 |      "output_type": "execute_result"
207 |     }
208 |    ],
209 |    "source": [
210 |     "stories[:2]"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "markdown",
215 |    "metadata": {},
216 |    "source": [
217 |     "## Data Cleaning"
218 |    ]
219 |   },
220 |   {
221 |    "cell_type": "code",
222 |    "execution_count": 8,
223 |    "metadata": {},
224 |    "outputs": [],
225 |    "source": [
226 |     "class Clean_data:\n",
227 |     "    def __init__(self):\n",
228 |     "        pass\n",
229 |     "           \n",
230 |     "    def clean_lines(self, lines):\n",
231 |     "        cleaned = list()\n",
232 |     "        table = str.maketrans('', '', string.punctuation)\n",
233 |     "        \n",
234 |     "        for line in lines:\n",
235 |     "            index = line.find('(CNN)')\n",
236 |     "            if index >= 0:\n",
237 |     "                line = line[index + len('(CNN)'):]\n",
238 |     "\n",
239 |     "            split_line = line.split()\n",
240 |     "            \n",
241 |     "            split_line = [word.lower() for word in split_line]\n",
242 |     "            split_line = [w.translate(table) for w in split_line]\n",
243 |     "            \n",
244 |     "            split_line = [word for word in split_line if word.isalpha()]\n",
245 |     "            cleaned.append(' '.join(split_line))\n",
246 |     "        cleaned = [c for c in cleaned if len(c) > 0]\n",
247 |     "        return cleaned"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "code",
252 |    "execution_count": 12,
253 |    "metadata": {},
254 |    "outputs": [],
255 |    "source": [
256 |     "obj1= Clean_data()\n",
257 |     "cleaned_stories= list()\n",
258 |     "for example in stories[:100]:\n",
259 |     "    cleaned_stories.append({'story': obj1.clean_lines(example['story'].split('\\n')), 'highlights': obj1.clean_lines(example['highlights'])})    "
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 13,
265 |    "metadata": {},
266 |    "outputs": [
267 |     {
268 |      "data": {
269 |       "text/plain": [
270 |        "{'highlights': ['stepmother claims girls father dismembered body',\n",
271 |        "  'a judge releases search warrants from detectives probing zahra bakers death',\n",
272 |        "  'zahras remains were found november a month after she was reported missing'],\n",
273 |        " 'story': ['the stepmother of zahra baker told police the girl was killed two weeks before she was reported missing according to search warrants released tuesday',\n",
274 |        "  'stepmother elisa baker also told police in hickory north carolina that the disabled girls body was disposed of the next day september in various locations according to the documents',\n",
275 |        "  'she told police on november that the girls father adam baker dismembered the girl and the couple disposed of the remains',\n",
276 |        "  'while elisa baker has been charged with obstruction of justice for writing a fake ransom note and leaving it at the familys hickory home no one has been charged directly in the girls death elisa baker also is accused of writing worthless checks',\n",
277 |        "  'police have said she had been cooperating with investigators',\n",
278 |        "  'one of the search warrants details an online conversation a web user said she had with adam andor elisa baker regarding their involvement with chainsaw massacre roleplaying',\n",
279 |        "  'according to the warrant the date of september was given regarding their virtual family doing a murder with chainsaws elisa bakers former husband also was involved in roleplaying according to the conversation',\n",
280 |        "  'catawba superior court judge robert c ervin released the search warrants that provide more insight into the grisly killing of the freckledface girl who lost part of her left leg at age to cancer eleven search warrants were released in december',\n",
281 |        "  'police found some of zahra bakers remains on november just over a month after she was reported her missing',\n",
282 |        "  'in october adam baker who denied he was involved in zahras disappearance tearfully called for the return of his daughter',\n",
283 |        "  'investigators reviewed adam bakers cell phone records and the phones gps locator from september the warrants say according to those records adam bakers phone was not in the areas where zahras remains were found on the day that elisa baker said he disposed of them the warrants say',\n",
284 |        "  'elisa bakers cell phone records indicate that her phone was in those areas on that day the warrants say',\n",
285 |        "  'according to the warrants elisa baker told adam baker that a man she said was her brother actually was her husband she married adam baker before the divorce police said',\n",
286 |        "  'cnn left messages tuesday evening with the attorneys for elisa and adam baker',\n",
287 |        "  'in one of the search warrants released in december a tipster told police that zahra had been at a north carolina home with two men and one of the men said he had done something very bad and needed to leave town',\n",
288 |        "  'one of the men was associated with zahras stepmother but was not zahras father the tipster said',\n",
289 |        "  'the tipster also told police that zahra had been raped by both men and that she had blood on her private area and legs the search warrant said the tipster told police that he got the information about the alleged rape from a friend who was told about it from his sister',\n",
290 |        "  'police went to the home to see if they could confirm the fourthhand information and found a mattress at the side of the house that had a large dark stain in the middle the search warrant said',\n",
291 |        "  'the tipster said the men did not admit to killing the girl but did say that they might have hit her in the head the search warrant said',\n",
292 |        "  'adam baker was arrested in late october in nearby catawba county on eight charges five counts of writing bad checks and three counts of failing to appear in court authorities said the charges were unrelated to zahras disappearance and he was later released on bail',\n",
293 |        "  'the disappearance of zahra made international news the girl whose biological parents were both from australia lost part of her left leg at age and lost hearing in both ears while being treated for cancer']}"
294 |       ]
295 |      },
296 |      "execution_count": 13,
297 |      "metadata": {},
298 |      "output_type": "execute_result"
299 |     }
300 |    ],
301 |    "source": [
302 |     "cleaned_stories[60]"
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": 14,
308 |    "metadata": {},
309 |    "outputs": [],
310 |    "source": [
311 |     "dump(cleaned_stories, open('/home/nikhil/Downloads/cnn/processed_sample_data/cnn_dataset.pkl', 'wb'))"
312 |    ]
313 |   },
314 |   {
315 |    "cell_type": "code",
316 |    "execution_count": 15,
317 |    "metadata": {},
318 |    "outputs": [
319 |     {
320 |      "name": "stdout",
321 |      "output_type": "stream",
322 |      "text": [
323 |       "Loaded Stories 100\n"
324 |      ]
325 |     }
326 |    ],
327 |    "source": [
328 |     "cleaned_stories = load(open('/home/nikhil/Downloads/cnn/processed_sample_data/cnn_dataset.pkl', 'rb'))\n",
329 |     "print('Loaded Stories %d' % len(cleaned_stories))"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "markdown",
334 |    "metadata": {},
335 |    "source": [
336 |     "---"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "markdown",
341 |    "metadata": {},
342 |    "source": [
343 |     "## Amazon Food reviews Dataset"
344 |    ]
345 |   },
346 |   {
347 |    "cell_type": "markdown",
348 |    "metadata": {},
349 |    "source": [
350 |     "## Imports"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "code",
355 |    "execution_count": 38,
356 |    "metadata": {},
357 |    "outputs": [],
358 |    "source": [
359 |     "import pandas as pd\n",
360 |     "import numpy as np\n",
361 |     "import re"
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": 39,
367 |    "metadata": {},
368 |    "outputs": [],
369 |    "source": [
370 |     "AMAZON_DATA_PATH= '/home/nikhil/Downloads/amazon-fine-food-reviews/Reviews.csv'"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 45,
376 |    "metadata": {},
377 |    "outputs": [],
378 |    "source": [
379 |     "class Load_amazon_data:\n",
380 |     "    def __init__(self, dir_path, seed= 0):\n",
381 |     "        self.dir_path= dir_path\n",
382 |     "        np.random.seed(seed)\n",
383 |     "        \n",
384 |     "    def load(self):\n",
385 |     "        \"\"\"\n",
386 |     "        Reads data from the given directory path\n",
387 |     "        \"\"\"\n",
388 |     "        return pd.read_csv(self.dir_path)\n",
389 |     "    \n",
390 |     "    def drop(self):\n",
391 |     "        \"\"\"\n",
392 |     "        Drops unnecessary columns\n",
393 |     "        \"\"\"\n",
394 |     "        \n",
395 |     "        data= self.load()\n",
396 |     "        \n",
397 |     "        data = data.dropna()\n",
398 |     "        data= data.iloc[:, -2:]\n",
399 |     "        data = data.reset_index(drop= True)\n",
400 |     "        \n",
401 |     "        return data\n",
402 |     "    \n",
403 |     "    def analyze_data(self):\n",
404 |     "        \"\"\"\n",
405 |     "        Prints some sample data points from the cleaned data\n",
406 |     "        \"\"\"\n",
407 |     "        data= self.drop()\n",
408 |     "        \n",
409 |     "        for sr_no, i in enumerate(np.random.randint(10, 100, size= 5)):\n",
410 |     "            print(\"_________________________\")\n",
411 |     "            print(\"Data Point {0}\".format(sr_no + 1))\n",
412 |     "            print(\"Summary:\")\n",
413 |     "            print(data['Summary'].iloc[i])\n",
414 |     "            print(\"Full Text:\")\n",
415 |     "            print(data['Text'].iloc[i])"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "code",
420 |    "execution_count": 46,
421 |    "metadata": {},
422 |    "outputs": [],
423 |    "source": [
424 |     "obj= Load_amazon_data(AMAZON_DATA_PATH, seed= 1)"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "## Load the Data"
432 |    ]
433 |   },
434 |   {
435 |    "cell_type": "code",
436 |    "execution_count": 47,
437 |    "metadata": {},
438 |    "outputs": [
439 |     {
440 |      "data": {
441 |       "text/html": [
442 |        "<div>\n",
443 |        "<style scoped>\n",
444 |        "    .dataframe tbody tr th:only-of-type {\n",
445 |        "        vertical-align: middle;\n",
446 |        "    }\n",
447 |        "\n",
448 |        "    .dataframe tbody tr th {\n",
449 |        "        vertical-align: top;\n",
450 |        "    }\n",
451 |        "\n",
452 |        "    .dataframe thead th {\n",
453 |        "        text-align: right;\n",
454 |        "    }\n",
455 |        "</style>\n",
456 |        "<table border=\"1\" class=\"dataframe\">\n",
457 |        "  <thead>\n",
458 |        "    <tr style=\"text-align: right;\">\n",
459 |        "      <th></th>\n",
460 |        "      <th>Id</th>\n",
461 |        "      <th>ProductId</th>\n",
462 |        "      <th>UserId</th>\n",
463 |        "      <th>ProfileName</th>\n",
464 |        "      <th>HelpfulnessNumerator</th>\n",
465 |        "      <th>HelpfulnessDenominator</th>\n",
466 |        "      <th>Score</th>\n",
467 |        "      <th>Time</th>\n",
468 |        "      <th>Summary</th>\n",
469 |        "      <th>Text</th>\n",
470 |        "    </tr>\n",
471 |        "  </thead>\n",
472 |        "  <tbody>\n",
473 |        "    <tr>\n",
474 |        "      <th>0</th>\n",
475 |        "      <td>1</td>\n",
476 |        "      <td>B001E4KFG0</td>\n",
477 |        "      <td>A3SGXH7AUHU8GW</td>\n",
478 |        "      <td>delmartian</td>\n",
479 |        "      <td>1</td>\n",
480 |        "      <td>1</td>\n",
481 |        "      <td>5</td>\n",
482 |        "      <td>1303862400</td>\n",
483 |        "      <td>Good Quality Dog Food</td>\n",
484 |        "      <td>I have bought several of the Vitality canned d...</td>\n",
485 |        "    </tr>\n",
486 |        "    <tr>\n",
487 |        "      <th>1</th>\n",
488 |        "      <td>2</td>\n",
489 |        "      <td>B00813GRG4</td>\n",
490 |        "      <td>A1D87F6ZCVE5NK</td>\n",
491 |        "      <td>dll pa</td>\n",
492 |        "      <td>0</td>\n",
493 |        "      <td>0</td>\n",
494 |        "      <td>1</td>\n",
495 |        "      <td>1346976000</td>\n",
496 |        "      <td>Not as Advertised</td>\n",
497 |        "      <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
498 |        "    </tr>\n",
499 |        "    <tr>\n",
500 |        "      <th>2</th>\n",
501 |        "      <td>3</td>\n",
502 |        "      <td>B000LQOCH0</td>\n",
503 |        "      <td>ABXLMWJIXXAIN</td>\n",
504 |        "      <td>Natalia Corres \"Natalia Corres\"</td>\n",
505 |        "      <td>1</td>\n",
506 |        "      <td>1</td>\n",
507 |        "      <td>4</td>\n",
508 |        "      <td>1219017600</td>\n",
509 |        "      <td>\"Delight\" says it all</td>\n",
510 |        "      <td>This is a confection that has been around a fe...</td>\n",
511 |        "    </tr>\n",
512 |        "    <tr>\n",
513 |        "      <th>3</th>\n",
514 |        "      <td>4</td>\n",
515 |        "      <td>B000UA0QIQ</td>\n",
516 |        "      <td>A395BORC6FGVXV</td>\n",
517 |        "      <td>Karl</td>\n",
518 |        "      <td>3</td>\n",
519 |        "      <td>3</td>\n",
520 |        "      <td>2</td>\n",
521 |        "      <td>1307923200</td>\n",
522 |        "      <td>Cough Medicine</td>\n",
523 |        "      <td>If you are looking for the secret ingredient i...</td>\n",
524 |        "    </tr>\n",
525 |        "    <tr>\n",
526 |        "      <th>4</th>\n",
527 |        "      <td>5</td>\n",
528 |        "      <td>B006K2ZZ7K</td>\n",
529 |        "      <td>A1UQRSCLF8GW1T</td>\n",
530 |        "      <td>Michael D. Bigham \"M. Wassir\"</td>\n",
531 |        "      <td>0</td>\n",
532 |        "      <td>0</td>\n",
533 |        "      <td>5</td>\n",
534 |        "      <td>1350777600</td>\n",
535 |        "      <td>Great taffy</td>\n",
536 |        "      <td>Great taffy at a great price.  There was a wid...</td>\n",
537 |        "    </tr>\n",
538 |        "  </tbody>\n",
539 |        "</table>\n",
540 |        "</div>"
541 |       ],
542 |       "text/plain": [
543 |        "   Id   ProductId          UserId                      ProfileName  \\\n",
544 |        "0   1  B001E4KFG0  A3SGXH7AUHU8GW                       delmartian   \n",
545 |        "1   2  B00813GRG4  A1D87F6ZCVE5NK                           dll pa   \n",
546 |        "2   3  B000LQOCH0   ABXLMWJIXXAIN  Natalia Corres \"Natalia Corres\"   \n",
547 |        "3   4  B000UA0QIQ  A395BORC6FGVXV                             Karl   \n",
548 |        "4   5  B006K2ZZ7K  A1UQRSCLF8GW1T    Michael D. Bigham \"M. Wassir\"   \n",
549 |        "\n",
550 |        "   HelpfulnessNumerator  HelpfulnessDenominator  Score        Time  \\\n",
551 |        "0                     1                       1      5  1303862400   \n",
552 |        "1                     0                       0      1  1346976000   \n",
553 |        "2                     1                       1      4  1219017600   \n",
554 |        "3                     3                       3      2  1307923200   \n",
555 |        "4                     0                       0      5  1350777600   \n",
556 |        "\n",
557 |        "                 Summary                                               Text  \n",
558 |        "0  Good Quality Dog Food  I have bought several of the Vitality canned d...  \n",
559 |        "1      Not as Advertised  Product arrived labeled as Jumbo Salted Peanut...  \n",
560 |        "2  \"Delight\" says it all  This is a confection that has been around a fe...  \n",
561 |        "3         Cough Medicine  If you are looking for the secret ingredient i...  \n",
562 |        "4            Great taffy  Great taffy at a great price.  There was a wid...  "
563 |       ]
564 |      },
565 |      "execution_count": 47,
566 |      "metadata": {},
567 |      "output_type": "execute_result"
568 |     }
569 |    ],
570 |    "source": [
571 |     "data= obj.load()\n",
572 |     "data.head()"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "markdown",
577 |    "metadata": {},
578 |    "source": [
579 |     "## Dropping Unnecessary columns"
580 |    ]
581 |   },
582 |   {
583 |    "cell_type": "code",
584 |    "execution_count": 48,
585 |    "metadata": {},
586 |    "outputs": [
587 |     {
588 |      "data": {
589 |       "text/html": [
590 |        "<div>\n",
591 |        "<style scoped>\n",
592 |        "    .dataframe tbody tr th:only-of-type {\n",
593 |        "        vertical-align: middle;\n",
594 |        "    }\n",
595 |        "\n",
596 |        "    .dataframe tbody tr th {\n",
597 |        "        vertical-align: top;\n",
598 |        "    }\n",
599 |        "\n",
600 |        "    .dataframe thead th {\n",
601 |        "        text-align: right;\n",
602 |        "    }\n",
603 |        "</style>\n",
604 |        "<table border=\"1\" class=\"dataframe\">\n",
605 |        "  <thead>\n",
606 |        "    <tr style=\"text-align: right;\">\n",
607 |        "      <th></th>\n",
608 |        "      <th>Summary</th>\n",
609 |        "      <th>Text</th>\n",
610 |        "    </tr>\n",
611 |        "  </thead>\n",
612 |        "  <tbody>\n",
613 |        "    <tr>\n",
614 |        "      <th>0</th>\n",
615 |        "      <td>Good Quality Dog Food</td>\n",
616 |        "      <td>I have bought several of the Vitality canned d...</td>\n",
617 |        "    </tr>\n",
618 |        "    <tr>\n",
619 |        "      <th>1</th>\n",
620 |        "      <td>Not as Advertised</td>\n",
621 |        "      <td>Product arrived labeled as Jumbo Salted Peanut...</td>\n",
622 |        "    </tr>\n",
623 |        "    <tr>\n",
624 |        "      <th>2</th>\n",
625 |        "      <td>\"Delight\" says it all</td>\n",
626 |        "      <td>This is a confection that has been around a fe...</td>\n",
627 |        "    </tr>\n",
628 |        "    <tr>\n",
629 |        "      <th>3</th>\n",
630 |        "      <td>Cough Medicine</td>\n",
631 |        "      <td>If you are looking for the secret ingredient i...</td>\n",
632 |        "    </tr>\n",
633 |        "    <tr>\n",
634 |        "      <th>4</th>\n",
635 |        "      <td>Great taffy</td>\n",
636 |        "      <td>Great taffy at a great price.  There was a wid...</td>\n",
637 |        "    </tr>\n",
638 |        "  </tbody>\n",
639 |        "</table>\n",
640 |        "</div>"
641 |       ],
642 |       "text/plain": [
643 |        "                 Summary                                               Text\n",
644 |        "0  Good Quality Dog Food  I have bought several of the Vitality canned d...\n",
645 |        "1      Not as Advertised  Product arrived labeled as Jumbo Salted Peanut...\n",
646 |        "2  \"Delight\" says it all  This is a confection that has been around a fe...\n",
647 |        "3         Cough Medicine  If you are looking for the secret ingredient i...\n",
648 |        "4            Great taffy  Great taffy at a great price.  There was a wid..."
649 |       ]
650 |      },
651 |      "execution_count": 48,
652 |      "metadata": {},
653 |      "output_type": "execute_result"
654 |     }
655 |    ],
656 |    "source": [
657 |     "data= obj.drop()\n",
658 |     "data.head()"
659 |    ]
660 |   },
661 |   {
662 |    "cell_type": "markdown",
663 |    "metadata": {},
664 |    "source": [
665 |     "## Analyze the data"
666 |    ]
667 |   },
668 |   {
669 |    "cell_type": "code",
670 |    "execution_count": 49,
671 |    "metadata": {},
672 |    "outputs": [
673 |     {
674 |      "name": "stdout",
675 |      "output_type": "stream",
676 |      "text": [
677 |       "_________________________\n",
678 |       "Data Point 1\n",
679 |       "Summary:\n",
680 |       "Mushy\n",
681 |       "Full Text:\n",
682 |       "The flavors are good.  However, I do not see any differce between this and Oaker Oats brand - they are both mushy.\n",
683 |       "_________________________\n",
684 |       "Data Point 2\n",
685 |       "Summary:\n",
686 |       "Delicious product!\n",
687 |       "Full Text:\n",
688 |       "I can remember buying this candy as a kid and the quality hasn't dropped in all these years. Still a superb product you won't be disappointed with.\n",
689 |       "_________________________\n",
690 |       "Data Point 3\n",
691 |       "Summary:\n",
692 |       "Forget Molecular Gastronomy - this stuff rockes a coffee creamer!\n",
693 |       "Full Text:\n",
694 |       "I know the product title says Molecular Gastronomy, but don't let that scare you off.  I have been looking for this for a while now, not for food science, but for something more down to earth.  I use it to make my own coffee creamer.<br /><br />I have to have my coffee blonde and sweet - but the flavored creamers are full of the bad kinds of fat, and honestly, I hate to use manufactured \"food\" items.  I really don't think they are good for the body.  On the other hand, I hate using cold milk or cream, because I like HOT coffee.<br /><br />I stumbled across this on Amazon one day and got the idea of making my own creamer.  I also bought low-fat (non-instant) milk powder and regular milk powder. The non-instant lowfat milk is a little sweeter and tastes fresher than regular instant low-fat milk, but does not dissolve good in cold water - which is not a problem for hot coffee.  You will have to play with the ratios - I would not do just the heavy cream, it made the coffee too rich. Also, I think the powder is too expensive to just use on it's own. I like mixing 1/3 of each together.<br /><br />For flavoring, I bough cocoa bean powder, vanilla bean powder, and caster (superfine) sugar.  I mix up small batches along with spices like cinnamon and nutmeg to make my own flavored creamers.  If you wanted, you could use a fake sweetner powder instead.  I make up small amounts that I store in jelly canning jars. I also use my little food chopper/food processor to blend everything, so the sugar is not heavier and sinks to the bottom.  Let it settle for a bit before opening the top though.<br /><br />This stuff tastes WAY better than the storebought creamers and it is fun to experiment and come up with your own flavors.  I am going to try using some essential oils next and see if I can get a good chocolate/orange mix.<br /><br />All of the ingredients I mentioned are here online.  Take the time to experiment.  Maybe you don't use any low-fat milk. Or don't add any flavorings.  It is up to you.  Also, would make great housewarming/host(ess) gifts.<br /><br />I am sure other molecular people will be able to tell you more of what you can do with it, and I am sure I will experiment with it in cooking - but the main reason I bought it was to make my own creamer and it worked out great.\n",
695 |       "_________________________\n",
696 |       "Data Point 4\n",
697 |       "Summary:\n",
698 |       "Home delivered twizlers\n",
699 |       "Full Text:\n",
700 |       "Candy was delivered very fast and was purchased at a reasonable price.  I was home bound and unable to get to a store so this was perfect for me.\n",
701 |       "_________________________\n",
702 |       "Data Point 5\n",
703 |       "Summary:\n",
704 |       "Great food!\n",
705 |       "Full Text:\n",
706 |       "We have three dogs and all of them love this food!  We bought it specifically for one of our dogs who has food allergies and it works great for him, no more hot spots or tummy problems.<br />I LOVE that it ships right to our door with free shipping.\n"
707 |      ]
708 |     }
709 |    ],
710 |    "source": [
711 |     "obj.analyze_data()"
712 |    ]
713 |   },
714 |   {
715 |    "cell_type": "code",
716 |    "execution_count": 50,
717 |    "metadata": {},
718 |    "outputs": [],
719 |    "source": [
720 |     "contractions = { \n",
721 |     "\"ain't\": \"am not\",\n",
722 |     "\"aren't\": \"are not\",\n",
723 |     "\"can't\": \"cannot\",\n",
724 |     "\"can't've\": \"cannot have\",\n",
725 |     "\"'cause\": \"because\",\n",
726 |     "\"could've\": \"could have\",\n",
727 |     "\"couldn't\": \"could not\",\n",
728 |     "\"couldn't've\": \"could not have\",\n",
729 |     "\"didn't\": \"did not\",\n",
730 |     "\"doesn't\": \"does not\",\n",
731 |     "\"don't\": \"do not\",\n",
732 |     "\"hadn't\": \"had not\",\n",
733 |     "\"hadn't've\": \"had not have\",\n",
734 |     "\"hasn't\": \"has not\",\n",
735 |     "\"haven't\": \"have not\",\n",
736 |     "\"he'd\": \"he would\",\n",
737 |     "\"he'd've\": \"he would have\",\n",
738 |     "\"he'll\": \"he will\",\n",
739 |     "\"he's\": \"he is\",\n",
740 |     "\"how'd\": \"how did\",\n",
741 |     "\"how'll\": \"how will\",\n",
742 |     "\"how's\": \"how is\",\n",
743 |     "\"i'd\": \"i would\",\n",
744 |     "\"i'll\": \"i will\",\n",
745 |     "\"i'm\": \"i am\",\n",
746 |     "\"i've\": \"i have\",\n",
747 |     "\"isn't\": \"is not\",\n",
748 |     "\"it'd\": \"it would\",\n",
749 |     "\"it'll\": \"it will\",\n",
750 |     "\"it's\": \"it is\",\n",
751 |     "\"let's\": \"let us\",\n",
752 |     "\"ma'am\": \"madam\",\n",
753 |     "\"mayn't\": \"may not\",\n",
754 |     "\"might've\": \"might have\",\n",
755 |     "\"mightn't\": \"might not\",\n",
756 |     "\"must've\": \"must have\",\n",
757 |     "\"mustn't\": \"must not\",\n",
758 |     "\"needn't\": \"need not\",\n",
759 |     "\"oughtn't\": \"ought not\",\n",
760 |     "\"shan't\": \"shall not\",\n",
761 |     "\"sha'n't\": \"shall not\",\n",
762 |     "\"she'd\": \"she would\",\n",
763 |     "\"she'll\": \"she will\",\n",
764 |     "\"she's\": \"she is\",\n",
765 |     "\"should've\": \"should have\",\n",
766 |     "\"shouldn't\": \"should not\",\n",
767 |     "\"that'd\": \"that would\",\n",
768 |     "\"that's\": \"that is\",\n",
769 |     "\"there'd\": \"there had\",\n",
770 |     "\"there's\": \"there is\",\n",
771 |     "\"they'd\": \"they would\",\n",
772 |     "\"they'll\": \"they will\",\n",
773 |     "\"they're\": \"they are\",\n",
774 |     "\"they've\": \"they have\",\n",
775 |     "\"wasn't\": \"was not\",\n",
776 |     "\"we'd\": \"we would\",\n",
777 |     "\"we'll\": \"we will\",\n",
778 |     "\"we're\": \"we are\",\n",
779 |     "\"we've\": \"we have\",\n",
780 |     "\"weren't\": \"were not\",\n",
781 |     "\"what'll\": \"what will\",\n",
782 |     "\"what're\": \"what are\",\n",
783 |     "\"what's\": \"what is\",\n",
784 |     "\"what've\": \"what have\",\n",
785 |     "\"where'd\": \"where did\",\n",
786 |     "\"where's\": \"where is\",\n",
787 |     "\"who'll\": \"who will\",\n",
788 |     "\"who's\": \"who is\",\n",
789 |     "\"won't\": \"will not\",\n",
790 |     "\"wouldn't\": \"would not\",\n",
791 |     "\"you'd\": \"you would\",\n",
792 |     "\"you'll\": \"you will\",\n",
793 |     "\"you're\": \"you are\"\n",
794 |     "}"
795 |    ]
796 |   },
797 |   {
798 |    "cell_type": "code",
799 |    "execution_count": 61,
800 |    "metadata": {},
801 |    "outputs": [],
802 |    "source": [
803 |     "class Data_cleaning:\n",
804 |     "    def __init__(self):\n",
805 |     "        self.clean_summaries= []\n",
806 |     "        self.clean_texts= []\n",
807 |     "\n",
808 |     "    def clean_text(self, text, remove_stopwords = False):\n",
809 |     "        \"\"\"\n",
810 |     "        Defines a series of cleaning operations \n",
811 |     "        \"\"\"\n",
812 |     "        text = text.lower()\n",
813 |     "\n",
814 |     "        if True:\n",
815 |     "            text = text.split()\n",
816 |     "            new_text = []\n",
817 |     "            for word in text:\n",
818 |     "                if word in contractions:\n",
819 |     "                    new_text.append(contractions[word])\n",
820 |     "                else:\n",
821 |     "                    new_text.append(word)\n",
822 |     "            text = \" \".join(new_text)\n",
823 |     "\n",
824 |     "        text = re.sub(r'https?:\\/\\/.*[\\r\\n]*', '', text, flags=re.MULTILINE)\n",
825 |     "        text = re.sub(r'\\<a href', ' ', text)\n",
826 |     "        text = re.sub(r'&amp;', '', text) \n",
827 |     "        text = re.sub(r'[_\"\\-;%()|+&=*%.,!?:#$@\\[\\]/]', ' ', text)\n",
828 |     "        text = re.sub(r'<br />', ' ', text)\n",
829 |     "        text = re.sub(r'<br >', ' ', text)\n",
830 |     "        text = re.sub(r'<br  >', ' ', text)\n",
831 |     "        text = re.sub(r'\\'', ' ', text)\n",
832 |     "\n",
833 |     "        # Optionally, remove stop words\n",
834 |     "        if remove_stopwords:\n",
835 |     "            text = text.split()\n",
836 |     "            stops = set(stopwords.words(\"english\"))\n",
837 |     "            text = [w for w in text if not w in stops]\n",
838 |     "            text = \" \".join(text)\n",
839 |     "\n",
840 |     "        return text\n",
841 |     "    \n",
842 |     "    def clean(self, data):\n",
843 |     "        \"\"\"\n",
844 |     "        Applies the clean_text() to the entire dataset\n",
845 |     "        \"\"\"\n",
846 |     "        for summary in data.Summary:\n",
847 |     "            self.clean_summaries.append(self.clean_text(summary))\n",
848 |     "\n",
849 |     "        print(\"Summaries are complete.\")\n",
850 |     "\n",
851 |     "        for text in data.Text:\n",
852 |     "            self.clean_texts.append(self.clean_text(text))\n",
853 |     "\n",
854 |     "        print(\"Texts are complete.\")\n",
855 |     "        \n",
856 |     "        return self.clean_summaries, self.clean_texts"
857 |    ]
858 |   },
859 |   {
860 |    "cell_type": "code",
861 |    "execution_count": 62,
862 |    "metadata": {},
863 |    "outputs": [
864 |     {
865 |      "name": "stdout",
866 |      "output_type": "stream",
867 |      "text": [
868 |       "Summaries are complete.\n",
869 |       "Texts are complete.\n"
870 |      ]
871 |     }
872 |    ],
873 |    "source": [
874 |     "# import nltk\n",
875 |     "# nltk.download('stopwords')\n",
876 |     "\n",
877 |     "clean_obj= Data_cleaning()\n",
878 |     "clean_summaries, clean_texts= clean_obj.clean(data)"
879 |    ]
880 |   },
881 |   {
882 |    "cell_type": "code",
883 |    "execution_count": 63,
884 |    "metadata": {},
885 |    "outputs": [
886 |     {
887 |      "name": "stdout",
888 |      "output_type": "stream",
889 |      "text": [
890 |       "_________________________\n",
891 |       "Data Point #1\n",
892 |       "Summary:\n",
893 |       "mushy\n",
894 |       "Full Text:\n",
895 |       "the flavors are good  however  i do not see any differce between this and oaker oats brand   they are both mushy \n",
896 |       "_________________________\n",
897 |       "Data Point #2\n",
898 |       "Summary:\n",
899 |       "delicious product \n",
900 |       "Full Text:\n",
901 |       "i can remember buying this candy as a kid and the quality has not dropped in all these years  still a superb product you will not be disappointed with \n",
902 |       "_________________________\n",
903 |       "Data Point #3\n",
904 |       "Summary:\n",
905 |       "forget molecular gastronomy   this stuff rockes a coffee creamer \n",
906 |       "Full Text:\n",
907 |       "i know the product title says molecular gastronomy  but do not let that scare you off  i have been looking for this for a while now  not for food science  but for something more down to earth  i use it to make my own coffee creamer   i have to have my coffee blonde and sweet   but the flavored creamers are full of the bad kinds of fat  and honestly  i hate to use manufactured  food  items  i really do not think they are good for the body  on the other hand  i hate using cold milk or cream  because i like hot coffee   i stumbled across this on amazon one day and got the idea of making my own creamer  i also bought low fat  non instant  milk powder and regular milk powder  the non instant lowfat milk is a little sweeter and tastes fresher than regular instant low fat milk  but does not dissolve good in cold water   which is not a problem for hot coffee  you will have to play with the ratios   i would not do just the heavy cream  it made the coffee too rich  also  i think the powder is too expensive to just use on it is own  i like mixing 1 3 of each together   for flavoring  i bough cocoa bean powder  vanilla bean powder  and caster  superfine  sugar  i mix up small batches along with spices like cinnamon and nutmeg to make my own flavored creamers  if you wanted  you could use a fake sweetner powder instead  i make up small amounts that i store in jelly canning jars  i also use my little food chopper food processor to blend everything  so the sugar is not heavier and sinks to the bottom  let it settle for a bit before opening the top though   this stuff tastes way better than the storebought creamers and it is fun to experiment and come up with your own flavors  i am going to try using some essential oils next and see if i can get a good chocolate orange mix   all of the ingredients i mentioned are here online  take the time to experiment  maybe you do not use any low fat milk  or do not add any flavorings  it is up to you  also  would make great housewarming host ess  gifts   i am sure other molecular people will be able to tell you more of what you can do with it  and i am sure i will experiment with it in cooking   but the main reason i bought it was to make my own creamer and it worked out great \n",
908 |       "_________________________\n",
909 |       "Data Point #4\n",
910 |       "Summary:\n",
911 |       "home delivered twizlers\n",
912 |       "Full Text:\n",
913 |       "candy was delivered very fast and was purchased at a reasonable price  i was home bound and unable to get to a store so this was perfect for me \n",
914 |       "_________________________\n",
915 |       "Data Point #5\n",
916 |       "Summary:\n",
917 |       "great food \n",
918 |       "Full Text:\n",
919 |       "we have three dogs and all of them love this food  we bought it specifically for one of our dogs who has food allergies and it works great for him  no more hot spots or tummy problems  i love that it ships right to our door with free shipping \n"
920 |      ]
921 |     }
922 |    ],
923 |    "source": [
924 |     "np.random.seed(1)\n",
925 |     "\n",
926 |     "for sr_no, i in enumerate(np.random.randint(10, 100, size= 5)):\n",
927 |     "    print(\"_________________________\")\n",
928 |     "    print(\"Data Point #{0}\".format(sr_no + 1))\n",
929 |     "    print(\"Summary:\")\n",
930 |     "    print(clean_summaries[i])\n",
931 |     "    print(\"Full Text:\")\n",
932 |     "    print(clean_texts[i])"
933 |    ]
934 |   }
935 |  ],
936 |  "metadata": {
937 |   "kernelspec": {
938 |    "display_name": "Python 3",
939 |    "language": "python",
940 |    "name": "python3"
941 |   },
942 |   "language_info": {
943 |    "codemirror_mode": {
944 |     "name": "ipython",
945 |     "version": 3
946 |    },
947 |    "file_extension": ".py",
948 |    "mimetype": "text/x-python",
949 |    "name": "python",
950 |    "nbconvert_exporter": "python",
951 |    "pygments_lexer": "ipython3",
952 |    "version": "3.6.7"
953 |   }
954 |  },
955 |  "nbformat": 4,
956 |  "nbformat_minor": 2
957 | }
958 | 


--------------------------------------------------------------------------------