├── DataPreProcessing.pdf ├── README.md └── data-preprocessing.ipynb /DataPreProcessing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/miohana/data-preprocessing/HEAD/DataPreProcessing.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Example of Data Preprocessing using Python :snake: 2 | 3 | We all produce a lot of data. All the time. 4 | 5 | We need to treat all that data in order to make it useful and extract high-quality information from the text, that can be used for predictions and natural language processing. 6 | 7 | The main objective here is to give a short information about some tools that data scientist have been using to data mining. 8 | 9 | It's important to always focus on the business and see what are the tools that most fit with it. 10 | 11 | ## The language 12 | 13 | In this project I used Python, in version 3.6.8. 14 | 15 | ## The content 16 | 17 | We are using the content extract from [this book](https://alex.smola.org/drafts/thebook.pdf), written by Alex Smola, about Machine Learning (great stuff, btw). 18 | 19 | ## About the techniques used 20 | 21 | The techniques that we are going to use are: 22 | 23 | 1-Case alignment 24 | 25 | 2-Tokenization 26 | 27 | 3-Stopwords removal 28 | 29 | 4-Stemming 30 | 31 | 5-Lemmatization 32 |

33 | 34 | You can see more information in the notebook, the data-preprocessing.ipynb archive, and the presentation that guides the content, the DataPreProcessing.pdf. 35 | 36 | Enjoy! :purple_heart: 37 | -------------------------------------------------------------------------------- /data-preprocessing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Data Preprocessing - why is it important?\n", 8 | "\n", 9 | "The main focus of this project is to show few __techniques__ that have been used in __data science projects__.\n", 10 | "

\n", 11 | "The techniques that we are going to use are:\n", 12 | "
\n", 13 | "

1-Case alignment

\n", 14 | "

2-Tokenization

\n", 15 | "

3-Stopwords removal

\n", 16 | "

4-Stemming

\n", 17 | "

5-Lemmatization

" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## Collecting the data\n", 25 | "\n", 26 | "We are going to use this pdf archive* as the source of all the preprocessing implementation: alex.smola.org/drafts/thebook.pdf. _(By the way, very great content!)_\n", 27 | "
\n", 28 | "*Not the entire pdf, just two pages, because the main objective of this notebook is to show the techniques.\n", 29 | "


\n", 30 | "For this, we will use the pdftotext library that provide support for the text extraction for pdf files.\n", 31 | "\n", 32 | "In case you want to test this project by your own, you will need to download [the pdf file](https://alex.smola.org/drafts/thebook.pdf) that I've used and update the path variable with the directory where the file is in your computer." 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 160, 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "name": "stdout", 42 | "output_type": "stream", 43 | "text": [ 44 | " 1\n", 45 | " Introduction\n", 46 | "Over the past two decades Machine Learning has become one of the main-\n", 47 | "stays of information technology and with that, a rather central, albeit usually\n", 48 | "hidden, part of our life. With the ever increasing amounts of data becoming\n", 49 | "available there is good reason to believe that smart data analysis will become\n", 50 | "even more pervasive as a necessary ingredient for technological progress.\n", 51 | " The purpose of this chapter is to provide the reader with an overview over\n", 52 | "the vast range of applications which have at their heart a machine learning\n", 53 | "problem and to bring some degree of order to the zoo of problems. After\n", 54 | "that, we will discuss some basic tools from statistics and probability theory,\n", 55 | "since they form the language in which many machine learning problems must\n", 56 | "be phrased to become amenable to solving. Finally, we will outline a set of\n", 57 | "fairly basic yet effective algorithms to solve an important problem, namely\n", 58 | "that of classification. More sophisticated tools, a discussion of more general\n", 59 | "problems and a detailed analysis will follow in later parts of the book.\n", 60 | "1.1 A Taste of Machine Learning\n", 61 | "Machine learning can appear in many guises. We now discuss a number of\n", 62 | "applications, the types of data they deal with, and finally, we formalize the\n", 63 | "problems in a somewhat more stylized fashion. The latter is key if we want to\n", 64 | "avoid reinventing the wheel for every new application. Instead, much of the\n", 65 | "art of machine learning is to reduce a range of fairly disparate problems to\n", 66 | "a set of fairly narrow prototypes. Much of the science of machine learning is\n", 67 | "then to solve those problems and provide good guarantees for the solutions.\n", 68 | "1.1.1 Applications\n", 69 | "Most readers will be familiar with the concept of web page ranking. That\n", 70 | "is, the process of submitting a query to a search engine, which then finds\n", 71 | "webpages relevant to the query and which returns them in their order of\n", 72 | "relevance. See e.g. Figure 1.1 for an example of the query results for “ma-\n", 73 | "chine learning”. That is, the search engine returns a sorted list of webpages\n", 74 | "given a query. To achieve this goal, a search engine needs to ‘know’ which\n", 75 | " 3\n", 76 | "\n" 77 | ] 78 | } 79 | ], 80 | "source": [ 81 | "import pdftotext\n", 82 | "import io\n", 83 | "\n", 84 | "pages = ''\n", 85 | "dict = {}\n", 86 | "path = \"/home/miohana/thebook.pdf\"\n", 87 | "\n", 88 | "fileObject = open(path, 'rb')\n", 89 | "pdf = pdftotext.PDF(fileObject)\n", 90 | "\n", 91 | "#we just wanted the page 11\n", 92 | "for index, page in enumerate(pdf, 1): \n", 93 | " if (index == 11):\n", 94 | " pages += str(page) \n", 95 | " dict[index] = page\n", 96 | " break\n", 97 | "\n", 98 | "result = pages\n", 99 | "print(result)" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## 1 - Case alignment\n", 107 | "\n", 108 | "Avoid duplicated words - Computer and computer have the same meaning." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 161, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "result = result.lower()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "## Removing \"Introduction\" word and the summary titles - isn't necessary\n", 125 | "For this, we need to find starting index for the word \"over\", because that is the on that the text will start with." 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 162, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "85\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "print(result.find('over'))" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 163, 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "over the past two decades machine learning has become one of the main-\n", 155 | "stays of information technology and with that, a rather central, albeit usually\n", 156 | "hidden, part of our life. with the ever increasing amounts of data becoming\n", 157 | "available there is good reason to believe that smart data analysis will become\n", 158 | "even more pervasive as a necessary ingredient for technological progress.\n", 159 | " the purpose of this chapter is to provide the reader with an overview over\n", 160 | "the vast range of applications which have at their heart a machine learning\n", 161 | "problem and to bring some degree of order to the zoo of problems. after\n", 162 | "that, we will discuss some basic tools from statistics and probability theory,\n", 163 | "since they form the language in which many machine learning problems must\n", 164 | "be phrased to become amenable to solving. finally, we will outline a set of\n", 165 | "fairly basic yet effective algorithms to solve an important problem, namely\n", 166 | "that of classification. more sophisticated tools, a discussion of more general\n", 167 | "problems and a detailed analysis will follow in later parts of the book.\n", 168 | "1.1 a taste of machine learning\n", 169 | "machine learning can appear in many guises. we now discuss a number of\n", 170 | "applications, the types of data they deal with, and finally, we formalize the\n", 171 | "problems in a somewhat more stylized fashion. the latter is key if we want to\n", 172 | "avoid reinventing the wheel for every new application. instead, much of the\n", 173 | "art of machine learning is to reduce a range of fairly disparate problems to\n", 174 | "a set of fairly narrow prototypes. much of the science of machine learning is\n", 175 | "then to solve those problems and provide good guarantees for the solutions.\n", 176 | "1.1.1 applications\n", 177 | "most readers will be familiar with the concept of web page ranking. that\n", 178 | "is, the process of submitting a query to a search engine, which then finds\n", 179 | "webpages relevant to the query and which returns them in their order of\n", 180 | "relevance. see e.g. figure 1.1 for an example of the query results for “ma-\n", 181 | "chine learning”. that is, the search engine returns a sorted list of webpages\n", 182 | "given a query. to achieve this goal, a search engine needs to ‘know’ which\n", 183 | " 3\n", 184 | "\n" 185 | ] 186 | } 187 | ], 188 | "source": [ 189 | "print(result[85:])\n", 190 | "result = result[85:]" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 165, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "result = result.replace('1.1 a taste of machine learning', '')" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 166, 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "result = result.replace('1.1.1 applications', '')" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 167, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "name": "stdout", 218 | "output_type": "stream", 219 | "text": [ 220 | "over the past two decades machine learning has become one of the main-\n", 221 | "stays of information technology and with that, a rather central, albeit usually\n", 222 | "hidden, part of our life. with the ever increasing amounts of data becoming\n", 223 | "available there is good reason to believe that smart data analysis will become\n", 224 | "even more pervasive as a necessary ingredient for technological progress.\n", 225 | " the purpose of this chapter is to provide the reader with an overview over\n", 226 | "the vast range of applications which have at their heart a machine learning\n", 227 | "problem and to bring some degree of order to the zoo of problems. after\n", 228 | "that, we will discuss some basic tools from statistics and probability theory,\n", 229 | "since they form the language in which many machine learning problems must\n", 230 | "be phrased to become amenable to solving. finally, we will outline a set of\n", 231 | "fairly basic yet effective algorithms to solve an important problem, namely\n", 232 | "that of classification. more sophisticated tools, a discussion of more general\n", 233 | "problems and a detailed analysis will follow in later parts of the book.\n", 234 | "\n", 235 | "machine learning can appear in many guises. we now discuss a number of\n", 236 | "applications, the types of data they deal with, and finally, we formalize the\n", 237 | "problems in a somewhat more stylized fashion. the latter is key if we want to\n", 238 | "avoid reinventing the wheel for every new application. instead, much of the\n", 239 | "art of machine learning is to reduce a range of fairly disparate problems to\n", 240 | "a set of fairly narrow prototypes. much of the science of machine learning is\n", 241 | "then to solve those problems and provide good guarantees for the solutions.\n", 242 | "\n", 243 | "most readers will be familiar with the concept of web page ranking. that\n", 244 | "is, the process of submitting a query to a search engine, which then finds\n", 245 | "webpages relevant to the query and which returns them in their order of\n", 246 | "relevance. see e.g. figure 1.1 for an example of the query results for “ma-\n", 247 | "chine learning”. that is, the search engine returns a sorted list of webpages\n", 248 | "given a query. to achieve this goal, a search engine needs to ‘know’ which\n", 249 | " 3\n", 250 | "\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "print(result)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "### Also, we don't want the final phrase of the text. And the page." 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 168, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "1988\n" 275 | ] 276 | } 277 | ], 278 | "source": [ 279 | "print(result.find('to achieve this goal'))" 280 | ] 281 | }, 282 | { 283 | "cell_type": "code", 284 | "execution_count": 169, 285 | "metadata": {}, 286 | "outputs": [ 287 | { 288 | "name": "stdout", 289 | "output_type": "stream", 290 | "text": [ 291 | "over the past two decades machine learning has become one of the main-\n", 292 | "stays of information technology and with that, a rather central, albeit usually\n", 293 | "hidden, part of our life. with the ever increasing amounts of data becoming\n", 294 | "available there is good reason to believe that smart data analysis will become\n", 295 | "even more pervasive as a necessary ingredient for technological progress.\n", 296 | " the purpose of this chapter is to provide the reader with an overview over\n", 297 | "the vast range of applications which have at their heart a machine learning\n", 298 | "problem and to bring some degree of order to the zoo of problems. after\n", 299 | "that, we will discuss some basic tools from statistics and probability theory,\n", 300 | "since they form the language in which many machine learning problems must\n", 301 | "be phrased to become amenable to solving. finally, we will outline a set of\n", 302 | "fairly basic yet effective algorithms to solve an important problem, namely\n", 303 | "that of classification. more sophisticated tools, a discussion of more general\n", 304 | "problems and a detailed analysis will follow in later parts of the book.\n", 305 | "\n", 306 | "machine learning can appear in many guises. we now discuss a number of\n", 307 | "applications, the types of data they deal with, and finally, we formalize the\n", 308 | "problems in a somewhat more stylized fashion. the latter is key if we want to\n", 309 | "avoid reinventing the wheel for every new application. instead, much of the\n", 310 | "art of machine learning is to reduce a range of fairly disparate problems to\n", 311 | "a set of fairly narrow prototypes. much of the science of machine learning is\n", 312 | "then to solve those problems and provide good guarantees for the solutions.\n", 313 | "\n", 314 | "most readers will be familiar with the concept of web page ranking. that\n", 315 | "is, the process of submitting a query to a search engine, which then finds\n", 316 | "webpages relevant to the query and which returns them in their order of\n", 317 | "relevance. see e.g. figure 1.1 for an example of the query results for “ma-\n", 318 | "chine learning”. that is, the search engine returns a sorted list of webpages\n", 319 | "given a query. \n" 320 | ] 321 | } 322 | ], 323 | "source": [ 324 | "print(result[:1988])\n", 325 | "result = result[:1988]" 326 | ] 327 | }, 328 | { 329 | "cell_type": "markdown", 330 | "metadata": {}, 331 | "source": [ 332 | "## 2 - Tokenization (by sentence)" 333 | ] 334 | }, 335 | { 336 | "cell_type": "code", 337 | "execution_count": 170, 338 | "metadata": {}, 339 | "outputs": [], 340 | "source": [ 341 | "from nltk.tokenize import sent_tokenize\n", 342 | "from nltk.corpus import stopwords\n", 343 | "\n", 344 | "tokens_by_sentence = ''\n", 345 | "\n", 346 | "tokens_by_sentence = sent_tokenize(result)\n", 347 | "tokens_by_sentence = [w.replace('\\n', ' ').replace('- ', '') for w in tokens_by_sentence]" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 171, 353 | "metadata": {}, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "['over the past two decades machine learning has become one of the mainstays of information technology and with that, a rather central, albeit usually hidden, part of our life.', 'with the ever increasing amounts of data becoming available there is good reason to believe that smart data analysis will become even more pervasive as a necessary ingredient for technological progress.', 'the purpose of this chapter is to provide the reader with an overview over the vast range of applications which have at their heart a machine learning problem and to bring some degree of order to the zoo of problems.', 'after that, we will discuss some basic tools from statistics and probability theory, since they form the language in which many machine learning problems must be phrased to become amenable to solving.', 'finally, we will outline a set of fairly basic yet effective algorithms to solve an important problem, namely that of classification.', 'more sophisticated tools, a discussion of more general problems and a detailed analysis will follow in later parts of the book.', 'machine learning can appear in many guises.', 'we now discuss a number of applications, the types of data they deal with, and finally, we formalize the problems in a somewhat more stylized fashion.', 'the latter is key if we want to avoid reinventing the wheel for every new application.', 'instead, much of the art of machine learning is to reduce a range of fairly disparate problems to a set of fairly narrow prototypes.', 'much of the science of machine learning is then to solve those problems and provide good guarantees for the solutions.', 'most readers will be familiar with the concept of web page ranking.', 'that is, the process of submitting a query to a search engine, which then finds webpages relevant to the query and which returns them in their order of relevance.', 'see e.g.', 'figure 1.1 for an example of the query results for “machine learning”.', 'that is, the search engine returns a sorted list of webpages given a query.']\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "print(tokens_by_sentence)" 365 | ] 366 | }, 367 | { 368 | "cell_type": "markdown", 369 | "metadata": {}, 370 | "source": [ 371 | "## 2 - Tokenization (by word)" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 172, 377 | "metadata": {}, 378 | "outputs": [], 379 | "source": [ 380 | "from nltk.tokenize import word_tokenize\n", 381 | "\n", 382 | "tokens_by_word = ''\n", 383 | "\n", 384 | "tokens_by_word = word_tokenize(result)\n", 385 | "tokens_by_word = [w.replace('\\n', ' ').replace('- ', '') for w in tokens_by_word]" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 173, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "name": "stdout", 395 | "output_type": "stream", 396 | "text": [ 397 | "['over', 'the', 'past', 'two', 'decades', 'machine', 'learning', 'has', 'become', 'one', 'of', 'the', 'main-', 'stays', 'of', 'information', 'technology', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usually', 'hidden', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increasing', 'amounts', 'of', 'data', 'becoming', 'available', 'there', 'is', 'good', 'reason', 'to', 'believe', 'that', 'smart', 'data', 'analysis', 'will', 'become', 'even', 'more', 'pervasive', 'as', 'a', 'necessary', 'ingredient', 'for', 'technological', 'progress', '.', 'the', 'purpose', 'of', 'this', 'chapter', 'is', 'to', 'provide', 'the', 'reader', 'with', 'an', 'overview', 'over', 'the', 'vast', 'range', 'of', 'applications', 'which', 'have', 'at', 'their', 'heart', 'a', 'machine', 'learning', 'problem', 'and', 'to', 'bring', 'some', 'degree', 'of', 'order', 'to', 'the', 'zoo', 'of', 'problems', '.', 'after', 'that', ',', 'we', 'will', 'discuss', 'some', 'basic', 'tools', 'from', 'statistics', 'and', 'probability', 'theory', ',', 'since', 'they', 'form', 'the', 'language', 'in', 'which', 'many', 'machine', 'learning', 'problems', 'must', 'be', 'phrased', 'to', 'become', 'amenable', 'to', 'solving', '.', 'finally', ',', 'we', 'will', 'outline', 'a', 'set', 'of', 'fairly', 'basic', 'yet', 'effective', 'algorithms', 'to', 'solve', 'an', 'important', 'problem', ',', 'namely', 'that', 'of', 'classification', '.', 'more', 'sophisticated', 'tools', ',', 'a', 'discussion', 'of', 'more', 'general', 'problems', 'and', 'a', 'detailed', 'analysis', 'will', 'follow', 'in', 'later', 'parts', 'of', 'the', 'book', '.', 'machine', 'learning', 'can', 'appear', 'in', 'many', 'guises', '.', 'we', 'now', 'discuss', 'a', 'number', 'of', 'applications', ',', 'the', 'types', 'of', 'data', 'they', 'deal', 'with', ',', 'and', 'finally', ',', 'we', 'formalize', 'the', 'problems', 'in', 'a', 'somewhat', 'more', 'stylized', 'fashion', '.', 'the', 'latter', 'is', 'key', 'if', 'we', 'want', 'to', 'avoid', 'reinventing', 'the', 'wheel', 'for', 'every', 'new', 'application', '.', 'instead', ',', 'much', 'of', 'the', 'art', 'of', 'machine', 'learning', 'is', 'to', 'reduce', 'a', 'range', 'of', 'fairly', 'disparate', 'problems', 'to', 'a', 'set', 'of', 'fairly', 'narrow', 'prototypes', '.', 'much', 'of', 'the', 'science', 'of', 'machine', 'learning', 'is', 'then', 'to', 'solve', 'those', 'problems', 'and', 'provide', 'good', 'guarantees', 'for', 'the', 'solutions', '.', 'most', 'readers', 'will', 'be', 'familiar', 'with', 'the', 'concept', 'of', 'web', 'page', 'ranking', '.', 'that', 'is', ',', 'the', 'process', 'of', 'submitting', 'a', 'query', 'to', 'a', 'search', 'engine', ',', 'which', 'then', 'finds', 'webpages', 'relevant', 'to', 'the', 'query', 'and', 'which', 'returns', 'them', 'in', 'their', 'order', 'of', 'relevance', '.', 'see', 'e.g', '.', 'figure', '1.1', 'for', 'an', 'example', 'of', 'the', 'query', 'results', 'for', '“', 'ma-', 'chine', 'learning', '”', '.', 'that', 'is', ',', 'the', 'search', 'engine', 'returns', 'a', 'sorted', 'list', 'of', 'webpages', 'given', 'a', 'query', '.']\n" 398 | ] 399 | } 400 | ], 401 | "source": [ 402 | "print(tokens_by_word)" 403 | ] 404 | }, 405 | { 406 | "cell_type": "markdown", 407 | "metadata": {}, 408 | "source": [ 409 | "## 3 - Stopwords removal" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 174, 415 | "metadata": {}, 416 | "outputs": [ 417 | { 418 | "name": "stdout", 419 | "output_type": "stream", 420 | "text": [ 421 | "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n" 422 | ] 423 | } 424 | ], 425 | "source": [ 426 | "print(stopwords.words('english'))" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 175, 432 | "metadata": {}, 433 | "outputs": [ 434 | { 435 | "name": "stdout", 436 | "output_type": "stream", 437 | "text": [ 438 | "['de', 'a', 'o', 'que', 'e', 'do', 'da', 'em', 'um', 'para']\n" 439 | ] 440 | } 441 | ], 442 | "source": [ 443 | "print(stopwords.words('portuguese')[:10])" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": 176, 449 | "metadata": {}, 450 | "outputs": [], 451 | "source": [ 452 | "def remove_stopwords(text):\n", 453 | " stopWords = stopwords.words('english')\n", 454 | " not_stopword = [word for word in text if not word in stopWords]\n", 455 | " return not_stopword\n", 456 | "\n", 457 | "result = remove_stopwords(tokens_by_word)" 458 | ] 459 | }, 460 | { 461 | "cell_type": "code", 462 | "execution_count": 177, 463 | "metadata": {}, 464 | "outputs": [ 465 | { 466 | "name": "stdout", 467 | "output_type": "stream", 468 | "text": [ 469 | "['past', 'two', 'decades', 'machine', 'learning', 'become', 'one', 'main-', 'stays', 'information', 'technology', ',', 'rather', 'central', ',', 'albeit', 'usually', 'hidden', ',', 'part', 'life', '.', 'ever', 'increasing', 'amounts', 'data', 'becoming', 'available', 'good', 'reason', 'believe', 'smart', 'data', 'analysis', 'become', 'even', 'pervasive', 'necessary', 'ingredient', 'technological', 'progress', '.', 'purpose', 'chapter', 'provide', 'reader', 'overview', 'vast', 'range', 'applications', 'heart', 'machine', 'learning', 'problem', 'bring', 'degree', 'order', 'zoo', 'problems', '.', ',', 'discuss', 'basic', 'tools', 'statistics', 'probability', 'theory', ',', 'since', 'form', 'language', 'many', 'machine', 'learning', 'problems', 'must', 'phrased', 'become', 'amenable', 'solving', '.', 'finally', ',', 'outline', 'set', 'fairly', 'basic', 'yet', 'effective', 'algorithms', 'solve', 'important', 'problem', ',', 'namely', 'classification', '.', 'sophisticated', 'tools', ',', 'discussion', 'general', 'problems', 'detailed', 'analysis', 'follow', 'later', 'parts', 'book', '.', 'machine', 'learning', 'appear', 'many', 'guises', '.', 'discuss', 'number', 'applications', ',', 'types', 'data', 'deal', ',', 'finally', ',', 'formalize', 'problems', 'somewhat', 'stylized', 'fashion', '.', 'latter', 'key', 'want', 'avoid', 'reinventing', 'wheel', 'every', 'new', 'application', '.', 'instead', ',', 'much', 'art', 'machine', 'learning', 'reduce', 'range', 'fairly', 'disparate', 'problems', 'set', 'fairly', 'narrow', 'prototypes', '.', 'much', 'science', 'machine', 'learning', 'solve', 'problems', 'provide', 'good', 'guarantees', 'solutions', '.', 'readers', 'familiar', 'concept', 'web', 'page', 'ranking', '.', ',', 'process', 'submitting', 'query', 'search', 'engine', ',', 'finds', 'webpages', 'relevant', 'query', 'returns', 'order', 'relevance', '.', 'see', 'e.g', '.', 'figure', '1.1', 'example', 'query', 'results', '“', 'ma-', 'chine', 'learning', '”', '.', ',', 'search', 'engine', 'returns', 'sorted', 'list', 'webpages', 'given', 'query', '.']\n" 470 | ] 471 | } 472 | ], 473 | "source": [ 474 | "print(result)" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "## 4 - Stemming\n", 482 | "\n", 483 | "For Non-English projects (such as portuguese), a good library is __RSLP Stemmer__.\n", 484 | "
\n", 485 | "For our purpose, we will use the __SnowballStemmer__ that provides a great support for the english language.\n", 486 | "

\n", 487 | "We need to remember that we have a lot of libraries that have similar functionalities. The right thing to do is to test each one to make sure we choose the one with the best result.\n", 488 | "
\n", 489 | "Examples of other stemmers:\n", 490 | "* PorterStemmer (English)\n", 491 | "* LancasterStemmer (English)\n", 492 | "* ISRIStemmer (Arabic)\n", 493 | "* RSLPSTemmer (Portuguese)" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 194, 499 | "metadata": {}, 500 | "outputs": [ 501 | { 502 | "name": "stdout", 503 | "output_type": "stream", 504 | "text": [ 505 | "['over', 'the', 'past', 'two', 'decad', 'machin', 'learn', 'has', 'becom', 'one', 'of', 'the', 'main-', 'stay', 'of', 'inform', 'technolog', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usual', 'hidden', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increas', 'amount', 'of', 'data', 'becom', 'avail', 'there', 'is', 'good', 'reason', 'to', 'believ', 'that']\n" 506 | ] 507 | } 508 | ], 509 | "source": [ 510 | "from nltk.stem.snowball import SnowballStemmer\n", 511 | "\n", 512 | "englishStemmer = SnowballStemmer('english')\n", 513 | "\n", 514 | "words = [word for word in result if word.isalpha()]\n", 515 | "\n", 516 | "stemmed = [englishStemmer.stem(word) for word in tokens_by_word]\n", 517 | "print(stemmed[:50])" 518 | ] 519 | }, 520 | { 521 | "cell_type": "markdown", 522 | "metadata": {}, 523 | "source": [ 524 | "## 5 - Lemmatization" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": 197, 530 | "metadata": {}, 531 | "outputs": [ 532 | { 533 | "name": "stdout", 534 | "output_type": "stream", 535 | "text": [ 536 | "['over', 'the', 'past', 'two', 'decades', 'machine', 'learn', 'have', 'become', 'one', 'of', 'the', 'main-', 'stay', 'of', 'information', 'technology', 'and', 'with', 'that', ',', 'a', 'rather', 'central', ',', 'albeit', 'usually', 'hide', ',', 'part', 'of', 'our', 'life', '.', 'with', 'the', 'ever', 'increase', 'amount', 'of', 'data', 'become', 'available', 'there', 'be', 'good', 'reason', 'to', 'believe', 'that']\n" 537 | ] 538 | } 539 | ], 540 | "source": [ 541 | "from nltk.stem import WordNetLemmatizer\n", 542 | "\n", 543 | "wordnet_lemmatizer = WordNetLemmatizer()\n", 544 | "\n", 545 | "lemmatized = [wordnet_lemmatizer.lemmatize(word, pos=\"v\") for word in tokens_by_word]\n", 546 | "print(lemmatized[:50])" 547 | ] 548 | } 549 | ], 550 | "metadata": { 551 | "kernelspec": { 552 | "display_name": "Python 3", 553 | "language": "python", 554 | "name": "python3" 555 | }, 556 | "language_info": { 557 | "codemirror_mode": { 558 | "name": "ipython", 559 | "version": 3 560 | }, 561 | "file_extension": ".py", 562 | "mimetype": "text/x-python", 563 | "name": "python", 564 | "nbconvert_exporter": "python", 565 | "pygments_lexer": "ipython3", 566 | "version": "3.6.8" 567 | } 568 | }, 569 | "nbformat": 4, 570 | "nbformat_minor": 2 571 | } 572 | --------------------------------------------------------------------------------