├── .github └── stale.yml ├── .gitignore ├── LICENSE ├── README.md ├── TextAnalysis.Rproj ├── TextAnalysis.wpr ├── data ├── fake_news_data │ └── liar_dataset │ │ ├── README.md │ │ ├── test.tsv │ │ ├── train.tsv │ │ └── valid.tsv ├── kaggle_spooky_authors │ ├── sample_submission.csv │ ├── test.csv │ └── train.csv ├── novelwordsonly.txt ├── plainText │ ├── austen.txt │ └── melville.txt ├── readme.md └── text_data_for_analysis.txt ├── figures ├── Rplot-wordcloud01.png └── readme.md ├── resources └── readme.md └── scripts ├── .gitignore ├── R ├── HP_preprocess_01.R ├── austen_text_analysis.R ├── dispersion_plots.R ├── dracula_text_analysis.R ├── expr_kaggle-reddit_EDA.R ├── expr_kaggle-reddit_text-analysis.R ├── initialScript-1.R ├── initialScript.R ├── mobydick_novel_text_analysis.R ├── structural_topic_modeling_00.R ├── text_analysis_example01.R ├── text_analysis_example02.R ├── token_distribution_analysis.R ├── topic_modeling_00.R ├── topic_modeling_01.R ├── wuthering_heights_sentiment_analysis.R └── wuthering_heights_text_analysis.R ├── python ├── .gitignore ├── 00_topic_modelling_fundamentals.ipynb ├── 00_topic_modelling_theoretical_concepts.ipynb ├── 01_topic_modelling_fundamentals.ipynb ├── Tutorial1-An introduction to NLP with SpaCy.ipynb ├── check_pkgs.py ├── extract_table_from_pdf.py ├── extract_text_data_from_pdf.py ├── extract_text_data_from_pdf_2.py ├── extract_text_data_from_pdf_3.py ├── extract_text_data_from_textfiles.py ├── func_text_preprocess.py ├── kaggle_hotel_review_rating.py ├── kaggle_spooky_authors_ml_model.ipynb └── learn_text_analysis.py └── readme.md /.github/stale.yml: -------------------------------------------------------------------------------- 1 | # Number of days of inactivity before an issue becomes stale 2 | daysUntilStale: 60 3 | # Number of days of inactivity before a stale issue is closed 4 | daysUntilClose: 7 5 | # Issues with these labels will never be considered stale 6 | exemptLabels: 7 | - pinned 8 | - security 9 | - notes 10 | # Label to use when marking an issue as stale 11 | staleLabel: wontfix 12 | # Comment to post when marking an issue as stale. Set to `false` to disable 13 | markComment: > 14 | This issue has been automatically marked as stale because it has not had 15 | recent activity. It will be closed if no further activity occurs. Thank you 16 | for your contributions. 17 | # Comment to post when closing a stale issue. Set to `false` to disable 18 | closeComment: true 19 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | /*.idea 6 | /data 7 | *.xml 8 | *.iml 9 | *.idea 10 | /.ipynb_checkpoints/ 11 | /*.cpython-37.pyc 12 | /scripts/python/__pycache__ 13 | *.spyproject 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Ashish Dutt 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Readme 2 | [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/duttashi/text-analysis/graphs/commit-activity) 3 | [![Issues](https://img.shields.io/github/issues/duttashi/text-analysis.svg)](https://github.com/duttashi/text-analysis/issues) 4 | [![Popularity Score](https://img.shields.io/github/forks/duttashi/text-analysis.svg)](https://github.com/duttashi/text-analysis/network) 5 | [![Interested](https://img.shields.io/github/stars/duttashi/text-analysis.svg)](https://github.com/duttashi/text-analysis/stargazers) 6 | [![License](https://img.shields.io/github/license/duttashi/text-analysis.svg)](https://github.com/duttashi/text-analysis/blob/master/LICENSE) 7 | [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1489352.svg)](https://doi.org/10.5281/zenodo.1489352) 8 | ### Cleaning and Analyzing textual data 9 | 10 | Weaving analytic stories from text data 11 | 12 | The repository consist of the following folders namely, `data`, `scripts`, `resources` and `figures`. 13 | 14 | ###### Have a question? 15 | 16 | Ask your question on [Stack Overflow](http://stackoverflow.com/questions/tagged/r) 17 | or the [R-SIG-Finance](https://stat.ethz.ch/mailman/listinfo/r-sig-finance) 18 | mailing list (you must subscribe to post). 19 | 20 | ## Contributing [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/duttashi/text-analysis/issues) 21 | 22 | Please see the [contributing guide](CONTRIBUTING.md). 23 | 24 | ## Author 25 | [Ashish Dutt](https://duttashi.github.io/about/) 26 | 27 | 28 |

29 | 30 | 31 | 32 |

33 | 34 | 35 | -------------------------------------------------------------------------------- /TextAnalysis.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /TextAnalysis.wpr: -------------------------------------------------------------------------------- 1 | #!wing 2 | #!version=7.0 3 | ################################################################## 4 | # Wing project file # 5 | ################################################################## 6 | [project attributes] 7 | proj.directory-list = [{'dirloc': loc('data'), 8 | 'excludes': (), 9 | 'filter': u'*', 10 | 'include_hidden': False, 11 | 'recursive': True, 12 | 'watch_for_changes': True}] 13 | proj.file-list = [loc('scripts/python/basic_text_processing_00.py'), 14 | loc('scripts/python/basic_text_processing_01.py')] 15 | proj.file-type = 'normal' 16 | [user attributes] 17 | debug.show-args-dialog = {loc('scripts/python/basic_text_processing_00.py'): False, 18 | loc('scripts/python/basic_text_processing_01.py'): False, 19 | loc('scripts/python/extract_table_from_pdf.py'): False, 20 | loc('scripts/python/extract_text_data_from_pdf.py'): False, 21 | loc('scripts/python/list_environ_path_details.py'): False, 22 | loc('scripts/python/pdf_reader.py'): False, 23 | loc('scripts/python/pdf_table_reader.py'): False, 24 | loc('unknown: #1'): False} 25 | guimgr.overall-gui-state = {'windowing-policy': 'combined-window', 26 | 'windows': [{'name': '9z79HX4zahWgrg1gEbMAmcgYyp'\ 27 | 'VZEoEv', 28 | 'size-state': '', 29 | 'type': 'dock', 30 | 'view': {'area': 'tall', 31 | 'constraint': None, 32 | 'current_pages': [0], 33 | 'full-screen': False, 34 | 'notebook_display': 'normal', 35 | 'notebook_percent': 0.25, 36 | 'override_title': None, 37 | 'pagelist': [('project', 38 | 'tall', 39 | 0, 40 | {'tree-state': {'file-sort-method': 'by name', 41 | 'list-files-first': False, 42 | 'tree-states': {'deep': {'expanded-nodes': [], 43 | 'selected-nodes': [(1, 44 | 0)], 45 | 'top-node': (0,)}, 46 | 'flat': {'expanded-nodes': [(6,)], 47 | 'selected-nodes': [(6,)], 48 | 'top-node': (0,)}}, 49 | 'tree-style': 'flat'}}), 50 | ('source-assistant', 51 | 'tall', 52 | 2, 53 | {}), 54 | ('debug-stack', 55 | 'tall', 56 | 1, 57 | {'codeline-mode': 'below'}), 58 | ('browser', 59 | 'tall', 60 | 0, 61 | {'all_tree_states': {loc('scripts/python/basic_text_processing_00.py'): {'e'\ 62 | 'xpanded-nodes': [], 63 | 'selected-nodes': [[('generic attribute', 64 | loc('scripts/python/basic_text_processing_00.py'), 65 | 'e')]], 66 | 'top-node': [('generic attribute', 67 | loc('scripts/python/basic_text_processing_00.py'), 68 | 'e')]}, 69 | loc('scripts/python/extract_table_from_pdf.py'): {'expanded-nodes': [], 70 | 'selected-nodes': [], 71 | 'top-node': [('generic attribute', 72 | loc('scripts/python/extract_table_from_pdf.py'), 73 | 'df')]}, 74 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'): {'expande'\ 75 | 'd-nodes': [], 76 | 'selected-nodes': [], 77 | 'top-node': [('function def', 78 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 79 | 'build_options')]}}, 80 | 'browse_mode': u'Current Module', 81 | 'follow-selection': False, 82 | 'sort_mode': 'Alphabetically', 83 | 'visibility_options': {u'Derived Classes': False, 84 | u'Imported': False, 85 | u'Modules': True}}), 86 | ('indent', 87 | 'tall', 88 | 2, 89 | {})], 90 | 'primary_view_state': {'area': 'wide', 91 | 'constraint': None, 92 | 'current_pages': [0, 93 | 1], 94 | 'notebook_display': 'normal', 95 | 'notebook_percent': 0.37662337662337664, 96 | 'override_title': None, 97 | 'pagelist': [('batch-search', 98 | 'wide', 99 | 0, 100 | {'fScope': {'fFileSetName': 'All Source Files', 101 | 'fLocation': None, 102 | 'fRecursive': True, 103 | 'fType': 'project-files'}, 104 | 'fSearchSpec': {'fEndPos': None, 105 | 'fIncludeLinenos': True, 106 | 'fInterpretBackslashes': False, 107 | 'fMatchCase': False, 108 | 'fOmitBinary': True, 109 | 'fRegexFlags': 42, 110 | 'fReplaceText': '', 111 | 'fReverse': False, 112 | 'fSearchText': '', 113 | 'fStartPos': 0, 114 | 'fStyle': 'text', 115 | 'fWholeWords': False, 116 | 'fWrap': True}, 117 | 'fUIOptions': {'fAutoBackground': True, 118 | 'fFilePrefix': 'short-file', 119 | 'fFindAfterReplace': True, 120 | 'fInSelection': False, 121 | 'fIncremental': True, 122 | 'fReplaceOnDisk': False, 123 | 'fShowFirstMatch': False, 124 | 'fShowLineno': True, 125 | 'fShowReplaceWidgets': False}, 126 | 'replace-entry-expanded': False, 127 | 'search-entry-expanded': False}), 128 | ('interactive-search', 129 | 'wide', 130 | 0, 131 | {'fScope': {'fFileSetName': 'All Source Files', 132 | 'fLocation': None, 133 | 'fRecursive': True, 134 | 'fType': 'project-files'}, 135 | 'fSearchSpec': {'fEndPos': None, 136 | 'fIncludeLinenos': True, 137 | 'fInterpretBackslashes': False, 138 | 'fMatchCase': False, 139 | 'fOmitBinary': True, 140 | 'fRegexFlags': 42, 141 | 'fReplaceText': '', 142 | 'fReverse': False, 143 | 'fSearchText': '', 144 | 'fStartPos': 0, 145 | 'fStyle': 'text', 146 | 'fWholeWords': False, 147 | 'fWrap': True}, 148 | 'fUIOptions': {'fAutoBackground': True, 149 | 'fFilePrefix': 'short-file', 150 | 'fFindAfterReplace': True, 151 | 'fInSelection': False, 152 | 'fIncremental': True, 153 | 'fReplaceOnDisk': False, 154 | 'fShowFirstMatch': False, 155 | 'fShowLineno': True, 156 | 'fShowReplaceWidgets': False}}), 157 | ('debug-data', 158 | 'wide', 159 | 0, 160 | {}), 161 | ('debug-io', 162 | 'wide', 163 | 1, 164 | {}), 165 | ('debug-exceptions', 166 | 'wide', 167 | 1, 168 | {}), 169 | ('python-shell', 170 | 'wide', 171 | 2, 172 | {'active-range': (None, 173 | -1, 174 | -1), 175 | 'attrib-starts': [], 176 | 'code-line': '', 177 | 'first-line': 0L, 178 | 'folded-linenos': [], 179 | 'history': {}, 180 | 'launch-id': None, 181 | 'sel-line': 2L, 182 | 'sel-line-start': 145L, 183 | 'selection_end': 145L, 184 | 'selection_start': 145L, 185 | 'zoom': 0L}), 186 | ('messages', 187 | 'wide', 188 | 2, 189 | {'current-domain': 0}), 190 | ('os-command', 191 | 'wide', 192 | 1, 193 | {'last-percent': 0.8, 194 | 'toolbox-percent': 1.0, 195 | 'toolbox-tree-sel': ''})], 196 | 'primary_view_state': {'editor_states': ({'bookmarks': ([[loc('scripts/python/basic_text_processing_00.py'), 197 | {'attrib-starts': [], 198 | 'code-line': ' countlist = Counter(stripped) \r\n', 199 | 'first-line': 6L, 200 | 'folded-linenos': [], 201 | 'sel-line': 44L, 202 | 'sel-line-start': 2135L, 203 | 'selection_end': 2173L, 204 | 'selection_start': 2143L, 205 | 'zoom': 0L}, 206 | 1586504429.747], 207 | [loc('scripts/python/extract_text_data_from_textfiles.py'), 208 | {'attrib-starts': [], 209 | 'code-line': 'filePath = "C:\\\\Users\\\\Ashoo\\\\Documents\\\\Pyt'\ 210 | 'honPlayground\\\\text-analysis\\\\data\\\\plainText"'\ 211 | '\r\n', 212 | 'first-line': 33L, 213 | 'folded-linenos': [], 214 | 'sel-line': 12L, 215 | 'sel-line-start': 439L, 216 | 'selection_end': 502L, 217 | 'selection_start': 502L, 218 | 'zoom': 0L}, 219 | 1586504435.198], 220 | [loc('scripts/python/basic_text_processing_00.py'), 221 | {'attrib-starts': [], 222 | 'code-line': ' countlist = Counter(stripped) \r\n', 223 | 'first-line': 6L, 224 | 'folded-linenos': [], 225 | 'sel-line': 44L, 226 | 'sel-line-start': 2135L, 227 | 'selection_end': 2173L, 228 | 'selection_start': 2143L, 229 | 'zoom': 0L}, 230 | 1586504437.538], 231 | [loc('scripts/python/extract_text_data_from_pdf.py'), 232 | {'attrib-starts': [], 233 | 'code-line': '\r\n', 234 | 'first-line': 36L, 235 | 'folded-linenos': [], 236 | 'sel-line': 11L, 237 | 'sel-line-start': 717L, 238 | 'selection_end': 717L, 239 | 'selection_start': 717L, 240 | 'zoom': 0L}, 241 | 1586504686.472], 242 | [loc('scripts/python/extract_table_from_pdf.py'), 243 | {'attrib-starts': [], 244 | 'code-line': ' tabula.convert_into(pdfFiles, "pdf_to_cs'\ 245 | 'v.csv", output_format="csv")\r\n', 246 | 'first-line': 10L, 247 | 'folded-linenos': [], 248 | 'sel-line': 28L, 249 | 'sel-line-start': 1279L, 250 | 'selection_end': 1359L, 251 | 'selection_start': 1359L, 252 | 'zoom': 0L}, 253 | 1586505684.863], 254 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 255 | {'attrib-starts': [('convert_into|0|', 256 | 505)], 257 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 258 | 'd it manually.".format(path)\n', 259 | 'first-line': 541L, 260 | 'folded-linenos': [], 261 | 'sel-line': 557L, 262 | 'sel-line-start': 23803L, 263 | 'selection_end': 23803L, 264 | 'selection_start': 23803L, 265 | 'zoom': 0L}, 266 | 1586505728.867], 267 | [loc('scripts/python/extract_table_from_pdf.py'), 268 | {'attrib-starts': [], 269 | 'code-line': ' tabula.convert_into(pdfFiles, "pdf_to_cs'\ 270 | 'v.csv", output_format="csv")\r\n', 271 | 'first-line': 12L, 272 | 'folded-linenos': [], 273 | 'sel-line': 28L, 274 | 'sel-line-start': 1279L, 275 | 'selection_end': 1297L, 276 | 'selection_start': 1297L, 277 | 'zoom': 0L}, 278 | 1586505861.747], 279 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 280 | {'attrib-starts': [('convert_into|0|', 281 | 505)], 282 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 283 | 'd it manually.".format(path)\n', 284 | 'first-line': 538L, 285 | 'folded-linenos': [], 286 | 'sel-line': 557L, 287 | 'sel-line-start': 23803L, 288 | 'selection_end': 23883L, 289 | 'selection_start': 23816L, 290 | 'zoom': 0L}, 291 | 1586505996.978], 292 | [loc('scripts/python/extract_table_from_pdf.py'), 293 | {'attrib-starts': [], 294 | 'code-line': " df= tabula.read_pdf(pdfFiles, pages=\"al"\ 295 | "l\", encoding='utf-8', spreadsheet=True)\r\n", 296 | 'first-line': 12L, 297 | 'folded-linenos': [], 298 | 'sel-line': 24L, 299 | 'sel-line-start': 1139L, 300 | 'selection_end': 1228L, 301 | 'selection_start': 1228L, 302 | 'zoom': 0L}, 303 | 1586506020.948], 304 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 305 | {'attrib-starts': [('_run|0|', 306 | 53)], 307 | 'code-line': ' built_options = build_options(**options)\n', 308 | 'first-line': 68L, 309 | 'folded-linenos': [], 310 | 'sel-line': 73L, 311 | 'sel-line-start': 2296L, 312 | 'selection_end': 2296L, 313 | 'selection_start': 2296L, 314 | 'zoom': 0L}, 315 | 1586506030.692], 316 | [loc('scripts/python/extract_table_from_pdf.py'), 317 | {'attrib-starts': [], 318 | 'code-line': " df= tabula.read_pdf(pdfFiles, pages=\"al"\ 319 | "l\", encoding='utf-8')\r\n", 320 | 'first-line': 12L, 321 | 'folded-linenos': [], 322 | 'sel-line': 24L, 323 | 'sel-line-start': 1139L, 324 | 'selection_end': 1210L, 325 | 'selection_start': 1210L, 326 | 'zoom': 0L}, 327 | 1586506053.037], 328 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 329 | {'attrib-starts': [('convert_into|0|', 330 | 505)], 331 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 332 | 'd it manually.".format(path)\n', 333 | 'first-line': 541L, 334 | 'folded-linenos': [], 335 | 'sel-line': 557L, 336 | 'sel-line-start': 23803L, 337 | 'selection_end': 23803L, 338 | 'selection_start': 23803L, 339 | 'zoom': 0L}, 340 | 1586506058.31], 341 | [loc('scripts/python/extract_table_from_pdf.py'), 342 | {'attrib-starts': [], 343 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\ 344 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\ 345 | "\n", 346 | 'first-line': 12L, 347 | 'folded-linenos': [], 348 | 'sel-line': 28L, 349 | 'sel-line-start': 1297L, 350 | 'selection_end': 1394L, 351 | 'selection_start': 1394L, 352 | 'zoom': 0L}, 353 | 1586506153.545], 354 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 355 | {'attrib-starts': [('convert_into|0|', 356 | 505)], 357 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 358 | 'd it manually.".format(path)\n', 359 | 'first-line': 547L, 360 | 'folded-linenos': [], 361 | 'sel-line': 557L, 362 | 'sel-line-start': 23803L, 363 | 'selection_end': 23803L, 364 | 'selection_start': 23803L, 365 | 'zoom': 0L}, 366 | 1586506173.045], 367 | [loc('scripts/python/extract_table_from_pdf.py'), 368 | {'attrib-starts': [], 369 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\ 370 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\ 371 | "\n", 372 | 'first-line': 12L, 373 | 'folded-linenos': [], 374 | 'sel-line': 28L, 375 | 'sel-line-start': 1297L, 376 | 'selection_end': 1394L, 377 | 'selection_start': 1394L, 378 | 'zoom': 0L}, 379 | 1586506203.318], 380 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 381 | {'attrib-starts': [('convert_into|0|', 382 | 505)], 383 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 384 | 'd it manually.".format(path)\n', 385 | 'first-line': 547L, 386 | 'folded-linenos': [], 387 | 'sel-line': 557L, 388 | 'sel-line-start': 23803L, 389 | 'selection_end': 23803L, 390 | 'selection_start': 23803L, 391 | 'zoom': 0L}, 392 | 1586506208.703], 393 | [loc('scripts/python/extract_table_from_pdf.py'), 394 | {'attrib-starts': [], 395 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\ 396 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\ 397 | "\n", 398 | 'first-line': 12L, 399 | 'folded-linenos': [], 400 | 'sel-line': 28L, 401 | 'sel-line-start': 1297L, 402 | 'selection_end': 1394L, 403 | 'selection_start': 1394L, 404 | 'zoom': 0L}, 405 | 1586506210.434], 406 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 407 | {'attrib-starts': [('convert_into|0|', 408 | 505)], 409 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 410 | 'd it manually.".format(path)\n', 411 | 'first-line': 535L, 412 | 'folded-linenos': [], 413 | 'sel-line': 557L, 414 | 'sel-line-start': 23803L, 415 | 'selection_end': 23803L, 416 | 'selection_start': 23803L, 417 | 'zoom': 0L}, 418 | 1586506219.002], 419 | [loc('scripts/python/extract_table_from_pdf.py'), 420 | {'attrib-starts': [], 421 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\ 422 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\ 423 | "\n", 424 | 'first-line': 12L, 425 | 'folded-linenos': [], 426 | 'sel-line': 28L, 427 | 'sel-line-start': 1297L, 428 | 'selection_end': 1350L, 429 | 'selection_start': 1340L, 430 | 'zoom': 0L}, 431 | 1586506558.008], 432 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'), 433 | {'attrib-starts': [('convert_into|0|', 434 | 505)], 435 | 'code-line': ' "{} is empty. Check the file, or downloa'\ 436 | 'd it manually.".format(path)\n', 437 | 'first-line': 541L, 438 | 'folded-linenos': [], 439 | 'sel-line': 557L, 440 | 'sel-line-start': 23803L, 441 | 'selection_end': 23803L, 442 | 'selection_start': 23803L, 443 | 'zoom': 0L}, 444 | 1586506571.33]], 445 | 20), 446 | 'current-loc': loc('scripts/python/extract_table_from_pdf.py'), 447 | 'editor-state-list': [(loc('scripts/python/extract_table_from_pdf.py'), 448 | {'attrib-starts': [], 449 | 'code-line': ' df= tabula.read_pd'\ 450 | 'f(pdfFiles, pages="all")\r\n', 451 | 'first-line': 12L, 452 | 'folded-linenos': [], 453 | 'sel-line': 24L, 454 | 'sel-line-start': 1139L, 455 | 'selection_end': 1192L, 456 | 'selection_start': 1192L, 457 | 'zoom': 0L})], 458 | 'has-focus': True, 459 | 'locked': False}, 460 | [loc('scripts/python/extract_table_from_pdf.py')]), 461 | 'open_files': [u'scripts/python/extract_table_from_pdf.py']}, 462 | 'saved_notebook_display': None, 463 | 'split_percents': {0: 0.338}, 464 | 'splits': 2, 465 | 'tab_location': 'top', 466 | 'traversal_pos': ((1, 467 | 1), 468 | 1586506557.899), 469 | 'user_data': {}}, 470 | 'saved_notebook_display': None, 471 | 'split_percents': {}, 472 | 'splits': 1, 473 | 'tab_location': 'left', 474 | 'traversal_pos': ((0, 475 | 0), 476 | 1586504780.092), 477 | 'user_data': {}}, 478 | 'window-alloc': (0, 479 | -1, 480 | 1380, 481 | 739)}]} 482 | guimgr.recent-documents = [loc('scripts/python/extract_table_from_pdf.py')] 483 | guimgr.visual-state = {loc('scripts/python/basic_text_processing_00.py'): {'a'\ 484 | 'ttrib-starts': [], 485 | 'code-line': ' countlist = Counter(stripped) \r\n', 486 | 'first-line': 6L, 487 | 'folded-linenos': [], 488 | 'sel-line': 44L, 489 | 'sel-line-start': 2135L, 490 | 'selection_end': 2173L, 491 | 'selection_start': 2143L, 492 | 'zoom': 0L}, 493 | loc('scripts/python/extract_table_from_pdf.py'): {'at'\ 494 | 'trib-starts': [], 495 | 'code-line': ' df= tabula.read_pdf(f, stream=True)[0]\r\n', 496 | 'first-line': 0L, 497 | 'folded-linenos': [], 498 | 'sel-line': 21L, 499 | 'sel-line-start': 1031L, 500 | 'selection_end': 1064L, 501 | 'selection_start': 1064L, 502 | 'zoom': 0L}, 503 | loc('scripts/python/extract_text_data_from_pdf.py'): {'a'\ 504 | 'ttrib-starts': [], 505 | 'code-line': '\r\n', 506 | 'first-line': 36L, 507 | 'folded-linenos': [], 508 | 'sel-line': 11L, 509 | 'sel-line-start': 717L, 510 | 'selection_end': 717L, 511 | 'selection_start': 717L, 512 | 'zoom': 0L}, 513 | loc('scripts/python/extract_text_data_from_textfiles.py'): {'a'\ 514 | 'ttrib-starts': [], 515 | 'code-line': 'filePath = "C:\\\\Users\\\\Ashoo\\\\Documents\\\\Pytho'\ 516 | 'nPlayground\\\\text-analysis\\\\data\\\\plainText"\r\n', 517 | 'first-line': 33L, 518 | 'folded-linenos': [], 519 | 'sel-line': 12L, 520 | 'sel-line-start': 439L, 521 | 'selection_end': 502L, 522 | 'selection_start': 502L, 523 | 'zoom': 0L}, 524 | loc('scripts/python/list_environ_path_details.py'): {'a'\ 525 | 'ttrib-starts': [], 526 | 'code-line': ' print(item)', 527 | 'first-line': 0L, 528 | 'folded-linenos': [], 529 | 'sel-line': 3L, 530 | 'sel-line-start': 62L, 531 | 'selection_end': 77L, 532 | 'selection_start': 77L, 533 | 'zoom': 0L}, 534 | loc('scripts/python/pdf_reader.py'): {'attrib-starts': [], 535 | 'code-line': '# Objective 1: Read multiple pdf files from a director'\ 536 | 'y into a list\r\n', 537 | 'first-line': 39L, 538 | 'folded-linenos': [], 539 | 'sel-line': 0L, 540 | 'sel-line-start': 0L, 541 | 'selection_end': 0L, 542 | 'selection_start': 0L, 543 | 'zoom': 0L}, 544 | loc('scripts/python/pdf_table_reader.py'): {'attrib-s'\ 545 | 'tarts': [], 546 | 'code-line': '\r\n', 547 | 'first-line': 9L, 548 | 'folded-linenos': [], 549 | 'sel-line': 17L, 550 | 'sel-line-start': 677L, 551 | 'selection_end': 677L, 552 | 'selection_start': 677L, 553 | 'zoom': 0L}, 554 | loc('../../../Miniconda3/Lib/site-packages/PyPDF2/pdf.py'): {'a'\ 555 | 'ttrib-starts': [('PdfFileReader|0|', 556 | 1043), 557 | ('PdfFileReader|0|.read|0|', 558 | 1684)], 559 | 'code-line': ' raise utils.PdfReadError("EOF marker n'\ 560 | 'ot found")\n', 561 | 'first-line': 1678L, 562 | 'folded-linenos': [], 563 | 'sel-line': 1695L, 564 | 'sel-line-start': 67771L, 565 | 'selection_end': 67771L, 566 | 'selection_start': 67771L, 567 | 'zoom': 0L}, 568 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'): {'a'\ 569 | 'ttrib-starts': [('convert_into|0|', 570 | 505)], 571 | 'code-line': ' "{} is empty. Check the file, or download '\ 572 | 'it manually.".format(path)\n', 573 | 'first-line': 541L, 574 | 'folded-linenos': [], 575 | 'sel-line': 557L, 576 | 'sel-line-start': 23803L, 577 | 'selection_end': 23803L, 578 | 'selection_start': 23803L, 579 | 'zoom': 0L}, 580 | loc('../../../Miniconda3/Lib/site-packages/tika/tika.py'): {'a'\ 581 | 'ttrib-starts': [('checkTikaServer|0|', 582 | 568)], 583 | 'code-line': ' raise RuntimeError("Unable to start Ti'\ 584 | 'ka server.")\n', 585 | 'first-line': 583L, 586 | 'folded-linenos': [], 587 | 'sel-line': 600L, 588 | 'sel-line-start': 23544L, 589 | 'selection_end': 23544L, 590 | 'selection_start': 23544L, 591 | 'zoom': 0L}, 592 | loc('../../../Miniconda3/Lib/subprocess.py'): {'attri'\ 593 | 'b-starts': [('run|0|', 594 | 430)], 595 | 'code-line': ' output=stdout, st'\ 596 | 'derr=stderr)\n', 597 | 'first-line': 469L, 598 | 'folded-linenos': [], 599 | 'sel-line': 486L, 600 | 'sel-line-start': 17536L, 601 | 'selection_end': 17536L, 602 | 'selection_start': 17536L, 603 | 'zoom': 0L}} 604 | -------------------------------------------------------------------------------- /data/fake_news_data/liar_dataset/README.md: -------------------------------------------------------------------------------- 1 | LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION 2 | 3 | William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL. 4 | ===================================================================== 5 | Description of the TSV format: 6 | 7 | Column 1: the ID of the statement ([ID].json). 8 | Column 2: the label. 9 | Column 3: the statement. 10 | Column 4: the subject(s). 11 | Column 5: the speaker. 12 | Column 6: the speaker's job title. 13 | Column 7: the state info. 14 | Column 8: the party affiliation. 15 | Column 9-13: the total credit history count, including the current statement. 16 | 9: barely true counts. 17 | 10: false counts. 18 | 11: half true counts. 19 | 12: mostly true counts. 20 | 13: pants on fire counts. 21 | Column 14: the context (venue / location of the speech or statement). 22 | 23 | Note that we do not provide the full-text verdict report in this current version of the dataset, 24 | but you can use the following command to access the full verdict report and links to the source documents: 25 | wget http://www.politifact.com//api/v/2/statement/[ID]/?format=json 26 | 27 | ====================================================================== 28 | The original sources retain the copyright of the data. 29 | 30 | Note that there are absolutely no guarantees with this data, 31 | and we provide this dataset "as is", 32 | but you are welcome to report the issues of the preliminary version 33 | of this data. 34 | 35 | You are allowed to use this dataset for research purposes only. 36 | 37 | For more question about the dataset, please contact: 38 | William Wang, william@cs.ucsb.edu -------------------------------------------------------------------------------- /data/readme.md: -------------------------------------------------------------------------------- 1 | data files for text analysis are kept here 2 | -------------------------------------------------------------------------------- /data/text_data_for_analysis.txt: -------------------------------------------------------------------------------- 1 | As a term, data analytics predominantly refers to an assortment of applications, from basic business 2 | intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced 3 | analytics. In that sense, it's similar in nature to business analytics, another umbrella term for 4 | approaches to analyzing data -- with the difference that the latter is oriented to business uses, while 5 | data analytics has a broader focus. The expansive view of the term isn't universal, though: In some 6 | cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate 7 | category. Data analytics initiatives can help businesses increase revenues, improve operational 8 | efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to 9 | emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of 10 | boosting business performance. Depending on the particular application, the data that's analyzed 11 | can consist of either historical records or new information that has been processed for real-time 12 | analytics uses. In addition, it can come from a mix of internal systems and external data sources. At 13 | a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find 14 | patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical 15 | techniques to determine whether hypotheses about a data set are true or false. EDA is often 16 | compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a 17 | distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis. Data 18 | analytics can also be separated into quantitative data analysis and qualitative data analysis. The 19 | former involves analysis of numerical data with quantifiable variables that can be compared or 20 | measured statistically. The qualitative approach is more interpretive -- it focuses on understanding 21 | the content of non-numerical data like text, images, audio and video, including common phrases, 22 | themes and points of view. -------------------------------------------------------------------------------- /figures/Rplot-wordcloud01.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/duttashi/text-analysis/154b34ff3c2fac60e5e4068e48597dbdc815b894/figures/Rplot-wordcloud01.png -------------------------------------------------------------------------------- /figures/readme.md: -------------------------------------------------------------------------------- 1 | All plots/figures live here 2 | -------------------------------------------------------------------------------- /resources/readme.md: -------------------------------------------------------------------------------- 1 | All references to learning resources live here 2 | -------------------------------------------------------------------------------- /scripts/.gitignore: -------------------------------------------------------------------------------- 1 | /.ipynb_checkpoints/ 2 | -------------------------------------------------------------------------------- /scripts/R/HP_preprocess_01.R: -------------------------------------------------------------------------------- 1 | 2 | 3 | # clear the workspace 4 | rm(list=ls()) 5 | 6 | # load harrypotter books from the package harrypotter 7 | library(devtools) 8 | devtools::install_github("bradleyboehmke/harrypotter") 9 | # load the relevant libraries 10 | library(tidyverse) # data manipulation & plotting 11 | library(stringr) # text cleaning and regular expressions 12 | library(tidytext) # provides additional text mining functions 13 | library(harrypotter) # provides the first seven novels of the Harry Potter series 14 | library(ggplot2) # for data visualization 15 | 16 | # read the first two books 17 | titles <- c("Philosopher's Stone", "Chamber of Secrets") 18 | books <- list(philosophers_stone, chamber_of_secrets) 19 | 20 | # sneak peak into the first two chapters of chamber of secrets book 21 | chamber_of_secrets[1:2] 22 | 23 | # Tidying the text 24 | # To properly analyze the text, we need to convert it into a data frame or a tibble. 25 | typeof(chamber_of_secrets) # as you can see, at the moment it is a character vector 26 | text_tb<- tibble(chapter= seq_along(philosophers_stone), 27 | text=philosophers_stone) 28 | str(text_tb) 29 | 30 | # Unnest the text 31 | # Its important to note that the unnest_token function does the following; splits the text into single words, strips all punctuation and converts each word to lowercase for easy comparability. 32 | clean<-text_tb %>% 33 | unnest_tokens(word, text) 34 | 35 | clean_book<- tibble() 36 | clean_book<- rbind(clean_book, clean) 37 | clean_book 38 | 39 | # Basic calculations 40 | # calculate word frequency 41 | word_freq <- clean_book %>% 42 | count(word, sort=TRUE) 43 | word_freq 44 | # lots of stop words like the, and, to, a etc. Let's remove the stop words. 45 | # We can remove the stop words from our tibble with anti_join and the built-in stop_words data set provided by tidytext. 46 | clean_book %>% 47 | anti_join(stop_words) %>% 48 | count(word, sort=TRUE) %>% 49 | top_n(10) %>% 50 | ggplot(aes(word,n))+ 51 | geom_bar(stat = "identity") 52 | 53 | 54 | 55 | 56 | 57 | -------------------------------------------------------------------------------- /scripts/R/austen_text_analysis.R: -------------------------------------------------------------------------------- 1 | # Script title: austen_text_analysis.R 2 | 3 | # Load the data 4 | austen.data<- scan("data/plainText/austen.txt", what = "character", sep = "\n") 5 | 6 | # clear the workspace 7 | rm(list=ls()) 8 | 9 | # find out the begining & end of of the main text 10 | austen.start<-which(austen.data=="CHAPTER 1") 11 | austen.start # main text begins from line 17 12 | austen.end<-which(austen.data=="THE END") 13 | austen.end # main text ends at line 10609 14 | 15 | # save the metadata to a separate object 16 | austen.startmeta<- austen.data[1:austen.start-1] 17 | austen.endmeta<- austen.data[(austen.end+1):length(austen.end)] 18 | metadata<- c(austen.startmeta, austen.endmeta) 19 | novel.data<- austen.data[austen.start: austen.end] 20 | head(novel.data) 21 | tail(novel.data) 22 | 23 | # Formatting the text for subsequent data analysis 24 | novel.data<- paste(novel.data, collapse = " ") 25 | 26 | # Data preprocessing 27 | ## convert all words to lowercase 28 | novel.lower<- tolower(novel.data) 29 | ## extract all words only 30 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character. 31 | head(novel.words) 32 | str(novel.words) # The words are contained in a list format 33 | novel.words<- unlist(novel.words) 34 | str(novel.words) # Convert list to a character vector 35 | 36 | ## Removing the blanks. First, find out the non blank positions 37 | notblanks<- which(novel.words!="") 38 | head(notblanks) 39 | 40 | novel.wordsonly<-novel.words[notblanks] 41 | head(novel.wordsonly) 42 | totalwords<-length(novel.wordsonly) 43 | 44 | ### Practice: Find the top 10 most frequent words in the text 45 | novel.wordsonly.freq<- table(novel.wordsonly) 46 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE) 47 | novel.wordsonly.freq.sorted[c(1:10)] 48 | ### Practice: Visualize the top 10 most frequent words in the text 49 | plot(novel.wordsonly.freq.sorted[c(1:10)]) 50 | ## calculating the relative frequency of the words 51 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted)) 52 | 53 | plot(novel.wordsonly.relfreq[1:10], type="b", 54 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n") 55 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10])) -------------------------------------------------------------------------------- /scripts/R/dispersion_plots.R: -------------------------------------------------------------------------------- 1 | # Script name: dispersion_plots.R 2 | # vector to use: novel.wordsonly 3 | # RQ: How to identify where in text, different words occur and how they behave over the course of the text/novel? 4 | 5 | # list files in working directory 6 | ls() 7 | 8 | # step 1: create a integer vector indicating position of each word in text 9 | novel.time.v<- seq(1:length(novel.wordsonly)) # n.time.v 10 | # step 2: Say, I want to see all occurences of the word, "whale" in the text 11 | family.v<- which(novel.words == "family") 12 | family.v 13 | ## Ultimately we want to create a dispersion plot where the x-axis is novel.time.v and the y-axis is the values need only be some reflection of the logical condition of TRUE where a family is found and FALSE or none found when an instance of family is not found 14 | family.v.count<- rep(NA, length(novel.time.v)) # initialize a vector full of NA values 15 | family.v.count[family.v]<-1 # using the numerical positions stored in the family.v object, so the resetting is simple with this expression 16 | family.v.count 17 | 18 | ## Plot showing the distribution of the word, "family" across the novel 19 | plot(family.v.count, main="Dispersion Plot of `family' in Moby Dick", 20 | xlab="Novel Time", ylab="family", type="h", ylim=c(0,1), yaxt='n') 21 | 22 | man.v<- which(novel.words == "man") 23 | man.v 24 | ## Ultimately we want to create a dispersion plot where the x-axis is novel.time.v and the y-axis is the values need only be some reflection of the logical condition of TRUE where a family is found and FALSE or none found when an instance of family is not found 25 | man.v.count<- rep(NA, length(novel.time.v)) # initialize a vector full of NA values 26 | man.v.count[man.v]<-1 # using the numerical positions stored in the family.v object, so the resetting is simple with this expression 27 | man.v.count 28 | 29 | ## Plot showing the distribution of the word, "family" across the novel 30 | plot(man.v.count, main="Dispersion Plot of `man' in Moby Dick", 31 | xlab="Novel Time", ylab="family", type="h", ylim=c(0,1), yaxt='n') -------------------------------------------------------------------------------- /scripts/R/dracula_text_analysis.R: -------------------------------------------------------------------------------- 1 | # Objective: Read a novel from project gutenberg and then tidy it. Thereafter, count the number of words, and perform basic sentiment analysis 2 | # Script name: tidy_text_analysis_01.R 3 | 4 | library(gutenbergr) 5 | library(tidytext) 6 | library(tidyverse) 7 | library(tm) # for Corpus() 8 | 9 | # Show all gutenberg works in the gutenbergr package 10 | gutenberg_works<- gutenberg_works(languages = "en") 11 | View(gutenberg_works) 12 | # The id for Bram Stoker's Dracula is 345 13 | dracula<- gutenberg_download(345) 14 | View(dracula) 15 | str(dracula) 16 | 17 | #head(dracula_stripped,169) 18 | #tail(dracula_stripped,15482) 19 | 20 | dracula_tidy<- dracula%>% 21 | unnest_tokens(word, text) %>% 22 | anti_join(stop_words, by="word") 23 | # DATA PREPROCESSING & INITIAL VISUALIZATION 24 | 25 | # Lets create a custom theme 26 | mytheme<- theme_bw()+ 27 | theme(plot.title = element_text(color = "darkred"))+ 28 | theme(panel.border = element_rect(color = "steelblue", size = 2))+ 29 | theme(plot.title = element_text(hjust = 0.5)) # where 0.5 is to center 30 | 31 | # From the count of common words from above code, now plotting the most common words 32 | dracula_tidy %>% 33 | count(word, sort = TRUE) %>% 34 | filter(n >100 & n<400) %>% 35 | mutate(word = reorder(word, n)) %>% 36 | ggplot(aes(word, n)) + 37 | geom_col() + 38 | #xlab(NULL) + 39 | coord_flip()+ 40 | mytheme+ 41 | ggtitle("Top words in Bam Stoker's Dracula") 42 | 43 | # convert to corpus 44 | novel_corpus<- Corpus(VectorSource(dracula_tidy[,2])) 45 | # build a document term matrix. Document matrix is a table containing the frequency of the words. 46 | dtm<- DocumentTermMatrix(novel_corpus) 47 | # preprocessing the novel data 48 | novel_corpus_clean<- tm_map(novel_corpus, tolower) 49 | novel_corpus_clean<- tm_map(novel_corpus, removeNumbers) 50 | novel_corpus_clean<- tm_map(novel_corpus, removeWords, stopwords("english")) 51 | novel_corpus_clean<- tm_map(novel_corpus, removePunctuation) 52 | # To see the first few documents in the text file 53 | inspect(novel_corpus)[1:10] 54 | # explicitly convert the document term matrix table to matrix format 55 | m<- as.matrix(dtm) 56 | # sort the matrix and store in a new data frame 57 | word_freq<- sort(colSums(m), decreasing = TRUE) 58 | # look at the top 5 words 59 | head(word_freq,5) 60 | # create a character vector 61 | words<- names(word_freq) 62 | # create a data frame having the character vector and its associated number of occurences or frequency 63 | words_df<- data.frame(word=words, freq=word_freq) 64 | # Plot word frequencies 65 | barplot(words_df[1:10,]$freq, las = 2, names.arg = words_df[1:10,]$word, 66 | col ="lightblue", main ="Most frequent words", 67 | ylab = "Word frequencies") -------------------------------------------------------------------------------- /scripts/R/expr_kaggle-reddit_EDA.R: -------------------------------------------------------------------------------- 1 | # data soruce: https://www.kaggle.com/maksymshkliarevskyi/reddit-data-science-posts 2 | 3 | # clean the workspace 4 | rm(list = ls()) 5 | 6 | # load required libraries 7 | library(tidyverse) 8 | library(tidyr) 9 | 10 | # 1. reading multiple data files from a folder into separate dataframes 11 | filesPath = "data/kaggle_reddit_data" 12 | temp = list.files(path = filesPath, pattern = "*.csv", full.names = TRUE) 13 | for (i in 1:length(temp)){ 14 | nam <- paste("df",i, sep = "_") 15 | assign(nam, read_csv(temp[i], na=c("","NA"))) 16 | } 17 | 18 | #2. Read multiple dataframe created at step 1 into a list 19 | df_lst<- lapply(ls(pattern="df_[0-9]+"), function(x) get(x)) 20 | typeof(df_lst) 21 | #3. combining a list of dataframes into a single data frame 22 | # df<- bind_rows(df_lst) 23 | df<- plyr::ldply(df_lst, data.frame) 24 | str(df) 25 | dim(df) # [1] 476970 22 26 | 27 | # lowercase all character variables 28 | df<- df %>% 29 | mutate(across(where(is.character), tolower)) 30 | 31 | # Data Engineering: split col created_date into date and time 32 | df$create_date<- as.Date(df$created_date) 33 | df$create_time<- format(df$created_date,"%H:%M:%S") 34 | df<- separate(df, create_date, c('create_year', 'create_month', 'create_day'), sep = "-",remove = TRUE) 35 | df<- separate(df, create_time, c('create_hour', 'create_min', 'create_sec'), sep = ":",remove = TRUE) 36 | 37 | # drop cols not required for further analysis 38 | df$X1<- NULL 39 | df$created_timestamp<- NULL 40 | df$author_created_utc<- NULL 41 | df$full_link<- NULL 42 | df$post_text<- NULL 43 | df$postURL<- NULL 44 | df$title_text<- NULL 45 | df$created_date<- NULL 46 | 47 | # extract url from post and save as separate column 48 | url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" 49 | df$postURL <- str_extract_all(df$post, url_pattern) 50 | df$postURL<- NULL 51 | # df$post_text<- str_extract_all(df$post, boundary("word")) 52 | # df$post_text<- str_extract_all(df$post, "[a-z]+") 53 | # df$title_text<- str_extract_all(df$title, "[a-z]+") 54 | 55 | # extract data from list 56 | # head(df$post_text) 57 | # df$postText<- sapply(df$post_text, function(x) x[1]) 58 | # str(df) 59 | 60 | # filter out selected missing values 61 | colSums(is.na(df)) 62 | df1<- df 63 | df1 <- df1 %>% 64 | # filter number of comments less than 0. this will take care of 0 & -1 comments 65 | # filter(num_comments > 0 ) %>% 66 | # filter posts with NA 67 | filter(!is.na(post)) %>% 68 | # filter subreddit_subscribers with NA 69 | filter(!is.na(subreddit_subscribers)) %>% 70 | # filter crossposts with NA 71 | filter(!is.na(num_crossposts)) %>% 72 | # # filter create date, create time with NA 73 | filter(!is.na(create_year)) %>% 74 | filter(!is.na(create_month)) %>% 75 | filter(!is.na(create_day)) %>% 76 | filter(!is.na(create_hour)) %>% 77 | filter(!is.na(create_min)) %>% 78 | filter(!is.na(create_sec)) 79 | colSums(is.na(df1)) 80 | # rearrange the cols 81 | df1<- df1[,c(3,10,14:15,11:13,1:2,4:9)] 82 | 83 | df_clean <- df1 84 | str(df_clean) 85 | # 4. write combined partially clean data to disk 86 | write_csv(df_clean, file = "data/kaggle_reddit_data/reddit_data_clean.csv") 87 | 88 | 89 | 90 | # filter character cols with only text data. remove all special characters data 91 | # filter(str_detect(str_to_lower(author), "[a-zA-Z]")) %>% 92 | # filter(str_detect(str_to_lower(title), "[a-zA-Z]")) %>% 93 | # filter(str_detect(str_to_lower(id),"[a-zA-Z0-9]")) %>% 94 | # filter(str_detect(str_to_lower(post),"[a-zA-Z]")) %>% 95 | # # separate date into 3 columns 96 | # separate(create_date, into = c("create_year","create_month","create_day")) %>% 97 | # # separate time into 3 columns 98 | # separate(create_time, into = c("create_hour","create_min","create_sec")) %>% 99 | # # coerce all character cols into factor 100 | # mutate_if(is.character,as.factor) 101 | 102 | ## 103 | # df_clean<-df_clean %>% 104 | # filter(str_detect(str_to_lower(post), url_pattern)) -------------------------------------------------------------------------------- /scripts/R/expr_kaggle-reddit_text-analysis.R: -------------------------------------------------------------------------------- 1 | # read kaggle reddit comments clean file 2 | # required file: data/kaggle_reddit_data/reddit_data_clean.csv 3 | 4 | # clean the workspace 5 | rm(list = ls()) 6 | # load required libraries 7 | library(readr) 8 | library(stringr) 9 | library(tidyverse) 10 | library(tidytext) 11 | 12 | df<- read_csv(file = "data/kaggle_reddit_data/reddit_data_clean.csv") 13 | str(df) 14 | colnames(df) 15 | 16 | # clean the 'post' column 17 | # remove any url contained in post 18 | df$post_c <- gsub('http.* *', "", df$post) 19 | # remove any brackets contained in title 20 | df$title_c <- gsub('\\[|]',"", df$title) 21 | df$title_c <- gsub('\\?',"", df$title_c) 22 | df$title_c <- gsub('\\!',"", df$title_c) 23 | 24 | df1<- df %>% 25 | # add row number 26 | mutate(line=row_number()) %>% 27 | mutate(post_ct = replace(post_c, post_c == '', 'None')) %>% 28 | unnest_tokens(word, post_ct) %>% 29 | anti_join(get_stopwords()) %>% 30 | #mutate(word_count = count(word)) %>% 31 | inner_join(get_sentiments("bing")) %>% # pull out only sentiment words 32 | count(sentiment, word) %>% # count the # of positive & negative words 33 | spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow 34 | mutate(sentiment = positive - negative) # # of positive words - # of negative owrds 35 | str(df1) 36 | 37 | df2<- df %>% 38 | # add row number 39 | mutate(line=row_number()) %>% 40 | mutate(post_ct = replace(post_c, post_c == '', 'None')) %>% 41 | unnest_tokens(word, post_ct) %>% 42 | anti_join(get_stopwords()) %>% 43 | group_by(author) %>% 44 | inner_join(get_sentiments("bing"))%>% 45 | count(word, sentiment, sort = TRUE) %>% 46 | ungroup() %>% 47 | #slice_max(n, n = 10) %>% 48 | #ungroup() %>% 49 | mutate(word = reorder(word, n)) 50 | view(df2) 51 | 52 | df2 %>% 53 | group_by(sentiment) %>% 54 | slice_max(n, n = 10) %>% 55 | ungroup() %>% 56 | mutate(word = reorder(word, n)) %>% 57 | ggplot(aes(n, word, fill = sentiment)) + 58 | geom_col(show.legend = FALSE) + 59 | facet_wrap(~sentiment, scales = "free_y") + 60 | labs(x = "Contribution to sentiment", 61 | y = NULL)+ 62 | theme_bw() 63 | 64 | -------------------------------------------------------------------------------- /scripts/R/initialScript-1.R: -------------------------------------------------------------------------------- 1 | # script title: initialScript-1.R 2 | # Task: Accessing and comparing word frequency data 3 | 4 | ## comparing the usage of he vs. she and him vs. her 5 | novel.wordsonly.freq.sorted["he"] # "he" occurs 1876 times 6 | novel.wordsonly.freq.sorted["she"] # "she" occurs 114 times 7 | novel.wordsonly.freq.sorted["him"] # "him" occurs 1058 times 8 | novel.wordsonly.freq.sorted["her"] # "her" occurs 330 times 9 | ### This clearly indicates that moby dick is a male oriented book 10 | novel.wordsonly.freq.sorted["him"]/novel.wordsonly.freq.sorted["her"] # him is 3.2 times more frequent than her 11 | novel.wordsonly.freq.sorted["he"]/novel.wordsonly.freq.sorted["she"] # he is 16 times more frequent than she 12 | 13 | ## calculating the relative frequency of the words 14 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted)) 15 | novel.wordsonly.relfreq["the"] # "the" occurs 6.6 times per every 100 words in the novel Moby Dick 16 | 17 | plot(novel.wordsonly.relfreq[1:10], type="b", 18 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n") 19 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10])) 20 | -------------------------------------------------------------------------------- /scripts/R/initialScript.R: -------------------------------------------------------------------------------- 1 | 2 | # Load the data 3 | text.data<- scan("data/plainText/melville.txt", what = "character", sep = "\n") 4 | str(text.data) 5 | text.data[1] 6 | text.data[408] # The main text of the novel, "moby dick" begins at line 408 7 | text.data[18576] # The main text of the novel, "moby dick" ends at line 18576 8 | 9 | # To find out the begining & end of of the main text, do the following 10 | text.start<- which(text.data=="CHAPTER 1. Loomings.") #408L 11 | text.end<- which(text.data=="orphan.") #18576L 12 | text.start 13 | text.end 14 | 15 | # what is the length of text 16 | length(text.data) # there are 18,874 lines of text in the file 17 | # save the metadata to a separate object 18 | text.startmetadata<- text.data[1:text.start-1] 19 | text.endmetadata<- text.data[(text.end+1):length(text.end)] 20 | metadata<- c(text.startmetadata, text.endmetadata) 21 | novel.data<- text.data[text.start: text.end] 22 | head(novel.data) 23 | tail(novel.data) 24 | 25 | # Formatting the text for subsequent data analysis 26 | 27 | ## Get rid of line breaks "\n" such that all line are put into one long string. 28 | ## This is achived using paste function to join and collapse all lines into one long string 29 | novel.data<- paste(novel.data, collapse = " ") 30 | ## The paste function with the collapse argument provides a way of gluing together a bunch of separate pieces using a glue character that you define as the value for the collapse argument. In this case, you are going to glue together the lines (the pieces) using a blank space character (the glue). 31 | 32 | # Data preprocessing 33 | ## convert all words to lowercase 34 | novel.lower<- tolower(novel.data) 35 | ## extract all words only 36 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character. 37 | head(novel.words) 38 | str(novel.words) # The words are contained in a list format 39 | 40 | novel.words<- unlist(novel.words) 41 | str(novel.words) # Convert list to a character vector 42 | 43 | ## Removing the blanks. First, find out the non blank positions 44 | notblanks<- which(novel.words!="") 45 | head(notblanks) 46 | 47 | novel.wordsonly<-novel.words[notblanks] 48 | head(novel.wordsonly) 49 | totalwords<-length(novel.wordsonly) # total words=214889 words 50 | whale.hits<-length(novel.wordsonly[which(novel.wordsonly=="whale")]) # the word 'whale occurs 1150 times 51 | whale.hits.perct<- whale.hits/totalwords 52 | whale.hits.perct # The word whale occurs 0.005% times in whole text 53 | length(unique(novel.wordsonly)) # there are 16,872 unique words in the novel 54 | 55 | novel.wordsonly.freq<- table(novel.wordsonly) # the table() will build up the contigency table that contains the count of every word occurence 56 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE) # sort the data with most frequent words first followed by least frequent words 57 | head(novel.wordsonly.freq.sorted) 58 | tail(novel.wordsonly.freq.sorted) 59 | 60 | ### Practice: Find the top 10 most frequent words in the text 61 | novel.wordsonly.freq.sorted[c(1:10)] 62 | ### Practice: Visualize the top 10 most frequent words in the text 63 | plot(novel.wordsonly.freq.sorted[c(1:10)]) 64 | -------------------------------------------------------------------------------- /scripts/R/mobydick_novel_text_analysis.R: -------------------------------------------------------------------------------- 1 | # clear the workspace 2 | rm(list = ls()) 3 | 4 | # Load the moby dick novel 5 | mobydick.data<- scan("data/plainText/melville.txt", what = "character", sep = "\n") 6 | 7 | # find out the begining & end of of the main text 8 | mobydick.start<-which(mobydick.data=="CHAPTER 1. Loomings.") 9 | mobydick.end<-which(mobydick.data=="orphan.") 10 | 11 | 12 | # save the metadata to a separate object 13 | mobydick.startmeta<- mobydick.data[1:mobydick.start-1] 14 | mobydick.endmeta<- mobydick.data[(mobydick.end+1):length(mobydick.end)] 15 | metadata<- c(austen.startmeta, austen.endmeta) 16 | novel.data<- mobydick.data[mobydick.start: mobydick.end] 17 | head(novel.data) 18 | tail(novel.data) 19 | 20 | # Formatting the text for subsequent data analysis 21 | novel.data<- paste(novel.data, collapse = " ") 22 | 23 | # Data preprocessing 24 | ## convert all words to lowercase 25 | novel.lower<- tolower(novel.data) 26 | ## extract all words only 27 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character. 28 | head(novel.words) 29 | str(novel.words) # The words are contained in a list format 30 | novel.words<- unlist(novel.words) 31 | str(novel.words) # Convert list to a character vector 32 | 33 | ## Removing the blanks. First, find out the non blank positions 34 | notblanks<- which(novel.words!="") 35 | head(notblanks) 36 | 37 | novel.wordsonly<-novel.words[notblanks] 38 | head(novel.wordsonly) 39 | totalwords<-length(novel.wordsonly) 40 | 41 | # Count the unique word occurences 42 | length(unique(novel.wordsonly)) # 16,872 unique words 43 | # count the number of whale hits 44 | length(novel.wordsonly[novel.wordsonly=="whale"]) # 1,150 times the word whale appears 45 | 46 | # Accessing and understanding word data 47 | novel.wordsonly.freq.sorted["he"] # The word he occurs 1876 times 48 | novel.wordsonly.freq.sorted["she"] # 114 49 | novel.wordsonly.freq.sorted["him"] # 1058 50 | novel.wordsonly.freq.sorted["her"] # 330 51 | 52 | ### Practice: Find the top 10 most frequent words in the text 53 | novel.wordsonly.freq<- table(novel.wordsonly) 54 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE) 55 | novel.wordsonly.freq.sorted[c(1:10)] 56 | ### Practice: Visualize the top 10 most frequent words in the text 57 | plot(novel.wordsonly.freq.sorted[c(1:10)]) 58 | ## calculating the relative frequency of the words 59 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted)) 60 | 61 | plot(novel.wordsonly.relfreq[1:10], type="b", 62 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n") 63 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10])) 64 | 65 | # Dispersion Analysis 66 | whales.v<- which(novel.wordsonly="whale") 67 | 68 | 69 | 70 | 71 | -------------------------------------------------------------------------------- /scripts/R/structural_topic_modeling_00.R: -------------------------------------------------------------------------------- 1 | # Objective: To Explore Topic Modeling and Structural Topic Modeling, what is it and how it is useful for text mining? 2 | # script create date: 10/11/2018 3 | # script modified date: 4 | # script name: structural_topic_modeling_00.R 5 | 6 | # Topic modelling- In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. 7 | # The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. 8 | # Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”. 9 | 10 | # Algorithms for Topic Modeling 11 | # Term Frequency, 12 | # Inverse Document Frequency, 13 | # NonNegative Matrix Factorization techniques, 14 | # Latent Dirichlet Allocation is the most popular topic modeling technique 15 | 16 | # reference: http://www.structuraltopicmodel.com/ 17 | # reference: https://en.wikipedia.org/wiki/Topic_model 18 | # reference: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/ 19 | # reference: https://juliasilge.com/blog/sherlock-holmes-stm/ 20 | # reference: https://blogs.uoregon.edu/rclub/2016/04/05/structural-topic-modeling/ 21 | # reference: https://juliasilge.github.io/ibm-ai-day/slides.html#39 22 | 23 | # load the required packages 24 | library(tidyverse) 25 | library(gutenbergr) 26 | library(tidytext) # for unnest_tokens() 27 | library(stm) 28 | 29 | # Show all gutenberg works in the gutenbergr package 30 | gutenberg_works<- gutenberg_works(languages = "en") 31 | View(gutenberg_works) 32 | # The id for novel, "The Adventures of Sherlock Holmes" is 1661 33 | sherlock_raw<- gutenberg_download(1661) 34 | 35 | sherlock <- sherlock_raw %>% 36 | mutate(story = ifelse(str_detect(text, "ADVENTURE"), 37 | text, 38 | NA)) %>% 39 | fill(story) %>% 40 | filter(story != "THE ADVENTURES OF SHERLOCK HOLMES") %>% 41 | mutate(story = factor(story, levels = unique(story))) 42 | 43 | # create a custom function to reorder a column before plotting with facetting, such that the values are ordered within each facet. 44 | # reference: https://github.com/dgrtwo/drlib 45 | reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) { 46 | new_x <- paste(x, within, sep = sep) 47 | stats::reorder(new_x, by, FUN = fun) 48 | } 49 | 50 | scale_x_reordered <- function(..., sep = "___") { 51 | reg <- paste0(sep, ".+$") 52 | ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...) 53 | } 54 | 55 | scale_y_reordered <- function(..., sep = "___") { 56 | reg <- paste0(sep, ".+$") 57 | ggplot2::scale_y_discrete(labels = function(x) gsub(reg, "", x), ...) 58 | } 59 | 60 | # Transform the text data into tidy format using `unnest_tokens()` and remove stopwords 61 | sherlock_tidy<- sherlock %>% 62 | unnest_tokens(word, text) %>% 63 | anti_join(stop_words, by="word") %>% 64 | filter(word != "holmes") 65 | 66 | sherlock_tidy %>% 67 | count(word, sort = TRUE) 68 | 69 | # Determine the highest tf_idf words in the 12 stories on sherlock holmes? we will use the function bind_tf_idf() from the tidytext package 70 | 71 | sherlock_tf_idf <- sherlock_tidy %>% 72 | count(story, word, sort = TRUE) %>% 73 | bind_tf_idf(word, story, n) %>% 74 | arrange(-tf_idf) %>% 75 | group_by(story) %>% 76 | top_n(10) %>% 77 | ungroup 78 | 79 | sherlock_tf_idf %>% 80 | mutate(word = reorder_within(word, tf_idf, story)) %>% 81 | ggplot(aes(word, tf_idf, fill = story)) + 82 | geom_col(alpha = 0.8, show.legend = FALSE) + 83 | facet_wrap(~ story, scales = "free", ncol = 3) + 84 | scale_x_reordered() + 85 | coord_flip() + 86 | theme(strip.text=element_text(size=11)) + 87 | labs(x = NULL, y = "tf-idf", 88 | title = "Highest tf-idf words in Sherlock Holmes short stories", 89 | subtitle = "Individual stories focus on different characters and narrative elements") 90 | 91 | # Exploring tf-idf can be helpful before training topic models. 92 | # let’s get started on a topic model! Using the stm package. 93 | sherlock_sparse <- sherlock_tidy %>% 94 | count(story, word, sort = TRUE) %>% 95 | cast_sparse(story, word, n) 96 | 97 | # Now training a topic model with 6 topics, but the stm includes lots of functions and support for choosing an appropriate number of topics for your model. 98 | topic_model <- stm(sherlock_sparse, K = 6, 99 | verbose = FALSE, init.type = "Spectral") -------------------------------------------------------------------------------- /scripts/R/text_analysis_example01.R: -------------------------------------------------------------------------------- 1 | # Sample script for text analysis with R 2 | # Reference: Below is an example from https://github.com/juliasilge/tidytext 3 | library(janeaustenr) 4 | library(dplyr) 5 | 6 | original_books <- austen_books() %>% 7 | group_by(book) %>% 8 | mutate(linenumber = row_number()) %>% 9 | ungroup() 10 | 11 | original_books 12 | 13 | library(tidytext) 14 | tidy_books <- original_books %>% 15 | unnest_tokens(word, text) 16 | 17 | tidy_books 18 | 19 | data("stop_words") 20 | tidy_books <- tidy_books %>% 21 | anti_join(stop_words) 22 | tidy_books %>% 23 | count(word, sort = TRUE) 24 | # Sentiment analysis can be done as an inner join. Three sentiment lexicons are available via the get_sentiments() function. 25 | # sentiment analysis 26 | library(tidyr) 27 | get_sentiments("bing") 28 | 29 | janeaustensentiment <- tidy_books %>% 30 | inner_join(get_sentiments("bing"), by = "word") %>% 31 | count(book, index = linenumber %/% 80, sentiment) %>% 32 | spread(sentiment, n, fill = 0) %>% 33 | mutate(sentiment = positive - negative) 34 | 35 | janeaustensentiment 36 | 37 | library(ggplot2) 38 | 39 | ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) + 40 | geom_bar(stat = "identity", show.legend = FALSE) + 41 | facet_wrap(~book, ncol = 2, scales = "free_x") 42 | 43 | -------------------------------------------------------------------------------- /scripts/R/text_analysis_example02.R: -------------------------------------------------------------------------------- 1 | # clear the workspace 2 | rm(list = ls()) 3 | # load the required libraries 4 | library(tidytext) 5 | library(magrittr) 6 | 7 | # Read the text file 8 | df<- scan("data/text_data_for_analysis.txt", what = "character", sep = "\n") 9 | 10 | # basic eda 11 | str(df) 12 | typeof(df) # character vector 13 | 14 | # convert all words to lowercase 15 | df<- tolower(df) 16 | ## extract all words only 17 | df<- strsplit(df, "\\W") # where \\W is a regex to match any non-word character. 18 | head(df) 19 | df.words<- unlist(df) 20 | ## Removing the blanks. First, find out the non blank positions 21 | notblanks<- which(df.words!="") 22 | head(notblanks) 23 | df.wordsonly<-df.words[notblanks] 24 | # Count the unique word occurences 25 | length(unique(df.wordsonly)) # 194 unique words 26 | 27 | # Accessing and understanding word data 28 | df.wordsonly.freq<- table(df.wordsonly) 29 | df.wordsonly.freq_sorted<- sort(df.wordsonly, decreasing = TRUE) 30 | df.wordsonly.freq_sorted[c(1:10)] 31 | 32 | length(df.wordsonly) 33 | df.tbl<- tibble(idx=c(1:length(df.wordsonly)), text= df.wordsonly) 34 | str(df.tbl) 35 | library(tidytext) 36 | library(dplyr) # for anti_join() 37 | df.tbl<- df.tbl %>% 38 | unnest_tokens(idx, text) 39 | str(df.tbl) 40 | 41 | df.tbl.tidy %>% 42 | count(line, sort = TRUE) 43 | 44 | 45 | -------------------------------------------------------------------------------- /scripts/R/token_distribution_analysis.R: -------------------------------------------------------------------------------- 1 | 2 | # clean the workspace 3 | rm(list = ls()) 4 | 5 | # Load the data 6 | text.v <- scan("data/plainText/melville.txt", what="character", sep="\n") 7 | start.v <- which(text.v == "CHAPTER 1. Loomings.") 8 | end.v <- which(text.v == "orphan.") 9 | novel.lines.v <- text.v[start.v:end.v] 10 | novel.lines.v 11 | 12 | ## Identify the chapter break positions in the vector using the grep function 13 | chap.positions.v <- grep("^CHAPTER \\d", novel.lines.v) # (the start of a line is marked by use of the caret symbol ˆ. with the capitalized letters CHAPTER followed by a space and then any digit (digits are represented using an escaped d, as in \\d) 14 | novel.lines.v[chap.positions.v] # we can see there are 135 chapter headings 15 | 16 | ## Identify the text break positions in chapters 17 | novel.lines.v <- c(novel.lines.v, "END") 18 | last.position.v <- length(novel.lines.v) 19 | chap.positions.v <- c(chap.positions.v , last.position.v) 20 | ## Now, we have to figure out how to process the text, that is, the actual content of each chapter that appears between each of these chapter markers. 21 | ## We will use the for loop 22 | 23 | for(i in 1:length(chap.positions.v)){ 24 | print(paste("Chapter ",i, " begins at position ", 25 | chap.positions.v[i]), sep="") 26 | } # end for 27 | chapter.raws.l <- list() 28 | chapter.freqs.l <- list() 29 | 30 | for(i in 1:length(chap.positions.v)){ 31 | if(i != length(chap.positions.v)){ 32 | chapter.title <- novel.lines.v[chap.positions.v[i]] 33 | start <- chap.positions.v[i]+1 34 | end <- chap.positions.v[i+1]-1 35 | chapter.lines.v <- novel.lines.v[start:end] 36 | chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" ")) 37 | chapter.words.l <- strsplit(chapter.words.v, "\\W") 38 | chapter.word.v <- unlist(chapter.words.l) 39 | chapter.word.v <- chapter.word.v[which(chapter.word.v!="")] 40 | chapter.freqs.t <- table(chapter.word.v) 41 | chapter.raws.l[[chapter.title]] <- chapter.freqs.t 42 | chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t)) 43 | chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel 44 | } 45 | } 46 | -------------------------------------------------------------------------------- /scripts/R/topic_modeling_00.R: -------------------------------------------------------------------------------- 1 | # Objective: Topic Modelling 2 | # script create date: 27/4/2019 3 | # reference: https://www.tidytextmining.com/topicmodeling.html 4 | 5 | library(topicmodels) 6 | data("AssociatedPress") 7 | AssociatedPress 8 | # set a seed so that the output of the model is predictable 9 | ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234)) 10 | #ap_lda 11 | 12 | # word topic probabilities 13 | library(tidytext) 14 | ap_topics <- tidy(ap_lda, matrix = "beta") 15 | ap_topics # the model is turned into a one-topic-per-term-per-row format. 16 | # For each combination, the model computes the probability of that term being generated from that topic. For example, the term “aaron” has a \(1.686917\times 10^{-12}\) probability of being generated from topic 1, but a \(3.8959408\times 10^{-5}\) probability of being generated from topic 2. 17 | 18 | library(ggplot2) 19 | library(dplyr) 20 | 21 | ap_top_terms <- ap_topics %>% 22 | group_by(topic) %>% 23 | top_n(10, beta) %>% 24 | ungroup() %>% 25 | arrange(topic, -beta) 26 | ap_top_terms %>% 27 | mutate(term = reorder(term, beta)) %>% 28 | ggplot(aes(term, beta, fill = factor(topic))) + 29 | geom_col(show.legend = FALSE) + 30 | facet_wrap(~ topic, scales = "free") + 31 | coord_flip() 32 | -------------------------------------------------------------------------------- /scripts/R/topic_modeling_01.R: -------------------------------------------------------------------------------- 1 | # Topic Modelling - 01 2 | 3 | # resources 4 | # https://www.kaggle.com/regiso/tips-and-tricks-for-building-topic-models-in-r 5 | # https://www.tidytextmining.com/topicmodeling.html 6 | # http://www.bernhardlearns.com/2017/05/topic-models-lda-and-ctm-in-r-with.html 7 | # https://cran.r-project.org/web/packages/tidytext/vignettes/topic_modeling.html 8 | 9 | 10 | library(dplyr) 11 | library(gutenbergr) 12 | titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds", 13 | "Pride and Prejudice", "Great Expectations") 14 | books <- gutenberg_works(title %in% titles) %>% 15 | gutenberg_download(meta_fields = "title") 16 | books 17 | 18 | library(tidytext) 19 | library(stringr) 20 | library(tidyr) 21 | 22 | by_chapter <- books %>% 23 | group_by(title) %>% 24 | mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>% 25 | ungroup() %>% 26 | filter(chapter > 0) 27 | 28 | by_chapter_word <- by_chapter %>% 29 | unite(title_chapter, title, chapter) %>% 30 | unnest_tokens(word, text) 31 | 32 | word_counts <- by_chapter_word %>% 33 | anti_join(stop_words) %>% 34 | count(title_chapter, word, sort = TRUE) 35 | 36 | word_counts 37 | 38 | # Latent Dirichlet Allocation with the topicmodels package 39 | # Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a DocumentTermMatrix (from the tm package). As described in this vignette, we can cast a one-token-per-row table into a DocumentTermMatrix with tidytext’s cast_dtm: 40 | 41 | chapters_dtm <- word_counts %>% 42 | cast_dtm(title_chapter, word, n) 43 | 44 | chapters_dtm 45 | 46 | # Now we are ready to use the topicmodels package to create a four topic LDA model. 47 | 48 | library(topicmodels) 49 | chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234)) 50 | chapters_lda 51 | chapters_lda_td <- tidy(chapters_lda) 52 | chapters_lda_td 53 | top_terms <- chapters_lda_td %>% 54 | group_by(topic) %>% 55 | top_n(5, beta) %>% 56 | ungroup() %>% 57 | arrange(topic, -beta) 58 | 59 | top_terms 60 | library(ggplot2) 61 | theme_set(theme_bw()) 62 | 63 | top_terms %>% 64 | mutate(term = reorder(term, beta)) %>% 65 | ggplot(aes(term, beta)) + 66 | geom_bar(stat = "identity") + 67 | facet_wrap(~ topic, scales = "free") + 68 | theme(axis.text.x = element_text(size = 15, angle = 65, hjust = 1)) 69 | -------------------------------------------------------------------------------- /scripts/R/wuthering_heights_sentiment_analysis.R: -------------------------------------------------------------------------------- 1 | # SENTIMENT ANALYSIS 2 | 3 | # required packages 4 | library(tidyverse) 5 | library(tidytext) 6 | library(stringi) 7 | library(SentimentAnalysis) 8 | 9 | # Analyze sentiment 10 | sentiment<- analyzeSentiment(dtm, language = "english", 11 | removeStopwords = TRUE) 12 | # Extract dictionary-based sentiment according to the QDAP dictionary 13 | sentiment$SentimentQDAP 14 | # View sentiment direction (i.e. positive, neutral and negative) 15 | convertToDirection(sentiment$SentimentQDAP) 16 | 17 | # The three general-purpose lexicons available in tidytext package are 18 | # 19 | # AFINN from Finn Årup Nielsen, 20 | # bing from Bing Liu and collaborators, and 21 | # nrc from Saif Mohammad and Peter Turney. 22 | 23 | # I will use Bing lexicon which is simply a tibble with words and positive and negative words 24 | get_sentiments("bing") %>% head 25 | 26 | wuthering_heights %>% 27 | # split text into words 28 | unnest_tokens(word, text) %>% 29 | # remove stop words 30 | anti_join(stop_words, by = "word") %>% 31 | # add sentiment scores to words 32 | left_join(get_sentiments("bing"), by = "word") %>% 33 | # count number of negative and positive words 34 | count(word, sentiment) %>% 35 | spread(key = sentiment, value = n) %>% 36 | #ungroup %>% 37 | # create centered score 38 | mutate(sentiment = positive - negative - 39 | mean(positive - negative)) %>% 40 | # select_if(sentiment) %>% 41 | # reorder word levels 42 | #mutate(word = factor(as.character(word), levels = levels(word)[61:1])) %>% 43 | # plot 44 | ggplot(aes(x = word, y = sentiment)) + 45 | geom_bar(stat = "identity", aes(fill = word)) + 46 | theme_classic() + 47 | theme(axis.text.x = element_text(angle = 90)) + 48 | coord_flip() + 49 | #ylim(0, 200) + 50 | coord_cartesian(ylim=c(0,40)) + # Explain ggplot2 warning: “Removed k rows containing missing values” 51 | ggtitle("Centered sentiment scores", 52 | subtitle = "for Wuthering Heights") -------------------------------------------------------------------------------- /scripts/R/wuthering_heights_text_analysis.R: -------------------------------------------------------------------------------- 1 | # clean the workspace environment 2 | # Reference: http://tidytextmining.com/tidytext.html 3 | # Reference: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html 4 | 5 | rm(list = ls()) 6 | 7 | # Load the required libraries 8 | library(gutenbergr) 9 | library(dplyr) 10 | library(tidytext) 11 | library(ggplot2) 12 | 13 | # READ or DOWNLOAD THE DATA 14 | # find the gutenberg title for analysis. Look for the "gutenberg_id" 15 | gutenberg_works(title == "Wuthering Heights") 16 | 17 | # Once the id is found, pass it as a value to function "gutenberg_download() as shown below 18 | wuthering_heights <- gutenberg_download(768) 19 | 20 | # Look at the data structure, its a tibble 21 | head(wuthering_heights) 22 | # Notice, it has two columns, the first column contains "gutenberg_id" and the second col contains the novel text. 23 | # Do not remove the first column even though it contains a single repeating value 768, because, this column will be used subsequently for filtering data 24 | 25 | # DATA PREPROCESSING & INITIAL VISUALIZATION 26 | 27 | ## Step 1: To create a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function. 28 | tidy_novel_data<- wuthering_heights %>% 29 | unnest_tokens(word, text) %>% 30 | anti_join(stop_words, by="word") 31 | 32 | # use dplyr’s count() to find the most common words in the novel 33 | 34 | tidy_novel_data %>% 35 | count(word, sort=TRUE) 36 | 37 | # Lets create a custom theme 38 | mytheme<- theme_bw()+ 39 | theme(plot.title = element_text(color = "darkred"))+ 40 | theme(panel.border = element_rect(color = "steelblue", size = 2))+ 41 | theme(plot.title = element_text(hjust = 0.5)) # where 0.5 is to center 42 | 43 | # From the count of common words from above code, now plotting the most common words 44 | tidy_novel_data %>% 45 | count(word, sort = TRUE) %>% 46 | filter(n >100 & n<400) %>% 47 | mutate(word = reorder(word, n)) %>% 48 | ggplot(aes(word, n)) + 49 | geom_col() + 50 | #xlab(NULL) + 51 | coord_flip()+ 52 | mytheme+ 53 | ggtitle("Top words in Wuthering Heights") 54 | 55 | # WORDCLOUD 56 | # load libraries 57 | library("tm") 58 | library("SnowballC") 59 | library("wordcloud") 60 | 61 | # convert to corpus 62 | novel_corpus<- Corpus(VectorSource(tidy_novel_data[,2])) 63 | # preprocessing the novel data 64 | novel_corpus_clean<- tm_map(novel_corpus, tolower) 65 | novel_corpus_clean<- tm_map(novel_corpus, removeNumbers) 66 | novel_corpus_clean<- tm_map(novel_corpus, removeWords, stopwords("english")) 67 | novel_corpus_clean<- tm_map(novel_corpus, removePunctuation) 68 | # To see the first few documents in the text file 69 | inspect(novel_corpus)[1:10] 70 | 71 | # build a document term matrix. Document matrix is a table containing the frequency of the words. 72 | dtm<- DocumentTermMatrix(novel_corpus_clean) 73 | # explicitly convert the document term matrix table to matrix format 74 | m<- as.matrix(dtm) 75 | # sort the matrix and store in a new data frame 76 | word_freq<- sort(colSums(m), decreasing = TRUE) 77 | # look at the top 5 words 78 | head(word_freq,5) 79 | # create a character vector 80 | words<- names(word_freq) 81 | # create a data frame having the character vector and its associated number of occurences or frequency 82 | words_df<- data.frame(word=words, freq=word_freq) 83 | 84 | # create the first word cloud 85 | set.seed(1234) 86 | wordcloud (words_df$word, words_df$freq, scale=c(4,0.5), random.order=FALSE, rot.per=0.35, 87 | use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 100) 88 | # create the second word cloud 89 | wordcloud (words_df$word, words_df$freq, scale=c(4,0.5), random.order=FALSE, rot.per=0.35, 90 | use.r.layout=FALSE, colors=brewer.pal(8, "Accent"), max.words = 100) 91 | 92 | # Explore frequent terms and their association with each other 93 | findFreqTerms(dtm, lowfreq = 4) 94 | findMostFreqTerms(dtm) 95 | # You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function. 96 | #findAssocs(dtm, terms = "master", corlimit = 0.3) 97 | 98 | # Plot word frequencies 99 | barplot(words_df[1:10,]$freq, las = 2, names.arg = words_df[1:10,]$word, 100 | col ="lightblue", main ="Most frequent words", 101 | ylab = "Word frequencies") 102 | 103 | -------------------------------------------------------------------------------- /scripts/python/.gitignore: -------------------------------------------------------------------------------- 1 | /.ipynb_checkpoints/ 2 | -------------------------------------------------------------------------------- /scripts/python/00_topic_modelling_fundamentals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Installing the required libraries (*if not already installed*)" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# conda install -c anaconda nltk\n", 17 | "# conda install -c anaconda gensim" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 7, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "#Here are the sample documents combining together to form a corpus.\n", 27 | "\n", 28 | "doc1 = \"Sugar is bad to consume. My sister likes to have sugar, but not my father.\"\n", 29 | "doc2 = \"My father spends a lot of time driving my sister around to dance practice.\"\n", 30 | "doc3 = \"Doctors suggest that driving may cause increased stress and blood pressure.\"\n", 31 | "doc4 = \"Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.\"\n", 32 | "doc5 = \"Health experts say that Sugar is not good for your lifestyle.\"\n", 33 | "\n", 34 | "# compile documents\n", 35 | "doc_complete = [doc1, doc2, doc3, doc4, doc5]" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "#### Preprocessing" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 8, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "# required libraries\n", 52 | "from nltk.corpus import stopwords \n", 53 | "from nltk.stem.wordnet import WordNetLemmatizer\n", 54 | "import string" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "import nltk\n", 64 | "from nltk.corpus import stopwords\n", 65 | "nltk.download('stopwords')\n", 66 | "stopwords = stopwords.words('english')" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 9, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "stop = set(stopwords.words('english'))\n", 76 | "exclude = set(string.punctuation) \n", 77 | "lemma = WordNetLemmatizer()" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": 10, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "def clean(doc):\n", 87 | " stop_free = \" \".join([i for i in doc.lower().split() if i not in stop])\n", 88 | " punc_free = ''.join(ch for ch in stop_free if ch not in exclude)\n", 89 | " normalized = \" \".join(lemma.lemmatize(word) for word in punc_free.split())\n", 90 | " return normalized" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 11, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "doc_clean = [clean(doc).split() for doc in doc_complete]" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "#### Preparing Document-Term Matrix\n", 107 | "\n", 108 | "All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, **gensim** is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 6, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "# install gensim library" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 13, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "import gensim\n", 127 | "from gensim import corpora" 128 | ] 129 | }, 130 | { 131 | "cell_type": "code", 132 | "execution_count": 14, 133 | "metadata": {}, 134 | "outputs": [], 135 | "source": [ 136 | "# Creating the term dictionary of our courpus, where every unique term is assigned an index. \n", 137 | "dictionary = corpora.Dictionary(doc_clean)\n", 138 | "\n", 139 | "# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.\n", 140 | "doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "#### Running LDA Model\n", 148 | "Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents." 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 15, 154 | "metadata": {}, 155 | "outputs": [], 156 | "source": [ 157 | "# Creating the object for LDA model using gensim library\n", 158 | "Lda = gensim.models.ldamodel.LdaModel\n", 159 | "\n", 160 | "# Running and Trainign LDA model on the document term matrix.\n", 161 | "ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "markdown", 166 | "metadata": {}, 167 | "source": [ 168 | "#### Results" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 16, 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "name": "stdout", 178 | "output_type": "stream", 179 | "text": [ 180 | "[(0, '0.076*\"sugar\" + 0.075*\"good\" + 0.075*\"health\"'), (1, '0.065*\"driving\" + 0.065*\"pressure\" + 0.064*\"doctor\"'), (2, '0.084*\"sister\" + 0.084*\"father\" + 0.059*\"sugar\"')]\n" 181 | ] 182 | } 183 | ], 184 | "source": [ 185 | "print(ldamodel.print_topics(num_topics=3, num_words=3))" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "#### Tips to improve results of topic modeling\n", 193 | "\n", 194 | "- Feature selection\n", 195 | "- Part of Speech Tag Filter" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [] 204 | } 205 | ], 206 | "metadata": { 207 | "kernelspec": { 208 | "display_name": "Python 3", 209 | "language": "python", 210 | "name": "python3" 211 | }, 212 | "language_info": { 213 | "codemirror_mode": { 214 | "name": "ipython", 215 | "version": 3 216 | }, 217 | "file_extension": ".py", 218 | "mimetype": "text/x-python", 219 | "name": "python", 220 | "nbconvert_exporter": "python", 221 | "pygments_lexer": "ipython3", 222 | "version": "3.7.1" 223 | } 224 | }, 225 | "nbformat": 4, 226 | "nbformat_minor": 2 227 | } 228 | -------------------------------------------------------------------------------- /scripts/python/00_topic_modelling_theoretical_concepts.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "One such technique in the field of text mining is Topic Modelling. As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.\n", 8 | "\n", 9 | "**Topic Modelling** is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an **unsupervised approach** used for finding and observing the bunch of words (called “topics”) in large clusters of texts.\n", 10 | "\n", 11 | "Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.\n", 12 | "\n", 13 | "Topic Models are **very useful for the purpose for document clustering**, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.\n", 14 | "\n", 15 | "**Approaches for Topic Modelling**\n", 16 | "\n", 17 | "There are many approaches for obtaining topics from a text such as \n", 18 | "\n", 19 | "- Term Frequency and Inverse Document Frequency\n", 20 | "- NonNegative Matrix Factorization techniques. \n", 21 | "- Latent Dirichlet Allocation (LDA) is the most popular topic modeling technique and in this article, we will discuss the same.\n", 22 | "\n", 23 | "**How LDA works**\n", 24 | "\n", 25 | "LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.\n", 26 | "\n", 27 | "LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. " 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [] 36 | } 37 | ], 38 | "metadata": { 39 | "kernelspec": { 40 | "display_name": "Python 3", 41 | "language": "python", 42 | "name": "python3" 43 | }, 44 | "language_info": { 45 | "codemirror_mode": { 46 | "name": "ipython", 47 | "version": 3 48 | }, 49 | "file_extension": ".py", 50 | "mimetype": "text/x-python", 51 | "name": "python", 52 | "nbconvert_exporter": "python", 53 | "pygments_lexer": "ipython3", 54 | "version": "3.7.1" 55 | } 56 | }, 57 | "nbformat": 4, 58 | "nbformat_minor": 2 59 | } 60 | -------------------------------------------------------------------------------- /scripts/python/01_topic_modelling_fundamentals.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**Objective**: \n", 8 | "\n", 9 | " - Given a list of docuemnts, determine the topic.\n", 10 | " - Then aggregrate the documents and print the topic the documents belong too.\n", 11 | " \n", 12 | "**Example**:\n", 13 | " \n", 14 | " Data\n", 15 | " \n", 16 | " it's very hot outside summer\n", 17 | " there are not many flowers in winter\n", 18 | " in the winter we eat hot food\n", 19 | " in the summer we go to the sea\n", 20 | " in winter we used many clothes\n", 21 | " in summer we are on vacation\n", 22 | " winter and summer are two seasons of the year\n", 23 | " \n", 24 | " **Output**\n", 25 | " \n", 26 | " Topic 1\n", 27 | " it's very hot outside summer\n", 28 | " in the summer we go to the sea\n", 29 | " in summer we are on vacation\n", 30 | " winter and summer are two seasons of the year\n", 31 | "\n", 32 | " Topic 2\n", 33 | " there are not many flowers in winter\n", 34 | " in the winter we eat hot food\n", 35 | " in winter we used many clothes\n", 36 | " winter and summer are two seasons of the year" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 8, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "#Here are the sample documents combining together to form a corpus.\n", 46 | "\n", 47 | "comments = ['it is very hot outside summer', 'there are not many flowers in winter','in the winter we eat hot food',\n", 48 | " 'in the summer we go to the sea','in winter we used many clothes','in summer we are on vacation',\n", 49 | " 'winter and summer are two seasons of the year']" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 7, 55 | "metadata": {}, 56 | "outputs": [], 57 | "source": [ 58 | "# load the required libraries\n", 59 | "from sklearn.feature_extraction.text import CountVectorizer\n", 60 | "from sklearn.decomposition import LatentDirichletAllocation\n", 61 | "import numpy as np" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 17, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/plain": [ 72 | "['it is very hot outside summer',\n", 73 | " 'there are not many flowers in winter',\n", 74 | " 'in the winter we eat hot food',\n", 75 | " 'in the summer we go to the sea',\n", 76 | " 'in winter we used many clothes',\n", 77 | " 'in summer we are on vacation',\n", 78 | " 'winter and summer are two seasons of the year']" 79 | ] 80 | }, 81 | "execution_count": 17, 82 | "metadata": {}, 83 | "output_type": "execute_result" 84 | } 85 | ], 86 | "source": [ 87 | "comments" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 18, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "vect = CountVectorizer()\n", 97 | "X = vect.fit_transform(comments)" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 20, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "lda = LatentDirichletAllocation(n_components = 2, learning_method = \"batch\", max_iter = 25, random_state = 0)\n", 107 | "document_topics = lda.fit_transform(X)" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 21, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "sorting = np.argsort(lda.components_, axis = 1)[:, ::-1]\n", 117 | "feature_names = np.array(vect.get_feature_names())" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 26, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "docs = np.argsort(comments)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 30, 132 | "metadata": {}, 133 | "outputs": [ 134 | { 135 | "name": "stdout", 136 | "output_type": "stream", 137 | "text": [ 138 | "in summer we are on vacation\n", 139 | "\n", 140 | "in the summer we go to the sea\n", 141 | "\n", 142 | "in the winter we eat hot food\n", 143 | "\n", 144 | "in winter we used many clothes\n", 145 | "\n", 146 | "it is very hot outside summer\n", 147 | "\n" 148 | ] 149 | } 150 | ], 151 | "source": [ 152 | "for i in docs[:5]:\n", 153 | " print(\" \".join(comments[i].split(\",\")[:2]) + \"\\n\")" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 1, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "import nltk" 163 | ] 164 | }, 165 | { 166 | "cell_type": "code", 167 | "execution_count": 2, 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "from nltk.corpus import names" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 3, 177 | "metadata": {}, 178 | "outputs": [ 179 | { 180 | "name": "stdout", 181 | "output_type": "stream", 182 | "text": [ 183 | "\n" 184 | ] 185 | } 186 | ], 187 | "source": [ 188 | "print(names)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "code", 193 | "execution_count": null, 194 | "metadata": {}, 195 | "outputs": [], 196 | "source": [] 197 | } 198 | ], 199 | "metadata": { 200 | "kernelspec": { 201 | "display_name": "Python 3", 202 | "language": "python", 203 | "name": "python3" 204 | }, 205 | "language_info": { 206 | "codemirror_mode": { 207 | "name": "ipython", 208 | "version": 3 209 | }, 210 | "file_extension": ".py", 211 | "mimetype": "text/x-python", 212 | "name": "python", 213 | "nbconvert_exporter": "python", 214 | "pygments_lexer": "ipython3", 215 | "version": "3.7.1" 216 | } 217 | }, 218 | "nbformat": 4, 219 | "nbformat_minor": 2 220 | } 221 | -------------------------------------------------------------------------------- /scripts/python/Tutorial1-An introduction to NLP with SpaCy.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "#### Natural Language Processing with spaCy" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "#### How to install SpaCy package in Windows environment\n", 15 | "\n", 16 | "- Open a regular command prompt, figure out where anaconda/miniconda is installed. In my case, the location is `C:\\Users\\username\\Miniconda3`\n", 17 | "\n", 18 | "- cd to the directory, `C:\\Users\\username\\miniconda3\\Scripts` and type `activate` and press the enter key.\n", 19 | "\n", 20 | "- Now type the command, `conda install -c conda-forge spacy` OR `pip install spacy` to install the required package. " 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "##### Tutorial from the [documentation](https://course.spacy.io/chapter1)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 1, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "# import the English language class\n", 37 | "from spacy.lang.en import English" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "# Create the nlp object\n", 47 | "nlp = English()" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 3, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "name": "stdout", 57 | "output_type": "stream", 58 | "text": [ 59 | "Hello\n", 60 | "world\n", 61 | "!\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "# Created by processing a string of text with the nlp object\n", 67 | "doc = nlp(\"Hello world!\")\n", 68 | "\n", 69 | "# Iterate over tokens in a Doc\n", 70 | "for token in doc:\n", 71 | " print(token.text)" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 15, 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "world\n" 84 | ] 85 | } 86 | ], 87 | "source": [ 88 | "doc = nlp(\"Hello world!\")\n", 89 | "\n", 90 | "# Index into the Doc to get a single Token\n", 91 | "token = doc[1]\n", 92 | "\n", 93 | "# Get the token text via the .text attribute\n", 94 | "print(token.text)" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 16, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "name": "stdout", 104 | "output_type": "stream", 105 | "text": [ 106 | "Index: [0, 1, 2, 3, 4]\n", 107 | "Text: ['It', 'costs', '$', '5', '.']\n", 108 | "is_alpha: [True, True, False, False, False]\n", 109 | "is_punct: [False, False, False, False, True]\n", 110 | "like_num: [False, False, False, True, False]\n" 111 | ] 112 | } 113 | ], 114 | "source": [ 115 | "# Lexical attributes\n", 116 | "doc = nlp(\"It costs $5.\")\n", 117 | "print('Index: ', [token.i for token in doc])\n", 118 | "print('Text: ', [token.text for token in doc])\n", 119 | "\n", 120 | "print('is_alpha:', [token.is_alpha for token in doc])\n", 121 | "print('is_punct:', [token.is_punct for token in doc])\n", 122 | "print('like_num:', [token.like_num for token in doc])" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 17, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# process the text\n", 132 | "doc = nlp(\"I like tree kangaroos and narwhals.\")" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 19, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "name": "stdout", 142 | "output_type": "stream", 143 | "text": [ 144 | "like\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "# Select the first token\n", 150 | "first_token = doc[1]\n", 151 | "print(first_token.text)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": {}, 158 | "outputs": [], 159 | "source": [] 160 | } 161 | ], 162 | "metadata": { 163 | "kernelspec": { 164 | "display_name": "Python 3", 165 | "language": "python", 166 | "name": "python3" 167 | }, 168 | "language_info": { 169 | "codemirror_mode": { 170 | "name": "ipython", 171 | "version": 3 172 | }, 173 | "file_extension": ".py", 174 | "mimetype": "text/x-python", 175 | "name": "python", 176 | "nbconvert_exporter": "python", 177 | "pygments_lexer": "ipython3", 178 | "version": "3.7.1" 179 | } 180 | }, 181 | "nbformat": 4, 182 | "nbformat_minor": 2 183 | } 184 | -------------------------------------------------------------------------------- /scripts/python/check_pkgs.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/duttashi/text-analysis/154b34ff3c2fac60e5e4068e48597dbdc815b894/scripts/python/check_pkgs.py -------------------------------------------------------------------------------- /scripts/python/extract_table_from_pdf.py: -------------------------------------------------------------------------------- 1 | # Objective 1: Read table from multiple pdf files contained in a directory to a list 2 | # Obkective 2: clean the pdf text data and find unique words then save them to a dictionary 3 | 4 | # To read tables contained within the pdf files, I'm using the tabula.py library 5 | # To install tabula.py on Python3 in windows OS, ensure Java version 8 is installed. 6 | # Next, open a command-prompt window, browse to python directory and execute the command, pip3 install tabula.py 7 | 8 | import tabula, os, re, string 9 | from collections import Counter 10 | 11 | # path to pdf files 12 | filePath = "C:\\Users\\Ashoo\\Documents\\PythonPlayground\\text-analysis\\data\\pdf" 13 | 14 | stripped = [] # initialize an empty string 15 | 16 | for filename in os.listdir(filePath): 17 | # search for files ending with .txt extension and read them in memory 18 | if filename.strip().endswith('.pdf'): 19 | print(filename) 20 | # Mote: python will read the pdf file as 'rb' or 'wb' as a binary read and write format 21 | with(open(os.path.join(filePath,filename),'rb')) as pdfFiles: 22 | #df= tabula.read_pdf(f, stream=True)[0] 23 | 24 | # read all pdf pages 25 | df= tabula.read_pdf(pdfFiles, pages="all") 26 | print(df) 27 | 28 | # convert pdf table to csv format 29 | tabula.convert_into(pdfFiles, "pdf_to_csv.csv", output_format="csv", stream=True) 30 | pdfFiles.close() 31 | -------------------------------------------------------------------------------- /scripts/python/extract_text_data_from_pdf.py: -------------------------------------------------------------------------------- 1 | # Objective 1: Read multiple pdf files from a directory into a list 2 | # Obkective 2: clean the pdf text data and find unique words then save them to a dictionary 3 | 4 | # To read pdf files, I'm using the PyPDF2 library 5 | # To install it on Python3 in windows OS, open a command-prompt window, browse to python directory and execute the command, pip3 install PyPDF2 6 | # In my environment, I installed it like, `C:\Users\Ashoo\Miniconda3\pip3 install PyPDF2` 7 | # Read the documentation of PyPDF2 library at https://pythonhosted.org/PyPDF2/ 8 | # # PyPDF2 uses a zero-based index for getting pages: The first page is page 0, the second is Introduction, and so on. 9 | 10 | import re, os, string, PyPDF2 11 | from collections import Counter 12 | 13 | # path to pdf files 14 | filePath = "C:\\Users\\Ashoo\\Documents\\PythonPlayground\\text-analysis\\data\\pdf" 15 | 16 | stripped = [] # initialize an empty string 17 | 18 | for filename in os.listdir(filePath): 19 | # search for files ending with .txt extension and read them in memory 20 | if filename.strip().endswith('.pdf'): 21 | print(filename) 22 | # Mote: python will read the pdf file as 'rb' or 'wb' as a binary read and write format 23 | with(open(os.path.join(filePath,filename),'rb')) as f: 24 | 25 | # creating a pdf reader object 26 | pdfReader = PyPDF2.PdfFileReader(f) 27 | 28 | # print the number of pages in pdf file 29 | number_of_pages = 0 30 | number_of_pages = pdfReader.getNumPages() 31 | #print(pdfReader.numPages) 32 | print("Number of pages in pdf file are: ", number_of_pages) 33 | 34 | # create a page object and specify the page number to read.. say the first page 35 | #pageObj = pdfReader.getPage(0) 36 | # create a page object and read all pages in the pdf file 37 | # since page numbering starts from 0, therefore decreasing the value returned by method getNumPages by 1 38 | number_of_pages-=1 39 | pageObj = pdfReader.getPage(number_of_pages) 40 | 41 | # extract the text from the pageObj variable 42 | pageContent = pageObj.extractText() 43 | print (pageContent.encode('utf-8')) 44 | 45 | # basic text cleaning process 46 | # using re library, split the text data into words 47 | words = re.split(r'\W+', pageContent) 48 | # add split words to a list and lowercase them 49 | words = [word.lower() for word in words] 50 | # using string library, remove punctuation from words and save in table format 51 | table = str.maketrans('', '', string.punctuation) 52 | # concatenating clean data to list 53 | stripped += [w.translate(table) for w in words] 54 | 55 | # create a dictionary that stores the unique words found in the text files 56 | countlist = Counter(stripped) 57 | 58 | 59 | # print the unique word and its frequency of occurence 60 | for key, val in countlist.items(): 61 | print(key, val) 62 | 63 | f.close() 64 | # print end of program statement 65 | print("End") 66 | -------------------------------------------------------------------------------- /scripts/python/extract_text_data_from_pdf_2.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Fri Jun 25 09:12:31 2021 4 | Pdf file data downloaded from: https://ncrb.gov.in/en/crime-in-india-table-addtional-table-and-chapter-contents 5 | @author: Ashish 6 | """ 7 | 8 | import pdfplumber as pp 9 | import os 10 | 11 | cur_path = os.path.dirname(__file__) 12 | print(cur_path) 13 | 14 | # set path for loading pdf file in memory 15 | new_path = os.path.relpath('..\\..\\data\\pdf\\pdf_tbl1.pdf', cur_path) 16 | 17 | # read pdf, extract data 18 | with pp.open(new_path) as pdf: 19 | page = pdf.pages[0] 20 | text=page.extract_text() 21 | print(text) 22 | 23 | # set table settings in pdf file 24 | table_settings = { 25 | "vertical_strategy": "lines", 26 | "horizontal_strategy": "text", 27 | "intersection_x_tolerance": 15 28 | } 29 | 30 | # read pdf file, extract table data 31 | with pp.open(new_path) as pdf: 32 | mypage = pdf.pages[0] 33 | tbls = mypage.find_tables() 34 | print("Tables: ", tbls) 35 | tbl_content = tbls[0].extract(x_tolerance=5) 36 | print("Table content: ",tbl_content) 37 | 38 | 39 | -------------------------------------------------------------------------------- /scripts/python/extract_text_data_from_pdf_3.py: -------------------------------------------------------------------------------- 1 | import pdfplumber 2 | data_path = "../../data/pdf/file1.pdf" 3 | with pdfplumber.open(data_path) as pdf: 4 | first_page = pdf.pages[0] 5 | text = first_page.extract_text() 6 | print(text) 7 | # print(first_page.chars[0]) -------------------------------------------------------------------------------- /scripts/python/extract_text_data_from_textfiles.py: -------------------------------------------------------------------------------- 1 | # Objective #1: To read multiple text files in a given directory & write content of each text file to a list 2 | # Objective #2: To search for a given text string in each file and write its line number to a dictionary and print it 3 | 4 | import re, os, string, logging, sys 5 | from collections import Counter 6 | 7 | fileData = [] # initialize an empty list 8 | stripped = [] 9 | phrase = [] # initialize an empty list 10 | searchPhrase = 'beauty' 11 | 12 | # the path to text files 13 | # Note the usage of relative path in here.. Since the script is in a folder which is different from the data folder. 14 | # Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with '..' 15 | 16 | filePath = "../../data/plainText" 17 | for filename in os.listdir(filePath): 18 | # search for files ending with .txt extension and read them in memory 19 | if filename.strip().endswith('.txt'): 20 | print(filename) 21 | with(open(os.path.join(filePath,filename),'rt')) as f: 22 | try: 23 | # read all text with newlines included and save data to variable 24 | fileTxt = f.read() 25 | f.close() 26 | 27 | # basic text cleaning process 28 | # using re library, split the text data into words 29 | words = re.split(r'\W+', fileTxt) 30 | # add split words to a list and lowercase them 31 | words = [word.lower() for word in words] 32 | # using string library, remove punctuation from words and save in table format 33 | table = str.maketrans('', '', string.punctuation) 34 | # concatenating clean data to list 35 | stripped += [w.translate(table) for w in words] 36 | 37 | # search for a given string in clean text. If found then print the line number 38 | for num, line in enumerate(stripped,1): 39 | if(searchPhrase in line): 40 | print(searchPhrase, " found at line number: ",num) 41 | 42 | # if file not found, log an error 43 | except IOError as e: 44 | logging.error('Operation failed: {}' .format(e.strerror)) 45 | sys.exit(0) 46 | 47 | # create a dictionary that stores the unique words found in the text files 48 | countlist = Counter(stripped) 49 | 50 | print("Count of uniique words in text files", filename, " are: ") 51 | # print the unique word and its frequency of occurence 52 | for key, val in countlist.items(): 53 | print(key, val) 54 | 55 | # print end of program statement 56 | print("End") -------------------------------------------------------------------------------- /scripts/python/func_text_preprocess.py: -------------------------------------------------------------------------------- 1 | # We cannot work with the text data in machine learning so we need to convert them into numerical vectors, 2 | # This script has all techniques for conversion 3 | # A collection of functions for text data cleaning 4 | # Script create date: May 1, 2023 location: Jhs, UP 5 | 6 | def read_multiple_files(source_dir_path): 7 | import os 8 | dir_csv = [file for file in os.listdir(source_dir_path) if file.endswith('.csv')] 9 | files_csv_list = [] 10 | for file in dir_csv: 11 | df = pd.read_csv(os.path.join(dir_csv, file)) 12 | files_csv_list.append(df) 13 | return df 14 | 15 | def clean_text(df_text): 16 | from nltk.corpus import stopwords 17 | from nltk.stem import SnowballStemmer 18 | import re 19 | stop = set(stopwords.words('english')) 20 | temp =[] 21 | snow = SnowballStemmer('english') 22 | for sentence in df_text: 23 | sentence = sentence.lower() # Converting to lowercase 24 | cleanr = re.compile('<.*?>') 25 | sentence = re.sub(cleanr, ' ', sentence) #Removing HTML tags 26 | sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence) 27 | sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) #Removing Punctuations 28 | 29 | words = [snow.stem(word) for word in sentence.split() if word not in stopwords.words('english')] # Stemming and removing stopwords 30 | temp.append(words) 31 | 32 | sent = [] 33 | for row in temp: 34 | sequ = '' 35 | for word in row: 36 | sequ = sequ + ' ' + word 37 | sent.append(sequ) 38 | 39 | clean_sents = sent 40 | return clean_sents 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | -------------------------------------------------------------------------------- /scripts/python/kaggle_hotel_review_rating.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Sun Jul 11 11:33:01 2021 4 | 5 | @author: Ashish 6 | 7 | Data source: https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews 8 | Objective: Hotel reviews rating prediction 9 | """ 10 | 11 | import numpy as np # linear algebra 12 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) 13 | import matplotlib.pyplot as plt 14 | import seaborn as sns 15 | import nltk 16 | from nltk.corpus import stopwords 17 | import string 18 | import xgboost as xgb 19 | from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer 20 | from sklearn.decomposition import TruncatedSVD 21 | from sklearn import ensemble, metrics, model_selection, naive_bayes 22 | color = sns.color_palette() 23 | 24 | eng_stopwords = set(stopwords.words("english")) 25 | pd.options.mode.chained_assignment = None 26 | 27 | # import os 28 | 29 | # cur_path = os.path.dirname(__file__) 30 | # print(cur_path) 31 | 32 | # Reference: https://github.com/SudalaiRajkumar/Kaggle/blob/master/SpookyAuthor/simple_fe_notebook_spooky_author.ipynb 33 | # load the data 34 | df = pd.read_csv("..\\..\\data\\kaggle_tripadvisor_hotel_reviews.csv") 35 | 36 | # print(df.head(), df.shape) 37 | 38 | ## Feature Engineering 39 | # Now let us come try to do some feature engineering. This consists of two main parts. 40 | 41 | # Meta features - features that are extracted from the text like number of words, number of stop words, number of punctuations etc 42 | # Text based features - features directly based on the text / words like frequency, svd, word2vec etc. 43 | # Meta Features: 44 | 45 | # We will start with creating meta featues and see how good are they at predicting the spooky authors. The feature list is as follows: 46 | 47 | # Number of words in the text 48 | # Number of unique words in the text 49 | # Number of characters in the text 50 | # Number of stopwords 51 | # Number of punctuations 52 | # Number of upper case words 53 | # Number of title case words 54 | # Average length of the words 55 | ## Number of words in the text ## 56 | df["num_words"] = df["Review"].apply(lambda x: len(str(x).split())) 57 | ## Number of unique words in the text ## 58 | df["num_unique_words"] = df["Review"].apply(lambda x: len(set(str(x).split()))) 59 | 60 | # Number of characters in text 61 | df["num_chars"] = df["Review"].apply(lambda x: len(str(x))) 62 | 63 | ## Number of stopwords in the text ## 64 | df["num_stopwords"] = df["Review"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords])) 65 | 66 | ## Number of punctuations in the text ## 67 | df["num_punctuations"] =df['Review'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) ) 68 | 69 | ## Number of title case words in the text ## 70 | df["num_words_upper"] = df["Review"].apply(lambda x: len([w for w in str(x).split() if w.isupper()])) 71 | 72 | ## Number of title case words in the text ## 73 | df["num_words_title"] = df["Review"].apply(lambda x: len([w for w in str(x).split() if w.istitle()])) 74 | 75 | ## Average length of the words in the text ## 76 | df["mean_word_len"] = df["Review"].apply(lambda x: np.mean([len(w) for w in str(x).split()])) 77 | 78 | 79 | print(df.columns) 80 | 81 | # Let us now plot some of our new variables to see of they will be helpful in predictions. 82 | 83 | df['num_words'].loc[df['num_words']>80] = 80 #truncation for better visuals 84 | plt.figure(figsize=(12,8)) 85 | sns.violinplot(x='Rating', y='num_words', data=df) 86 | plt.xlabel('Rating', fontsize=12) 87 | plt.ylabel('Number of words in text', fontsize=12) 88 | plt.title("Number of words by review", fontsize=15) 89 | plt.show() 90 | 91 | df['num_punctuations'].loc[df['num_punctuations']>10] = 10 #truncation for better visuals 92 | plt.figure(figsize=(12,8)) 93 | sns.violinplot(x='Rating', y='num_punctuations', data=df) 94 | plt.xlabel('Rating', fontsize=12) 95 | plt.ylabel('Number of puntuations in text', fontsize=12) 96 | plt.title("Number of punctuations by rating", fontsize=15) 97 | plt.show() 98 | 99 | 100 | # Initial Model building 101 | ## Prepare the data for modeling ### 102 | review_mapping_dict = {1:"worse", 2:"bad", 3:"ok",4:"good",5:"best"} 103 | train_y = df['Rating'].map(review_mapping_dict) 104 | 105 | ### recompute the trauncated variables again ### 106 | df["num_words"] = df["Review"].apply(lambda x: len(str(x).split())) 107 | df["mean_word_len"] = df["Review"].apply(lambda x: np.mean([len(w) for w in str(x).split()])) 108 | 109 | cols_to_drop = ['Review'] 110 | train_X = df.drop(cols_to_drop+['Rating'], axis=1) 111 | test_X = df.drop(cols_to_drop, axis=1) 112 | 113 | # Training a simple XGBoost model 114 | def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, child=1, colsample=0.3): 115 | param = {} 116 | param['objective'] = 'multi:softprob' 117 | param['eta'] = 0.1 118 | param['max_depth'] = 3 119 | param['silent'] = 1 120 | param['num_class'] = 3 121 | param['eval_metric'] = "mlogloss" 122 | param['min_child_weight'] = child 123 | param['subsample'] = 0.8 124 | param['colsample_bytree'] = colsample 125 | param['seed'] = seed_val 126 | num_rounds = 2000 127 | 128 | plst = list(param.items()) 129 | xgtrain = xgb.DMatrix(train_X, label=train_y) 130 | 131 | if test_y is not None: 132 | xgtest = xgb.DMatrix(test_X, label=test_y) 133 | watchlist = [ (xgtrain,'train'), (xgtest, 'test') ] 134 | model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=50, verbose_eval=20) 135 | else: 136 | xgtest = xgb.DMatrix(test_X) 137 | model = xgb.train(plst, xgtrain, num_rounds) 138 | 139 | pred_test_y = model.predict(xgtest, ntree_limit = model.best_ntree_limit) 140 | if test_X2 is not None: 141 | xgtest2 = xgb.DMatrix(test_X2) 142 | pred_test_y2 = model.predict(xgtest2, ntree_limit = model.best_ntree_limit) 143 | return pred_test_y, pred_test_y2, model 144 | 145 | kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017) 146 | cv_scores = [] 147 | pred_full_test = 0 148 | pred_train = np.zeros([df.shape[0], 3]) 149 | for dev_index, val_index in kf.split(train_X): 150 | dev_X, val_X = train_X.loc[dev_index], train_X.loc[val_index] 151 | dev_y, val_y = train_y[dev_index], train_y[val_index] 152 | pred_val_y, pred_test_y, model = runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0) 153 | pred_full_test = pred_full_test + pred_test_y 154 | pred_train[val_index,:] = pred_val_y 155 | cv_scores.append(metrics.log_loss(val_y, pred_val_y)) 156 | break 157 | print("cv scores : ", cv_scores) 158 | -------------------------------------------------------------------------------- /scripts/python/kaggle_spooky_authors_ml_model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Objective: To predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. \n", 8 | "\n", 9 | "Category: Text Analysis/ Natural Language Processing\n", 10 | " \n", 11 | "Data Source: https://www.kaggle.com/c/spooky-author-identification" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "data": { 21 | "text/plain": [ 22 | "'C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\scripts\\\\python'" 23 | ] 24 | }, 25 | "execution_count": 1, 26 | "metadata": {}, 27 | "output_type": "execute_result" 28 | } 29 | ], 30 | "source": [ 31 | "# Print the current working directory\n", 32 | "import os\n", 33 | "os.getcwd()" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": 36, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "# Load the required libraries\n", 43 | "import pandas as pd\n", 44 | "import numpy as np\n", 45 | "from sklearn.feature_extraction.text import CountVectorizer as CV\n", 46 | "from sklearn.naive_bayes import MultinomialNB as MNB" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 17, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "C:\\Users\\Ashoo\\Documents\\R playground\\text-analysis\\data\\kaggle_spooky_authors\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "# set data path\n", 64 | "data_path = \"\"\n", 65 | "print(data_path)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 20, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/html": [ 76 | "
\n", 77 | "\n", 90 | "\n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | "
idtextauthor
0id26305This process, however, afforded me no means of...EAP
1id17569It never once occurred to me that the fumbling...HPL
2id11008In his left hand was a gold snuff box, from wh...EAP
3id27763How lovely is spring As we looked from Windsor...MWS
4id12958Finding nothing else, not even gold, the Super...HPL
\n", 132 | "
" 133 | ], 134 | "text/plain": [ 135 | " id text author\n", 136 | "0 id26305 This process, however, afforded me no means of... EAP\n", 137 | "1 id17569 It never once occurred to me that the fumbling... HPL\n", 138 | "2 id11008 In his left hand was a gold snuff box, from wh... EAP\n", 139 | "3 id27763 How lovely is spring As we looked from Windsor... MWS\n", 140 | "4 id12958 Finding nothing else, not even gold, the Super... HPL" 141 | ] 142 | }, 143 | "execution_count": 20, 144 | "metadata": {}, 145 | "output_type": "execute_result" 146 | } 147 | ], 148 | "source": [ 149 | "# Load the data\n", 150 | "training_data = pd.read_csv(\"C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\data\\\\kaggle_spooky_authors\\\\train.csv\")\n", 151 | "testing_data=pd.read_csv(\"C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\data\\\\kaggle_spooky_authors\\\\test.csv\")\n", 152 | "training_data.head()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Map \"EAP\" to 0 \"HPL\" to 1 and \"MWS\" to 2\n", 160 | "\n", 161 | "Next we take all the rows under the column named \"text\" and put it in X ( a variable in python)\n", 162 | "\n", 163 | "Similarly we take all rows under the column named \"author_num\" and put it in y (a variable in python)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 21, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "name": "stdout", 173 | "output_type": "stream", 174 | "text": [ 175 | "0 This process, however, afforded me no means of...\n", 176 | "1 It never once occurred to me that the fumbling...\n", 177 | "2 In his left hand was a gold snuff box, from wh...\n", 178 | "3 How lovely is spring As we looked from Windsor...\n", 179 | "4 Finding nothing else, not even gold, the Super...\n", 180 | "Name: text, dtype: object\n", 181 | "0 0\n", 182 | "1 1\n", 183 | "2 0\n", 184 | "3 2\n", 185 | "4 1\n", 186 | "Name: author_num, dtype: int64\n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "training_data['author_num'] = training_data.author.map({'EAP':0, 'HPL':1, 'MWS':2})\n", 192 | "X = training_data['text']\n", 193 | "y = training_data['author_num']\n", 194 | "print (X.head())\n", 195 | "print (y.head())" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "Now we need to split the data into training set and testing set. We train the model on the training set. Model testing is done on the test set.\n", 203 | "\n", 204 | "So we are going to split it into 70% for training and 30% for testing." 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 22, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "per=int(float(0.7)* len(X))\n", 214 | "X_train=X[:per]\n", 215 | "X_test=X[per:]\n", 216 | "y_train=y[:per]\n", 217 | "y_test=y[per:]" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "##### Converting text data into numbers or in other words, `Vectorization`\n", 225 | "\n", 226 | "Computers get crazy with text, It only understands numbers, but we have got to classify text. Now what do we do? We do tokenization and vectorization to save the count of each word. " 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 31, 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "name": "stdout", 236 | "output_type": "stream", 237 | "text": [ 238 | "{'My': 0, 'name': 4, 'is': 2, 'Sindabad': 1, 'the': 6, 'sailor': 5, 'man': 3}\n", 239 | "1\n", 240 | "1\n", 241 | "1\n", 242 | "1\n", 243 | "1\n" 244 | ] 245 | } 246 | ], 247 | "source": [ 248 | "#toy example\n", 249 | "text=[\"My name is Sindabad the sailor man\"]\n", 250 | "toy = CV(lowercase=False, token_pattern=r'\\w+|\\,')\n", 251 | "toy.fit_transform(text)\n", 252 | "print (toy.vocabulary_)\n", 253 | "matrix=toy.transform(text)\n", 254 | "print (matrix[0,0])\n", 255 | "print (matrix[0,1])\n", 256 | "print (matrix[0,2])\n", 257 | "print (matrix[0,3])\n", 258 | "print (matrix[0,4])" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": 30, 264 | "metadata": {}, 265 | "outputs": [ 266 | { 267 | "name": "stdout", 268 | "output_type": "stream", 269 | "text": [ 270 | "(13705, 27497)\n" 271 | ] 272 | } 273 | ], 274 | "source": [ 275 | "vect = CV(lowercase=False, token_pattern=r'\\w+|\\,')\n", 276 | "X_cv=vect.fit_transform(X)\n", 277 | "X_train_cv = vect.transform(X_train)\n", 278 | "X_test_cv = vect.transform(X_test)\n", 279 | "print (X_train_cv.shape)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "The final step We give the data to the `clf.fit` for training and test it for score. Let's check the accuracy on training set" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 37, 292 | "metadata": {}, 293 | "outputs": [ 294 | { 295 | "data": { 296 | "text/plain": [ 297 | "0.8432073544433095" 298 | ] 299 | }, 300 | "execution_count": 37, 301 | "metadata": {}, 302 | "output_type": "execute_result" 303 | } 304 | ], 305 | "source": [ 306 | "MNB=MultinomialNB()\n", 307 | "MNB.fit(X_train_cv, y_train)\n", 308 | "MNB.score(X_test_cv, y_test)\n", 309 | "\n", 310 | "\n", 311 | "#clf=MultinomialNB()\n", 312 | "#clf.fit(X_train_cv, y_train)\n", 313 | "#clf.score(X_test_cv, y_test)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "Now, lets check the accuracy on the test set. But first vectorize the test set just like we did it for the training set." 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 38, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "X_test=vect.transform(testing_data[\"text\"])" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 39, 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "data": { 339 | "text/plain": [ 340 | "(8392, 3)" 341 | ] 342 | }, 343 | "execution_count": 39, 344 | "metadata": {}, 345 | "output_type": "execute_result" 346 | } 347 | ], 348 | "source": [ 349 | "MNB=MultinomialNB()\n", 350 | "MNB.fit(X_cv, y)\n", 351 | "predicted_result=MNB.predict_proba(X_test)\n", 352 | "predicted_result.shape" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "We see that we got a result with 8392 rows presenting each text and 3 columns each column representing probability of each author." 360 | ] 361 | }, 362 | { 363 | "cell_type": "code", 364 | "execution_count": 40, 365 | "metadata": {}, 366 | "outputs": [ 367 | { 368 | "data": { 369 | "text/html": [ 370 | "
\n", 371 | "\n", 384 | "\n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | "
idEAPHPLMWS
0id023100.0009242.968107e-059.990459e-01
1id245411.0000003.262255e-079.234086e-09
2id001340.0031939.968065e-018.737549e-07
3id277570.9209857.901473e-024.811097e-07
4id040810.9531585.981884e-034.086011e-02
\n", 432 | "
" 433 | ], 434 | "text/plain": [ 435 | " id EAP HPL MWS\n", 436 | "0 id02310 0.000924 2.968107e-05 9.990459e-01\n", 437 | "1 id24541 1.000000 3.262255e-07 9.234086e-09\n", 438 | "2 id00134 0.003193 9.968065e-01 8.737549e-07\n", 439 | "3 id27757 0.920985 7.901473e-02 4.811097e-07\n", 440 | "4 id04081 0.953158 5.981884e-03 4.086011e-02" 441 | ] 442 | }, 443 | "execution_count": 40, 444 | "metadata": {}, 445 | "output_type": "execute_result" 446 | } 447 | ], 448 | "source": [ 449 | "#NOW WE CREATE A RESULT DATA FRAME AND ADD THE COLUMNS NECESSARY FOR KAGGLE SUBMISSION\n", 450 | "result=pd.DataFrame()\n", 451 | "result[\"id\"]=testing_data[\"id\"]\n", 452 | "result[\"EAP\"]=predicted_result[:,0]\n", 453 | "result[\"HPL\"]=predicted_result[:,1]\n", 454 | "result[\"MWS\"]=predicted_result[:,2]\n", 455 | "result.head()" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "metadata": {}, 461 | "source": [ 462 | "FINALLY WE SUBMIT THE RESULT TO KAGGLE FOR EVALUATION" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": 41, 468 | "metadata": {}, 469 | "outputs": [], 470 | "source": [ 471 | "result.to_csv(\"TO_SUBMIT.csv\", index=False)" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [] 480 | } 481 | ], 482 | "metadata": { 483 | "kernelspec": { 484 | "display_name": "Python 3", 485 | "language": "python", 486 | "name": "python3" 487 | }, 488 | "language_info": { 489 | "codemirror_mode": { 490 | "name": "ipython", 491 | "version": 3 492 | }, 493 | "file_extension": ".py", 494 | "mimetype": "text/x-python", 495 | "name": "python", 496 | "nbconvert_exporter": "python", 497 | "pygments_lexer": "ipython3", 498 | "version": "3.6.5" 499 | } 500 | }, 501 | "nbformat": 4, 502 | "nbformat_minor": 2 503 | } 504 | -------------------------------------------------------------------------------- /scripts/python/learn_text_analysis.py: -------------------------------------------------------------------------------- 1 | import spacy 2 | import numpy as np 3 | import pandas as pd 4 | 5 | # import the english language model 6 | nlp = spacy.load("en_core_web_sm") 7 | 8 | # dataset: https://www.kaggle.com/datasets/mrisdal/fake-news 9 | # reference: https://www.kaggle.com/code/sudalairajkumar/getting-started-with-spacy 10 | df = pd.read_csv("../../data/fake_news_data/fake.csv") 11 | print(df.shape) 12 | print(df.head()) 13 | print(df.columns) 14 | print(df['title'].head()) 15 | 16 | txt = df['title'][0] 17 | print(txt) 18 | doc = nlp(txt) 19 | olist = [] 20 | for token in doc: 21 | l = [token.text, 22 | token.idx, 23 | token.lemma_, 24 | token.is_punct, 25 | token.is_space, 26 | token.shape_, 27 | token.pos_, 28 | token.tag_] 29 | olist.append(l) 30 | 31 | odf = pd.DataFrame(olist) 32 | odf.columns = ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"] 33 | print(odf) 34 | 35 | # Named Entity Recognition: 36 | # A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. 37 | # We also get named entity recognition as part of spacy package. It is inbuilt in the english language model and we can also train our own entities if needed. 38 | doc = nlp(txt) 39 | olist = [] 40 | for ent in doc.ents: 41 | olist.append([ent.text, ent.label_]) 42 | 43 | odf = pd.DataFrame(olist) 44 | odf.columns = ["Text", "EntityType"] 45 | print(odf) -------------------------------------------------------------------------------- /scripts/readme.md: -------------------------------------------------------------------------------- 1 | All code scripts live here 2 | --------------------------------------------------------------------------------