├── .github
└── stale.yml
├── .gitignore
├── LICENSE
├── README.md
├── TextAnalysis.Rproj
├── TextAnalysis.wpr
├── data
├── fake_news_data
│ └── liar_dataset
│ │ ├── README.md
│ │ ├── test.tsv
│ │ ├── train.tsv
│ │ └── valid.tsv
├── kaggle_spooky_authors
│ ├── sample_submission.csv
│ ├── test.csv
│ └── train.csv
├── novelwordsonly.txt
├── plainText
│ ├── austen.txt
│ └── melville.txt
├── readme.md
└── text_data_for_analysis.txt
├── figures
├── Rplot-wordcloud01.png
└── readme.md
├── resources
└── readme.md
└── scripts
├── .gitignore
├── R
├── HP_preprocess_01.R
├── austen_text_analysis.R
├── dispersion_plots.R
├── dracula_text_analysis.R
├── expr_kaggle-reddit_EDA.R
├── expr_kaggle-reddit_text-analysis.R
├── initialScript-1.R
├── initialScript.R
├── mobydick_novel_text_analysis.R
├── structural_topic_modeling_00.R
├── text_analysis_example01.R
├── text_analysis_example02.R
├── token_distribution_analysis.R
├── topic_modeling_00.R
├── topic_modeling_01.R
├── wuthering_heights_sentiment_analysis.R
└── wuthering_heights_text_analysis.R
├── python
├── .gitignore
├── 00_topic_modelling_fundamentals.ipynb
├── 00_topic_modelling_theoretical_concepts.ipynb
├── 01_topic_modelling_fundamentals.ipynb
├── Tutorial1-An introduction to NLP with SpaCy.ipynb
├── check_pkgs.py
├── extract_table_from_pdf.py
├── extract_text_data_from_pdf.py
├── extract_text_data_from_pdf_2.py
├── extract_text_data_from_pdf_3.py
├── extract_text_data_from_textfiles.py
├── func_text_preprocess.py
├── kaggle_hotel_review_rating.py
├── kaggle_spooky_authors_ml_model.ipynb
└── learn_text_analysis.py
└── readme.md
/.github/stale.yml:
--------------------------------------------------------------------------------
1 | # Number of days of inactivity before an issue becomes stale
2 | daysUntilStale: 60
3 | # Number of days of inactivity before a stale issue is closed
4 | daysUntilClose: 7
5 | # Issues with these labels will never be considered stale
6 | exemptLabels:
7 | - pinned
8 | - security
9 | - notes
10 | # Label to use when marking an issue as stale
11 | staleLabel: wontfix
12 | # Comment to post when marking an issue as stale. Set to `false` to disable
13 | markComment: >
14 | This issue has been automatically marked as stale because it has not had
15 | recent activity. It will be closed if no further activity occurs. Thank you
16 | for your contributions.
17 | # Comment to post when closing a stale issue. Set to `false` to disable
18 | closeComment: true
19 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rhistory
3 | .RData
4 | .Ruserdata
5 | /*.idea
6 | /data
7 | *.xml
8 | *.iml
9 | *.idea
10 | /.ipynb_checkpoints/
11 | /*.cpython-37.pyc
12 | /scripts/python/__pycache__
13 | *.spyproject
14 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Ashish Dutt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Readme
2 | [](https://github.com/duttashi/text-analysis/graphs/commit-activity)
3 | [](https://github.com/duttashi/text-analysis/issues)
4 | [](https://github.com/duttashi/text-analysis/network)
5 | [](https://github.com/duttashi/text-analysis/stargazers)
6 | [](https://github.com/duttashi/text-analysis/blob/master/LICENSE)
7 | [](https://doi.org/10.5281/zenodo.1489352)
8 | ### Cleaning and Analyzing textual data
9 |
10 | Weaving analytic stories from text data
11 |
12 | The repository consist of the following folders namely, `data`, `scripts`, `resources` and `figures`.
13 |
14 | ###### Have a question?
15 |
16 | Ask your question on [Stack Overflow](http://stackoverflow.com/questions/tagged/r)
17 | or the [R-SIG-Finance](https://stat.ethz.ch/mailman/listinfo/r-sig-finance)
18 | mailing list (you must subscribe to post).
19 |
20 | ## Contributing [](https://github.com/duttashi/text-analysis/issues)
21 |
22 | Please see the [contributing guide](CONTRIBUTING.md).
23 |
24 | ## Author
25 | [Ashish Dutt](https://duttashi.github.io/about/)
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
--------------------------------------------------------------------------------
/TextAnalysis.Rproj:
--------------------------------------------------------------------------------
1 | Version: 1.0
2 |
3 | RestoreWorkspace: Default
4 | SaveWorkspace: Default
5 | AlwaysSaveHistory: Default
6 |
7 | EnableCodeIndexing: Yes
8 | UseSpacesForTab: Yes
9 | NumSpacesForTab: 2
10 | Encoding: UTF-8
11 |
12 | RnwWeave: Sweave
13 | LaTeX: pdfLaTeX
14 |
--------------------------------------------------------------------------------
/TextAnalysis.wpr:
--------------------------------------------------------------------------------
1 | #!wing
2 | #!version=7.0
3 | ##################################################################
4 | # Wing project file #
5 | ##################################################################
6 | [project attributes]
7 | proj.directory-list = [{'dirloc': loc('data'),
8 | 'excludes': (),
9 | 'filter': u'*',
10 | 'include_hidden': False,
11 | 'recursive': True,
12 | 'watch_for_changes': True}]
13 | proj.file-list = [loc('scripts/python/basic_text_processing_00.py'),
14 | loc('scripts/python/basic_text_processing_01.py')]
15 | proj.file-type = 'normal'
16 | [user attributes]
17 | debug.show-args-dialog = {loc('scripts/python/basic_text_processing_00.py'): False,
18 | loc('scripts/python/basic_text_processing_01.py'): False,
19 | loc('scripts/python/extract_table_from_pdf.py'): False,
20 | loc('scripts/python/extract_text_data_from_pdf.py'): False,
21 | loc('scripts/python/list_environ_path_details.py'): False,
22 | loc('scripts/python/pdf_reader.py'): False,
23 | loc('scripts/python/pdf_table_reader.py'): False,
24 | loc('unknown: #1'): False}
25 | guimgr.overall-gui-state = {'windowing-policy': 'combined-window',
26 | 'windows': [{'name': '9z79HX4zahWgrg1gEbMAmcgYyp'\
27 | 'VZEoEv',
28 | 'size-state': '',
29 | 'type': 'dock',
30 | 'view': {'area': 'tall',
31 | 'constraint': None,
32 | 'current_pages': [0],
33 | 'full-screen': False,
34 | 'notebook_display': 'normal',
35 | 'notebook_percent': 0.25,
36 | 'override_title': None,
37 | 'pagelist': [('project',
38 | 'tall',
39 | 0,
40 | {'tree-state': {'file-sort-method': 'by name',
41 | 'list-files-first': False,
42 | 'tree-states': {'deep': {'expanded-nodes': [],
43 | 'selected-nodes': [(1,
44 | 0)],
45 | 'top-node': (0,)},
46 | 'flat': {'expanded-nodes': [(6,)],
47 | 'selected-nodes': [(6,)],
48 | 'top-node': (0,)}},
49 | 'tree-style': 'flat'}}),
50 | ('source-assistant',
51 | 'tall',
52 | 2,
53 | {}),
54 | ('debug-stack',
55 | 'tall',
56 | 1,
57 | {'codeline-mode': 'below'}),
58 | ('browser',
59 | 'tall',
60 | 0,
61 | {'all_tree_states': {loc('scripts/python/basic_text_processing_00.py'): {'e'\
62 | 'xpanded-nodes': [],
63 | 'selected-nodes': [[('generic attribute',
64 | loc('scripts/python/basic_text_processing_00.py'),
65 | 'e')]],
66 | 'top-node': [('generic attribute',
67 | loc('scripts/python/basic_text_processing_00.py'),
68 | 'e')]},
69 | loc('scripts/python/extract_table_from_pdf.py'): {'expanded-nodes': [],
70 | 'selected-nodes': [],
71 | 'top-node': [('generic attribute',
72 | loc('scripts/python/extract_table_from_pdf.py'),
73 | 'df')]},
74 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'): {'expande'\
75 | 'd-nodes': [],
76 | 'selected-nodes': [],
77 | 'top-node': [('function def',
78 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
79 | 'build_options')]}},
80 | 'browse_mode': u'Current Module',
81 | 'follow-selection': False,
82 | 'sort_mode': 'Alphabetically',
83 | 'visibility_options': {u'Derived Classes': False,
84 | u'Imported': False,
85 | u'Modules': True}}),
86 | ('indent',
87 | 'tall',
88 | 2,
89 | {})],
90 | 'primary_view_state': {'area': 'wide',
91 | 'constraint': None,
92 | 'current_pages': [0,
93 | 1],
94 | 'notebook_display': 'normal',
95 | 'notebook_percent': 0.37662337662337664,
96 | 'override_title': None,
97 | 'pagelist': [('batch-search',
98 | 'wide',
99 | 0,
100 | {'fScope': {'fFileSetName': 'All Source Files',
101 | 'fLocation': None,
102 | 'fRecursive': True,
103 | 'fType': 'project-files'},
104 | 'fSearchSpec': {'fEndPos': None,
105 | 'fIncludeLinenos': True,
106 | 'fInterpretBackslashes': False,
107 | 'fMatchCase': False,
108 | 'fOmitBinary': True,
109 | 'fRegexFlags': 42,
110 | 'fReplaceText': '',
111 | 'fReverse': False,
112 | 'fSearchText': '',
113 | 'fStartPos': 0,
114 | 'fStyle': 'text',
115 | 'fWholeWords': False,
116 | 'fWrap': True},
117 | 'fUIOptions': {'fAutoBackground': True,
118 | 'fFilePrefix': 'short-file',
119 | 'fFindAfterReplace': True,
120 | 'fInSelection': False,
121 | 'fIncremental': True,
122 | 'fReplaceOnDisk': False,
123 | 'fShowFirstMatch': False,
124 | 'fShowLineno': True,
125 | 'fShowReplaceWidgets': False},
126 | 'replace-entry-expanded': False,
127 | 'search-entry-expanded': False}),
128 | ('interactive-search',
129 | 'wide',
130 | 0,
131 | {'fScope': {'fFileSetName': 'All Source Files',
132 | 'fLocation': None,
133 | 'fRecursive': True,
134 | 'fType': 'project-files'},
135 | 'fSearchSpec': {'fEndPos': None,
136 | 'fIncludeLinenos': True,
137 | 'fInterpretBackslashes': False,
138 | 'fMatchCase': False,
139 | 'fOmitBinary': True,
140 | 'fRegexFlags': 42,
141 | 'fReplaceText': '',
142 | 'fReverse': False,
143 | 'fSearchText': '',
144 | 'fStartPos': 0,
145 | 'fStyle': 'text',
146 | 'fWholeWords': False,
147 | 'fWrap': True},
148 | 'fUIOptions': {'fAutoBackground': True,
149 | 'fFilePrefix': 'short-file',
150 | 'fFindAfterReplace': True,
151 | 'fInSelection': False,
152 | 'fIncremental': True,
153 | 'fReplaceOnDisk': False,
154 | 'fShowFirstMatch': False,
155 | 'fShowLineno': True,
156 | 'fShowReplaceWidgets': False}}),
157 | ('debug-data',
158 | 'wide',
159 | 0,
160 | {}),
161 | ('debug-io',
162 | 'wide',
163 | 1,
164 | {}),
165 | ('debug-exceptions',
166 | 'wide',
167 | 1,
168 | {}),
169 | ('python-shell',
170 | 'wide',
171 | 2,
172 | {'active-range': (None,
173 | -1,
174 | -1),
175 | 'attrib-starts': [],
176 | 'code-line': '',
177 | 'first-line': 0L,
178 | 'folded-linenos': [],
179 | 'history': {},
180 | 'launch-id': None,
181 | 'sel-line': 2L,
182 | 'sel-line-start': 145L,
183 | 'selection_end': 145L,
184 | 'selection_start': 145L,
185 | 'zoom': 0L}),
186 | ('messages',
187 | 'wide',
188 | 2,
189 | {'current-domain': 0}),
190 | ('os-command',
191 | 'wide',
192 | 1,
193 | {'last-percent': 0.8,
194 | 'toolbox-percent': 1.0,
195 | 'toolbox-tree-sel': ''})],
196 | 'primary_view_state': {'editor_states': ({'bookmarks': ([[loc('scripts/python/basic_text_processing_00.py'),
197 | {'attrib-starts': [],
198 | 'code-line': ' countlist = Counter(stripped) \r\n',
199 | 'first-line': 6L,
200 | 'folded-linenos': [],
201 | 'sel-line': 44L,
202 | 'sel-line-start': 2135L,
203 | 'selection_end': 2173L,
204 | 'selection_start': 2143L,
205 | 'zoom': 0L},
206 | 1586504429.747],
207 | [loc('scripts/python/extract_text_data_from_textfiles.py'),
208 | {'attrib-starts': [],
209 | 'code-line': 'filePath = "C:\\\\Users\\\\Ashoo\\\\Documents\\\\Pyt'\
210 | 'honPlayground\\\\text-analysis\\\\data\\\\plainText"'\
211 | '\r\n',
212 | 'first-line': 33L,
213 | 'folded-linenos': [],
214 | 'sel-line': 12L,
215 | 'sel-line-start': 439L,
216 | 'selection_end': 502L,
217 | 'selection_start': 502L,
218 | 'zoom': 0L},
219 | 1586504435.198],
220 | [loc('scripts/python/basic_text_processing_00.py'),
221 | {'attrib-starts': [],
222 | 'code-line': ' countlist = Counter(stripped) \r\n',
223 | 'first-line': 6L,
224 | 'folded-linenos': [],
225 | 'sel-line': 44L,
226 | 'sel-line-start': 2135L,
227 | 'selection_end': 2173L,
228 | 'selection_start': 2143L,
229 | 'zoom': 0L},
230 | 1586504437.538],
231 | [loc('scripts/python/extract_text_data_from_pdf.py'),
232 | {'attrib-starts': [],
233 | 'code-line': '\r\n',
234 | 'first-line': 36L,
235 | 'folded-linenos': [],
236 | 'sel-line': 11L,
237 | 'sel-line-start': 717L,
238 | 'selection_end': 717L,
239 | 'selection_start': 717L,
240 | 'zoom': 0L},
241 | 1586504686.472],
242 | [loc('scripts/python/extract_table_from_pdf.py'),
243 | {'attrib-starts': [],
244 | 'code-line': ' tabula.convert_into(pdfFiles, "pdf_to_cs'\
245 | 'v.csv", output_format="csv")\r\n',
246 | 'first-line': 10L,
247 | 'folded-linenos': [],
248 | 'sel-line': 28L,
249 | 'sel-line-start': 1279L,
250 | 'selection_end': 1359L,
251 | 'selection_start': 1359L,
252 | 'zoom': 0L},
253 | 1586505684.863],
254 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
255 | {'attrib-starts': [('convert_into|0|',
256 | 505)],
257 | 'code-line': ' "{} is empty. Check the file, or downloa'\
258 | 'd it manually.".format(path)\n',
259 | 'first-line': 541L,
260 | 'folded-linenos': [],
261 | 'sel-line': 557L,
262 | 'sel-line-start': 23803L,
263 | 'selection_end': 23803L,
264 | 'selection_start': 23803L,
265 | 'zoom': 0L},
266 | 1586505728.867],
267 | [loc('scripts/python/extract_table_from_pdf.py'),
268 | {'attrib-starts': [],
269 | 'code-line': ' tabula.convert_into(pdfFiles, "pdf_to_cs'\
270 | 'v.csv", output_format="csv")\r\n',
271 | 'first-line': 12L,
272 | 'folded-linenos': [],
273 | 'sel-line': 28L,
274 | 'sel-line-start': 1279L,
275 | 'selection_end': 1297L,
276 | 'selection_start': 1297L,
277 | 'zoom': 0L},
278 | 1586505861.747],
279 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
280 | {'attrib-starts': [('convert_into|0|',
281 | 505)],
282 | 'code-line': ' "{} is empty. Check the file, or downloa'\
283 | 'd it manually.".format(path)\n',
284 | 'first-line': 538L,
285 | 'folded-linenos': [],
286 | 'sel-line': 557L,
287 | 'sel-line-start': 23803L,
288 | 'selection_end': 23883L,
289 | 'selection_start': 23816L,
290 | 'zoom': 0L},
291 | 1586505996.978],
292 | [loc('scripts/python/extract_table_from_pdf.py'),
293 | {'attrib-starts': [],
294 | 'code-line': " df= tabula.read_pdf(pdfFiles, pages=\"al"\
295 | "l\", encoding='utf-8', spreadsheet=True)\r\n",
296 | 'first-line': 12L,
297 | 'folded-linenos': [],
298 | 'sel-line': 24L,
299 | 'sel-line-start': 1139L,
300 | 'selection_end': 1228L,
301 | 'selection_start': 1228L,
302 | 'zoom': 0L},
303 | 1586506020.948],
304 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
305 | {'attrib-starts': [('_run|0|',
306 | 53)],
307 | 'code-line': ' built_options = build_options(**options)\n',
308 | 'first-line': 68L,
309 | 'folded-linenos': [],
310 | 'sel-line': 73L,
311 | 'sel-line-start': 2296L,
312 | 'selection_end': 2296L,
313 | 'selection_start': 2296L,
314 | 'zoom': 0L},
315 | 1586506030.692],
316 | [loc('scripts/python/extract_table_from_pdf.py'),
317 | {'attrib-starts': [],
318 | 'code-line': " df= tabula.read_pdf(pdfFiles, pages=\"al"\
319 | "l\", encoding='utf-8')\r\n",
320 | 'first-line': 12L,
321 | 'folded-linenos': [],
322 | 'sel-line': 24L,
323 | 'sel-line-start': 1139L,
324 | 'selection_end': 1210L,
325 | 'selection_start': 1210L,
326 | 'zoom': 0L},
327 | 1586506053.037],
328 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
329 | {'attrib-starts': [('convert_into|0|',
330 | 505)],
331 | 'code-line': ' "{} is empty. Check the file, or downloa'\
332 | 'd it manually.".format(path)\n',
333 | 'first-line': 541L,
334 | 'folded-linenos': [],
335 | 'sel-line': 557L,
336 | 'sel-line-start': 23803L,
337 | 'selection_end': 23803L,
338 | 'selection_start': 23803L,
339 | 'zoom': 0L},
340 | 1586506058.31],
341 | [loc('scripts/python/extract_table_from_pdf.py'),
342 | {'attrib-starts': [],
343 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\
344 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\
345 | "\n",
346 | 'first-line': 12L,
347 | 'folded-linenos': [],
348 | 'sel-line': 28L,
349 | 'sel-line-start': 1297L,
350 | 'selection_end': 1394L,
351 | 'selection_start': 1394L,
352 | 'zoom': 0L},
353 | 1586506153.545],
354 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
355 | {'attrib-starts': [('convert_into|0|',
356 | 505)],
357 | 'code-line': ' "{} is empty. Check the file, or downloa'\
358 | 'd it manually.".format(path)\n',
359 | 'first-line': 547L,
360 | 'folded-linenos': [],
361 | 'sel-line': 557L,
362 | 'sel-line-start': 23803L,
363 | 'selection_end': 23803L,
364 | 'selection_start': 23803L,
365 | 'zoom': 0L},
366 | 1586506173.045],
367 | [loc('scripts/python/extract_table_from_pdf.py'),
368 | {'attrib-starts': [],
369 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\
370 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\
371 | "\n",
372 | 'first-line': 12L,
373 | 'folded-linenos': [],
374 | 'sel-line': 28L,
375 | 'sel-line-start': 1297L,
376 | 'selection_end': 1394L,
377 | 'selection_start': 1394L,
378 | 'zoom': 0L},
379 | 1586506203.318],
380 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
381 | {'attrib-starts': [('convert_into|0|',
382 | 505)],
383 | 'code-line': ' "{} is empty. Check the file, or downloa'\
384 | 'd it manually.".format(path)\n',
385 | 'first-line': 547L,
386 | 'folded-linenos': [],
387 | 'sel-line': 557L,
388 | 'sel-line-start': 23803L,
389 | 'selection_end': 23803L,
390 | 'selection_start': 23803L,
391 | 'zoom': 0L},
392 | 1586506208.703],
393 | [loc('scripts/python/extract_table_from_pdf.py'),
394 | {'attrib-starts': [],
395 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\
396 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\
397 | "\n",
398 | 'first-line': 12L,
399 | 'folded-linenos': [],
400 | 'sel-line': 28L,
401 | 'sel-line-start': 1297L,
402 | 'selection_end': 1394L,
403 | 'selection_start': 1394L,
404 | 'zoom': 0L},
405 | 1586506210.434],
406 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
407 | {'attrib-starts': [('convert_into|0|',
408 | 505)],
409 | 'code-line': ' "{} is empty. Check the file, or downloa'\
410 | 'd it manually.".format(path)\n',
411 | 'first-line': 535L,
412 | 'folded-linenos': [],
413 | 'sel-line': 557L,
414 | 'sel-line-start': 23803L,
415 | 'selection_end': 23803L,
416 | 'selection_start': 23803L,
417 | 'zoom': 0L},
418 | 1586506219.002],
419 | [loc('scripts/python/extract_table_from_pdf.py'),
420 | {'attrib-starts': [],
421 | 'code-line': " tabula.convert_into(pdfFiles, \"pdf_to_c"\
422 | "sv.csv\", output_format=\"csv\", encoding='utf-8')\r"\
423 | "\n",
424 | 'first-line': 12L,
425 | 'folded-linenos': [],
426 | 'sel-line': 28L,
427 | 'sel-line-start': 1297L,
428 | 'selection_end': 1350L,
429 | 'selection_start': 1340L,
430 | 'zoom': 0L},
431 | 1586506558.008],
432 | [loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'),
433 | {'attrib-starts': [('convert_into|0|',
434 | 505)],
435 | 'code-line': ' "{} is empty. Check the file, or downloa'\
436 | 'd it manually.".format(path)\n',
437 | 'first-line': 541L,
438 | 'folded-linenos': [],
439 | 'sel-line': 557L,
440 | 'sel-line-start': 23803L,
441 | 'selection_end': 23803L,
442 | 'selection_start': 23803L,
443 | 'zoom': 0L},
444 | 1586506571.33]],
445 | 20),
446 | 'current-loc': loc('scripts/python/extract_table_from_pdf.py'),
447 | 'editor-state-list': [(loc('scripts/python/extract_table_from_pdf.py'),
448 | {'attrib-starts': [],
449 | 'code-line': ' df= tabula.read_pd'\
450 | 'f(pdfFiles, pages="all")\r\n',
451 | 'first-line': 12L,
452 | 'folded-linenos': [],
453 | 'sel-line': 24L,
454 | 'sel-line-start': 1139L,
455 | 'selection_end': 1192L,
456 | 'selection_start': 1192L,
457 | 'zoom': 0L})],
458 | 'has-focus': True,
459 | 'locked': False},
460 | [loc('scripts/python/extract_table_from_pdf.py')]),
461 | 'open_files': [u'scripts/python/extract_table_from_pdf.py']},
462 | 'saved_notebook_display': None,
463 | 'split_percents': {0: 0.338},
464 | 'splits': 2,
465 | 'tab_location': 'top',
466 | 'traversal_pos': ((1,
467 | 1),
468 | 1586506557.899),
469 | 'user_data': {}},
470 | 'saved_notebook_display': None,
471 | 'split_percents': {},
472 | 'splits': 1,
473 | 'tab_location': 'left',
474 | 'traversal_pos': ((0,
475 | 0),
476 | 1586504780.092),
477 | 'user_data': {}},
478 | 'window-alloc': (0,
479 | -1,
480 | 1380,
481 | 739)}]}
482 | guimgr.recent-documents = [loc('scripts/python/extract_table_from_pdf.py')]
483 | guimgr.visual-state = {loc('scripts/python/basic_text_processing_00.py'): {'a'\
484 | 'ttrib-starts': [],
485 | 'code-line': ' countlist = Counter(stripped) \r\n',
486 | 'first-line': 6L,
487 | 'folded-linenos': [],
488 | 'sel-line': 44L,
489 | 'sel-line-start': 2135L,
490 | 'selection_end': 2173L,
491 | 'selection_start': 2143L,
492 | 'zoom': 0L},
493 | loc('scripts/python/extract_table_from_pdf.py'): {'at'\
494 | 'trib-starts': [],
495 | 'code-line': ' df= tabula.read_pdf(f, stream=True)[0]\r\n',
496 | 'first-line': 0L,
497 | 'folded-linenos': [],
498 | 'sel-line': 21L,
499 | 'sel-line-start': 1031L,
500 | 'selection_end': 1064L,
501 | 'selection_start': 1064L,
502 | 'zoom': 0L},
503 | loc('scripts/python/extract_text_data_from_pdf.py'): {'a'\
504 | 'ttrib-starts': [],
505 | 'code-line': '\r\n',
506 | 'first-line': 36L,
507 | 'folded-linenos': [],
508 | 'sel-line': 11L,
509 | 'sel-line-start': 717L,
510 | 'selection_end': 717L,
511 | 'selection_start': 717L,
512 | 'zoom': 0L},
513 | loc('scripts/python/extract_text_data_from_textfiles.py'): {'a'\
514 | 'ttrib-starts': [],
515 | 'code-line': 'filePath = "C:\\\\Users\\\\Ashoo\\\\Documents\\\\Pytho'\
516 | 'nPlayground\\\\text-analysis\\\\data\\\\plainText"\r\n',
517 | 'first-line': 33L,
518 | 'folded-linenos': [],
519 | 'sel-line': 12L,
520 | 'sel-line-start': 439L,
521 | 'selection_end': 502L,
522 | 'selection_start': 502L,
523 | 'zoom': 0L},
524 | loc('scripts/python/list_environ_path_details.py'): {'a'\
525 | 'ttrib-starts': [],
526 | 'code-line': ' print(item)',
527 | 'first-line': 0L,
528 | 'folded-linenos': [],
529 | 'sel-line': 3L,
530 | 'sel-line-start': 62L,
531 | 'selection_end': 77L,
532 | 'selection_start': 77L,
533 | 'zoom': 0L},
534 | loc('scripts/python/pdf_reader.py'): {'attrib-starts': [],
535 | 'code-line': '# Objective 1: Read multiple pdf files from a director'\
536 | 'y into a list\r\n',
537 | 'first-line': 39L,
538 | 'folded-linenos': [],
539 | 'sel-line': 0L,
540 | 'sel-line-start': 0L,
541 | 'selection_end': 0L,
542 | 'selection_start': 0L,
543 | 'zoom': 0L},
544 | loc('scripts/python/pdf_table_reader.py'): {'attrib-s'\
545 | 'tarts': [],
546 | 'code-line': '\r\n',
547 | 'first-line': 9L,
548 | 'folded-linenos': [],
549 | 'sel-line': 17L,
550 | 'sel-line-start': 677L,
551 | 'selection_end': 677L,
552 | 'selection_start': 677L,
553 | 'zoom': 0L},
554 | loc('../../../Miniconda3/Lib/site-packages/PyPDF2/pdf.py'): {'a'\
555 | 'ttrib-starts': [('PdfFileReader|0|',
556 | 1043),
557 | ('PdfFileReader|0|.read|0|',
558 | 1684)],
559 | 'code-line': ' raise utils.PdfReadError("EOF marker n'\
560 | 'ot found")\n',
561 | 'first-line': 1678L,
562 | 'folded-linenos': [],
563 | 'sel-line': 1695L,
564 | 'sel-line-start': 67771L,
565 | 'selection_end': 67771L,
566 | 'selection_start': 67771L,
567 | 'zoom': 0L},
568 | loc('../../../Miniconda3/Lib/site-packages/tabula/io.py'): {'a'\
569 | 'ttrib-starts': [('convert_into|0|',
570 | 505)],
571 | 'code-line': ' "{} is empty. Check the file, or download '\
572 | 'it manually.".format(path)\n',
573 | 'first-line': 541L,
574 | 'folded-linenos': [],
575 | 'sel-line': 557L,
576 | 'sel-line-start': 23803L,
577 | 'selection_end': 23803L,
578 | 'selection_start': 23803L,
579 | 'zoom': 0L},
580 | loc('../../../Miniconda3/Lib/site-packages/tika/tika.py'): {'a'\
581 | 'ttrib-starts': [('checkTikaServer|0|',
582 | 568)],
583 | 'code-line': ' raise RuntimeError("Unable to start Ti'\
584 | 'ka server.")\n',
585 | 'first-line': 583L,
586 | 'folded-linenos': [],
587 | 'sel-line': 600L,
588 | 'sel-line-start': 23544L,
589 | 'selection_end': 23544L,
590 | 'selection_start': 23544L,
591 | 'zoom': 0L},
592 | loc('../../../Miniconda3/Lib/subprocess.py'): {'attri'\
593 | 'b-starts': [('run|0|',
594 | 430)],
595 | 'code-line': ' output=stdout, st'\
596 | 'derr=stderr)\n',
597 | 'first-line': 469L,
598 | 'folded-linenos': [],
599 | 'sel-line': 486L,
600 | 'sel-line-start': 17536L,
601 | 'selection_end': 17536L,
602 | 'selection_start': 17536L,
603 | 'zoom': 0L}}
604 |
--------------------------------------------------------------------------------
/data/fake_news_data/liar_dataset/README.md:
--------------------------------------------------------------------------------
1 | LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION
2 |
3 | William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.
4 | =====================================================================
5 | Description of the TSV format:
6 |
7 | Column 1: the ID of the statement ([ID].json).
8 | Column 2: the label.
9 | Column 3: the statement.
10 | Column 4: the subject(s).
11 | Column 5: the speaker.
12 | Column 6: the speaker's job title.
13 | Column 7: the state info.
14 | Column 8: the party affiliation.
15 | Column 9-13: the total credit history count, including the current statement.
16 | 9: barely true counts.
17 | 10: false counts.
18 | 11: half true counts.
19 | 12: mostly true counts.
20 | 13: pants on fire counts.
21 | Column 14: the context (venue / location of the speech or statement).
22 |
23 | Note that we do not provide the full-text verdict report in this current version of the dataset,
24 | but you can use the following command to access the full verdict report and links to the source documents:
25 | wget http://www.politifact.com//api/v/2/statement/[ID]/?format=json
26 |
27 | ======================================================================
28 | The original sources retain the copyright of the data.
29 |
30 | Note that there are absolutely no guarantees with this data,
31 | and we provide this dataset "as is",
32 | but you are welcome to report the issues of the preliminary version
33 | of this data.
34 |
35 | You are allowed to use this dataset for research purposes only.
36 |
37 | For more question about the dataset, please contact:
38 | William Wang, william@cs.ucsb.edu
--------------------------------------------------------------------------------
/data/readme.md:
--------------------------------------------------------------------------------
1 | data files for text analysis are kept here
2 |
--------------------------------------------------------------------------------
/data/text_data_for_analysis.txt:
--------------------------------------------------------------------------------
1 | As a term, data analytics predominantly refers to an assortment of applications, from basic business
2 | intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced
3 | analytics. In that sense, it's similar in nature to business analytics, another umbrella term for
4 | approaches to analyzing data -- with the difference that the latter is oriented to business uses, while
5 | data analytics has a broader focus. The expansive view of the term isn't universal, though: In some
6 | cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate
7 | category. Data analytics initiatives can help businesses increase revenues, improve operational
8 | efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to
9 | emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of
10 | boosting business performance. Depending on the particular application, the data that's analyzed
11 | can consist of either historical records or new information that has been processed for real-time
12 | analytics uses. In addition, it can come from a mix of internal systems and external data sources. At
13 | a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find
14 | patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical
15 | techniques to determine whether hypotheses about a data set are true or false. EDA is often
16 | compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a
17 | distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis. Data
18 | analytics can also be separated into quantitative data analysis and qualitative data analysis. The
19 | former involves analysis of numerical data with quantifiable variables that can be compared or
20 | measured statistically. The qualitative approach is more interpretive -- it focuses on understanding
21 | the content of non-numerical data like text, images, audio and video, including common phrases,
22 | themes and points of view.
--------------------------------------------------------------------------------
/figures/Rplot-wordcloud01.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/duttashi/text-analysis/154b34ff3c2fac60e5e4068e48597dbdc815b894/figures/Rplot-wordcloud01.png
--------------------------------------------------------------------------------
/figures/readme.md:
--------------------------------------------------------------------------------
1 | All plots/figures live here
2 |
--------------------------------------------------------------------------------
/resources/readme.md:
--------------------------------------------------------------------------------
1 | All references to learning resources live here
2 |
--------------------------------------------------------------------------------
/scripts/.gitignore:
--------------------------------------------------------------------------------
1 | /.ipynb_checkpoints/
2 |
--------------------------------------------------------------------------------
/scripts/R/HP_preprocess_01.R:
--------------------------------------------------------------------------------
1 |
2 |
3 | # clear the workspace
4 | rm(list=ls())
5 |
6 | # load harrypotter books from the package harrypotter
7 | library(devtools)
8 | devtools::install_github("bradleyboehmke/harrypotter")
9 | # load the relevant libraries
10 | library(tidyverse) # data manipulation & plotting
11 | library(stringr) # text cleaning and regular expressions
12 | library(tidytext) # provides additional text mining functions
13 | library(harrypotter) # provides the first seven novels of the Harry Potter series
14 | library(ggplot2) # for data visualization
15 |
16 | # read the first two books
17 | titles <- c("Philosopher's Stone", "Chamber of Secrets")
18 | books <- list(philosophers_stone, chamber_of_secrets)
19 |
20 | # sneak peak into the first two chapters of chamber of secrets book
21 | chamber_of_secrets[1:2]
22 |
23 | # Tidying the text
24 | # To properly analyze the text, we need to convert it into a data frame or a tibble.
25 | typeof(chamber_of_secrets) # as you can see, at the moment it is a character vector
26 | text_tb<- tibble(chapter= seq_along(philosophers_stone),
27 | text=philosophers_stone)
28 | str(text_tb)
29 |
30 | # Unnest the text
31 | # Its important to note that the unnest_token function does the following; splits the text into single words, strips all punctuation and converts each word to lowercase for easy comparability.
32 | clean<-text_tb %>%
33 | unnest_tokens(word, text)
34 |
35 | clean_book<- tibble()
36 | clean_book<- rbind(clean_book, clean)
37 | clean_book
38 |
39 | # Basic calculations
40 | # calculate word frequency
41 | word_freq <- clean_book %>%
42 | count(word, sort=TRUE)
43 | word_freq
44 | # lots of stop words like the, and, to, a etc. Let's remove the stop words.
45 | # We can remove the stop words from our tibble with anti_join and the built-in stop_words data set provided by tidytext.
46 | clean_book %>%
47 | anti_join(stop_words) %>%
48 | count(word, sort=TRUE) %>%
49 | top_n(10) %>%
50 | ggplot(aes(word,n))+
51 | geom_bar(stat = "identity")
52 |
53 |
54 |
55 |
56 |
57 |
--------------------------------------------------------------------------------
/scripts/R/austen_text_analysis.R:
--------------------------------------------------------------------------------
1 | # Script title: austen_text_analysis.R
2 |
3 | # Load the data
4 | austen.data<- scan("data/plainText/austen.txt", what = "character", sep = "\n")
5 |
6 | # clear the workspace
7 | rm(list=ls())
8 |
9 | # find out the begining & end of of the main text
10 | austen.start<-which(austen.data=="CHAPTER 1")
11 | austen.start # main text begins from line 17
12 | austen.end<-which(austen.data=="THE END")
13 | austen.end # main text ends at line 10609
14 |
15 | # save the metadata to a separate object
16 | austen.startmeta<- austen.data[1:austen.start-1]
17 | austen.endmeta<- austen.data[(austen.end+1):length(austen.end)]
18 | metadata<- c(austen.startmeta, austen.endmeta)
19 | novel.data<- austen.data[austen.start: austen.end]
20 | head(novel.data)
21 | tail(novel.data)
22 |
23 | # Formatting the text for subsequent data analysis
24 | novel.data<- paste(novel.data, collapse = " ")
25 |
26 | # Data preprocessing
27 | ## convert all words to lowercase
28 | novel.lower<- tolower(novel.data)
29 | ## extract all words only
30 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character.
31 | head(novel.words)
32 | str(novel.words) # The words are contained in a list format
33 | novel.words<- unlist(novel.words)
34 | str(novel.words) # Convert list to a character vector
35 |
36 | ## Removing the blanks. First, find out the non blank positions
37 | notblanks<- which(novel.words!="")
38 | head(notblanks)
39 |
40 | novel.wordsonly<-novel.words[notblanks]
41 | head(novel.wordsonly)
42 | totalwords<-length(novel.wordsonly)
43 |
44 | ### Practice: Find the top 10 most frequent words in the text
45 | novel.wordsonly.freq<- table(novel.wordsonly)
46 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE)
47 | novel.wordsonly.freq.sorted[c(1:10)]
48 | ### Practice: Visualize the top 10 most frequent words in the text
49 | plot(novel.wordsonly.freq.sorted[c(1:10)])
50 | ## calculating the relative frequency of the words
51 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted))
52 |
53 | plot(novel.wordsonly.relfreq[1:10], type="b",
54 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n")
55 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10]))
--------------------------------------------------------------------------------
/scripts/R/dispersion_plots.R:
--------------------------------------------------------------------------------
1 | # Script name: dispersion_plots.R
2 | # vector to use: novel.wordsonly
3 | # RQ: How to identify where in text, different words occur and how they behave over the course of the text/novel?
4 |
5 | # list files in working directory
6 | ls()
7 |
8 | # step 1: create a integer vector indicating position of each word in text
9 | novel.time.v<- seq(1:length(novel.wordsonly)) # n.time.v
10 | # step 2: Say, I want to see all occurences of the word, "whale" in the text
11 | family.v<- which(novel.words == "family")
12 | family.v
13 | ## Ultimately we want to create a dispersion plot where the x-axis is novel.time.v and the y-axis is the values need only be some reflection of the logical condition of TRUE where a family is found and FALSE or none found when an instance of family is not found
14 | family.v.count<- rep(NA, length(novel.time.v)) # initialize a vector full of NA values
15 | family.v.count[family.v]<-1 # using the numerical positions stored in the family.v object, so the resetting is simple with this expression
16 | family.v.count
17 |
18 | ## Plot showing the distribution of the word, "family" across the novel
19 | plot(family.v.count, main="Dispersion Plot of `family' in Moby Dick",
20 | xlab="Novel Time", ylab="family", type="h", ylim=c(0,1), yaxt='n')
21 |
22 | man.v<- which(novel.words == "man")
23 | man.v
24 | ## Ultimately we want to create a dispersion plot where the x-axis is novel.time.v and the y-axis is the values need only be some reflection of the logical condition of TRUE where a family is found and FALSE or none found when an instance of family is not found
25 | man.v.count<- rep(NA, length(novel.time.v)) # initialize a vector full of NA values
26 | man.v.count[man.v]<-1 # using the numerical positions stored in the family.v object, so the resetting is simple with this expression
27 | man.v.count
28 |
29 | ## Plot showing the distribution of the word, "family" across the novel
30 | plot(man.v.count, main="Dispersion Plot of `man' in Moby Dick",
31 | xlab="Novel Time", ylab="family", type="h", ylim=c(0,1), yaxt='n')
--------------------------------------------------------------------------------
/scripts/R/dracula_text_analysis.R:
--------------------------------------------------------------------------------
1 | # Objective: Read a novel from project gutenberg and then tidy it. Thereafter, count the number of words, and perform basic sentiment analysis
2 | # Script name: tidy_text_analysis_01.R
3 |
4 | library(gutenbergr)
5 | library(tidytext)
6 | library(tidyverse)
7 | library(tm) # for Corpus()
8 |
9 | # Show all gutenberg works in the gutenbergr package
10 | gutenberg_works<- gutenberg_works(languages = "en")
11 | View(gutenberg_works)
12 | # The id for Bram Stoker's Dracula is 345
13 | dracula<- gutenberg_download(345)
14 | View(dracula)
15 | str(dracula)
16 |
17 | #head(dracula_stripped,169)
18 | #tail(dracula_stripped,15482)
19 |
20 | dracula_tidy<- dracula%>%
21 | unnest_tokens(word, text) %>%
22 | anti_join(stop_words, by="word")
23 | # DATA PREPROCESSING & INITIAL VISUALIZATION
24 |
25 | # Lets create a custom theme
26 | mytheme<- theme_bw()+
27 | theme(plot.title = element_text(color = "darkred"))+
28 | theme(panel.border = element_rect(color = "steelblue", size = 2))+
29 | theme(plot.title = element_text(hjust = 0.5)) # where 0.5 is to center
30 |
31 | # From the count of common words from above code, now plotting the most common words
32 | dracula_tidy %>%
33 | count(word, sort = TRUE) %>%
34 | filter(n >100 & n<400) %>%
35 | mutate(word = reorder(word, n)) %>%
36 | ggplot(aes(word, n)) +
37 | geom_col() +
38 | #xlab(NULL) +
39 | coord_flip()+
40 | mytheme+
41 | ggtitle("Top words in Bam Stoker's Dracula")
42 |
43 | # convert to corpus
44 | novel_corpus<- Corpus(VectorSource(dracula_tidy[,2]))
45 | # build a document term matrix. Document matrix is a table containing the frequency of the words.
46 | dtm<- DocumentTermMatrix(novel_corpus)
47 | # preprocessing the novel data
48 | novel_corpus_clean<- tm_map(novel_corpus, tolower)
49 | novel_corpus_clean<- tm_map(novel_corpus, removeNumbers)
50 | novel_corpus_clean<- tm_map(novel_corpus, removeWords, stopwords("english"))
51 | novel_corpus_clean<- tm_map(novel_corpus, removePunctuation)
52 | # To see the first few documents in the text file
53 | inspect(novel_corpus)[1:10]
54 | # explicitly convert the document term matrix table to matrix format
55 | m<- as.matrix(dtm)
56 | # sort the matrix and store in a new data frame
57 | word_freq<- sort(colSums(m), decreasing = TRUE)
58 | # look at the top 5 words
59 | head(word_freq,5)
60 | # create a character vector
61 | words<- names(word_freq)
62 | # create a data frame having the character vector and its associated number of occurences or frequency
63 | words_df<- data.frame(word=words, freq=word_freq)
64 | # Plot word frequencies
65 | barplot(words_df[1:10,]$freq, las = 2, names.arg = words_df[1:10,]$word,
66 | col ="lightblue", main ="Most frequent words",
67 | ylab = "Word frequencies")
--------------------------------------------------------------------------------
/scripts/R/expr_kaggle-reddit_EDA.R:
--------------------------------------------------------------------------------
1 | # data soruce: https://www.kaggle.com/maksymshkliarevskyi/reddit-data-science-posts
2 |
3 | # clean the workspace
4 | rm(list = ls())
5 |
6 | # load required libraries
7 | library(tidyverse)
8 | library(tidyr)
9 |
10 | # 1. reading multiple data files from a folder into separate dataframes
11 | filesPath = "data/kaggle_reddit_data"
12 | temp = list.files(path = filesPath, pattern = "*.csv", full.names = TRUE)
13 | for (i in 1:length(temp)){
14 | nam <- paste("df",i, sep = "_")
15 | assign(nam, read_csv(temp[i], na=c("","NA")))
16 | }
17 |
18 | #2. Read multiple dataframe created at step 1 into a list
19 | df_lst<- lapply(ls(pattern="df_[0-9]+"), function(x) get(x))
20 | typeof(df_lst)
21 | #3. combining a list of dataframes into a single data frame
22 | # df<- bind_rows(df_lst)
23 | df<- plyr::ldply(df_lst, data.frame)
24 | str(df)
25 | dim(df) # [1] 476970 22
26 |
27 | # lowercase all character variables
28 | df<- df %>%
29 | mutate(across(where(is.character), tolower))
30 |
31 | # Data Engineering: split col created_date into date and time
32 | df$create_date<- as.Date(df$created_date)
33 | df$create_time<- format(df$created_date,"%H:%M:%S")
34 | df<- separate(df, create_date, c('create_year', 'create_month', 'create_day'), sep = "-",remove = TRUE)
35 | df<- separate(df, create_time, c('create_hour', 'create_min', 'create_sec'), sep = ":",remove = TRUE)
36 |
37 | # drop cols not required for further analysis
38 | df$X1<- NULL
39 | df$created_timestamp<- NULL
40 | df$author_created_utc<- NULL
41 | df$full_link<- NULL
42 | df$post_text<- NULL
43 | df$postURL<- NULL
44 | df$title_text<- NULL
45 | df$created_date<- NULL
46 |
47 | # extract url from post and save as separate column
48 | url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
49 | df$postURL <- str_extract_all(df$post, url_pattern)
50 | df$postURL<- NULL
51 | # df$post_text<- str_extract_all(df$post, boundary("word"))
52 | # df$post_text<- str_extract_all(df$post, "[a-z]+")
53 | # df$title_text<- str_extract_all(df$title, "[a-z]+")
54 |
55 | # extract data from list
56 | # head(df$post_text)
57 | # df$postText<- sapply(df$post_text, function(x) x[1])
58 | # str(df)
59 |
60 | # filter out selected missing values
61 | colSums(is.na(df))
62 | df1<- df
63 | df1 <- df1 %>%
64 | # filter number of comments less than 0. this will take care of 0 & -1 comments
65 | # filter(num_comments > 0 ) %>%
66 | # filter posts with NA
67 | filter(!is.na(post)) %>%
68 | # filter subreddit_subscribers with NA
69 | filter(!is.na(subreddit_subscribers)) %>%
70 | # filter crossposts with NA
71 | filter(!is.na(num_crossposts)) %>%
72 | # # filter create date, create time with NA
73 | filter(!is.na(create_year)) %>%
74 | filter(!is.na(create_month)) %>%
75 | filter(!is.na(create_day)) %>%
76 | filter(!is.na(create_hour)) %>%
77 | filter(!is.na(create_min)) %>%
78 | filter(!is.na(create_sec))
79 | colSums(is.na(df1))
80 | # rearrange the cols
81 | df1<- df1[,c(3,10,14:15,11:13,1:2,4:9)]
82 |
83 | df_clean <- df1
84 | str(df_clean)
85 | # 4. write combined partially clean data to disk
86 | write_csv(df_clean, file = "data/kaggle_reddit_data/reddit_data_clean.csv")
87 |
88 |
89 |
90 | # filter character cols with only text data. remove all special characters data
91 | # filter(str_detect(str_to_lower(author), "[a-zA-Z]")) %>%
92 | # filter(str_detect(str_to_lower(title), "[a-zA-Z]")) %>%
93 | # filter(str_detect(str_to_lower(id),"[a-zA-Z0-9]")) %>%
94 | # filter(str_detect(str_to_lower(post),"[a-zA-Z]")) %>%
95 | # # separate date into 3 columns
96 | # separate(create_date, into = c("create_year","create_month","create_day")) %>%
97 | # # separate time into 3 columns
98 | # separate(create_time, into = c("create_hour","create_min","create_sec")) %>%
99 | # # coerce all character cols into factor
100 | # mutate_if(is.character,as.factor)
101 |
102 | ##
103 | # df_clean<-df_clean %>%
104 | # filter(str_detect(str_to_lower(post), url_pattern))
--------------------------------------------------------------------------------
/scripts/R/expr_kaggle-reddit_text-analysis.R:
--------------------------------------------------------------------------------
1 | # read kaggle reddit comments clean file
2 | # required file: data/kaggle_reddit_data/reddit_data_clean.csv
3 |
4 | # clean the workspace
5 | rm(list = ls())
6 | # load required libraries
7 | library(readr)
8 | library(stringr)
9 | library(tidyverse)
10 | library(tidytext)
11 |
12 | df<- read_csv(file = "data/kaggle_reddit_data/reddit_data_clean.csv")
13 | str(df)
14 | colnames(df)
15 |
16 | # clean the 'post' column
17 | # remove any url contained in post
18 | df$post_c <- gsub('http.* *', "", df$post)
19 | # remove any brackets contained in title
20 | df$title_c <- gsub('\\[|]',"", df$title)
21 | df$title_c <- gsub('\\?',"", df$title_c)
22 | df$title_c <- gsub('\\!',"", df$title_c)
23 |
24 | df1<- df %>%
25 | # add row number
26 | mutate(line=row_number()) %>%
27 | mutate(post_ct = replace(post_c, post_c == '', 'None')) %>%
28 | unnest_tokens(word, post_ct) %>%
29 | anti_join(get_stopwords()) %>%
30 | #mutate(word_count = count(word)) %>%
31 | inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
32 | count(sentiment, word) %>% # count the # of positive & negative words
33 | spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
34 | mutate(sentiment = positive - negative) # # of positive words - # of negative owrds
35 | str(df1)
36 |
37 | df2<- df %>%
38 | # add row number
39 | mutate(line=row_number()) %>%
40 | mutate(post_ct = replace(post_c, post_c == '', 'None')) %>%
41 | unnest_tokens(word, post_ct) %>%
42 | anti_join(get_stopwords()) %>%
43 | group_by(author) %>%
44 | inner_join(get_sentiments("bing"))%>%
45 | count(word, sentiment, sort = TRUE) %>%
46 | ungroup() %>%
47 | #slice_max(n, n = 10) %>%
48 | #ungroup() %>%
49 | mutate(word = reorder(word, n))
50 | view(df2)
51 |
52 | df2 %>%
53 | group_by(sentiment) %>%
54 | slice_max(n, n = 10) %>%
55 | ungroup() %>%
56 | mutate(word = reorder(word, n)) %>%
57 | ggplot(aes(n, word, fill = sentiment)) +
58 | geom_col(show.legend = FALSE) +
59 | facet_wrap(~sentiment, scales = "free_y") +
60 | labs(x = "Contribution to sentiment",
61 | y = NULL)+
62 | theme_bw()
63 |
64 |
--------------------------------------------------------------------------------
/scripts/R/initialScript-1.R:
--------------------------------------------------------------------------------
1 | # script title: initialScript-1.R
2 | # Task: Accessing and comparing word frequency data
3 |
4 | ## comparing the usage of he vs. she and him vs. her
5 | novel.wordsonly.freq.sorted["he"] # "he" occurs 1876 times
6 | novel.wordsonly.freq.sorted["she"] # "she" occurs 114 times
7 | novel.wordsonly.freq.sorted["him"] # "him" occurs 1058 times
8 | novel.wordsonly.freq.sorted["her"] # "her" occurs 330 times
9 | ### This clearly indicates that moby dick is a male oriented book
10 | novel.wordsonly.freq.sorted["him"]/novel.wordsonly.freq.sorted["her"] # him is 3.2 times more frequent than her
11 | novel.wordsonly.freq.sorted["he"]/novel.wordsonly.freq.sorted["she"] # he is 16 times more frequent than she
12 |
13 | ## calculating the relative frequency of the words
14 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted))
15 | novel.wordsonly.relfreq["the"] # "the" occurs 6.6 times per every 100 words in the novel Moby Dick
16 |
17 | plot(novel.wordsonly.relfreq[1:10], type="b",
18 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n")
19 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10]))
20 |
--------------------------------------------------------------------------------
/scripts/R/initialScript.R:
--------------------------------------------------------------------------------
1 |
2 | # Load the data
3 | text.data<- scan("data/plainText/melville.txt", what = "character", sep = "\n")
4 | str(text.data)
5 | text.data[1]
6 | text.data[408] # The main text of the novel, "moby dick" begins at line 408
7 | text.data[18576] # The main text of the novel, "moby dick" ends at line 18576
8 |
9 | # To find out the begining & end of of the main text, do the following
10 | text.start<- which(text.data=="CHAPTER 1. Loomings.") #408L
11 | text.end<- which(text.data=="orphan.") #18576L
12 | text.start
13 | text.end
14 |
15 | # what is the length of text
16 | length(text.data) # there are 18,874 lines of text in the file
17 | # save the metadata to a separate object
18 | text.startmetadata<- text.data[1:text.start-1]
19 | text.endmetadata<- text.data[(text.end+1):length(text.end)]
20 | metadata<- c(text.startmetadata, text.endmetadata)
21 | novel.data<- text.data[text.start: text.end]
22 | head(novel.data)
23 | tail(novel.data)
24 |
25 | # Formatting the text for subsequent data analysis
26 |
27 | ## Get rid of line breaks "\n" such that all line are put into one long string.
28 | ## This is achived using paste function to join and collapse all lines into one long string
29 | novel.data<- paste(novel.data, collapse = " ")
30 | ## The paste function with the collapse argument provides a way of gluing together a bunch of separate pieces using a glue character that you define as the value for the collapse argument. In this case, you are going to glue together the lines (the pieces) using a blank space character (the glue).
31 |
32 | # Data preprocessing
33 | ## convert all words to lowercase
34 | novel.lower<- tolower(novel.data)
35 | ## extract all words only
36 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character.
37 | head(novel.words)
38 | str(novel.words) # The words are contained in a list format
39 |
40 | novel.words<- unlist(novel.words)
41 | str(novel.words) # Convert list to a character vector
42 |
43 | ## Removing the blanks. First, find out the non blank positions
44 | notblanks<- which(novel.words!="")
45 | head(notblanks)
46 |
47 | novel.wordsonly<-novel.words[notblanks]
48 | head(novel.wordsonly)
49 | totalwords<-length(novel.wordsonly) # total words=214889 words
50 | whale.hits<-length(novel.wordsonly[which(novel.wordsonly=="whale")]) # the word 'whale occurs 1150 times
51 | whale.hits.perct<- whale.hits/totalwords
52 | whale.hits.perct # The word whale occurs 0.005% times in whole text
53 | length(unique(novel.wordsonly)) # there are 16,872 unique words in the novel
54 |
55 | novel.wordsonly.freq<- table(novel.wordsonly) # the table() will build up the contigency table that contains the count of every word occurence
56 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE) # sort the data with most frequent words first followed by least frequent words
57 | head(novel.wordsonly.freq.sorted)
58 | tail(novel.wordsonly.freq.sorted)
59 |
60 | ### Practice: Find the top 10 most frequent words in the text
61 | novel.wordsonly.freq.sorted[c(1:10)]
62 | ### Practice: Visualize the top 10 most frequent words in the text
63 | plot(novel.wordsonly.freq.sorted[c(1:10)])
64 |
--------------------------------------------------------------------------------
/scripts/R/mobydick_novel_text_analysis.R:
--------------------------------------------------------------------------------
1 | # clear the workspace
2 | rm(list = ls())
3 |
4 | # Load the moby dick novel
5 | mobydick.data<- scan("data/plainText/melville.txt", what = "character", sep = "\n")
6 |
7 | # find out the begining & end of of the main text
8 | mobydick.start<-which(mobydick.data=="CHAPTER 1. Loomings.")
9 | mobydick.end<-which(mobydick.data=="orphan.")
10 |
11 |
12 | # save the metadata to a separate object
13 | mobydick.startmeta<- mobydick.data[1:mobydick.start-1]
14 | mobydick.endmeta<- mobydick.data[(mobydick.end+1):length(mobydick.end)]
15 | metadata<- c(austen.startmeta, austen.endmeta)
16 | novel.data<- mobydick.data[mobydick.start: mobydick.end]
17 | head(novel.data)
18 | tail(novel.data)
19 |
20 | # Formatting the text for subsequent data analysis
21 | novel.data<- paste(novel.data, collapse = " ")
22 |
23 | # Data preprocessing
24 | ## convert all words to lowercase
25 | novel.lower<- tolower(novel.data)
26 | ## extract all words only
27 | novel.words<- strsplit(novel.lower, "\\W") # where \\W is a regex to match any non-word character.
28 | head(novel.words)
29 | str(novel.words) # The words are contained in a list format
30 | novel.words<- unlist(novel.words)
31 | str(novel.words) # Convert list to a character vector
32 |
33 | ## Removing the blanks. First, find out the non blank positions
34 | notblanks<- which(novel.words!="")
35 | head(notblanks)
36 |
37 | novel.wordsonly<-novel.words[notblanks]
38 | head(novel.wordsonly)
39 | totalwords<-length(novel.wordsonly)
40 |
41 | # Count the unique word occurences
42 | length(unique(novel.wordsonly)) # 16,872 unique words
43 | # count the number of whale hits
44 | length(novel.wordsonly[novel.wordsonly=="whale"]) # 1,150 times the word whale appears
45 |
46 | # Accessing and understanding word data
47 | novel.wordsonly.freq.sorted["he"] # The word he occurs 1876 times
48 | novel.wordsonly.freq.sorted["she"] # 114
49 | novel.wordsonly.freq.sorted["him"] # 1058
50 | novel.wordsonly.freq.sorted["her"] # 330
51 |
52 | ### Practice: Find the top 10 most frequent words in the text
53 | novel.wordsonly.freq<- table(novel.wordsonly)
54 | novel.wordsonly.freq.sorted<- sort(novel.wordsonly.freq, decreasing = TRUE)
55 | novel.wordsonly.freq.sorted[c(1:10)]
56 | ### Practice: Visualize the top 10 most frequent words in the text
57 | plot(novel.wordsonly.freq.sorted[c(1:10)])
58 | ## calculating the relative frequency of the words
59 | novel.wordsonly.relfreq<- 100*(novel.wordsonly.freq.sorted/sum(novel.wordsonly.freq.sorted))
60 |
61 | plot(novel.wordsonly.relfreq[1:10], type="b",
62 | xlab="Top Ten Words", ylab="Percentage of Full Text", xaxt ="n")
63 | axis(1,1:10, labels=names(novel.wordsonly.relfreq [1:10]))
64 |
65 | # Dispersion Analysis
66 | whales.v<- which(novel.wordsonly="whale")
67 |
68 |
69 |
70 |
71 |
--------------------------------------------------------------------------------
/scripts/R/structural_topic_modeling_00.R:
--------------------------------------------------------------------------------
1 | # Objective: To Explore Topic Modeling and Structural Topic Modeling, what is it and how it is useful for text mining?
2 | # script create date: 10/11/2018
3 | # script modified date:
4 | # script name: structural_topic_modeling_00.R
5 |
6 | # Topic modelling- In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.
7 | # The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.
8 | # Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.
9 |
10 | # Algorithms for Topic Modeling
11 | # Term Frequency,
12 | # Inverse Document Frequency,
13 | # NonNegative Matrix Factorization techniques,
14 | # Latent Dirichlet Allocation is the most popular topic modeling technique
15 |
16 | # reference: http://www.structuraltopicmodel.com/
17 | # reference: https://en.wikipedia.org/wiki/Topic_model
18 | # reference: https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
19 | # reference: https://juliasilge.com/blog/sherlock-holmes-stm/
20 | # reference: https://blogs.uoregon.edu/rclub/2016/04/05/structural-topic-modeling/
21 | # reference: https://juliasilge.github.io/ibm-ai-day/slides.html#39
22 |
23 | # load the required packages
24 | library(tidyverse)
25 | library(gutenbergr)
26 | library(tidytext) # for unnest_tokens()
27 | library(stm)
28 |
29 | # Show all gutenberg works in the gutenbergr package
30 | gutenberg_works<- gutenberg_works(languages = "en")
31 | View(gutenberg_works)
32 | # The id for novel, "The Adventures of Sherlock Holmes" is 1661
33 | sherlock_raw<- gutenberg_download(1661)
34 |
35 | sherlock <- sherlock_raw %>%
36 | mutate(story = ifelse(str_detect(text, "ADVENTURE"),
37 | text,
38 | NA)) %>%
39 | fill(story) %>%
40 | filter(story != "THE ADVENTURES OF SHERLOCK HOLMES") %>%
41 | mutate(story = factor(story, levels = unique(story)))
42 |
43 | # create a custom function to reorder a column before plotting with facetting, such that the values are ordered within each facet.
44 | # reference: https://github.com/dgrtwo/drlib
45 | reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
46 | new_x <- paste(x, within, sep = sep)
47 | stats::reorder(new_x, by, FUN = fun)
48 | }
49 |
50 | scale_x_reordered <- function(..., sep = "___") {
51 | reg <- paste0(sep, ".+$")
52 | ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
53 | }
54 |
55 | scale_y_reordered <- function(..., sep = "___") {
56 | reg <- paste0(sep, ".+$")
57 | ggplot2::scale_y_discrete(labels = function(x) gsub(reg, "", x), ...)
58 | }
59 |
60 | # Transform the text data into tidy format using `unnest_tokens()` and remove stopwords
61 | sherlock_tidy<- sherlock %>%
62 | unnest_tokens(word, text) %>%
63 | anti_join(stop_words, by="word") %>%
64 | filter(word != "holmes")
65 |
66 | sherlock_tidy %>%
67 | count(word, sort = TRUE)
68 |
69 | # Determine the highest tf_idf words in the 12 stories on sherlock holmes? we will use the function bind_tf_idf() from the tidytext package
70 |
71 | sherlock_tf_idf <- sherlock_tidy %>%
72 | count(story, word, sort = TRUE) %>%
73 | bind_tf_idf(word, story, n) %>%
74 | arrange(-tf_idf) %>%
75 | group_by(story) %>%
76 | top_n(10) %>%
77 | ungroup
78 |
79 | sherlock_tf_idf %>%
80 | mutate(word = reorder_within(word, tf_idf, story)) %>%
81 | ggplot(aes(word, tf_idf, fill = story)) +
82 | geom_col(alpha = 0.8, show.legend = FALSE) +
83 | facet_wrap(~ story, scales = "free", ncol = 3) +
84 | scale_x_reordered() +
85 | coord_flip() +
86 | theme(strip.text=element_text(size=11)) +
87 | labs(x = NULL, y = "tf-idf",
88 | title = "Highest tf-idf words in Sherlock Holmes short stories",
89 | subtitle = "Individual stories focus on different characters and narrative elements")
90 |
91 | # Exploring tf-idf can be helpful before training topic models.
92 | # let’s get started on a topic model! Using the stm package.
93 | sherlock_sparse <- sherlock_tidy %>%
94 | count(story, word, sort = TRUE) %>%
95 | cast_sparse(story, word, n)
96 |
97 | # Now training a topic model with 6 topics, but the stm includes lots of functions and support for choosing an appropriate number of topics for your model.
98 | topic_model <- stm(sherlock_sparse, K = 6,
99 | verbose = FALSE, init.type = "Spectral")
--------------------------------------------------------------------------------
/scripts/R/text_analysis_example01.R:
--------------------------------------------------------------------------------
1 | # Sample script for text analysis with R
2 | # Reference: Below is an example from https://github.com/juliasilge/tidytext
3 | library(janeaustenr)
4 | library(dplyr)
5 |
6 | original_books <- austen_books() %>%
7 | group_by(book) %>%
8 | mutate(linenumber = row_number()) %>%
9 | ungroup()
10 |
11 | original_books
12 |
13 | library(tidytext)
14 | tidy_books <- original_books %>%
15 | unnest_tokens(word, text)
16 |
17 | tidy_books
18 |
19 | data("stop_words")
20 | tidy_books <- tidy_books %>%
21 | anti_join(stop_words)
22 | tidy_books %>%
23 | count(word, sort = TRUE)
24 | # Sentiment analysis can be done as an inner join. Three sentiment lexicons are available via the get_sentiments() function.
25 | # sentiment analysis
26 | library(tidyr)
27 | get_sentiments("bing")
28 |
29 | janeaustensentiment <- tidy_books %>%
30 | inner_join(get_sentiments("bing"), by = "word") %>%
31 | count(book, index = linenumber %/% 80, sentiment) %>%
32 | spread(sentiment, n, fill = 0) %>%
33 | mutate(sentiment = positive - negative)
34 |
35 | janeaustensentiment
36 |
37 | library(ggplot2)
38 |
39 | ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
40 | geom_bar(stat = "identity", show.legend = FALSE) +
41 | facet_wrap(~book, ncol = 2, scales = "free_x")
42 |
43 |
--------------------------------------------------------------------------------
/scripts/R/text_analysis_example02.R:
--------------------------------------------------------------------------------
1 | # clear the workspace
2 | rm(list = ls())
3 | # load the required libraries
4 | library(tidytext)
5 | library(magrittr)
6 |
7 | # Read the text file
8 | df<- scan("data/text_data_for_analysis.txt", what = "character", sep = "\n")
9 |
10 | # basic eda
11 | str(df)
12 | typeof(df) # character vector
13 |
14 | # convert all words to lowercase
15 | df<- tolower(df)
16 | ## extract all words only
17 | df<- strsplit(df, "\\W") # where \\W is a regex to match any non-word character.
18 | head(df)
19 | df.words<- unlist(df)
20 | ## Removing the blanks. First, find out the non blank positions
21 | notblanks<- which(df.words!="")
22 | head(notblanks)
23 | df.wordsonly<-df.words[notblanks]
24 | # Count the unique word occurences
25 | length(unique(df.wordsonly)) # 194 unique words
26 |
27 | # Accessing and understanding word data
28 | df.wordsonly.freq<- table(df.wordsonly)
29 | df.wordsonly.freq_sorted<- sort(df.wordsonly, decreasing = TRUE)
30 | df.wordsonly.freq_sorted[c(1:10)]
31 |
32 | length(df.wordsonly)
33 | df.tbl<- tibble(idx=c(1:length(df.wordsonly)), text= df.wordsonly)
34 | str(df.tbl)
35 | library(tidytext)
36 | library(dplyr) # for anti_join()
37 | df.tbl<- df.tbl %>%
38 | unnest_tokens(idx, text)
39 | str(df.tbl)
40 |
41 | df.tbl.tidy %>%
42 | count(line, sort = TRUE)
43 |
44 |
45 |
--------------------------------------------------------------------------------
/scripts/R/token_distribution_analysis.R:
--------------------------------------------------------------------------------
1 |
2 | # clean the workspace
3 | rm(list = ls())
4 |
5 | # Load the data
6 | text.v <- scan("data/plainText/melville.txt", what="character", sep="\n")
7 | start.v <- which(text.v == "CHAPTER 1. Loomings.")
8 | end.v <- which(text.v == "orphan.")
9 | novel.lines.v <- text.v[start.v:end.v]
10 | novel.lines.v
11 |
12 | ## Identify the chapter break positions in the vector using the grep function
13 | chap.positions.v <- grep("^CHAPTER \\d", novel.lines.v) # (the start of a line is marked by use of the caret symbol ˆ. with the capitalized letters CHAPTER followed by a space and then any digit (digits are represented using an escaped d, as in \\d)
14 | novel.lines.v[chap.positions.v] # we can see there are 135 chapter headings
15 |
16 | ## Identify the text break positions in chapters
17 | novel.lines.v <- c(novel.lines.v, "END")
18 | last.position.v <- length(novel.lines.v)
19 | chap.positions.v <- c(chap.positions.v , last.position.v)
20 | ## Now, we have to figure out how to process the text, that is, the actual content of each chapter that appears between each of these chapter markers.
21 | ## We will use the for loop
22 |
23 | for(i in 1:length(chap.positions.v)){
24 | print(paste("Chapter ",i, " begins at position ",
25 | chap.positions.v[i]), sep="")
26 | } # end for
27 | chapter.raws.l <- list()
28 | chapter.freqs.l <- list()
29 |
30 | for(i in 1:length(chap.positions.v)){
31 | if(i != length(chap.positions.v)){
32 | chapter.title <- novel.lines.v[chap.positions.v[i]]
33 | start <- chap.positions.v[i]+1
34 | end <- chap.positions.v[i+1]-1
35 | chapter.lines.v <- novel.lines.v[start:end]
36 | chapter.words.v <- tolower(paste(chapter.lines.v, collapse=" "))
37 | chapter.words.l <- strsplit(chapter.words.v, "\\W")
38 | chapter.word.v <- unlist(chapter.words.l)
39 | chapter.word.v <- chapter.word.v[which(chapter.word.v!="")]
40 | chapter.freqs.t <- table(chapter.word.v)
41 | chapter.raws.l[[chapter.title]] <- chapter.freqs.t
42 | chapter.freqs.t.rel <- 100*(chapter.freqs.t/sum(chapter.freqs.t))
43 | chapter.freqs.l[[chapter.title]] <- chapter.freqs.t.rel
44 | }
45 | }
46 |
--------------------------------------------------------------------------------
/scripts/R/topic_modeling_00.R:
--------------------------------------------------------------------------------
1 | # Objective: Topic Modelling
2 | # script create date: 27/4/2019
3 | # reference: https://www.tidytextmining.com/topicmodeling.html
4 |
5 | library(topicmodels)
6 | data("AssociatedPress")
7 | AssociatedPress
8 | # set a seed so that the output of the model is predictable
9 | ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
10 | #ap_lda
11 |
12 | # word topic probabilities
13 | library(tidytext)
14 | ap_topics <- tidy(ap_lda, matrix = "beta")
15 | ap_topics # the model is turned into a one-topic-per-term-per-row format.
16 | # For each combination, the model computes the probability of that term being generated from that topic. For example, the term “aaron” has a \(1.686917\times 10^{-12}\) probability of being generated from topic 1, but a \(3.8959408\times 10^{-5}\) probability of being generated from topic 2.
17 |
18 | library(ggplot2)
19 | library(dplyr)
20 |
21 | ap_top_terms <- ap_topics %>%
22 | group_by(topic) %>%
23 | top_n(10, beta) %>%
24 | ungroup() %>%
25 | arrange(topic, -beta)
26 | ap_top_terms %>%
27 | mutate(term = reorder(term, beta)) %>%
28 | ggplot(aes(term, beta, fill = factor(topic))) +
29 | geom_col(show.legend = FALSE) +
30 | facet_wrap(~ topic, scales = "free") +
31 | coord_flip()
32 |
--------------------------------------------------------------------------------
/scripts/R/topic_modeling_01.R:
--------------------------------------------------------------------------------
1 | # Topic Modelling - 01
2 |
3 | # resources
4 | # https://www.kaggle.com/regiso/tips-and-tricks-for-building-topic-models-in-r
5 | # https://www.tidytextmining.com/topicmodeling.html
6 | # http://www.bernhardlearns.com/2017/05/topic-models-lda-and-ctm-in-r-with.html
7 | # https://cran.r-project.org/web/packages/tidytext/vignettes/topic_modeling.html
8 |
9 |
10 | library(dplyr)
11 | library(gutenbergr)
12 | titles <- c("Twenty Thousand Leagues under the Sea", "The War of the Worlds",
13 | "Pride and Prejudice", "Great Expectations")
14 | books <- gutenberg_works(title %in% titles) %>%
15 | gutenberg_download(meta_fields = "title")
16 | books
17 |
18 | library(tidytext)
19 | library(stringr)
20 | library(tidyr)
21 |
22 | by_chapter <- books %>%
23 | group_by(title) %>%
24 | mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
25 | ungroup() %>%
26 | filter(chapter > 0)
27 |
28 | by_chapter_word <- by_chapter %>%
29 | unite(title_chapter, title, chapter) %>%
30 | unnest_tokens(word, text)
31 |
32 | word_counts <- by_chapter_word %>%
33 | anti_join(stop_words) %>%
34 | count(title_chapter, word, sort = TRUE)
35 |
36 | word_counts
37 |
38 | # Latent Dirichlet Allocation with the topicmodels package
39 | # Right now this data frame is in a tidy form, with one-term-per-document-per-row. However, the topicmodels package requires a DocumentTermMatrix (from the tm package). As described in this vignette, we can cast a one-token-per-row table into a DocumentTermMatrix with tidytext’s cast_dtm:
40 |
41 | chapters_dtm <- word_counts %>%
42 | cast_dtm(title_chapter, word, n)
43 |
44 | chapters_dtm
45 |
46 | # Now we are ready to use the topicmodels package to create a four topic LDA model.
47 |
48 | library(topicmodels)
49 | chapters_lda <- LDA(chapters_dtm, k = 4, control = list(seed = 1234))
50 | chapters_lda
51 | chapters_lda_td <- tidy(chapters_lda)
52 | chapters_lda_td
53 | top_terms <- chapters_lda_td %>%
54 | group_by(topic) %>%
55 | top_n(5, beta) %>%
56 | ungroup() %>%
57 | arrange(topic, -beta)
58 |
59 | top_terms
60 | library(ggplot2)
61 | theme_set(theme_bw())
62 |
63 | top_terms %>%
64 | mutate(term = reorder(term, beta)) %>%
65 | ggplot(aes(term, beta)) +
66 | geom_bar(stat = "identity") +
67 | facet_wrap(~ topic, scales = "free") +
68 | theme(axis.text.x = element_text(size = 15, angle = 65, hjust = 1))
69 |
--------------------------------------------------------------------------------
/scripts/R/wuthering_heights_sentiment_analysis.R:
--------------------------------------------------------------------------------
1 | # SENTIMENT ANALYSIS
2 |
3 | # required packages
4 | library(tidyverse)
5 | library(tidytext)
6 | library(stringi)
7 | library(SentimentAnalysis)
8 |
9 | # Analyze sentiment
10 | sentiment<- analyzeSentiment(dtm, language = "english",
11 | removeStopwords = TRUE)
12 | # Extract dictionary-based sentiment according to the QDAP dictionary
13 | sentiment$SentimentQDAP
14 | # View sentiment direction (i.e. positive, neutral and negative)
15 | convertToDirection(sentiment$SentimentQDAP)
16 |
17 | # The three general-purpose lexicons available in tidytext package are
18 | #
19 | # AFINN from Finn Årup Nielsen,
20 | # bing from Bing Liu and collaborators, and
21 | # nrc from Saif Mohammad and Peter Turney.
22 |
23 | # I will use Bing lexicon which is simply a tibble with words and positive and negative words
24 | get_sentiments("bing") %>% head
25 |
26 | wuthering_heights %>%
27 | # split text into words
28 | unnest_tokens(word, text) %>%
29 | # remove stop words
30 | anti_join(stop_words, by = "word") %>%
31 | # add sentiment scores to words
32 | left_join(get_sentiments("bing"), by = "word") %>%
33 | # count number of negative and positive words
34 | count(word, sentiment) %>%
35 | spread(key = sentiment, value = n) %>%
36 | #ungroup %>%
37 | # create centered score
38 | mutate(sentiment = positive - negative -
39 | mean(positive - negative)) %>%
40 | # select_if(sentiment) %>%
41 | # reorder word levels
42 | #mutate(word = factor(as.character(word), levels = levels(word)[61:1])) %>%
43 | # plot
44 | ggplot(aes(x = word, y = sentiment)) +
45 | geom_bar(stat = "identity", aes(fill = word)) +
46 | theme_classic() +
47 | theme(axis.text.x = element_text(angle = 90)) +
48 | coord_flip() +
49 | #ylim(0, 200) +
50 | coord_cartesian(ylim=c(0,40)) + # Explain ggplot2 warning: “Removed k rows containing missing values”
51 | ggtitle("Centered sentiment scores",
52 | subtitle = "for Wuthering Heights")
--------------------------------------------------------------------------------
/scripts/R/wuthering_heights_text_analysis.R:
--------------------------------------------------------------------------------
1 | # clean the workspace environment
2 | # Reference: http://tidytextmining.com/tidytext.html
3 | # Reference: https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
4 |
5 | rm(list = ls())
6 |
7 | # Load the required libraries
8 | library(gutenbergr)
9 | library(dplyr)
10 | library(tidytext)
11 | library(ggplot2)
12 |
13 | # READ or DOWNLOAD THE DATA
14 | # find the gutenberg title for analysis. Look for the "gutenberg_id"
15 | gutenberg_works(title == "Wuthering Heights")
16 |
17 | # Once the id is found, pass it as a value to function "gutenberg_download() as shown below
18 | wuthering_heights <- gutenberg_download(768)
19 |
20 | # Look at the data structure, its a tibble
21 | head(wuthering_heights)
22 | # Notice, it has two columns, the first column contains "gutenberg_id" and the second col contains the novel text.
23 | # Do not remove the first column even though it contains a single repeating value 768, because, this column will be used subsequently for filtering data
24 |
25 | # DATA PREPROCESSING & INITIAL VISUALIZATION
26 |
27 | ## Step 1: To create a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
28 | tidy_novel_data<- wuthering_heights %>%
29 | unnest_tokens(word, text) %>%
30 | anti_join(stop_words, by="word")
31 |
32 | # use dplyr’s count() to find the most common words in the novel
33 |
34 | tidy_novel_data %>%
35 | count(word, sort=TRUE)
36 |
37 | # Lets create a custom theme
38 | mytheme<- theme_bw()+
39 | theme(plot.title = element_text(color = "darkred"))+
40 | theme(panel.border = element_rect(color = "steelblue", size = 2))+
41 | theme(plot.title = element_text(hjust = 0.5)) # where 0.5 is to center
42 |
43 | # From the count of common words from above code, now plotting the most common words
44 | tidy_novel_data %>%
45 | count(word, sort = TRUE) %>%
46 | filter(n >100 & n<400) %>%
47 | mutate(word = reorder(word, n)) %>%
48 | ggplot(aes(word, n)) +
49 | geom_col() +
50 | #xlab(NULL) +
51 | coord_flip()+
52 | mytheme+
53 | ggtitle("Top words in Wuthering Heights")
54 |
55 | # WORDCLOUD
56 | # load libraries
57 | library("tm")
58 | library("SnowballC")
59 | library("wordcloud")
60 |
61 | # convert to corpus
62 | novel_corpus<- Corpus(VectorSource(tidy_novel_data[,2]))
63 | # preprocessing the novel data
64 | novel_corpus_clean<- tm_map(novel_corpus, tolower)
65 | novel_corpus_clean<- tm_map(novel_corpus, removeNumbers)
66 | novel_corpus_clean<- tm_map(novel_corpus, removeWords, stopwords("english"))
67 | novel_corpus_clean<- tm_map(novel_corpus, removePunctuation)
68 | # To see the first few documents in the text file
69 | inspect(novel_corpus)[1:10]
70 |
71 | # build a document term matrix. Document matrix is a table containing the frequency of the words.
72 | dtm<- DocumentTermMatrix(novel_corpus_clean)
73 | # explicitly convert the document term matrix table to matrix format
74 | m<- as.matrix(dtm)
75 | # sort the matrix and store in a new data frame
76 | word_freq<- sort(colSums(m), decreasing = TRUE)
77 | # look at the top 5 words
78 | head(word_freq,5)
79 | # create a character vector
80 | words<- names(word_freq)
81 | # create a data frame having the character vector and its associated number of occurences or frequency
82 | words_df<- data.frame(word=words, freq=word_freq)
83 |
84 | # create the first word cloud
85 | set.seed(1234)
86 | wordcloud (words_df$word, words_df$freq, scale=c(4,0.5), random.order=FALSE, rot.per=0.35,
87 | use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 100)
88 | # create the second word cloud
89 | wordcloud (words_df$word, words_df$freq, scale=c(4,0.5), random.order=FALSE, rot.per=0.35,
90 | use.r.layout=FALSE, colors=brewer.pal(8, "Accent"), max.words = 100)
91 |
92 | # Explore frequent terms and their association with each other
93 | findFreqTerms(dtm, lowfreq = 4)
94 | findMostFreqTerms(dtm)
95 | # You can analyze the association between frequent terms (i.e., terms which correlate) using findAssocs() function.
96 | #findAssocs(dtm, terms = "master", corlimit = 0.3)
97 |
98 | # Plot word frequencies
99 | barplot(words_df[1:10,]$freq, las = 2, names.arg = words_df[1:10,]$word,
100 | col ="lightblue", main ="Most frequent words",
101 | ylab = "Word frequencies")
102 |
103 |
--------------------------------------------------------------------------------
/scripts/python/.gitignore:
--------------------------------------------------------------------------------
1 | /.ipynb_checkpoints/
2 |
--------------------------------------------------------------------------------
/scripts/python/00_topic_modelling_fundamentals.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "#### Installing the required libraries (*if not already installed*)"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# conda install -c anaconda nltk\n",
17 | "# conda install -c anaconda gensim"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 7,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "#Here are the sample documents combining together to form a corpus.\n",
27 | "\n",
28 | "doc1 = \"Sugar is bad to consume. My sister likes to have sugar, but not my father.\"\n",
29 | "doc2 = \"My father spends a lot of time driving my sister around to dance practice.\"\n",
30 | "doc3 = \"Doctors suggest that driving may cause increased stress and blood pressure.\"\n",
31 | "doc4 = \"Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.\"\n",
32 | "doc5 = \"Health experts say that Sugar is not good for your lifestyle.\"\n",
33 | "\n",
34 | "# compile documents\n",
35 | "doc_complete = [doc1, doc2, doc3, doc4, doc5]"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "#### Preprocessing"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 8,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "# required libraries\n",
52 | "from nltk.corpus import stopwords \n",
53 | "from nltk.stem.wordnet import WordNetLemmatizer\n",
54 | "import string"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": [
63 | "import nltk\n",
64 | "from nltk.corpus import stopwords\n",
65 | "nltk.download('stopwords')\n",
66 | "stopwords = stopwords.words('english')"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": 9,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "stop = set(stopwords.words('english'))\n",
76 | "exclude = set(string.punctuation) \n",
77 | "lemma = WordNetLemmatizer()"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": 10,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "def clean(doc):\n",
87 | " stop_free = \" \".join([i for i in doc.lower().split() if i not in stop])\n",
88 | " punc_free = ''.join(ch for ch in stop_free if ch not in exclude)\n",
89 | " normalized = \" \".join(lemma.lemmatize(word) for word in punc_free.split())\n",
90 | " return normalized"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": 11,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "doc_clean = [clean(doc).split() for doc in doc_complete]"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "#### Preparing Document-Term Matrix\n",
107 | "\n",
108 | "All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, **gensim** is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. Following code shows how to convert a corpus into a document-term matrix."
109 | ]
110 | },
111 | {
112 | "cell_type": "code",
113 | "execution_count": 6,
114 | "metadata": {},
115 | "outputs": [],
116 | "source": [
117 | "# install gensim library"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 13,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "import gensim\n",
127 | "from gensim import corpora"
128 | ]
129 | },
130 | {
131 | "cell_type": "code",
132 | "execution_count": 14,
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "# Creating the term dictionary of our courpus, where every unique term is assigned an index. \n",
137 | "dictionary = corpora.Dictionary(doc_clean)\n",
138 | "\n",
139 | "# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.\n",
140 | "doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "metadata": {},
146 | "source": [
147 | "#### Running LDA Model\n",
148 | "Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents."
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 15,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "# Creating the object for LDA model using gensim library\n",
158 | "Lda = gensim.models.ldamodel.LdaModel\n",
159 | "\n",
160 | "# Running and Trainign LDA model on the document term matrix.\n",
161 | "ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)"
162 | ]
163 | },
164 | {
165 | "cell_type": "markdown",
166 | "metadata": {},
167 | "source": [
168 | "#### Results"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 16,
174 | "metadata": {},
175 | "outputs": [
176 | {
177 | "name": "stdout",
178 | "output_type": "stream",
179 | "text": [
180 | "[(0, '0.076*\"sugar\" + 0.075*\"good\" + 0.075*\"health\"'), (1, '0.065*\"driving\" + 0.065*\"pressure\" + 0.064*\"doctor\"'), (2, '0.084*\"sister\" + 0.084*\"father\" + 0.059*\"sugar\"')]\n"
181 | ]
182 | }
183 | ],
184 | "source": [
185 | "print(ldamodel.print_topics(num_topics=3, num_words=3))"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "#### Tips to improve results of topic modeling\n",
193 | "\n",
194 | "- Feature selection\n",
195 | "- Part of Speech Tag Filter"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": null,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": []
204 | }
205 | ],
206 | "metadata": {
207 | "kernelspec": {
208 | "display_name": "Python 3",
209 | "language": "python",
210 | "name": "python3"
211 | },
212 | "language_info": {
213 | "codemirror_mode": {
214 | "name": "ipython",
215 | "version": 3
216 | },
217 | "file_extension": ".py",
218 | "mimetype": "text/x-python",
219 | "name": "python",
220 | "nbconvert_exporter": "python",
221 | "pygments_lexer": "ipython3",
222 | "version": "3.7.1"
223 | }
224 | },
225 | "nbformat": 4,
226 | "nbformat_minor": 2
227 | }
228 |
--------------------------------------------------------------------------------
/scripts/python/00_topic_modelling_theoretical_concepts.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "One such technique in the field of text mining is Topic Modelling. As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.\n",
8 | "\n",
9 | "**Topic Modelling** is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an **unsupervised approach** used for finding and observing the bunch of words (called “topics”) in large clusters of texts.\n",
10 | "\n",
11 | "Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.\n",
12 | "\n",
13 | "Topic Models are **very useful for the purpose for document clustering**, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.\n",
14 | "\n",
15 | "**Approaches for Topic Modelling**\n",
16 | "\n",
17 | "There are many approaches for obtaining topics from a text such as \n",
18 | "\n",
19 | "- Term Frequency and Inverse Document Frequency\n",
20 | "- NonNegative Matrix Factorization techniques. \n",
21 | "- Latent Dirichlet Allocation (LDA) is the most popular topic modeling technique and in this article, we will discuss the same.\n",
22 | "\n",
23 | "**How LDA works**\n",
24 | "\n",
25 | "LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.\n",
26 | "\n",
27 | "LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. "
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": null,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": []
36 | }
37 | ],
38 | "metadata": {
39 | "kernelspec": {
40 | "display_name": "Python 3",
41 | "language": "python",
42 | "name": "python3"
43 | },
44 | "language_info": {
45 | "codemirror_mode": {
46 | "name": "ipython",
47 | "version": 3
48 | },
49 | "file_extension": ".py",
50 | "mimetype": "text/x-python",
51 | "name": "python",
52 | "nbconvert_exporter": "python",
53 | "pygments_lexer": "ipython3",
54 | "version": "3.7.1"
55 | }
56 | },
57 | "nbformat": 4,
58 | "nbformat_minor": 2
59 | }
60 |
--------------------------------------------------------------------------------
/scripts/python/01_topic_modelling_fundamentals.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "**Objective**: \n",
8 | "\n",
9 | " - Given a list of docuemnts, determine the topic.\n",
10 | " - Then aggregrate the documents and print the topic the documents belong too.\n",
11 | " \n",
12 | "**Example**:\n",
13 | " \n",
14 | " Data\n",
15 | " \n",
16 | " it's very hot outside summer\n",
17 | " there are not many flowers in winter\n",
18 | " in the winter we eat hot food\n",
19 | " in the summer we go to the sea\n",
20 | " in winter we used many clothes\n",
21 | " in summer we are on vacation\n",
22 | " winter and summer are two seasons of the year\n",
23 | " \n",
24 | " **Output**\n",
25 | " \n",
26 | " Topic 1\n",
27 | " it's very hot outside summer\n",
28 | " in the summer we go to the sea\n",
29 | " in summer we are on vacation\n",
30 | " winter and summer are two seasons of the year\n",
31 | "\n",
32 | " Topic 2\n",
33 | " there are not many flowers in winter\n",
34 | " in the winter we eat hot food\n",
35 | " in winter we used many clothes\n",
36 | " winter and summer are two seasons of the year"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 8,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "#Here are the sample documents combining together to form a corpus.\n",
46 | "\n",
47 | "comments = ['it is very hot outside summer', 'there are not many flowers in winter','in the winter we eat hot food',\n",
48 | " 'in the summer we go to the sea','in winter we used many clothes','in summer we are on vacation',\n",
49 | " 'winter and summer are two seasons of the year']"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 7,
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "# load the required libraries\n",
59 | "from sklearn.feature_extraction.text import CountVectorizer\n",
60 | "from sklearn.decomposition import LatentDirichletAllocation\n",
61 | "import numpy as np"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 17,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "data": {
71 | "text/plain": [
72 | "['it is very hot outside summer',\n",
73 | " 'there are not many flowers in winter',\n",
74 | " 'in the winter we eat hot food',\n",
75 | " 'in the summer we go to the sea',\n",
76 | " 'in winter we used many clothes',\n",
77 | " 'in summer we are on vacation',\n",
78 | " 'winter and summer are two seasons of the year']"
79 | ]
80 | },
81 | "execution_count": 17,
82 | "metadata": {},
83 | "output_type": "execute_result"
84 | }
85 | ],
86 | "source": [
87 | "comments"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 18,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "vect = CountVectorizer()\n",
97 | "X = vect.fit_transform(comments)"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 20,
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "lda = LatentDirichletAllocation(n_components = 2, learning_method = \"batch\", max_iter = 25, random_state = 0)\n",
107 | "document_topics = lda.fit_transform(X)"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 21,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "sorting = np.argsort(lda.components_, axis = 1)[:, ::-1]\n",
117 | "feature_names = np.array(vect.get_feature_names())"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 26,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "docs = np.argsort(comments)"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 30,
132 | "metadata": {},
133 | "outputs": [
134 | {
135 | "name": "stdout",
136 | "output_type": "stream",
137 | "text": [
138 | "in summer we are on vacation\n",
139 | "\n",
140 | "in the summer we go to the sea\n",
141 | "\n",
142 | "in the winter we eat hot food\n",
143 | "\n",
144 | "in winter we used many clothes\n",
145 | "\n",
146 | "it is very hot outside summer\n",
147 | "\n"
148 | ]
149 | }
150 | ],
151 | "source": [
152 | "for i in docs[:5]:\n",
153 | " print(\" \".join(comments[i].split(\",\")[:2]) + \"\\n\")"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 1,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "import nltk"
163 | ]
164 | },
165 | {
166 | "cell_type": "code",
167 | "execution_count": 2,
168 | "metadata": {},
169 | "outputs": [],
170 | "source": [
171 | "from nltk.corpus import names"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 3,
177 | "metadata": {},
178 | "outputs": [
179 | {
180 | "name": "stdout",
181 | "output_type": "stream",
182 | "text": [
183 | "\n"
184 | ]
185 | }
186 | ],
187 | "source": [
188 | "print(names)"
189 | ]
190 | },
191 | {
192 | "cell_type": "code",
193 | "execution_count": null,
194 | "metadata": {},
195 | "outputs": [],
196 | "source": []
197 | }
198 | ],
199 | "metadata": {
200 | "kernelspec": {
201 | "display_name": "Python 3",
202 | "language": "python",
203 | "name": "python3"
204 | },
205 | "language_info": {
206 | "codemirror_mode": {
207 | "name": "ipython",
208 | "version": 3
209 | },
210 | "file_extension": ".py",
211 | "mimetype": "text/x-python",
212 | "name": "python",
213 | "nbconvert_exporter": "python",
214 | "pygments_lexer": "ipython3",
215 | "version": "3.7.1"
216 | }
217 | },
218 | "nbformat": 4,
219 | "nbformat_minor": 2
220 | }
221 |
--------------------------------------------------------------------------------
/scripts/python/Tutorial1-An introduction to NLP with SpaCy.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "#### Natural Language Processing with spaCy"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "#### How to install SpaCy package in Windows environment\n",
15 | "\n",
16 | "- Open a regular command prompt, figure out where anaconda/miniconda is installed. In my case, the location is `C:\\Users\\username\\Miniconda3`\n",
17 | "\n",
18 | "- cd to the directory, `C:\\Users\\username\\miniconda3\\Scripts` and type `activate` and press the enter key.\n",
19 | "\n",
20 | "- Now type the command, `conda install -c conda-forge spacy` OR `pip install spacy` to install the required package. "
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {},
26 | "source": [
27 | "##### Tutorial from the [documentation](https://course.spacy.io/chapter1)"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 1,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "# import the English language class\n",
37 | "from spacy.lang.en import English"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "# Create the nlp object\n",
47 | "nlp = English()"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 3,
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "name": "stdout",
57 | "output_type": "stream",
58 | "text": [
59 | "Hello\n",
60 | "world\n",
61 | "!\n"
62 | ]
63 | }
64 | ],
65 | "source": [
66 | "# Created by processing a string of text with the nlp object\n",
67 | "doc = nlp(\"Hello world!\")\n",
68 | "\n",
69 | "# Iterate over tokens in a Doc\n",
70 | "for token in doc:\n",
71 | " print(token.text)"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 15,
77 | "metadata": {},
78 | "outputs": [
79 | {
80 | "name": "stdout",
81 | "output_type": "stream",
82 | "text": [
83 | "world\n"
84 | ]
85 | }
86 | ],
87 | "source": [
88 | "doc = nlp(\"Hello world!\")\n",
89 | "\n",
90 | "# Index into the Doc to get a single Token\n",
91 | "token = doc[1]\n",
92 | "\n",
93 | "# Get the token text via the .text attribute\n",
94 | "print(token.text)"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 16,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "name": "stdout",
104 | "output_type": "stream",
105 | "text": [
106 | "Index: [0, 1, 2, 3, 4]\n",
107 | "Text: ['It', 'costs', '$', '5', '.']\n",
108 | "is_alpha: [True, True, False, False, False]\n",
109 | "is_punct: [False, False, False, False, True]\n",
110 | "like_num: [False, False, False, True, False]\n"
111 | ]
112 | }
113 | ],
114 | "source": [
115 | "# Lexical attributes\n",
116 | "doc = nlp(\"It costs $5.\")\n",
117 | "print('Index: ', [token.i for token in doc])\n",
118 | "print('Text: ', [token.text for token in doc])\n",
119 | "\n",
120 | "print('is_alpha:', [token.is_alpha for token in doc])\n",
121 | "print('is_punct:', [token.is_punct for token in doc])\n",
122 | "print('like_num:', [token.like_num for token in doc])"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 17,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "# process the text\n",
132 | "doc = nlp(\"I like tree kangaroos and narwhals.\")"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 19,
138 | "metadata": {},
139 | "outputs": [
140 | {
141 | "name": "stdout",
142 | "output_type": "stream",
143 | "text": [
144 | "like\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "# Select the first token\n",
150 | "first_token = doc[1]\n",
151 | "print(first_token.text)"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": []
160 | }
161 | ],
162 | "metadata": {
163 | "kernelspec": {
164 | "display_name": "Python 3",
165 | "language": "python",
166 | "name": "python3"
167 | },
168 | "language_info": {
169 | "codemirror_mode": {
170 | "name": "ipython",
171 | "version": 3
172 | },
173 | "file_extension": ".py",
174 | "mimetype": "text/x-python",
175 | "name": "python",
176 | "nbconvert_exporter": "python",
177 | "pygments_lexer": "ipython3",
178 | "version": "3.7.1"
179 | }
180 | },
181 | "nbformat": 4,
182 | "nbformat_minor": 2
183 | }
184 |
--------------------------------------------------------------------------------
/scripts/python/check_pkgs.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/duttashi/text-analysis/154b34ff3c2fac60e5e4068e48597dbdc815b894/scripts/python/check_pkgs.py
--------------------------------------------------------------------------------
/scripts/python/extract_table_from_pdf.py:
--------------------------------------------------------------------------------
1 | # Objective 1: Read table from multiple pdf files contained in a directory to a list
2 | # Obkective 2: clean the pdf text data and find unique words then save them to a dictionary
3 |
4 | # To read tables contained within the pdf files, I'm using the tabula.py library
5 | # To install tabula.py on Python3 in windows OS, ensure Java version 8 is installed.
6 | # Next, open a command-prompt window, browse to python directory and execute the command, pip3 install tabula.py
7 |
8 | import tabula, os, re, string
9 | from collections import Counter
10 |
11 | # path to pdf files
12 | filePath = "C:\\Users\\Ashoo\\Documents\\PythonPlayground\\text-analysis\\data\\pdf"
13 |
14 | stripped = [] # initialize an empty string
15 |
16 | for filename in os.listdir(filePath):
17 | # search for files ending with .txt extension and read them in memory
18 | if filename.strip().endswith('.pdf'):
19 | print(filename)
20 | # Mote: python will read the pdf file as 'rb' or 'wb' as a binary read and write format
21 | with(open(os.path.join(filePath,filename),'rb')) as pdfFiles:
22 | #df= tabula.read_pdf(f, stream=True)[0]
23 |
24 | # read all pdf pages
25 | df= tabula.read_pdf(pdfFiles, pages="all")
26 | print(df)
27 |
28 | # convert pdf table to csv format
29 | tabula.convert_into(pdfFiles, "pdf_to_csv.csv", output_format="csv", stream=True)
30 | pdfFiles.close()
31 |
--------------------------------------------------------------------------------
/scripts/python/extract_text_data_from_pdf.py:
--------------------------------------------------------------------------------
1 | # Objective 1: Read multiple pdf files from a directory into a list
2 | # Obkective 2: clean the pdf text data and find unique words then save them to a dictionary
3 |
4 | # To read pdf files, I'm using the PyPDF2 library
5 | # To install it on Python3 in windows OS, open a command-prompt window, browse to python directory and execute the command, pip3 install PyPDF2
6 | # In my environment, I installed it like, `C:\Users\Ashoo\Miniconda3\pip3 install PyPDF2`
7 | # Read the documentation of PyPDF2 library at https://pythonhosted.org/PyPDF2/
8 | # # PyPDF2 uses a zero-based index for getting pages: The first page is page 0, the second is Introduction, and so on.
9 |
10 | import re, os, string, PyPDF2
11 | from collections import Counter
12 |
13 | # path to pdf files
14 | filePath = "C:\\Users\\Ashoo\\Documents\\PythonPlayground\\text-analysis\\data\\pdf"
15 |
16 | stripped = [] # initialize an empty string
17 |
18 | for filename in os.listdir(filePath):
19 | # search for files ending with .txt extension and read them in memory
20 | if filename.strip().endswith('.pdf'):
21 | print(filename)
22 | # Mote: python will read the pdf file as 'rb' or 'wb' as a binary read and write format
23 | with(open(os.path.join(filePath,filename),'rb')) as f:
24 |
25 | # creating a pdf reader object
26 | pdfReader = PyPDF2.PdfFileReader(f)
27 |
28 | # print the number of pages in pdf file
29 | number_of_pages = 0
30 | number_of_pages = pdfReader.getNumPages()
31 | #print(pdfReader.numPages)
32 | print("Number of pages in pdf file are: ", number_of_pages)
33 |
34 | # create a page object and specify the page number to read.. say the first page
35 | #pageObj = pdfReader.getPage(0)
36 | # create a page object and read all pages in the pdf file
37 | # since page numbering starts from 0, therefore decreasing the value returned by method getNumPages by 1
38 | number_of_pages-=1
39 | pageObj = pdfReader.getPage(number_of_pages)
40 |
41 | # extract the text from the pageObj variable
42 | pageContent = pageObj.extractText()
43 | print (pageContent.encode('utf-8'))
44 |
45 | # basic text cleaning process
46 | # using re library, split the text data into words
47 | words = re.split(r'\W+', pageContent)
48 | # add split words to a list and lowercase them
49 | words = [word.lower() for word in words]
50 | # using string library, remove punctuation from words and save in table format
51 | table = str.maketrans('', '', string.punctuation)
52 | # concatenating clean data to list
53 | stripped += [w.translate(table) for w in words]
54 |
55 | # create a dictionary that stores the unique words found in the text files
56 | countlist = Counter(stripped)
57 |
58 |
59 | # print the unique word and its frequency of occurence
60 | for key, val in countlist.items():
61 | print(key, val)
62 |
63 | f.close()
64 | # print end of program statement
65 | print("End")
66 |
--------------------------------------------------------------------------------
/scripts/python/extract_text_data_from_pdf_2.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Fri Jun 25 09:12:31 2021
4 | Pdf file data downloaded from: https://ncrb.gov.in/en/crime-in-india-table-addtional-table-and-chapter-contents
5 | @author: Ashish
6 | """
7 |
8 | import pdfplumber as pp
9 | import os
10 |
11 | cur_path = os.path.dirname(__file__)
12 | print(cur_path)
13 |
14 | # set path for loading pdf file in memory
15 | new_path = os.path.relpath('..\\..\\data\\pdf\\pdf_tbl1.pdf', cur_path)
16 |
17 | # read pdf, extract data
18 | with pp.open(new_path) as pdf:
19 | page = pdf.pages[0]
20 | text=page.extract_text()
21 | print(text)
22 |
23 | # set table settings in pdf file
24 | table_settings = {
25 | "vertical_strategy": "lines",
26 | "horizontal_strategy": "text",
27 | "intersection_x_tolerance": 15
28 | }
29 |
30 | # read pdf file, extract table data
31 | with pp.open(new_path) as pdf:
32 | mypage = pdf.pages[0]
33 | tbls = mypage.find_tables()
34 | print("Tables: ", tbls)
35 | tbl_content = tbls[0].extract(x_tolerance=5)
36 | print("Table content: ",tbl_content)
37 |
38 |
39 |
--------------------------------------------------------------------------------
/scripts/python/extract_text_data_from_pdf_3.py:
--------------------------------------------------------------------------------
1 | import pdfplumber
2 | data_path = "../../data/pdf/file1.pdf"
3 | with pdfplumber.open(data_path) as pdf:
4 | first_page = pdf.pages[0]
5 | text = first_page.extract_text()
6 | print(text)
7 | # print(first_page.chars[0])
--------------------------------------------------------------------------------
/scripts/python/extract_text_data_from_textfiles.py:
--------------------------------------------------------------------------------
1 | # Objective #1: To read multiple text files in a given directory & write content of each text file to a list
2 | # Objective #2: To search for a given text string in each file and write its line number to a dictionary and print it
3 |
4 | import re, os, string, logging, sys
5 | from collections import Counter
6 |
7 | fileData = [] # initialize an empty list
8 | stripped = []
9 | phrase = [] # initialize an empty list
10 | searchPhrase = 'beauty'
11 |
12 | # the path to text files
13 | # Note the usage of relative path in here.. Since the script is in a folder which is different from the data folder.
14 | # Pandas will start looking from where your current python file is located. Therefore you can move from your current directory to where your data is located with '..'
15 |
16 | filePath = "../../data/plainText"
17 | for filename in os.listdir(filePath):
18 | # search for files ending with .txt extension and read them in memory
19 | if filename.strip().endswith('.txt'):
20 | print(filename)
21 | with(open(os.path.join(filePath,filename),'rt')) as f:
22 | try:
23 | # read all text with newlines included and save data to variable
24 | fileTxt = f.read()
25 | f.close()
26 |
27 | # basic text cleaning process
28 | # using re library, split the text data into words
29 | words = re.split(r'\W+', fileTxt)
30 | # add split words to a list and lowercase them
31 | words = [word.lower() for word in words]
32 | # using string library, remove punctuation from words and save in table format
33 | table = str.maketrans('', '', string.punctuation)
34 | # concatenating clean data to list
35 | stripped += [w.translate(table) for w in words]
36 |
37 | # search for a given string in clean text. If found then print the line number
38 | for num, line in enumerate(stripped,1):
39 | if(searchPhrase in line):
40 | print(searchPhrase, " found at line number: ",num)
41 |
42 | # if file not found, log an error
43 | except IOError as e:
44 | logging.error('Operation failed: {}' .format(e.strerror))
45 | sys.exit(0)
46 |
47 | # create a dictionary that stores the unique words found in the text files
48 | countlist = Counter(stripped)
49 |
50 | print("Count of uniique words in text files", filename, " are: ")
51 | # print the unique word and its frequency of occurence
52 | for key, val in countlist.items():
53 | print(key, val)
54 |
55 | # print end of program statement
56 | print("End")
--------------------------------------------------------------------------------
/scripts/python/func_text_preprocess.py:
--------------------------------------------------------------------------------
1 | # We cannot work with the text data in machine learning so we need to convert them into numerical vectors,
2 | # This script has all techniques for conversion
3 | # A collection of functions for text data cleaning
4 | # Script create date: May 1, 2023 location: Jhs, UP
5 |
6 | def read_multiple_files(source_dir_path):
7 | import os
8 | dir_csv = [file for file in os.listdir(source_dir_path) if file.endswith('.csv')]
9 | files_csv_list = []
10 | for file in dir_csv:
11 | df = pd.read_csv(os.path.join(dir_csv, file))
12 | files_csv_list.append(df)
13 | return df
14 |
15 | def clean_text(df_text):
16 | from nltk.corpus import stopwords
17 | from nltk.stem import SnowballStemmer
18 | import re
19 | stop = set(stopwords.words('english'))
20 | temp =[]
21 | snow = SnowballStemmer('english')
22 | for sentence in df_text:
23 | sentence = sentence.lower() # Converting to lowercase
24 | cleanr = re.compile('<.*?>')
25 | sentence = re.sub(cleanr, ' ', sentence) #Removing HTML tags
26 | sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
27 | sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) #Removing Punctuations
28 |
29 | words = [snow.stem(word) for word in sentence.split() if word not in stopwords.words('english')] # Stemming and removing stopwords
30 | temp.append(words)
31 |
32 | sent = []
33 | for row in temp:
34 | sequ = ''
35 | for word in row:
36 | sequ = sequ + ' ' + word
37 | sent.append(sequ)
38 |
39 | clean_sents = sent
40 | return clean_sents
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
--------------------------------------------------------------------------------
/scripts/python/kaggle_hotel_review_rating.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun Jul 11 11:33:01 2021
4 |
5 | @author: Ashish
6 |
7 | Data source: https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews
8 | Objective: Hotel reviews rating prediction
9 | """
10 |
11 | import numpy as np # linear algebra
12 | import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
13 | import matplotlib.pyplot as plt
14 | import seaborn as sns
15 | import nltk
16 | from nltk.corpus import stopwords
17 | import string
18 | import xgboost as xgb
19 | from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
20 | from sklearn.decomposition import TruncatedSVD
21 | from sklearn import ensemble, metrics, model_selection, naive_bayes
22 | color = sns.color_palette()
23 |
24 | eng_stopwords = set(stopwords.words("english"))
25 | pd.options.mode.chained_assignment = None
26 |
27 | # import os
28 |
29 | # cur_path = os.path.dirname(__file__)
30 | # print(cur_path)
31 |
32 | # Reference: https://github.com/SudalaiRajkumar/Kaggle/blob/master/SpookyAuthor/simple_fe_notebook_spooky_author.ipynb
33 | # load the data
34 | df = pd.read_csv("..\\..\\data\\kaggle_tripadvisor_hotel_reviews.csv")
35 |
36 | # print(df.head(), df.shape)
37 |
38 | ## Feature Engineering
39 | # Now let us come try to do some feature engineering. This consists of two main parts.
40 |
41 | # Meta features - features that are extracted from the text like number of words, number of stop words, number of punctuations etc
42 | # Text based features - features directly based on the text / words like frequency, svd, word2vec etc.
43 | # Meta Features:
44 |
45 | # We will start with creating meta featues and see how good are they at predicting the spooky authors. The feature list is as follows:
46 |
47 | # Number of words in the text
48 | # Number of unique words in the text
49 | # Number of characters in the text
50 | # Number of stopwords
51 | # Number of punctuations
52 | # Number of upper case words
53 | # Number of title case words
54 | # Average length of the words
55 | ## Number of words in the text ##
56 | df["num_words"] = df["Review"].apply(lambda x: len(str(x).split()))
57 | ## Number of unique words in the text ##
58 | df["num_unique_words"] = df["Review"].apply(lambda x: len(set(str(x).split())))
59 |
60 | # Number of characters in text
61 | df["num_chars"] = df["Review"].apply(lambda x: len(str(x)))
62 |
63 | ## Number of stopwords in the text ##
64 | df["num_stopwords"] = df["Review"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
65 |
66 | ## Number of punctuations in the text ##
67 | df["num_punctuations"] =df['Review'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
68 |
69 | ## Number of title case words in the text ##
70 | df["num_words_upper"] = df["Review"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
71 |
72 | ## Number of title case words in the text ##
73 | df["num_words_title"] = df["Review"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
74 |
75 | ## Average length of the words in the text ##
76 | df["mean_word_len"] = df["Review"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
77 |
78 |
79 | print(df.columns)
80 |
81 | # Let us now plot some of our new variables to see of they will be helpful in predictions.
82 |
83 | df['num_words'].loc[df['num_words']>80] = 80 #truncation for better visuals
84 | plt.figure(figsize=(12,8))
85 | sns.violinplot(x='Rating', y='num_words', data=df)
86 | plt.xlabel('Rating', fontsize=12)
87 | plt.ylabel('Number of words in text', fontsize=12)
88 | plt.title("Number of words by review", fontsize=15)
89 | plt.show()
90 |
91 | df['num_punctuations'].loc[df['num_punctuations']>10] = 10 #truncation for better visuals
92 | plt.figure(figsize=(12,8))
93 | sns.violinplot(x='Rating', y='num_punctuations', data=df)
94 | plt.xlabel('Rating', fontsize=12)
95 | plt.ylabel('Number of puntuations in text', fontsize=12)
96 | plt.title("Number of punctuations by rating", fontsize=15)
97 | plt.show()
98 |
99 |
100 | # Initial Model building
101 | ## Prepare the data for modeling ###
102 | review_mapping_dict = {1:"worse", 2:"bad", 3:"ok",4:"good",5:"best"}
103 | train_y = df['Rating'].map(review_mapping_dict)
104 |
105 | ### recompute the trauncated variables again ###
106 | df["num_words"] = df["Review"].apply(lambda x: len(str(x).split()))
107 | df["mean_word_len"] = df["Review"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
108 |
109 | cols_to_drop = ['Review']
110 | train_X = df.drop(cols_to_drop+['Rating'], axis=1)
111 | test_X = df.drop(cols_to_drop, axis=1)
112 |
113 | # Training a simple XGBoost model
114 | def runXGB(train_X, train_y, test_X, test_y=None, test_X2=None, seed_val=0, child=1, colsample=0.3):
115 | param = {}
116 | param['objective'] = 'multi:softprob'
117 | param['eta'] = 0.1
118 | param['max_depth'] = 3
119 | param['silent'] = 1
120 | param['num_class'] = 3
121 | param['eval_metric'] = "mlogloss"
122 | param['min_child_weight'] = child
123 | param['subsample'] = 0.8
124 | param['colsample_bytree'] = colsample
125 | param['seed'] = seed_val
126 | num_rounds = 2000
127 |
128 | plst = list(param.items())
129 | xgtrain = xgb.DMatrix(train_X, label=train_y)
130 |
131 | if test_y is not None:
132 | xgtest = xgb.DMatrix(test_X, label=test_y)
133 | watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
134 | model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=50, verbose_eval=20)
135 | else:
136 | xgtest = xgb.DMatrix(test_X)
137 | model = xgb.train(plst, xgtrain, num_rounds)
138 |
139 | pred_test_y = model.predict(xgtest, ntree_limit = model.best_ntree_limit)
140 | if test_X2 is not None:
141 | xgtest2 = xgb.DMatrix(test_X2)
142 | pred_test_y2 = model.predict(xgtest2, ntree_limit = model.best_ntree_limit)
143 | return pred_test_y, pred_test_y2, model
144 |
145 | kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2017)
146 | cv_scores = []
147 | pred_full_test = 0
148 | pred_train = np.zeros([df.shape[0], 3])
149 | for dev_index, val_index in kf.split(train_X):
150 | dev_X, val_X = train_X.loc[dev_index], train_X.loc[val_index]
151 | dev_y, val_y = train_y[dev_index], train_y[val_index]
152 | pred_val_y, pred_test_y, model = runXGB(dev_X, dev_y, val_X, val_y, test_X, seed_val=0)
153 | pred_full_test = pred_full_test + pred_test_y
154 | pred_train[val_index,:] = pred_val_y
155 | cv_scores.append(metrics.log_loss(val_y, pred_val_y))
156 | break
157 | print("cv scores : ", cv_scores)
158 |
--------------------------------------------------------------------------------
/scripts/python/kaggle_spooky_authors_ml_model.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Objective: To predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. \n",
8 | "\n",
9 | "Category: Text Analysis/ Natural Language Processing\n",
10 | " \n",
11 | "Data Source: https://www.kaggle.com/c/spooky-author-identification"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {},
18 | "outputs": [
19 | {
20 | "data": {
21 | "text/plain": [
22 | "'C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\scripts\\\\python'"
23 | ]
24 | },
25 | "execution_count": 1,
26 | "metadata": {},
27 | "output_type": "execute_result"
28 | }
29 | ],
30 | "source": [
31 | "# Print the current working directory\n",
32 | "import os\n",
33 | "os.getcwd()"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 36,
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "# Load the required libraries\n",
43 | "import pandas as pd\n",
44 | "import numpy as np\n",
45 | "from sklearn.feature_extraction.text import CountVectorizer as CV\n",
46 | "from sklearn.naive_bayes import MultinomialNB as MNB"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 17,
52 | "metadata": {},
53 | "outputs": [
54 | {
55 | "name": "stdout",
56 | "output_type": "stream",
57 | "text": [
58 | "C:\\Users\\Ashoo\\Documents\\R playground\\text-analysis\\data\\kaggle_spooky_authors\n"
59 | ]
60 | }
61 | ],
62 | "source": [
63 | "# set data path\n",
64 | "data_path = \"\"\n",
65 | "print(data_path)"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 20,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/html": [
76 | "\n",
77 | "\n",
90 | "
\n",
91 | " \n",
92 | " \n",
93 | " | \n",
94 | " id | \n",
95 | " text | \n",
96 | " author | \n",
97 | "
\n",
98 | " \n",
99 | " \n",
100 | " \n",
101 | " 0 | \n",
102 | " id26305 | \n",
103 | " This process, however, afforded me no means of... | \n",
104 | " EAP | \n",
105 | "
\n",
106 | " \n",
107 | " 1 | \n",
108 | " id17569 | \n",
109 | " It never once occurred to me that the fumbling... | \n",
110 | " HPL | \n",
111 | "
\n",
112 | " \n",
113 | " 2 | \n",
114 | " id11008 | \n",
115 | " In his left hand was a gold snuff box, from wh... | \n",
116 | " EAP | \n",
117 | "
\n",
118 | " \n",
119 | " 3 | \n",
120 | " id27763 | \n",
121 | " How lovely is spring As we looked from Windsor... | \n",
122 | " MWS | \n",
123 | "
\n",
124 | " \n",
125 | " 4 | \n",
126 | " id12958 | \n",
127 | " Finding nothing else, not even gold, the Super... | \n",
128 | " HPL | \n",
129 | "
\n",
130 | " \n",
131 | "
\n",
132 | "
"
133 | ],
134 | "text/plain": [
135 | " id text author\n",
136 | "0 id26305 This process, however, afforded me no means of... EAP\n",
137 | "1 id17569 It never once occurred to me that the fumbling... HPL\n",
138 | "2 id11008 In his left hand was a gold snuff box, from wh... EAP\n",
139 | "3 id27763 How lovely is spring As we looked from Windsor... MWS\n",
140 | "4 id12958 Finding nothing else, not even gold, the Super... HPL"
141 | ]
142 | },
143 | "execution_count": 20,
144 | "metadata": {},
145 | "output_type": "execute_result"
146 | }
147 | ],
148 | "source": [
149 | "# Load the data\n",
150 | "training_data = pd.read_csv(\"C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\data\\\\kaggle_spooky_authors\\\\train.csv\")\n",
151 | "testing_data=pd.read_csv(\"C:\\\\Users\\\\Ashoo\\\\Documents\\\\R playground\\\\text-analysis\\\\data\\\\kaggle_spooky_authors\\\\test.csv\")\n",
152 | "training_data.head()"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "Map \"EAP\" to 0 \"HPL\" to 1 and \"MWS\" to 2\n",
160 | "\n",
161 | "Next we take all the rows under the column named \"text\" and put it in X ( a variable in python)\n",
162 | "\n",
163 | "Similarly we take all rows under the column named \"author_num\" and put it in y (a variable in python)"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 21,
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "name": "stdout",
173 | "output_type": "stream",
174 | "text": [
175 | "0 This process, however, afforded me no means of...\n",
176 | "1 It never once occurred to me that the fumbling...\n",
177 | "2 In his left hand was a gold snuff box, from wh...\n",
178 | "3 How lovely is spring As we looked from Windsor...\n",
179 | "4 Finding nothing else, not even gold, the Super...\n",
180 | "Name: text, dtype: object\n",
181 | "0 0\n",
182 | "1 1\n",
183 | "2 0\n",
184 | "3 2\n",
185 | "4 1\n",
186 | "Name: author_num, dtype: int64\n"
187 | ]
188 | }
189 | ],
190 | "source": [
191 | "training_data['author_num'] = training_data.author.map({'EAP':0, 'HPL':1, 'MWS':2})\n",
192 | "X = training_data['text']\n",
193 | "y = training_data['author_num']\n",
194 | "print (X.head())\n",
195 | "print (y.head())"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "Now we need to split the data into training set and testing set. We train the model on the training set. Model testing is done on the test set.\n",
203 | "\n",
204 | "So we are going to split it into 70% for training and 30% for testing."
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": 22,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "per=int(float(0.7)* len(X))\n",
214 | "X_train=X[:per]\n",
215 | "X_test=X[per:]\n",
216 | "y_train=y[:per]\n",
217 | "y_test=y[per:]"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "##### Converting text data into numbers or in other words, `Vectorization`\n",
225 | "\n",
226 | "Computers get crazy with text, It only understands numbers, but we have got to classify text. Now what do we do? We do tokenization and vectorization to save the count of each word. "
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 31,
232 | "metadata": {},
233 | "outputs": [
234 | {
235 | "name": "stdout",
236 | "output_type": "stream",
237 | "text": [
238 | "{'My': 0, 'name': 4, 'is': 2, 'Sindabad': 1, 'the': 6, 'sailor': 5, 'man': 3}\n",
239 | "1\n",
240 | "1\n",
241 | "1\n",
242 | "1\n",
243 | "1\n"
244 | ]
245 | }
246 | ],
247 | "source": [
248 | "#toy example\n",
249 | "text=[\"My name is Sindabad the sailor man\"]\n",
250 | "toy = CV(lowercase=False, token_pattern=r'\\w+|\\,')\n",
251 | "toy.fit_transform(text)\n",
252 | "print (toy.vocabulary_)\n",
253 | "matrix=toy.transform(text)\n",
254 | "print (matrix[0,0])\n",
255 | "print (matrix[0,1])\n",
256 | "print (matrix[0,2])\n",
257 | "print (matrix[0,3])\n",
258 | "print (matrix[0,4])"
259 | ]
260 | },
261 | {
262 | "cell_type": "code",
263 | "execution_count": 30,
264 | "metadata": {},
265 | "outputs": [
266 | {
267 | "name": "stdout",
268 | "output_type": "stream",
269 | "text": [
270 | "(13705, 27497)\n"
271 | ]
272 | }
273 | ],
274 | "source": [
275 | "vect = CV(lowercase=False, token_pattern=r'\\w+|\\,')\n",
276 | "X_cv=vect.fit_transform(X)\n",
277 | "X_train_cv = vect.transform(X_train)\n",
278 | "X_test_cv = vect.transform(X_test)\n",
279 | "print (X_train_cv.shape)"
280 | ]
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "metadata": {},
285 | "source": [
286 | "The final step We give the data to the `clf.fit` for training and test it for score. Let's check the accuracy on training set"
287 | ]
288 | },
289 | {
290 | "cell_type": "code",
291 | "execution_count": 37,
292 | "metadata": {},
293 | "outputs": [
294 | {
295 | "data": {
296 | "text/plain": [
297 | "0.8432073544433095"
298 | ]
299 | },
300 | "execution_count": 37,
301 | "metadata": {},
302 | "output_type": "execute_result"
303 | }
304 | ],
305 | "source": [
306 | "MNB=MultinomialNB()\n",
307 | "MNB.fit(X_train_cv, y_train)\n",
308 | "MNB.score(X_test_cv, y_test)\n",
309 | "\n",
310 | "\n",
311 | "#clf=MultinomialNB()\n",
312 | "#clf.fit(X_train_cv, y_train)\n",
313 | "#clf.score(X_test_cv, y_test)"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "metadata": {},
319 | "source": [
320 | "Now, lets check the accuracy on the test set. But first vectorize the test set just like we did it for the training set."
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": 38,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": [
329 | "X_test=vect.transform(testing_data[\"text\"])"
330 | ]
331 | },
332 | {
333 | "cell_type": "code",
334 | "execution_count": 39,
335 | "metadata": {},
336 | "outputs": [
337 | {
338 | "data": {
339 | "text/plain": [
340 | "(8392, 3)"
341 | ]
342 | },
343 | "execution_count": 39,
344 | "metadata": {},
345 | "output_type": "execute_result"
346 | }
347 | ],
348 | "source": [
349 | "MNB=MultinomialNB()\n",
350 | "MNB.fit(X_cv, y)\n",
351 | "predicted_result=MNB.predict_proba(X_test)\n",
352 | "predicted_result.shape"
353 | ]
354 | },
355 | {
356 | "cell_type": "markdown",
357 | "metadata": {},
358 | "source": [
359 | "We see that we got a result with 8392 rows presenting each text and 3 columns each column representing probability of each author."
360 | ]
361 | },
362 | {
363 | "cell_type": "code",
364 | "execution_count": 40,
365 | "metadata": {},
366 | "outputs": [
367 | {
368 | "data": {
369 | "text/html": [
370 | "\n",
371 | "\n",
384 | "
\n",
385 | " \n",
386 | " \n",
387 | " | \n",
388 | " id | \n",
389 | " EAP | \n",
390 | " HPL | \n",
391 | " MWS | \n",
392 | "
\n",
393 | " \n",
394 | " \n",
395 | " \n",
396 | " 0 | \n",
397 | " id02310 | \n",
398 | " 0.000924 | \n",
399 | " 2.968107e-05 | \n",
400 | " 9.990459e-01 | \n",
401 | "
\n",
402 | " \n",
403 | " 1 | \n",
404 | " id24541 | \n",
405 | " 1.000000 | \n",
406 | " 3.262255e-07 | \n",
407 | " 9.234086e-09 | \n",
408 | "
\n",
409 | " \n",
410 | " 2 | \n",
411 | " id00134 | \n",
412 | " 0.003193 | \n",
413 | " 9.968065e-01 | \n",
414 | " 8.737549e-07 | \n",
415 | "
\n",
416 | " \n",
417 | " 3 | \n",
418 | " id27757 | \n",
419 | " 0.920985 | \n",
420 | " 7.901473e-02 | \n",
421 | " 4.811097e-07 | \n",
422 | "
\n",
423 | " \n",
424 | " 4 | \n",
425 | " id04081 | \n",
426 | " 0.953158 | \n",
427 | " 5.981884e-03 | \n",
428 | " 4.086011e-02 | \n",
429 | "
\n",
430 | " \n",
431 | "
\n",
432 | "
"
433 | ],
434 | "text/plain": [
435 | " id EAP HPL MWS\n",
436 | "0 id02310 0.000924 2.968107e-05 9.990459e-01\n",
437 | "1 id24541 1.000000 3.262255e-07 9.234086e-09\n",
438 | "2 id00134 0.003193 9.968065e-01 8.737549e-07\n",
439 | "3 id27757 0.920985 7.901473e-02 4.811097e-07\n",
440 | "4 id04081 0.953158 5.981884e-03 4.086011e-02"
441 | ]
442 | },
443 | "execution_count": 40,
444 | "metadata": {},
445 | "output_type": "execute_result"
446 | }
447 | ],
448 | "source": [
449 | "#NOW WE CREATE A RESULT DATA FRAME AND ADD THE COLUMNS NECESSARY FOR KAGGLE SUBMISSION\n",
450 | "result=pd.DataFrame()\n",
451 | "result[\"id\"]=testing_data[\"id\"]\n",
452 | "result[\"EAP\"]=predicted_result[:,0]\n",
453 | "result[\"HPL\"]=predicted_result[:,1]\n",
454 | "result[\"MWS\"]=predicted_result[:,2]\n",
455 | "result.head()"
456 | ]
457 | },
458 | {
459 | "cell_type": "markdown",
460 | "metadata": {},
461 | "source": [
462 | "FINALLY WE SUBMIT THE RESULT TO KAGGLE FOR EVALUATION"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": 41,
468 | "metadata": {},
469 | "outputs": [],
470 | "source": [
471 | "result.to_csv(\"TO_SUBMIT.csv\", index=False)"
472 | ]
473 | },
474 | {
475 | "cell_type": "code",
476 | "execution_count": null,
477 | "metadata": {},
478 | "outputs": [],
479 | "source": []
480 | }
481 | ],
482 | "metadata": {
483 | "kernelspec": {
484 | "display_name": "Python 3",
485 | "language": "python",
486 | "name": "python3"
487 | },
488 | "language_info": {
489 | "codemirror_mode": {
490 | "name": "ipython",
491 | "version": 3
492 | },
493 | "file_extension": ".py",
494 | "mimetype": "text/x-python",
495 | "name": "python",
496 | "nbconvert_exporter": "python",
497 | "pygments_lexer": "ipython3",
498 | "version": "3.6.5"
499 | }
500 | },
501 | "nbformat": 4,
502 | "nbformat_minor": 2
503 | }
504 |
--------------------------------------------------------------------------------
/scripts/python/learn_text_analysis.py:
--------------------------------------------------------------------------------
1 | import spacy
2 | import numpy as np
3 | import pandas as pd
4 |
5 | # import the english language model
6 | nlp = spacy.load("en_core_web_sm")
7 |
8 | # dataset: https://www.kaggle.com/datasets/mrisdal/fake-news
9 | # reference: https://www.kaggle.com/code/sudalairajkumar/getting-started-with-spacy
10 | df = pd.read_csv("../../data/fake_news_data/fake.csv")
11 | print(df.shape)
12 | print(df.head())
13 | print(df.columns)
14 | print(df['title'].head())
15 |
16 | txt = df['title'][0]
17 | print(txt)
18 | doc = nlp(txt)
19 | olist = []
20 | for token in doc:
21 | l = [token.text,
22 | token.idx,
23 | token.lemma_,
24 | token.is_punct,
25 | token.is_space,
26 | token.shape_,
27 | token.pos_,
28 | token.tag_]
29 | olist.append(l)
30 |
31 | odf = pd.DataFrame(olist)
32 | odf.columns = ["Text", "StartIndex", "Lemma", "IsPunctuation", "IsSpace", "WordShape", "PartOfSpeech", "POSTag"]
33 | print(odf)
34 |
35 | # Named Entity Recognition:
36 | # A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title.
37 | # We also get named entity recognition as part of spacy package. It is inbuilt in the english language model and we can also train our own entities if needed.
38 | doc = nlp(txt)
39 | olist = []
40 | for ent in doc.ents:
41 | olist.append([ent.text, ent.label_])
42 |
43 | odf = pd.DataFrame(olist)
44 | odf.columns = ["Text", "EntityType"]
45 | print(odf)
--------------------------------------------------------------------------------
/scripts/readme.md:
--------------------------------------------------------------------------------
1 | All code scripts live here
2 |
--------------------------------------------------------------------------------