├── .gitignore ├── .travis.yml ├── CITATION.R ├── LICENSES.txt ├── MANIFEST.in ├── README.rst ├── __init__.py ├── bin └── sentimentpy_logo │ ├── py_sentimentpy.png │ ├── py_sentimentpya.png │ ├── py_sentimentpyb.png │ ├── py_sentimentr.pptx │ └── resize_icon.txt ├── sentimentpy.Rproj ├── sentimentpy ├── __init__.py └── split_sentences.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | 3 | sudo: false 4 | 5 | python: 6 | - "3.5" 7 | - "3.6" 8 | 9 | 10 | install: 11 | - "./travis.sh" 12 | 13 | 14 | script: 15 | - pytest 16 | 17 | notifications: 18 | email: 19 | on_success: change 20 | on_failure: change 21 | 22 | cache: pip -------------------------------------------------------------------------------- /CITATION.R: -------------------------------------------------------------------------------- 1 | @Manual{, 2 | title = {{sentimentpy}: Calculate Text Polarity Sentiment}, 3 | author = {Tyler W. Rinker}, 4 | address = {Buffalo, New York}, 5 | note = {version 2.7.0}, 6 | year = {2018}, 7 | url = {http://github.com/trinker/sentimentpy}, 8 | } 9 | -------------------------------------------------------------------------------- /LICENSES.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2018 Tyler W. Rinker 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.txt *.py *.rst 2 | recursive-include sentimentpy *.txt *.py 3 | recursive-include additional_resources *.tar.gz -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | sentimentpy 2 | =========== 3 | 4 | .. image:: https://www.repostatus.org/badges/latest/wip.svg 5 | :alt: Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. 6 | :target: https://www.repostatus.org/#wip 7 | 8 | .. image:: https://img.shields.io/travis/trinker/sentimentpy/master.svg?style=flat-square&logo=travis 9 | :target: https://travis-ci.org/trinker/sentimentpy 10 | :alt: Build Status 11 | 12 | .. image:: bin/sentimentpy_logo/py_sentimentpyb.png 13 | :alt: Module Logo 14 | 15 | 16 | 17 | 18 | **sentimentpy** is designed to quickly calculate text polarity sentiment at the sentence level. The user can aggregate these scores by grouping variable(s) using built-in aggregate functions. 19 | 20 | 21 | **sentimentpy** (a Python port of the R `sentimentr package `_) is a response to my own needs with sentiment detection that were not addressed by the current **R** tools. My own `polarity` function in the R **qdap** package is slower on larger data sets. It is a dictionary lookup approach that tries to incorporate weighting for valence shifters (negation and amplifiers/deamplifiers). Matthew Jockers created the `syuzhet `_ R package that utilizes dictionary lookups for the Bing, NRC, and Afinn methods as well as a custom dictionary. He also utilizes a wrapper for the `Stanford coreNLP `_ which uses much more sophisticated analysis. Jocker's dictionary methods are fast but are more prone to error in the case of valence shifters. Jocker's `addressed these critiques `_ explaining that the method is good with regard to analyzing general sentiment in a piece of literature. He points to the accuracy of the Stanford detection as well. In my own work I need better accuracy than a simple dictionary lookup; something that considers valence shifters yet optimizes speed which the Stanford's parser does not. This leads to a trade off of speed vs. accuracy. Simply, **sentimentpy** attempts to balance accuracy and speed. 22 | 23 | 24 | Installation 25 | ============ 26 | 27 | 28 | Currently, this is a GitHub package. To install use: 29 | 30 | ``pip install git+https://github.com/trinker/sentimentpy`` 31 | 32 | 33 | Sentence Splitting 34 | ================== 35 | 36 | :: 37 | 38 | import sentimentpy.split_sentences as ss 39 | 40 | s = [ 41 | ' I like you. P.S. I like carrots too mrs. dunbar. Well let\'s go to 100th st. around the corner. ', 42 | 'Hello Dr. Livingstone. How are you?', 43 | 'This is sill an incomplete thou.' 44 | 45 | ] 46 | 47 | ss.split_sentences(s) 48 | 49 | :: 50 | 51 | ['I like you.', 52 | 'P.S. I like carrots too mrs. dunbar.', 53 | "Well let's go to 100th st. around the corner.", 54 | 'Hello Dr. Livingstone.', 55 | 'How are you?', 56 | 'This is sill an incomplete thou.'] 57 | 58 | :: 59 | 60 | x = [ 61 | " ".join( 62 | ["Mr. Brown comes! He says hello. i give him coffee. i will ", 63 | "go at 5 p. m. eastern time. Or somewhere in between!go there" 64 | ]), 65 | " ".join( 66 | ["Marvin K. Mooney Will You Please Go Now!", "The time has come.", 67 | "The time has come. The time is now. Just go. Go. GO!", 68 | "I don't care how." 69 | ]) 70 | ] 71 | 72 | ss.split_sentences(x) 73 | 74 | :: 75 | 76 | ['Mr. Brown comes!', 77 | 'He says hello.', 78 | 'i give him coffee.', 79 | 'i will go at 5 p.m. eastern time.', 80 | 'Or somewhere in between!', 81 | 'go there', 82 | 'Marvin K. Mooney Will You Please Go Now!', 83 | 'The time has come.', 84 | 'The time has come.', 85 | 'The time is now.', 86 | 'Just go.', 87 | 'Go.', 88 | 'GO!', 89 | "I don't care how."] -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/__init__.py -------------------------------------------------------------------------------- /bin/sentimentpy_logo/py_sentimentpy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/bin/sentimentpy_logo/py_sentimentpy.png -------------------------------------------------------------------------------- /bin/sentimentpy_logo/py_sentimentpya.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/bin/sentimentpy_logo/py_sentimentpya.png -------------------------------------------------------------------------------- /bin/sentimentpy_logo/py_sentimentpyb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/bin/sentimentpy_logo/py_sentimentpyb.png -------------------------------------------------------------------------------- /bin/sentimentpy_logo/py_sentimentr.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/bin/sentimentpy_logo/py_sentimentr.pptx -------------------------------------------------------------------------------- /bin/sentimentpy_logo/resize_icon.txt: -------------------------------------------------------------------------------- 1 | ffmpeg -i py_sentimentpya.png -vf scale=150:-1 py_sentimentpy.png 2 | 3 | convert py_sentimentpya.png -transparent white -resize 25% -crop 0x0-30-30 py_sentimentpy.png 4 | convert py_sentimentpya.png -transparent white -resize 16% -crop 0x0-18-19 py_sentimentpyb.png 5 | -------------------------------------------------------------------------------- /sentimentpy.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 4 10 | Encoding: UTF-8 11 | 12 | RnwWeave: knitr 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /sentimentpy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/trinker/sentimentpy/ce960456d5d9ac4c211e910dd3d379fc895d2d9b/sentimentpy/__init__.py -------------------------------------------------------------------------------- /sentimentpy/split_sentences.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Mon Nov 5 19:07:23 2018 4 | 5 | @author: trinker 6 | """ 7 | 8 | import re 9 | import numpy as np 10 | 11 | 12 | abbr_rep_1_json = { 13 | "Titles": [ 14 | "[mM]r", 15 | "[mM]rs", 16 | "[mM]s", 17 | "[dD]r", 18 | "[pP]rof", 19 | "[sS]en", 20 | "[rR]ep", 21 | "[rR]ev", 22 | "[gG]ov", 23 | "[aA]tty", 24 | "[sS]upt", 25 | "[dD]et", 26 | "[rR]ev", 27 | "[cC]ol", 28 | "[gG]en", 29 | "[lL]t", 30 | "[cC]mdr", 31 | "[aA]dm", 32 | "[cC]apt", 33 | "[sS]gt", 34 | "[cC]pl", 35 | "[mM]aj" 36 | ], 37 | "Entities": [ 38 | "[dD]ept", 39 | "[uU]niv", 40 | "[uU]ni", 41 | "[aA]ssn" 42 | ], 43 | "Misc": [ 44 | "[vV]s", 45 | "[mM]t" 46 | ], 47 | "Streets": [ 48 | "[sS]t" 49 | ] 50 | } 51 | 52 | 53 | abbr_rep_2_json = { 54 | "Titles": [ 55 | "[jJ]r", 56 | "[sS]r" 57 | ], 58 | "Entities": [ 59 | "[bB]ros", 60 | "[iI]nc", 61 | "[lL]td", 62 | "[cC]o", 63 | "[cC]orp", 64 | "[pP]lc" 65 | ], 66 | "Months": [ 67 | "[jJ]an", 68 | "[fF]eb", 69 | "[mM]ar", 70 | "[aA]pr", 71 | "[mM]ay", 72 | "[jJ]un", 73 | "[jJ]ul", 74 | "[aA]ug", 75 | "[sS]ep", 76 | "[oO]ct", 77 | "[nN]ov", 78 | "[dD]ec", 79 | "[sS]ept" 80 | ], 81 | "Days": [ 82 | "[mM]on", 83 | "[tT]ue", 84 | "[wW]ed", 85 | "[tT]hu", 86 | "[fF]ri", 87 | "[sS]at", 88 | "[sS]un" 89 | ], 90 | "Misc": [ 91 | "[eE]tc", 92 | "[eE]sp", 93 | "[cC]f", 94 | "[aA]l" 95 | ], 96 | "Streets": [ 97 | "[aA]ve", 98 | "[bB]ld", 99 | "[bB]lvd", 100 | "[cC]l", 101 | "[cC]t", 102 | "[cC]res", 103 | "[rR]d" 104 | ], 105 | "Measurement": [ 106 | "[fF]t", 107 | "[gG]al", 108 | "[mM]i", 109 | "[tT]bsp", 110 | "[tT]sp", 111 | "[yY]d", 112 | "[qQ]t", 113 | "[sS]q", 114 | "[pP]t", 115 | "[lL]b", 116 | "[lL]bs" 117 | ] 118 | } 119 | 120 | 121 | 122 | period_reg = '{}|{}|{}|{}'.format( 123 | r"(?:(?<=[a-z])\.\s(?=[a-z]\.))", 124 | r"(?:(?<=([ .][a-z]))\.)(?!(?:\s[A-Z]|$)|(?:\s\s))", 125 | r"(?:(?<=[A-Z])\.(?=\s??[A-Z]\.))", 126 | r"(?:(?<=[A-Z])\.(?!\s+[A-Z][A-Za-z]))" 127 | ) 128 | 129 | 130 | 131 | abbr_rep_1 = [item for sublist in list(abbr_rep_1_json.values()) for item in sublist] 132 | abbr_rep_1_results = [] 133 | 134 | for i in range(len(abbr_rep_1)): 135 | abbr_rep_1_results.append(r"((?<=\b({}))\.)".format(abbr_rep_1[i])) 136 | 137 | 138 | 139 | abbr_rep_2 = [item for sublist in list(abbr_rep_2_json.values()) for item in sublist] 140 | abbr_rep_2_results = [] 141 | 142 | for i in range(len(abbr_rep_2)): 143 | abbr_rep_2_results.append(r"((?<=\b({}))\.(?!\s+[A-Z]))".format(abbr_rep_2[i])) 144 | 145 | 146 | sent_regex = "{}|{}|{}|({})".format( 147 | "|".join(abbr_rep_1_results), 148 | "|".join(abbr_rep_2_results), 149 | period_reg, 150 | r'\.(?=\d+)' 151 | ) 152 | 153 | 154 | 155 | 156 | ## This works on a single string. Need to loop through and apply. 157 | def break_sentence(x): 158 | 159 | y = re.sub( 160 | pattern = r'([Pp])(\.)(\s*[Ss])(\.)', 161 | repl = r'\1<<>>\3<<>>', 162 | string = x.strip() 163 | ) 164 | 165 | y = re.sub( 166 | pattern = sent_regex, 167 | repl = "<<>>", 168 | string = y 169 | ) 170 | 171 | y = re.sub( 172 | pattern = r'(\b[Nn]o)(\\.)(\s+\d)', 173 | repl = r'\1<<>>\3', 174 | string = y 175 | ) 176 | 177 | y = re.sub( 178 | pattern = r'(\b\d+\s+in)(\.)(\s[a-z])', 179 | repl = r'\1<<>>\3', 180 | string = y 181 | ) 182 | 183 | y = re.sub( 184 | pattern = r'([?.!]+)([\'])([^,])', 185 | repl = r'<<>>\1 \3', 186 | string = y 187 | ) 188 | 189 | y = re.sub( 190 | pattern = r'([?.!]+)(["])([^,])', 191 | repl = r'<<>>\1 \3', 192 | string = y 193 | ) 194 | 195 | ## midde name handling 196 | y = re.sub( 197 | pattern = r'(\b[A-Z][a-z]+\s[A-Z])(\.)(\s[A-Z][a-z]+\b)', 198 | repl = r'\1<<>>\3', 199 | string = y 200 | ) 201 | 202 | #2 middle names 203 | y = re.sub( 204 | pattern = r'(\b[A-Z][a-z]+\s[A-Z])(\.)(\s[A-Z])(\.)(\s[A-Z][a-z]+\b)', 205 | repl = r'\1<<>>\3<<>>\5', 206 | string = y 207 | ) 208 | 209 | y = re.split( 210 | pattern = r"{}{}".format( 211 | r"(?:(?>>', 238 | repl = r'.', 239 | string = x.strip() 240 | ) 241 | 242 | y = re.sub( 243 | pattern = r'(<<>>)([?.!]+)', 244 | repl = r'\2"', 245 | string = y 246 | ) 247 | 248 | y = re.sub( 249 | pattern = r'(<<>>)([?.!]+)', 250 | repl = r'\2\"', 251 | string = y 252 | ) 253 | 254 | return y 255 | 256 | 257 | 258 | 259 | def split_sentences(x): 260 | 261 | y = break_sentences(x) 262 | 263 | element_id = np.repeat(range(len(y)), [len(i) for i in y]) 264 | sentence_id = [range(len(i)) for i in y] 265 | sentence_id = [item for sublist in sentence_id for item in sublist] 266 | 267 | # locs = (np.cumsum([len(x) for x in y]) + 1)[:-1] 268 | ## TO DO: this should returna pandas object with an element_id, sentence_id, and the text 269 | 270 | sents = [restore_sentence(sentence).strip() for element in y for sentence in element] 271 | return sents 272 | 273 | ## Identical to ^^^ 274 | # ============================================================================= 275 | # list_of_words = [] 276 | # for element in y: 277 | # for sentence in element: 278 | # list_of_words.append(restore_sentence(sentence)) 279 | # 280 | # list_of_words 281 | # ============================================================================= 282 | 283 | 284 | if __name__ == '__main__': 285 | # --- examples ------- 286 | s = [ 287 | ' I like you. P.S. I like carrots too mrs. dunbar. Well let\'s go to 100th st. around the corner. ', 288 | 'Hello Dr. Livingstone. How are you?', 289 | 'This is sill an incomplete thou.' 290 | 291 | ] 292 | 293 | split_sentences(s) 294 | 295 | x = [ 296 | " ".join( 297 | ["Mr. Brown comes! He says hello. i give him coffee. i will ", 298 | "go at 5 p. m. eastern time. Or somewhere in between!go there" 299 | ]), 300 | " ".join( 301 | ["Marvin K. Mooney Will You Please Go Now!", "The time has come.", 302 | "The time has come. The time is now. Just go. Go. GO!", 303 | "I don't care how." 304 | ]) 305 | ] 306 | 307 | ss.split_sentences(x) 308 | 309 | 310 | 311 | 312 | 313 | 314 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import codecs 2 | import os 3 | from setuptools import setup, find_packages 4 | 5 | HERE = os.path.abspath(os.path.dirname(__file__)) 6 | def read(*parts): 7 | """ 8 | Build an absolute path from *parts* and and return the contents of the 9 | resulting file. Assume UTF-8 encoding. 10 | """ 11 | with codecs.open(os.path.join(HERE, *parts), "rb", "utf-8") as f: 12 | return f.read() 13 | 14 | 15 | setup( 16 | name = 'sentimentpy', 17 | #packages = ['vaderSentiment'], # this must be the same as the name above 18 | packages = find_packages(exclude=['tests*']), # a better way to do it than the line above -- this way no typo/transpo errors 19 | include_package_data=True, 20 | version = '2.7.0', 21 | description = 'sentimentpy: Calculate Text Polarity Sentiment', 22 | long_description = read("README.rst"), 23 | long_description_content_type = 'text/markdown', 24 | author = 'Tyler W. Rinker', 25 | author_email = 'tyler.rinker@gmail.com', 26 | license = 'MIT License: http://opensource.org/licenses/MIT', 27 | url = 'https://github.com/cjhutto/vaderSentiment', # use the URL to the github repo 28 | download_url = 'https://github.com/trinker/sentimentpy/archive/master.zip', 29 | keywords = ['sentiment'], # arbitrary keywords 30 | classifiers = ['Development Status :: 4 - Beta', 'Intended Audience :: Science/Research', 31 | 'License :: OSI Approved :: MIT License', 'Natural Language :: English', 32 | 'Programming Language :: Python :: 3.6', 'Topic :: Scientific/Engineering :: Artificial Intelligence', 33 | 'Topic :: Scientific/Engineering :: Information Analysis', 'Topic :: Text Processing :: Linguistic', 34 | 'Topic :: Text Processing :: General'], 35 | ) 36 | --------------------------------------------------------------------------------