├── README.md ├── slang.txt ├── techniques.py └── preprocess.py /README.md: -------------------------------------------------------------------------------- 1 | # text-preprocessing-techniques 2 | ## 16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis. 3 | 4 | These techniques were used in comparison in our paper **"A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis"**. If you use this material please [cite](https://link.springer.com/chapter/10.1007/978-3-319-67008-9_31) the paper. An extended paper for this work can be found [here](https://www.sciencedirect.com/science/article/pii/S0957417418303683), with the title **"A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis"**. Please cite. 5 | 6 | Most of these techniques are generic and can be used in various applications except Sentiment Analysis. 7 | They are the following: 8 | 9 | #### 0. Remove Unicode Strings and Noise 10 | #### 1. Replace URLs, User Mentions and Hashtags 11 | #### 2. Replcae Slang and Abbreviations 12 | #### 3. Replace Contractions 13 | #### 4. Remove Numbers 14 | #### 5. Replace Repetitions of Punctuation 15 | #### 6. Replace Negations with Antonyms 16 | #### 7. Remove Punctuation 17 | #### 8. Handling Capitalized Words 18 | #### 9. Lowercase 19 | #### 10. Remove Stopwords 20 | #### 11. Replace Elongated Words 21 | #### 12. Spelling Correction 22 | #### 13. Part of Speech Tagging 23 | #### 14. Lemmatizing 24 | #### 15. Stemming 25 | 26 | This scripts also prints some statistics for the text file like: 27 | 28 | - Total Sentences 29 | - Total Words before and after preprocess 30 | - Total Unique words before and after preprocess 31 | - Average Words per Sentence before and after preprocess 32 | - Total Run time 33 | - Total Emoticons found 34 | - Total Slangs and Abbreviations found 35 | - 20 Most Commong Sland and Abbreviations and plots them 36 | - Total Elongated words 37 | - Total multi Exclamation 38 | - question and stop marks 39 | - Total All Capitalized words 40 | - 100 Most Common words and plots them and most common bigram and trigram collocations 41 | 42 | The text file that we included here is a sample (2000 tweets) of the SS-Twitter dataset. 43 | 44 | The file "preprocess.py" includes many comments and in order to use a technique you have to uncomment the appropriate line/lines. The initial script uses all techniques. So if you want to use only specific techniques, comment out the others. 45 | -------------------------------------------------------------------------------- /slang.txt: -------------------------------------------------------------------------------- 1 | 2day today 2 | 2nite tonight 3 | 4u for you 4 | 4ward forward 5 | a3 anyplace, anywhere, anytime 6 | a/n author note 7 | a/w anyway 8 | a/s/l age, sex, location 9 | adn any day now 10 | afaic as far as i'm concerned 11 | afaik as far as I know 12 | afk away from keyboard 13 | aggro aggresive 14 | aight alright 15 | airhead stupid 16 | aka as known as 17 | alol actually laughing out loud 18 | amigo friend 19 | amz amazing 20 | app application 21 | armpit undesirable 22 | asap as soon as possible 23 | atm at the moment 24 | atw all the way 25 | b/c because 26 | b-day birthday 27 | b4 before 28 | b4n bye for now 29 | bae before anyone else 30 | bak back at the keyboard 31 | bbl bee back later 32 | bday birthday 33 | becuz because 34 | bent angry 35 | bestie best friend 36 | besty best friend 37 | bf boyfriend 38 | bff best friends forever 39 | bffe best friends forever 40 | bfn bye for now 41 | bg big grin 42 | bmfe best mates forever 43 | bmfl best mates life 44 | bozo idiot 45 | brah friend 46 | bravo well done 47 | brb be right back 48 | bro brother 49 | bta but then again 50 | btdt been there, done that 51 | btr better 52 | btw by the way 53 | buddy friend 54 | c'mon came on 55 | cid crying in disgrace 56 | congrats congratulations 57 | copacetic excellent 58 | coz beacause 59 | cu see you 60 | cuddy friends 61 | cul see you later 62 | cul8r see you later  63 | cutie cute 64 | cuz because 65 | cya bye 66 | cyo see you online  67 | dbau doing business as usual 68 | deets details 69 | dmn damn 70 | dobe idiot 71 | dope stupid 72 | dork strange 73 | dunno don't know 74 | dwi deal with it 75 | dyd don't you dare 76 | ermahgerd oh my gosh 77 | eu europe 78 | ez easy 79 | f9 fine 80 | fav favorite 81 | far-out great 82 | fb facebook 83 | flick movie 84 | fml fuck my life 85 | foxy sexy 86 | friggin freaking 87 | fttn for the time being 88 | ftw for the win 89 | fud fear, uncertainty, and doubt 90 | fwiw for what it's worth  91 | fyi for your information 92 | g grin  93 | g2g got to go  94 | ga go ahead  95 | gal get a life 96 | getcha understand  97 | gf girlfriend  98 | gfn gone for now 99 | gg good game 100 | gj good job 101 | gky go kill yourself 102 | gl good luck 103 | glhf good luck have fun 104 | gmab give me a break 105 | gmbo giggling my butt off  106 | gmta great minds think alike  107 | goof idiot 108 | goofy idiot 109 | gr8 great 110 | gtg got to go 111 | gud good 112 | h8 hate 113 | hagn have a good night  114 | hdop help delete online predators  115 | hf have fun 116 | hml hate my life 117 | hoas hold on a second 118 | hhis hanging head in shame  119 | hmu hit me up 120 | hru how are you 121 | twt hope this helps 122 | hw homework 123 | i'ma i am going to 124 | iac in any case  125 | ic I see  126 | icymi in case you missed it 127 | idk I don't know 128 | iggy ignore  129 | iht i hate this 130 | ikr i know, right? 131 | ilt i like that 132 | ily i love you 133 | ima i am going to  134 | imao in my arrogant opinion 135 | imnsho in my not so humble opinion  136 | imo in my opinion  137 | imy i miss you 138 | iou i owe you 139 | iow in other words  140 | ipn I’m posting naked  141 | irl in real life  142 | j/k just kidding 143 | jdi just do it 144 | jk just kidding 145 | jkn joking 146 | jyeah yeah 147 | kinda kind of 148 | l8 late 149 | l8r later 150 | lbh let's be honest 151 | ld later, dude 152 | ldi let's do it   153 | ldr long distance relationship  154 | lees beautiful  155 | lfm looking for more 156 | lil little 157 | llta lots and lots of thunderous applause  158 | lmao laugh my ass off 159 | lmirl let's meet in real life  160 | lmk let me know 161 | lol laugh out loud 162 | lolz laugh out loud 163 | lotta lot of 164 | lsr loser 165 | ltr longterm relationship 166 | lua love you always  167 | lub love 168 | lubb love 169 | lulab love you like a brother  170 | lulas love you like a sister  171 | lul laugh 172 | luls laugh 173 | lulz laugh 174 | lumu love you miss you 175 | luv love 176 | lux luxury 177 | lwm laugh with me 178 | lwp laugh with passion 179 | lvl level 180 | m/f male or female 181 | m2 me too 182 | m8 mate 183 | me2 me too 184 | milf mother I would like to fuck 185 | mma meet me at 186 | mmb message me back 187 | mvp most valueable player 188 | msg message 189 | mtf more to follow 190 | myob mind your own business 191 | nah no 192 | nc no comment 193 | nk not kidding 194 | ngl not gonna lie 195 | nlt no later than 196 | nm not much 197 | no1 no one 198 | np no problem 199 | nsfw not safe for work 200 | nuh no 201 | nvm nevermind 202 | obo or best offer 203 | oic oh, i see 204 | oll online love  205 | omg oh my god 206 | omw on my way 207 | osm awesome 208 | otoh on the other hand  209 | perv pervert 210 | pervy pervert 211 | phat pretty hot and tempting 212 | pir parent in room 213 | pls please 214 | plz please 215 | ppl people 216 | pro professional 217 | pwnd owned 218 | qq crying 219 | r are 220 | rly really 221 | rofl roll on the floor laughing 222 | rolf roll on the floor laughing 223 | rpg role playing games 224 | ru are you 225 | s2u shame to you 226 | scrub loser 227 | sec second 228 | shid slaps head in disgust 229 | shoulda should have 230 | sff so funny 231 | smexy smart and sexy 232 | smh shaking my head 233 | somy sick of me yet 234 | sot short of time  235 | sry sorry 236 | str8 straight 237 | sux sucks 238 | swag style 239 | taze irritate 240 | tba to be announced 241 | tbfu too bad for you 242 | tbc to be continued 243 | tbd to be determined 244 | tbr to be rude 245 | tc take care 246 | thx thanks 247 | thanx thanks 248 | thx thanks 249 | tfw that feeling when 250 | til today i learned 251 | ttyl talk to you later  252 | ty thank you 253 | tyvm thank you very much 254 | u you 255 | uber the best 256 | ugh disgusted 257 | ur you are 258 | uw you are welcome  259 | vs versus 260 | w2f way too funny 261 | w8 wait 262 | wak weird 263 | wanna want to 264 | wb welcome back 265 | whiz talented 266 | whoa surprise 267 | whoah surprise 268 | wfm works for me  269 | wibni wouldn't it be nice if  270 | wmd weapon of mass destruction 271 | wot what 272 | wtf what the fuck 273 | wtg way to go 274 | wtgp want to go private 275 | wu what's up 276 | wuh what? 277 | wuv love 278 | ym young man 279 | yawn boring 280 | yum good 281 | x kiss 282 | xxx kiss 283 | xdd laughing 284 | y why 285 | yolo you only live once 286 | yuge huge 287 | yw you are welcome 288 | ywa you are welcome anyway 289 | zomg oh my god! 290 | zzz sleeping -------------------------------------------------------------------------------- /techniques.py: -------------------------------------------------------------------------------- 1 | """ Copyright 2017, Dimitrios Effrosynidis, All rights reserved. """ 2 | 3 | import re 4 | from functools import partial 5 | from collections import Counter 6 | import nltk 7 | from nltk.corpus import wordnet 8 | from nltk.corpus import stopwords 9 | from nltk.stem import WordNetLemmatizer 10 | from nltk.stem.porter import PorterStemmer 11 | 12 | def removeUnicode(text): 13 | """ Removes unicode strings like "\u002c" and "x96" """ 14 | text = re.sub(r'(\\u[0-9A-Fa-f]+)',r'', text) 15 | text = re.sub(r'[^\x00-\x7f]',r'',text) 16 | return text 17 | 18 | def replaceURL(text): 19 | """ Replaces url address with "url" """ 20 | text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','url',text) 21 | text = re.sub(r'#([^\s]+)', r'\1', text) 22 | return text 23 | 24 | def replaceAtUser(text): 25 | """ Replaces "@user" with "atUser" """ 26 | text = re.sub('@[^\s]+','atUser',text) 27 | return text 28 | 29 | def removeHashtagInFrontOfWord(text): 30 | """ Removes hastag in front of a word """ 31 | text = re.sub(r'#([^\s]+)', r'\1', text) 32 | return text 33 | 34 | def removeNumbers(text): 35 | """ Removes integers """ 36 | text = ''.join([i for i in text if not i.isdigit()]) 37 | return text 38 | 39 | def replaceMultiExclamationMark(text): 40 | """ Replaces repetitions of exlamation marks """ 41 | text = re.sub(r"(\!)\1+", ' multiExclamation ', text) 42 | return text 43 | 44 | def replaceMultiQuestionMark(text): 45 | """ Replaces repetitions of question marks """ 46 | text = re.sub(r"(\?)\1+", ' multiQuestion ', text) 47 | return text 48 | 49 | def replaceMultiStopMark(text): 50 | """ Replaces repetitions of stop marks """ 51 | text = re.sub(r"(\.)\1+", ' multiStop ', text) 52 | return text 53 | 54 | def countMultiExclamationMarks(text): 55 | """ Replaces repetitions of exlamation marks """ 56 | return len(re.findall(r"(\!)\1+", text)) 57 | 58 | def countMultiQuestionMarks(text): 59 | """ Count repetitions of question marks """ 60 | return len(re.findall(r"(\?)\1+", text)) 61 | 62 | def countMultiStopMarks(text): 63 | """ Count repetitions of stop marks """ 64 | return len(re.findall(r"(\.)\1+", text)) 65 | 66 | def countElongated(text): 67 | """ Input: a text, Output: how many words are elongated """ 68 | regex = re.compile(r"(.)\1{2}") 69 | return len([word for word in text.split() if regex.search(word)]) 70 | 71 | def countAllCaps(text): 72 | """ Input: a text, Output: how many words are all caps """ 73 | return len(re.findall("[A-Z0-9]{3,}", text)) 74 | 75 | """ Creates a dictionary with slangs and their equivalents and replaces them """ 76 | with open('slang.txt') as file: 77 | slang_map = dict(map(str.strip, line.partition('\t')[::2]) 78 | for line in file if line.strip()) 79 | 80 | slang_words = sorted(slang_map, key=len, reverse=True) # longest first for regex 81 | regex = re.compile(r"\b({})\b".format("|".join(map(re.escape, slang_words)))) 82 | replaceSlang = partial(regex.sub, lambda m: slang_map[m.group(1)]) 83 | 84 | def countSlang(text): 85 | """ Input: a text, Output: how many slang words and a list of found slangs """ 86 | slangCounter = 0 87 | slangsFound = [] 88 | tokens = nltk.word_tokenize(text) 89 | for word in tokens: 90 | if word in slang_words: 91 | slangsFound.append(word) 92 | slangCounter += 1 93 | return slangCounter, slangsFound 94 | 95 | """ Replaces contractions from a string to their equivalents """ 96 | contraction_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), 97 | (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would'), (r'&', 'and'), (r'dammit', 'damn it'), (r'dont', 'do not'), (r'wont', 'will not') ] 98 | def replaceContraction(text): 99 | patterns = [(re.compile(regex), repl) for (regex, repl) in contraction_patterns] 100 | for (pattern, repl) in patterns: 101 | (text, count) = re.subn(pattern, repl, text) 102 | return text 103 | 104 | def replaceElongated(word): 105 | """ Replaces an elongated word with its basic form, unless the word exists in the lexicon """ 106 | 107 | repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') 108 | repl = r'\1\2\3' 109 | if wordnet.synsets(word): 110 | return word 111 | repl_word = repeat_regexp.sub(repl, word) 112 | if repl_word != word: 113 | return replaceElongated(repl_word) 114 | else: 115 | return repl_word 116 | 117 | def removeEmoticons(text): 118 | """ Removes emoticons from text """ 119 | text = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', text) 120 | return text 121 | 122 | def countEmoticons(text): 123 | """ Input: a text, Output: how many emoticons """ 124 | return len(re.findall(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', text)) 125 | 126 | 127 | ### Spell Correction begin ### 128 | """ Spell Correction http://norvig.com/spell-correct.html """ 129 | def words(text): return re.findall(r'\w+', text.lower()) 130 | 131 | WORDS = Counter(words(open('corporaForSpellCorrection.txt').read())) 132 | 133 | def P(word, N=sum(WORDS.values())): 134 | """P robability of `word`. """ 135 | return WORDS[word] / N 136 | 137 | def spellCorrection(word): 138 | """ Most probable spelling correction for word. """ 139 | return max(candidates(word), key=P) 140 | 141 | def candidates(word): 142 | """ Generate possible spelling corrections for word. """ 143 | return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) 144 | 145 | def known(words): 146 | """ The subset of `words` that appear in the dictionary of WORDS. """ 147 | return set(w for w in words if w in WORDS) 148 | 149 | def edits1(word): 150 | """ All edits that are one edit away from `word`. """ 151 | letters = 'abcdefghijklmnopqrstuvwxyz' 152 | splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] 153 | deletes = [L + R[1:] for L, R in splits if R] 154 | transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] 155 | replaces = [L + c + R[1:] for L, R in splits if R for c in letters] 156 | inserts = [L + c + R for L, R in splits for c in letters] 157 | return set(deletes + transposes + replaces + inserts) 158 | 159 | def edits2(word): 160 | """ All edits that are two edits away from `word`. """ 161 | return (e2 for e1 in edits1(word) for e2 in edits1(e1)) 162 | 163 | ### Spell Correction End ### 164 | 165 | ### Replace Negations Begin ### 166 | 167 | def replace(word, pos=None): 168 | """ Creates a set of all antonyms for the word and if there is only one antonym, it returns it """ 169 | antonyms = set() 170 | for syn in wordnet.synsets(word, pos=pos): 171 | for lemma in syn.lemmas(): 172 | for antonym in lemma.antonyms(): 173 | antonyms.add(antonym.name()) 174 | if len(antonyms) == 1: 175 | return antonyms.pop() 176 | else: 177 | return None 178 | 179 | def replaceNegations(text): 180 | """ Finds "not" and antonym for the next word and if found, replaces not and the next word with the antonym """ 181 | i, l = 0, len(text) 182 | words = [] 183 | while i < l: 184 | word = text[i] 185 | if word == 'not' and i+1 < l: 186 | ant = replace(text[i+1]) 187 | if ant: 188 | words.append(ant) 189 | i += 2 190 | continue 191 | words.append(word) 192 | i += 1 193 | return words 194 | 195 | ### Replace Negations End ### 196 | 197 | def addNotTag(text): 198 | """ Finds "not,never,no" and adds the tag NEG_ to all words that follow until the next punctuation """ 199 | transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 200 | lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 201 | text, 202 | flags=re.IGNORECASE) 203 | return transformed 204 | 205 | def addCapTag(word): 206 | """ Finds a word with at least 3 characters capitalized and adds the tag ALL_CAPS_ """ 207 | if(len(re.findall("[A-Z]{3,}", word))): 208 | word = word.replace('\\', '' ) 209 | transformed = re.sub("[A-Z]{3,}", "ALL_CAPS_"+word, word) 210 | return transformed 211 | else: 212 | return word 213 | -------------------------------------------------------------------------------- /preprocess.py: -------------------------------------------------------------------------------- 1 | """ Copyright 2017, Dimitrios Effrosynidis, All rights reserved. """ 2 | 3 | from time import time 4 | import numpy as np 5 | import string 6 | 7 | from techniques import * 8 | 9 | print("Starting preprocess..\n") 10 | 11 | """ Tokenizes a text to its words, removes and replaces some of them """ 12 | finalTokens = [] # all tokens 13 | stoplist = stopwords.words('english') 14 | my_stopwords = "multiexclamation multiquestion multistop url atuser st rd nd th am pm" # my extra stopwords 15 | stoplist = stoplist + my_stopwords.split() 16 | allowedWordTypes = ["J","R","V","N"] # J is Adject, R is Adverb, V is Verb, N is Noun. These are used for POS Tagging 17 | lemmatizer = WordNetLemmatizer() # set lemmatizer 18 | stemmer = PorterStemmer() # set stemmer 19 | 20 | def tokenize(text, wordCountBefore, textID, y): 21 | totalAdjectives = 0 22 | totalAdverbs = 0 23 | totalVerbs = 0 24 | onlyOneSentenceTokens = [] # tokens of one sentence each time 25 | 26 | tokens = nltk.word_tokenize(text) 27 | 28 | tokens = replaceNegations(tokens) # Technique 6: finds "not" and antonym for the next word and if found, replaces not and the next word with the antonym 29 | 30 | translator = str.maketrans('', '', string.punctuation) 31 | text = text.translate(translator) # Technique 7: remove punctuation 32 | 33 | tokens = nltk.word_tokenize(text) # it takes a text as an input and provides a list of every token in it 34 | 35 | ### NO POS TAGGING BEGIN (If you don't want to use POS Tagging keep this section uncommented) ### 36 | 37 | ## for w in tokens: 38 | ## 39 | ## if (w not in stoplist): # Technique 10: remove stopwords 40 | ## final_word = addCapTag(w) # Technique 8: Finds a word with at least 3 characters capitalized and adds the tag ALL_CAPS_ 41 | ## final_word = final_word.lower() # Technique 9: lowercases all characters 42 | ## final_word = replaceElongated(final_word) # Technique 11: replaces an elongated word with its basic form, unless the word exists in the lexicon 43 | ## if len(final_word)>1: 44 | ## final_word = spellCorrection(final_word) # Technique 12: correction of spelling errors 45 | ## final_word = lemmatizer.lemmatize(final_word) # Technique 14: lemmatizes words 46 | ## final_word = stemmer.stem(final_word) # Technique 15: apply stemming to words 47 | 48 | ### NO POS TAGGING END ### 49 | 50 | 51 | ### POS TAGGING BEGIN (If you want to exclude words using POS Tagging, keep this section uncommented and comment the above) ### 52 | 53 | tagged = nltk.pos_tag(tokens) # Technique 13: part of speech tagging 54 | for w in tagged: 55 | 56 | if (w[1][0] in allowedWordTypes and w[0] not in stoplist): 57 | final_word = addCapTag(w[0]) 58 | #final_word = final_word.lower() 59 | final_word = replaceElongated(final_word) 60 | if len(final_word)>1: 61 | final_word = spellCorrection(final_word) 62 | final_word = lemmatizer.lemmatize(final_word) 63 | final_word = stemmer.stem(final_word) 64 | 65 | ### POS TAGGING END ### 66 | 67 | onlyOneSentenceTokens.append(final_word) 68 | finalTokens.append(final_word) 69 | 70 | 71 | onlyOneSentence = " ".join(onlyOneSentenceTokens) # form again the sentence from the list of tokens 72 | #print(onlyOneSentence) # print final sentence 73 | 74 | 75 | """ Write the preprocessed text to file """ 76 | with open("result.txt", "a") as result: 77 | result.write(textID+"\t"+y+"\t"+onlyOneSentence+"\n") 78 | 79 | return finalTokens 80 | 81 | 82 | f = open("ss-twitterfinal.txt","r", encoding="utf8", errors='replace').read() 83 | 84 | t0 = time() 85 | totalSentences = 0 86 | totalEmoticons = 0 87 | totalSlangs = 0 88 | totalSlangsFound = [] 89 | totalElongated = 0 90 | totalMultiExclamationMarks = 0 91 | totalMultiQuestionMarks = 0 92 | totalMultiStopMarks = 0 93 | totalAllCaps = 0 94 | 95 | for line in f.split('\n'): 96 | totalSentences += 1 97 | feat = [] 98 | columns = line.split('\t') 99 | columns = [col.strip() for col in columns] 100 | 101 | textID = (columns[0]) 102 | y = (columns[2]) 103 | 104 | text = removeUnicode(columns[1]) # Technique 0 105 | #print(text) # print initial text 106 | wordCountBefore = len(re.findall(r'\w+', text)) # word count of one sentence before preprocess 107 | #print("Words before preprocess: ",wordCountBefore,"\n") 108 | 109 | text = replaceURL(text) # Technique 1 110 | text = replaceAtUser(text) # Technique 1 111 | text = removeHashtagInFrontOfWord(text) # Technique 1 112 | 113 | temp_slangs, temp_slangsFound = countSlang(text) 114 | totalSlangs += temp_slangs # total slangs for all sentences 115 | for word in temp_slangsFound: 116 | totalSlangsFound.append(word) # all the slangs found in all sentences 117 | 118 | text = replaceSlang(text) # Technique 2: replaces slang words and abbreviations with their equivalents 119 | text = replaceContraction(text) # Technique 3: replaces contractions to their equivalents 120 | text = removeNumbers(text) # Technique 4: remove integers from text 121 | 122 | emoticons = countEmoticons(text) # how many emoticons in this sentence 123 | totalEmoticons += emoticons 124 | 125 | text = removeEmoticons(text) # removes emoticons from text 126 | 127 | 128 | totalAllCaps += countAllCaps(text) 129 | 130 | totalMultiExclamationMarks += countMultiExclamationMarks(text) # how many repetitions of exlamation marks in this sentence 131 | totalMultiQuestionMarks += countMultiQuestionMarks(text) # how many repetitions of question marks in this sentence 132 | totalMultiStopMarks += countMultiStopMarks(text) # how many repetitions of stop marks in this sentence 133 | 134 | text = replaceMultiExclamationMark(text) # Technique 5: replaces repetitions of exlamation marks with the tag "multiExclamation" 135 | text = replaceMultiQuestionMark(text) # Technique 5: replaces repetitions of question marks with the tag "multiQuestion" 136 | text = replaceMultiStopMark(text) # Technique 5: replaces repetitions of stop marks with the tag "multiStop" 137 | 138 | totalElongated += countElongated(text) # how many elongated words emoticons in this sentence 139 | 140 | tokens = tokenize(text, wordCountBefore, textID, y) 141 | 142 | 143 | print("Total sentences: ",totalSentences,"\n") 144 | print("Total Words before preprocess: ",len(re.findall(r'\w+', f))) 145 | print("Total Distinct Tokens before preprocess: ",len(set(re.findall(r'\w+', f)))) 146 | print("Average word/sentence before preprocess: ",len(re.findall(r'\w+', f))/totalSentences,"\n") 147 | print("Total Words after preprocess: ",len(tokens)) 148 | print("Total Distinct Tokens after preprocess: ",len(set(tokens))) 149 | print("Average word/sentence after preprocess: ",len(tokens)/totalSentences,"\n") 150 | 151 | 152 | print("Total run time: ",time() - t0," seconds\n") 153 | 154 | print("Total emoticons: ",totalEmoticons,"\n") 155 | print("Total slangs: ",totalSlangs,"\n") 156 | commonSlangs = nltk.FreqDist(totalSlangsFound) 157 | for (word, count) in commonSlangs.most_common(20): # most common slangs across all texts 158 | print(word,"\t",count) 159 | 160 | commonSlangs.plot(20, cumulative=False) # plot most common slangs 161 | 162 | print("Total elongated words: ",totalElongated,"\n") 163 | print("Total multi exclamation marks: ",totalMultiExclamationMarks) 164 | print("Total multi question marks: ",totalMultiQuestionMarks) 165 | print("Total multi stop marks: ",totalMultiStopMarks,"\n") 166 | print("Total all capitalized words: ",totalAllCaps,"\n") 167 | 168 | #print(tokens) 169 | commonWords = nltk.FreqDist(tokens) 170 | print("Most common words ") 171 | print("Word\tCount") 172 | for (word, count) in commonWords.most_common(100): # most common words across all texts 173 | print(word,"\t",count) 174 | 175 | commonWords.plot(100, cumulative=False) # plot most common words 176 | 177 | 178 | bgm = nltk.collocations.BigramAssocMeasures() 179 | tgm = nltk.collocations.TrigramAssocMeasures() 180 | bgm_finder = nltk.collocations.BigramCollocationFinder.from_words(tokens) 181 | tgm_finder = nltk.collocations.TrigramCollocationFinder.from_words(tokens) 182 | bgm_finder.apply_freq_filter(5) # bigrams that occur at least 5 times 183 | print("Most common collocations (bigrams)") 184 | print(bgm_finder.nbest(bgm.pmi, 50)) # top 50 bigram collocations 185 | tgm_finder.apply_freq_filter(5) # trigrams that occur at least 5 times 186 | print("Most common collocations (trigrams)") 187 | print(tgm_finder.nbest(tgm.pmi, 20)) # top 20 trigrams collocations 188 | --------------------------------------------------------------------------------