├── README.md
├── slang.txt
├── techniques.py
└── preprocess.py


/README.md:
--------------------------------------------------------------------------------
 1 | # text-preprocessing-techniques
 2 | ## 16 Text Preprocessing Techniques in Python for Twitter Sentiment Analysis.
 3 | 
 4 | These techniques were used in comparison in our paper **"A Comparison of Pre-processing Techniques for Twitter Sentiment Analysis"**. If you use this material please [cite](https://link.springer.com/chapter/10.1007/978-3-319-67008-9_31) the paper. An extended paper for this work can be found [here](https://www.sciencedirect.com/science/article/pii/S0957417418303683), with the title **"A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis"**. Please cite.
 5 |  
 6 | Most of these techniques are generic and can be used in various applications except Sentiment Analysis. 
 7 | They are the following:
 8 | 
 9 | #### 0. Remove Unicode Strings and Noise
10 | #### 1. Replace URLs, User Mentions and Hashtags
11 | #### 2. Replcae Slang and Abbreviations
12 | #### 3. Replace Contractions
13 | #### 4. Remove Numbers
14 | #### 5. Replace Repetitions of Punctuation
15 | #### 6. Replace Negations with Antonyms
16 | #### 7. Remove Punctuation
17 | #### 8. Handling Capitalized Words
18 | #### 9. Lowercase
19 | #### 10. Remove Stopwords
20 | #### 11. Replace Elongated Words
21 | #### 12. Spelling Correction
22 | #### 13. Part of Speech Tagging
23 | #### 14. Lemmatizing
24 | #### 15. Stemming
25 | 
26 | This scripts also prints some statistics for the text file like: 
27 | 
28 | - Total Sentences
29 | - Total Words before and after preprocess
30 | - Total Unique words before and after preprocess
31 | - Average Words per Sentence before and after preprocess
32 | - Total Run time
33 | - Total Emoticons found
34 | - Total Slangs and Abbreviations found
35 | - 20 Most Commong Sland and Abbreviations and plots them
36 | - Total Elongated words
37 | - Total multi Exclamation
38 | - question and stop marks
39 | - Total All Capitalized words
40 | - 100 Most Common words and plots them and most common bigram and trigram collocations
41 | 
42 | The text file that we included here is a sample (2000 tweets) of the SS-Twitter dataset.
43 | 
44 | The file "preprocess.py" includes many comments and in order to use a technique you have to uncomment the appropriate line/lines. The initial script uses all techniques. So if you want to use only specific techniques, comment out the others.
45 | 


--------------------------------------------------------------------------------
/slang.txt:
--------------------------------------------------------------------------------
  1 | 2day	today
  2 | 2nite	tonight
  3 | 4u	for you
  4 | 4ward	forward
  5 | a3	anyplace, anywhere, anytime
  6 | a/n	author note
  7 | a/w	anyway
  8 | a/s/l	age, sex, location
  9 | adn	any day now
 10 | afaic	as far as i'm concerned
 11 | afaik	as far as I know
 12 | afk	away from keyboard
 13 | aggro	aggresive
 14 | aight	alright
 15 | airhead	stupid
 16 | aka	as known as
 17 | alol	actually laughing out loud
 18 | amigo	friend
 19 | amz	amazing
 20 | app	application
 21 | armpit	undesirable
 22 | asap	as soon as possible
 23 | atm	at the moment
 24 | atw	all the way
 25 | b/c	because
 26 | b-day	birthday
 27 | b4	before
 28 | b4n	bye for now
 29 | bae	before anyone else
 30 | bak	back at the keyboard
 31 | bbl	bee back later
 32 | bday	birthday
 33 | becuz	because
 34 | bent	angry
 35 | bestie	best friend
 36 | besty	best friend
 37 | bf	boyfriend
 38 | bff	best friends forever
 39 | bffe	best friends forever
 40 | bfn	bye for now
 41 | bg	big grin
 42 | bmfe	best mates forever
 43 | bmfl	best mates life
 44 | bozo	idiot
 45 | brah	friend
 46 | bravo	well done
 47 | brb	be right back
 48 | bro	brother
 49 | bta	but then again
 50 | btdt	been there, done that
 51 | btr	better
 52 | btw	by the way
 53 | buddy	friend
 54 | c'mon	came on
 55 | cid crying in disgrace
 56 | congrats congratulations
 57 | copacetic excellent
 58 | coz beacause
 59 | cu	see you
 60 | cuddy	friends
 61 | cul	see you later
 62 | cul8r	see you later 
 63 | cutie	cute
 64 | cuz	because
 65 | cya	bye
 66 | cyo	see you online 
 67 | dbau	doing business as usual
 68 | deets	details
 69 | dmn	damn
 70 | dobe	idiot
 71 | dope	stupid
 72 | dork	strange
 73 | dunno	don't know
 74 | dwi	deal with it
 75 | dyd	don't you dare
 76 | ermahgerd	oh my gosh
 77 | eu	europe
 78 | ez	easy
 79 | f9	fine
 80 | fav	favorite
 81 | far-out	great
 82 | fb	facebook
 83 | flick	movie
 84 | fml	fuck my life
 85 | foxy	sexy
 86 | friggin	freaking
 87 | fttn	for the time being
 88 | ftw	for the win
 89 | fud	fear, uncertainty, and doubt
 90 | fwiw	for what it's worth 
 91 | fyi	for your information
 92 | g	grin 
 93 | g2g	got to go 
 94 | ga	go ahead 
 95 | gal	get a life
 96 | getcha	understand 
 97 | gf	girlfriend 
 98 | gfn	gone for now
 99 | gg	good game
100 | gj	good job
101 | gky	go kill yourself
102 | gl	good luck
103 | glhf	good luck have fun
104 | gmab	give me a break
105 | gmbo	giggling my butt off 
106 | gmta	great minds think alike 
107 | goof	idiot
108 | goofy	idiot
109 | gr8	great
110 | gtg	got to go
111 | gud	good
112 | h8	hate
113 | hagn	have a good night 
114 | hdop	help delete online predators 
115 | hf	have fun
116 | hml	hate my life
117 | hoas	hold on a second
118 | hhis	hanging head in shame 
119 | hmu	hit me up
120 | hru	how are you
121 | twt	hope this helps
122 | hw	homework
123 | i'ma	i am going to
124 | iac	in any case 
125 | ic	I see 
126 | icymi	in case you missed it
127 | idk	I don't know
128 | iggy	ignore 
129 | iht	i hate this
130 | ikr	i know, right?
131 | ilt	i like that
132 | ily	i love you
133 | ima	i am going to 
134 | imao	in my arrogant opinion
135 | imnsho	in my not so humble opinion 
136 | imo	in my opinion 
137 | imy	i miss you
138 | iou	i owe you
139 | iow	in other words 
140 | ipn	I’m posting naked 
141 | irl	in real life 
142 | j/k	just kidding
143 | jdi	just do it
144 | jk	just kidding
145 | jkn	joking
146 | jyeah	yeah
147 | kinda	kind of
148 | l8	late
149 | l8r	later
150 | lbh	let's be honest
151 | ld	later, dude
152 | ldi	let's do it  
153 | ldr	long distance relationship 
154 | lees	beautiful 
155 | lfm	looking for more
156 | lil	little
157 | llta	lots and lots of thunderous applause 
158 | lmao	laugh my ass off
159 | lmirl	let's meet in real life 
160 | lmk	let me know
161 | lol	laugh out loud
162 | lolz	laugh out loud
163 | lotta	lot of
164 | lsr	loser
165 | ltr	longterm relationship
166 | lua	love you always 
167 | lub	love
168 | lubb	love
169 | lulab	love you like a brother 
170 | lulas	love you like a sister 
171 | lul	laugh
172 | luls	laugh
173 | lulz	laugh
174 | lumu	love you miss you
175 | luv	love
176 | lux	luxury
177 | lwm	laugh with me
178 | lwp	laugh with passion
179 | lvl	level
180 | m/f	male or female
181 | m2	me too
182 | m8	mate
183 | me2	me too
184 | milf	mother I would like to fuck
185 | mma	meet me at
186 | mmb	message me back
187 | mvp	most valueable player
188 | msg	message
189 | mtf	more to follow
190 | myob	mind your own business
191 | nah	no
192 | nc	no comment
193 | nk	not kidding
194 | ngl	not gonna lie
195 | nlt	no later than
196 | nm	not much
197 | no1	no one
198 | np	no problem
199 | nsfw	not safe for work
200 | nuh	no
201 | nvm	nevermind
202 | obo	or best offer
203 | oic	oh, i see
204 | oll	online love 
205 | omg	oh my god
206 | omw	on my way
207 | osm	awesome
208 | otoh	on the other hand 
209 | perv	pervert
210 | pervy	pervert
211 | phat	pretty hot and tempting
212 | pir	parent in room
213 | pls	please
214 | plz	please
215 | ppl	people
216 | pro	professional
217 | pwnd	owned
218 | qq	crying
219 | r	are
220 | rly	really
221 | rofl	roll on the floor laughing
222 | rolf	roll on the floor laughing
223 | rpg	role playing games
224 | ru	are you
225 | s2u	shame to you
226 | scrub	loser
227 | sec	second
228 | shid	slaps head in disgust
229 | shoulda	should have
230 | sff	so funny
231 | smexy	smart and sexy
232 | smh	shaking my head
233 | somy	sick of me yet
234 | sot	short of time 
235 | sry	sorry
236 | str8	straight
237 | sux	sucks
238 | swag	style
239 | taze	irritate
240 | tba	to be announced
241 | tbfu	too bad for you
242 | tbc	to be continued
243 | tbd	to be determined
244 | tbr	to be rude
245 | tc	take care
246 | thx	thanks
247 | thanx	thanks
248 | thx	thanks
249 | tfw	that feeling  when
250 | til	today i learned
251 | ttyl	talk to you later 
252 | ty	thank you
253 | tyvm	thank you very much
254 | u	you
255 | uber	the best
256 | ugh	disgusted
257 | ur	you are
258 | uw	you are welcome 
259 | vs	versus
260 | w2f	way too funny
261 | w8	wait
262 | wak	weird
263 | wanna	want to
264 | wb	welcome back
265 | whiz	talented
266 | whoa	surprise
267 | whoah	surprise
268 | wfm	works for me 
269 | wibni	wouldn't it be nice if 
270 | wmd	weapon of mass destruction
271 | wot	what
272 | wtf	what the fuck
273 | wtg	way to go
274 | wtgp	want to go private
275 | wu	what's up
276 | wuh	what?
277 | wuv	love
278 | ym	young man
279 | yawn	boring
280 | yum	good
281 | x	kiss
282 | xxx	kiss
283 | xdd	laughing
284 | y	why
285 | yolo	you only live once
286 | yuge	huge
287 | yw	you are welcome
288 | ywa	you are welcome anyway
289 | zomg	oh my god!
290 | zzz	sleeping


--------------------------------------------------------------------------------
/techniques.py:
--------------------------------------------------------------------------------
  1 | """ Copyright 2017, Dimitrios Effrosynidis, All rights reserved. """
  2 | 
  3 | import re
  4 | from functools import partial
  5 | from collections import Counter
  6 | import nltk
  7 | from nltk.corpus import wordnet
  8 | from nltk.corpus import stopwords
  9 | from nltk.stem import WordNetLemmatizer
 10 | from nltk.stem.porter import PorterStemmer
 11 | 
 12 | def removeUnicode(text):
 13 |     """ Removes unicode strings like "\u002c" and "x96" """
 14 |     text = re.sub(r'(\\u[0-9A-Fa-f]+)',r'', text)       
 15 |     text = re.sub(r'[^\x00-\x7f]',r'',text)
 16 |     return text
 17 | 
 18 | def replaceURL(text):
 19 |     """ Replaces url address with "url" """
 20 |     text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','url',text)
 21 |     text = re.sub(r'#([^\s]+)', r'\1', text)
 22 |     return text
 23 | 
 24 | def replaceAtUser(text):
 25 |     """ Replaces "@user" with "atUser" """
 26 |     text = re.sub('@[^\s]+','atUser',text)
 27 |     return text
 28 | 
 29 | def removeHashtagInFrontOfWord(text):
 30 |     """ Removes hastag in front of a word """
 31 |     text = re.sub(r'#([^\s]+)', r'\1', text)
 32 |     return text
 33 | 
 34 | def removeNumbers(text):
 35 |     """ Removes integers """
 36 |     text = ''.join([i for i in text if not i.isdigit()])         
 37 |     return text
 38 | 
 39 | def replaceMultiExclamationMark(text):
 40 |     """ Replaces repetitions of exlamation marks """
 41 |     text = re.sub(r"(\!)\1+", ' multiExclamation ', text)
 42 |     return text
 43 | 
 44 | def replaceMultiQuestionMark(text):
 45 |     """ Replaces repetitions of question marks """
 46 |     text = re.sub(r"(\?)\1+", ' multiQuestion ', text)
 47 |     return text
 48 | 
 49 | def replaceMultiStopMark(text):
 50 |     """ Replaces repetitions of stop marks """
 51 |     text = re.sub(r"(\.)\1+", ' multiStop ', text)
 52 |     return text
 53 | 
 54 | def countMultiExclamationMarks(text):
 55 |     """ Replaces repetitions of exlamation marks """
 56 |     return len(re.findall(r"(\!)\1+", text))
 57 | 
 58 | def countMultiQuestionMarks(text):
 59 |     """ Count repetitions of question marks """
 60 |     return len(re.findall(r"(\?)\1+", text))
 61 | 
 62 | def countMultiStopMarks(text):
 63 |     """ Count repetitions of stop marks """
 64 |     return len(re.findall(r"(\.)\1+", text))
 65 | 
 66 | def countElongated(text):
 67 |     """ Input: a text, Output: how many words are elongated """
 68 |     regex = re.compile(r"(.)\1{2}")
 69 |     return len([word for word in text.split() if regex.search(word)])
 70 | 
 71 | def countAllCaps(text):
 72 |     """ Input: a text, Output: how many words are all caps """
 73 |     return len(re.findall("[A-Z0-9]{3,}", text))
 74 | 
 75 | """ Creates a dictionary with slangs and their equivalents and replaces them """
 76 | with open('slang.txt') as file:
 77 |     slang_map = dict(map(str.strip, line.partition('\t')[::2])
 78 |     for line in file if line.strip())
 79 | 
 80 | slang_words = sorted(slang_map, key=len, reverse=True) # longest first for regex
 81 | regex = re.compile(r"\b({})\b".format("|".join(map(re.escape, slang_words))))
 82 | replaceSlang = partial(regex.sub, lambda m: slang_map[m.group(1)])
 83 | 
 84 | def countSlang(text):
 85 |     """ Input: a text, Output: how many slang words and a list of found slangs """
 86 |     slangCounter = 0
 87 |     slangsFound = []
 88 |     tokens = nltk.word_tokenize(text)
 89 |     for word in tokens:
 90 |         if word in slang_words:
 91 |             slangsFound.append(word)
 92 |             slangCounter += 1
 93 |     return slangCounter, slangsFound
 94 | 
 95 | """ Replaces contractions from a string to their equivalents """
 96 | contraction_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'),
 97 |                          (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would'), (r'&', 'and'), (r'dammit', 'damn it'), (r'dont', 'do not'), (r'wont', 'will not') ]
 98 | def replaceContraction(text):
 99 |     patterns = [(re.compile(regex), repl) for (regex, repl) in contraction_patterns]
100 |     for (pattern, repl) in patterns:
101 |         (text, count) = re.subn(pattern, repl, text)
102 |     return text
103 | 
104 | def replaceElongated(word):
105 |     """ Replaces an elongated word with its basic form, unless the word exists in the lexicon """
106 | 
107 |     repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
108 |     repl = r'\1\2\3'
109 |     if wordnet.synsets(word):
110 |         return word
111 |     repl_word = repeat_regexp.sub(repl, word)
112 |     if repl_word != word:      
113 |         return replaceElongated(repl_word)
114 |     else:       
115 |         return repl_word
116 | 
117 | def removeEmoticons(text):
118 |     """ Removes emoticons from text """
119 |     text = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', '', text)
120 |     return text
121 | 
122 | def countEmoticons(text):
123 |     """ Input: a text, Output: how many emoticons """
124 |     return len(re.findall(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', text))
125 | 
126 | 
127 | ### Spell Correction begin ###
128 | """ Spell Correction http://norvig.com/spell-correct.html """
129 | def words(text): return re.findall(r'\w+', text.lower())
130 | 
131 | WORDS = Counter(words(open('corporaForSpellCorrection.txt').read()))
132 | 
133 | def P(word, N=sum(WORDS.values())): 
134 |     """P robability of `word`. """
135 |     return WORDS[word] / N
136 | 
137 | def spellCorrection(word): 
138 |     """ Most probable spelling correction for word. """
139 |     return max(candidates(word), key=P)
140 | 
141 | def candidates(word): 
142 |     """ Generate possible spelling corrections for word. """
143 |     return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
144 | 
145 | def known(words): 
146 |     """ The subset of `words` that appear in the dictionary of WORDS. """
147 |     return set(w for w in words if w in WORDS)
148 | 
149 | def edits1(word):
150 |     """ All edits that are one edit away from `word`. """
151 |     letters    = 'abcdefghijklmnopqrstuvwxyz'
152 |     splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
153 |     deletes    = [L + R[1:]               for L, R in splits if R]
154 |     transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
155 |     replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
156 |     inserts    = [L + c + R               for L, R in splits for c in letters]
157 |     return set(deletes + transposes + replaces + inserts)
158 | 
159 | def edits2(word): 
160 |     """ All edits that are two edits away from `word`. """
161 |     return (e2 for e1 in edits1(word) for e2 in edits1(e1))
162 | 
163 | ### Spell Correction End ###
164 | 
165 | ### Replace Negations Begin ###
166 | 
167 | def replace(word, pos=None):
168 |     """ Creates a set of all antonyms for the word and if there is only one antonym, it returns it """
169 |     antonyms = set()
170 |     for syn in wordnet.synsets(word, pos=pos):
171 |       for lemma in syn.lemmas():
172 |         for antonym in lemma.antonyms():
173 |           antonyms.add(antonym.name())
174 |     if len(antonyms) == 1:
175 |       return antonyms.pop()
176 |     else:
177 |       return None
178 | 
179 | def replaceNegations(text):
180 |     """ Finds "not" and antonym for the next word and if found, replaces not and the next word with the antonym """
181 |     i, l = 0, len(text)
182 |     words = []
183 |     while i < l:
184 |       word = text[i]
185 |       if word == 'not' and i+1 < l:
186 |         ant = replace(text[i+1])
187 |         if ant:
188 |           words.append(ant)
189 |           i += 2
190 |           continue
191 |       words.append(word)
192 |       i += 1
193 |     return words
194 | 
195 | ### Replace Negations End ###
196 | 
197 | def addNotTag(text):
198 | 	""" Finds "not,never,no" and adds the tag NEG_ to all words that follow until the next punctuation """
199 | 	transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
200 |        lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
201 |        text,
202 |        flags=re.IGNORECASE)
203 | 	return transformed
204 | 
205 | def addCapTag(word):
206 |     """ Finds a word with at least 3 characters capitalized and adds the tag ALL_CAPS_ """
207 |     if(len(re.findall("[A-Z]{3,}", word))):
208 |         word = word.replace('\\', '' )
209 |         transformed = re.sub("[A-Z]{3,}", "ALL_CAPS_"+word, word)
210 |         return transformed
211 |     else:
212 |         return word
213 | 


--------------------------------------------------------------------------------
/preprocess.py:
--------------------------------------------------------------------------------
  1 | """ Copyright 2017, Dimitrios Effrosynidis, All rights reserved. """
  2 | 
  3 | from time import time
  4 | import numpy as np
  5 | import string
  6 | 
  7 | from techniques import *
  8 | 
  9 | print("Starting preprocess..\n")
 10 | 
 11 | """ Tokenizes a text to its words, removes and replaces some of them """    
 12 | finalTokens = [] # all tokens
 13 | stoplist = stopwords.words('english')
 14 | my_stopwords = "multiexclamation multiquestion multistop url atuser st rd nd th am pm" # my extra stopwords
 15 | stoplist = stoplist + my_stopwords.split()
 16 | allowedWordTypes = ["J","R","V","N"] #  J is Adject, R is Adverb, V is Verb, N is Noun. These are used for POS Tagging
 17 | lemmatizer = WordNetLemmatizer() # set lemmatizer
 18 | stemmer = PorterStemmer() # set stemmer
 19 | 
 20 | def tokenize(text, wordCountBefore, textID, y):
 21 |     totalAdjectives = 0
 22 |     totalAdverbs = 0
 23 |     totalVerbs = 0
 24 |     onlyOneSentenceTokens = [] # tokens of one sentence each time
 25 | 
 26 |     tokens = nltk.word_tokenize(text)
 27 |     
 28 |     tokens = replaceNegations(tokens) # Technique 6: finds "not" and antonym for the next word and if found, replaces not and the next word with the antonym
 29 |     
 30 |     translator = str.maketrans('', '', string.punctuation)
 31 |     text = text.translate(translator) # Technique 7: remove punctuation
 32 | 
 33 |     tokens = nltk.word_tokenize(text) # it takes a text as an input and provides a list of every token in it
 34 |     
 35 | ### NO POS TAGGING BEGIN (If you don't want to use POS Tagging keep this section uncommented) ###
 36 |     
 37 | ##    for w in tokens:
 38 | ##
 39 | ##        if (w not in stoplist): # Technique 10: remove stopwords
 40 | ##            final_word = addCapTag(w) # Technique 8: Finds a word with at least 3 characters capitalized and adds the tag ALL_CAPS_
 41 | ##            final_word = final_word.lower() # Technique 9: lowercases all characters
 42 | ##            final_word = replaceElongated(final_word) # Technique 11: replaces an elongated word with its basic form, unless the word exists in the lexicon
 43 | ##            if len(final_word)>1:
 44 | ##                final_word = spellCorrection(final_word) # Technique 12: correction of spelling errors
 45 | ##            final_word = lemmatizer.lemmatize(final_word) # Technique 14: lemmatizes words
 46 | ##            final_word = stemmer.stem(final_word) # Technique 15: apply stemming to words
 47 | 
 48 | ### NO POS TAGGING END ###
 49 | 
 50 | 
 51 | ### POS TAGGING BEGIN (If you want to exclude words using POS Tagging, keep this section uncommented and comment the above) ###          
 52 |             
 53 |     tagged = nltk.pos_tag(tokens) # Technique 13: part of speech tagging  
 54 |     for w in tagged:
 55 | 
 56 |         if (w[1][0] in allowedWordTypes and w[0] not in stoplist):
 57 |             final_word = addCapTag(w[0])
 58 |             #final_word = final_word.lower()
 59 |             final_word = replaceElongated(final_word)
 60 |             if len(final_word)>1:
 61 |                 final_word = spellCorrection(final_word)
 62 |             final_word = lemmatizer.lemmatize(final_word)
 63 |             final_word = stemmer.stem(final_word)
 64 | 
 65 | ### POS TAGGING END ###
 66 |                 
 67 |             onlyOneSentenceTokens.append(final_word)           
 68 |             finalTokens.append(final_word)
 69 | 
 70 |          
 71 |     onlyOneSentence = " ".join(onlyOneSentenceTokens) # form again the sentence from the list of tokens
 72 |     #print(onlyOneSentence) # print final sentence
 73 | 
 74 |     
 75 |     """ Write the preprocessed text to file """
 76 |     with open("result.txt", "a") as result:
 77 |         result.write(textID+"\t"+y+"\t"+onlyOneSentence+"\n")
 78 |         
 79 |     return finalTokens
 80 | 
 81 | 
 82 | f = open("ss-twitterfinal.txt","r", encoding="utf8", errors='replace').read()
 83 | 
 84 | t0 = time()
 85 | totalSentences = 0
 86 | totalEmoticons = 0
 87 | totalSlangs = 0
 88 | totalSlangsFound = []
 89 | totalElongated = 0
 90 | totalMultiExclamationMarks = 0
 91 | totalMultiQuestionMarks = 0
 92 | totalMultiStopMarks = 0
 93 | totalAllCaps = 0
 94 | 
 95 | for line in f.split('\n'):
 96 |     totalSentences += 1
 97 |     feat = []
 98 |     columns = line.split('\t')
 99 |     columns = [col.strip() for col in columns]
100 | 
101 |     textID = (columns[0])
102 |     y = (columns[2])
103 | 
104 |     text = removeUnicode(columns[1]) # Technique 0
105 |     #print(text) # print initial text
106 |     wordCountBefore = len(re.findall(r'\w+', text)) # word count of one sentence before preprocess    
107 |     #print("Words before preprocess: ",wordCountBefore,"\n")
108 |     
109 |     text = replaceURL(text) # Technique 1
110 |     text = replaceAtUser(text) # Technique 1
111 |     text = removeHashtagInFrontOfWord(text) # Technique 1
112 | 
113 |     temp_slangs, temp_slangsFound = countSlang(text)
114 |     totalSlangs += temp_slangs # total slangs for all sentences
115 |     for word in temp_slangsFound:
116 |         totalSlangsFound.append(word) # all the slangs found in all sentences
117 |     
118 |     text = replaceSlang(text) # Technique 2: replaces slang words and abbreviations with their equivalents
119 |     text = replaceContraction(text) # Technique 3: replaces contractions to their equivalents
120 |     text = removeNumbers(text) # Technique 4: remove integers from text
121 | 
122 |     emoticons = countEmoticons(text) # how many emoticons in this sentence
123 |     totalEmoticons += emoticons
124 |     
125 |     text = removeEmoticons(text) # removes emoticons from text
126 | 
127 |     
128 |     totalAllCaps += countAllCaps(text)
129 | 
130 |     totalMultiExclamationMarks += countMultiExclamationMarks(text) # how many repetitions of exlamation marks in this sentence
131 |     totalMultiQuestionMarks += countMultiQuestionMarks(text) # how many repetitions of question marks in this sentence
132 |     totalMultiStopMarks += countMultiStopMarks(text) # how many repetitions of stop marks in this sentence
133 |     
134 |     text = replaceMultiExclamationMark(text) # Technique 5: replaces repetitions of exlamation marks with the tag "multiExclamation"
135 |     text = replaceMultiQuestionMark(text) # Technique 5: replaces repetitions of question marks with the tag "multiQuestion"
136 |     text = replaceMultiStopMark(text) # Technique 5: replaces repetitions of stop marks with the tag "multiStop"
137 | 
138 |     totalElongated += countElongated(text) # how many elongated words emoticons in this sentence
139 |     
140 |     tokens = tokenize(text, wordCountBefore, textID, y)  
141 |     
142 |     
143 | print("Total sentences: ",totalSentences,"\n")
144 | print("Total Words before preprocess: ",len(re.findall(r'\w+', f)))
145 | print("Total Distinct Tokens before preprocess: ",len(set(re.findall(r'\w+', f))))
146 | print("Average word/sentence before preprocess: ",len(re.findall(r'\w+', f))/totalSentences,"\n")
147 | print("Total Words after preprocess: ",len(tokens))
148 | print("Total Distinct Tokens after preprocess: ",len(set(tokens)))
149 | print("Average word/sentence after preprocess: ",len(tokens)/totalSentences,"\n")
150 | 
151 | 
152 | print("Total run time: ",time() - t0," seconds\n")
153 | 
154 | print("Total emoticons: ",totalEmoticons,"\n")
155 | print("Total slangs: ",totalSlangs,"\n")
156 | commonSlangs = nltk.FreqDist(totalSlangsFound)
157 | for (word, count) in commonSlangs.most_common(20): # most common slangs across all texts
158 |     print(word,"\t",count)
159 | 
160 | commonSlangs.plot(20, cumulative=False) # plot most common slangs
161 | 
162 | print("Total elongated words: ",totalElongated,"\n")
163 | print("Total multi exclamation marks: ",totalMultiExclamationMarks)
164 | print("Total multi question marks: ",totalMultiQuestionMarks)
165 | print("Total multi stop marks: ",totalMultiStopMarks,"\n")
166 | print("Total all capitalized words: ",totalAllCaps,"\n")
167 | 
168 | #print(tokens)
169 | commonWords = nltk.FreqDist(tokens)
170 | print("Most common words ")
171 | print("Word\tCount")
172 | for (word, count) in commonWords.most_common(100): # most common words across all texts
173 |     print(word,"\t",count)
174 | 
175 | commonWords.plot(100, cumulative=False) # plot most common words
176 | 
177 | 
178 | bgm = nltk.collocations.BigramAssocMeasures()
179 | tgm = nltk.collocations.TrigramAssocMeasures()
180 | bgm_finder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
181 | tgm_finder = nltk.collocations.TrigramCollocationFinder.from_words(tokens)
182 | bgm_finder.apply_freq_filter(5) # bigrams that occur at least 5 times
183 | print("Most common collocations (bigrams)")
184 | print(bgm_finder.nbest(bgm.pmi, 50)) # top 50 bigram collocations
185 | tgm_finder.apply_freq_filter(5) # trigrams that occur at least 5 times
186 | print("Most common collocations (trigrams)")
187 | print(tgm_finder.nbest(tgm.pmi, 20)) # top 20 trigrams collocations
188 | 


--------------------------------------------------------------------------------