├── .gitignore
├── CHANGES
├── LICENSE
├── README.md
├── documentation
├── api-objects.txt
├── class-tree.html
├── crarr.png
├── epydoc.css
├── epydoc.js
├── frames.html
├── help.html
├── identifier-index.html
├── index.html
├── module-tree.html
├── redirect.html
├── toc-everything.html
├── toc-tweetokenize-module.html
├── toc.html
├── tweetokenize-module.html
├── tweetokenize-pysrc.html
├── tweetokenize.Tokenizer-class.html
└── tweetokenize.Tokenizer.TokenizerException-class.html
├── setup.py
├── tests
├── __main__.py
└── test_tweetokenize.py
└── tweetokenize
├── __init__.py
├── lexicons
├── domains.txt
├── emoticons.txt
└── stopwords.txt
└── tokenizer.py
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | *.pyc
3 | .gitignore
4 | build
5 | bench
6 |
--------------------------------------------------------------------------------
/CHANGES:
--------------------------------------------------------------------------------
1 | Changes
2 | =======
3 |
4 | 1.0.1 (2013-08-15)
5 | ------------------
6 |
7 | - Module docstring
8 | - Changes to `setup.py`
9 | - Refactored: gained ~15% speed up for tokenization
10 |
11 |
12 | 1.0.0 (2013-05-11 - 2013-06-25)
13 | -------------------------------
14 |
15 | - First version
16 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2013, Jared Suttles.
2 | All rights reserved.
3 |
4 | Redistribution and use in source and binary forms, with or without modification,
5 | are permitted provided that the following conditions are met:
6 |
7 | 1. Redistributions of source code must retain the above copyright notice,
8 | this list of conditions and the following disclaimer.
9 |
10 | 2. Redistributions in binary form must reproduce the above copyright
11 | notice, this list of conditions and the following disclaimer in the
12 | documentation and/or other materials provided with the distribution.
13 |
14 | 3. Neither the name of tweetokenize nor the names of its contributors may be
15 | used to endorse or promote products derived from this software without
16 | specific prior written permission.
17 |
18 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
19 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
20 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
21 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
22 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
23 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
24 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
25 | ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
26 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
27 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
28 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | tweetokenize
2 | ============
3 |
4 | Regular expression based tokenizer for Twitter. Focused on tokenization
5 | and pre-processing to train classifiers for sentiment, emotion, or mood.
6 |
7 | Intended as glue between Python wrappers for Twitter API and machine
8 | learning algorithms of the Natural Language Toolkit (NLTK), but probably
9 | applicable to tokenizing any short messages of the social networking
10 | variety.
11 |
12 | ```python
13 | from tweetokenize import Tokenizer
14 | gettokens = Tokenizer()
15 | gettokens.tokenize('hey playa!:):3.....@SHAQ can you still dunk?#old🍕🍔😵LOL')
16 | [u'hey', u'playa', u'!', u':)', u':3', u'...', u'USERNAME', u'can', u'you', u'still', u'dunk', u'?', u'#old', u'🍕', u'🍔', u'😵', u'LOL']
17 | ```
18 |
19 | Features
20 | --------
21 |
22 | * Can easily replace tweet features like usernames, urls, phone numbers, times,
23 | etc. with tokens in order to reduce feature set complexity and improve
24 | performance of classifiers
25 | * Allows user-defined sets of emoticons to be used in tokenization
26 | * Correctly separates emoji, written consecutively, into individual tokens
27 |
28 | Installation
29 | ------------
30 |
31 | python setup.py install
32 |
33 | After installation, you can make sure everything is working by running the following inside the project root folder,
34 |
35 | python tests
36 |
37 | Documentation
38 | -------------
39 |
40 | http://htmlpreview.github.io/?https://raw.github.com/jaredks/tweetokenize/master/documentation/tweetokenize.Tokenizer-class.html
41 |
42 | License
43 | -------
44 |
45 | "Modified BSD License". See LICENSE for details. Copyright Jared Suttles, 2013.
46 |
--------------------------------------------------------------------------------
/documentation/api-objects.txt:
--------------------------------------------------------------------------------
1 | tweetokenize tweetokenize-module.html
2 | tweetokenize.__package__ tweetokenize-module.html#__package__
3 | tweetokenize.Tokenizer tweetokenize.Tokenizer-class.html
4 | tweetokenize.Tokenizer.repeating_re tweetokenize.Tokenizer-class.html#repeating_re
5 | tweetokenize.Tokenizer._cleanword tweetokenize.Tokenizer-class.html#_cleanword
6 | tweetokenize.Tokenizer.phonenumbers_re tweetokenize.Tokenizer-class.html#phonenumbers_re
7 | tweetokenize.Tokenizer._unicode tweetokenize.Tokenizer-class.html#_unicode
8 | tweetokenize.Tokenizer.usernames_re tweetokenize.Tokenizer-class.html#usernames_re
9 | tweetokenize.Tokenizer._replacetokens tweetokenize.Tokenizer-class.html#_replacetokens
10 | tweetokenize.Tokenizer.quotes_re tweetokenize.Tokenizer-class.html#quotes_re
11 | tweetokenize.Tokenizer._topleveldomains tweetokenize.Tokenizer-class.html#_topleveldomains
12 | tweetokenize.Tokenizer.__init__ tweetokenize.Tokenizer-class.html#__init__
13 | tweetokenize.Tokenizer.emoticons tweetokenize.Tokenizer-class.html#emoticons
14 | tweetokenize.Tokenizer.TokenizerException tweetokenize.Tokenizer.TokenizerException-class.html
15 | tweetokenize.Tokenizer.punctuation tweetokenize.Tokenizer-class.html#punctuation
16 | tweetokenize.Tokenizer._collectset tweetokenize.Tokenizer-class.html#_collectset
17 | tweetokenize.Tokenizer._isemoji tweetokenize.Tokenizer-class.html#_isemoji
18 | tweetokenize.Tokenizer.numbers_re tweetokenize.Tokenizer-class.html#numbers_re
19 | tweetokenize.Tokenizer.times_re tweetokenize.Tokenizer-class.html#times_re
20 | tweetokenize.Tokenizer.tokenize tweetokenize.Tokenizer-class.html#tokenize
21 | tweetokenize.Tokenizer.__call__ tweetokenize.Tokenizer-class.html#__call__
22 | tweetokenize.Tokenizer._converthtmlentities tweetokenize.Tokenizer-class.html#_converthtmlentities
23 | tweetokenize.Tokenizer._number tweetokenize.Tokenizer-class.html#_number
24 | tweetokenize.Tokenizer._separate_emoticons_punctuation tweetokenize.Tokenizer-class.html#_separate_emoticons_punctuation
25 | tweetokenize.Tokenizer.update tweetokenize.Tokenizer-class.html#update
26 | tweetokenize.Tokenizer.word_re tweetokenize.Tokenizer-class.html#word_re
27 | tweetokenize.Tokenizer.tokenize_re tweetokenize.Tokenizer-class.html#tokenize_re
28 | tweetokenize.Tokenizer.html_entities tweetokenize.Tokenizer-class.html#html_entities
29 | tweetokenize.Tokenizer.urls_re tweetokenize.Tokenizer-class.html#urls_re
30 | tweetokenize.Tokenizer._doublequotes tweetokenize.Tokenizer-class.html#_doublequotes
31 | tweetokenize.Tokenizer._token_regexs tweetokenize.Tokenizer-class.html#_token_regexs
32 | tweetokenize.Tokenizer.other_re tweetokenize.Tokenizer-class.html#other_re
33 | tweetokenize.Tokenizer.ellipsis_re tweetokenize.Tokenizer-class.html#ellipsis_re
34 | tweetokenize.Tokenizer.stopwords tweetokenize.Tokenizer-class.html#stopwords
35 | tweetokenize.Tokenizer.html_entities_re tweetokenize.Tokenizer-class.html#html_entities_re
36 | tweetokenize.Tokenizer.hashtags_re tweetokenize.Tokenizer-class.html#hashtags_re
37 | tweetokenize.Tokenizer.__default_args tweetokenize.Tokenizer-class.html#__default_args
38 | tweetokenize.Tokenizer.TokenizerException tweetokenize.Tokenizer.TokenizerException-class.html
39 |
--------------------------------------------------------------------------------
/documentation/class-tree.html:
--------------------------------------------------------------------------------
1 |
2 |
4 |
5 |
tweetokenize.Tokenizer:
67 | Can be used to tokenize a string representation of a message,
68 | adjusting features based on the given configuration details, to
69 | enable further processing in feature extraction and training
70 | stages.
71 |
This document contains the API (Application Programming Interface)
54 | documentation for this project. Documentation for the Python
55 | objects defined by the project is divided into separate pages for each
56 | package, module, and class. The API documentation also includes two
57 | pages containing information about the project as a whole: a trees
58 | page, and an index page.
59 |
60 |
Object Documentation
61 |
62 |
Each Package Documentation page contains:
63 |
64 |
A description of the package.
65 |
A list of the modules and sub-packages contained by the
66 | package.
67 |
A summary of the classes defined by the package.
68 |
A summary of the functions defined by the package.
69 |
A summary of the variables defined by the package.
70 |
A detailed description of each function defined by the
71 | package.
72 |
A detailed description of each variable defined by the
73 | package.
74 |
75 |
76 |
Each Module Documentation page contains:
77 |
78 |
A description of the module.
79 |
A summary of the classes defined by the module.
80 |
A summary of the functions defined by the module.
81 |
A summary of the variables defined by the module.
82 |
A detailed description of each function defined by the
83 | module.
84 |
A detailed description of each variable defined by the
85 | module.
86 |
87 |
88 |
Each Class Documentation page contains:
89 |
90 |
A class inheritance diagram.
91 |
A list of known subclasses.
92 |
A description of the class.
93 |
A summary of the methods defined by the class.
94 |
A summary of the instance variables defined by the class.
95 |
A summary of the class (static) variables defined by the
96 | class.
97 |
A detailed description of each method defined by the
98 | class.
99 |
A detailed description of each instance variable defined by the
100 | class.
101 |
A detailed description of each class (static) variable defined
102 | by the class.
103 |
104 |
105 |
Project Documentation
106 |
107 |
The Trees page contains the module and class hierarchies:
108 |
109 |
The module hierarchy lists every package and module, with
110 | modules grouped into packages. At the top level, and within each
111 | package, modules and sub-packages are listed alphabetically.
112 |
The class hierarchy lists every class, grouped by base
113 | class. If a class has more than one base class, then it will be
114 | listed under each base class. At the top level, and under each base
115 | class, classes are listed alphabetically.
116 |
117 |
118 |
The Index page contains indices of terms and
119 | identifiers:
120 |
121 |
The term index lists every term indexed by any object's
122 | documentation. For each term, the index provides links to each
123 | place where the term is indexed.
124 |
The identifier index lists the (short) name of every package,
125 | module, class, method, function, variable, and parameter. For each
126 | identifier, the index provides a short description, and a link to
127 | its documentation.
128 |
129 |
130 |
The Table of Contents
131 |
132 |
The table of contents occupies the two frames on the left side of
133 | the window. The upper-left frame displays the project
134 | contents, and the lower-left frame displays the module
135 | contents:
136 |
137 |
138 |
139 |
140 | Project Contents...
141 |
142 | API Documentation Frame
143 |
144 |
145 |
146 |
147 | Module Contents ...
148 |
149 |
150 |
151 |
152 |
The project contents frame contains a list of all packages
153 | and modules that are defined by the project. Clicking on an entry
154 | will display its contents in the module contents frame. Clicking on a
155 | special entry, labeled "Everything," will display the contents of
156 | the entire project.
157 |
158 |
The module contents frame contains a list of every
159 | submodule, class, type, exception, function, and variable defined by a
160 | module or package. Clicking on an entry will display its
161 | documentation in the API documentation frame. Clicking on the name of
162 | the module, at the top of the frame, will display the documentation
163 | for the module itself.
164 |
165 |
The "frames" and "no frames" buttons below the top
166 | navigation bar can be used to control whether the table of contents is
167 | displayed or not.
168 |
169 |
The Navigation Bar
170 |
171 |
A navigation bar is located at the top and bottom of every page.
172 | It indicates what type of page you are currently viewing, and allows
173 | you to go to related pages. The following table describes the labels
174 | on the navigation bar. Note that not some labels (such as
175 | [Parent]) are not displayed on all pages.
176 |
177 |
178 |
179 |
Label
180 |
Highlighted when...
181 |
Links to...
182 |
183 |
[Parent]
184 |
(never highlighted)
185 |
the parent of the current package
186 |
[Package]
187 |
viewing a package
188 |
the package containing the current object
189 |
190 |
[Module]
191 |
viewing a module
192 |
the module containing the current object
193 |
194 |
[Class]
195 |
viewing a class
196 |
the class containing the current object
197 |
[Trees]
198 |
viewing the trees page
199 |
the trees page
200 |
[Index]
201 |
viewing the index page
202 |
the index page
203 |
[Help]
204 |
viewing the help page
205 |
the help page
206 |
207 |
208 |
The "show private" and "hide private" buttons below
209 | the top navigation bar can be used to control whether documentation
210 | for private objects is displayed. Private objects are usually defined
211 | as objects whose (short) names begin with a single underscore, but do
212 | not end with an underscore. For example, "_x",
213 | "__pprint", and "epydoc.epytext._tokenize"
214 | are private objects; but "re.sub",
215 | "__init__", and "type_" are not. However,
216 | if a module defines the "__all__" variable, then its
217 | contents are used to decide which objects are private.
218 |
219 |
A timestamp below the bottom navigation bar indicates when each
220 | page was last updated.
54 | [
55 | A
56 | B
57 | C
58 | D
59 | E
60 | F
61 | G
62 | H
63 | I
64 | J
65 | K
66 | L
67 | M
68 | N
69 | O
70 | P
71 | Q
72 | R
73 | S
74 | T
75 | U
76 | V
77 | W
78 | X
79 | Y
80 | Z
81 | _
82 | ]
83 |
When javascript is enabled, this page will redirect URLs of
22 | the form redirect.html#dotted.name to the
23 | documentation for the object with the given fully-qualified
24 | dotted name.
Tokenization and pre-processing for social media data used to train
57 | classifiers. Focused on classification of sentiment, emotion, or
58 | mood.
59 |
Intended as glue between Python wrappers for Twitter API and the
60 | Natural Language Toolkit (NLTK), but probably applicable to tokenizing
61 | any short messages of the social networking variety.
62 |
In many cases, reducing feature-set complexity can increase
63 | performance of classifiers trained for detecting sentiment. The available
64 | settings are based on commonly modified and normalized features in
65 | classification research using content from Twitter.
66 |
67 |
68 |
69 |
71 |
72 |
73 | Classes
74 |
75 |
76 |
77 |
78 |
79 | Tokenizer
80 | Can be used to tokenize a string representation of a message,
81 | adjusting features based on the given configuration details, to
82 | enable further processing in feature extraction and training
83 | stages.
84 |
Can be used to tokenize a string representation of a message,
65 | adjusting features based on the given configuration details, to enable
66 | further processing in feature extraction and training stages.
67 |
An example usage:
68 |
69 | >>> from tweetokenize import Tokenizer
70 | >>> gettokens = Tokenizer(usernames='USER', urls='')
71 | >>> gettokens.tokenize('@justinbeiber yo man!love you#inlove#wantyou in a totally straight way #brotime <3:p:D www.justinbeiber.com')
72 | [u'USER', u'yo', u'man', u'!', u'love', u'you', u'#inlove', u'#wantyou', u'in', u'a', u'totally', u'straight', u'way', u'#brotime', u'<3', u':p', u':D']
73 |
stopwords(self,
209 | iterable=None,
210 | filename=None)
211 | Consumes an iterable of stopwords that the tokenizer will ignore if
212 | the stopwords setting is True.
Constructs a new Tokenizer. Can specify custom settings for various
435 | feature normalizations.
436 |
Any features with replacement tokens can be removed from the message
437 | by setting the token to the empty string (""),
438 | "DELETE", or "REMOVE".
439 |
440 |
Parameters:
441 |
442 |
lowercase (bool) - If True, lowercases words, excluding those with all
443 | letters capitalized.
444 |
allcapskeep (bool) - If True, maintains capitalization for words with all
445 | letters in capitals. Otherwise, capitalization for such words is
446 | dependent on lowercase.
447 |
normalize (int) - The number of repeating letters when normalizing arbitrary letter
448 | elongations.
449 |
Example:
450 |
451 | Heyyyyyy i lovvvvvvve youuuuuuuuu <3
452 |
453 |
Becomes:
454 |
455 | Heyyy i lovvve youuu <3
456 |
457 |
Not sure why you would want to change this (maybe just for
458 | fun?? :P)
459 |
usernames - Serves as the replacement token for anything that parses as a
460 | Twitter username, ie. @rayj. Setting this to
461 | False means no usernames will be changed.
462 |
urls - Serves as the replacement token for anything that parses as a
463 | URL, ie. bit.ly or http://example.com.
464 | Setting this to False means no URLs will be changed.
465 |
hashtags - Serves as the replacement token for anything that parses as a
466 | Twitter hashtag, ie. #ihititfirst or
467 | #onedirection. Setting this to False
468 | means no hashtags will be changed.
469 |
phonenumbers - Replacement token for phone numbers.
470 |
times - Replacement token for times.
471 |
numbers - Replacement token for any other kinds of numbers.
472 |
ignorequotes (bool) - If True, will remove various types of quotes and the
473 | contents within.
474 |
ignorestopwords (bool) - If True, will remove any stopwords. The default set
475 | includes 'I', 'me', 'itself', 'against', 'should', etc.
Consumes an iterable of stopwords that the tokenizer will ignore if
612 | the stopwords setting is True. The default set is taken from
613 | NLTK's english list.
614 |
615 |
Parameters:
616 |
617 |
iterable - Object capable of iteration, providing stopword strings.
618 |
filename (str) - Path to the file containing stopwords delimited by new lines.
619 | Strips trailing whitespace and skips blank lines.