├── .gitignore
├── .gitmodules
├── .travis.yml
├── LICENSE
├── MANIFEST.in
├── README.md
├── __init__.py
├── requirements.txt
├── setup.py
├── tests.py
└── twitter_text
    ├── __init__.py
    ├── autolink.py
    ├── extractor.py
    ├── highlighter.py
    ├── regex.py
    ├── templatetags
        ├── __init__.py
        └── twitterize.py
    ├── unicode.py
    └── validation.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | build*
3 | *.egg*
4 | dist


--------------------------------------------------------------------------------
/.gitmodules:
--------------------------------------------------------------------------------
1 | [submodule "twitter-text-conformance"]
2 | 	path = twitter-text-conformance
3 | 	url = https://github.com/dryan/twitter-text-conformance.git
4 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - "2.6"
 4 |   - "2.7"
 5 | 
 6 | install: 
 7 |   - "pip install . --use-mirrors"
 8 |   - "pip install -r requirements.txt --use-mirrors"
 9 | script: "python ./tests.py"
10 | notifications:
11 |   email: false
12 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2010, Daniel Ryan 
 2 | All rights reserved. 
 3 | 
 4 | Redistribution and use in source and binary forms, with or without 
 5 | modification, are permitted provided that the following conditions are met: 
 6 | 
 7 |  * Redistributions of source code must retain the above copyright notice, 
 8 |    this list of conditions and the following disclaimer. 
 9 |  * Redistributions in binary form must reproduce the above copyright 
10 |    notice, this list of conditions and the following disclaimer in the 
11 |    documentation and/or other materials provided with the distribution. 
12 |  * Neither the name of "dryan" nor Daniel Ryan nor the names of its contributors may be used 
13 |    to endorse or promote products derived from this software without 
14 |    specific prior written permission. 
15 | 
16 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 
17 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
18 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
19 | ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE 
20 | LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
21 | CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 
22 | SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 
23 | INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 
24 | CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
25 | ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
26 | POSSIBILITY OF SUCH DAMAGE.
27 | 
28 | ==================
29 | 
30 | This project is modeled after http://github.com/mzsanford/twitter-text-rb under the Apache License, 2.0.
31 | 
32 | Copyright 2010 Twitter, Inc.
33 | 
34 | Licensed under the Apache License, Version 2.0 (the "License"); you may not
35 | use this file except in compliance with the License. You may obtain a copy of
36 | the License at
37 | 
38 | http://www.apache.org/licenses/LICENSE-2.0
39 | 
40 | Unless required by applicable law or agreed to in writing, software
41 | distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
42 | WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
43 | License for the specific language governing permissions and limitations under
44 | the License.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include LICENSE
2 | include *.textile
3 | recursive-include twitter_text *.py


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | A port of the Ruby gem [twitter-text-rb](https://github.com/twitter/twitter-text-rb) to Python.
  2 | 
  3 | [![Build Status](https://travis-ci.org/dryan/twitter-text-py.png?branch=master)](https://travis-ci.org/dryan/twitter-text-py)
  4 | 
  5 | # Changes in 2.0
  6 | 
  7 | See the [pull request](https://github.com/dryan/twitter-text-py/pull/17) for details.
  8 | 
  9 | # Usage
 10 | 
 11 | You can either call a new TwitterText object with the text of the tweet you want to process `TwitterText('twitter-text-py is #awesome')` or use any of the submodule objects directly (Autolink, Extractor, HitHighlighter or Validation), passing in the tweet text as an argument.
 12 | 
 13 | The library also contains a Django template filter that applies the auto_link method to the passed in text. It can also optionally apply the hit_highlight method. Example:
 14 | 
 15 |     {% load twitterize %}
 16 | 
 17 |     {{ obj.body|twitter_text }} <!-- just add the links -->
 18 |     {{ obj.body|twitter_text:"my term" }} <!-- add the links and highlight the search term -->
 19 | 
 20 | You can test that the library is working correctly by running `python tests.py` inside the `twitter_text` directory.
 21 | 
 22 | ## TwitterText(text)
 23 | 
 24 | ### Properties:
 25 | 
 26 | * text:             the original text you passed in, or the modified version if you've called any functions on the object.
 27 | * original_text:    the original text you passed in; never modified. Useful for a fallback or to do comparisons.
 28 | * has_been_linked:  boolean denoting if any of the Autolink functions have been called. (Mostly for internal use.)
 29 | * tweet_length:     the value returned by `validation.tweet_length` or None if that function has not yet been called.
 30 | * tweet_is_valid:   boolean returned by `validation.tweet_invalid` or None if that function has not yet been called.
 31 | * validation_error: the validation error string returned by `validation.tweet_invalid` or None if that function has not yet been called.
 32 | * autolink:         property pointing to an Autolink object initialized with `text`
 33 | * extractor:        property pointing to an Extractor object initialized with `text`
 34 | * highlighter:      property pointing to a HitHighlighter object initialized with `text`
 35 | * validation:       property pointing to a Validation object initialized with `text`
 36 | 
 37 | ## Autolink(text)
 38 | 
 39 | This object modifies the text passed to it (and the parent TwitterText.text if present).
 40 | 
 41 | ### Defaults
 42 | 
 43 | These may be overridden by kwargs on a particular method.
 44 | 
 45 | * url_class         =   'tweet-url'
 46 | * list_class        =   'list-slug'
 47 | * username_class    =   'username'
 48 | * hashtag_class     =   'hashtag'
 49 | 
 50 | ### Methods:
 51 | 
 52 | __auto_link(self, **kwargs)__
 53 | 
 54 | Add `<a></a>` tags around the usernames, lists, hashtags and URLs in the provided text. The `<a>` tags can be controlled with the following kwargs:
 55 | 
 56 | * url_class:            class to add to all `<a>` tags
 57 | * list_class:           class to add to list `<a>` tags
 58 | * username_class:       class to add to username `<a>` tags
 59 | * hashtag_class:        class to add to hashtag `<a>` tags
 60 | * username_url_base:    the value for href attribute on username links. The @username (minus the @) will be appended at the end of this.
 61 | * list_url_base:        the value for href attribute on list links. The @username/list (minus the @) will be appended at the end of this.
 62 | * hashtag_url_base:     the value for href attribute on hashtag links. The #hashtag (minus the #) will be appended at the end of this.
 63 | * suppress_lists:       disable auto-linking to lists
 64 | * suppress_no_follow:   do not add rel="nofollow" to auto-linked items
 65 | * html_attrs:           a dictionary of HTML attributes to add to non-Twitter links
 66 | 
 67 | __auto_link_usernames_or_lists(self, **kwargs)__
 68 | 
 69 | Add `<a></a>` tags around the usernames and lists in the provided text. The `<a>` tags can be controlled with the following kwargs:
 70 | 
 71 | * url_class:            class to add to all `<a>` tags
 72 | * list_class:           class to add to list `<a>` tags
 73 | * username_class:       class to add to username `<a>` tags
 74 | * username_url_base:    the value for href attribute on username links. The @username (minus the @) will be appended at the end of this.
 75 | * list_url_base:        the value for href attribute on list links. The @username/list (minus the @) will be appended at the end of this.
 76 | * suppress_lists:       disable auto-linking to lists
 77 | * suppress_no_follow:   do not add rel="nofollow" to auto-linked items
 78 | 
 79 | __auto_link_hashtags(self, **kwargs)__
 80 | 
 81 | Add `<a></a>` tags around the hashtags in the provided text. The `<a>` tags can be controlled with the following kwargs:
 82 | 
 83 | * url_class:            class to add to all `<a>` tags
 84 | * hashtag_class:        class to add to hashtag `<a>` tags
 85 | * hashtag_url_base:     the value for href attribute. The hashtag text (minus the #) will be appended at the end of this.
 86 | * suppress_no_follow:   do not add rel="nofollow" to auto-linked items
 87 | 
 88 | __auto_link_urls_custom(self, **kwargs)__
 89 | 
 90 | Add `<a></a>` tags around the URLs in the provided text. Any elements in kwargs (except @supress_no_follow@) will be converted to HTML attributes and place in the `<a>` tag. Unless kwargs contains @suppress_no_follow@ the rel="nofollow" attribute will be added.
 91 | 
 92 | ## Extractor
 93 | 
 94 | This object does not modify the text passed to it (or the parent TwitterText.text if present).
 95 | 
 96 | ### Methods
 97 | 
 98 | __extract_mentioned_screen_names__
 99 | 
100 | Extracts a list of all usernames mentioned in the Tweet text. If the text contains no username mentions an empty list will be returned.
101 | 
102 | If a transform is given, then it will be called with each username.
103 | 
104 | __extract_mentioned_screen_names_with_indices__
105 | 
106 | Extracts a list of all usernames mentioned in the Tweet text along with the indices for where the mention occurred in the format:
107 | 
108 |     {
109 |         'screen_name': username_string,
110 |         'indicies': ( start_postion, end_position )
111 |     }
112 | 
113 | If the text contains no username mentions, an empty list will be returned.
114 | 
115 | If a transform is given, then it will be called with each username, the start index, and the end index in the text.
116 | 
117 | __extract_reply_screen_name__
118 | 
119 | Extracts the first username replied to in the Tweet text. If the text does not contain a reply None will be returned.
120 | 
121 | If a transform is given then it will be called with the username replied to (if any).
122 | 
123 | __extract_urls__
124 | 
125 | Extracts a list of all URLs included in the Tweet text. If the text contains no URLs an empty list will be returned.
126 | 
127 | If a transform is given then it will be called for each URL.
128 | 
129 | __extract_urls_with_indices__
130 | 
131 | Extracts a list of all URLs included in the Tweet text along with the indices in the format:
132 | 
133 |     {
134 |         'url': url_string,
135 |         'indices': ( start_postion, end_position )
136 |     }
137 | 
138 | If the text contains no URLs an empty list will be returned.
139 | 
140 | If a transform is given then it will be called for each URL, the start index, and the end index in the text.
141 | 
142 | __extract_hashtags__
143 | 
144 | Extracts a list of all hashtags included in the Tweet text. If the text contains no hashtags an empty list will be returned. The list returned will not include the leading # character.
145 | 
146 | If a transform is given then it will be called for each hashtag.
147 | 
148 | __extract_hashtags_with_indices__
149 | 
150 | Extracts a list of all hashtags included in the Tweet text along with the indices in the format:
151 | 
152 |     {
153 |         'hashtag': hashtag_text,
154 |         'indices': ( start_postion, end_position )
155 |     }
156 | 
157 | If the text contains no hashtags an empty list will be returned. The list returned will not include the leading # character.
158 | 
159 | If a transform is given then it will be called for each hashtag.
160 | 
161 | ## HitHighlighter
162 | 
163 | ### Defaults
164 | 
165 | These may be overridden by kwargs on a particular method.
166 | 
167 | * highlight_tag = 'em'
168 | * highlight_class = 'search-hit'
169 | 
170 | ### Methods
171 | 
172 | __hit_highlight(self, query, **kwargs)__
173 | 
174 | Add `<em></em>` tags around occurrences of query provided in the text except for occurrences inside hashtags.
175 | 
176 | The `<em></em>` tags or css class can be overridden using the highlight_tag and/or highlight_class kwarg. For example:
177 | 
178 |     python> HitHighlighter.hit_highlight('test hit here').hit_highlight('hit', highlight_tag = 'strong', highlight_class = 'search-term')
179 |             =\> "test <strong class='search-term'>hit</strong> here"
180 | 
181 | 
182 | ## Validation
183 | 
184 | ### Methods
185 | 
186 | __tweet_length__
187 | 
188 | Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a string no matter which actual form was transmitted. For example:
189 | 
190 |     U+0065 Latin Small Letter E
191 |     + U+0301 Combining Acute Accent
192 |     ----------
193 |     = 2 bytes, 2 characters, displayed as é (1 visual glyph)
194 | 
195 | The NFC of {U+0065, U+0301} is {U+00E9}, which is a single character and a display length of 1
196 | 
197 | The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
198 | 
199 | __tweet_invalid__
200 | 
201 | Check the text for any reason that it may not be valid as a Tweet. This is meant as a pre-validation before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation will allow quicker feedback.
202 | 
203 | Returns false if this text is valid. Otherwise one of the following Symbols will be returned:
204 | 
205 | * "Too long": if the text is too long
206 | * "Empty text": if the text is empty
207 | * "Invalid characters": if the text contains non-Unicode or any of the disallowed Unicode characters
208 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dryan/twitter-text-py/143ee74751597efd782417df4773d586720428a4/__init__.py


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | argparse==1.2.1
2 | PyYAML==3.10
3 | beautifulsoup4==4.2.0
4 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup, find_packages
 2 |  
 3 | setup(
 4 |     name='twitter-text-py',
 5 |     version='2.0.2',
 6 |     description='A library for auto-converting URLs, mentions, hashtags, lists, etc. in Twitter text. Also does tweet validation and search term highlighting.',
 7 |     author='Daniel Ryan',
 8 |     author_email='dryan@dryan.com',
 9 |     url='http://github.com/dryan/twitter-text-py',
10 |     packages=find_packages(),
11 |     classifiers=[
12 |         'Development Status :: 5 - Production/Stable',
13 |         'Environment :: Web Environment',
14 |         'Intended Audience :: Developers',
15 |         'License :: OSI Approved :: BSD License',
16 |         'Operating System :: OS Independent',
17 |         'Programming Language :: Python',
18 |         'Framework :: Django',
19 |     ],
20 |     include_package_data=True,
21 |     install_requires=['setuptools'],
22 |     license = "BSD"
23 | )
24 | 


--------------------------------------------------------------------------------
/tests.py:
--------------------------------------------------------------------------------
  1 | # encoding=utf-8
  2 | 
  3 | import twitter_text, sys, os, json, argparse, re
  4 | from twitter_text.unicode import force_unicode
  5 | 
  6 | narrow_build = True
  7 | try:
  8 |     unichr(0x20000)
  9 |     narrow_build = False
 10 | except:
 11 |     pass
 12 | 
 13 | parser = argparse.ArgumentParser(description = u'Run the integration tests for twitter_text')
 14 | parser.add_argument('--ignore-narrow-errors', '-i', help = u'Ignore errors caused by narrow builds', default = False, action = 'store_true')
 15 | args = parser.parse_args()
 16 | 
 17 | try:
 18 |     import yaml
 19 | except ImportError:
 20 |     raise Exception('You need to install pyaml to run the tests')
 21 | # from http://stackoverflow.com/questions/2890146/how-to-force-pyyaml-to-load-strings-as-unicode-objects
 22 | from yaml import Loader, SafeLoader
 23 | def construct_yaml_str(self, node):
 24 |     return self.construct_scalar(node)
 25 | Loader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)
 26 | SafeLoader.add_constructor(u'tag:yaml.org,2002:str', construct_yaml_str)
 27 | 
 28 | try:
 29 |     from bs4 import BeautifulSoup
 30 | except ImportError:
 31 |     try:
 32 |         from BeautifulSoup import BeautifulSoup
 33 |     except ImportError:
 34 |         raise Exception('You need to install BeautifulSoup to run the tests')
 35 | 
 36 | def success(text):
 37 |     return (u'\033[92m%s\033[0m\n' % text).encode('utf-8')
 38 | 
 39 | def error(text):
 40 |     return (u'\033[91m%s\033[0m\n' % text).encode('utf-8')
 41 | 
 42 | attempted = 0
 43 | 
 44 | def assert_equal_without_attribute_order(result, test, failure_message = None):
 45 |     global attempted
 46 |     attempted += 1
 47 |     # Beautiful Soup sorts the attributes for us so we can skip all the hoops the ruby version jumps through
 48 |     assert BeautifulSoup(result) == BeautifulSoup(test.get('expected')), error(u'Test %d Failed: %s' % (attempted, test.get('description')))
 49 |     sys.stdout.write(success(u'Test %d Passed: %s' % (attempted, test.get('description'))))
 50 |     sys.stdout.flush()
 51 | 
 52 | def assert_equal(result, test):
 53 |     global attempted
 54 |     attempted += 1
 55 |     assert result == test.get('expected'), error(u'\nTest %d Failed: %s%s' % (attempted, test.get('description'), u'\n%s' % test.get('hits') if test.get('hits') else ''))
 56 |     sys.stdout.write(success(u'Test %d Passed: %s' % (attempted, test.get('description'))))
 57 |     sys.stdout.flush()
 58 | 
 59 | # extractor section
 60 | extractor_file = open(os.path.join('twitter-text-conformance', 'extract.yml'), 'r')
 61 | extractor_tests = yaml.load(force_unicode(extractor_file.read()))
 62 | extractor_file.close()
 63 | 
 64 | sys.stdout.write('Testing Extractor\n')
 65 | sys.stdout.flush()
 66 | 
 67 | for section in extractor_tests.get('tests'):
 68 |     sys.stdout.write('\nTesting Extractor: %s\n' % section)
 69 |     sys.stdout.flush()
 70 |     for test in extractor_tests.get('tests').get(section):
 71 |         if (args.ignore_narrow_errors or narrow_build) and section in ['hashtags'] and test.get('description') in ['Hashtag with ideographic iteration mark']:
 72 |             sys.stdout.write('Skipping: %s\n' % test.get('description'))
 73 |             sys.stdout.flush()
 74 |             continue
 75 |         extractor = twitter_text.extractor.Extractor(test.get('text'))
 76 |         if section == 'mentions':
 77 |             assert_equal(extractor.extract_mentioned_screen_names(), test)
 78 |         elif section == 'mentions_with_indices':
 79 |             assert_equal(extractor.extract_mentioned_screen_names_with_indices(), test)
 80 |         elif section == 'mentions_or_lists_with_indices':
 81 |             assert_equal(extractor.extract_mentions_or_lists_with_indices(), test)
 82 |         elif section == 'replies':
 83 |             assert_equal(extractor.extract_reply_screen_name(), test)
 84 |         elif section == 'urls':
 85 |             assert_equal(extractor.extract_urls(), test)
 86 |         elif section == 'urls_with_indices':
 87 |             assert_equal(extractor.extract_urls_with_indices(), test)
 88 |         elif section == 'hashtags':
 89 |             assert_equal(extractor.extract_hashtags(), test)
 90 |         elif section == 'cashtags':
 91 |             assert_equal(extractor.extract_cashtags(), test)
 92 |         elif section == 'hashtags_with_indices':
 93 |             assert_equal(extractor.extract_hashtags_with_indices(), test)
 94 |         elif section == 'cashtags_with_indices':
 95 |             assert_equal(extractor.extract_cashtags_with_indices(), test)
 96 | 
 97 | # autolink section
 98 | autolink_file = open(os.path.join('twitter-text-conformance', 'autolink.yml'), 'r')
 99 | autolink_tests = yaml.load(force_unicode(autolink_file.read()))
100 | autolink_file.close()
101 | 
102 | sys.stdout.write('\nTesting Autolink\n')
103 | sys.stdout.flush()
104 | 
105 | autolink_options = {'suppress_no_follow': True}
106 | 
107 | for section in autolink_tests.get('tests'):
108 |     sys.stdout.write('\nTesting Autolink: %s\n' % section)
109 |     for test in autolink_tests.get('tests').get(section):
110 |         if (args.ignore_narrow_errors or narrow_build) and section in ['hashtags'] and test.get('description') in ['Autolink a hashtag containing ideographic iteration mark']:
111 |             sys.stdout.write('Skipping: %s\n' % test.get('description'))
112 |             sys.stdout.flush()
113 |             continue
114 |         autolink = twitter_text.autolink.Autolink(test.get('text'))
115 |         if section == 'usernames':
116 |             assert_equal_without_attribute_order(autolink.auto_link_usernames_or_lists(autolink_options), test)
117 |         elif section == 'cashtags':
118 |             assert_equal_without_attribute_order(autolink.auto_link_cashtags(autolink_options), test)
119 |         elif section == 'urls':
120 |             assert_equal_without_attribute_order(autolink.auto_link_urls(autolink_options), test)
121 |         elif section == 'hashtags':
122 |             assert_equal_without_attribute_order(autolink.auto_link_hashtags(autolink_options), test)
123 |         elif section == 'all':
124 |             assert_equal_without_attribute_order(autolink.auto_link(autolink_options), test)
125 |         elif section == 'lists':
126 |             assert_equal_without_attribute_order(autolink.auto_link_usernames_or_lists(autolink_options), test)
127 |         elif section == 'json':
128 |             assert_equal_without_attribute_order(autolink.auto_link_with_json(json.loads(test.get('json')), autolink_options), test)
129 | 
130 | # hit_highlighting section
131 | hit_highlighting_file = open(os.path.join('twitter-text-conformance', 'hit_highlighting.yml'), 'r')
132 | hit_highlighting_tests = yaml.load(force_unicode(hit_highlighting_file.read()))
133 | hit_highlighting_file.close()
134 | 
135 | sys.stdout.write('\nTesting Hit Highlighting\n')
136 | sys.stdout.flush()
137 | 
138 | for section in hit_highlighting_tests.get('tests'):
139 |     sys.stdout.write('\nTesting Hit Highlighting: %s\n' % section)
140 |     for test in hit_highlighting_tests.get('tests').get(section):
141 |         hit_highlighter = twitter_text.highlighter.HitHighlighter(test.get('text'))
142 |         if section == 'plain_text':
143 |             assert_equal(hit_highlighter.hit_highlight(hits = test.get('hits')), test)
144 |         elif section == 'with_links':
145 |             assert_equal_without_attribute_order(hit_highlighter.hit_highlight(hits = test.get('hits')), test)
146 | 
147 | # validation section
148 | validation_tested = False
149 | validate_tests = None
150 | try:
151 |     validate_file = open(os.path.join('twitter-text-conformance', 'validate.yml'), 'r')
152 |     validate_file_contents = validate_file.read()
153 |     validate_tests = yaml.load(re.sub(ur'\\n', '\n', validate_file_contents.encode('unicode-escape')))
154 |     validate_file.close()
155 | except ValueError:
156 |     sys.stdout.write('\nValidation tests were skipped because of wide character issues\n')
157 |     sys.stdout.flush()
158 | 
159 | if validate_tests:
160 |     sys.stdout.write('\nTesting Validation\n')
161 |     sys.stdout.flush()
162 | 
163 |     for section in validate_tests.get('tests'):
164 |         sys.stdout.write('\nTesting Validation: %s\n' % section)
165 |         for test in validate_tests.get('tests').get(section):
166 |             validator = twitter_text.validation.Validation(test.get('text'))
167 |             if section == 'tweets':
168 |                 assert_equal(not validator.tweet_invalid(), test)
169 |             elif section == 'usernames':
170 |                 assert_equal(validator.valid_username(), test)
171 |             elif section == 'lists':
172 |                 assert_equal(validator.valid_list(), test)
173 |             elif section == 'hashtags':
174 |                 assert_equal(validator.valid_hashtag(), test)
175 |             elif section == 'urls':
176 |                 assert_equal(validator.valid_url(), test)
177 | 
178 | sys.stdout.write(u'\033[0m-------\n\033[92m%d tests passed.\033[0m\n' % attempted)
179 | sys.stdout.flush()
180 | sys.exit(os.EX_OK)


--------------------------------------------------------------------------------
/twitter_text/__init__.py:
--------------------------------------------------------------------------------
 1 | # encoding=utf-8
 2 | 
 3 | from twitter_text.autolink import Autolink
 4 | from twitter_text.extractor import Extractor
 5 | from twitter_text.highlighter import HitHighlighter
 6 | from twitter_text.validation import Validation
 7 | from twitter_text.unicode import force_unicode
 8 | 
 9 | class TwitterText(object):
10 |     def __init__(self, text):
11 |         self.text = force_unicode(text) # this will get modified by some functions
12 |         self.original_text = self.text # this never changes; use it as a fallback or for comparison
13 |         self.has_been_linked = False
14 |         self.tweet_length = None # gets changed by validation method
15 |         self.tweet_is_valid = None # gets changed by validation method
16 |         self.validation_error = None # gets changed by validation method
17 |         
18 |     def __unicode__(self):
19 |         return self.text
20 |         
21 |     def __repr__(self):
22 |         return self.__unicode__()
23 |     
24 |     @property
25 |     def autolink(self):
26 |         return Autolink(self.text, parent = self)
27 |         
28 |     @property
29 |     def extractor(self):
30 |         return Extractor(self.text)
31 |         
32 |     @property
33 |     def highlighter(self):
34 |         return HitHighlighter(self.text, parent = self)
35 |         
36 |     @property
37 |     def validation(self):
38 |         return Validation(self.text, parent = self)


--------------------------------------------------------------------------------
/twitter_text/autolink.py:
--------------------------------------------------------------------------------
  1 | # encoding=utf-8
  2 | 
  3 | import re, cgi
  4 | 
  5 | from twitter_text.regex import REGEXEN
  6 | from twitter_text.unicode import force_unicode
  7 | from twitter_text.extractor import Extractor
  8 | 
  9 | # Default CSS class for auto-linked lists
 10 | DEFAULT_LIST_CLASS = "tweet-url list-slug"
 11 | # Default CSS class for auto-linked usernames
 12 | DEFAULT_USERNAME_CLASS = "tweet-url username"
 13 | # Default CSS class for auto-linked hashtags
 14 | DEFAULT_HASHTAG_CLASS = "tweet-url hashtag"
 15 | # Default CSS class for auto-linked cashtags
 16 | DEFAULT_CASHTAG_CLASS = "tweet-url cashtag"
 17 | 
 18 | # Default URL base for auto-linked usernames
 19 | DEFAULT_USERNAME_URL_BASE = "https://twitter.com/"
 20 | # Default URL base for auto-linked lists
 21 | DEFAULT_LIST_URL_BASE = "https://twitter.com/"
 22 | # Default URL base for auto-linked hashtags
 23 | DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%23"
 24 | # Default URL base for auto-linked cashtags
 25 | DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24"
 26 | 
 27 | # Default attributes for invisible span tag
 28 | DEFAULT_INVISIBLE_TAG_ATTRS = "style='position:absolute;left:-9999px;'"
 29 | 
 30 | DEFAULT_OPTIONS = {
 31 |   'list_class':             DEFAULT_LIST_CLASS,
 32 |   'username_class':         DEFAULT_USERNAME_CLASS,
 33 |   'hashtag_class':          DEFAULT_HASHTAG_CLASS,
 34 |   'cashtag_class':          DEFAULT_CASHTAG_CLASS,
 35 | 
 36 |   'username_url_base':      DEFAULT_USERNAME_URL_BASE,
 37 |   'list_url_base':          DEFAULT_LIST_URL_BASE,
 38 |   'hashtag_url_base':       DEFAULT_HASHTAG_URL_BASE,
 39 |   'cashtag_url_base':       DEFAULT_CASHTAG_URL_BASE,
 40 | 
 41 |   'invisible_tag_attrs':    DEFAULT_INVISIBLE_TAG_ATTRS,
 42 | }
 43 | 
 44 | OPTIONS_NOT_ATTRIBUTES = (
 45 |     'url_class',
 46 |     'list_class',
 47 |     'username_class',
 48 |     'hashtag_class',
 49 |     'cashtag_class',
 50 |     'username_url_base',
 51 |     'list_url_base',
 52 |     'hashtag_url_base',
 53 |     'cashtag_url_base',
 54 |     'username_url_transform',
 55 |     'list_url_transform',
 56 |     'hashtag_url_transform',
 57 |     'cashtag_url_transform',
 58 |     'link_url_transform',
 59 |     'username_include_symbol',
 60 |     'suppress_lists',
 61 |     'suppress_no_follow',
 62 |     'url_entities',
 63 |     'invisible_tag_attrs',
 64 |     'symbol_tag',
 65 |     'text_with_symbol_tag',
 66 |     'url_target',
 67 |     'link_attribute_transform',
 68 |     'link_text_transform',
 69 | )
 70 | 
 71 | HTML_ENTITIES = {
 72 |   '&': '&amp;',
 73 |   '>': '&gt;',
 74 |   '<': '&lt;',
 75 |   '"': '&quot;',
 76 |   "'": '&#39;',
 77 | }
 78 | 
 79 | BOOLEAN_ATTRIBUTES = (
 80 |     'disabled', 
 81 |     'readonly',
 82 |     'multiple',
 83 |     'checked',
 84 | )
 85 | 
 86 | def default_transform(entity, text):
 87 |     return text
 88 | 
 89 | class Autolink(object):
 90 |     def __init__(self, text, **kwargs):
 91 |         self.text = force_unicode(text)
 92 |         self.parent = kwargs.get('parent', False)
 93 |         self.extractor = Extractor(self.text)
 94 | 
 95 |     def auto_link_with_json(self, json_obj, options = {}):
 96 |         # concantenate entities
 97 |         entities = []
 98 |         if 'entities' in json_obj:
 99 |             json_obj = json_obj.get('entities')
100 |         for key in json_obj:
101 |             if type(json_obj[key]) == list:
102 |                 entities = entities + json_obj[key]
103 | 
104 |         # map JSON entity to twitter_text entity
105 |         for entity in entities:
106 |             if 'text' in entity:
107 |                 entity['hashtag'] = entity.get('text')
108 | 
109 |         return self.auto_link_entities(entities, options)
110 | 
111 |     def auto_link_entities(self, entities = [], options = {}):
112 |         if not self.text:
113 |             return self.text
114 | 
115 |         # NOTE deprecate these attributes not options keys in options hash, then use html_attrs
116 |         options = dict(DEFAULT_OPTIONS.items() + options.items())
117 |         options['html_attrs'] = self._extract_html_attrs_from_options(options)
118 |         if not options.get('suppress_no_follow', False):
119 |             options['html_attrs']['rel'] = "nofollow"
120 | 
121 |         entities.sort(key = lambda entity: entity['indices'][0], reverse = True)
122 |         chars = self.text
123 | 
124 |         for entity in entities:
125 |             if 'url' in entity:
126 |                 chars = self._link_to_url(entity, chars, options)
127 |             elif 'hashtag' in entity:
128 |                 chars = self._link_to_hashtag(entity, chars, options)
129 |             elif 'screen_name' in entity:
130 |                 chars = self._link_to_screen_name(entity, chars, options)
131 |             elif 'cashtag' in entity:
132 |                 chars = self._link_to_cashtag(entity, chars, options)
133 | 
134 |         return chars
135 | 
136 |     def auto_link(self, options = {}):
137 |         """
138 |         Add <a></a> tags around the usernames, lists, hashtags and URLs in the provided text.
139 |         The <a> tags can be controlled with the following entries in the options hash.
140 |         Also any elements in the options hash will be converted to HTML attributes
141 |         and place in the <a> tag.
142 | 
143 |         @url_class                  class to add to url <a> tags
144 |         @list_class                 class to add to list <a> tags
145 |         @username_class             class to add to username <a> tags
146 |         @hashtag_class              class to add to hashtag <a> tags
147 |         @cashtag_class              class to add to cashtag <a> tags
148 |         @username_url_base          the value for href attribute on username links. The @username (minus the @) will be appended at the end of this.
149 |         @list_url_base              the value for href attribute on list links. The @username/list (minus the @) will be appended at the end of this.
150 |         @hashtag_url_base           the value for href attribute on hashtag links. The #hashtag (minus the #) will be appended at the end of this.
151 |         @cashtag_url_base           the value for href attribute on cashtag links. The $cashtag (minus the $) will be appended at the end of this.
152 |         @invisible_tag_attrs        HTML attribute to add to invisible span tags
153 |         @username_include_symbol    place the @ symbol within username and list links
154 |         @suppress_lists             disable auto-linking to lists
155 |         @suppress_no_follow         do not add rel="nofollow" to auto-linked items
156 |         @symbol_tag                 tag to apply around symbol (@, #, $) in username / hashtag / cashtag links
157 |         @text_with_symbol_tag       tag to apply around text part in username / hashtag / cashtag links
158 |         @url_target                 the value for target attribute on URL links.
159 |         @link_attribute_transform   function to modify the attributes of a link based on the entity. called with |entity, attributes| params, and should modify the attributes hash.
160 |         @link_text_transform        function to modify the text of a link based on the entity. called with (entity, text) params, and should return a modified text.
161 |         """
162 |         return self.auto_link_entities(self.extractor.extract_entities_with_indices({'extract_url_without_protocol': False}), options)
163 | 
164 |     def auto_link_usernames_or_lists(self, options = {}):
165 |         """
166 |         Add <a></a> tags around the usernames and lists in the provided text. The
167 |         <a> tags can be controlled with the following entries in the options hash.
168 |         Also any elements in the options hash will be converted to HTML attributes
169 |         and place in the <a> tag.
170 | 
171 |         @list_class                 class to add to list <a> tags
172 |         @username_class             class to add to username <a> tags
173 |         @username_url_base          the value for href attribute on username links. The @username (minus the @) will be appended at the end of this.
174 |         @list_url_base              the value for href attribute on list links. The @username/list (minus the @) will be appended at the end of this.
175 |         @username_include_symbol    place the @ symbol within username and list links
176 |         @suppress_lists             disable auto-linking to lists
177 |         @suppress_no_follow         do not add rel="nofollow" to auto-linked items
178 |         @symbol_tag                 tag to apply around symbol (@, #, $) in username / hashtag / cashtag links
179 |         @text_with_symbol_tag       tag to apply around text part in username / hashtag / cashtag links
180 |         @link_attribute_transform   function to modify the attributes of a link based on the entity. called with (entity, attributes) params, and should modify the attributes hash.
181 |         @link_text_transform        function to modify the text of a link based on the entity. called with (entity, text) params, and should return a modified text.
182 |         """
183 |         return self.auto_link_entities(self.extractor.extract_mentions_or_lists_with_indices(), options)
184 | 
185 |     def auto_link_hashtags(self, options = {}):
186 |         """
187 |         Add <a></a> tags around the hashtags in the provided text.
188 |         The <a> tags can be controlled with the following entries in the options hash.
189 |         Also any elements in the options hash will be converted to HTML attributes
190 |         and place in the <a> tag.
191 | 
192 |         @hashtag_class              class to add to hashtag <a> tags
193 |         @hashtag_url_base           the value for href attribute. The hashtag text (minus the #) will be appended at the end of this.
194 |         @suppress_no_follow         do not add rel="nofollow" to auto-linked items
195 |         @symbol_tag                 tag to apply around symbol (@, #, $) in username / hashtag / cashtag links
196 |         @text_with_symbol_tag       tag to apply around text part in username / hashtag / cashtag links
197 |         @link_attribute_transform   function to modify the attributes of a link based on the entity. called with (entity, attributes) params, and should modify the attributes hash.
198 |         @link_text_transform        function to modify the text of a link based on the entity. called with (entity, text) params, and should return a modified text.
199 |         """
200 |         return self.auto_link_entities(self.extractor.extract_hashtags_with_indices(), options)
201 | 
202 |     def auto_link_cashtags(self, options = {}):
203 |         """
204 |         Add <a></a> tags around the cashtags in the provided text.
205 |         The <a> tags can be controlled with the following entries in the options hash.
206 |         Also any elements in the options hash will be converted to HTML attributes
207 |         and place in the <a> tag.
208 | 
209 |         @cashtag_class:: class to add to cashtag <a> tags
210 |         @cashtag_url_base           the value for href attribute. The cashtag text (minus the $) will be appended at the end of this.
211 |         @suppress_no_follow         do not add rel="nofollow" to auto-linked items
212 |         @symbol_tag                 tag to apply around symbol (@, #, $) in username / hashtag / cashtag links
213 |         @text_with_symbol_tag       tag to apply around text part in username / hashtag / cashtag links
214 |         @link_attribute_transform   function to modify the attributes of a link based on the entity. called with (entity, attributes) params, and should modify the attributes hash.
215 |         @link_text_transform        function to modify the text of a link based on the entity. called with (entity, text) params, and should return a modified text.
216 |         """
217 |         return self.auto_link_entities(self.extractor.extract_cashtags_with_indices(), options)
218 | 
219 |     def auto_link_urls(self, options = {}):
220 |         """
221 |         Add <a></a> tags around the URLs in the provided text.
222 |         The <a> tags can be controlled with the following entries in the options hash.
223 |         Also any elements in the options hash will be converted to HTML attributes
224 |         and place in the <a> tag.
225 | 
226 |         @url_class                  class to add to url <a> tags
227 |         @invisible_tag_attrs        HTML attribute to add to invisible span tags
228 |         @suppress_no_follow         do not add rel="nofollow" to auto-linked items
229 |         @symbol_tag                 tag to apply around symbol (@, #, $) in username / hashtag / cashtag links
230 |         @text_with_symbol_tag       tag to apply around text part in username / hashtag / cashtag links
231 |         @url_target                 the value for target attribute on URL links.
232 |         @link_attribute_transform   function to modify the attributes of a link based on the entity. called with (entity, attributes) params, and should modify the attributes hash.
233 |         @link_text_transform        function to modify the text of a link based on the entity. called with (entity, text) params, and should return a modified text.
234 |         """
235 |         return self.auto_link_entities(self.extractor.extract_urls_with_indices({'extract_url_without_protocol': False}), options)
236 | 
237 |     # begin private methods
238 |     def _html_escape(self, text):
239 |         for char in HTML_ENTITIES:
240 |             text = text.replace(char, HTML_ENTITIES[char])
241 |         return text
242 | 
243 |     def _extract_html_attrs_from_options(self, options = {}):
244 |         html_attrs = options.get('html_attrs', {})
245 |         options = options.copy()
246 |         if 'html_attrs' in options:
247 |             del(options['html_attrs'])
248 |         for option in options.keys():
249 |             if not option in OPTIONS_NOT_ATTRIBUTES:
250 |                 html_attrs[option] = options[option]
251 |         return html_attrs
252 | 
253 |     def _url_entities_hash(self, url_entities):
254 |         entities = {}
255 |         for entity in url_entities:
256 |             entities[entity.get('url')] = entity
257 |         return entities
258 | 
259 |     def _link_to_url(self, entity, chars, options = {}):
260 |         url = entity.get('url')
261 | 
262 |         href = options.get('link_url_transform', lambda x: x)(url)
263 | 
264 |         # NOTE auto link to urls do not use any default values and options
265 |         # like url_class but use suppress_no_follow.
266 |         html_attrs = self._extract_html_attrs_from_options(options)
267 |         if options.get('url_class'):
268 |             html_attrs['class'] = options.get('url_class')
269 | 
270 |         # add target attribute only if @url_target is specified
271 |         if options.get('url_target'):
272 |             html_attrs['target'] = options.get('url_target')
273 | 
274 |         url_entities = self._url_entities_hash(options.get('url_entities', {}))
275 | 
276 |         # use entity from @url_entities if available
277 |         url_entity = url_entities.get(url, entity)
278 |         if url_entity.get('display_url'):
279 |             html_attrs['title'] = url_entity.get('expanded_url')
280 |             link_text = self._link_url_with_entity(url_entity, options)
281 |         else:
282 |             link_text = self._html_escape(url)
283 | 
284 |         link = self._link_to_text(entity, link_text, href, html_attrs, options)
285 |         return chars[:entity['indices'][0]] + link + chars[entity['indices'][1]:]
286 | 
287 |     def _link_url_with_entity(self, entity, options = {}):
288 |         """
289 |         Goal: If a user copies and pastes a tweet containing t.co'ed link, the resulting paste
290 |         should contain the full original URL (expanded_url), not the display URL.
291 | 
292 |         Method: Whenever possible, we actually emit HTML that contains expanded_url, and use
293 |         font-size:0 to hide those parts that should not be displayed (because they are not part of display_url).
294 |         Elements with font-size:0 get copied even though they are not visible.
295 |         Note that display:none doesn't work here. Elements with display:none don't get copied.
296 | 
297 |         Additionally, we want to *display* ellipses, but we don't want them copied.  To make this happen we
298 |         wrap the ellipses in a tco-ellipsis class and provide an onCopy handler that sets display:none on
299 |         everything with the tco-ellipsis class.
300 | 
301 |         Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/#!/username/status/1234/photo/1
302 |         For those URLs, display_url is not a substring of expanded_url, so we don't do anything special to render the elided parts.
303 |         For a pic.twitter.com URL, the only elided part will be the "https://", so this is fine.
304 |         """
305 |         display_url = entity.get('display_url').decode('utf-8')
306 |         expanded_url = entity.get('expanded_url')
307 |         invisible_tag_attrs = options.get('invisible_tag_attrs', DEFAULT_INVISIBLE_TAG_ATTRS)
308 | 
309 |         display_url_sans_ellipses = re.sub(ur'…', u'', display_url)
310 | 
311 |         if expanded_url.find(display_url_sans_ellipses) > -1:
312 |             before_display_url, after_display_url = expanded_url.split(display_url_sans_ellipses, 2)
313 |             preceding_ellipsis = re.search(ur'\A…', display_url)
314 |             following_ellipsis = re.search(ur'…\z', display_url)
315 |             if preceding_ellipsis is not None:
316 |                 preceding_ellipsis = preceding_ellipsis.group()
317 |             else:
318 |                 preceding_ellipsis = ''
319 |             if following_ellipsis is not None:
320 |                 following_ellipsis = following_ellipsis.group()
321 |             else:
322 |                 following_ellipsis = ''
323 | 
324 |             # As an example: The user tweets "hi http://longdomainname.com/foo"
325 |             # This gets shortened to "hi http://t.co/xyzabc", with display_url = "…nname.com/foo"
326 |             # This will get rendered as:
327 |             # <span class='tco-ellipsis'> <!-- This stuff should get displayed but not copied -->
328 |             #   …
329 |             #   <!-- There's a chance the onCopy event handler might not fire. In case that happens,
330 |             #        we include an &nbsp; here so that the … doesn't bump up against the URL and ruin it.
331 |             #        The &nbsp; is inside the tco-ellipsis span so that when the onCopy handler *does*
332 |             #        fire, it doesn't get copied.  Otherwise the copied text would have two spaces in a row,
333 |             #        e.g. "hi  http://longdomainname.com/foo".
334 |             #   <span style='font-size:0'>&nbsp;</span>
335 |             # </span>
336 |             # <span style='font-size:0'>  <!-- This stuff should get copied but not displayed -->
337 |             #   http://longdomai
338 |             # </span>
339 |             # <span class='js-display-url'> <!-- This stuff should get displayed *and* copied -->
340 |             #   nname.com/foo
341 |             # </span>
342 |             # <span class='tco-ellipsis'> <!-- This stuff should get displayed but not copied -->
343 |             #   <span style='font-size:0'>&nbsp;</span>
344 |             #   …
345 |             # </span>
346 | 
347 |             return u"<span class='tco-ellipsis'>%s<span %s>&nbsp;</span></span><span %s>%s</span><span class='js-display-url'>%s</span><span %s>%s</span><span class='tco-ellipsis'><span %s>&nbsp;</span>%s</span>" % (preceding_ellipsis, invisible_tag_attrs, invisible_tag_attrs, self._html_escape(before_display_url), self._html_escape(display_url_sans_ellipses), invisible_tag_attrs, self._html_escape(after_display_url), invisible_tag_attrs, following_ellipsis)
348 |         else:
349 |             return self._html_escape(display_url)
350 | 
351 |     def _link_to_hashtag(self, entity, chars, options = {}):
352 |         hashchar = chars[entity['indices'][0]]
353 |         hashtag = entity['hashtag']
354 |         hashtag_class = options.get('hashtag_class')
355 | 
356 |         if REGEXEN['rtl_chars'].search(hashtag):
357 |             hashtag_class += ' rtl'
358 | 
359 |         href = options.get('hashtag_url_transform', lambda ht: u'%s%s' % (options.get('hashtag_url_base'), ht))(hashtag)
360 | 
361 |         html_attrs = {}
362 |         html_attrs.update(options.get('html_attrs', {}))
363 |         html_attrs = {
364 |             'class':    hashtag_class,
365 |             'title':    u'#%s' % hashtag,
366 |         }
367 | 
368 |         link = self._link_to_text_with_symbol(entity, hashchar, hashtag, href, html_attrs, options)
369 |         return chars[:entity['indices'][0]] + link + chars[entity['indices'][1]:]
370 | 
371 |     def _link_to_cashtag(self, entity, chars, options = {}):
372 |         dollar = chars[entity['indices'][0]]
373 |         cashtag = entity['cashtag']
374 | 
375 |         href = options.get('cashtag_url_transform', lambda ct: u'%s%s' % (options.get('cashtag_url_base'), ct))(cashtag)
376 | 
377 |         html_attrs = {
378 |             'class': options.get('cashtag_class'),
379 |             'title': u'$%s' % cashtag
380 |         }
381 |         html_attrs.update(options.get('html_attrs', {}))
382 | 
383 |         link = self._link_to_text_with_symbol(entity, dollar, cashtag, href, html_attrs, options)
384 |         return chars[:entity['indices'][0]] + link + chars[entity['indices'][1]:]
385 | 
386 |     def _link_to_screen_name(self, entity, chars, options = {}):
387 |         name = u'%s%s' % (entity['screen_name'], entity.get('list_slug') or '')
388 |         chunk = options.get('link_text_transform', default_transform)(entity, name)
389 |         name = name.lower()
390 | 
391 |         at = chars[entity['indices'][0]]
392 | 
393 |         html_attrs = options.get('html_attrs', {}).copy()
394 |         if 'title' in html_attrs:
395 |             del(html_attrs['title'])
396 | 
397 |         if entity.get('list_slug') and not options.get('supress_lists'):
398 |             href = options.get('list_url_transform', lambda sn: u'%s%s' % (options.get('list_url_base'), sn))(name)
399 |             html_attrs['class'] = options.get('list_class')
400 |         else:
401 |             href = options.get('username_url_transform', lambda sn: u'%s%s' % (options.get('username_url_base'), sn))(name)
402 |             html_attrs['class'] = options.get('username_class')
403 | 
404 |         link = self._link_to_text_with_symbol(entity, at, chunk, href, html_attrs, options)
405 |         return chars[:entity['indices'][0]] + link + chars[entity['indices'][1]:]
406 | 
407 |     def _link_to_text_with_symbol(self, entity, symbol, text, href, attributes = {}, options = {}):
408 |         tagged_symbol = u'<%s>%s</%s>' % (options.get('symbol_tag'), symbol, options.get('symbol_tag')) if options.get('symbol_tag') else symbol
409 |         text = self._html_escape(text)
410 |         tagged_text = u'<%s>%s</%s>' % (options.get('text_with_symbol_tag'), text, options.get('text_with_symbol_tag')) if options.get('text_with_symbol_tag') else text
411 |         if options.get('username_include_symbol') or not REGEXEN['at_signs'].match(symbol):
412 |             return u'%s' % self._link_to_text(entity, tagged_symbol + tagged_text, href, attributes, options)
413 |         else:
414 |             return u'%s%s' % (tagged_symbol, self._link_to_text(entity, tagged_text, href, attributes, options))
415 | 
416 |     def _link_to_text(self, entity, text, href, attributes = {}, options = {}):
417 |         attributes['href'] = href
418 |         if options.get('link_attribute_transform'):
419 |             attributes = options.get('link_attribute_transform')(entity, attributes)
420 |         text = options.get('link_text_transform', default_transform)(entity, text)
421 |         return u'<a %s>%s</a>' % (self._tag_attrs(attributes), text)
422 | 
423 |     def _tag_attrs(self, attributes = {}):
424 |         attrs = []
425 |         for key in sorted(attributes.keys()):
426 |             value = attributes[key]
427 |             if key in BOOLEAN_ATTRIBUTES:
428 |                 attrs.append(key)
429 |                 continue
430 |             if type(value) == list:
431 |                 value = u' '.join(value)
432 |             attrs.append(u'%s="%s"' % (self._html_escape(key), self._html_escape(value)))
433 | 
434 |         return u' '.join(attrs)


--------------------------------------------------------------------------------
/twitter_text/extractor.py:
--------------------------------------------------------------------------------
  1 | # encoding=utf-8
  2 | 
  3 | from twitter_text.regex import REGEXEN
  4 | from twitter_text.unicode import force_unicode
  5 | 
  6 | class Extractor(object):
  7 |     """
  8 |     A module for including Tweet parsing in a class. This module provides function for the extraction and processing
  9 |     of usernames, lists, URLs and hashtags.
 10 |     """
 11 |     
 12 |     def __init__(self, text):
 13 |         self.text = force_unicode(text)
 14 | 
 15 |     def _remove_overlapping_entities(self, entities):
 16 |         """
 17 |         Remove overlapping entities.
 18 |         This returns a new list with no overlapping entities.
 19 |         """
 20 | 
 21 |         # sort by start index
 22 |         entities.sort(key = lambda entity: entity['indices'][0])
 23 | 
 24 |         # remove duplicates
 25 |         prev    =   None
 26 |         for entity in [e for e in entities]:
 27 |             if prev and prev['indices'][1] > entity['indices'][0]:
 28 |                 entities.remove(entity)
 29 |             else:
 30 |                 prev    =   entity
 31 |         return entities
 32 | 
 33 |     def extract_entities_with_indices(self, options = {}, transform = lambda x: x):
 34 |         """
 35 |         Extracts all usernames, lists, hashtags and URLs  in the Tweet text
 36 |         along with the indices for where the entity ocurred
 37 |         If the text is None or contains no entity an empty list
 38 |         will be returned.
 39 | 
 40 |         If a transform is given then it will be called for each entity.
 41 |         """
 42 |         if not self.text:
 43 |             return []
 44 | 
 45 |         # extract all entities
 46 |         entities    =   self.extract_urls_with_indices(options) + \
 47 |                         self.extract_hashtags_with_indices({'check_url_overlap': False}) + \
 48 |                         self.extract_mentions_or_lists_with_indices() + \
 49 |                         self.extract_cashtags_with_indices()
 50 | 
 51 |         entities    =   self._remove_overlapping_entities(entities)
 52 | 
 53 |         for entity in entities:
 54 |             entity  =   transform(entity)
 55 | 
 56 |         return entities
 57 | 
 58 |     def extract_mentioned_screen_names(self, transform = lambda x: x):
 59 |         """
 60 |         Extracts a list of all usernames mentioned in the Tweet text. If the
 61 |         text is None or contains no username mentions an empty list
 62 |         will be returned.
 63 | 
 64 |         If a transform is given then it will be called for each username.
 65 |         """
 66 |         return [transform(mention['screen_name']) for mention in self.extract_mentioned_screen_names_with_indices()]
 67 | 
 68 |     def extract_mentioned_screen_names_with_indices(self, transform = lambda x: x):
 69 |         """
 70 |         Extracts a list of all usernames mentioned in the Tweet text
 71 |         along with the indices for where the mention ocurred.  If the
 72 |         text is None or contains no username mentions, an empty list
 73 |         will be returned.
 74 | 
 75 |         If a transform is given, then it will be called with each username, the start
 76 |         index, and the end index in the text.
 77 |         """
 78 |         if not self.text:
 79 |             return []
 80 | 
 81 |         possible_screen_names = []
 82 |         for match in self.extract_mentions_or_lists_with_indices():
 83 |             if not match['list_slug']:
 84 |                 possible_screen_names.append({
 85 |                     'screen_name':  transform(match['screen_name']),
 86 |                     'indices':      match['indices']
 87 |                 })
 88 |         return possible_screen_names
 89 | 
 90 |     def extract_mentions_or_lists_with_indices(self, transform = lambda x: x):
 91 |         """
 92 |         Extracts a list of all usernames or lists mentioned in the Tweet text
 93 |         along with the indices for where the mention ocurred.  If the
 94 |         text is None or contains no username or list mentions, an empty list
 95 |         will be returned.
 96 | 
 97 |         If a transform is given, then it will be called with each username, list slug, the start
 98 |         index, and the end index in the text. The list_slug will be an empty stirng
 99 |         if this is a username mention.
100 |         """
101 |         if not REGEXEN['at_signs'].search(self.text):
102 |             return []
103 | 
104 |         possible_entries    =   []
105 |         for match in REGEXEN['valid_mention_or_list'].finditer(self.text):
106 |             try:
107 |                 after = self.text[match.end()]
108 |             except IndexError:
109 |                 # the mention was the last character in the string
110 |                 after = None
111 |             if after and REGEXEN['end_mention_match'].match(after) or match.groups()[2].find('http') == 0:
112 |                 continue
113 |             possible_entries.append({
114 |                 'screen_name':  transform(match.groups()[2]),
115 |                 'list_slug':    match.groups()[3] or '',
116 |                 'indices':      [match.start() + len(match.groups()[0]), match.end()]
117 |             })
118 | 
119 |         return possible_entries
120 |         
121 |     def extract_reply_screen_name(self, transform = lambda x: x):
122 |         """
123 |         Extracts the username username replied to in the Tweet text. If the
124 |         text is None or is not a reply None will be returned.
125 | 
126 |         If a transform is given then it will be called with the username replied to (if any)
127 |         """
128 |         if not self.text:
129 |             return None
130 | 
131 |         possible_screen_name = REGEXEN['valid_reply'].match(self.text)
132 |         if possible_screen_name is not None:
133 |             if possible_screen_name.group(1).find('http') > -1:
134 |                 possible_screen_name = None
135 |             else:
136 |                 possible_screen_name = transform(possible_screen_name.group(1))
137 |         return possible_screen_name
138 |         
139 |     def extract_urls(self, transform = lambda x: x):
140 |         """
141 |         Extracts a list of all URLs included in the Tweet text. If the
142 |         text is None or contains no URLs an empty list
143 |         will be returned.
144 | 
145 |         If a transform is given then it will be called for each URL.
146 |         """
147 |         return [transform(url['url']) for url in self.extract_urls_with_indices()]
148 |         
149 |     def extract_urls_with_indices(self, options = {'extract_url_without_protocol': True}):
150 |         """
151 |         Extracts a list of all URLs included in the Tweet text along
152 |         with the indices. If the text is None or contains no
153 |         URLs an empty list will be returned.
154 | 
155 |         If a block is given then it will be called for each URL.
156 |         """
157 |         urls = []
158 |         for match in REGEXEN['valid_url'].finditer(self.text):
159 |             complete, before, url, protocol, domain, port, path, query = match.groups()
160 |             start_position = match.start() + len(before or '')
161 |             end_position = match.end()
162 |             # If protocol is missing and domain contains non-ASCII characters,
163 |             # extract ASCII-only domains.
164 |             if not protocol:
165 |                 if not options.get('extract_url_without_protocol') or REGEXEN['invalid_url_without_protocol_preceding_chars'].search(before):
166 |                     continue
167 |                 last_url = None
168 |                 last_url_invalid_match = None
169 |                 for ascii_domain in REGEXEN['valid_ascii_domain'].finditer(domain):
170 |                     ascii_domain = ascii_domain.group()
171 |                     last_url = {
172 |                         'url':      ascii_domain,
173 |                         'indices':  [start_position - len(before or '') + complete.find(ascii_domain), start_position - len(before or '') + complete.find(ascii_domain) + len(ascii_domain)]
174 |                     }
175 |                     last_url_invalid_match = REGEXEN['invalid_short_domain'].search(ascii_domain) is not None
176 |                     if not last_url_invalid_match:
177 |                         urls.append(last_url)
178 |                 # no ASCII-only domain found. Skip the entire URL
179 |                 if not last_url:
180 |                     continue
181 |                 if path:
182 |                     last_url['url'] = url.replace(domain, last_url['url'])
183 |                     last_url['indices'][1] = end_position
184 |                     if last_url_invalid_match:
185 |                         urls.append(last_url)
186 |             else:
187 |                 if REGEXEN['valid_tco_url'].match(url):
188 |                     url = REGEXEN['valid_tco_url'].match(url).group()
189 |                     end_position = start_position + len(url)
190 |                 urls.append({
191 |                     'url':      url,
192 |                     'indices':  [start_position, end_position]
193 |                 })
194 |         return urls
195 |         
196 |     def extract_hashtags(self, transform = lambda x: x):
197 |         """
198 |         Extracts a list of all hashtags included in the Tweet text. If the
199 |         text is None or contains no hashtags an empty list
200 |         will be returned. The list returned will not include the leading #
201 |         character.
202 | 
203 |         If a block is given then it will be called for each hashtag.
204 |         """
205 |         return [transform(hashtag['hashtag']) for hashtag in self.extract_hashtags_with_indices()]
206 |         
207 |     def extract_hashtags_with_indices(self, options = {'check_url_overlap': True}, transform = lambda x: x):
208 |         """
209 |         Extracts a list of all hashtags included in the Tweet text. If the
210 |         text is None or contains no hashtags an empty list
211 |         will be returned. The list returned will not include the leading #
212 |         character.
213 | 
214 |         If a block is given then it will be called for each hashtag.
215 |         """
216 |         tags = []
217 |         for match in REGEXEN['valid_hashtag'].finditer(self.text):
218 |             before, hashchar, hashtext = match.groups()
219 |             start_position, end_position = match.span()
220 |             start_position = start_position + len(before)
221 |             if not (REGEXEN['end_hashtag_match'].match(self.text[end_position]) if len(self.text) > end_position else None) and not hashtext.find('http') == 0 and not REGEXEN['numeric_only'].match(hashtext):
222 |                 tags.append({
223 |                     'hashtag':  hashtext,
224 |                     'indices':  [start_position, end_position]
225 |                 })
226 | 
227 |         if options.get('check_url_overlap'):
228 |             urls = self.extract_urls_with_indices()
229 |             if len(urls):
230 |                 tags = tags + urls
231 |                 # remove duplicates
232 |                 tags = self._remove_overlapping_entities(tags)
233 |                 tags = [tag for tag in tags if 'hashtag' in tag]
234 | 
235 |         return tags
236 | 
237 |     def extract_cashtags(self, transform = lambda x: x):
238 |         """
239 |         Extracts a list of all cashtags included in the Tweet text. If the
240 |         text is None or contains no cashtags an empty list
241 |         will be returned. The list returned will not include the leading $
242 |         character.
243 | 
244 |         If a block is given then it will be called for each cashtag.
245 |         """
246 |         return [cashtag['cashtag'] for cashtag in self.extract_cashtags_with_indices()]
247 | 
248 |     def extract_cashtags_with_indices(self, transform = lambda x: x):
249 |         """
250 |         Extracts a list of all cashtags included in the Tweet text. If the
251 |         text is None or contains no cashtags an empty list
252 |         will be returned. The list returned will not include the leading $
253 |         character.
254 | 
255 |         If a block is given then it will be called for each cashtag.
256 |         """
257 |         if not self.text or self.text.find('$') == -1:
258 |             return []
259 | 
260 |         tags = []
261 |         for match in REGEXEN['valid_cashtag'].finditer(self.text):
262 |             before, dollar, cashtext = match.groups()
263 |             start_position, end_position = match.span()
264 |             start_position = start_position + len(before or '')
265 |             tags.append({
266 |                 'cashtag':  cashtext,
267 |                 'indices':  [start_position, end_position]
268 |             })
269 | 
270 |         return tags


--------------------------------------------------------------------------------
/twitter_text/highlighter.py:
--------------------------------------------------------------------------------
 1 | # encoding=utf-8
 2 | 
 3 | import re
 4 | from HTMLParser import HTMLParser
 5 | 
 6 | from twitter_text.regex import UNICODE_SPACES
 7 | from twitter_text.unicode import force_unicode
 8 | 
 9 | DEFAULT_HIGHLIGHT_TAG = 'em'
10 | 
11 | # from http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python
12 | class MLStripper(HTMLParser):
13 |     def __init__(self):
14 |         self.reset()
15 |         self.fed = []
16 |     def handle_data(self, d):
17 |         self.fed.append(d)
18 |     def get_data(self):
19 |         return ''.join(self.fed)
20 | 
21 | def strip_tags(html):
22 |     s = MLStripper()
23 |     s.feed(html)
24 |     return s.get_data()
25 | 
26 | class HitHighlighter(object):
27 |     def __init__(self, text, **kwargs):
28 |         self.text = force_unicode(text)
29 |         self.parent = kwargs.get('parent', False)
30 | 
31 |     def hit_highlight(self, hits = [], **kwargs):
32 |         if not hits and not kwargs.get('query'):
33 |             return self.text
34 | 
35 |         if not hits and kwargs.get('query'):
36 |             stripped_text   =   strip_tags(self.text)
37 |             for match in re.finditer(ur'%s' % kwargs.get('query'), stripped_text):
38 |                 hits.append(match.span())
39 | 
40 |         if hits and not type(hits) == list:
41 |             raise Exception('The syntax for the hit_highlight method has changed. You must pass in a list of lists containing the indices of the strings you want to match.')
42 | 
43 |         tag_name = kwargs.get('tag', DEFAULT_HIGHLIGHT_TAG)
44 |         tags = [u'<%s>' % tag_name, u'</%s>' % tag_name]
45 | 
46 |         text = self.text
47 |         chunks = re.split(r'[<>]', text)
48 |         text_chunks = []
49 |         for index, chunk in enumerate(chunks):
50 |             if not index % 2:
51 |                 text_chunks.append(chunk)
52 |         for hit in sorted(hits, key = lambda chunk: chunk[1], reverse = True):
53 |             hit_start, hit_end = hit
54 |             placed = 0
55 |             for index, chunk in enumerate(chunks):
56 |                 if placed == 2:
57 |                     continue
58 |                 if index % 2:
59 |                     # we're inside a <tag>
60 |                     continue
61 |                 chunk_start = len(u''.join(text_chunks[0:index / 2]))
62 |                 chunk_end = chunk_start + len(chunk)
63 |                 if hit_start >= chunk_start and hit_start < chunk_end:
64 |                     chunk = chunk[:hit_start - chunk_start] + tags[0] + chunk[hit_start - chunk_start:]
65 |                     if hit_end <= chunk_end:
66 |                         hit_end += len(tags[0])
67 |                         chunk_end += len(tags[0])
68 |                     placed = 1
69 |                 if hit_end > chunk_start and hit_end <= chunk_end:
70 |                     chunk = chunk[:hit_end - chunk_start] + tags[1] + chunk[hit_end - chunk_start:]
71 |                     placed = 2
72 |                 chunks[index] = chunk
73 |             if placed == 1:
74 |                 chunks[-1] = chunks[-1] + tags[1]
75 |         result = []
76 |         for index, chunk in enumerate(chunks):
77 |             if index % 2:
78 |                 # we're inside a <tag>
79 |                 result.append(u'<%s>' % chunk)
80 |             else:
81 |                 result.append(chunk)
82 |         self.text = u''.join(result)
83 |         return self.text


--------------------------------------------------------------------------------
/twitter_text/regex.py:
--------------------------------------------------------------------------------
  1 | #  encoding=utf-8
  2 | 
  3 | # A collection of regular expressions for parsing Tweet text. The regular expression
  4 | # list is frozen at load time to ensure immutability. These reular expressions are
  5 | # used throughout the Twitter classes. Special care has been taken to make
  6 | # sure these reular expressions work with Tweets in all languages.
  7 | import re, string
  8 | 
  9 | REGEXEN = {} # :nodoc:
 10 | 
 11 | def regex_range(start, end = None):
 12 |     if end:
 13 |         return u'%s-%s' % (unichr(start), unichr(end))
 14 |     else:
 15 |         return u'%s' % unichr(start)
 16 | 
 17 | # Space is more than %20, U+3000 for example is the full-width space used with Kanji. Provide a short-hand
 18 | # to access both the list of characters and a pattern suitible for use with String#split
 19 | #  Taken from: ActiveSupport::Multibyte::Handlers::UTF8Handler::UNICODE_WHITESPACE
 20 | UNICODE_SPACES = []
 21 | for space in reduce(lambda x,y: x + y if type(y) == list else x + [y], [
 22 |         range(0x0009, 0x000D),  # White_Space # Cc   [5] <control-0009>..<control-000D>
 23 |         0x0020,                 # White_Space # Zs       SPACE
 24 |         0x0085,                 # White_Space # Cc       <control-0085>
 25 |         0x00A0,                 # White_Space # Zs       NO-BREAK SPACE
 26 |         0x1680,                 # White_Space # Zs       OGHAM SPACE MARK
 27 |         0x180E,                 # White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
 28 |         range(0x2000, 0x200A),  # White_Space # Zs  [11] EN QUAD..HAIR SPACE
 29 |         0x2028,                 # White_Space # Zl       LINE SEPARATOR
 30 |         0x2029,                 # White_Space # Zp       PARAGRAPH SEPARATOR
 31 |         0x202F,                 # White_Space # Zs       NARROW NO-BREAK SPACE
 32 |         0x205F,                 # White_Space # Zs       MEDIUM MATHEMATICAL SPACE
 33 |         0x3000,                 # White_Space # Zs       IDEOGRAPHIC SPACE
 34 |     ]):
 35 |     UNICODE_SPACES.append(unichr(space))
 36 | REGEXEN['spaces'] = re.compile(ur''.join(UNICODE_SPACES))
 37 | 
 38 | # Characters not allowed in Tweets
 39 | INVALID_CHARACTERS  =   [
 40 |     0xFFFE, 0xFEFF,                         # BOM
 41 |     0xFFFF,                                 # Special
 42 |     0x202A, 0x202B, 0x202C, 0x202D, 0x202E, # Directional change
 43 | ]
 44 | REGEXEN['invalid_control_characters']   =   [unichr(x) for x in INVALID_CHARACTERS]
 45 | 
 46 | REGEXEN['list_name'] = re.compile(ur'^[a-zA-Z][a-zA-Z0-9_\-\u0080-\u00ff]{0,24}$')
 47 | 
 48 | # Latin accented characters
 49 | # Excludes 0xd7 from the range (the multiplication sign, confusable with "x").
 50 | # Also excludes 0xf7, the division sign
 51 | LATIN_ACCENTS = [
 52 |     regex_range(0x00c0, 0x00d6),
 53 |     regex_range(0x00d8, 0x00f6),
 54 |     regex_range(0x00f8, 0x00ff),
 55 |     regex_range(0x0100, 0x024f),
 56 |     regex_range(0x0253, 0x0254),
 57 |     regex_range(0x0256, 0x0257),
 58 |     regex_range(0x0259),
 59 |     regex_range(0x025b),
 60 |     regex_range(0x0263),
 61 |     regex_range(0x0268),
 62 |     regex_range(0x026f),
 63 |     regex_range(0x0272),
 64 |     regex_range(0x0289),
 65 |     regex_range(0x028b),
 66 |     regex_range(0x02bb),
 67 |     regex_range(0x0300, 0x036f),
 68 |     regex_range(0x1e00, 0x1eff),
 69 | ]
 70 | REGEXEN['latin_accents'] = re.compile(ur''.join(LATIN_ACCENTS), re.IGNORECASE | re.UNICODE)
 71 | LATIN_ACCENTS = u''.join(LATIN_ACCENTS)
 72 | 
 73 | RTL_CHARACTERS = ''.join([
 74 |     regex_range(0x0600,0x06FF),
 75 |     regex_range(0x0750,0x077F),
 76 |     regex_range(0x0590,0x05FF),
 77 |     regex_range(0xFE70,0xFEFF)
 78 | ])
 79 | 
 80 | NON_LATIN_HASHTAG_CHARS = ''.join([
 81 |     # Cyrillic (Russian, Ukrainian, etc.)
 82 |     regex_range(0x0400, 0x04ff), # Cyrillic
 83 |     regex_range(0x0500, 0x0527), # Cyrillic Supplement
 84 |     regex_range(0x2de0, 0x2dff), # Cyrillic Extended A
 85 |     regex_range(0xa640, 0xa69f), # Cyrillic Extended B
 86 |     regex_range(0x0591, 0x05bf), # Hebrew
 87 |     regex_range(0x05c1, 0x05c2),
 88 |     regex_range(0x05c4, 0x05c5),
 89 |     regex_range(0x05c7),
 90 |     regex_range(0x05d0, 0x05ea),
 91 |     regex_range(0x05f0, 0x05f4),
 92 |     regex_range(0xfb12, 0xfb28), # Hebrew Presentation Forms
 93 |     regex_range(0xfb2a, 0xfb36),
 94 |     regex_range(0xfb38, 0xfb3c),
 95 |     regex_range(0xfb3e),
 96 |     regex_range(0xfb40, 0xfb41),
 97 |     regex_range(0xfb43, 0xfb44),
 98 |     regex_range(0xfb46, 0xfb4f),
 99 |     regex_range(0x0610, 0x061a), # Arabic
100 |     regex_range(0x0620, 0x065f),
101 |     regex_range(0x066e, 0x06d3),
102 |     regex_range(0x06d5, 0x06dc),
103 |     regex_range(0x06de, 0x06e8),
104 |     regex_range(0x06ea, 0x06ef),
105 |     regex_range(0x06fa, 0x06fc),
106 |     regex_range(0x06ff),
107 |     regex_range(0x0750, 0x077f), # Arabic Supplement
108 |     regex_range(0x08a0),         # Arabic Extended A
109 |     regex_range(0x08a2, 0x08ac),
110 |     regex_range(0x08e4, 0x08fe),
111 |     regex_range(0xfb50, 0xfbb1), # Arabic Pres. Forms A
112 |     regex_range(0xfbd3, 0xfd3d),
113 |     regex_range(0xfd50, 0xfd8f),
114 |     regex_range(0xfd92, 0xfdc7),
115 |     regex_range(0xfdf0, 0xfdfb),
116 |     regex_range(0xfe70, 0xfe74), # Arabic Pres. Forms B
117 |     regex_range(0xfe76, 0xfefc),
118 |     regex_range(0x200c, 0x200c), # Zero-Width Non-Joiner
119 |     regex_range(0x0e01, 0x0e3a), # Thai
120 |     regex_range(0x0e40, 0x0e4e), # Hangul (Korean)
121 |     regex_range(0x1100, 0x11ff), # Hangul Jamo
122 |     regex_range(0x3130, 0x3185), # Hangul Compatibility Jamo
123 |     regex_range(0xA960, 0xA97F), # Hangul Jamo Extended-A
124 |     regex_range(0xAC00, 0xD7AF), # Hangul Syllables
125 |     regex_range(0xD7B0, 0xD7FF), # Hangul Jamo Extended-B
126 |     regex_range(0xFFA1, 0xFFDC)  # Half-width Hangul
127 | ])
128 | 
129 | CJ_HASHTAG_CHARACTERS = ''.join([
130 |     regex_range(0x30A1, 0x30FA), regex_range(0x30FC, 0x30FE), # Katakana (full-width)
131 |     regex_range(0xFF66, 0xFF9F), # Katakana (half-width)
132 |     regex_range(0xFF10, 0xFF19), regex_range(0xFF21, 0xFF3A), regex_range(0xFF41, 0xFF5A), # Latin (full-width)
133 |     regex_range(0x3041, 0x3096), regex_range(0x3099, 0x309E), # Hiragana
134 |     regex_range(0x3400, 0x4DBF), # Kanji (CJK Extension A)
135 |     regex_range(0x4E00, 0x9FFF), # Kanji (Unified)
136 | ])
137 | 
138 | try:
139 |     CJ_HASHTAG_CHARACTERS = ''.join([
140 |         CJ_HASHTAG_CHARACTERS,
141 |         regex_range(0x20000, 0x2A6DF), # Kanji (CJK Extension B)
142 |         regex_range(0x2A700, 0x2B73F), # Kanji (CJK Extension C)
143 |         regex_range(0x2B740, 0x2B81F), # Kanji (CJK Extension D)
144 |         regex_range(0x2F800, 0x2FA1F), regex_range(0x3003), regex_range(0x3005), regex_range(0x303B) # Kanji (CJK supplement)
145 |     ])
146 | except ValueError:
147 |     # this is a narrow python build so these extended Kanji characters won't work
148 |     pass
149 | 
150 | PUNCTUATION_CHARS = ur'!"#$%&\'()*+,-./:;<=>?@\[\]^_\`{|}~'
151 | SPACE_CHARS = ur" \t\n\x0B\f\r"
152 | CTRL_CHARS = ur"\x00-\x1F\x7F"
153 | 
154 | # A hashtag must contain latin characters, numbers and underscores, but not all numbers.
155 | HASHTAG_ALPHA = ur'[a-z_%s]' % (LATIN_ACCENTS + NON_LATIN_HASHTAG_CHARS + CJ_HASHTAG_CHARACTERS)
156 | HASHTAG_ALPHANUMERIC = ur'[a-z0-9_%s]' % (LATIN_ACCENTS + NON_LATIN_HASHTAG_CHARS + CJ_HASHTAG_CHARACTERS)
157 | HASHTAG_BOUNDARY = ur'\A|\z|\[|[^&a-z0-9_%s]' % (LATIN_ACCENTS + NON_LATIN_HASHTAG_CHARS + CJ_HASHTAG_CHARACTERS)
158 | 
159 | HASHTAG = re.compile(ur'(%s)(#|＃)(%s*%s%s*)' % (HASHTAG_BOUNDARY, HASHTAG_ALPHANUMERIC, HASHTAG_ALPHA, HASHTAG_ALPHANUMERIC), re.IGNORECASE)
160 | 
161 | REGEXEN['valid_hashtag'] = HASHTAG
162 | REGEXEN['end_hashtag_match'] = re.compile(ur'\A(?:[#＃]|:\/\/)', re.IGNORECASE | re.UNICODE)
163 | REGEXEN['numeric_only'] = re.compile(ur'^[\d]+$')
164 | 
165 | REGEXEN['valid_mention_preceding_chars'] = re.compile(r'(?:[^a-zA-Z0-9_!#\$%&*@＠]|^|RT:?)')
166 | REGEXEN['at_signs'] = re.compile(ur'[@＠]')
167 | REGEXEN['valid_mention_or_list'] = re.compile(
168 |     ur'(%s)' % REGEXEN['valid_mention_preceding_chars'].pattern.decode('utf-8') +   # preceding character
169 |     ur'(%s)' % REGEXEN['at_signs'].pattern +                                        # at mark
170 |     ur'([a-zA-Z0-9_]{1,20})' +                                                      # screen name
171 |     ur'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?'                                           # list (optional)
172 | )
173 | REGEXEN['valid_reply'] = re.compile(ur'^(?:[%s])*%s([a-zA-Z0-9_]{1,20})' % (REGEXEN['spaces'].pattern, REGEXEN['at_signs'].pattern), re.IGNORECASE | re.UNICODE)
174 |  # Used in Extractor for final filtering
175 | REGEXEN['end_mention_match'] = re.compile(ur'\A(?:%s|[%s]|:\/\/)' % (REGEXEN['at_signs'].pattern, REGEXEN['latin_accents'].pattern), re.IGNORECASE | re.UNICODE)
176 | 
177 | # URL related hash regex collection
178 | REGEXEN['valid_url_preceding_chars'] = re.compile(ur'(?:[^A-Z0-9@＠$#＃%s]|^)' % ur''.join(REGEXEN['invalid_control_characters']), re.IGNORECASE | re.UNICODE)
179 | REGEXEN['invalid_url_without_protocol_preceding_chars'] = re.compile(ur'[-_.\/]$')
180 | DOMAIN_VALID_CHARS = ur'[^%s%s%s%s%s]' % (PUNCTUATION_CHARS, SPACE_CHARS, CTRL_CHARS, ur''.join(REGEXEN['invalid_control_characters']), ur''.join(UNICODE_SPACES))
181 | REGEXEN['valid_subdomain'] = re.compile(ur'(?:(?:%s(?:[_-]|%s)*)?%s\.)' % (DOMAIN_VALID_CHARS, DOMAIN_VALID_CHARS, DOMAIN_VALID_CHARS), re.IGNORECASE | re.UNICODE)
182 | REGEXEN['valid_domain_name'] = re.compile(ur'(?:(?:%s(?:[-]|%s)*)?%s\.)' % (DOMAIN_VALID_CHARS, DOMAIN_VALID_CHARS, DOMAIN_VALID_CHARS), re.IGNORECASE | re.UNICODE)
183 | REGEXEN['valid_gTLD'] = re.compile(ur'(?:(?:academy|actor|aero|agency|arpa|asia|bar|bargains|berlin|best|bid|bike|biz|blue|boutique|build|builders|buzz|cab|camera|camp|cards|careers|cat|catering|center|ceo|cheap|christmas|cleaning|clothing|club|codes|coffee|com|community|company|computer|construction|contractors|cool|coop|cruises|dance|dating|democrat|diamonds|directory|domains|edu|education|email|enterprises|equipment|estate|events|expert|exposed|farm|fish|flights|florist|foundation|futbol|gallery|gift|glass|gov|graphics|guitars|guru|holdings|holiday|house|immobilien|industries|info|institute|int|international|jobs|kaufen|kim|kitchen|kiwi|koeln|kred|land|lighting|limo|link|luxury|management|mango|marketing|menu|mil|mobi|moda|monash|museum|nagoya|name|net|neustar|ninja|okinawa|onl|org|partners|parts|photo|photography|photos|pics|pink|plumbing|post|pro|productions|properties|pub|qpon|recipes|red|rentals|repair|report|reviews|rich|ruhr|sexy|shiksha|shoes|singles|social|solar|solutions|supplies|supply|support|systems|tattoo|technology|tel|tienda|tips|today|tokyo|tools|training|travel|uno|vacations|ventures|viajes|villas|vision|vote|voting|voto|voyage|wang|watch|wed|wien|wiki|works|xxx|xyz|zone|дети|онлайн|орг|сайт|بازار|شبكة|みんな|中信|中文网|公司|公>益|在线|我爱你|政务|游戏|移动|网络|集团|삼성)(?=[^0-9a-z]|$))', re.IGNORECASE | re.UNICODE)
184 | REGEXEN['valid_ccTLD'] = re.compile(ur'(?:(?:ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|um|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw|мон|рф|срб|укр|қаз|الاردن|الجزائر|السعودية|المغرب|امارات|ایران|بھارت|تونس|سودان|سورية|عمان|فلسطين|قطر|مصر|مليسيا|پاکستان|भारत|বাংলা|ভারত|ਭਾਰਤ|ભારત|இந்தியா|இலங்கை|சிங்கப்பூர்|భారత్|ලංකා|ไทย|გე|中国|中加坡|湾|台灣|新香港|한국)(?=[^0-9a-z]|$))', re.IGNORECASE | re.UNICODE)
185 | REGEXEN['valid_punycode'] = re.compile(ur'(?:xn--[0-9a-z]+)', re.IGNORECASE | re.UNICODE)
186 | 
187 | REGEXEN['valid_domain'] = re.compile(ur'(?:%s*%s(?:%s|%s|%s))' % (REGEXEN['valid_subdomain'].pattern, REGEXEN['valid_domain_name'].pattern, REGEXEN['valid_gTLD'].pattern, REGEXEN['valid_ccTLD'].pattern, REGEXEN['valid_punycode'].pattern), re.IGNORECASE | re.UNICODE)
188 | 
189 | # This is used in Extractor
190 | REGEXEN['valid_ascii_domain'] = re.compile(ur'(?:(?:[A-Za-z0-9\-_]|[%s])+\.)+(?:%s|%s|%s)' % (REGEXEN['latin_accents'].pattern, REGEXEN['valid_gTLD'].pattern, REGEXEN['valid_ccTLD'].pattern, REGEXEN['valid_punycode'].pattern), re.IGNORECASE | re.UNICODE)
191 | 
192 | # This is used in Extractor for stricter t.co URL extraction
193 | REGEXEN['valid_tco_url'] = re.compile(ur'^https?:\/\/t\.co\/[a-z0-9]+', re.IGNORECASE | re.UNICODE)
194 | 
195 | # This is used in Extractor to filter out unwanted URLs.
196 | REGEXEN['invalid_short_domain'] = re.compile(ur'\A%s%s\Z' % (REGEXEN['valid_domain_name'].pattern, REGEXEN['valid_ccTLD'].pattern), re.IGNORECASE | re.UNICODE)
197 | 
198 | REGEXEN['valid_port_number'] = re.compile(ur'[0-9]+')
199 | 
200 | REGEXEN['valid_general_url_path_chars'] = re.compile(ur"[a-z0-9!\*';:=\+\,\.\$\/%%#\[\]\-_~&|@%s]" % LATIN_ACCENTS, re.IGNORECASE | re.UNICODE)
201 | # Allow URL paths to contain balanced parens
202 | #  1. Used in Wikipedia URLs like /Primer_(film)
203 | #  2. Used in IIS sessions like /S(dfd346)/
204 | REGEXEN['valid_url_balanced_parens'] = re.compile(ur'\(%s+\)' % REGEXEN['valid_general_url_path_chars'].pattern, re.IGNORECASE | re.UNICODE)
205 | # Valid end-of-path chracters (so /foo. does not gobble the period).
206 | #   1. Allow =&# for empty URL parameters and other URL-join artifacts
207 | REGEXEN['valid_url_path_ending_chars'] = re.compile(ur'[a-z0-9=_#\/\+\-%s]|(?:%s)' % (LATIN_ACCENTS, REGEXEN['valid_url_balanced_parens'].pattern), re.IGNORECASE | re.UNICODE)
208 | REGEXEN['valid_url_path'] = re.compile(ur'(?:(?:%s*(?:%s %s*)*%s)|(?:%s+\/))' % (REGEXEN['valid_general_url_path_chars'].pattern, REGEXEN['valid_url_balanced_parens'].pattern, REGEXEN['valid_general_url_path_chars'].pattern, REGEXEN['valid_url_path_ending_chars'].pattern, REGEXEN['valid_general_url_path_chars'].pattern), re.IGNORECASE | re.UNICODE)
209 | 
210 | REGEXEN['valid_url_query_chars'] = re.compile(ur"[a-z0-9!?\*'\(\);:&=\+\$\/%#\[\]\-_\.,~|@]", re.IGNORECASE | re.UNICODE)
211 | REGEXEN['valid_url_query_ending_chars'] = re.compile(ur'[a-z0-9_&=#\/]', re.IGNORECASE | re.UNICODE)
212 | REGEXEN['valid_url'] = re.compile(ur'((%s)((https?:\/\/)?(%s)(?::(%s))?(/%s*)?(\?%s*%s)?))' % (
213 |     REGEXEN['valid_url_preceding_chars'].pattern,
214 |     REGEXEN['valid_domain'].pattern,
215 |     REGEXEN['valid_port_number'].pattern,
216 |     REGEXEN['valid_url_path'].pattern,
217 |     REGEXEN['valid_url_query_chars'].pattern,
218 |     REGEXEN['valid_url_query_ending_chars'].pattern
219 | ), re.IGNORECASE | re.UNICODE)
220 | #   Matches
221 | #   $1 total match
222 | #   $2 Preceeding chracter
223 | #   $3 URL
224 | #   $4 Protocol (optional)
225 | #   $5 Domain(s)
226 | #   $6 Port number (optional)
227 | #   $7 URL Path and anchor
228 | #   $8 Query String
229 | 
230 | REGEXEN['cashtag'] = re.compile(ur'[a-z]{1,6}(?:[._][a-z]{1,2})?', re.IGNORECASE)
231 | REGEXEN['valid_cashtag'] = re.compile(ur'(^|[%s])(\$|＄|﹩)(%s)(?=$|\s|[%s])' % (REGEXEN['spaces'].pattern, REGEXEN['cashtag'].pattern, PUNCTUATION_CHARS), re.IGNORECASE)
232 | 
233 | # These URL validation pattern strings are based on the ABNF from RFC 3986
234 | REGEXEN['validate_url_unreserved'] = re.compile(ur'[a-z0-9\-._~]', re.IGNORECASE | re.UNICODE)
235 | REGEXEN['validate_url_pct_encoded'] = re.compile(ur'(?:%[0-9a-f]{2})', re.IGNORECASE | re.UNICODE)
236 | REGEXEN['validate_url_sub_delims'] = re.compile(ur"[!$&'()*+,;=]", re.IGNORECASE | re.UNICODE)
237 | REGEXEN['validate_url_pchar'] = re.compile(ur'(?:%s|%s|%s|[:\|@])' % (REGEXEN['validate_url_unreserved'].pattern, REGEXEN['validate_url_pct_encoded'].pattern, REGEXEN['validate_url_sub_delims'].pattern), re.IGNORECASE | re.UNICODE)
238 | 
239 | REGEXEN['validate_url_scheme'] = re.compile(ur'(?:[a-z][a-z0-9+\-.]*)', re.IGNORECASE | re.UNICODE)
240 | REGEXEN['validate_url_userinfo'] = re.compile(ur'(?:%s|%s|%s|:)*' % (REGEXEN['validate_url_unreserved'].pattern, REGEXEN['validate_url_pct_encoded'].pattern, REGEXEN['validate_url_sub_delims'].pattern), re.IGNORECASE | re.UNICODE)
241 | 
242 | REGEXEN['validate_url_dec_octet'] = re.compile(ur'(?:[0-9]|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5]))', re.IGNORECASE | re.UNICODE)
243 | REGEXEN['validate_url_ipv4'] = re.compile(ur'(?:%s(?:\.%s){3})' % (REGEXEN['validate_url_dec_octet'].pattern, REGEXEN['validate_url_dec_octet'].pattern), re.IGNORECASE | re.UNICODE)
244 | 
245 | # Punting on real IPv6 validation for now
246 | REGEXEN['validate_url_ipv6'] = re.compile(ur'(?:\[[a-f0-9:\.]+\])', re.IGNORECASE | re.UNICODE)
247 | 
248 | # Also punting on IPvFuture for now
249 | REGEXEN['validate_url_ip'] = re.compile(ur'(?:%s|%s)' % (REGEXEN['validate_url_ipv4'].pattern, REGEXEN['validate_url_ipv6'].pattern), re.IGNORECASE | re.UNICODE)
250 | 
251 | # This is more strict than the rfc specifies
252 | REGEXEN['validate_url_subdomain_segment'] = re.compile(ur'(?:[a-z0-9](?:[a-z0-9_\-]*[a-z0-9])?)', re.IGNORECASE | re.UNICODE)
253 | REGEXEN['validate_url_domain_segment'] = re.compile(ur'(?:[a-z0-9](?:[a-z0-9\-]*[a-z0-9])?)', re.IGNORECASE | re.UNICODE)
254 | REGEXEN['validate_url_domain_tld'] = re.compile(ur'(?:[a-z](?:[a-z0-9\-]*[a-z0-9])?)', re.IGNORECASE | re.UNICODE)
255 | REGEXEN['validate_url_domain'] = re.compile(ur'(?:(?:%s\.)*(?:%s\.)%s)' % (REGEXEN['validate_url_subdomain_segment'].pattern, REGEXEN['validate_url_domain_segment'].pattern, REGEXEN['validate_url_domain_tld'].pattern), re.IGNORECASE | re.UNICODE)
256 | 
257 | REGEXEN['validate_url_host'] = re.compile(ur'(?:%s|%s)' % (REGEXEN['validate_url_ip'].pattern, REGEXEN['validate_url_domain'].pattern), re.IGNORECASE | re.UNICODE)
258 | 
259 | # Unencoded internationalized domains - this doesn't check for invalid UTF-8 sequences
260 | REGEXEN['validate_url_unicode_subdomain_segment'] = re.compile(ur'(?:(?:[a-z0-9]|[^\x00-\x7f])(?:(?:[a-z0-9_\-]|[^\x00-\x7f])*(?:[a-z0-9]|[^\x00-\x7f]))?)', re.IGNORECASE | re.UNICODE)
261 | REGEXEN['validate_url_unicode_domain_segment'] = re.compile(ur'(?:(?:[a-z0-9]|[^\x00-\x7f])(?:(?:[a-z0-9\-]|[^\x00-\x7f])*(?:[a-z0-9]|[^\x00-\x7f]))?)', re.IGNORECASE | re.UNICODE)
262 | REGEXEN['validate_url_unicode_domain_tld'] = re.compile(ur'(?:(?:[a-z]|[^\x00-\x7f])(?:(?:[a-z0-9\-]|[^\x00-\x7f])*(?:[a-z0-9]|[^\x00-\x7f]))?)', re.IGNORECASE | re.UNICODE)
263 | REGEXEN['validate_url_unicode_domain'] = re.compile(ur'(?:(?:%s\.)*(?:%s\.)%s)' % (REGEXEN['validate_url_unicode_subdomain_segment'].pattern, REGEXEN['validate_url_unicode_domain_segment'].pattern, REGEXEN['validate_url_unicode_domain_tld'].pattern), re.IGNORECASE | re.UNICODE)
264 | 
265 | REGEXEN['validate_url_unicode_host'] = re.compile(ur'(?:%s|%s)' % (REGEXEN['validate_url_ip'].pattern, REGEXEN['validate_url_unicode_domain'].pattern), re.IGNORECASE | re.UNICODE)
266 | 
267 | REGEXEN['validate_url_port'] = re.compile(ur'[0-9]{1,5}')
268 | 
269 | REGEXEN['validate_url_unicode_authority'] = re.compile(ur'(?:(%s)@)?(%s)(?::(%s))?' % (REGEXEN['validate_url_userinfo'].pattern, REGEXEN['validate_url_unicode_host'].pattern, REGEXEN['validate_url_port'].pattern), re.IGNORECASE | re.UNICODE)
270 | 
271 | REGEXEN['validate_url_authority'] = re.compile(ur'(?:(%s)@)?(%s)(?::(%s))?' % (REGEXEN['validate_url_userinfo'].pattern, REGEXEN['validate_url_host'].pattern, REGEXEN['validate_url_port'].pattern), re.IGNORECASE | re.UNICODE)
272 | 
273 | REGEXEN['validate_url_path'] = re.compile(ur'(/%s*)*' % REGEXEN['validate_url_pchar'].pattern, re.IGNORECASE | re.UNICODE)
274 | REGEXEN['validate_url_query'] = re.compile(ur'(%s|/|\?)*' % REGEXEN['validate_url_pchar'].pattern, re.IGNORECASE | re.UNICODE)
275 | REGEXEN['validate_url_fragment'] = re.compile(ur'(%s|/|\?)*' % REGEXEN['validate_url_pchar'].pattern, re.IGNORECASE | re.UNICODE)
276 | 
277 | # Modified version of RFC 3986 Appendix B
278 | REGEXEN['validate_url_unencoded'] = re.compile(ur'\A(?:([^:/?#]+)://)?([^/?#]*)([^?#]*)(?:\?([^#]*))?(?:\#(.*))?\Z', re.IGNORECASE | re.UNICODE)
279 | 
280 | REGEXEN['rtl_chars'] = re.compile(ur'[%s]' % RTL_CHARACTERS, re.IGNORECASE | re.UNICODE)
281 | 


--------------------------------------------------------------------------------
/twitter_text/templatetags/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dryan/twitter-text-py/143ee74751597efd782417df4773d586720428a4/twitter_text/templatetags/__init__.py


--------------------------------------------------------------------------------
/twitter_text/templatetags/twitterize.py:
--------------------------------------------------------------------------------
 1 | try:
 2 |     from django.template import Library
 3 |     from django.template.defaultfilters import stringfilter
 4 | except:
 5 |     raise Exception('Django is not installed.')
 6 | 
 7 | from twitter_text import TwitterText
 8 | 
 9 | register = Library()
10 | 
11 | @register.filter(name = 'twitter_text')
12 | @stringfilter
13 | def twitter_text(text, search_query = False):
14 |     """
15 |     Parses a text string through the TwitterText auto_link method and if search_query is passed, through the hit_highlight method.
16 |     """
17 |     tt = TwitterText(text)
18 |     if search_query:
19 |         tt.text     =   tt.highlighter.hit_highlight(query = search_query)
20 |     tt.text         =   tt.autolink.auto_link()
21 |     return tt.text
22 | twitter_text.is_safe = True


--------------------------------------------------------------------------------
/twitter_text/unicode.py:
--------------------------------------------------------------------------------
 1 | import types, datetime
 2 | from decimal import Decimal
 3 | 
 4 | # borrowed from django.utils.encoding
 5 | class TwitterTextUnicodeDecodeError(UnicodeDecodeError):
 6 |     def __init__(self, obj, *args):
 7 |         self.obj = obj
 8 |         UnicodeDecodeError.__init__(self, *args)
 9 | 
10 |     def __str__(self):
11 |         original = UnicodeDecodeError.__str__(self)
12 |         return '%s. You passed in %r (%s)' % (original, self.obj,
13 |                 type(self.obj))
14 | 
15 | def is_protected_type(obj):
16 |     """Determine if the object instance is of a protected type.
17 | 
18 |     Objects of protected types are preserved as-is when passed to
19 |     force_unicode(strings_only=True).
20 |     """
21 |     return isinstance(obj, (
22 |         types.NoneType,
23 |         int, long,
24 |         datetime.datetime, datetime.date, datetime.time,
25 |         float, Decimal)
26 |     )
27 | 
28 | def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'):
29 |     """
30 |     Similar to smart_unicode, except that lazy instances are resolved to
31 |     strings, rather than kept as lazy objects.
32 | 
33 |     If strings_only is True, don't convert (some) non-string-like objects.
34 |     """
35 |     if strings_only and is_protected_type(s):
36 |         return s
37 |     try:
38 |         if not isinstance(s, basestring,):
39 |             if hasattr(s, '__unicode__'):
40 |                 s = unicode(s)
41 |             else:
42 |                 try:
43 |                     s = unicode(str(s), encoding, errors)
44 |                 except UnicodeEncodeError:
45 |                     if not isinstance(s, Exception):
46 |                         raise
47 |                     # If we get to here, the caller has passed in an Exception
48 |                     # subclass populated with non-ASCII data without special
49 |                     # handling to display as a string. We need to handle this
50 |                     # without raising a further exception. We do an
51 |                     # approximation to what the Exception's standard str()
52 |                     # output should be.
53 |                     s = ' '.join([force_unicode(arg, encoding, strings_only,
54 |                             errors) for arg in s])
55 |         elif not isinstance(s, unicode):
56 |             # Note: We use .decode() here, instead of unicode(s, encoding,
57 |             # errors), so that if s is a SafeString, it ends up being a
58 |             # SafeUnicode at the end.
59 |             s = s.decode(encoding, errors)
60 |     except UnicodeDecodeError, e:
61 |         if not isinstance(s, Exception):
62 |             raise TwitterTextUnicodeDecodeError(s, *e.args)
63 |         else:
64 |             # If we get to here, the caller has passed in an Exception
65 |             # subclass populated with non-ASCII bytestring data without a
66 |             # working unicode method. Try to handle this without raising a
67 |             # further exception by individually forcing the exception args
68 |             # to unicode.
69 |             s = ' '.join([force_unicode(arg, encoding, strings_only,
70 |                     errors) for arg in s])
71 |     return s
72 | 


--------------------------------------------------------------------------------
/twitter_text/validation.py:
--------------------------------------------------------------------------------
  1 | # encoding=utf-8
  2 | 
  3 | import re
  4 | 
  5 | from twitter_text.unicode import force_unicode
  6 | from twitter_text.extractor import Extractor
  7 | from twitter_text.regex import REGEXEN
  8 | 
  9 | MAX_LENGTH = 140
 10 | 
 11 | DEFAULT_TCO_URL_LENGTHS = {
 12 |   'short_url_length': 22,
 13 |   'short_url_length_https': 23,
 14 |   'characters_reserved_per_media': 22,
 15 | }
 16 | 
 17 | class Validation(object):
 18 |     def __init__(self, text, **kwargs):
 19 |         self.text = force_unicode(text)
 20 |         self.parent = kwargs.get('parent', False)
 21 |         
 22 |     def tweet_length(self, options = {}):
 23 |         """
 24 |         Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
 25 |         (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
 26 |         string no matter which actual form was transmitted. For example:
 27 | 
 28 |              U+0065  Latin Small Letter E
 29 |          +   U+0301  Combining Acute Accent
 30 |          ----------
 31 |          =   2 bytes, 2 characters, displayed as é (1 visual glyph)
 32 |              … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
 33 | 
 34 |          The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
 35 |         """
 36 | 
 37 |         assert (not self.parent or not getattr(self.parent, 'has_been_linked', False) ), 'The validator should only be run on text before it has been modified.'
 38 | 
 39 |         for key in DEFAULT_TCO_URL_LENGTHS:
 40 |             if not key in options:
 41 |                 options[key] = DEFAULT_TCO_URL_LENGTHS[key]
 42 | 
 43 |         length = len(self.text)
 44 |         # thanks force_unicode for making this so much simpler than the ruby version
 45 | 
 46 |         for url in Extractor(self.text).extract_urls_with_indices():
 47 |             # remove the link of the original URL
 48 |             length += url['indices'][0] - url['indices'][1]
 49 |             # add the length of the t.co URL that will replace it
 50 |             length += options.get('short_url_length_https') if url['url'].lower().find('https://') > -1 else options.get('short_url_length')
 51 | 
 52 |         if self.parent and hasattr(self.parent, 'tweet_length'):
 53 |             self.parent.tweet_length = length
 54 |         return length
 55 |     
 56 |     def tweet_invalid(self):
 57 |         """
 58 |         Check the text for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
 59 |         before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
 60 |         will allow quicker feedback.
 61 |         
 62 |         Returns false if this text is valid. Otherwise one of the following Symbols will be returned:
 63 |         
 64 |             "Too long":: if the text is too long
 65 |             "Empty text":: if the text is empty
 66 |             "Invalid characters":: if the text contains non-Unicode or any of the disallowed Unicode characters
 67 |         """
 68 | 
 69 |         valid = True # optimism
 70 |         validation_error = None
 71 | 
 72 |         if not self.tweet_length():
 73 |             valid, validation_error = False, 'Empty text'
 74 | 
 75 |         if self.tweet_length() > MAX_LENGTH:
 76 |             valid, validation_error = False, 'Too long'
 77 | 
 78 |         if re.search(ur''.join(REGEXEN['invalid_control_characters']), self.text):
 79 |             valid, validation_error = False, 'Invalid characters'
 80 |             
 81 |         if self.parent and hasattr(self.parent, 'tweet_is_valid'):
 82 |             self.parent.tweet_is_valid = valid
 83 |         if self.parent and hasattr(self.parent, 'tweet_validation_error'):
 84 |             self.parent.tweet_validation_error = validation_error
 85 | 
 86 |         return validation_error if not valid else False
 87 | 
 88 |     def valid_tweet_text(self):
 89 |         return not self.tweet_invalid()
 90 | 
 91 |     def valid_username(self):
 92 |         if not self.text:
 93 |             return False
 94 | 
 95 |         extracted = Extractor(self.text).extract_mentioned_screen_names()
 96 | 
 97 |         return len(extracted) == 1 and extracted[0] == self.text[1:]
 98 | 
 99 |     def valid_list(self):
100 |         match = re.compile(ur'^%s$' % REGEXEN['valid_mention_or_list'].pattern).search(self.text)
101 |         return bool(match is not None and match.groups()[0] == "" and match.groups()[3])
102 | 
103 |     def valid_hashtag(self):
104 |         if not self.text:
105 |             return False
106 | 
107 |         extracted = Extractor(self.text).extract_hashtags()
108 | 
109 |         return len(extracted) == 1 and extracted[0] == self.text[1:]
110 | 
111 |     def valid_url(self, unicode_domains = True, require_protocol = True):
112 |         if not self.text:
113 |             return False
114 | 
115 |         url_parts = REGEXEN['validate_url_unencoded'].match(self.text)
116 | 
117 |         if not (url_parts and url_parts.string == self.text):
118 |             return False
119 | 
120 |         scheme, authority, path, query, fragment = url_parts.groups()
121 | 
122 |         if not (
123 |             (
124 |                 not require_protocol 
125 |                 or (
126 |                     self._valid_match(scheme, REGEXEN['validate_url_scheme']) 
127 |                     and re.compile(ur'^https?$', re.IGNORECASE).match(scheme)
128 |                 )
129 |             )
130 |             and (
131 |                 path == ''
132 |                 or self._valid_match(path, REGEXEN['validate_url_path'])
133 |             )
134 |             and self._valid_match(query, REGEXEN['validate_url_query'], True)
135 |             and self._valid_match(fragment, REGEXEN['validate_url_fragment'], True)
136 |         ):
137 |             return False
138 | 
139 |         return bool(
140 |             (
141 |                 unicode_domains 
142 |                 and self._valid_match(authority, REGEXEN['validate_url_unicode_authority'])
143 |                 and REGEXEN['validate_url_unicode_authority'].match(authority).string == authority
144 |             )
145 |             or (
146 |                 not unicode_domains
147 |                 and self._valid_match(authority, REGEXEN['validate_url_authority'])
148 |                 and REGEXEN['validate_url_authority'].match(authority).string == authority
149 |             )
150 |         )
151 | 
152 |     def _valid_match(self, string, re_obj, optional = False):
153 |         if optional and string is None:
154 |             return True
155 |         match = re_obj.match(string)
156 |         if optional:
157 |             return not (string and (match is None or not match.string[match.span()[0]:match.span()[1]] == string))
158 |         else:
159 |             return bool(string and match and match.string[match.span()[0]:match.span()[1]] == string)
160 | 


--------------------------------------------------------------------------------