├── emoji_extractor ├── __init__.py ├── data │ ├── big_regex.pkl │ ├── tme_regex.pkl │ └── possible_emoji.pkl ├── __pycache__ │ ├── __init__.cpython-36.pyc │ └── extract.cpython-36.pyc └── extract.py ├── .gitignore ├── LICENSE.txt ├── README.md ├── README.txt ├── update_regex └── update_regex.py └── notebooks ├── examples.ipynb └── .ipynb_checkpoints └── examples-checkpoint.ipynb /emoji_extractor/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | dist/ 2 | .git/ 3 | .ipynb_checkpoints/ 4 | make regex/ 5 | __pycache__/ 6 | setup.py 7 | MANIFEST.in 8 | MANIFEST -------------------------------------------------------------------------------- /emoji_extractor/data/big_regex.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexanderrobertson/emoji-extractor/HEAD/emoji_extractor/data/big_regex.pkl -------------------------------------------------------------------------------- /emoji_extractor/data/tme_regex.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexanderrobertson/emoji-extractor/HEAD/emoji_extractor/data/tme_regex.pkl -------------------------------------------------------------------------------- /emoji_extractor/data/possible_emoji.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexanderrobertson/emoji-extractor/HEAD/emoji_extractor/data/possible_emoji.pkl -------------------------------------------------------------------------------- /emoji_extractor/__pycache__/__init__.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexanderrobertson/emoji-extractor/HEAD/emoji_extractor/__pycache__/__init__.cpython-36.pyc -------------------------------------------------------------------------------- /emoji_extractor/__pycache__/extract.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/alexanderrobertson/emoji-extractor/HEAD/emoji_extractor/__pycache__/extract.cpython-36.pyc -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Alexander Robertson 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Emoji extractor/counter 2 | 3 | # Installation 4 | 5 | ```pip install emoji_extractor``` 6 | 7 | ```conda install emoji-extractor -c conda-forge``` 8 | 9 | Usage examples: [see this Jupyter notebook](https://github.com/alexanderrobertson/emoji-extractor/blob/master/notebooks/examples.ipynb) 10 | 11 | # Info 12 | 13 | It counts the emoji in a string, returning the emoji and their counts. That's it! It should properly detect and count all current multi-part emoji. 14 | 15 | There is an update script in `update_regex` which can be used to update to the latest Unicode version, or if you want to detect only emoji for a specific Unicode version. 16 | 17 | # Details 18 | 19 | * Uses [v16.0 of the current emoji test data](https://unicode.org/Public/emoji/16.0/emoji-test.txt). 20 | 21 | * `possible_emoji.pkl` is a pickled set of possible emoji, used to check for their presence in a string with a few additional characters like the exciting [VARIATION-SELECTOR-16](https://emojipedia.org/variation-selector-16/) and the individual characters which make up flag sequences. 22 | 23 | * `big_regex.pkl` is a pickled compiled regular expression. It's just lots of regular expressions piped together in order of decreasing length. This is important to make sure that you can count multi-codepoint sequences like '💁🏽\u200d♂️' and so on. 24 | 25 | * Some emoji have a variation selector 0xFE0F, but some platforms strip these and still render the emoji form. However, the regex used here will capture both '👁️\u200d🗨️' (0xFE0F after each emoji codepoint) and '👁\u200d🗨' (no 0xFE0F) and even situations where some component codepoints can and do have variant selectors but others can but don't. See Unicode's Full Emoji List and search for '0xFE0F' to see which emoji this potentially affects. 26 | 27 | # Other work 28 | 29 | If you want to do stuff more complicated than simply detecting, extracting and counting emoji then you might find [this Python package useful](https://github.com/carpedm20/emoji/). 30 | 31 | # To do 32 | 33 | It may be possible to speed up the extraction/counting process by limited the regular expression used to only those which are possible, given the unique detected characters. I guess it would depend on how quickly the new smaller regex can be compiled. Storing them might be possible but the combinations are likely to be prohibitive. 34 | 35 | # Anything else 36 | 37 | Feel free to email me about any of this stuff. 38 | -------------------------------------------------------------------------------- /README.txt: -------------------------------------------------------------------------------- 1 | # Emoji extractor/counter 2 | 3 | # Installation 4 | 5 | ```pip install emoji_extractor``` 6 | ```conda install emoji-extractor -c conda-forge``` 7 | 8 | Usage examples: [see this Jupyter notebook](https://github.com/alexanderrobertson/emoji-extractor/blob/master/notebooks/examples.ipynb) 9 | 10 | # Info 11 | 12 | It counts the emoji in a string, returning the emoji and their counts. That's it! It should properly detect and count all current multi-part emoji. 13 | 14 | There is an update script in `update_regex` which can be used to update to the latest Unicode version, or if you want to detect only emoji for a specific Unicode version. 15 | 16 | # Details 17 | 18 | * Uses [v16.0 of the current emoji test data](https://unicode.org/Public/emoji/16.0/emoji-test.txt). 19 | 20 | * `possible_emoji.pkl` is a pickled set of possible emoji, used to check for their presence in a string with a few additional characters like the exciting [VARIATION-SELECTOR-16](https://emojipedia.org/variation-selector-16/) and the individual characters which make up flag sequences. 21 | 22 | * `big_regex.pkl` is a pickled compiled regular expression. It's just lots of regular expressions piped together in order of decreasing length. This is important to make sure that you can count multi-codepoint sequences like '💁🏽\u200d♂️' and so on. 23 | 24 | * Some emoji have a variation selector 0xFE0F, but some platforms strip these and still render the emoji form. However, the regex used here will capture both '👁️\u200d🗨️' (0xFE0F after each emoji codepoint) and '👁\u200d🗨' (no 0xFE0F) and even situations where some component codepoints can and do have variant selectors but others can but don't. See Unicode's Full Emoji List and search for '0xFE0F' to see which emoji this potentially affects. 25 | 26 | # Other work 27 | 28 | If you want to do stuff more complicated than simply detecting, extracting and counting emoji then you might find [this Python package useful](https://github.com/carpedm20/emoji/). 29 | 30 | # To do 31 | 32 | It may be possible to speed up the extraction/counting process by limited the regular expression used to only those which are possible, given the unique detected characters. I guess it would depend on how quickly the new smaller regex can be compiled. Storing them might be possible but the combinations are likely to be prohibitive. 33 | 34 | I probably need to update this package to automatically check Unicode's public emoji files for updates so that I don't need to do it manually every time... 35 | 36 | # Anything else 37 | 38 | Feel free to email me about any of this stuff. 39 | -------------------------------------------------------------------------------- /update_regex/update_regex.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pickle 3 | import requests 4 | from collections import defaultdict 5 | 6 | def convert_code(original): 7 | codes = [] 8 | 9 | for data in original.split(' '): 10 | c = f"\\U{int(data, 16):08x}" 11 | 12 | codes.append(c) 13 | 14 | return ''.join(codes).upper() 15 | 16 | def shorten_name(name): 17 | for i in [": light skin tone", ": medium-light skin tone", ": medium skin tone", ": medium-dark skin tone", ": dark skin tone", 18 | ", light skin tone", ", medium-light skin tone", ", medium skin tone", ", medium-dark skin tone", ", dark skin tone"]: 19 | name = name.replace(i, '') 20 | 21 | return name 22 | 23 | 24 | print('Loading latest Unicode data...') 25 | 26 | raw_data = requests.get('https://unicode.org/Public/emoji/latest/emoji-test.txt') 27 | 28 | 29 | print('\tDone.') 30 | 31 | data = raw_data.content.decode().splitlines() 32 | 33 | current_group = None 34 | current_subgroup = None 35 | version = None 36 | 37 | name_to_codepoint = defaultdict(list) 38 | allowed_emoji = set() 39 | tme_names = set() 40 | 41 | 42 | print('Processing...') 43 | for line in data: 44 | if line == '': 45 | continue 46 | 47 | if line == '#EOF': 48 | break 49 | 50 | if line[0] != '#': 51 | codepoints = re.sub(r'\s+', ' ', line.split(';')[0].strip()) 52 | 53 | emoji = ''.join([chr(int(c, 16)) for c in codepoints.split()]) 54 | 55 | status = line.split(';')[1].split('#')[0].strip() 56 | 57 | name = re.split(r' E\d+\.\d+ ', line)[1].strip() 58 | 59 | if status != 'component': 60 | allowed_emoji.add(emoji) 61 | 62 | c = convert_code(codepoints) 63 | 64 | name_to_codepoint[name.split(':')[0]].append(c) 65 | 66 | if 'skin tone' in name: 67 | tme_names.add(name.split(':')[0]) 68 | 69 | continue 70 | 71 | if line.startswith('# Version:'): 72 | version = line.split(' ')[-1] 73 | 74 | codepoint_data = sorted([i for j in name_to_codepoint.values() for i in j], key=lambda x: len(x), reverse=True) 75 | 76 | tme_codepoints = [] 77 | 78 | for tme in tme_names: 79 | tme_codepoints.extend(name_to_codepoint[tme]) 80 | 81 | tme_codepoints = sorted(tme_codepoints, key=lambda x: len(x), reverse=True) 82 | 83 | print('\tDone.') 84 | print(f'\tFound {len(codepoint_data)} emoji codepoint sequences.') 85 | 86 | big_regex = re.compile('|'.join(codepoint_data), re.UNICODE) 87 | tme_regex = re.compile('|'.join(tme_codepoints), re.UNICODE) 88 | 89 | 90 | print('Saving to disk...') 91 | with open('../emoji_extractor/data/big_regex.pkl', 'wb') as f: 92 | pickle.dump(big_regex, f) 93 | 94 | with open('../emoji_extractor/data/possible_emoji.pkl', 'wb') as f: 95 | pickle.dump(allowed_emoji, f) 96 | 97 | with open('../emoji_extractor/data/tme_regex.pkl', 'wb') as f: 98 | pickle.dump(tme_regex, f) 99 | 100 | print('\tDone.') -------------------------------------------------------------------------------- /emoji_extractor/extract.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pickle 3 | 4 | from importlib import resources 5 | 6 | from collections import Counter 7 | from collections.abc import Iterable 8 | 9 | data_path = resources.files('emoji_extractor') / 'data' 10 | 11 | regex_file = data_path / 'big_regex.pkl' 12 | emoji_file = data_path / 'possible_emoji.pkl' 13 | tme_regex_file = data_path / 'tme_regex.pkl' 14 | 15 | 16 | class Extractor: 17 | """ 18 | Extract emoji from strings. 19 | Return a count of the emoji found. 20 | """ 21 | def __init__(self, regex=regex_file, emoji=emoji_file, tme=tme_regex_file): 22 | with open(regex, 'rb') as f: 23 | self.big_regex = pickle.load(f) 24 | 25 | with open(emoji, 'rb') as f: 26 | self.possible_emoji = pickle.load(f) 27 | 28 | with open(tme, 'rb') as f: 29 | self.tme = pickle.load(f) 30 | 31 | self.tones_re = re.compile(r'[🏻🏼🏽🏾🏿]', re.UNICODE) 32 | 33 | def detect_emoji(self, string): 34 | return set(string).intersection(self.possible_emoji) != set() 35 | 36 | def count_emoji(self, string, check_first=True): 37 | if check_first: 38 | if self.detect_emoji(string): 39 | return Counter(self.big_regex.findall(string)) 40 | else: 41 | return Counter() 42 | else: 43 | return Counter(self.big_regex.findall(string)) 44 | 45 | def count_tme(self, string, check_first=True): 46 | if check_first: 47 | if self.detect_emoji(string): 48 | return Counter(self.tme.findall(string)) 49 | else: 50 | return Counter() 51 | else: 52 | return Counter(self.tme.findall(string)) 53 | 54 | def count_tones(self, string, check_first=True): 55 | if check_first: 56 | if self.detect_emoji(string): 57 | return Counter(self.tones_re.findall(string)) 58 | else: 59 | return Counter() 60 | 61 | def count_all_tones(self, iterable, check_first=True): 62 | running_total = Counter() 63 | 64 | if type(iterable) == str: 65 | raise TypeError("This method is not for single strings. Use count_emoji() instead") 66 | 67 | try: 68 | for string in iterable: 69 | running_total.update(self.count_tones(string, check_first=check_first)) 70 | 71 | return running_total 72 | except: 73 | raise TypeError('This method requires an iterable of strings.') 74 | 75 | def count_all_emoji(self, iterable, check_first=True): 76 | running_total = Counter() 77 | 78 | if type(iterable) == str: 79 | raise TypeError("This method is not for single strings. Use count_emoji() instead") 80 | 81 | try: 82 | for string in iterable: 83 | running_total.update(self.count_emoji(string, check_first=check_first)) 84 | 85 | return running_total 86 | except: 87 | raise TypeError('This method requires an iterable of strings.') 88 | 89 | def count_all_tme(self, iterable, check_first=True): 90 | running_total = Counter() 91 | 92 | if type(iterable) == str: 93 | raise TypeError("This method is not for single strings. Use count_tme() instead") 94 | 95 | try: 96 | for string in iterable: 97 | running_total.update(self.count_tme(string, check_first=check_first)) 98 | 99 | return running_total 100 | except: 101 | raise TypeError('This method requires an iterable of strings.') -------------------------------------------------------------------------------- /notebooks/examples.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# 10,000 strings. 1,580 contain emoji\n", 10 | "with open('random_10k.txt', 'r', encoding='utf-8') as f:\n", 11 | " random_10k = f.read().splitlines()\n", 12 | "\n", 13 | "# 10,000 strings. All 10,000 contain emoji\n", 14 | "with open('emojis_10k.txt', 'r', encoding='utf-8') as f:\n", 15 | " emojis_10k = f.read().splitlines()" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "from emoji_extractor.extract import Extractor\n", 25 | "\n", 26 | "extract = Extractor()" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Usage 1\n", 34 | "## Pass in strings without knowing beforehand if they contain emoji" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "With ```check_first=True```, determine if there are actually any emoji present to count.\n", 42 | "\n", 43 | "If there are, then count them. If not, then return an empty dictionary - which would have been the result anyway." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "Counter()\n", 56 | "Counter({'👀': 1})\n", 57 | "Counter({'🌟': 1, '💕': 1})\n", 58 | "Counter()\n", 59 | "Counter()\n", 60 | "Counter()\n", 61 | "Counter()\n", 62 | "Counter({'⏰': 1, '🥅': 1, '🆚': 1, '🏆': 1})\n", 63 | "Counter()\n", 64 | "Counter()\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "for t in random_10k[0:10]:\n", 70 | " print(extract.count_emoji(t, check_first=True))" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "A collections.Counter object is returned. If no emoji were counted, then the Counter will be empty." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "# Usage 2\n", 85 | "\n", 86 | "## Pass in strings that you already know contain emoji" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "With ```check_first=False```, just assume that there will be emoji present.\n", 94 | "\n", 95 | "Perhaps because you've already filtered the data." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 4, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "Counter({'😇': 2, '💦': 2})\n", 108 | "Counter({'😭': 1})\n", 109 | "Counter({'🛫': 1, '🏆': 1})\n", 110 | "Counter({'🍎': 1})\n", 111 | "Counter({'💁🏻': 1})\n", 112 | "Counter({'🔴': 1})\n", 113 | "Counter({'☺': 1})\n", 114 | "Counter({'😿': 1, '💛': 1})\n", 115 | "Counter({'😿': 1, '💛': 1})\n", 116 | "Counter({'😿': 1, '💛': 1})\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "for t in emojis_10k[0:10]: \n", 122 | " print(extract.count_emoji(t, check_first=False))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "# Usage 3\n", 130 | "## Pass an iterable of strings\n", 131 | "\n", 132 | "Counters have a useful ```most_common``` method." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "[('😿', 3),\n", 144 | " ('💛', 3),\n", 145 | " ('😇', 2),\n", 146 | " ('💦', 2),\n", 147 | " ('😭', 1),\n", 148 | " ('🛫', 1),\n", 149 | " ('🏆', 1),\n", 150 | " ('🍎', 1),\n", 151 | " ('💁🏻', 1),\n", 152 | " ('🔴', 1),\n", 153 | " ('☺', 1)]" 154 | ] 155 | }, 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "count = extract.count_all_emoji(emojis_10k[0:10])\n", 163 | "\n", 164 | "count.most_common()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 6, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/plain": [ 175 | "[('😂', 2813),\n", 176 | " ('❤', 1150),\n", 177 | " ('😍', 974),\n", 178 | " ('😭', 933),\n", 179 | " ('💕', 552),\n", 180 | " ('🔥', 485),\n", 181 | " ('✨', 430),\n", 182 | " ('♥', 286),\n", 183 | " ('😊', 277),\n", 184 | " ('😘', 236)]" 185 | ] 186 | }, 187 | "execution_count": 6, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "count2 = extract.count_all_emoji(emojis_10k)\n", 194 | "\n", 195 | "count2.most_common(n=10)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "# Speed comparison\n", 203 | "When you're not sure if you have emoji in all strings, it's obviously faster to check first before trying to count since counting involves a lot of searches." 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "Example of how much slower it is without checking first:" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 7, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "2.58 s ± 69.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "%%timeit\n", 228 | "for t in random_10k:\n", 229 | " extract.count_emoji(t, check_first=False)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "And with checking, where you only run the counter for 15% of the strings:" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 8, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "496 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "%%timeit\n", 254 | "for t in random_10k:\n", 255 | " extract.count_emoji(t, check_first=True)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "If you already know that every string has emoji, then checking makes no difference:" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 9, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "2.52 s ± 41.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 275 | ] 276 | } 277 | ], 278 | "source": [ 279 | "%%timeit\n", 280 | "for t in emojis_10k: \n", 281 | " extract.count_emoji(t, check_first=False)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 10, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "2.57 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "%%timeit\n", 299 | "for t in emojis_10k: \n", 300 | " extract.count_emoji(t, check_first=True)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "# Tone-modifiable emoji\n", 308 | "\n", 309 | "These are emoji which can be modified for skin tone with {'🏻', '🏼', '🏽', '🏾', '🏿'}" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 34, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "dict_items([('💁🏻', 1)]) \t ['🏻']\n", 322 | "dict_items([('👉', 1)]) \t ['Y']\n", 323 | "dict_items([('🤷🏾', 1)]) \t ['🏾']\n", 324 | "dict_items([('🤹🏻', 1)]) \t ['🏻']\n", 325 | "dict_items([('🤷🏽', 1)]) \t ['🏽']\n", 326 | "dict_items([('✋', 1)]) \t ['Y']\n", 327 | "dict_items([('💪🏻', 1)]) \t ['🏻']\n", 328 | "dict_items([('👍', 1)]) \t ['Y']\n", 329 | "dict_items([('🤘', 1)]) \t ['Y']\n", 330 | "dict_items([('🤙🏾', 1), ('🤘🏾', 1)]) \t ['🏾', '🏾']\n", 331 | "dict_items([('👨', 2), ('👦', 2)]) \t ['Y', 'Y']\n", 332 | "dict_items([('👏', 1)]) \t ['Y']\n", 333 | "dict_items([('💪', 3)]) \t ['Y']\n", 334 | "dict_items([('👏', 4)]) \t ['Y']\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "for t in emojis_10k[0:100]: \n", 340 | " e = extract.count_tme(t, check_first=False)\n", 341 | " \n", 342 | " if e:\n", 343 | " print(e.items(), '\\t', [i[1] if len(i) == 2 else 'Y' for i in e])\n" 344 | ] 345 | } 346 | ], 347 | "metadata": { 348 | "kernelspec": { 349 | "display_name": "Python [default]", 350 | "language": "python", 351 | "name": "python3" 352 | }, 353 | "language_info": { 354 | "codemirror_mode": { 355 | "name": "ipython", 356 | "version": 3 357 | }, 358 | "file_extension": ".py", 359 | "mimetype": "text/x-python", 360 | "name": "python", 361 | "nbconvert_exporter": "python", 362 | "pygments_lexer": "ipython3", 363 | "version": "3.6.6" 364 | } 365 | }, 366 | "nbformat": 4, 367 | "nbformat_minor": 2 368 | } 369 | -------------------------------------------------------------------------------- /notebooks/.ipynb_checkpoints/examples-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# 10,000 strings. 1,580 contain emoji\n", 10 | "with open('random_10k.txt', 'r', encoding='utf-8') as f:\n", 11 | " random_10k = f.read().splitlines()\n", 12 | "\n", 13 | "# 10,000 strings. All 10,000 contain emoji\n", 14 | "with open('emojis_10k.txt', 'r', encoding='utf-8') as f:\n", 15 | " emojis_10k = f.read().splitlines()" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "from emoji_extractor.extract import Extractor\n", 25 | "\n", 26 | "extract = Extractor()" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "# Usage 1\n", 34 | "## Pass in strings without knowing beforehand if they contain emoji" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "With ```check_first=True```, determine if there are actually any emoji present to count.\n", 42 | "\n", 43 | "If there are, then count them. If not, then return an empty dictionary - which would have been the result anyway." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "Counter()\n", 56 | "Counter({'👀': 1})\n", 57 | "Counter({'🌟': 1, '💕': 1})\n", 58 | "Counter()\n", 59 | "Counter()\n", 60 | "Counter()\n", 61 | "Counter()\n", 62 | "Counter({'⏰': 1, '🥅': 1, '🆚': 1, '🏆': 1})\n", 63 | "Counter()\n", 64 | "Counter()\n" 65 | ] 66 | } 67 | ], 68 | "source": [ 69 | "for t in random_10k[0:10]:\n", 70 | " print(extract.count_emoji(t, check_first=True))" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "A collections.Counter object is returned. If no emoji were counted, then the Counter will be empty." 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "# Usage 2\n", 85 | "\n", 86 | "## Pass in strings that you already know contain emoji" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "With ```check_first=False```, just assume that there will be emoji present.\n", 94 | "\n", 95 | "Perhaps because you've already filtered the data." 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 4, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "Counter({'😇': 2, '💦': 2})\n", 108 | "Counter({'😭': 1})\n", 109 | "Counter({'🛫': 1, '🏆': 1})\n", 110 | "Counter({'🍎': 1})\n", 111 | "Counter({'💁🏻': 1})\n", 112 | "Counter({'🔴': 1})\n", 113 | "Counter({'☺': 1})\n", 114 | "Counter({'😿': 1, '💛': 1})\n", 115 | "Counter({'😿': 1, '💛': 1})\n", 116 | "Counter({'😿': 1, '💛': 1})\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "for t in emojis_10k[0:10]: \n", 122 | " print(extract.count_emoji(t, check_first=False))" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "# Usage 3\n", 130 | "## Pass an iterable of strings\n", 131 | "\n", 132 | "Counters have a useful ```most_common``` method." 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 5, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "[('😿', 3),\n", 144 | " ('💛', 3),\n", 145 | " ('😇', 2),\n", 146 | " ('💦', 2),\n", 147 | " ('😭', 1),\n", 148 | " ('🛫', 1),\n", 149 | " ('🏆', 1),\n", 150 | " ('🍎', 1),\n", 151 | " ('💁🏻', 1),\n", 152 | " ('🔴', 1),\n", 153 | " ('☺', 1)]" 154 | ] 155 | }, 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "count = extract.count_all_emoji(emojis_10k[0:10])\n", 163 | "\n", 164 | "count.most_common()" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 6, 170 | "metadata": {}, 171 | "outputs": [ 172 | { 173 | "data": { 174 | "text/plain": [ 175 | "[('😂', 2813),\n", 176 | " ('❤', 1150),\n", 177 | " ('😍', 974),\n", 178 | " ('😭', 933),\n", 179 | " ('💕', 552),\n", 180 | " ('🔥', 485),\n", 181 | " ('✨', 430),\n", 182 | " ('♥', 286),\n", 183 | " ('😊', 277),\n", 184 | " ('😘', 236)]" 185 | ] 186 | }, 187 | "execution_count": 6, 188 | "metadata": {}, 189 | "output_type": "execute_result" 190 | } 191 | ], 192 | "source": [ 193 | "count2 = extract.count_all_emoji(emojis_10k)\n", 194 | "\n", 195 | "count2.most_common(n=10)" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "# Speed comparison\n", 203 | "When you're not sure if you have emoji in all strings, it's obviously faster to check first before trying to count since counting involves a lot of searches." 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "metadata": {}, 209 | "source": [ 210 | "Example of how much slower it is without checking first:" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 7, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "2.58 s ± 69.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 223 | ] 224 | } 225 | ], 226 | "source": [ 227 | "%%timeit\n", 228 | "for t in random_10k:\n", 229 | " extract.count_emoji(t, check_first=False)" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "And with checking, where you only run the counter for 15% of the strings:" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 8, 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "496 ms ± 13.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 249 | ] 250 | } 251 | ], 252 | "source": [ 253 | "%%timeit\n", 254 | "for t in random_10k:\n", 255 | " extract.count_emoji(t, check_first=True)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "If you already know that every string has emoji, then checking makes no difference:" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 9, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "2.52 s ± 41.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 275 | ] 276 | } 277 | ], 278 | "source": [ 279 | "%%timeit\n", 280 | "for t in emojis_10k: \n", 281 | " extract.count_emoji(t, check_first=False)" 282 | ] 283 | }, 284 | { 285 | "cell_type": "code", 286 | "execution_count": 10, 287 | "metadata": {}, 288 | "outputs": [ 289 | { 290 | "name": "stdout", 291 | "output_type": "stream", 292 | "text": [ 293 | "2.57 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" 294 | ] 295 | } 296 | ], 297 | "source": [ 298 | "%%timeit\n", 299 | "for t in emojis_10k: \n", 300 | " extract.count_emoji(t, check_first=True)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "# Tone-modifiable emoji\n", 308 | "\n", 309 | "These are emoji which can be modified for skin tone with {'🏻', '🏼', '🏽', '🏾', '🏿'}" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": 34, 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | "dict_items([('💁🏻', 1)]) \t ['🏻']\n", 322 | "dict_items([('👉', 1)]) \t ['Y']\n", 323 | "dict_items([('🤷🏾', 1)]) \t ['🏾']\n", 324 | "dict_items([('🤹🏻', 1)]) \t ['🏻']\n", 325 | "dict_items([('🤷🏽', 1)]) \t ['🏽']\n", 326 | "dict_items([('✋', 1)]) \t ['Y']\n", 327 | "dict_items([('💪🏻', 1)]) \t ['🏻']\n", 328 | "dict_items([('👍', 1)]) \t ['Y']\n", 329 | "dict_items([('🤘', 1)]) \t ['Y']\n", 330 | "dict_items([('🤙🏾', 1), ('🤘🏾', 1)]) \t ['🏾', '🏾']\n", 331 | "dict_items([('👨', 2), ('👦', 2)]) \t ['Y', 'Y']\n", 332 | "dict_items([('👏', 1)]) \t ['Y']\n", 333 | "dict_items([('💪', 3)]) \t ['Y']\n", 334 | "dict_items([('👏', 4)]) \t ['Y']\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "for t in emojis_10k[0:100]: \n", 340 | " e = extract.count_tme(t, check_first=False)\n", 341 | " \n", 342 | " if e:\n", 343 | " print(e.items(), '\\t', [i[1] if len(i) == 2 else 'Y' for i in e])\n" 344 | ] 345 | } 346 | ], 347 | "metadata": { 348 | "kernelspec": { 349 | "display_name": "Python [default]", 350 | "language": "python", 351 | "name": "python3" 352 | }, 353 | "language_info": { 354 | "codemirror_mode": { 355 | "name": "ipython", 356 | "version": 3 357 | }, 358 | "file_extension": ".py", 359 | "mimetype": "text/x-python", 360 | "name": "python", 361 | "nbconvert_exporter": "python", 362 | "pygments_lexer": "ipython3", 363 | "version": "3.6.6" 364 | } 365 | }, 366 | "nbformat": 4, 367 | "nbformat_minor": 2 368 | } 369 | --------------------------------------------------------------------------------