AraNet, a deep learning toolkit for a host of Arabic social media processing. AraNet predicts age, dialect, gender, emotion, irony, and sentiment from social media posts. It delivers either state-of-the-art or competitive performance on these tasks. It also has the advantage of using a unified, simple framework based on the recently-developed BERT model. AraNet has the potential to alleviate issues related to comparing across different Arabic social media NLP tasks, by providing one way to test new models against AraNet predictions (i.e., model-based comparisons). AraNet can be used to make important discoveries about the Arab world, a vast geographical region of strategic importance. It can enhance also enhance our understating of Arabic online communities, and the Arabic digital culture in general.
5 | 6 | ## How to install 7 | - Using pip 8 | ```shell 9 | pip install git+https://github.com/UBC-NLP/aranet 10 | ``` 11 | - Clone and install 12 | ```shell 13 | git clone https://github.com/UBC-NLP/aranet 14 | cd aranet 15 | pip install . 16 | ``` 17 | ## Download models 18 | - [Sentiment](https://drive.google.com/file/d/1W5171aT-1rYwK2iQyKGF0C2kT1UPjq11/view?usp=sharing) 19 | - [Dialect](https://drive.google.com/file/d/1e85Y1bhvPc9yjwKq-lSIewHIRgCDa4YS/view?usp=sharing) 20 | - [Emotion](https://drive.google.com/file/d/1D1nvw715Yp_yK6XYYPfUxnxygfPEEK2k/view?usp=sharing) 21 | - [Irony](https://drive.google.com/file/d/1FzWmCNISoWwGJbdNM65frDJi4QVs_-g1/view?usp=sharing) 22 | - Gender 23 | - Age 24 | ## How to use 25 | *You can easily add AraNet in your code* 26 | 27 | **load the model** 28 | ``` python 29 | from aranet import aranet 30 | dialect_obj = aranet.AraNet(model_path) 31 | ``` 32 | **predict one sentance** 33 | ``` python 34 | dialect_obj.predict(text=text_str) 35 | ``` 36 | **Load from file/batch** 37 | ``` python 38 | dialect_obj.predict(path=file_path) 39 | ``` 40 | 41 | *You can use AraNet from Terminal* 42 | ```shell 43 | !python ./aranet/aranet.py \ 44 | --path model_path \ 45 | --batch file_path 46 | ``` 47 | ## Examples 48 | ### Dialect 49 | ```python 50 | #load AraNet dialect model 51 | model_path = "./models/dialect_aranet/" 52 | dialect_obj = aranet.AraNet(model_path) 53 | ``` 54 | ``` python 55 | text_str="انا هاخد ده لو سمحت" 56 | dialect_obj.predict(text=text_str) 57 | ``` 58 | [('Egypt', 0.9993844)] 59 | ``` python 60 | text_str="العشا اليوم كان عند الشيخ علي حمدي الحداد ، لمؤخذة بقى على الخيانة ، ايش مشاك غادي" 61 | dialect_obj.predict(text=text_str) 62 | ``` 63 | [('Libya', 0.763)] 64 | ```python 65 | text_str ="يعيشك برقا" 66 | dialect_obj.predict(text=text_str) 67 | ``` 68 | [('Tunisia', 0.998887)] 69 | ### Sentiment 70 | ```python 71 | #load AraNet sentiment model 72 | model_path = "./models/sentiment_aranet/" 73 | senti_obj = aranet.AraNet(model_path) 74 | ``` 75 | ```python 76 | text_str ="ما اكره واحد قد هذا المنافق" 77 | senti_obj.predict(text=text_str) 78 | ``` 79 | [('neg', 0.8975404)] 80 | ```python 81 | text_str ="يعيشك برقا" 82 | senti_obj.predict(text=text_str) 83 | ``` 84 | [('pos', 0.747435)] 85 | ### Emotion 86 | ```python 87 | #load AraNet emotion model 88 | model_path = "./models/emotion_aranet/" 89 | emo_obj = aranet.AraNet(model_path) 90 | ``` 91 | ```python 92 | text_str ="الله عليكي و انتي دائما مفرحانا" 93 | emo_obj.predict(text=text_str) 94 | ``` 95 | [('happy', 0.89688617)] 96 | ```python 97 | text_str ="لم اعرف المستحيل يوما" 98 | emo_obj.predict(text=text_str) 99 | ``` 100 | [('trust', 0.27242294)] 101 | ### Gender 102 | ```python 103 | #load AraNet gender model 104 | model_path = "./models/gender_aranet/" 105 | gender_obj = aranet.AraNet(model_path) 106 | text_str ="الله عليكي و انتي دائما مفرحانا" 107 | gender_obj.predict(text=text_str) 108 | ``` 109 | [('female', 0.8405795)] 110 | ### Load from file/batch 111 | ``` 112 | input_text file: sentance a line, for example 113 | -------------- 114 | انا هاخد ده لو سمحت 115 | العشا اليوم كان عند الشيخ علي حمدي الحداد ، لمؤخذة بقى على الخيانة ، ايش مشاك غادي 116 | ---------------- 117 | ``` 118 | ``` python 119 | model_path = "./models/dialect_aranet/" 120 | dialect_obj = aranet.AraNet(model_path) 121 | dialect_obj.predict(path=file_path) 122 | ``` 123 | [('Egypt', 0.9993844), ('Libya', 0.76300025)] 124 | 125 | # Inquiries? 126 | If you have any questions about this dataset please contact us @ *muhammad.mageed[at]ubc[dot]ca*. 127 | 128 | 129 | --- 130 | ## Reference/Citation: 131 | Please cite our work: 132 | ``` 133 | @inproceedings{abdul-mageed-etal-2020-aranet, 134 | title = "{A}ra{N}et: A Deep Learning Toolkit for {A}rabic Social Media", 135 | author = "Abdul-Mageed, Muhammad and Zhang, Chiyu and Hashemi, Azadeh and Nagoudi, El Moatez Billah", 136 | booktitle = "Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection", 137 | month = may, 138 | year = "2020", 139 | address = "Marseille, France", 140 | publisher = "European Language Resource Association", 141 | url = "https://www.aclweb.org/anthology/2020.osact-1.3", 142 | pages = "16--23", 143 | language = "English", 144 | ISBN = "979-10-95546-51-1", 145 | } 146 | ``` 147 | -------------------------------------------------------------------------------- /aranet/__init.py: -------------------------------------------------------------------------------- 1 | from .aranet import AraNet -------------------------------------------------------------------------------- /aranet/aranet.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import optparse 3 | import os, json 4 | import numpy as np 5 | import pandas as pd 6 | from keras.preprocessing.sequence import pad_sequences 7 | import torch 8 | from torch import nn 9 | from torch.utils.data import DataLoader, TensorDataset 10 | from transformers import BertTokenizer, BertForSequenceClassification 11 | 12 | 13 | class AraNet(): 14 | ''' 15 | ''' 16 | 17 | def __init__(self, path): 18 | # check if gpu is available 19 | self.__device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 20 | self.n___gpu = torch.cuda.device_count() 21 | 22 | # check if the path exists 23 | if path is None: 24 | raise Exception('Undefined path to model') 25 | if not os.path.exists(path): 26 | raise Exception("Couldn't find the path %s" %path) 27 | self.__path = path 28 | 29 | # check if the path contains model directory 30 | model_path = os.path.join(self.__path, 'model') 31 | if not os.path.exists(model_path): 32 | raise Exception("Couldn't find the path %s" %model_path) 33 | 34 | # load the model 35 | try: 36 | self.__tokenizer = BertTokenizer.from_pretrained(model_path, do_lower_case=True) 37 | self.__model = BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_path) 38 | 39 | self.__model = self.__model.to(self.__device) 40 | 41 | except Exception as e: 42 | raise Exception ("Couldn't load the model", e) 43 | 44 | # load the labels dictionary 45 | dict_path = os.path.join(self.__path, 'labels-dict.json') 46 | if not os.path.exists(dict_path): 47 | raise Exception("Couldn't find the path %s" %dict_path) 48 | with open(dict_path) as json_file: 49 | self.__lab2ind = json.load(json_file) 50 | self.__ind2lab = {} 51 | for label in self.__lab2ind.keys(): 52 | self.__ind2lab[self.__lab2ind[label]] = label 53 | 54 | 55 | 56 | def predict(self, text=None, path=None, with_dist=False): 57 | # init normalizer 58 | language_normalizer = _araNorm() 59 | 60 | if text is not None: 61 | sentences = [text] 62 | elif path is not None: 63 | if not os.path.exists(path): 64 | raise Exception("File not found %s" % path) 65 | 66 | # read the batch file in tsv format 67 | df = pd.read_csv(path, delimiter='\t', header=None, names=['sentence']) 68 | 69 | # Create sentence lists 70 | sentences = df.sentence.values 71 | else: 72 | raise Exception('No text/path specified') 73 | 74 | # adding special tokens at the beginning and end of each sentence for BERT to work properly 75 | sentences = ["[CLS] " + language_normalizer.run(sentence) for sentence in sentences] 76 | 77 | # load test data 78 | test_inputs, test_masks = self.__data_prepare(sentences=sentences) 79 | test_data = TensorDataset(test_inputs, test_masks) 80 | test_dataloader = DataLoader(test_data, batch_size=100) 81 | 82 | # set model to evaluation mode 83 | self.__model.eval() 84 | 85 | results = [] 86 | for batch in test_dataloader: 87 | batch = tuple(t.to(self.__device) for t in batch) 88 | 89 | # Unpack the inputs from dataloader 90 | b_input_ids, b_input_mask = batch 91 | del batch 92 | 93 | # Telling the model not to compute or store gradients, saving memory and speeding up validation 94 | with torch.no_grad(): 95 | # Forward pass, calculate logit predictions 96 | logits = self.__model(input_ids=b_input_ids, attention_mask=b_input_mask) 97 | 98 | # Move logits and labels to CPU 99 | logits = logits[0].cpu() 100 | 101 | pred_probs = nn.functional.softmax(logits, dim=1) 102 | max_values, max_indices = torch.max(pred_probs, 1) 103 | max_values = max_values.numpy() 104 | max_indices = max_indices.numpy() 105 | pred_probs = pred_probs.numpy() 106 | 107 | for i in range(len(max_indices)): 108 | if with_dist: 109 | results.extend([(self.__ind2lab[max_indices[i]], max_values[i], 110 | tuple(zip(self.__ind2lab.values(), pred_probs[i])))]) 111 | else: 112 | results.extend([(self.__ind2lab[max_indices[i]], max_values[i])]) 113 | 114 | return results 115 | 116 | def __data_prepare(self, sentences, MAX_LEN=50): 117 | # Import the BERT tokenizer, used to convert the text into tokens that correspond to BERT's vocabulary. 118 | tokenized_texts = [self.__tokenizer.tokenize(sent) for sent in sentences] 119 | # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary 120 | input_ids = [self.__tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts] 121 | 122 | # Pad the input tokens 123 | input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype=np.int_, truncating="post", padding="post") 124 | 125 | # Create attention masks 126 | attention_masks = [] 127 | 128 | # Create a mask of 1s for each token followed by 0s for padding 129 | for seq in input_ids: 130 | seq_mask = [float(i > 0) for i in seq] 131 | attention_masks.append(seq_mask) 132 | 133 | # Convert all of the data into torch tensors, the required datatype for the model 134 | inputs = torch.LongTensor(input_ids) 135 | masks = torch.tensor(attention_masks) 136 | 137 | return inputs, masks 138 | 139 | 140 | '''-------------------------------------------------------------------------------- 141 | Script: Normalization class 142 | Authors: Abdel-Rahim Elmadany and Muhammad Abdul-Mageed 143 | Creation date: Novamber, 2018 144 | Last update: Jan, 2019 145 | input: text 146 | output: normalized text 147 | ------------------------------------------------------------------------------------ 148 | Normalization functions: 149 | - Check if text contains at least one Arabic Letter, run normalizer 150 | - Normalize Alef and Yeh forms 151 | - Remove Tashkeeel (diac) from Atabic text 152 | - Reduce character repitation of > 2 characters at time 153 | - repalce links with space 154 | - Remove twitter username with the word USER 155 | - replace number with NUM 156 | - Remove non letters or digits characters such as emoticons 157 | ------------------------------------------------------------------------------------''' 158 | import re 159 | 160 | class _araNorm(): 161 | ''' 162 | araNorm is a normalizer class for n Arabic Text 163 | ''' 164 | 165 | def __init__(self): 166 | ''' 167 | List of normalized characters 168 | ''' 169 | self.normalize_chars = {u"\u0622": u"\u0627", u"\u0623": u"\u0627", u"\u0625": u"\u0627", 170 | # All Araf forms to Alaf without hamza 171 | u"\u0649": u"\u064A", # ALEF MAKSURA to YAH 172 | u"\u0629": u"\u0647" # TEH MARBUTA to HAH 173 | } 174 | ''' 175 | list of diac unicode and underscore 176 | ''' 177 | self.Tashkeel_underscore_chars = {u"\u0640": "_", u"\u064E": 'a', u"\u064F": 'u', u"\u0650": 'i', 178 | u"\u0651": '~', u"\u0652": 'o', u"\u064B": 'F', u"\u064C": 'N', 179 | u"\u064D": 'K'} 180 | 181 | def __normalizeChar(self, inputText): 182 | ''' 183 | step #2: Normalize Alef and Yeh forms 184 | ''' 185 | norm = "" 186 | for char in inputText: 187 | if char in self.normalize_chars: 188 | norm = norm + self.normalize_chars[char] 189 | else: 190 | norm = norm + char 191 | return norm 192 | 193 | def __remover_tashkeel(self, inputText): 194 | ''' 195 | step #3: Remove Tashkeeel (diac) from Atabic text 196 | ''' 197 | text_without_Tashkeel = "" 198 | for char in inputText: 199 | if char not in self.Tashkeel_underscore_chars: 200 | text_without_Tashkeel += char 201 | return text_without_Tashkeel 202 | 203 | def __reduce_characters(self, inputText): 204 | ''' 205 | step #4: Reduce character repitation of > 2 characters at time 206 | For example: the word 'cooooool' will convert to 'cool' 207 | ''' 208 | # pattern to look for three or more repetitions of any character, including 209 | # newlines. 210 | pattern = re.compile(r"(.)\1{2,}", re.DOTALL) 211 | reduced_text = pattern.sub(r"\1\1", inputText) 212 | return reduced_text 213 | 214 | def __replace_links(self, inputText): 215 | ''' 216 | step #5: repalce links to LINK 217 | For example: http://too.gl/sadsad322 will replaced to LINK 218 | ''' 219 | text = re.sub('(\w+:\/\/[ ]*\S+)', '+++++++++', inputText) # LINK 220 | text = re.sub('\++', 'URL', text) 221 | return re.sub('(URL\s*)+', ' URL ', text) 222 | 223 | def __replace_username(self, inputText): 224 | ''' 225 | step #5: Remove twitter username with the word USER 226 | For example: @elmadany will replaced by space 227 | ''' 228 | text = re.sub('(@[a-zA-Z0-9_]+)', 'USER', inputText) 229 | return re.sub('(USER\s*)+', ' USER ', text) 230 | 231 | def __replace_Number(self, inputText): 232 | ''' 233 | step #7: replace number with NUM 234 | For example: \d+ will replaced with NUM 235 | ''' 236 | text = re.sub('[\d\.]+', 'NUM', inputText) 237 | return re.sub('(NUM\s*)+', ' NUM ', text) 238 | 239 | def __remove_nonLetters_Digits(self, inputText): 240 | ''' 241 | step #8: Remove non letters or digits characters 242 | For example: emoticons...etc 243 | this step is very important for w2v and similar models; and dictionary 244 | ''' 245 | p1 = re.compile('[\W_\d\s]', re.IGNORECASE | re.UNICODE) # re.compile('\p{Arabic}') 246 | sent = re.sub(p1, ' ', inputText) 247 | p1 = re.compile('\s+') 248 | sent = re.sub(p1, ' ', sent) 249 | return sent 250 | 251 | def run(self, text): 252 | normtext = "" 253 | text = self.__normalizeChar(text) 254 | text = self.__remover_tashkeel(text) 255 | text = self.__reduce_characters(text) 256 | text = self.__replace_links(text) 257 | text = self.__replace_username(text) 258 | text = self.__replace_Number(text) 259 | text = self.__remove_nonLetters_Digits(text) 260 | text = re.sub('\s+', ' ', text.strip()) 261 | text = re.sub('\s+$', '', text.strip()) 262 | normtext = re.sub('^\s+', '', text.strip()) 263 | return normtext 264 | 265 | 266 | def main(): 267 | parser = optparse.OptionParser() 268 | parser.add_option('-p', '--path', action="store", default=None, 269 | help='specify the model path on the command line') 270 | parser.add_option('-b', '--batch', action="store", default=None, 271 | help='specify a file path on the command line') 272 | parser.add_option('-d', '--dist', action='store_true', default=False, help='show full distribution over languages') 273 | 274 | options, args = parser.parse_args() 275 | 276 | identifier = AraNet(options.path) 277 | if options.batch is not None: 278 | # "==== Batch Mode ====" 279 | predictions = identifier.predict(text=None, path=options.batch, with_dist=options.dist) 280 | print(predictions) 281 | else: 282 | import sys 283 | if sys.stdin.isatty(): 284 | # "==== Interactive Mode ====" 285 | while True: 286 | try: 287 | print(">>>", end=' ') 288 | text = input() 289 | except Exception as e: 290 | print(e) 291 | break 292 | predictions = identifier.predict(text=text, path=None, with_dist=options.dist) 293 | print(predictions) 294 | else: 295 | # "==== Redirected Mode ====" 296 | lines = sys.stdin.read() 297 | predictions = [] 298 | for text in lines: 299 | predictions.extend(identifier.predict(text=text, path=None, with_dist=options.dist)) 300 | print(predictions) 301 | 302 | 303 | if __name__ == "__main__": 304 | main() 305 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup(name='aranet', 4 | version='0.1', 5 | description='description', 6 | url='https://github.com/UBC-NLP/aranet', 7 | author='', 8 | author_email='', 9 | license='GNU', 10 | packages=['aranet'], 11 | install_requires=[ 12 | 'tensorflow-gpu', 13 | 'torch', 14 | 'sklearn', 15 | 'transformers==2.3.0', 'keras', 'pandas', 'numpy' 16 | ], 17 | 18 | entry_points={ 19 | 'console_scripts': [ 20 | 'aranet = aranet.aranet:main', 21 | ], 22 | }, 23 | package_data={'aranet': ['resources/*']}, 24 | include_package_data=True, 25 | zip_safe=False) 26 | --------------------------------------------------------------------------------