├── hate_speech_mlma.zip
├── notes
├── requirements.txt
├── LICENSE
├── constants.py
├── annotated_data_processing.py
├── keywords.txt
├── README.md
├── baseline_classifiers.py
├── predictors.py
├── run_sluice_net.py
├── utils.py
├── guidelines.tar
├── pilot_dataset_tweets_only.tar
└── sluice_net.py
/hate_speech_mlma.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HKUST-KnowComp/MLMA_hate_speech/HEAD/hate_speech_mlma.zip
--------------------------------------------------------------------------------
/notes:
--------------------------------------------------------------------------------
1 | Although we have done our best to avoid scams and to counter common misconceptions, there are still, as in
2 | complicated human annotation tasks unreliable annotations.
3 | For instance, one common misconception is that some annotators in French were not aware that the word m******
4 | was an insulting slur towards people with Down Syndrom rather than an origin.
5 | The same goes to people sometimes between wordings of nationality and gender/gender identity in Arabic.
6 |
7 | We will publish more insights about the data, our annotation guidelines, and the dataset with emojis soon.
8 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | To replicate our experiments, you need to download:
2 | Python 3.6 onwards
3 | dyNET 0.0.0 and its dependencies (Follow instructions on https://dynet.readthedocs.io/en/latest/python.html)
4 |
5 | On a side note, when you install DyNet make sure that you are using CUDA 9 and CUDNN for CUDA 9. I used the following command:
6 | CUDNN_ROOT=/path_to_conda/pkgs/cudnn-7.3.1-cuda10.0_0 BACKEND=/path_to_conda/pkgs/cudatoolkit-10.0.130-0 pip install git+https://github.com/clab/dynet#egg=dynet
7 |
8 | Using CUDA 10 will generate an error when calling DyNet.
9 |
10 | Sluice Networks (Ruder et al. 2017 )
11 | Babylon/ MUSE embeddings
12 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 HKUST-KnowComp
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/constants.py:
--------------------------------------------------------------------------------
1 | """
2 | Constants shared across files.
3 | """
4 | import re
5 |
6 | # special tokens and number regex
7 | UNK = '_UNK' # unk/OOV word/char
8 | WORD_START = '' # word star
9 | WORD_END = '' # word end
10 | NUM = 'NUM' # number normalization string
11 | NUMBERREGEX = re.compile("[0-9]+|[0-9]+\\.[0-9]+|[0-9]+[0-9,]+")
12 |
13 | # tasks
14 |
15 | TASK_NAMES = ['group', #The target group of the tweet
16 | 'annotator_sentiment', #The sentiment of the annotator with respect to the tweet
17 | 'directness', #Whether the tweet is direct or indirect hate speech
18 | 'target', #The characteristic based on which the tweet discriminates people (e.g., race).
19 | 'sentiment' ] #The sentiment expressed by the tweet
20 |
21 | # word embeddings
22 | EMBEDS = ['babylon', 'muse', 'umwe', None]
23 |
24 | EMBEDS_FILES = {'babylon': '../data/bi-embedding-babylon78/transformed_embeds/',
25 | 'muse': '../data/bi-embedding-muse/',
26 | 'umwe':' '}
27 |
28 |
29 | #Dictionary of tasks and corresponding labels
30 | LABELS = {'group':['arabs', 'other', 'african_descent', 'left_wing_people', 'asians',
31 | 'hispanics', 'muslims', 'individual', 'special_needs', 'christian', 'immigrants', 'jews' ,
32 | 'women', 'indian/hindu', 'gay', 'refugees'],
33 | 'annotator_sentiment':['indifference', 'sadness', 'disgust', 'shock', 'confusion',
34 | 'anger', 'fear'],
35 | 'directness':['direct', 'indirect'],
36 | 'target':['origin', 'religion', 'disability', 'gender', 'sexual_orientation', 'other'],
37 | 'sentiment':['disrespectful', 'fearful', 'offensive', 'abusive', 'hateful', 'normal']}
38 | MODIFIED_LABELS = {'group':['arabs', 'other', 'african_descent', 'left_wing_people', 'asians',
39 | 'hispanics', 'muslims', 'individual', 'special_needs', 'christian', 'immigrants', 'jews' ,
40 | 'women', 'indian/hindu', 'gay', 'refugees'],
41 | 'annotator_sentiment':['indifference', 'sadness', 'shock', 'confusion','anger', 'fear'],
42 | 'directness':['direct', 'indirect'],
43 | 'target':['origin', 'religion', 'disability', 'gender', 'sexual_orientation', 'other'],
44 | 'sentiment':['somewhatoffensive', 'offensive', 'veryoffensive', 'normal']}
45 |
46 | #'directness':['direct', 'indirect', 'none'], #to be added
47 |
48 | # languages
49 | LANGUAGES = ['ar', 'en', 'fr']
50 | FULL_LANG = {'ar': 'Arabic', 'en': 'English', 'fr': 'French'}
51 |
52 |
53 |
54 |
55 | # optimizers
56 | SGD = 'sgd'
57 | ADAM = 'adam'
58 |
59 |
60 | # cross-stitch and layer-stitch initialization schemes
61 | BALANCED = 'balanced'
62 | IMBALANCED = 'imbalanced'
63 |
--------------------------------------------------------------------------------
/annotated_data_processing.py:
--------------------------------------------------------------------------------
1 | import re
2 | from nltk.corpus import stopwords
3 | import numpy as np
4 | import pandas as pd
5 |
6 | def clean_text(text):
7 | """
8 | text: a string
9 |
10 | return: modified initial string
11 | """
12 | replace_by_blank_symbols = re.compile('\u00bb|\u00a0|\u00d7|\u00a3|\u00eb|\u00fb|\u00fb|\u00f4|\u00c7|\u00ab|\u00a0\ude4c|\udf99|\udfc1|\ude1b|\ude22|\u200b|\u2b07|\uddd0|\ude02|\ud83d|\u2026|\u201c|\udfe2|\u2018|\ude2a|\ud83c|\u2018|\u201d|\u201c|\udc69|\udc97|\ud83e|\udd18|\udffb|\ude2d|\udc80|\ud83e|\udd2a|\ud83e|\udd26|\u200d|\u2642|\ufe0f|\u25b7|\u25c1|\ud83e|\udd26|\udffd|\u200d|\u2642|\ufe0f|\udd21|\ude12|\ud83e|\udd14|\ude03|\ude03|\ude03|\ude1c|\udd81|\ude03|\ude10|\u2728|\udf7f|\ude48|\udc4d|\udffb|\udc47|\ude11|\udd26|\udffe|\u200d|\u2642|\ufe0f|\udd37|\ude44|\udffb|\u200d|\u2640|\udd23|\u2764|\ufe0f|\udc93|\udffc|\u2800|\u275b|\u275c|\udd37|\udffd|\u200d|\u2640|\ufe0f|\u2764|\ude48|\u2728|\ude05|\udc40|\udf8a|\u203c|\u266a|\u203c|\u2744|\u2665|\u23f0|\udea2|\u26a1|\u2022|\u25e1|\uff3f|\u2665|\u270b|\u270a|\udca6|\u203c|\u270c|\u270b|\u270a|\ude14|\u263a|\udf08|\u2753|\udd28|\u20ac|\u266b|\ude35|\ude1a|\u2622|\u263a|\ude09|\udd20|\udd15|\ude08|\udd2c|\ude21|\ude2b|\ude18|\udd25|\udc83|\ude24|\udc3e|\udd95|\udc96|\ude0f|\udc46|\udc4a|\udc7b|\udca8|\udec5|\udca8|\udd94|\ude08|\udca3|\ude2b|\ude24|\ude23|\ude16|\udd8d|\ude06|\ude09|\udd2b|\ude00|\udd95|\ude0d|\udc9e|\udca9|\udf33|\udc0b|\ude21|\udde3|\ude37|\udd2c|\ude21|\ude09|\ude39|\ude42|\ude41|\udc96|\udd24|\udf4f|\ude2b|\ude4a|\udf69|\udd2e|\ude09|\ude01|\udcf7|\ude2f|\ude21|\ude28|\ude43|\udc4a|\uddfa|\uddf2|\udc4a|\ude95|\ude0d|\udf39|\udded|\uddf7|\udded|\udd2c|\udd4a|\udc48|\udc42|\udc41|\udc43|\udc4c|\udd11|\ude0f|\ude29|\ude15|\ude18|\ude01|\udd2d|\ude43|\udd1d|\ude2e|\ude29|\ude00|\ude1f|\udd71|\uddf8|\ude20|\udc4a|\udeab|\udd19|\ude29|\udd42|\udc4a|\udc96|\ude08|\ude0d|\udc43|\udff3|\udc13|\ude0f|\udc4f|\udff9|\udd1d|\udc4a|\udc95|\udcaf|\udd12|\udd95|\udd38|\ude01|\ude2c|\udc49|\ude01|\udf89|\udc36|\ude0f|\udfff|\udd29|\udc4f|\ude0a|\ude1e|\udd2d|\uff46|\uff41|\uff54|\uff45|\uffe3|\u300a|\u300b|\u2708|\u2044|\u25d5|\u273f|\udc8b|\udc8d|\udc51|\udd8b|\udd54|\udc81|\udd80|\uded1|\udd27|\udc4b|\udc8b|\udc51|\udd90|\ude0e')
13 | replace_by_apostrophe_symbol = re.compile('\u2019')
14 | replace_by_dash_symbol = re.compile('\u2014')
15 | replace_by_u_symbols = re.compile('\u00fb|\u00f9')
16 | replace_by_a_symbols = re.compile('\u00e2|\u00e0')
17 | replace_by_c_symbols = re.compile('\u00e7')
18 | replace_by_i_symbols = re.compile('\u00ee|\u00ef')
19 | replace_by_o_symbols = re.compile('\u00f4')
20 | replace_by_oe_symbols = re.compile('\u0153')
21 | replace_by_e_symbols = re.compile('\u00e9|\u00ea|\u0117|\u00e8')
22 | REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|,;]')
23 | text = replace_by_e_symbols.sub('e', text)
24 | text = replace_by_a_symbols.sub('a', text)
25 | text = replace_by_o_symbols.sub('o', text)
26 | text = replace_by_oe_symbols.sub('oe', text)
27 | text = replace_by_u_symbols.sub('e', text)
28 | text = replace_by_i_symbols.sub('e', text)
29 | text = replace_by_u_symbols.sub('e', text)
30 | text = replace_by_apostrophe_symbol.sub("'", text)
31 | text = replace_by_dash_symbol.sub("_", text)
32 | text = replace_by_blank_symbols.sub('', text)
33 |
34 | #For English
35 | #text = ''.join([c for c in text if ord(c) < 128])
36 | text = text.replace("\\", "")
37 | #text = text.encode("ascii", errors="ignore").decode()
38 | text = text.lower() # lowercase text
39 | STOPWORDS = set(stopwords.words('english'))
40 | text = text.lower() # lowercase text
41 | text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
42 | text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
43 | return text
44 |
45 |
--------------------------------------------------------------------------------
/keywords.txt:
--------------------------------------------------------------------------------
1 | *** Below are keywords/phrases that we used for our search. They include different spellings of common slurs, controversial topics,
2 | insulting patterns during debates, ... ***
3 |
4 | ENGLISH
5 | ___________
6 | Immigrants
7 | refugees
8 | Africa
9 | AllLivesMatter
10 | immigrant terror
11 | abusive women
12 | harassment
13 | fake empowerment
14 | African attack
15 | protect from immigrants
16 | Eurafrica
17 | Immigrant not welcome
18 | illegal immigrants
19 | It's okay to be white
20 | feminazi
21 | violence
22 | refugees impact
23 | asians
24 | terrorist leftist
25 | terrorist threat
26 | ching chong
27 | dirty jew
28 | dirty pig
29 | redneck
30 | bring Ebola
31 | bring disease
32 | negro
33 | illegal aliens
34 | intellectual terrorist
35 | intellectual terrorism
36 | women threat
37 | woman terrorist
38 | shithole country
39 | shithole countries
40 | go back to where you come from
41 | go back to where they come from
42 | loud mexican
43 | muslim countries
44 | muslim country
45 | muslim immigrant
46 | muslim refugee
47 | white christian
48 | christian country
49 | immigrant invasion
50 | muslim invasion
51 | refugee invasion
52 | antiwhite
53 | anti white
54 | eurafrica
55 | jihadi
56 | russian rat
57 | mongol
58 | mongoloid
59 | mongy
60 | downy
61 | retard
62 | boojie
63 | surrender monkey
64 | chinaman
65 | hillbilly
66 | whigger
67 | white nigger
68 | wigger
69 | wigerette
70 | bitter clinger
71 | conspiracy theorist
72 | redneck
73 | trailer trash
74 | white trash
75 | yobbo
76 | retard
77 | nigger
78 | white trash
79 | trailer trash
80 | raghead
81 | camel fucker
82 | moon cricket
83 | wetback
84 | spic
85 | cunt
86 | dyke
87 | twat
88 | chinaman
89 | muzzie
90 | faggot
91 | __________________________________________________________________________________________________
92 | French
93 | _________
94 | invasion réfugié
95 | invasion migration
96 | invasion musulman
97 | invasion islam
98 | invasion migrant
99 | complotiste
100 | complotisme
101 | réfugiés
102 | migrants
103 | Afrique
104 | migrants africains
105 | réfugiés
106 | migrants
107 | Afrique
108 | debile
109 | débile
110 | mongol
111 | attardé
112 | attarde
113 | facho
114 | e-facho
115 | action sociale
116 | vive front national
117 | djihadiste
118 | eurafrique
119 | anti blanc
120 | entiblanc
121 | beurres
122 | rebeus
123 | renois
124 | chinetoque
125 | mongol
126 | terroristes
127 | arabes
128 | invasion des réfugiés
129 | France musulmane
130 | islamisation
131 | triage des migrants
132 | contre l'avortement
133 | femme autonome
134 | liberte de la femme
135 | harcèlement
136 | blanc fier
137 | implantation des migrants
138 | violence
139 | blackos
140 | terrorisme musulman
141 | terrorisme islam
142 | terrorisme intellectuel
143 | expulser migrants
144 | islam gauchiste
145 | gauchiste
146 | expulser migrant
147 | ching chong
148 | sale arabe
149 | sale africain
150 | negre
151 | negre maghrebins
152 | noich
153 | sale rebeu
154 | sale renoi
155 | africain ramene madadie
156 | migrant maladie
157 | ramene maladie
158 | arabe chameau
159 | africain chameau
160 | sale russe
161 | ramene ebola
162 | negro
163 | Afreurope
164 | __________________________________________________________________________________________________
165 |
166 | Arabic
167 | _________
168 | اللاجئون
169 | اللاجئون ضرر
170 | اللاجئون خطر
171 | اللاجئون و الامراض
172 | افارقة ايبولا
173 | افارقة كوليرا
174 | افارقة مرض
175 | خنازير
176 | حرمة قاصر
177 | سوري وسخ
178 | هندوس
179 | بايرة
180 | مراة عورة
181 | مراة متسلطة
182 | شينغ شونغ
183 | لاجئ وسخ
184 | مراة قاصر
185 | افارقة نجس
186 | افارقة وسخ
187 | عرب وسخ
188 | بول البعير
189 | سواقة حريم
190 | مراة قاصر
191 | خنزير
192 | وسخ
193 | منحرفة
194 | العادات و التقاليد
195 | الحريم
196 | بهيم
197 | الافارقة
198 | السود
199 | متبرجة
200 | الزواج و الطلاق للمراة
201 | ديوث
202 | فمينيست
203 | العنف
204 | اسقاط ولاية المراة
205 | الصينيون
206 | افريقيا السمراء
207 | التحرش
208 | افريقيا
209 | اغبياء
210 | منغول
211 | مغول
212 | ديوث
213 | وسخ شيعة
214 | ايران وسخ
215 | طائفة
216 | طائفي
217 | علماني
218 | جهاد
219 | كحالش
220 | معاق ذهني
221 | شيعة رافضة
222 | شيعي
223 | رافضي
224 | علوي
225 | اكراد
226 | لجوء احتلال
227 | لاجئ احتلال
228 | مثلية
229 | مثليين
230 | كلب ايران
231 | فارسي وسخ
232 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Dataset and code of our EMNLP 2019 Paper (Multilingual and Multi-Aspect Hate Speech Analysis)
2 | If you use our dataset or our guidelines please cite this paper:
3 |
4 | @inproceedings{ousidhoum-etal-multilingual-hate-speech-2019,
5 | title = "Multilingual and Multi-Aspect Hate Speech Analysis",
6 | author = "Ousidhoum, Nedjma
7 | and Lin, Zizheng
8 | and Zhang, Hongming
9 | and Song, Yangqiu
10 | and Yeung, Dit-Yan",
11 | booktitle = "Proceedings of EMNLP",
12 | year = "2019",
13 | publisher = "Association for Computational Linguistics",
14 | }
15 |
16 |
17 | ## Update
18 | - If you need the individual labels of some data instances (unfortunately I could not find all the batches on the cloud years later), please send me an email OusidhoumN(at)cardiff(dot)ac(dot)uk
19 |
20 | - The dataset is available on HuggingFace https://huggingface.co/datasets/nedjmaou/MLMA_hate_speech
21 |
22 | ## Clarification
23 | The multi-labeled tasks are *the hostility type of the tweet* and the *annotator's sentiment*. (We kept labels on which at least two annotators agreed.)
24 |
25 | ## Taxonomy
26 | In further experiments that involved binary classification tasks of the hostility/hate/abuse type, we considered single-labeled *normal* instances to be *non-hate/non-toxic* and all the other instances to be *toxic*.
27 |
28 | ## Dataset
29 | Our dataset is composed of three csv files sorted by language. They contain the tweets and the annotations described in our paper:
30 |
31 | the hostility type *(column: tweet sentiment)*
32 |
33 | hostility directness *(column: directness)*
34 |
35 | target attribute *(column: target)*
36 |
37 | target group *(column: group)*
38 |
39 | annotator's sentiment *(column: annotator sentiment)*.
40 |
41 | ## Experiments
42 |
43 | To replicate our experiments, please follow the guidelines below.
44 |
45 | ### Requirements
46 | Python 3.6 onwards,
47 |
48 | dyNET 0.0.0 and its dependencies (follow the instructions on https://dynet.readthedocs.io/en/latest/python.html).
49 |
50 | [On a side note, when you install DyNet make sure that you are using CUDA 9 and CUDNN for CUDA 9. I used the following command:
51 |
52 | CUDNN_ROOT=/path/to/conda/pkgs/cudnn-7.3.1-cuda10.0_0 \
53 | BACKEND=/path/to/conda/pkgs/cudatoolkit-10.0.130-0 \
54 | pip install git+https://github.com/clab/dynet#egg=dynet
55 |
56 | Using CUDA 10 will generate an error when calling DyNet for GPUs.]
57 |
58 | Cross-lingual word embeddings (Babylon or MUSE. The reported results have been run using Babylon.)
59 |
60 |
61 | ### Python files
62 |
63 | - annotated_data_processing.py contains a normalization function that cleans the content of the tweets.
64 |
65 | - constants.py defines constants used across all the files.
66 |
67 | - utils.py utility functions for data processing.
68 |
69 | - baseline_classifiers.py allows you to run majority voting and logistic regression by calling:
70 |
71 | run_majority_voting(train_filename, dev_filename, test_filename, attribute)
72 |
73 | or
74 |
75 | run_logistic_regression(train_filename, dev_filename, test_filename, attribute)
76 |
77 | on csv files of the same form of the dataset.
78 |
79 | predictors.py contains classes for sequence predictors and layers.
80 |
81 | run_sluice_net.py: Script to train, load, and evaluate SluiceNetwork.
82 |
83 | sluice_net.py: The main logic for the SluiceNetwork (Ruder et al. 2017). More details on the implementation of Sluice networks can be found here.
84 |
85 | ### How to run the program
86 | To save and load the trained model, you need to create a directory (e.g., model/), and specify the name of the created directory when using --model-dir argument in the command line.
87 |
88 | To save the log files of the training and evaluation, you need to create a directory (e.g., log/), and specify the name of the created directory when using --log-dir argumnet in the command line.
89 |
90 | #### Example:
91 | python run_sluice_net.py --dynet-autobatch 1 --dynet-gpus 3 --dynet-seed 123 \
92 | --h-layers 1 \
93 | --cross-stitch\
94 | --num-subspaces 2 --constraint-weight 0.1 \
95 | --constrain-matrices 1 2 --patience 3 \
96 | --languages ar en fr \
97 | --test-languages ar en fr \
98 | --model-dir model/ --log-dir log/\
99 | --task-names annotator_sentiment sentiment directness group target \
100 | --train-dir '/path/to/train' \
101 | --dev-dir '/path/to/dev' \
102 | --test-dir 'path/to/test' \
103 | --embeds babylon --h-dim 200 \
104 | --cross-stitch-init-scheme imbalanced \
105 | --threshold 0.1
106 |
107 | ### NB
108 | - The meaning of each argument can be found in run_sluice_net.py
109 |
110 | - '--task_names' refers to a list of task names (e.g: annotator_sentiment).
111 |
112 | - '--languages' refers to the language dataset which will be used in the training.
113 |
114 | - 'test-languages' can only be the subset of '--languages'.
115 |
116 |
117 |
--------------------------------------------------------------------------------
/baseline_classifiers.py:
--------------------------------------------------------------------------------
1 | import re
2 | from collections import Counter
3 | import os
4 | import matplotlib
5 | import numpy as np
6 | import pandas as pd
7 | from pandas import Series
8 | from sklearn.pipeline import Pipeline
9 | from sklearn.model_selection import train_test_split
10 | from sklearn.feature_extraction.text import CountVectorizer
11 | from sklearn.feature_extraction.text import TfidfTransformer
12 | from sklearn.naive_bayes import MultinomialNB
13 | from sklearn.metrics import accuracy_score
14 | from sklearn.linear_model import LogisticRegression
15 | from sklearn.pipeline import Pipeline
16 | from sklearn.preprocessing import LabelBinarizer, LabelEncoder
17 | from sklearn.metrics import classification_report
18 | from annotated_data_processing import clean_text
19 | from sklearn.preprocessing import MultiLabelBinarizer
20 | from sklearn.metrics import accuracy_score
21 | from sklearn.metrics import f1_score
22 | from skmultilearn.problem_transform import ClassifierChain
23 | from sklearn.dummy import DummyClassifier
24 | from constants import LABELS
25 |
26 | #majority voting for multilabel tasks: annotator's sentiment and hostility type (tweet sentiment)
27 | def lr_multilabel_classification(train_filename, dev_filename, test_filename, attribute):
28 | df_train = pd.read_csv(train_filename)
29 | df_dev = pd.read_csv(dev_filename)
30 | df_test = pd.read_csv(test_filename)
31 | mlb = MultiLabelBinarizer()
32 | X_train = df_train.tweet.apply(clean_text)
33 | y_train_text = df_train[attribute].apply(lambda x: x.split('_'))
34 | y_train = mlb.fit_transform(y_train_text)
35 | X_dev = df_dev.tweet.apply(clean_text)
36 | y_dev_text = df_dev[attribute].apply(lambda x: x.split('_'))
37 | y_dev = mlb.fit_transform(y_dev_text)
38 | X_test = df_test.tweet.apply(clean_text)
39 | y_test_text = df_test[attribute].apply(lambda x: x.split('_'))
40 | y_test = mlb.fit_transform(y_test_text)
41 | count_vect = CountVectorizer()
42 | X_train_counts = count_vect.fit_transform(X_train)
43 | tfidf_transformer = TfidfTransformer()
44 | X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
45 | Y = mlb.fit_transform(y_train_text)
46 | classifier = Pipeline([
47 | ('vectorizer', CountVectorizer()),
48 | ('tfidf', TfidfTransformer()),
49 | ('clf', ClassifierChain(LogisticRegression()))])
50 | classifier.fit(X_train, y_train)
51 | y_pred = classifier.predict(X_test)
52 | print('accuracy %s' % accuracy_score(y_pred, y_test))
53 | print('Test macro F1 score is %s' % f1_score(y_test, y_pred, average='macro'))
54 | print('Test micro F1 score is %s' % f1_score(y_test, y_pred, average='micro'))
55 |
56 | #majority voting for multilabel tasks: annotator's sentiment and hostility type (tweet sentiment)
57 | def majority_voting_multilabel_classification(train_filename, dev_filename, test_filename, attribute):
58 | df_train = pd.read_csv(train_filename)
59 | df_dev = pd.read_csv(dev_filename)
60 | df_test = pd.read_csv(test_filename)
61 | mlb = MultiLabelBinarizer()
62 | X_train = df_train.tweet.apply(clean_text)
63 | y_train_text = df_train[attribute].apply(lambda x: x.split('_'))
64 | y_train = mlb.fit_transform(y_train_text)
65 | X_dev = df_dev.tweet.apply(clean_text)
66 | y_dev_text = df_dev[attribute].apply(lambda x: x.split('_'))
67 | y_dev = mlb.fit_transform(y_dev_text)
68 | X_test = df_test.tweet.apply(clean_text)
69 | y_test_text = df_test[attribute].apply(lambda x: x.split('_'))
70 | y_test = mlb.fit_transform(y_test_text)
71 | count_vect = CountVectorizer()
72 | X_train_counts = count_vect.fit_transform(X_train)
73 | tfidf_transformer = TfidfTransformer()
74 | X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
75 | Y = mlb.fit_transform(y_train_text)
76 | classifier = Pipeline([
77 | ('vectorizer', CountVectorizer()),
78 | ('tfidf', TfidfTransformer()),
79 | ('clf', ClassifierChain(DummyClassifier()))])
80 |
81 | classifier.fit(X_train, y_train)
82 | y_pred = classifier.predict(X_test)
83 | print('Accuracy %s' % accuracy_score(y_pred, y_test))
84 | print('Test macro F1 score is %s' % f1_score(y_test, y_pred, average='macro'))
85 | print('Test micro F1 score is %s' % f1_score(y_test, y_pred, average='micro'))
86 |
87 |
88 | #majority voting for non mumtilabel tasks namely: target, group and directness
89 | def majority_voting_non_multilabel_classification(train_filename, dev_filename, test_filename, attribute):
90 | my_labels=LABELS[attribute]
91 | df_train = pd.read_csv(train_filename)
92 | df_dev = pd.read_csv(dev_filename)
93 | df_test = pd.read_csv(test_filename)
94 | X_train = df_train.tweet.apply(clean_text)
95 | y_train = df_train[attribute]
96 | X_dev = df_dev.tweet.apply(clean_text)
97 | y_dev = df_dev[attribute]
98 | X_test = df_test.tweet.apply(clean_text)
99 | y_test = df_test[attribute]
100 | count_vect = CountVectorizer()
101 | X_train_counts = count_vect.fit_transform(X_train)
102 | tfidf_transformer = TfidfTransformer()
103 | X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
104 | dummy = Pipeline([('vect', CountVectorizer()),
105 | ('tfidf', TfidfTransformer()),
106 | ('clf', DummyClassifier()),
107 | ])
108 | dummy.fit(X_train, y_train)
109 | y_pred = dummy.predict(X_test)
110 | print('Accuracy %s' % accuracy_score(y_pred, y_test))
111 | print(classification_report(y_test, y_pred,target_names=my_labels,labels=my_labels))
112 | print('Test macro F1 score is %s' % f1_score(y_test, y_pred, average='macro'))
113 | print('Test micro F1 score is %s' % f1_score(y_test, y_pred, average='micro'))
114 |
115 |
116 | #logistic regression for non mumtilabel tasks namely: target, group and directness
117 | def lr_non_multilabel_classification(train_filename, dev_filename, test_filename, attribute):
118 | my_labels=LABELS[attribute]
119 | df_train = pd.read_csv(train_filename)
120 | df_dev = pd.read_csv(dev_filename)
121 | df_test = pd.read_csv(test_filename)
122 | X_train = df_train.tweet.apply(clean_text)
123 | y_train = df_train[attribute]
124 | X_dev = df_dev.tweet.apply(clean_text)
125 | y_dev = df_dev[attribute]
126 | X_test = df_test.tweet.apply(clean_text)
127 | y_test = df_test[attribute]
128 | count_vect = CountVectorizer()
129 | X_train_counts = count_vect.fit_transform(X_train)
130 | tfidf_transformer = TfidfTransformer()
131 | X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
132 | logreg = Pipeline([('vect', CountVectorizer()),
133 | ('tfidf', TfidfTransformer()),
134 | ('clf', LogisticRegression(n_jobs=1, C=1e5)),
135 | ])
136 | logreg.fit(X_train, y_train)
137 | y_pred = logreg.predict(X_test)
138 | print('accuracy %s' % accuracy_score(y_pred, y_test))
139 | print('Test macro F1 score is %s' % f1_score(y_test, y_pred, average='macro'))
140 | print('Test micro F1 score is %s' % f1_score(y_test, y_pred, average='micro'))
141 |
142 |
143 | def run_majority_voting(train_filename, dev_filename, test_filename, attribute):
144 | #multilabel tasks
145 | if(attribute=='sentiment' or attribute=='annotator_sentiment'):
146 | return majority_voting_multilabel_classification(train_filename, dev_filename, test_filename, attribute)
147 | #non mutilabel tasks
148 | elif(attribute=='target' or attribute =='group' or attribute=='directness'):
149 | return majority_voting_non_multilabel_classification(train_filename, dev_filename, test_filename, attribute)
150 |
151 | def run_logistic_regression(train_filename, dev_filename, test_filename, attribute):
152 | #multilabel tasks
153 | if(attribute=='sentiment' or attribute=='annotator_sentiment'):
154 | return lr_multilabel_classification(train_filename, dev_filename, test_filename, attribute)
155 | #non mutilabel tasks
156 | elif(attribute=='target' or attribute =='group' or attribute=='directness'):
157 | return lr_non_multilabel_classification(train_filename, dev_filename, test_filename, attribute)
158 |
--------------------------------------------------------------------------------
/predictors.py:
--------------------------------------------------------------------------------
1 | """
2 | Classes for predictors and special layers.
3 | """
4 | import dynet
5 | import numpy as np
6 |
7 | from constants import BALANCED, IMBALANCED
8 |
9 |
10 | class SequencePredictor:
11 | """Convenience class to wrap a sequence prediction model."""
12 | def __init__(self, builder):
13 | """Initializes the model. Expects a LSTMBuilder or SimpleRNNBuilder."""
14 | self.builder = builder
15 |
16 | def predict_sequence(self, inputs):
17 | """Predicts the output of a sequence."""
18 | return [self.builder(x) for x in inputs]
19 |
20 |
21 | class RNNSequencePredictor(SequencePredictor):
22 | """Convenience class to wrap an RNN model."""
23 | def predict_sequence(self, inputs):
24 | s_init = self.builder.initial_state()
25 | return [x.output() for x in s_init.add_inputs(inputs)]
26 |
27 |
28 | class BiRNNSequencePredictor(SequencePredictor):
29 | """Convenience class to wrap an LSTM builder."""
30 | def predict_sequence(self, f_inputs, b_inputs):
31 | f_init = self.builder.initial_state()
32 | b_init = self.builder.initial_state()
33 | forward_sequence = [x.output() for x in f_init.add_inputs(f_inputs)]
34 | backward_sequence = [x.output() for x in b_init.add_inputs(
35 | reversed(b_inputs))]
36 | return forward_sequence, backward_sequence
37 |
38 |
39 | class CrossStitchLayer:
40 | """Cross-stitch layer class."""
41 | def __init__(self, model, num_tasks, hidden_dim, num_subspaces=1,
42 | init_scheme=BALANCED):
43 | """
44 | Initializes a CrossStitchLayer.
45 | :param model: the DyNet Model
46 | :param num_tasks: the number of tasks
47 | :param hidden_dim: the # of hidden dimensions of the previous LSTM layer
48 | :param num_subspaces: the number of subspaces
49 | :param init_scheme: the initialization scheme; balanced or imbalanced
50 | """
51 | print('Using %d subspaces...' % num_subspaces, flush=True)
52 | alpha_params = np.full((num_tasks * num_subspaces,
53 | num_tasks * num_subspaces),
54 | 1. / (num_tasks * num_subspaces))
55 | if init_scheme == IMBALANCED:
56 | if num_subspaces == 1:
57 | alpha_params = np.full((num_tasks, num_tasks),
58 | 0.1 / (num_tasks - 1))
59 | for i in range(num_tasks):
60 | alpha_params[i, i] = 0.9
61 | else:
62 | # 0 1 0 1
63 | # 0 1 0 1
64 | # 1 0 1 0
65 | # 1 0 1 0
66 | for (x, y), value in np.ndenumerate(alpha_params):
67 | if (y + 1) % num_subspaces == 0 and not \
68 | (x in range(num_tasks, num_tasks+num_subspaces)):
69 | alpha_params[x, y] = 0.95
70 | elif (y + num_subspaces) % num_subspaces == 0 and x \
71 | in range(num_tasks, num_tasks+num_subspaces):
72 | alpha_params[x, y] = 0.95
73 | else:
74 | alpha_params[x, y] = 0.05
75 |
76 | self.alphas = model.add_parameters(
77 | (num_tasks*num_subspaces, num_tasks*num_subspaces),
78 | init=dynet.NumpyInitializer(alpha_params))
79 | print('Initializing cross-stitch units to:', flush=True)
80 | print(dynet.parameter(self.alphas).value(), flush=True)
81 | self.num_tasks = num_tasks
82 | self.num_subspaces = num_subspaces
83 | self.hidden_dim = hidden_dim
84 |
85 | def stitch(self, predictions):
86 | """
87 | Takes as inputs a list of the predicted states of the previous layers of
88 | each task, e.g. for two tasks a list containing two lists of
89 | n-dimensional output states. For every time step, the predictions of
90 | each previous task layer are then multiplied with the cross-stitch
91 | units to obtain a linear combination. In the end, we obtain a list of
92 | lists of linear combinations of states for every subsequent task layer.
93 | :param predictions: a list of length num_tasks containing the predicted
94 | states for each task
95 | :return: a list of length num_tasks containing the linear combination of
96 | predictions for each task
97 | """
98 | assert self.num_tasks == len(predictions)
99 | linear_combinations = []
100 | # iterate over tuples of predictions of each task at every time step
101 | for task_predictions in zip(*predictions):
102 | # concatenate the predicted state for all tasks to a matrix of shape
103 | # (num_tasks*num_subspaces, hidden_dim/num_subspaces);
104 | # we can multiply this directly with the alpha values
105 | concat_task_predictions = dynet.reshape(
106 | dynet.concatenate_cols(list(task_predictions)),
107 | (self.num_tasks*self.num_subspaces,
108 | self.hidden_dim / self.num_subspaces))
109 |
110 | # multiply the alpha matrix with the concatenated predictions to
111 | # produce a linear combination of predictions
112 | alphas = dynet.parameter(self.alphas)
113 | product = alphas * concat_task_predictions
114 | if self.num_subspaces != 1:
115 | product = dynet.reshape(product,
116 | (self.num_tasks, self.hidden_dim))
117 | linear_combinations.append(product)
118 |
119 | stitched = [linear_combination for linear_combination in
120 | zip(*linear_combinations)]
121 | return stitched
122 |
123 |
124 | class LayerStitchLayer:
125 | """Layer-stitch layer class."""
126 | def __init__(self, model, num_layers, hidden_dim, init_scheme=IMBALANCED):
127 | """
128 | Initializes a LayerStitchLayer.
129 | :param model: the DyNet model
130 | :param num_layers: the number of layers
131 | :param hidden_dim: the hidden dimensions of the LSTM layers
132 | :param init_scheme: the initialisation scheme; balanced or imbalanced
133 | """
134 | if init_scheme == IMBALANCED:
135 | beta_params = np.full((num_layers), 0.1 / (num_layers - 1))
136 | beta_params[-1] = 0.9
137 | elif init_scheme == BALANCED:
138 | beta_params = np.full((num_layers), 1. / num_layers)
139 | else:
140 | raise ValueError('Invalid initialization scheme for layer-stitch '
141 | 'units: %s.' % init_scheme)
142 | self.betas = model.add_parameters(
143 | num_layers, init=dynet.NumpyInitializer(beta_params))
144 | print('Initializing layer-stitch units to:', flush=True)
145 | print(dynet.parameter(self.betas).value(), flush=True)
146 | self.num_layers = num_layers
147 | self.hidden_dim = hidden_dim
148 |
149 | def stitch(self, layer_predictions):
150 | """
151 | Takes as input the predicted states of all the layers of a task-specific
152 | network and produces a linear combination of them.
153 | :param layer_predictions: a list of length num_layers containing lists
154 | of length seq_len of predicted states for
155 | each layer
156 | :return: a list of linear combinations of the predicted states at every
157 | time step for each layer
158 | """
159 | assert len(layer_predictions) == self.num_layers
160 |
161 | concatenated_layer_states = dynet.reshape(dynet.concatenate_cols(\
162 | list(layer_predictions)), (self.num_layers, self.hidden_dim))
163 |
164 | product = None
165 | if(self.num_layers > 1):
166 | product = dynet.transpose(dynet.parameter(
167 | self.betas)) * concatenated_layer_states
168 | else:
169 | product = dynet.parameter(self.betas) * concatenated_layer_states
170 |
171 | reshaped = dynet.reshape(product, (self.hidden_dim,))
172 |
173 | return reshaped
174 |
175 |
176 |
177 | class Layer:
178 | """Class for a single layer or a two-layer MLP."""
179 | def __init__(self, model, in_dim, output_dim, activation=dynet.tanh,
180 | mlp=False):
181 | """
182 | Initialize the layer and add its parameters to the model.
183 | :param model: the DyNet Model
184 | :param in_dim: the input dimension
185 | :param output_dim: the output dimension
186 | :param activation: the activation function that should be used
187 | :param mlp: if True, add a hidden layer with 100 dimensions
188 | """
189 | self.act = activation
190 | self.mlp = mlp
191 | if mlp:
192 | mlp_dim = 100
193 | self.W_mlp = model.add_parameters((mlp_dim, in_dim))
194 | self.b_mlp = model.add_parameters((mlp_dim))
195 | else:
196 | mlp_dim = in_dim
197 | self.W_out = model.add_parameters((output_dim, mlp_dim))
198 | self.b_out = model.add_parameters((output_dim))
199 |
200 | def __call__(self, x):
201 | if self.mlp:
202 | W_mlp = dynet.parameter(self.W_mlp)
203 | b_mlp = dynet.parameter(self.b_mlp)
204 | input = dynet.rectify(W_mlp*x + b_mlp)
205 | else:
206 | input = x
207 | W_out = dynet.parameter(self.W_out)
208 | b_out = dynet.parameter(self.b_out)
209 | act = self.act(W_out*input + b_out)
210 | return act
211 |
212 |
--------------------------------------------------------------------------------
/run_sluice_net.py:
--------------------------------------------------------------------------------
1 | """
2 | Main script
3 | """
4 | import argparse
5 | import os
6 | import random
7 | import sys
8 |
9 | import numpy as np
10 |
11 | import dynet
12 |
13 | from constants import TASK_NAMES, LANGUAGES, EMBEDS, BALANCED, IMBALANCED, SGD, ADAM
14 | from sluice_net import SluiceNetwork, load
15 | import utils
16 |
17 |
18 | def check_activation_function(arg):
19 | """Checks allowed argument for --ac option."""
20 | try:
21 | functions = [dynet.rectify, dynet.tanh]
22 | functions = {function.__name__: function for function in functions}
23 | functions['None'] = None
24 | return functions[str(arg)]
25 | except:
26 | raise argparse.ArgumentTypeError(
27 | 'String {} does not match required format'.format(arg, ))
28 |
29 |
30 | def main(args):
31 |
32 |
33 | train_score = {task: 0 for task in args.task_names}
34 | dev_score = {task: 0 for task in args.task_names}
35 | avg_train_score = 0
36 | avg_dev_score = 0
37 |
38 | if args.load:
39 | assert os.path.exists(args.model_dir),\
40 | ('Error: Trying to load the model but %s does not exist.' %
41 | args.model_dir)
42 | print('Loading model from directory %s...' % args.model_dir)
43 |
44 | model_file = None
45 | params_file = None
46 |
47 | #Load models from different directory based on the type (STSL, MTSL, STML, MTML)
48 | if(len(args.task_names) ==1):
49 |
50 | if(len(args.languages) == 1):
51 |
52 | model_file = os.path.join(args.model_dir, 'STSL/{}_{}.model'.format(args.languages[0],args.task_names[0]))
53 | params_file = os.path.join(args.model_dir, 'STSL/{}_{}.pkl'.format(args.languages[0],args.task_names[0]))
54 |
55 | else:
56 |
57 | model_file = os.path.join(args.model_dir, 'STML/{}.model'.format(args.task_names[0]))
58 |
59 | params_file = os.path.join(args.model_dir, 'STML/{}.pkl'.format(args.task_names[0]))
60 |
61 | else:
62 |
63 | if(len(args.languages) ==1):
64 |
65 | model_file = os.path.join(args.model_dir, 'MTSL/{}.model'.format(args.languages[0]))
66 | params_file = os.path.join(args.model_dir, 'MTSL/{}.pkl'.format(args.languages[0]))
67 | else:
68 | model_file = os.path.join(args.model_dir, 'MTML/MTML.model')
69 |
70 | params_file = os.path.join(args.model_dir, 'MTML/MTML.pkl')
71 |
72 |
73 |
74 | model, train_score, dev_score, avg_train_score, avg_dev_score = load(params_file, model_file, args)
75 |
76 | if(args.continue_train):#Continue to train the loaded model
77 | train_score, dev_score, avg_train_score, avg_dev_score= model.fit(args.languages, args.test_languages, args.epochs, args.patience, args.opt, args.threshold,
78 | train_dir=args.train_dir, dev_dir=args.dev_dir)#added args.threshold
79 |
80 | else:
81 | model = SluiceNetwork(args.h_dim,
82 | args.h_layers,
83 | args.model_dir,
84 | args.log_dir,
85 | embeds=args.embeds,
86 | activation=args.activation,
87 | lower=args.lower,
88 | noise_sigma=args.sigma,
89 | task_names=args.task_names,
90 | languages = args.languages,
91 | cross_stitch=args.cross_stitch,
92 | num_subspaces=args.num_subspaces,
93 | constraint_weight=args.constraint_weight,
94 | constrain_matrices=args.constrain_matrices,
95 | cross_stitch_init_scheme=
96 | args.cross_stitch_init_scheme,
97 | layer_stitch_init_scheme=
98 | args.layer_stitch_init_scheme)
99 | train_score, dev_score, avg_train_score, avg_dev_score = model.fit(args.languages, args.test_languages, args.epochs, args.patience, args.opt, args.threshold, train_dir=args.train_dir, dev_dir=args.dev_dir)
100 |
101 |
102 |
103 | print('='*50)
104 | print('Start testing', ','.join(args.test_languages))
105 |
106 | for test_lang in args.test_languages:
107 | test_X, test_Y, _ = utils.get_data(
108 | [test_lang], model.task_names, model.word2id,
109 | model.task2label2id, data_dir=args.test_dir, train=False)
110 |
111 | test_score = model.evaluate(test_X, test_Y, test_lang, args.threshold)
112 |
113 |
114 |
115 |
116 | print('='*50)
117 | print('\tStart logging {}'.format(test_lang))
118 |
119 |
120 | utils.log_score(args.log_dir, args.languages, [test_lang], args.task_names, args.embeds, args.h_dim, args.cross_stitch_init_scheme,
121 | args.constraint_weight, args.sigma, args.opt, train_score, dev_score, test_score)
122 |
123 |
124 | print('\tFinished logging{}'.format(test_lang))
125 |
126 |
127 |
128 |
129 | if __name__ == '__main__':
130 | parser = argparse.ArgumentParser(
131 | description='Run the Sluice Network',
132 | formatter_class=argparse.ArgumentDefaultsHelpFormatter)
133 |
134 | # DyNet parameters
135 |
136 | parser.add_argument('--dynet-autobatch', type=int, #automatically batch some operations to speed up computations
137 | help='use auto-batching (1) (should be first argument)')
138 | parser.add_argument('--dynet-gpus', type=int,
139 | help='Specify how many GPUs you want to use, if DyNet is compiled with CUDA')
140 |
141 | parser.add_argument('--dynet-devices', nargs='+', choices=['CPU', 'GPU:0', 'GPU:1', 'GPU:2', 'GPU:3'],
142 | help='Specify which GPUs do use')
143 | parser.add_argument('--dynet-seed', type=int, help='random seed for DyNet')
144 | parser.add_argument('--dynet-mem', type=int, help='memory for DyNet')
145 |
146 | # languages, tasks, and paths
147 | parser.add_argument('--languages', nargs='+', choices=LANGUAGES,
148 | help='the language datasets to be trained on ')
149 |
150 | parser.add_argument('--test-languages', nargs='+', choices=LANGUAGES,
151 | help='the language datasets to be tested on')
152 |
153 | parser.add_argument('--train-dir', required=True,
154 | help='the directory containing the training data')
155 | parser.add_argument('--dev-dir', required=True,
156 | help='the directory containing the development data')
157 | parser.add_argument('--test-dir', required=True,
158 | help='the directory containing the test data')
159 |
160 | parser.add_argument('--load', action='store_true',
161 | help='load the pre-trained model')
162 |
163 | parser.add_argument('--load-action', default='test',
164 | choices=['train', 'test'],
165 | help='action after loading the model')
166 |
167 | parser.add_argument('--task-names', nargs='+', default=TASK_NAMES,
168 | choices=TASK_NAMES,
169 | help='the names of the tasks (main task is first)')
170 | parser.add_argument('--model-dir', required=True,
171 | help='directory where to save model and param files')
172 | parser.add_argument('--log-dir', required=True,
173 | help='the directory where the results should be logged')
174 | parser.add_argument('--w-in-dim', type=int, default=64,
175 | help='default word embeddings dimension [default: 64]')
176 | #parser.add_argument('--c-in-dim', type=int, default=100,
177 | # help='input dim for char embeddings [default:100]')
178 | parser.add_argument('--h-dim', type=int, default=100,
179 | help='hidden dimension [default: 100]')
180 | parser.add_argument('--h-layers', type=int, default=1,
181 | help='number of stacked LSTMs [default: 1=no stacking]')
182 | parser.add_argument('--lower', action='store_true',
183 | help='lowercase words (not used)')
184 | parser.add_argument('--embeds', nargs='?',help='word embeddings file',
185 | choices=EMBEDS, default=None)
186 |
187 |
188 | parser.add_argument('--sigma', help='noise sigma', default=0.2, type=float)
189 | parser.add_argument('--activation', default='tanh',
190 | help='activation function [rectify, tanh, ...]',
191 | type=check_activation_function)
192 | parser.add_argument('--opt', '--optimizer', default=SGD,
193 | choices=[SGD, ADAM],
194 | help='trainer [sgd, adam] default: sgd')
195 |
196 | # training hyperparameters
197 | parser.add_argument('--epochs', type=int, default=30,
198 | help='training epochs [default: 30]')
199 | parser.add_argument('--patience', default=1, type=int,
200 | help='patience for early stopping')
201 |
202 | parser.add_argument('--cross-stitch', action='store_true',
203 | help='use cross-stitch units between LSTM layers')
204 |
205 |
206 |
207 | parser.add_argument('--num-subspaces', default=1, type=int, choices=[1, 2],
208 | help='the number of subspaces for cross-stitching; '
209 | 'only 1 (no subspace) or 2 allowed currently')
210 | parser.add_argument('--constraint-weight', type=float, default=0.,
211 | help='weighting factor for orthogonality constraint on '
212 | 'cross-stitch subspaces; 0 = no constraint')
213 | parser.add_argument('--constrain-matrices', type=int, nargs='+',
214 | default=[1, 2],
215 | help='the indices of the LSTM matrices that should be '
216 | 'constrained; indices correspond to: Wix,Wih,Wic,'
217 | 'bi,Wox,Woh,Woc,bo,Wcx,Wch,bc. Best indices so '
218 | 'far: [1, 2] http://dynet.readthedocs.io/en/latest/python_ref.html#dynet.LSTMBuilder.get_parameter_expressions)')
219 | parser.add_argument('--cross-stitch-init-scheme', type=str,
220 | default=BALANCED, choices=[IMBALANCED, BALANCED],
221 | help='which initialisation scheme to use for the '
222 | 'alpha matrix - currently available: imbalanced '
223 | 'and balanced (which sets all to '
224 | '1/(num_tasks*num_subspaces)). Only available '
225 | 'with subspaces.')
226 | parser.add_argument('--layer-stitch-init-scheme', type=str,
227 | default=BALANCED,
228 | choices=[BALANCED, IMBALANCED],
229 | help='initialisation scheme for layer-stitch units; '
230 | 'default: imbalanced (.9) for last layer weights;'
231 | 'other choice: balanced (1. / num_layers).')
232 |
233 | parser.add_argument('--threshold', type=float,default=0.5,
234 | help='threshold for classfication')
235 | args = parser.parse_args()
236 | main(args)
237 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | """
2 | Utility methods for data processing.
3 | """
4 | import os
5 | from glob import glob
6 | import itertools
7 | import csv
8 | import pandas as pd
9 | from constants import NUM, NUMBERREGEX, UNK, WORD_START, WORD_END, EMBEDS_FILES, FULL_LANG, LABELS, MODIFIED_LABELS
10 |
11 | def print_task_labels(task_name, label2id, id_sequence, file):
12 | #Convert label_id sequence to label sequence and write to file
13 | #changed the original function completely
14 | with open(file, 'a+') as f:
15 | writer = csv.writer(f,delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL)
16 | writer.writerow(['ID', task_name])
17 | label_list=dict()
18 | for task_id, labels_ids in label2id.items():
19 | if task_name==task_id:
20 | for label, idx in labels_ids.items():
21 | label_list[label] = idx
22 | #print(label_list)
23 |
24 | count = 1
25 | #with open(file, 'a+') as f:
26 | for label_idx_seq in id_sequence:
27 | #Create a label_sequence for each tweet
28 | label_seq = []
29 | for task, label_idx in label_idx_seq.items():
30 | #intialize_values
31 | #target_val=''
32 | #group_val=''
33 | #annotator_val=[]
34 | #sentiment_val=[]
35 | #Non multilabel_tasks, labels are of the form [1, [7], [12], ...
36 | if task==task_name:
37 | if task=='target' or task =='group' or task=='directness':
38 | for target_label, indice in label2id[task].items():
39 | if indice==label_idx[0]:
40 | if task=='target':
41 | val=target_label
42 | else:
43 | val=target_label
44 | #Multilabel tasks, labels are of the form [1, 0, 0, 1, 0, 0], ... such that each column represents one label
45 | elif task=='annotator_sentiment':
46 | val=[]
47 | for j in range(len(label_idx)):
48 | if label_idx[j]>0:
49 | for label, indice in label2id[task].items():
50 | #if labels[j]==1 or label number j ==1 append the name of the label
51 | if indice==j:
52 | val.append(label)
53 | elif task=='sentiment':
54 | val=[]
55 | for j in range(len(label_idx)):
56 | if label_idx[j]>0:
57 | for label, indice in label2id[task].items():
58 | #if labels[j]==1 or label number j ==1 append the name of the label
59 | if indice==j:
60 | val.append(label)
61 | writer.writerow([count,val])
62 | count+=1
63 | #target_val=''
64 | #group_val=''
65 | #annotator_val=[]
66 | #sentiment_val=[]
67 |
68 |
69 | f.close()
70 |
71 |
72 |
73 |
74 | #write functions for studying correlations
75 | def save_generated_labels_in_csv_file(label2id, id_sequence, file):
76 | #Convert label_id sequence to label sequence and write to file
77 | #changed the original function completely
78 | with open(file, 'a+') as f:
79 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
80 | writer.writerow(['ID','annotator_sentiment','sentiment','group','target'])
81 | label_list=dict()
82 | for task, labels_ids in label2id.items():
83 | #print (task)
84 | for label, idx in labels_ids.items():
85 | label_list[label] = idx
86 | #print(label_list)
87 |
88 | count = 1
89 | #with open(file, 'a+') as f:
90 | for label_idx_seq in id_sequence:
91 | #Create a label_sequence for each tweet
92 | label_seq = []
93 | for task, label_idx in label_idx_seq.items():
94 | #intialize_values
95 | #target_val=''
96 | #group_val=''
97 | #annotator_val=[]
98 | #sentiment_val=[]
99 | #Non multilabel_tasks, labels are of the form [1, [7], [12], ...
100 | if task=='target' or task =='group':
101 | for target_label, indice in label2id[task].items():
102 | if indice==label_idx[0]:
103 | if task=='target':
104 | target_val=target_label
105 | else:
106 | group_val=target_label
107 | #Multilabel tasks, labels are of the form [1, 0, 0, 1, 0, 0], ... such that each column represents one label
108 | elif task=='annotator_sentiment':
109 | annotator_val=[]
110 | for j in range(len(label_idx)):
111 | if label_idx[j]>0:
112 | for label, indice in label2id[task].items():
113 | #if labels[j]==1 or label number j ==1 append the name of the label
114 | if indice==j:
115 | annotator_val.append(label)
116 | elif task=='sentiment':
117 | sentiment_val=[]
118 | for j in range(len(label_idx)):
119 | if label_idx[j]>0:
120 | for label, indice in label2id[task].items():
121 | #if labels[j]==1 or label number j ==1 append the name of the label
122 | if indice==j:
123 | sentiment_val.append(label)
124 | writer.writerow([count,sentiment_val,target_val,group_val,annotator_val])
125 | #target_val=''
126 | #group_val=''
127 | #annotator_val=[]
128 | #sentiment_val=[]
129 | count+=1
130 |
131 | f.close()
132 |
133 |
134 |
135 | def get_label(label2id, id_sequence, file):
136 | #Convert label_id sequence to label sequence and write to file
137 | #changed the original function completely
138 | label_list=dict()
139 | for task, labels_ids in label2id.items():
140 | #print (task)
141 | for label, idx in labels_ids.items():
142 | label_list[label] = idx
143 | #print(label_list)
144 |
145 |
146 | count = 1
147 | with open(file, 'a+') as f:
148 | for label_idx_seq in id_sequence:
149 | #Create a label_sequence for each tweet
150 | label_seq = []
151 | for task, label_idx in label_idx_seq.items():
152 | #Non multilabel_tasks, labels are of the form [1, [7], [12], ...
153 | if task=='target' or task =='group':
154 | for target_label, indice in label2id[task].items():
155 | if indice==label_idx[0]:
156 | label_seq.append(target_label)
157 | #Multilabel tasks, labels are of the form [1, 0, 0, 1, 0, 0], ... such that each column represents one label
158 | elif task=='annotator_sentiment' or task =='sentiment':
159 | for j in range(len(label_idx)):
160 | if label_idx[j]>0:
161 | for label, indice in label2id[task].items():
162 | #if labels[j]==1 or label number j ==1 append the name of the label
163 | if indice==j:
164 | label_seq.append(label)
165 | f.write(str(count) +'.\t'+','.join(label_seq) +'\n')
166 | count+=1
167 |
168 | f.close()
169 |
170 | def normalize(word):
171 | """Normalize a word by lower-casing it or replacing it if it is a number."""
172 | return NUM if NUMBERREGEX.match(word) else word.lower()
173 |
174 | def average_by_task(score_dict):
175 | #Compute unweighted average of all metrics among all tasks
176 | total = 0
177 | count = 0
178 |
179 | for key in score_dict:
180 |
181 | total+=(score_dict[key]['micro_f1'] + score_dict[key]['macro_f1'])
182 | count+=2
183 |
184 |
185 | return total/float(count)
186 |
187 | def average_by_lang(score_list, data_size_list, total_data_size):
188 | #Compute weighted average of all languages
189 | res = 0
190 |
191 | for idx in range(len(score_list)):
192 | ratio = float(data_size_list[idx]) / total_data_size
193 | res += ratio * score_list[idx]
194 |
195 | return res
196 |
197 | def load_embeddings_file(embeds, languages, sep=" ", lower=False):
198 | """Loads a word embedding file."""
199 |
200 |
201 | embed_dir = EMBEDS_FILES[embeds]
202 | file_name_list = []
203 | for f in os.listdir(embed_dir):
204 | if (any([f.endswith(lang+'.vec') for lang in languages])):
205 | file_name_list.append(os.path.join(embed_dir,f))
206 |
207 |
208 | word2vec = {}
209 | total_num_words = 0
210 | embed_dim = 0
211 | encoding = None
212 | for file_name in file_name_list:
213 | print('\n\n Loading {}.....\n\n'.format(file_name))
214 | if(file_name.endswith('ar.vec') or file_name.endswith('fr.vec')):
215 | encoding='utf-8'
216 | with open(file=file_name, mode='r', encoding=encoding) as f:
217 | (num_words, embed_dim) = (int(x) for x in f.readline().rstrip('\n').split(' '))
218 | total_num_words+=num_words
219 | for idx, line in enumerate(f):
220 | if((idx+1)%(1e+5)==0):
221 | print('Loading {}/{} words'.format(idx+1, num_words))
222 | fields = line.rstrip('\n').split(sep)
223 | vec = [float(x) for x in fields[1:]]
224 | word = fields[0]
225 | if lower:
226 | word = word.lower()
227 | word2vec[word] = vec
228 | print('Loaded pre-trained embeddings of dimension: {}, size: {}, lower: {}'
229 | .format(embed_dim, total_num_words, lower))
230 | return word2vec, embed_dim
231 |
232 |
233 |
234 |
235 |
236 |
237 | def get_data(languages, task_names, word2id=None, task2label2id=None, data_dir=None,
238 | train=True, verbose=False):
239 | """
240 | :param languages: a list of languages from which to obtain the data
241 | :param task_names: a list of task names
242 | :param word2id: a mapping of words to their ids
243 | :param char2id: a mapping of characters to their ids
244 | :param task2label2id: a mapping of tasks to a label-to-id dictionary
245 | :param data_dir: the directory containing the data
246 | :param train: whether data is used for training (default: True)
247 | :param verbose: whether to print more information re file reading
248 | :return X: a list of tuples containing a list of word indices and a list of
249 | a list of character indices;
250 | Y: a list of dictionaries mapping a task to a list of label indices;
251 | org_X: the original words; a list of lists of normalized word forms;
252 | org_Y: a list of dictionaries mapping a task to a list of labels;
253 | word2id: a word-to-id mapping;
254 | char2id: a character-to-id mapping;
255 | task2label2id: a dictionary mapping a task to a label-to-id mapping.
256 | """
257 | X = []
258 | Y = []
259 | org_X = []
260 | org_Y = []
261 |
262 | # for training, we initialize all mappings; for testing, we require mappings
263 | if train:
264 |
265 | # create word-to-id, character-to-id, and task-to-label-to-id mappings
266 | word2id = {}
267 |
268 |
269 | # set the indices of the special characters
270 | word2id[UNK] = 0 # unk word / OOV
271 |
272 |
273 | for language in languages:
274 | num_sentences = 0
275 | num_tokens = 0
276 |
277 | full_lang = FULL_LANG[language]
278 | #file_reader = iter(())
279 | language_path = os.path.join(data_dir, full_lang)
280 |
281 |
282 | assert os.path.exists(language_path), ('language path %s does not exist.'
283 | % language_path)
284 |
285 | csv_file = os.path.join(language_path,os.listdir(language_path)[0])
286 |
287 | df = pd.read_csv(csv_file)
288 |
289 |
290 | #Column headers are HITId, tweet, sentiment, directness, annotator_sentiment, target, group
291 |
292 | for index, instance in df.iterrows():
293 | num_sentences+=1
294 | #sentence = instance['tweet'].split()
295 | sentence = instance['tweet'].split()
296 |
297 | sentence_word_indices = [] # sequence of word indices
298 | sentence_char_indices = [] # sequence of char indice
299 |
300 | # keep track of the label indices and labels for each task
301 | sentence_task2label_indices = {}
302 |
303 | for i, word in enumerate(sentence):
304 | num_tokens+=1
305 |
306 | if train and word not in word2id:
307 | word2id[word] = len(word2id)
308 |
309 | sentence_word_indices.append(word2id.get(word, word2id[UNK]))
310 |
311 |
312 |
313 |
314 | labels = None
315 |
316 | for task in task2label2id.keys():
317 | if('sentiment' in task):
318 | labels = instance[task].split('_')
319 | else:
320 | labels = [instance[task]]
321 |
322 | if('sentiment' in task):#Multi-label
323 |
324 | sentence_task2label_indices[task]=[0]*len(task2label2id[task])
325 |
326 | for label in labels:
327 | label_idx = task2label2id[task][label]
328 | sentence_task2label_indices[task][label_idx]=1
329 |
330 |
331 | else:
332 |
333 | sentence_task2label_indices[task] = [task2label2id[task][labels[0]]]
334 |
335 |
336 | X.append(sentence_word_indices)
337 | Y.append(sentence_task2label_indices)
338 |
339 | assert len(X) == len(Y)
340 | return X, Y, word2id
341 |
342 |
343 |
344 | #Log the training process
345 |
346 | def log_fit(log_dir, epoch, languages, test_lang, task_names, train_score, dev_score):
347 | if(len(task_names) ==1):
348 | task_name = task_names[0]
349 |
350 | if(len(languages) == 1):
351 | task_directory = os.path.join(log_dir,'STSL/')
352 | if not os.path.exists(task_directory):
353 | os.mkdir(task_directory)
354 | file = os.path.join(log_dir, 'STSL/{}_{}.csv'.format(languages[0],task_names[0]))
355 |
356 | else:
357 | task_directory = os.path.join(log_dir,'STML/')
358 | if not os.path.exists(task_directory):
359 | os.mkdir(task_directory)
360 | file = os.path.join(log_dir, 'STML/{}.csv'.format(task_names[0]))
361 |
362 | #This function needs to be changed
363 | if(os.path.exists(file)):
364 | with open(file, 'a') as f:
365 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
366 |
367 |
368 | writer.writerow([epoch, test_lang, train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
369 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1']])
370 |
371 | else:
372 | with open(file, 'a') as f:
373 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
374 |
375 | writer.writerow(['epoch', 'test_lang', task_name+'-train-micro-f1', task_name+'-train-macro-f1',
376 | task_name+'-dev-micro-f1', task_name+'-dev-macro-f1'])
377 |
378 | writer.writerow([epoch, test_lang, train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
379 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1']])
380 |
381 | f.close()
382 |
383 | else:
384 |
385 | if(len(languages) ==1):
386 | task_directory = os.path.join(log_dir,'MTSL/')
387 | if not os.path.exists(task_directory):
388 | os.mkdir(task_directory)
389 | file = os.path.join(log_dir, 'MTSL/{}.csv'.format(languages[0]))
390 |
391 |
392 | else:
393 | task_directory = os.path.join(log_dir,'MTML/')
394 | if not os.path.exists(task_directory):
395 | os.mkdir(task_directory)
396 |
397 | file = os.path.join(log_dir, 'MTML/log.csv')
398 |
399 |
400 | task_name_list = []
401 |
402 | task_f1_list = []
403 | #changed for task_name in task_names to for task_name in task_names:
404 | for task_name in task_names:
405 | task_name_list+=[task_name+'-train-micro-f1', task_name+'-train-macro-f1',
406 | task_name+'-dev-micro-f1', task_name+'-dev-macro-f1']
407 |
408 | task_f1_list +=[train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
409 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1']]
410 |
411 |
412 | if(os.path.exists(file)):
413 | #print("File exists: ")
414 | #print(file)
415 | #file = open(file, 'a')
416 | with open(file, 'a') as f:
417 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
418 | writer.writerow([epoch, test_lang]+ task_f1_list)
419 |
420 | f.close()
421 |
422 | else:
423 | #print("File does not exist: ")
424 | #print(file)
425 | with open(file, 'a') as f:
426 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
427 | writer.writerow(['epoch', 'test_lang'] + task_name_list )
428 | writer.writerow([epoch, test_lang]+ task_f1_list )
429 |
430 |
431 | f.close()
432 |
433 |
434 |
435 |
436 | #Log the final score
437 |
438 | def log_score(log_dir, languages, test_lang, task_names, embeds,h_dim, cross_stitch_init,
439 | constraint_weight, sigma, optimizer, train_score, dev_score, test_score):
440 |
441 |
442 | if(len(task_names) ==1):
443 | task_name = task_names[0]
444 |
445 | if(len(languages) == 1):
446 | task_directory = os.path.join(log_dir,'STSL/')
447 | if not os.path.exists(task_directory):
448 | os.mkdir(task_directory)
449 | file = os.path.join(log_dir, 'STSL/{}_{}.csv'.format(languages[0],task_names[0]))
450 |
451 | else:
452 | task_directory = os.path.join(log_dir,'STML/')
453 | if not os.path.exists(task_directory):
454 | os.mkdir(task_directory)
455 | file = os.path.join(log_dir, 'STML/{}.csv'.format(task_names[0]))
456 |
457 |
458 | if(os.path.exists(file)):
459 | with open(file, 'a') as f:
460 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
461 | writer.writerow([embeds,test_lang, h_dim, cross_stitch_init, constraint_weight, sigma, optimizer,
462 | train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
463 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1'],
464 | test_score[task_name]['micro_f1'], test_score[task_name]['macro_f1']])
465 | print([embeds,test_lang, h_dim, cross_stitch_init, constraint_weight, sigma, optimizer,
466 | train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
467 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1'],
468 | test_score[task_name]['micro_f1'], test_score[task_name]['macro_f1']])
469 |
470 | else:
471 | with open(file, 'a') as f:
472 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
473 |
474 | writer.writerow(['embeds', 'test_lang', 'h_dim', 'cross_stitch_init', 'constraint_weight', 'sigma', 'optimizer',
475 | task_name+'-train-micro-f1', task_name+'-train-macro-f1', task_name+'-dev-micro-f1', task_name+'-dev-macro-f1',
476 | task_name+'-test-micro-f1', task_name+'-test-macro-f1'])
477 | print(['embeds', 'test_lang', 'h_dim', 'cross_stitch_init', 'constraint_weight', 'sigma', 'optimizer',
478 | task_name+'-train-micro-f1', task_name+'-train-macro-f1', task_name+'-dev-micro-f1', task_name+'-dev-macro-f1',
479 | task_name+'-test-micro-f1', task_name+'-test-macro-f1'])
480 |
481 | writer.writerow([embeds,test_lang, h_dim, cross_stitch_init, constraint_weight, sigma, optimizer,\
482 | train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
483 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1'],
484 | test_score[task_name]['micro_f1'], test_score[task_name]['macro_f1']])
485 | #added line
486 | #add test here
487 | #end of add
488 | print([embeds,test_lang, h_dim, cross_stitch_init, constraint_weight, sigma, optimizer,\
489 | train_score[task_name]['micro_f1'], train_score[task_name]['macro_f1'],
490 | dev_score[task_name]['micro_f1'], dev_score[task_name]['macro_f1'],
491 | test_score[task_name]['micro_f1'], test_score[task_name]['macro_f1']])
492 |
493 |
494 | f.close()
495 |
496 | else:
497 |
498 | if(len(languages) ==1):
499 | task_directory = os.path.join(log_dir,'MTSL/')
500 | if not os.path.exists(task_directory):
501 | os.mkdir(task_directory)
502 | file = os.path.join(log_dir, 'MTSL/{}.csv'.format(languages[0]))
503 |
504 | else:
505 | task_directory = os.path.join(log_dir,'MTML/')
506 | if not os.path.exists(task_directory):
507 | os.mkdir(task_directory)
508 | file = os.path.join(log_dir, 'MTML/log.csv')
509 |
510 |
511 | task_name_list = []
512 |
513 | task_f1_list = []
514 |
515 | for task in task_names:
516 | task_name_list+=[task+'-train-micro-f1', task+'-train-macro-f1', task+'-dev-micro-f1', task+'-dev-macro-f1', task+'-test-micro-f1', task+'-test-macro-f1']
517 |
518 | task_f1_list +=[ train_score[task]['micro_f1'], train_score[task]['macro_f1'], dev_score[task]['micro_f1'], dev_score[task]['macro_f1'], test_score[task]['micro_f1'], test_score[task]['macro_f1']]
519 |
520 | if(os.path.exists(file)):
521 | with open(file, 'a') as f:
522 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
523 | writer.writerow([embeds, test_lang, h_dim, cross_stitch_init, constraint_weight, sigma,optimizer]+\
524 | task_f1_list)
525 | print([embeds, test_lang, h_dim, cross_stitch_init, constraint_weight, sigma,optimizer]+\
526 | task_f1_list)
527 |
528 |
529 | f.close()
530 |
531 | else:
532 | with open(file, 'a') as f:
533 | writer = csv.writer(f,delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
534 | writer.writerow(['embeds', 'test_lang', 'h_dim', 'cross_stitch_init', 'constraint_weight', 'sigma']\
535 | +task_name_list)
536 | writer.writerow([embeds, test_lang,h_dim, cross_stitch_init, constraint_weight, sigma,optimizer]+\
537 | task_f1_list )
538 | print(['embeds', 'test_lang', 'h_dim', 'cross_stitch_init', 'constraint_weight', 'sigma']\
539 | +task_name_list)
540 | print([embeds, test_lang,h_dim, cross_stitch_init, constraint_weight, sigma,optimizer]+\
541 | task_f1_list )
542 |
543 |
544 | f.close()
545 |
546 |
547 |
548 |
549 |
--------------------------------------------------------------------------------
/guidelines.tar:
--------------------------------------------------------------------------------
1 | arabic_guidelines.html 0000664 0001762 0001762 00000016542 13541352256 016571 0 ustar nousidhoum nousidhoum
.نقوم في ما يلي بدراسة خطاب الكراهية الذي يمكن أن يُعرف بكونه أي عبارات تؤيد التحريض على الضرر (خاصة التمييز أو العدوانية أو العنف) حسب الهدف الذي تم استهدافه وسط مجموعة اجتماعية أوسُكانية وتكون هذه المجموعات عادة من الضعفاء والأقليات
2 |
.نصنف في التالي خطاب الكراهية بناءا على شعور كاتب التغريدة، شعور القارئ و الفئة المستهدفة من التغريدة. عند تصنيف التغريدات يرجى أخذ أسلوب التغريدة و انطباعك عن كاتبها بعين الاعتبار
3 |
.خطاب الكراهية قد يكون مباشرا أو غير مباشر اعتمادا على نوعية الألفاظ المستعملة، صراحة التعبير عن الفكرة المروج لها في التغريدة و احتمال استعمال صورة بلاغية مهينة
4 |
.يندرج تصنيف شعور الكاتب في سياق كلام التغريدة فقد يكون مفعما بالكراهية إن كان محرضا .للعنف، متعسفا او بذيئا، قبيحا، خائفا، قليل الاحترام أو عاديا
5 |
.التغريدة قد تثير صدمتك، خوفك، غضبك، حزنك، حيرتك (إن كنت غير متأكد) أو لا مبالاتك إن ظننت أنها غير عدوانية
6 |
أخيرا، اختر الفئة المستهدفة من التغريدة بناءا على الصفة التي تشمل أفراد المجتمع المعنيين بها. قد تتمثل هذد الصفة في الجنسية، الديانة،العرق، الأصل، الجنس، الهوية الجنسية، التوجه الجنسي أو الاحتياجات الخاصة (عقلية كانت أم جسدية) إضافة إلى الفئة المحددة كالنساء، اللاجئين، ...
7 |
8 |
ملحوظة
9 |
.لقد تم إخفاء هوية مستعملي التويتر و كل الروابط المستعملة احتراما لخصوصيتهم
10 |
خطاب هذه التغريدة
11 |
12 |
${TWEET}
13 |
14 |
مباشر غير مباشر
15 |
16 |
كيف يظهرلك كاتب التغريدة؟
17 |
متعسف/بذيء مفعم بالكراهية قبيح خائف قليل الإحترام عادي
18 |
ما هو شعورك بعد قراءة التغريدة؟
19 |
صدمة غضب حزن خوف حيرة لا مبالاة
20 |
فيم يشترك أفراد الفئة المستهدفة من التغريدة؟
21 |
الأصل الديانة العرق الجنسية الجِنْس الإحتياجات الخاصة التوجه الجنسي الهوية الجنسية أخرى
22 |
ما هي الفئة المستهدفة؟
23 |
الآسيويون ذووالإحتياجات الخاصة ذوو الأصول الإفريقية اللاتينيون اليساريون الهنود الهندوس المسيحيون النساء ذوو البشرة السمراء الصينيون العرب المسلمون اليهود المهاجرون اللاجئون المثليون آخرون