├── MyBert_paragraph_document_TPU.ipynb ├── README.md ├── _config.yml └── pre_post_processing_steps ├── 10_data_stats.py ├── 11_masked_lm_prepare_datasets.py ├── 1_preprocess_EMNLP.py ├── 2_core.py ├── 3_entity_recognition.py ├── 4_prepare_mturk_input.py ├── 5_save_links.py ├── 6_get_final_sentiment_compute_agreement_create_reannotation.py ├── 7_seperate_train_test.py ├── 8_masked_data_prepare_datasets.py ├── 9_combine_pre_new_sets.py ├── [7]_combine_4_votes.py └── data_distribution.py /README.md: -------------------------------------------------------------------------------- 1 | ## What is PerSenT? 2 | ### Person SenTiment, a challenge dataset for author's sentiment prediction in news domain. 3 | 4 | 5 | You can find our paper [Author's sentiment prediction](https://arxiv.org/abs/2011.06128) 6 | 7 | Mohaddeseh Bastan, Mahnaz Koupaee, Youngseo Son, Richard Sicoli, Niranjan Balasubramanian. COLING2020 8 | 9 | We introduce PerSenT, a crowd-sourced dataset that captures the sentiment of an author towards the main entity in a news article. This dataset contains annotation for 5.3k documents and 38k paragraphs covering 3.2k unique entities. 10 | 11 | ### Example 12 | In the following example we see a 4-paragraph document about an entity (Donald Trump). Each paragraph is labeled separately and finally the author's sentiment towards the whole document is mentioned in the last row. 13 | 14 | 15 | Image of PerSenT stats 16 | 17 | 18 | ### Dataset Statistics 19 | To split the dataset, we separated the entities into 4 mutually exclusive sets. Due to the nature of news collections, some entities tend to dominate the collection. In our collection,there were four entities which were the main entity in nearly 800 articles. To avoid these entities from dominating the train or test splits, we moved them to a separate test collection. We split the remaining into a training, dev, and test sets at random. Thus our collection includes one standard test set consisting of articles drawn at random (Test Standard), while the other is a test set which contains multiple articles about a small number of popular entities (Test Frequent). 20 | Image of PerSenT stats 21 | 22 | ### Download the data 23 | You can download the data set URLs from [here](https://github.com/MHDBST/PerSenT/blob/main/train_dev_test_URLs.pkl) 24 | 25 | The processed version of the dataset which contains used paragraphs, document-level, and paragraph-level labels can be download separately as [train](https://github.com/MHDBST/PerSenT/blob/main/train.csv), [dev](https://github.com/MHDBST/PerSenT/blob/main/dev.csv), [random test](https://github.com/MHDBST/PerSenT/blob/main/random_test.csv), and [fixed test](https://github.com/MHDBST/PerSenT/blob/main/fixed_test.csv). 26 | 27 | To recreat the results from the paper you can follow the instructions in the readme file from the [source code](https://github.com/StonyBrookNLP/PerSenT/tree/main/pre_post_processing_steps). 28 | 29 | ### Liked us? Cite us! 30 | 31 | Please use the following bibtex entry: 32 | 33 | ``` 34 | @inproceedings{bastan-etal-2020-authors, 35 | title = "Author{'}s Sentiment Prediction", 36 | author = "Bastan, Mohaddeseh and 37 | Koupaee, Mahnaz and 38 | Son, Youngseo and 39 | Sicoli, Richard and 40 | Balasubramanian, Niranjan", 41 | booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", 42 | month = dec, 43 | year = "2020", 44 | address = "Barcelona, Spain (Online)", 45 | publisher = "International Committee on Computational Linguistics", 46 | url = "https://aclanthology.org/2020.coling-main.52", 47 | doi = "10.18653/v1/2020.coling-main.52", 48 | pages = "604--615", 49 | } 50 | ``` 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-slate 2 | -------------------------------------------------------------------------------- /pre_post_processing_steps/10_data_stats.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import collections 3 | import random 4 | import matplotlib.pyplot as plt 5 | import matplotlib.pylab as pylab 6 | 7 | train_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_train.csv') 8 | dev_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_dev.csv') 9 | random_test_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_random_test.csv') 10 | fixed_test_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_fixed_test.csv') 11 | 12 | def count_entities(df): 13 | entities = df['TARGET_ENTITY'] 14 | unique_entities = entities.unique().tolist() 15 | return unique_entities 16 | 17 | def plot_class_distribution(dataset,negative,positive,neutral): 18 | print('distribution of %s set among three classes"'%dataset) 19 | print('negative:\n ', negative) 20 | print('positive:\n ', positive) 21 | print('neutral:\n ', neutral) 22 | # Data to plot 23 | plt.figure() 24 | labels = 'Negative', 'Positive', 'Neutral' 25 | sizes = [negative, 26 | positive 27 | , neutral] 28 | 29 | colors = ['Red', 'Green', 'Yellow'] 30 | 31 | # Plot 32 | plt.pie(sizes, labels=labels, colors=colors, 33 | autopct='%1.1f%%', shadow=True, startangle=140) 34 | 35 | plt.axis('equal') 36 | plt.title(dataset, horizontalalignment='center', verticalalignment='bottom') 37 | # plt.show() 38 | pylab.savefig('%s.png'%dataset) 39 | 40 | 41 | def entity_frequency(df): 42 | plt.figure() 43 | entities = df['TARGET_ENTITY'] 44 | entity_count = collections.Counter(entities) 45 | data = entity_count.most_common(10) 46 | plot_df = pd.DataFrame(data, columns=['entity', 'frequency']) 47 | plot_df.plot(kind='bar', x='entity') 48 | pylab.savefig('%s.png' % 'entity_frequency') 49 | 50 | def paragraph_distribution(df,plot=True): 51 | # plot sentence information 52 | plt.figure() 53 | documents = df['DOCUMENT'].tolist() 54 | doc_length = [] 55 | for document in documents: 56 | try: 57 | doc_length.append(len(document.split('\n'))) 58 | except: 59 | continue 60 | if plot: 61 | plt.hist(doc_length,len(set(doc_length))) 62 | plt.xlabel('Number of Paragraphs') 63 | plt.ylabel('Frequency') 64 | plt.axis([0, 25, 0, 4000]) 65 | pylab.savefig('%s.png' % 'paragraph_freq') 66 | # plt.legend() 67 | print('total number of sentences:%d'% sum(doc_length)) 68 | print('plot done') 69 | return doc_length 70 | 71 | def word_distribution(df,plot=True,plot_name='train_words'): 72 | # plot sentence information 73 | plt.figure() 74 | documents = df['DOCUMENT'].tolist() 75 | sentence_length = [] 76 | for document in documents: 77 | try: 78 | sentences = document.split('\n') 79 | 80 | except: 81 | continue 82 | for sentence in sentences: 83 | sentence_length.append(len(sentence.split())) 84 | if plot: 85 | plt.hist(sentence_length,len(set(sentence_length))) 86 | plt.xlabel('Number of Words in a Sentence') 87 | plt.ylabel('Frequency') 88 | plt.axis([0, 120, 0, 7500]) 89 | pylab.savefig('%s.png' % plot_name) 90 | # plt.legend() 91 | print('total number of sentences:%d'% sum(sentence_length)) 92 | print('plot done') 93 | return sentence_length 94 | 95 | 96 | train_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_train.csv') 97 | dev_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_dev.csv') 98 | random_test_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_random_test.csv') 99 | fixed_test_set = pd.read_csv('./combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_fixed_test.csv') 100 | 101 | train_entities = count_entities(train_set) 102 | dev_entities = count_entities(dev_set) 103 | random_test_entities = count_entities(random_test_set) 104 | fixed_test_entities = count_entities(fixed_test_set) 105 | 106 | print('number of unique entities in train set is %d' % len(train_entities)) 107 | print('number of unique entities in dev set is %d' %len(dev_entities)) 108 | print('number of unique entities in random test set is %d'%len(random_test_entities)) 109 | print('number of unique entities in fixed test set is %d'%len(fixed_test_entities)) 110 | 111 | 112 | # 113 | # 114 | plot_class_distribution('train',len(train_set[train_set['TRUE_SENTIMENT']=='Negative']), 115 | len(train_set[train_set['TRUE_SENTIMENT'] == 'Positive']), 116 | len(train_set[train_set['TRUE_SENTIMENT'] == 'Neutral'])) 117 | 118 | plot_class_distribution('dev',len(dev_set[dev_set['TRUE_SENTIMENT']=='Negative']), 119 | len(dev_set[dev_set['TRUE_SENTIMENT'] == 'Positive']), 120 | len(dev_set[dev_set['TRUE_SENTIMENT'] == 'Neutral'])) 121 | 122 | plot_class_distribution('random_test',len(random_test_set[random_test_set['TRUE_SENTIMENT']=='Negative']), 123 | len(random_test_set[random_test_set['TRUE_SENTIMENT'] == 'Positive']), 124 | len(random_test_set[random_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 125 | 126 | plot_class_distribution('fixed_test',len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT']=='Negative']), 127 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Positive']), 128 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 129 | 130 | 131 | all_used_docs = train_set.append(dev_set).append(random_test_set).append(fixed_test_set) 132 | plot_class_distribution('fixed_test',len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT']=='Negative']), 133 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Positive']), 134 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 135 | 136 | paragraph_distribution(all_used_docs,plot=True) 137 | 138 | para_train = paragraph_distribution(train_set,plot=False) 139 | para_dev = paragraph_distribution(dev_set,plot=False) 140 | para_fixed_test = paragraph_distribution(fixed_test_set,plot=False) 141 | para_random_test = paragraph_distribution(random_test_set,plot=False) 142 | print('paragraphs in train: %d, dev %d, fixed test %d, random test %d' %(sum(para_train),sum(para_dev),sum(para_fixed_test),sum(para_random_test))) 143 | print('max paragraphs in train: %d, dev %d, fixed test %d, random test %d' %(max(para_train),max(para_dev),max(para_fixed_test),max(para_random_test))) 144 | 145 | word_distribution(all_used_docs,plot=True) 146 | 147 | sent_train = word_distribution(train_set.append(dev_set),plot=False) 148 | sent_dev = word_distribution(dev_set,plot=False) 149 | sent_fixed_test = word_distribution(fixed_test_set,plot=False) 150 | sent_random_test = word_distribution(random_test_set,plot=False) 151 | print('words in train: %d, dev %d, fixed test %d, random test %d' %(sum(sent_train),sum(sent_dev),sum(sent_fixed_test),sum(sent_random_test))) 152 | print('max wordsin train: %d, dev %d, fixed test %d, random test %d' %(max(sent_train),max(sent_dev),max(sent_fixed_test),max(sent_random_test))) 153 | 154 | 155 | -------------------------------------------------------------------------------- /pre_post_processing_steps/11_masked_lm_prepare_datasets.py: -------------------------------------------------------------------------------- 1 | ##### create mask data set from the main dataset 2 | ##### for each document, replace all occurences of the main entity with [TGT], for each document, replace 3 | ##### one of the [TGT] occurrences with [MASK] and add the new document to the new dataset with label TRUE 4 | ##### for each document, replace each occurrences of all other entities with [MASK] and add the new document 5 | ##### to the dataset with FALSE label. 6 | import pandas as pd 7 | from nltk.tag import StanfordNERTagger 8 | from nltk.tokenize import word_tokenize 9 | import time 10 | st = StanfordNERTagger('../entityRecognition/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz', 11 | '../entityRecognition/stanford-ner-2018-02-27/stanford-ner.jar', 12 | encoding='utf-8') 13 | 14 | def prepare_date(df): 15 | 16 | 17 | ### new dataframe has 3 columnds, the initial data document id, the masked document(new) and the new label (true or false) 18 | mask_lm_df = pd.DataFrame(columns=['DOCUMENT_INDEX','DOCUMENT','LABEL']) 19 | ind = -1 20 | 21 | now = time.time() 22 | for doc in list(df['MASKED_DOCUMENT']): 23 | if ind %100 == 0: 24 | print('document %d processed'%ind) 25 | true_docs = [] 26 | ind += 1 27 | doc_id = df['DOCUMENT_INDEX'].iloc[ind] 28 | mask_count = doc.count('[TGT]') 29 | for i in range(1,mask_count+1): 30 | ## replace [TGT] one by one based on the occurrence number 31 | true_doc = doc.replace('[TGT]','[MASK]',i).replace('[MASK]','[TGT]',i-1) 32 | if not true_doc in true_docs: 33 | true_docs.append(true_doc) 34 | try: 35 | tokenized_text = word_tokenize(doc) 36 | except: 37 | tokenized_text = word_tokenize(doc.decode('utf-8')) 38 | classified_text = st.tag(tokenized_text) 39 | false_docs = [] 40 | i = 0 41 | 42 | previous_entity = ("","") 43 | entity = "" 44 | # read all entities in document and their entity tags 45 | for pair in classified_text: 46 | 47 | # if the entity is person, find the whole person name, replace it with mask add it to the false arrays and keep going 48 | if (pair[1] != 'PERSON' and previous_entity[1] == 'PERSON'): 49 | false_doc = doc.replace(entity,'[MASK]') 50 | if not false_doc in false_docs : 51 | false_docs.append(false_doc) 52 | if (pair[1] == 'PERSON' and previous_entity[1] != 'PERSON'): 53 | entity = pair[0] 54 | elif (pair[1] == 'PERSON' and previous_entity[1] == 'PERSON'): 55 | entity += " "+ pair[0] 56 | 57 | previous_entity = pair 58 | ### add all documents in the false/true array to the data frame with False/True labels 59 | for item in true_docs: 60 | mask_lm_df = mask_lm_df.append({'DOCUMENT_INDEX':doc_id,'DOCUMENT': item,'LABEL':'TRUE'}, ignore_index=True) 61 | for item in false_docs: 62 | mask_lm_df = mask_lm_df.append({'DOCUMENT_INDEX':doc_id,'DOCUMENT': item,'LABEL':'FALSE'}, ignore_index=True) 63 | 64 | print('processing took %d seconds'%(time.time()-now)) 65 | print(len(mask_lm_df[mask_lm_df['LABEL']=='TRUE'])) 66 | print(len(mask_lm_df[mask_lm_df['LABEL']=='FALSE'])) 67 | 68 | return mask_lm_df 69 | 70 | 71 | 72 | ##### load data set with MASKED_ENTITY column where main entities are replaced with [TGT] 73 | #df_train = pd.read_csv('combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_train.csv', encoding='latin-1') 74 | #mask_train = prepare_date(df_train) 75 | #mask_train.to_csv('masked_lm/mask_lm_combined_shuffled_3Dec_7Dec_aug19_reindex_train.csv', encoding='latin-1') 76 | # 77 | #df_dev = pd.read_csv('combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_dev.csv', encoding='latin-1') 78 | #mask_dev = prepare_date(df_dev) 79 | #mask_dev.to_csv('masked_lm/mask_lm_combined_shuffled_3Dec_7Dec_aug19_reindex_dev.csv', encoding='latin-1') 80 | # 81 | df_rTest = pd.read_csv('combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_random_test.csv', encoding='latin-1') 82 | mask_rTest = prepare_date(df_rTest) 83 | mask_rTest.to_csv('masked_lm/mask_lm_combined_shuffled_3Dec_7Dec_aug19_reindex_random_test.csv', encoding='latin-1') 84 | 85 | #df_fTest = pd.read_csv('combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_fixed_test.csv', encoding='latin-1') 86 | #mask_fTest = prepare_date(df_fTest) 87 | #mask_fTest.to_csv('masked_lm/mask_lm_combined_shuffled_3Dec_7Dec_aug19_reindex_fixed_test.csv', encoding='latin-1') 88 | -------------------------------------------------------------------------------- /pre_post_processing_steps/1_preprocess_EMNLP.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import json 3 | import random 4 | import os 5 | 6 | source_path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_resource/EMNLP_Data_Junting' 7 | train_input = source_path + '/content_df_test_filtered.csv' 8 | title_train_input = source_path + '/titles/title_selected_sources_ids_with_targets_test.csv' 9 | 10 | 11 | 12 | 13 | text_df = pd.read_csv(train_input, error_bad_lines=False,delimiter='\t') 14 | title_df = pd.read_csv(title_train_input, error_bad_lines=False,delimiter='\t') 15 | 16 | input_json = json.load(open(source_path + '/raw_docs.json')) 17 | title_input = source_path + '/titles/title_selected_sources_ids_with_targets_train.csv'#'emnlp18_data/titles/title_selected_sources_ids_with_targets_train.csv' 18 | title_df = pd.read_csv(open(title_input ), error_bad_lines=False,delimiter='\t') 19 | 20 | 21 | data_path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/' 22 | subdir= 'emnlp_paragraph_seperated_Aug19_part2/' 23 | 24 | 25 | 26 | ##select No_of_samples random documents 27 | No_of_samples = 5000 28 | doc_ids = random.sample(input_json.keys(), No_of_samples) 29 | for index in doc_ids: 30 | 31 | 32 | for root, dirs, files in os.walk(data_path): 33 | 34 | if index in files: 35 | print ("File exists") 36 | continue 37 | 38 | # if the file exsits, don't create it again 39 | try: 40 | # if it has been chosen previously, don't choose it again 41 | out_file = open(data_path+subdir+str(index)) 42 | # out_file = open('./masked_entity_lm/'+str(index)) 43 | continue 44 | except: 45 | pass 46 | # select random documents from source file and read the main document 47 | paragraphs = input_json[index].split('\n\n') 48 | 49 | if len(paragraphs) < 3: 50 | continue 51 | 52 | title = title_df[title_df['docid'] == int(index)] 53 | try: 54 | main_title = title.iloc[0]['title'] 55 | out_file = open(data_path+subdir+str(index),'w') 56 | # out_file = open('./masked_entity_lm/'+str(index),'w') 57 | out_file.write(main_title) 58 | out_file.write('\n') 59 | 60 | except: # this index is not in title file 61 | continue 62 | for paragraph in paragraphs: 63 | 64 | try: 65 | out_file.write(paragraph) 66 | except: 67 | out_file.write(paragraph.encode('utf-8')) 68 | out_file.write('\n') 69 | out_file.close() 70 | -------------------------------------------------------------------------------- /pre_post_processing_steps/2_core.py: -------------------------------------------------------------------------------- 1 | # This Python file uses the following encoding: utf-8 2 | # import en_coref_lg 3 | import spacy 4 | import neuralcoref 5 | import os 6 | import re 7 | # coref = en_coref_lg.load() 8 | coref = spacy.load('en') 9 | neuralcoref.add_to_pipe(coref) 10 | #source = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/LDC2014E13_output/' 11 | source = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/' 12 | # source = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/' 13 | # subdir = 'crowdsource/' 14 | # subdir = 'KBP/' 15 | # subdir = 'emnlp/' 16 | # subdir = 'emnlp_paragraph_seperated_batch3/' 17 | # subdir = 'masked_entity_lm/' 18 | subdir = 'emnlp_paragraph_seperated_Aug19_part2/' 19 | # read all files in directory (the reports) 20 | for filename in os.listdir(source+subdir): 21 | 22 | # print(filename) 23 | textfile = open(source+subdir + filename) 24 | 25 | 26 | text = textfile.read() 27 | text = re.sub('[^0-9a-zA-Z.\n,!?@#$%^&*()_+\"\';:=<>[]}{\|~`]+', ' ', text) 28 | indd = 0 29 | if os.path.exists(source+'coref_%s/'%subdir+filename): 30 | continue 31 | try: 32 | doc = coref(text.decode('utf-8')) 33 | 34 | except Exception as e: 35 | print('error occured: %s'%str(e)) 36 | doc = coref(text) 37 | # continue 38 | try: 39 | outputText = doc._.coref_resolved 40 | except: 41 | print('not being processed %s'%filename) 42 | continue 43 | 44 | indd = indd+1 45 | outputfile = open(source+'coref_%s/'%subdir+filename,'w') 46 | try: 47 | outputfile.write(outputText) 48 | except: 49 | outputfile.write(outputText.encode('utf-8')) 50 | outputfile.write('\n') 51 | outputfile.close() 52 | del doc 53 | 54 | print('all coreferences found') 55 | 56 | print('output table created') 57 | 58 | 59 | 60 | 61 | 62 | -------------------------------------------------------------------------------- /pre_post_processing_steps/3_entity_recognition.py: -------------------------------------------------------------------------------- 1 | #-*- coding: utf-8 -*- 2 | from nltk.tag import StanfordNERTagger 3 | from nltk.tokenize import word_tokenize 4 | import spacy 5 | import neuralcoref 6 | import os 7 | import re 8 | coref = spacy.load('en') 9 | neuralcoref.add_to_pipe(coref) 10 | st = StanfordNERTagger('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/entityRecognition/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz', 11 | '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/entityRecognition/stanford-ner-2018-02-27/stanford-ner.jar', 12 | encoding='utf-8') 13 | 14 | 15 | # find the most frequent entity in document and write them on file 16 | import os 17 | 18 | 19 | source = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/' 20 | subdir = 'emnlp_paragraph_seperated_Aug19_part2' 21 | 22 | ### reads all the file in a subdirectory and process one by one 23 | for filename in os.listdir(source+'coref_%s/'%subdir): 24 | 25 | ### if the entity recognition for on document is solved (it is saved in another direcotry) skip that document 26 | if os.path.exists(source +'pairentity_%s/'%subdir +filename): 27 | continue 28 | print(filename) 29 | text = open(source+'coref_%s/'%subdir + filename) 30 | # a dictionary for each document to find the entity 31 | allEntities = {} 32 | lines = text.read() 33 | lines = lines.replace('”','"').replace('’','\'') 34 | try: 35 | tokenized_text = word_tokenize(lines) 36 | except: 37 | tokenized_text = word_tokenize(lines.decode('utf-8')) 38 | classified_text = st.tag(tokenized_text) 39 | i = 0 40 | previous_entity = ("","") 41 | two_previous_entity = ("","") 42 | # read all entities in document and double check to keep the PERSON ones 43 | for pair in classified_text: 44 | # if the entity is person, save it in hash 45 | if (pair[1] == 'PERSON' and previous_entity[1] != 'PERSON'): 46 | if pair[0] in allEntities: 47 | allEntities[pair[0]] = allEntities[pair[0]] +1 48 | else: 49 | allEntities[pair[0]] = 1 50 | elif (pair[1] == 'PERSON' and previous_entity[1] == 'PERSON' and two_previous_entity[1] != 'PERSON'): 51 | entity = previous_entity[0]+" "+pair[0] 52 | if entity in allEntities: 53 | allEntities[entity] = allEntities[entity] +1 54 | else: 55 | allEntities[entity] = 1 56 | elif (pair[1] == 'PERSON' and previous_entity[1] == 'PERSON' and two_previous_entity[1] == 'PERSON'): 57 | # then add the new pairs as new entity 58 | entity = two_previous_entity[0]+" "+ previous_entity[0]+" "+pair[0] 59 | if entity in allEntities: 60 | allEntities[entity] = allEntities[entity] +1 61 | else: 62 | allEntities[entity] = 1 63 | two_previous_entity = previous_entity 64 | previous_entity = pair 65 | if len(allEntities) ==0: 66 | print('no Entities in %s'%filename) 67 | continue 68 | sortedEntities = sorted(allEntities, key=allEntities.get, reverse=True) 69 | maxEntity = sortedEntities[0] 70 | counted = 0 71 | for item in sortedEntities: 72 | if maxEntity in item: 73 | counted += allEntities[item] 74 | maxEntity = item 75 | # if number of entities is less than 3 do not save that document 76 | if counted <3: 77 | print('number of dominate entity %d'%counted) 78 | continue 79 | print(maxEntity) 80 | outfile = open(source +'pairentity_%s/'%subdir +filename,'w') 81 | try: 82 | outfile.write(maxEntity) 83 | except: 84 | outfile.write(maxEntity.encode('utf-8')) 85 | outfile.write('\n') 86 | outfile.write(''.join(lines)) 87 | outfile.close() -------------------------------------------------------------------------------- /pre_post_processing_steps/4_prepare_mturk_input.py: -------------------------------------------------------------------------------- 1 | # create input file for mechanical turk, it has three types of columns, entity, title, content. content itself may be 2 | # at most 15 paragraphs and at least 5 paragraphs. we disregard the rest. Finally all entities are highlighted not from 3 | # the corefrenced document, from the main documents. 4 | import os 5 | import re 6 | source = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/' 7 | import csv 8 | import spacy 9 | import neuralcoref 10 | 11 | 12 | 13 | docStat= {} 14 | # subdir = 'crowdsource/' 15 | # subdir = 'KBP/' 16 | # subdir = 'emnlp/' 17 | # subdir = 'emnlp_paragraph_seperated/' 18 | # subdir = 'emnlp_paragraph_seperated_batch3/' 19 | subdir = 'emnlp_paragraph_seperated_Aug19_part2/' 20 | 21 | # read all files in directory (the reports) 22 | 23 | # csvfile = open('input_KBP.csv', 'wb') 24 | # csvfile = open('./input_emnlp_PS.csv', 'wb') 25 | csvfile = open('./input_emnlp_PS_Aug19_part3.csv', 'w') 26 | 27 | writer = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL) 28 | writer.writerow(['entity','title','content']) 29 | fileCounter = 0 30 | coref = spacy.load('en') 31 | neuralcoref.add_to_pipe(coref) 32 | cnt = 0 33 | longdocs = [] 34 | shortdocs = [] 35 | notFound = [] 36 | countNF = 0 37 | for filename in os.listdir(source+subdir): 38 | fileCounter += 1 39 | if fileCounter%100 == 0: 40 | print('processing document: %d'%fileCounter) 41 | 42 | ### if the file is processes, skip it 43 | try: 44 | open(source+'used_documents_in_MTurk_%s'%subdir+filename) 45 | print('file found %s' %filename) 46 | continue 47 | except: 48 | pass 49 | 50 | 51 | 52 | 53 | try: 54 | coref_text = open(source+'pairentity_%s'%subdir + filename) 55 | except Exception as e: 56 | if str(e).startswith('[Errno 2]'): 57 | continue 58 | print('error darim: %s'%str(e)) 59 | # print('pair entity file not found %s'%filename) 60 | # print(filename) 61 | notFound.append(filename) 62 | countNF += 1 63 | continue 64 | entity = coref_text.readline().rstrip().replace(',',' ') 65 | 66 | try: 67 | main_text = open(source+subdir+ filename) 68 | text = main_text.read().decode('utf-8').strip().replace(',',' ') 69 | except: 70 | main_text = open(source+subdir+ filename) 71 | text = main_text.read().strip().replace(',',' ') 72 | new_text = "" 73 | sc = 0 74 | try: 75 | doc = coref(text) 76 | except Exception as e: 77 | print('there is an error in %s which is %s'%(filename,str(e))) 78 | continue 79 | entity_in_coref = False 80 | corefs = [] 81 | try: 82 | # find all mentions and coreferences 83 | for item in doc._.coref_clusters: 84 | # if the head of cluster is the intended entity then highlight all of the mentions 85 | if entity in item.main.text or item.main.text in entity: 86 | 87 | entity_in_coref = True 88 | for span in item: 89 | corefs.append(span) 90 | for item in sorted(corefs): 91 | pronoun = item.text 92 | ec = item.start_char 93 | if ec < sc: 94 | continue 95 | new_text += text[sc:ec] + ' '+pronoun+' ' 96 | sc = item.end_char 97 | # print(new_text) 98 | new_text += text[sc:] 99 | 100 | new_text = new_text.replace(entity,''+entity+'' ) 101 | 102 | new_text = new_text.rstrip().replace(',',' ') 103 | new_text = new_text.replace('|',' ') 104 | if not entity_in_coref: 105 | continue 106 | 107 | except Exception as e: 108 | print('the error is %s'%str(e)) 109 | print('coreference not resolved %s' % filename) 110 | continue 111 | 112 | main_lines = new_text.split('\n') 113 | content = "" 114 | # if len( main_lines) < 3 : 115 | # continue 116 | header = main_lines[0] 117 | used_text = entity+ '\n'+ header.replace('','').replace('','') + '\n' 118 | ind=0 119 | for main_line in main_lines[1:]: 120 | ### for one paragraph text 121 | # if main_line.count('15: 127 | break 128 | main_line = main_line.replace('|',' ') 129 | i = str(ind) 130 | #### for one paragraph document 131 | # SENTIMENT = '' 132 | # content += '

' + main_line + '

' 133 | ##### for multiple paragraph documents 134 | SENTIMENT = '
' 135 | content += '

' + main_line + '

' + SENTIMENT+'' 136 | 137 | used_text += main_line.replace('
','').replace('','') + '\n' 138 | 139 | ind = ind+1 140 | if ind in docStat: 141 | 142 | docStat[ind] = docStat[ind]+1 143 | else: 144 | docStat[ind]= 1 145 | i = str(ind) 146 | total_sentiment = '

The whole article\'s view towards '+entity +' is:

' 147 | total_sentiment = total_sentiment.replace('|',' ') 148 | # if ind >3 and ind<8 : 149 | # fileSize = 'small/' 150 | # cnt += 1 151 | # shortdocs.append([entity,header,''+content+total_sentiment+'
']) 152 | # elif ind >7 and ind < 16: 153 | # fileSize = 'large/' 154 | 155 | # longdocs.append([entity,header,''+content+total_sentiment+'
']) 156 | # for kbp and crowdsource data 157 | if ind>3: 158 | # for emnlp data with one paragraph 159 | # if ind>0: 160 | 161 | row = ''+content+total_sentiment+'
' 162 | try: 163 | writer.writerow([entity,header,row]) 164 | except: 165 | row = unidecode(row) 166 | try: 167 | writer.writerow([entity,header,row]) 168 | 169 | 170 | except: 171 | print('unicode error') 172 | continue 173 | # write the paragraphs which are used in a file for future usage 174 | # datafile = open(source+'used_documents_in_MTurk_merge/'+fileSize+filename,'wb') 175 | datafile = open(source+'used_documents_in_MTurk_%s'%subdir+filename,'wb') 176 | datafile.write(used_text.encode('utf-8')) 177 | datafile.close() 178 | cnt +=1 179 | else: 180 | print('not enough paragraphs have entity mention %s' %filename) 181 | # if cnt == 1000: 182 | # break 183 | 184 | print("acceptable documents %d"% cnt) 185 | csvfile.close() -------------------------------------------------------------------------------- /pre_post_processing_steps/5_save_links.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import requests 4 | import pickle 5 | import pandas as pd 6 | # 7 | # path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_resource/EMNLP_Data_Junting/article_url_cmplt.json' 8 | # with open(path) as article_url_file: 9 | # article_url_map = json.load(article_url_file) 10 | # 11 | # 12 | # headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} 13 | # def add_url_to_archie(filename_archive_url,path,URL): 14 | # i = 0 15 | # err = 0 16 | # vld = 0 17 | # for filename in os.listdir(path): 18 | # i += 1 19 | # 20 | # if i%100 ==0: 21 | # print('i is %d'%i) 22 | # if filename in filename_archive_url: 23 | # vld+= 1 24 | # continue 25 | # added_url = article_url_map[filename] 26 | # try: 27 | # r2 = requests.get(url = URL+'/save/'+added_url,headers=headers) 28 | # except Exception as e: 29 | # print('error occured: %s'%str(e)) 30 | # continue 31 | # if r2.status_code == 502: 32 | # for _ in range(3): 33 | # print('attempt 10 times to get something else than 502') 34 | # r2 = requests.get(url = URL+'/save/'+added_url,headers=headers) 35 | # if r2.status_code == 200: 36 | # print('succeed') 37 | # break 38 | # if r2.status_code == 200: 39 | # new_url = r2.headers['Content-Location'] 40 | # 41 | # # if added_url in new_url: 42 | # rsp = requests.head(url = URL+new_url,headers=headers) 43 | # if rsp.status_code == 200: 44 | # filename_archive_url[filename] = URL+new_url 45 | # vld+= 1 46 | # else: 47 | # err += 1 48 | # print('error on file %s with url %s with error code: %d (while saving)'%(filename,added_url,rsp.status_code)) 49 | # continue 50 | # # else: 51 | # # print('added url %s'%(added_url)) 52 | # # print('new url %s'%(new_url)) 53 | # # err += 1 54 | # # print('error on file %s with url %s with error code: %d ( new url not same as requested url)' 55 | # # %(filename,added_url,r2.status_code)) 56 | # # continue 57 | # 58 | # else: 59 | # err += 1 60 | # print('error on file %s with url %s with error code: %d '%(filename,added_url,r2.status_code)) 61 | # continue 62 | # 63 | # print('total: %d, valid: %d, error: %d, len map: %d'%(i,vld,err,len(filename_archive_url)) ) 64 | # return(filename_archive_url) 65 | # 66 | # 67 | # URL = 'http://web.archive.org' 68 | # filename_archive_url = {} 69 | # 70 | # path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_Aug19_part2/' 71 | # filename_archive_url = add_url_to_archie(filename_archive_url,path,URL) 72 | # 73 | # 74 | # ### check all urls in the archive to be valid 75 | # import requests 76 | # i = 0 77 | # should_be_deleted = [] 78 | # for key in filename_archive_url: 79 | # i += 1 80 | # if i % 100 == 0: 81 | # print('i is %d'%i) 82 | # url = filename_archive_url[key] 83 | # # if 'err' in url: 84 | # # print('error in address %s, key is %s'%(url,key)) 85 | # rr = requests.head(url = url,headers=headers) 86 | # if rr.status_code != 200: 87 | # print(rr.status_code) 88 | # print('error in file %s with url %s'%(key,article_url_map[key])) 89 | # should_be_deleted.append(key) 90 | # print('documents which should be deleted: \n', should_be_deleted) 91 | # 92 | # pickle.dump(filename_archive_url,open('file_url_Aug19_part3','wb')) 93 | # 94 | # 95 | # 96 | # ### check the folders to see how many news artciles were not retrieved with status code 97 | # filename_ids_notadded = [] 98 | # URL = 'http://web.archive.org' 99 | # path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_Aug19_part2/' 100 | # 101 | # for path in [path]: 102 | # for filename in os.listdir(path): 103 | # if not filename.startswith('_'): 104 | # if filename not in filename_archive_url: 105 | # filename_ids_notadded.append(filename) 106 | # added_url = article_url_map[filename] 107 | # r2 = requests.get(url = URL+'/save/'+added_url) 108 | # status = r2.status_code 109 | # print('error on file %s with url %s with status %d'%(filename,added_url,status)) 110 | # else: 111 | # filename_ids_notadded.append(filename[1:]) 112 | # 113 | # pickle.dump(filename_ids_notadded,open('file_not_added_Aug19_part3','wb')) 114 | # 115 | 116 | def map_title_doc(): 117 | 118 | #### find the titles for specific doc ids and map title to doc id for both removed docs and kept docs 119 | titles = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_resource/EMNLP_Data_Junting/titles/title_selected_sources_ids_with_targets_' 120 | df_titles = pd.DataFrame([]) 121 | for name in ['train','test','val']: 122 | df_file = pd.read_csv(open(titles+name+'.csv'),sep='\t') 123 | # print(len(df_file)) 124 | df_titles = df_titles.append(df_file) 125 | print(len(df_titles)) 126 | print('--------- Dataframe of titles created ---------') 127 | 128 | docs = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_resource/EMNLP_Data_Junting/content_df_' 129 | df_docs = pd.DataFrame([]) 130 | for name in ['train','test','val']: 131 | df_file = pd.read_csv(open(docs+name+'_filtered.csv'),sep='\t') 132 | # print(len(df_file)) 133 | df_docs = df_docs.append(df_file) 134 | print(len(df_docs)) 135 | print('--------- Dataframe of documents created ---------') 136 | 137 | to_be_deleted_title = {} 138 | to_be_deleted_doc = {} 139 | i = 0 140 | for item in list(df_titles['docid']): 141 | if str(item) in filename_ids_notadded: 142 | to_be_deleted_title[item] = df_titles[df_titles['docid']==item]['title'].values[0] 143 | to_be_deleted_doc[item] = df_docs[df_docs['docid']==item]['content'].values[0] 144 | i += 1 145 | 146 | to_be_kept_title = {} 147 | to_be_kept_doc = {} 148 | i = 0 149 | for item in list(df_titles['docid']): 150 | if str(item) in filename_archive_url: 151 | to_be_kept_title[item] = df_titles[df_titles['docid']==item]['title'].values[0] 152 | to_be_kept_doc[item] = df_docs[df_docs['docid']==item]['content'].values[0] 153 | i += 1 154 | print('--------- Deleting and keeping lists of title and documents created ---------') 155 | print('--------- Statistics about the dataset ---------') 156 | print('length of the list of titles to be deleted: %d'%len(to_be_deleted_title)) 157 | print('length of the list of titles to be kept: %d'%len(to_be_kept_title)) 158 | print('length of the set of titles to be kept: %d'%len(set(to_be_kept_title.values()))) 159 | print('length of the list of docs to be deleted: %d'%len(to_be_deleted_doc)) 160 | print('length of the list of docs to be kept: %d'%len(to_be_kept_doc)) 161 | print('length of the set of docs to be kept: %d'%len(set(to_be_kept_doc.values()))) 162 | 163 | ### If we have repeated docs, remove one of them by adding it to the to_be_deleted list 164 | print('--------- List of the doc ids which has the same content ---------') 165 | repeated_docs = [] 166 | key_list = list(to_be_kept_doc) 167 | 168 | for i in range(len(key_list)): 169 | key1 = key_list[i] 170 | value1 = to_be_kept_doc[key1] 171 | for j in range(i+1,len(key_list)): 172 | key2 = key_list[j] 173 | value2 = to_be_kept_doc[key2] 174 | if key1 != key2 and value1 == value2: 175 | print(key1,key2) 176 | repeated_docs.append(key1) 177 | to_be_deleted_doc[key1] = value1 178 | to_be_deleted_title[key1] = to_be_kept_title[key1] 179 | # print(value1) 180 | print('--------- delete the repeated docs from the list of kept docs ---------') 181 | for key in repeated_docs: 182 | del to_be_kept_doc[key] 183 | del to_be_kept_title[key] 184 | print('length of the list of docs to be kept: %d'%len(to_be_kept_doc)) 185 | return (to_be_deleted_doc, to_be_deleted_title) 186 | 187 | 188 | 189 | def del_unsused_doc_title(to_be_deleted_title,data_names=['part3'],path = './'): 190 | ### remove document which doesn't have any link in archive in mturk input 191 | 192 | remove_from_dataset = {} 193 | ### collect the ids 194 | for name in data_names: 195 | print('-----------------\tAnalyzing %s\t-----------------'%name) 196 | data_path = path + 'input_emnlp_PS_Aug19_%s.csv'%(name) 197 | df = pd.read_csv(data_path) 198 | remove_set = [] 199 | i = 0 200 | print(len(df['title'])) 201 | for title in df['title']: 202 | if title in to_be_deleted_title.values(): 203 | remove_set.append(i) 204 | i += 1 205 | remove_from_dataset[name] = remove_set 206 | print(remove_from_dataset) 207 | 208 | ### remove the collected ids 209 | for dataset in remove_from_dataset.keys(): 210 | print('working on dataset %s'%dataset) 211 | 212 | data_path = path + 'input_emnlp_PS_Aug19_%s.csv'%(dataset) 213 | df = pd.read_csv(data_path) 214 | print('len df before dropping: %d'%len(df)) 215 | drop_list = remove_from_dataset[dataset] 216 | print(drop_list) 217 | df = df.drop(df.index[drop_list]) 218 | print('len df after dropping: %d'%len(df)) 219 | df.to_csv(path + 'input_emnlp_PS_dropped_Aug19_%s.csv'%(dataset),index=False) 220 | 221 | 222 | 223 | filename_archive_url = pickle.load(open('data/file_url_Aug19_part3_cop','rb')) 224 | filename_ids_notadded = pickle.load(open('data/file_not_added_Aug19_part3_cop','rb')) 225 | (to_be_deleted_doc, to_be_deleted_title) = map_title_doc() 226 | del_unsused_doc_title(to_be_deleted_title,data_names=['part3'],path = './data/') 227 | 228 | 229 | -------------------------------------------------------------------------------- /pre_post_processing_steps/6_get_final_sentiment_compute_agreement_create_reannotation.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from nltk import agreement 3 | from scipy import stats 4 | import numpy as np 5 | import csv 6 | import pickle 7 | import pandas as pd 8 | 9 | ######## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! the order of paragraphs should be fixed before running this code !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 10 | 11 | 12 | from nltk.metrics.agreement import AnnotationTask 13 | from nltk.metrics import custom_distance as ds 14 | 15 | # incase of smallset and largeset where the last column with label is the sentiment label 16 | def get_final_sentiment(input_csv = 'smallset1.csv'): 17 | # input: csv file 18 | # output: the last column of the csv as sentiment label of the document 19 | # warning: this method is just useful if there is no Fianl sentiment label in the document and the last column 20 | # of the csv is the final label with another name 21 | df = pd.read_csv(input_csv) 22 | df_out = df.ffill(axis=1).iloc[:, [-1]] 23 | return(df_out) 24 | 25 | 26 | # get the array of the sentiments according to different annotators and return the label for the document 27 | def getCombinedSentiment(sentiments): 28 | # input: array of the sentiments 29 | # if the variance is more than 2 it will be marked for reassignment to the new annotator 30 | # output: compute the sentiment based on the labels, return one final sentiment 31 | weights = {'Negative': -2, 'Slightly_Negative': -1, 'Neutral': 0, 'Slightly_Positive': 1, 'Positive': 2} 32 | labels = ['Negative', 'Neutral', 'Positive'] 33 | avg = 0 34 | sent = [] 35 | kappa_input = [0,0,0] 36 | for s in sentiments: 37 | try: 38 | avg += weights[s] 39 | sent.append(weights[s]) 40 | if 'Neg' in s: 41 | kappa_input[0] += 1 42 | elif 'Pos' in s: 43 | kappa_input[2] += 1 44 | elif 'Neu' in s: 45 | kappa_input[1] += 1 46 | # skip nan labels 47 | except: 48 | continue 49 | var = np.var(sent) 50 | reannotate = False 51 | 52 | if var > 2: 53 | reannotate = True 54 | # print('var is : %f'%var, 'sent is ' , sent) 55 | avg = float(avg)/float(len(sentiments)) 56 | if avg >= 0.5: 57 | return (2,labels[2]),reannotate 58 | elif avg <= -0.5: 59 | return (0,labels[0]),reannotate 60 | else: 61 | return (1,labels[1]),reannotate 62 | 63 | 64 | # preprocess sentences mostly used for titles, remove extra whitespace, tags for HTML multiple extra spaces 65 | def preprocess(sent): 66 | # input: text to be processed 67 | # output: the removed whitespace version of the text with no HTML tag 68 | return sent.replace('','').replace('','').replace(',','').replace(' ',' ').strip() 69 | 70 | 71 | # read all documents in crowdsource folder and keep the document text based on the title in hash 72 | def create_hash(path,entity_file=True): 73 | import os 74 | # input: path: directory of file source ,entity_file: whether the first line of the file is entity name or not 75 | # output: a hash with key of title and value of document 76 | title_text_hash= {} 77 | for file_name in os.listdir(path): 78 | document = open(path+file_name) 79 | # if we read from the documents where the first line is the entity name, we should skip it to get the title 80 | if (entity_file): 81 | ent = document.readline() 82 | document_title = document.readline() 83 | preprocessed_title = preprocess(document_title) 84 | text = document.read() 85 | title_text_hash[preprocessed_title,ent.strip()] = text.strip() 86 | return title_text_hash 87 | 88 | 89 | def create_annotation_data(labels,global_task_id): 90 | data_annot = [] 91 | for ind in range(len(labels)): 92 | data_annot.append((global_task_id*10+ind,global_task_id,frozenset([labels[ind]]))) 93 | t2 = AnnotationTask(data=data_annot ,distance=ds('cs_3class.txt')) 94 | return t2.avg_Ao() 95 | 96 | # load amazon turk asnwer file as csv and data frame and return documents, entitie, titles, class label 97 | # Also it saves the pickle file of title and labels as hash to use for multiple submissions 98 | 99 | 100 | # load amazon turk asnwer file as csv and data frame and return documents, entitie, titles, class label for each paragraph 101 | # Also it saves the pickle file of title and labels as hash to use for multiple submissions 102 | #also it computes pairwise agreement 103 | 104 | 105 | all_labels = [] 106 | all_labels = [] 107 | def mturk_output_creator_paragraph_level(input_csv = 'long_sentences_with_second_batch_submission_multipleSubmission.csv' 108 | ,path = 'LDC2014E13_output/crowdsource/', 109 | pickle_reader = True, save_file_name = 'alldata_3Dec.csv', 110 | num_labels=3): 111 | # input: pickle_reader: there is a hash file of title and labels saved as pickle file, if this variable is true 112 | # this hash is read and used for furthur usage, otherwise it will be created 113 | # path: the path of the crowdsource files 114 | # header_id: for general csv file, we don't have index after column, for merged csv file, we have id of 2, 115 | # this parameter determines whether to add the index to the end of title and entity column or not 116 | # save_id: when we need to run multiple times and save multiple files, we use different indexing to prevent confusing 117 | # input_csv: input csv file which is the output of MTurk 118 | # output: a hash table of titles and different labels by the user 119 | # a csv file of the entity, doc index, title, sentiment and document text 120 | # title_text_hash = create_hash(path) 121 | final_labels_all = [] 122 | para_labels_all = [] 123 | title_column = 'Input.title' 124 | title_text_hash = create_hash(path) 125 | 126 | if pickle_reader: 127 | title_labels_hash = pickle.load(open('title_labels.pickle', 'rb')) 128 | doc_id = len(title_labels_hash) 129 | else: 130 | title_labels_hash = {} 131 | doc_id = 0 132 | columns = ['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT','Paragraph0','Paragraph1' 133 | ,'Paragraph2','Paragraph3','Paragraph4','Paragraph5' ,'Paragraph6','Paragraph7','Paragraph8', 134 | 'Paragraph9','Paragraph10','Paragraph11','Paragraph12','Paragraph13' ,'Paragraph14','Paragraph15','Reannotate','HITId'] 135 | 136 | try: 137 | open(save_file_name) 138 | output_file = open(save_file_name, 'a+') 139 | fileWriter = csv.writer(output_file) 140 | except: 141 | output_file = open(save_file_name, 'a+') 142 | fileWriter = csv.writer(output_file) 143 | fileWriter.writerow(columns) 144 | data_df = pd.read_csv(input_csv) 145 | 146 | columns =[col for col in list(data_df) if col.startswith('Answer.sentiment_Pargraph')] 147 | print('ghabls az amal ', len(data_df)) 148 | data_df = data_df[data_df['AssignmentStatus']== 'Approved'] 149 | title = data_df[title_column] 150 | print('bad az amal ' , len(title)) 151 | print('approved assignment %d out of %d assignments' %(len(data_df[data_df['AssignmentStatus']== 'Approved']),len(data_df))) 152 | entity = data_df['Input.entity'] 153 | entity_decision= data_df['Answer.entity_decision'] 154 | 155 | # all_doc_titles = title.unique() 156 | hitids = data_df['HITId'].unique() 157 | skipped_doc_entity = 0 158 | not_found_label_documents = 0 159 | 160 | print("number of documents %d" % len(hitids)) 161 | 162 | global_id = 0 163 | try: 164 | Final_Sentiment = data_df[['Answer.Final_sentiment']] 165 | except: 166 | Final_Sentiment = get_final_sentiment(input_csv) 167 | final_sentiment_column_name = list(Final_Sentiment)[0] 168 | Final_Sentiment = pd.concat([Final_Sentiment,data_df['HITId']],axis=1) 169 | 170 | # for item in all_doc_titles: 171 | for hid in hitids: 172 | item = list(data_df[data_df['HITId']==hid][title_column])[0] 173 | entity_decision_array = entity_decision[data_df[title_column]==item].tolist() 174 | document_entity = entity[data_df[title_column]==item].tolist()[0] 175 | 176 | # if (item,document_entity) in title_labels_hash: 177 | # print('item is in the hash %s' % item) 178 | # continue 179 | 180 | 181 | final_entity_decision = max(set(entity_decision_array), key=entity_decision_array.count) 182 | # if most of the annotator mark the document as not related to the specified entity, skip it 183 | if final_entity_decision =='No': 184 | skipped_doc_entity += 1 185 | continue 186 | try: 187 | clean_title = preprocess(item) 188 | except: 189 | clean_title = preprocess(item.encode('utf-8')) 190 | try: 191 | 192 | document = title_text_hash[clean_title,document_entity] 193 | # document = data_df[data_df['HITId']==hid][title_column] 194 | except: 195 | print(document_entity) 196 | print('title not found ') 197 | continue 198 | row = [document_entity,doc_id,clean_title,document] 199 | df_allparagraphs = data_df[data_df['HITId']==hid] 200 | df_finalsentiment = Final_Sentiment[Final_Sentiment['HITId']==hid] 201 | 202 | final_labels = df_finalsentiment[final_sentiment_column_name].dropna().tolist() 203 | l,reannotate = getCombinedSentiment(final_labels) 204 | label = l[1] 205 | if len(final_labels) == num_labels : 206 | final_labels_all.append(final_labels) 207 | 208 | title_labels_hash[(item,document_entity)] = final_labels 209 | 210 | 211 | 212 | row.append(label) 213 | write_to_file = True 214 | ### IF the last paragraph label is the document label, it doesn't exclude it, we should remove it (TO DO) 215 | for column in columns: 216 | 217 | labels = df_allparagraphs[column].dropna().tolist() 218 | 219 | if len(labels) > 0: 220 | all_labels.append(labels) 221 | if len(labels) > num_labels: 222 | write_to_file = False 223 | print('larger than expect4ed, %s'%item) 224 | if len(labels) < num_labels: 225 | write_to_file = False 226 | print('hid %s has %d labels for column %s and length of the final labels is %d' %(str(hid),len(labels),str(column),len(final_labels) ) ) 227 | l,reann = getCombinedSentiment(labels) 228 | label = l[1] 229 | if reann: 230 | reannotate = True 231 | global_id += 1 232 | 233 | else: 234 | label = 'NaN' 235 | row.append(label) 236 | row.append(reannotate) 237 | row.append(hid) 238 | 239 | 240 | if write_to_file: 241 | fileWriter.writerow(row) 242 | doc_id += 1 243 | with open('title_labels.pickle', 'wb') as title_labels_pickle: 244 | pickle.dump(title_labels_hash, title_labels_pickle) 245 | output_file.close() 246 | print('skipped %d'%skipped_doc_entity) 247 | return final_labels_all,all_labels 248 | 249 | 250 | def kapa_computer(new_final_labels,weighted=False, weights= None): 251 | table = 1 * np.asarray(new_final_labels) #avoid integer division 252 | 253 | n_sub, n_cat = table.shape 254 | 255 | n_total = table.sum() 256 | n_rater = table.sum(1) 257 | n_rat = n_rater.max() 258 | #assume fully ranked 259 | assert n_total == n_sub * n_rat 260 | 261 | #marginal frequency of categories 262 | p_cat = table.sum(0) / n_total 263 | 264 | if weighted: 265 | table_weight = 1 * np.asarray(weights) 266 | table2 = np.matmul(table , table_weight) 267 | table2 = np.multiply(table2,table) 268 | else: 269 | table2 = table * table 270 | 271 | p_rat = (table2.sum(1) - n_rat) / (n_rat * (n_rat - 1.)) 272 | p_mean = p_rat.mean() 273 | 274 | 275 | 276 | p_mean_exp = (p_cat*p_cat).sum() 277 | 278 | kappa = float(p_mean - p_mean_exp) / (1- p_mean_exp) 279 | 280 | 281 | return kappa 282 | 283 | import statsmodels.stats.inter_rater as ir 284 | import random 285 | import statsmodels.stats.inter_rater as ir 286 | 287 | ### the method to return the array of number of votes per category for each item. 288 | ### it can come in 3 categories as pos, neut, neg or 5 categories, pos, slig pos, neut, slig neg, neg 289 | def create_sent_arr(finals,number_class=3,number_voters=3): 290 | new_final_labels = [] 291 | wrong_voters = 0 292 | for item in finals: 293 | new_l = [] 294 | if len(item) != number_voters: 295 | # print(item) 296 | wrong_voters += 1 297 | continue 298 | negs = 0 299 | neus = 0 300 | pos = 0 301 | slneg = 0 302 | slpos = 0 303 | for l in item: 304 | if l == 'Negative': 305 | negs += 1 306 | elif l == 'Slightly_Negative': 307 | if number_class == 5: 308 | slneg += 1 309 | else: 310 | negs += 1 311 | elif 'Neutral' in l: 312 | neus += 1 313 | if number_class == 2: 314 | negs += 1 315 | 316 | 317 | elif l == 'Positive': 318 | pos += 1 319 | 320 | else: # slightly positive 321 | if number_class == 5: 322 | slpos += 1 323 | else: 324 | pos += 1 325 | if number_class == 5: 326 | new_final_labels.append([negs,slneg,neus,slpos,pos]) 327 | elif number_class== 2: 328 | new_final_labels.append([negs,pos]) 329 | else: 330 | new_final_labels.append([negs,neus,pos]) 331 | print('wrong number of voters: %d'%wrong_voters) 332 | return new_final_labels 333 | 334 | 335 | ## create reannotation input data to assign to new annotators in mturk 336 | ## there are two input files, the first one is the original annotation from the mturk output, 337 | ## the second one is the final sentiment file in which it shows the whether each input needs to be reannotated or not 338 | ## based on the variance of the paragraph and document labels. 339 | 340 | def create_reannotate_mturk_input(original_csv='./mturk_out_Aug19_part2.csv', 341 | final_sentiment_csv = './deleteme.csv', 342 | outfile_resubmit='mturk_input_part2_v2.csv', 343 | outfile_accepted= 'mturk_out_accepted_part2_v2.csv'): 344 | 345 | final_sent = pd.read_csv(final_sentiment_csv) 346 | original_docs = pd.read_csv(original_csv) 347 | reann_docs = final_sent[final_sent['Reannotate']==True] 348 | 349 | out_df = open(outfile_resubmit, 'w') 350 | fileWriter = csv.writer(out_df) 351 | fileWriter.writerow(['entity','title','content']) 352 | 353 | hids = reann_docs['HITId'] 354 | 355 | 356 | for id in hids: 357 | 358 | entity = list(original_docs[original_docs['HITId']==id]['Input.entity'])[0] 359 | title = list(original_docs[original_docs['HITId']==id]['Input.title'])[0] 360 | doc = list(original_docs[original_docs['HITId']==id]['Input.content'])[0] 361 | fileWriter.writerow([entity,title,doc]) 362 | 363 | accepted_docs = final_sent[final_sent['Reannotate']==False] 364 | out_df = open(outfile_accepted, 'w') 365 | fileWriter = csv.writer(out_df) 366 | fileWriter.writerow(['entity','title','content']) 367 | 368 | hids = reann_docs['HITId'] 369 | 370 | 371 | for id in hids: 372 | 373 | entity = list(original_docs[original_docs['HITId']==id]['Input.entity'])[0] 374 | title = list(original_docs[original_docs['HITId']==id]['Input.title'])[0] 375 | doc = list(original_docs[original_docs['HITId']==id]['Input.content'])[0] 376 | fileWriter.writerow([entity,title,doc]) 377 | 378 | 379 | out_df.close() 380 | 381 | 382 | 383 | path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_Aug19/' 384 | part= 'part1' 385 | save_path = 'mturk_final_sentiemnt_%s.csv' % part 386 | original_csv='./data/mturk_out_Aug19_%s.csv' % part 387 | 388 | 389 | 390 | number_class = 5 391 | weighted = True 392 | w_cons = 45*np.pi/180 393 | 394 | try: 395 | with open('final_labels__.out','rb') as f: 396 | final_labels=pickle.load(f) 397 | with open('par_lables__.out','rb') as f: 398 | par_labels=pickle.load(f) 399 | 400 | # final_labels = [] 401 | # par_labels = [] 402 | # 403 | # final6,all_labels = \ 404 | # mturk_output_creator_paragraph_level(input_csv = 'testi.csv' , 405 | # path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19/', 406 | # pickle_reader=False,num_labels=3) 407 | # 408 | # 409 | # final_labels.extend(final6) 410 | # par_labels.extend(all_labels) 411 | 412 | except Exception as e: 413 | print('error occurred: %s'%str(e)) 414 | final_labels = [] 415 | par_labels = [] 416 | final_labels,all_lables = mturk_output_creator_paragraph_level(input_csv = original_csv,path=path,pickle_reader=False,save_file_name =save_path) 417 | final6,all_labels = mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out/Mturk_dataset_for_crowdsource/mergeset.csv', 418 | path='../dataAnalysis/LDC2014E13_output/used_documents_in_MTurk_merge/small/', 419 | pickle_reader=False, 420 | save_file_name = 'all_data_aug19.csv') 421 | final_labels.extend(final6) 422 | par_labels.extend(all_labels) 423 | 424 | for i in [1,2,3,4,5]: 425 | final,all_labels = \ 426 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out_1Dec/emnlp_%s.csv'%str(i), 427 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp/', 428 | pickle_reader=True,save_file_name = 'all_data_aug19.csv') 429 | final_labels.extend(final) 430 | par_labels.extend(all_labels) 431 | 432 | 433 | final8,all_labels = \ 434 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out/emnlp_PS_out.csv', 435 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated/', 436 | pickle_reader=True,save_file_name = 'all_data_aug19.csv') 437 | final_labels.extend(final8) 438 | par_labels.extend(all_labels) 439 | 440 | final9,all_labels = \ 441 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out/kbp_mturk_out.csv', 442 | path='../dataAnalysis/LDC2014E13_output/used_documents_in_MTurk_KBP/', 443 | pickle_reader=True, 444 | save_file_name = 'all_data_aug19.csv') 445 | final_labels.extend(final9) 446 | par_labels.extend(all_labels) 447 | 448 | final10,all_labels = \ 449 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out/Mturk_dataset_for_crowdsource/smallset1.csv', 450 | path='../dataAnalysis/LDC2014E13_output/used_documents_in_MTurk_merge/small/',pickle_reader=True, 451 | save_file_name = 'all_data_aug19.csv') 452 | final_labels.extend(final10) 453 | par_labels.extend(all_labels) 454 | 455 | final11,all_labels = \ 456 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out/Mturk_dataset_for_crowdsource/largeset1.csv', 457 | path='../dataAnalysis/LDC2014E13_output/used_documents_in_MTurk_merge/large/',pickle_reader=True, 458 | save_file_name = 'all_data_aug19.csv') 459 | 460 | final_labels.extend(final11) 461 | par_labels.extend(all_labels) 462 | 463 | 464 | final12,all_labels = \ 465 | mturk_output_creator_paragraph_level(input_csv ='../dataAnalysis/mturk_out_1Dec/long_1.csv', 466 | path='../dataAnalysis/LDC2014E13_output/used_documents_in_MTurk_merge/large/',pickle_reader=True, 467 | save_file_name = 'all_data_aug19.csv') 468 | final_labels.extend(final12) 469 | par_labels.extend(all_labels) 470 | 471 | 472 | final13,all_labels = \ 473 | mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out_4Dec/emnlp_batch2_mturk_out.csv', 474 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_batch2/' 475 | ,pickle_reader=True, 476 | save_file_name = 'all_data_aug19.csv') 477 | final_labels.extend(final13) 478 | par_labels.extend(all_labels) 479 | final14,all_labels = mturk_output_creator_paragraph_level(input_csv = '../dataAnalysis/mturk_out_4Dec/emnlp_batch3_mturk_out.csv', 480 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_batch3/' 481 | ,pickle_reader=True,save_file_name = 'all_data_aug19.csv') 482 | final_labels.extend(final14) 483 | par_labels.extend(all_labels) 484 | 485 | #### This is the documents before re-assignment 486 | # final14,all_labels = \ 487 | # mturk_output_creator_paragraph_level(input_csv = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/data/mturk_out_aug19_part1.csv', 488 | # path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19/' 489 | # ,pickle_reader=False,save_file_name = 'all_data_aug19.csv') 490 | # 491 | # final_labels.extend(final14) 492 | # par_labels.extend(all_labels) 493 | # 494 | # 495 | # final14,all_labels = \ 496 | # mturk_output_creator_paragraph_level(input_csv = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/data/mturk_out_aug19_part2.csv', 497 | # path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19/' 498 | # ,pickle_reader=False,save_file_name = 'all_data_aug19.csv') 499 | # 500 | # final_labels.extend(final14) 501 | # par_labels.extend(all_labels) 502 | # 503 | #### part 3 is not reassigned, so we have just one version of it 504 | final14,all_labels = \ 505 | mturk_output_creator_paragraph_level(input_csv = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/data/mturk_out_aug19_part3.csv', 506 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19_part2/' 507 | ,pickle_reader=False,save_file_name = 'all_data_aug19.csv') 508 | 509 | final_labels.extend(final14) 510 | par_labels.extend(all_labels) 511 | 512 | #### This is the documents after re-assignment. 3 out of 4 are selected which are the least diverse 513 | final14,all_labels = \ 514 | mturk_output_creator_paragraph_level(input_csv = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/mturk_out_combined_versions_part1.csv', 515 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19/' 516 | ,pickle_reader=False,save_file_name = 'all_data_aug19.csv') 517 | 518 | final_labels.extend(final14) 519 | par_labels.extend(all_labels) 520 | 521 | final14,all_labels = \ 522 | mturk_output_creator_paragraph_level(input_csv = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/mturk_out_combined_versions_part2.csv', 523 | path='../dataAnalysis/emnlp18_data/used_documents_in_MTurk_emnlp_paragraph_seperated_aug19/' 524 | ,pickle_reader=True,save_file_name = 'all_data_aug19.csv') 525 | 526 | final_labels.extend(final14) 527 | par_labels.extend(all_labels) 528 | print(len(final_labels)) 529 | print(len(par_labels)) 530 | 531 | with open('final_labels.out','wb') as f: 532 | pickle.dump(final_labels,f) 533 | with open('par_lables.out','wb') as f: 534 | pickle.dump(par_labels,f) 535 | 536 | 537 | if number_class== 5: 538 | # weights = np.ones([5,5]) 539 | weights = [[1,np.cos(w_cons/2),np.cos(w_cons),np.cos(w_cons*3/2),0], # negative 540 | [np.cos(w_cons/2),1,np.cos(w_cons/2),np.cos(w_cons),np.cos(w_cons*3/2)], #Slightly Negative 541 | [np.cos(w_cons),np.cos(w_cons/2),1,np.cos(w_cons/2),np.cos(w_cons)], #Neutral 542 | [np.cos(w_cons*3/2),np.cos(w_cons),np.cos(w_cons/2),1,np.cos(w_cons/2)], #Slightly Positive 543 | [0,np.cos(w_cons*3/2),np.cos(w_cons),np.cos(w_cons/2),1]] #Positive 544 | else: 545 | # weights = np.ones([3,3]) 546 | weights = [[1 ,np.cos(w_cons) ,0], 547 | [np.cos(w_cons) ,1 ,np.cos(w_cons)], 548 | [0 ,np.cos(w_cons) ,1]] 549 | 550 | ######## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! the order of paragraphs should be fixed before running this code !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 551 | 552 | print('number of paragraphs: ',len( par_labels)) 553 | 554 | arr = create_sent_arr(par_labels,number_class,number_voters=3) 555 | print('length of the arr is: %d'%len(arr)) 556 | agg_par = kapa_computer(arr, weighted=weighted,weights=weights) 557 | print('paragraph level agreement number is : %s'%str(agg_par)) 558 | 559 | 560 | print('number of documents: ',len( final_labels)) 561 | arr = create_sent_arr(final_labels,number_class,number_voters=3) 562 | agg_doc = kapa_computer(arr, weighted=weighted, weights=weights) 563 | print('document level agreement number is : %s'%str(agg_doc)) 564 | 565 | # # create_reannotate_mturk_input(original_csv=original_csv, 566 | # # final_sentiment_csv = save_path, 567 | # # outfile_resubmit='mturk_input_%s_v2.csv'%part, 568 | # # outfile_accepted= 'mturk_out_accepted_%s_v2.csv'%part) 569 | -------------------------------------------------------------------------------- /pre_post_processing_steps/7_seperate_train_test.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import collections 3 | import random 4 | import matplotlib.pyplot as plt 5 | import matplotlib.pylab as pylab 6 | def separate_train_test(path ): 7 | df = pd.read_csv(path) 8 | print('before dropping dupicate:' , len(df)) 9 | df = df.drop_duplicates(subset='DOCUMENT', keep=False) 10 | print('after dropping dupicate:' , len(df)) 11 | # dropped.to_csv('all_data_combined_shuffled_drop_dup_3Dec_7Dec_aug19_%s.csv'%str(set_name),index=False) 12 | 13 | 14 | 15 | entities = df['TARGET_ENTITY'] 16 | # en_freq = collections.Counter(entities) 17 | unique_entities = entities.unique().tolist() 18 | # print(unique_entities) 19 | data_size = len(df) 20 | print('whole dataset size is %d'%data_size) 21 | # print(en_freq) 22 | # fixed test set 23 | # fixed_test_entities = ['Barack Obama', 'LeBron James', 'Hillary Clinton'] 24 | fixed_test_entities = ['Donald Trump', 'Barack Obama','Trump','Obama'] 25 | fixed_test_set = pd.DataFrame([],columns=['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT']) 26 | for entity in fixed_test_entities: 27 | unique_entities.remove(entity) 28 | df_entity = df[df['TARGET_ENTITY'] == entity] 29 | fixed_test_set = fixed_test_set.append(df_entity) 30 | fixed_test_size = len(fixed_test_set) 31 | print('fixed test set size is %d' %fixed_test_size) 32 | 33 | random_test_set = pd.DataFrame([],columns=['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT']) 34 | random_test_size = len(random_test_set) 35 | random_test_entities = [] 36 | # random test set 37 | while random_test_size < float(data_size)/10: 38 | # print(unique_entities) 39 | entity = random.choice(unique_entities) 40 | random_test_entities.append(entity) 41 | # print(entity) 42 | unique_entities.remove(entity) 43 | df_entity = df[df['TARGET_ENTITY']== entity] 44 | df_entity = df_entity.head(30) 45 | random_test_set = random_test_set.append(df_entity) 46 | random_test_size = len(random_test_set) 47 | print('random test set size is now %d'%random_test_size) 48 | 49 | dev_set = pd.DataFrame([],columns=['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT']) 50 | dev_size = len(dev_set) 51 | dev_entities = [] 52 | while dev_size < float(data_size) / 10: 53 | entity = random.choice(unique_entities) 54 | dev_entities.append(entity) 55 | unique_entities.remove(entity) 56 | df_entity = df[df['TARGET_ENTITY']== entity] 57 | df_entity = df_entity.head(30) 58 | dev_set = dev_set.append(df_entity) 59 | dev_size = len(dev_set) 60 | print('dev set size is now %d' % dev_size) 61 | train_set = pd.DataFrame([],columns=['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT']) 62 | #enforce the train set to have just firt 30 documents of each entity 63 | for entity in unique_entities: 64 | df_entity = df[df['TARGET_ENTITY'] == entity] 65 | df_entity = df_entity.head(30) 66 | train_set = train_set.append(df_entity) 67 | 68 | # train_set = df[df['TARGET_ENTITY'].isin( unique_entities)] 69 | print('train set size is %d' % len(train_set)) 70 | return train_set,unique_entities,dev_set, dev_entities,random_test_set,random_test_entities,fixed_test_set,fixed_test_entities 71 | 72 | 73 | def plot_class_distribution(dataset,negative,positive,neutral): 74 | print('distribution of %s set among three classes"'%dataset) 75 | print('negative:\n ', negative) 76 | print('positive:\n ', positive) 77 | print('neutral:\n ', neutral) 78 | # Data to plot 79 | plt.figure() 80 | labels = 'Negative', 'Positive', 'Neutral' 81 | sizes = [negative, 82 | positive 83 | , neutral] 84 | 85 | colors = ['Red', 'Green', 'Yellow'] 86 | 87 | # Plot 88 | plt.pie(sizes, labels=labels, colors=colors, 89 | autopct='%1.1f%%', shadow=True, startangle=140) 90 | 91 | plt.axis('equal') 92 | plt.title(dataset, horizontalalignment='center', verticalalignment='bottom') 93 | # plt.show() 94 | pylab.savefig('%s.png'%dataset) 95 | 96 | 97 | def entity_frequency(df): 98 | plt.figure() 99 | entities = df['TARGET_ENTITY'] 100 | entity_count = collections.Counter(entities) 101 | data = entity_count.most_common(10) 102 | plot_df = pd.DataFrame(data, columns=['entity', 'frequency']) 103 | plot_df.plot(kind='bar', x='entity') 104 | pylab.savefig('%s.png' % 'entity_frequency') 105 | 106 | def paragraph_distribution(df,plot=True): 107 | # plot sentence information 108 | plt.figure() 109 | documents = df['DOCUMENT'].tolist() 110 | doc_length = [] 111 | for document in documents: 112 | try: 113 | doc_length.append(len(document.split('\n'))) 114 | except: 115 | continue 116 | if plot: 117 | plt.hist(doc_length,len(set(doc_length))) 118 | plt.xlabel('Number of Paragraphs') 119 | plt.ylabel('Frequency') 120 | plt.axis([0, 30, 0, 2000]) 121 | pylab.savefig('%s.png' % 'paragraph_freq') 122 | # plt.legend() 123 | print('total number of sentences:%d'% sum(doc_length)) 124 | print('plot done') 125 | return doc_length 126 | 127 | def word_distribution(df,plot=True,plot_name='train_dev'): 128 | # plot sentence information 129 | plt.figure() 130 | documents = df['DOCUMENT'].tolist() 131 | sentence_length = [] 132 | for document in documents: 133 | try: 134 | sentences = document.split('\n') 135 | 136 | except: 137 | continue 138 | for sentence in sentences: 139 | sentence_length.append(len(sentence.split())) 140 | if plot: 141 | plt.hist(sentence_length,len(set(sentence_length))) 142 | plt.xlabel('Number of Words in Sentence') 143 | plt.ylabel('Frequency') 144 | plt.axis([0, 120, 0, 5000]) 145 | pylab.savefig('%s.png' % plot_name) 146 | # plt.legend() 147 | print('total number of sentences:%d'% sum(sentence_length)) 148 | print('plot done') 149 | return sentence_length 150 | 151 | 152 | # path = '/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/data/alldata_12Nov.csv' 153 | path = 'all_data_aug19.csv' 154 | train_set,train_entities,dev_set, dev_entities,random_test_set,random_test_entities,fixed_test_set,fixed_test_entities = separate_train_test(path) 155 | train_set.to_csv('alldata_aug19_train.csv',index=False) 156 | dev_set.to_csv('alldata_aug19_dev.csv',index=False) 157 | random_test_set.to_csv('alldata_aug19_random_test.csv',index= False) 158 | fixed_test_set.to_csv('alldata_aug19_fixed_test.csv',index= False) 159 | 160 | print('train set is: \n',train_entities) 161 | print('dev set is: \n',dev_entities) 162 | print('random test set is: \n',random_test_entities) 163 | print('fixed test set is: \n',fixed_test_entities) 164 | 165 | print('number of unique entities in train set is %d' % len(train_entities)) 166 | print('number of unique entities in dev set is %d' %len(dev_entities)) 167 | print('number of unique entities in random test set is %d'%len(random_test_entities)) 168 | print('number of unique entities in fixed test set is %d'%len(fixed_test_entities)) 169 | 170 | 171 | # 172 | # 173 | plot_class_distribution('train',len(train_set[train_set['TRUE_SENTIMENT']=='Negative']), 174 | len(train_set[train_set['TRUE_SENTIMENT'] == 'Positive']), 175 | len(train_set[train_set['TRUE_SENTIMENT'] == 'Neutral'])) 176 | 177 | plot_class_distribution('dev',len(dev_set[dev_set['TRUE_SENTIMENT']=='Negative']), 178 | len(dev_set[dev_set['TRUE_SENTIMENT'] == 'Positive']), 179 | len(dev_set[dev_set['TRUE_SENTIMENT'] == 'Neutral'])) 180 | 181 | plot_class_distribution('random_test',len(random_test_set[random_test_set['TRUE_SENTIMENT']=='Negative']), 182 | len(random_test_set[random_test_set['TRUE_SENTIMENT'] == 'Positive']), 183 | len(random_test_set[random_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 184 | 185 | plot_class_distribution('fixed_test',len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT']=='Negative']), 186 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Positive']), 187 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 188 | 189 | 190 | train_set= pd.read_csv('alldata_aug19_train.csv') 191 | dev_set = pd.read_csv('alldata_aug19_dev.csv') 192 | random_test_set= pd.read_csv('alldata_aug19_random_test.csv') 193 | fixed_test_set = pd.read_csv('alldata_aug19_fixed_test.csv') 194 | all_used_docs = train_set.append(dev_set).append(random_test_set).append(fixed_test_set) 195 | plot_class_distribution('fixed_test',len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT']=='Negative']), 196 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Positive']), 197 | len(fixed_test_set[fixed_test_set['TRUE_SENTIMENT'] == 'Neutral'])) 198 | # all_used_docs.to_csv('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/dataAnalysis/data/alldata_12Nov_30Nov.csv',index= False) 199 | # entity_frequency(all_used_docs) 200 | paragraph_distribution(all_used_docs,plot=True) 201 | 202 | para_train = paragraph_distribution(train_set,plot=False) 203 | para_dev = paragraph_distribution(dev_set,plot=False) 204 | para_fixed_test = paragraph_distribution(fixed_test_set,plot=False) 205 | para_random_test = paragraph_distribution(random_test_set,plot=False) 206 | print('paragraphs in train: %d, dev %d, fixed test %d, random test %d' %(sum(para_train),sum(para_dev),sum(para_fixed_test),sum(para_random_test))) 207 | print('max paragraphs in train: %d, dev %d, fixed test %d, random test %d' %(max(para_train),max(para_dev),max(para_fixed_test),max(para_random_test))) 208 | 209 | word_distribution(all_used_docs,plot=True) 210 | 211 | sent_train = word_distribution(train_set.append(dev_set),plot=False) 212 | sent_dev = word_distribution(dev_set,plot=False) 213 | sent_fixed_test = word_distribution(fixed_test_set,plot=False) 214 | sent_random_test = word_distribution(random_test_set,plot=False) 215 | print('words in train: %d, dev %d, fixed test %d, random test %d' %(sum(sent_train),sum(sent_dev),sum(sent_fixed_test),sum(sent_random_test))) 216 | print('max wordsin train: %d, dev %d, fixed test %d, random test %d' %(max(sent_train),max(sent_dev),max(sent_fixed_test),max(sent_random_test))) 217 | 218 | 219 | -------------------------------------------------------------------------------- /pre_post_processing_steps/8_masked_data_prepare_datasets.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import pandas as pd 5 | import spacy 6 | nlp = spacy.load('en') 7 | 8 | # Add neural coref to SpaCy's pipe 9 | import neuralcoref 10 | neuralcoref.add_to_pipe(nlp) 11 | 12 | 13 | # In[3]: 14 | 15 | 16 | #### Create a dataframe from the main dataset with two new columns: 17 | #### enclose entity and its corefrences with and tags save in ENCLOSED_ENTITY column 18 | #### replace all occurrences of main entity with [TGT] and save in MASKED_ENTITY column 19 | 20 | 21 | def enclose_mask_entity(df): 22 | new_df = df 23 | docs = df['DOCUMENT'] 24 | print(len(list(docs))) 25 | ind = -1 26 | enclosed_docs = [] 27 | masked_docs = [] 28 | for text in list(docs): 29 | if ind %100 == 0: 30 | print('document %d processed'%ind) 31 | 32 | ind +=1 33 | doc_id = df['DOCUMENT_INDEX'].iloc[ind] 34 | entity = df['TARGET_ENTITY'].iloc[ind] 35 | 36 | sc = 0 37 | new_text_tag = "" 38 | new_text_mask = "" 39 | corefs = [] 40 | 41 | try: 42 | ## preprocess for coreference resolution 43 | doc = nlp(text) 44 | 45 | # find all mentions and coreferences 46 | for item in doc._.coref_clusters: 47 | # if the head of cluster is the intended entity then add all of the mentions 48 | if entity in item.main.text or item.main.text in entity: 49 | for span in item: 50 | corefs.append(span) 51 | 52 | for item in sorted(corefs): 53 | 54 | 55 | pronoun = item.text 56 | 57 | ec = item.start_char 58 | 59 | if ec < sc: 60 | continue 61 | 62 | 63 | new_text_tag += text[sc:ec] + ' '+pronoun+' ' 64 | new_text_mask += text[sc:ec] + ' [TGT] ' 65 | if '\n' in item.text: 66 | new_text_mask += '\n' 67 | new_text_tag += '\n' 68 | sc = item.end_char 69 | 70 | if len(corefs) >0: 71 | new_text_tag += text[sc:] 72 | new_text_mask += text[sc:] 73 | 74 | else: 75 | new_text_tag = text.replace(entity,' '+entity+' ' ) 76 | new_text_mask = text.replace(entity,' [TGT] ' ) 77 | # print('coref couldnt find the main entity in document %d'%doc_id) 78 | except Exception as e: 79 | print('can not resolve coref for document %d, error is %s '% (doc_id,str(e))) 80 | new_text_tag = text 81 | new_text_mask = text 82 | enclosed_docs.append(new_text_tag) 83 | masked_docs.append(new_text_mask) 84 | new_df['ENCLOSED_DOCUMENT'] = pd.Series(enclosed_docs) 85 | new_df['MASKED_DOCUMENT'] = pd.Series(masked_docs) 86 | return new_df 87 | 88 | 89 | data_train = pd.read_csv('alldata_aug19_train.csv', encoding='latin-1') 90 | new_df = enclose_mask_entity(data_train) 91 | new_df.to_csv('alldata_aug19_enclosed_masked_train.csv', encoding='latin-1') 92 | # 93 | data_dev = pd.read_csv('alldata_aug19_dev.csv', encoding='latin-1') 94 | new_df = enclose_mask_entity(data_dev) 95 | new_df.to_csv('alldata_aug19_enclosed_masked_dev.csv', encoding='latin-1') 96 | # 97 | # 98 | data_test = pd.read_csv('alldata_aug19_random_test.csv', encoding='latin-1') 99 | new_df = enclose_mask_entity(data_test) 100 | new_df.to_csv('alldata_aug19_enclosed_masked_random_test.csv', encoding='latin-1') 101 | # 102 | # 103 | data_test_fixed = pd.read_csv('alldata_aug19_fixed_test.csv', encoding='latin-1') 104 | new_df = enclose_mask_entity(data_test_fixed) 105 | new_df.to_csv('alldata_aug19_enclosed_masked_fixed_test.csv', encoding='latin-1') 106 | # 107 | 108 | 109 | 110 | # # DataStats Statistics 111 | 112 | # In[42]: 113 | 114 | 115 | def data_dist(input_file): 116 | doc_lengths = [] 117 | par_length_withDoc = [] 118 | par_length_woDoc = [] 119 | i=-1 120 | for doc in input_file['DOCUMENT']: 121 | i += 1 122 | length = len(doc.split(' ')) 123 | doc_lengths.append(length) 124 | if pd.notnull(input_file['Paragraph0'].iloc[i]): 125 | pars = doc.split('\n') 126 | for par in pars: 127 | par_length_withDoc.append(len(par.split(' '))) 128 | par_length_woDoc.append(len(par.split(' '))) 129 | else: 130 | par_length_woDoc.append(length) 131 | 132 | return(doc_lengths,par_length_withDoc,par_length_woDoc) 133 | 134 | 135 | 136 | 137 | 138 | # In[43]: 139 | 140 | 141 | from matplotlib import pyplot 142 | (doc_lengths,par_length_withDoc,par_length_woDoc)=data_dist(data_train) 143 | pyplot.figure() 144 | pyplot.hist(doc_lengths,bins=[0,128,256,384,512]); 145 | pyplot.title('Distributaion of Document Length in Train Set') 146 | pyplot.xlabel('Document Length') 147 | pyplot.ylabel('Frequency') 148 | 149 | 150 | pyplot.figure() 151 | pyplot.hist(par_length_withDoc,bins=[0,20,40,60,80,100,120,140,160]); 152 | pyplot.title('Distributaion of Paragraph Length in Train Set') 153 | pyplot.xlabel('Paragraph Length') 154 | pyplot.ylabel('Frequency') 155 | 156 | 157 | 158 | 159 | # In[44]: 160 | 161 | 162 | (doc_lengths,par_length_withDoc,par_length_woDoc)=data_dist(data_dev) 163 | 164 | pyplot.figure() 165 | pyplot.hist(doc_lengths,bins=[0,128,256,384,512]); 166 | pyplot.title('Distributaion of Document Length in Dev Set') 167 | pyplot.xlabel('Document Length') 168 | pyplot.ylabel('Frequency') 169 | 170 | 171 | 172 | 173 | 174 | 175 | pyplot.figure() 176 | pyplot.hist(par_length_woDoc,bins=[0,20,40,60,80,100,120,140,160]); 177 | pyplot.title('Distributaion of Paragraph Length in Dev Set') 178 | pyplot.xlabel('Paragraph Length') 179 | pyplot.ylabel('Frequency') 180 | 181 | pyplot.show() 182 | 183 | -------------------------------------------------------------------------------- /pre_post_processing_steps/9_combine_pre_new_sets.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | sets_name = ['train','dev','random_test','fixed_test'] 3 | pre_set_prefix = 'alldata_3Dec_7Dec_PS_reindex_enclosed_masked' 4 | new_set_prefix = 'alldata_aug19_enclosed_masked' 5 | pre_sets = [] # train, dev, random_test, fixed_test 6 | new_sets = [] 7 | concat_sets= [] 8 | shuffle_sets = [] 9 | 10 | column_names=['DOCUMENT_INDEX','TITLE','TARGET_ENTITY','DOCUMENT','MASKED_DOCUMENT','TRUE_SENTIMENT','Paragraph0','Paragraph1','Paragraph2','Paragraph3','Paragraph4','Paragraph5','Paragraph6','Paragraph7','Paragraph8','Paragraph9','Paragraph10','Paragraph11','Paragraph12','Paragraph13','Paragraph14','Paragraph15'] 11 | 12 | for set_name in sets_name: 13 | pre_set = pd.read_csv('./pre_set/%s_%s_v3.csv'%(str(pre_set_prefix),str(set_name))) 14 | pre_sets.append(pre_set) 15 | 16 | new_set = pd.read_csv('./new_set/%s_%s.csv'%(str(new_set_prefix),str(set_name)),encoding = "ISO-8859-1") 17 | new_set['DOCUMENT_INDEX'] = new_set['DOCUMENT_INDEX']+ 3000 18 | new_sets.append(new_set) 19 | concat_set = pd.concat([pre_set, new_set[['TARGET_ENTITY','DOCUMENT_INDEX','TITLE','DOCUMENT','TRUE_SENTIMENT','Paragraph0','Paragraph1','Paragraph2','Paragraph3','Paragraph4','Paragraph5','Paragraph6','Paragraph7','Paragraph8','Paragraph9','Paragraph10','Paragraph11','Paragraph12','Paragraph13','Paragraph14','Paragraph15','MASKED_DOCUMENT']]],sort=True,ignore_index=True) 20 | # concat_set = pd.concat([pre_set, new_set],names=['DOCUMENT_INDEX','TITLE','TARGET_ENTITY','DOCUMENT','MASKED_DOCUMENT','TRUE_SENTIMENT','Paragraph0','Paragraph1','Paragraph2','Paragraph3','Paragraph4','Paragraph5','Paragraph6','Paragraph7','Paragraph8','Paragraph9','Paragraph10','Paragraph11','Paragraph12','Paragraph13','Paragraph14','Paragraph15'],sort=False,ignore_index=True) 21 | concat_set = concat_set[column_names] 22 | shuffle_set = concat_set.sample(frac=1) 23 | concat_sets.append(concat_set) 24 | shuffle_sets.append(shuffle_set) 25 | shuffle_set.to_csv('all_data_combined_shuffled_3Dec_7Dec_aug19_%s.csv'%str(set_name),index=False) 26 | -------------------------------------------------------------------------------- /pre_post_processing_steps/[7]_combine_4_votes.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | #### compute agreement between the new batch and previous batch ( when some batches are submitted for reannotating, this code 4 | #### will combine hits with 3 and 4 votes, remove the most divers answer out of 4 answers and create an output like all other outputs 5 | #### with three answers per hit 6 | 7 | 8 | import pandas as pd 9 | new_batch_file = 'data/mturk_out_aug19_part2_v2.csv' 10 | pre_batch_file = 'data/mturk_out_aug19_part2.csv' 11 | 12 | new_batch = pd.read_csv(new_batch_file) 13 | pre_batch = pd.read_csv(pre_batch_file) 14 | new_titles = new_batch['Input.title'] 15 | pre_titles = pre_batch['Input.title'] 16 | 17 | 18 | new_df_appended = pd.DataFrame() 19 | new_df_notappended = pd.DataFrame() 20 | for title in new_titles: 21 | entity = list(new_batch[new_batch['Input.title'] == title]['Input.entity'])[0] 22 | hitid = list(pre_batch[pre_batch['Input.title'] == title]['HITId'])[0] 23 | pre_df = pre_batch[pre_batch['Input.title'] == title] 24 | pre_df = pre_df [ pre_df['Input.entity'] == entity] 25 | new_df_appended = new_df_appended.append(pre_df) 26 | batch = new_batch[new_batch['Input.title'] == title] 27 | batch['HITId'] = hitid 28 | 29 | new_df_appended = new_df_appended.append(batch) 30 | new_df_notappended= new_df_notappended.append(pre_df) 31 | 32 | print(len(new_batch),len(pre_batch),len(new_df_appended),len(new_df_notappended)) 33 | new_df_appended.to_csv('appended_part2.csv') 34 | 35 | 36 | hitids = new_df_appended['HITId'].unique() 37 | arr_map = {'Negative': -2, 'Slightly_Negative':-1 , 'Neutral': 0 , 'Slightly_Positive': 1, 'Positive':2} 38 | sent_arr = [] 39 | counter = 0 40 | to_be_removed_indices = [] 41 | for hid in hitids: 42 | hit_arr = [] 43 | df = new_df_appended[new_df_appended['HITId']==hid] 44 | df1 = df[df['AssignmentStatus']=='Approved'] 45 | if len(df1)>3: 46 | sentiments = list(df1['Answer.Final_sentiment']) 47 | if np.nan in sentiments: 48 | remove_idx = sentiments.index(np.nan) 49 | else: 50 | for item in sentiments: 51 | hit_arr.append(arr_map[item]) 52 | min_var = 1000 53 | for index in range(len(hit_arr)-1,-1,-1): 54 | new_arr = [hit_arr[idx] for idx in range(len(hit_arr)) if idx != index ] 55 | new_var = np.var(new_arr) 56 | if new_var < min_var: 57 | min_var = new_var 58 | remove_idx = index 59 | to_be_removed_indices.append(counter+remove_idx) 60 | 61 | counter += 4 62 | 63 | cleaned_df = new_df_appended.drop(new_df_appended.index[to_be_removed_indices],0) 64 | cleaned_df.to_csv('mturk_out_combined_versions_part2.csv') -------------------------------------------------------------------------------- /pre_post_processing_steps/data_distribution.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | data_train = pd.read_csv('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_train.csv', encoding='latin-1') 3 | data_dev = pd.read_csv('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_dev.csv', encoding='latin-1') 4 | data_test = pd.read_csv('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_random_test.csv', encoding='latin-1') 5 | data_test_fixed = pd.read_csv('/Users/mohaddeseh/Documents/EntitySentimentAnalyzer-master/data_preparation_process/combined_set/all_data_combined_shuffled_3Dec_7Dec_aug19_reindex_fixed_test.csv', encoding='latin-1') 6 | 7 | 8 | def data_dist(input_file): 9 | doc_lengths = [] 10 | par_length_withDoc = [] 11 | par_length_woDoc = [] 12 | i=-1 13 | for doc in input_file['DOCUMENT']: 14 | i += 1 15 | length = len(doc.split(' ')) 16 | doc_lengths.append(length) 17 | if pd.notnull(input_file['Paragraph0'].iloc[i]): 18 | pars = doc.split('\n') 19 | for par in pars: 20 | par_length_withDoc.append(len(par.split(' '))) 21 | par_length_woDoc.append(len(par.split(' '))) 22 | else: 23 | par_length_woDoc.append(length) 24 | 25 | return(doc_lengths,par_length_withDoc,par_length_woDoc) 26 | 27 | 28 | from matplotlib import pyplot 29 | #(doc_lengths,par_length_withDoc,par_length_woDoc)=data_dist(data_train) 30 | #pyplot.figure() 31 | #pyplot.hist(doc_lengths,bins=[0,150,300,450,600,750,900.1050],color='b',alpha=1,edgecolor='black'); 32 | #pyplot.title('Distribution of Document Length in Train Set') 33 | #pyplot.xlabel('Document Length') 34 | #pyplot.ylabel('Frequency') 35 | #pyplot.show() 36 | 37 | 38 | #pyplot.figure() 39 | #pyplot.hist(par_length_withDoc,bins=[0,20,40,60,80,100,120,140,160],edgecolor='black'); 40 | #pyplot.title('Distribution of Paragraph Length in Train Set')# (including docs without paragraph label)') 41 | #pyplot.xlabel('Paragraph Length') 42 | #pyplot.ylabel('Frequency') 43 | #pyplot.show() 44 | 45 | 46 | # pyplot.figure() 47 | # pyplot.hist(par_length_woDoc,bins=[0,20,40,60,80,100,120,140,160],edgecolor='black'); 48 | # pyplot.title('Distribution of Paragraph Length in Train Set')# (excluding docs without paragraph label)') 49 | # pyplot.xlabel('Paragraph Length') 50 | # pyplot.ylabel('Frequency') 51 | # pyplot.show() 52 | 53 | (doc_lengths,par_length_withDoc,par_length_woDoc)=data_dist(data_dev) 54 | 55 | pyplot.figure() 56 | pyplot.hist(doc_lengths,bins=[0,100,200,300,400,500,600,700,800,900,1000,1100,1200],edgecolor='black'); 57 | pyplot.title('Distribution of Document Length in Dev Set') 58 | pyplot.xlabel('Document Length') 59 | pyplot.ylabel('Frequency') 60 | pyplot.show() 61 | 62 | # 63 | #pyplot.figure() 64 | #pyplot.hist(par_length_withDoc,bins=[0,20,40,60,80,100,120,140,160],edgecolor='black'); 65 | #pyplot.title('Distribution of Paragraph Length in Dev Set')# (including docs without paragraph label)') 66 | #pyplot.xlabel('Paragraph Length') 67 | #pyplot.ylabel('Frequency') 68 | #pyplot.show() 69 | # 70 | # 71 | ## pyplot.figure() 72 | ## pyplot.hist(par_length_woDoc,bins=[0,20,40,60,80,100,120,140,160],edgecolor='black'); 73 | ### pyplot.title('Distribution of Paragraph Length in Dev Set')# (excluding docs without paragraph label)') 74 | ## pyplot.xlabel('Paragraph Length') 75 | ## pyplot.ylabel('Frequency') 76 | ## pyplot.show() 77 | --------------------------------------------------------------------------------