├── models
    └── models.txt
├── requirements.txt
├── embeddings
    └── dictionaries.txt
├── training_data
    └── training.txt
├── test_input_data_sample.txt
├── README.md
└── run_citation_need_model.py


/models/models.txt:
--------------------------------------------------------------------------------
1 | The Tensorflow models to detect Citation Need for English, French, and Italian Wikipedia can be found here:
2 | 
3 | https://drive.google.com/drive/folders/166ok0FmW-SiMNJl9ZYpeVjc8BeO1195W?usp=sharing
4 | 
5 | Format of the (zipped) files is:
6 | fa_<language_code>_model_rnn_attention_section.h5.gz
7 | 
8 | Where language_code can be: [en, fr, it].
9 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | absl-py==0.8.0
 2 | astor==0.8.0
 3 | backports.weakref==1.0.post1
 4 | bleach==1.5.0
 5 | enum34==1.1.6
 6 | funcsigs==1.0.2
 7 | futures==3.3.0
 8 | gast==0.3.2
 9 | grpcio==1.24.1
10 | h5py==2.10.0
11 | html5lib==0.9999999
12 | Keras==2.1.5
13 | Markdown==3.1.1
14 | mock==3.0.5
15 | numpy==1.16.5
16 | pandas==0.24.2
17 | protobuf==3.10.0
18 | python-dateutil==2.8.0
19 | pytz==2019.3
20 | PyYAML==5.1.2
21 | scikit-learn==0.18.1
22 | scipy==1.2.2
23 | six==1.12.0
24 | tensorboard==1.7.0
25 | tensorflow==1.7.0
26 | termcolor==1.1.0
27 | Werkzeug==0.16.0
28 | 


--------------------------------------------------------------------------------
/embeddings/dictionaries.txt:
--------------------------------------------------------------------------------
 1 | To run the models, you will need two dictionaries:
 2 | * Sentence dictionary: these are the embeddings for the words in the sentences
 3 | * Section dictionary: embeddings for the section titles.
 4 | 
 5 | Whatever words or sections are outside these two dictionaries, they will be assigned the UNK embedding. 
 6 | 
 7 | All dictionaries in their pickle format can be found here: https://drive.google.com/drive/folders/1dlocPHPz6Giv9nS8rR4t6kes8nlJ3inX?usp=sharing
 8 | 
 9 | The file format is:
10 | <dict type>_dict_<lan>.pck
11 | 
12 | Where:
13 | * <dict type> in [word,section]
14 | * <lan> in [en,fr,it]
15 | 


--------------------------------------------------------------------------------
/training_data/training.txt:
--------------------------------------------------------------------------------
 1 | In [1] you can download the samples we used for evaluation. 
 2 | 
 3 | The folders are “fa” (featured articles), “lqn” (citation needed), and “rnd” (random articles).
 4 | 
 5 | In each of these folders you will find 2 files:
 6 | * “all_citations*” are files containing the positive instances, that is, statements that have a citation.
 7 | * “no_citations” contain the statements that do not require a citation (negative instances). 
 8 | 
 9 | In the “lqn” dataset you have another case of “cn_citations”, which is basically similar to the positive instances, but with the only difference that instead of having an actual citation, those statements have “citation needed” marker. 
10 | 
11 | [1] https://drive.google.com/drive/folders/1zG6orf0_h2jYBvGvso1pSy3ikbNiW0xJ?usp=sharing
12 | 


--------------------------------------------------------------------------------
/test_input_data_sample.txt:
--------------------------------------------------------------------------------
1 | entity_id	revision_id	timestamp	entity_title	section_id	section	prg_idx	sentence_idx	statement	citations
2 | 36492250	870654167	3ed52b90-012b-11e9-bfca-21613666d588	The_Fergies	0	MAIN_SECTION	0	-1	The Fergies is a folk/indie/rock/pop band from Brisbane, Australia formed by the five Ferguson siblings, Kahlia, Daniel, Joel, Nathan, and Shani. Their music has grown in popularity due to their busking performances on Queen Street Mall and originals uploaded to YouTube.	False
3 | 36492250	870654167	3ed52b90-012b-11e9-bfca-21613666d588	The_Fergies	0	MAIN_SECTION	1	1	The Fergies' lead singer, Kahlia Ferguson, won the senior category of the Australian Children's Music Foundation's (ACMF) National Song Writing Competition two years in a row, with Little Bird in 2008 and Soldier Boy in 2009	True
4 | 1865327	868915520	2f051d20-08c3-11e9-b9d2-062454f1b6b4	Honours_degree	0	MAIN_SECTION	0	-1	The term "honours degree" (or "honors degree") has various meanings in the context of different degrees and education systems. Most commonly it refers to a variant of the undergraduate bachelor's degree containing a larger volume of material or a higher standard of study, or both, rather than an "ordinary", "general" or "pass" bachelor's degree. Honours degrees are sometimes indicated by "Hons" after the degree abbreviation, with various punctuation according to local custom, e.g. "BA (Hons)", "B.A., Hons", etc.	False
5 | 1865327	868915520	2f051d20-08c3-11e9-b9d2-062454f1b6b4	Honours_degree	1	Australia	1	3	Students receiving high marks in their Honours program have the option of continuing to candidature of a Doctoral program, such as Doctor of Philosophy, without having to complete a master's degree.	True
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability
 2 | Repository of data and code to use the models described in the paper "Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability"
 3 | 
 4 | ### Detecting Citation Need
 5 | The repository provides data and models to score sentences according to their "citation need", i.e. whether an inline citation is needed to support the information in the sentence. Models are implemented in Keras.
 6 | 
 7 | #### Using Citation Need Models
 8 | The *test_citation_need_model.py* script takes as input a text file containing statements to be classified, and gives as ouput a "citation need" score of reach sentence.
 9 | 
10 | To run the script, you can use the following command:
11 | ```
12 | python test_citation_need_model.py -i input_file.txt  -m models/model.h5 -v dicts/word_dict.pck -s dicts/section_dict.pck -o output_folder -l it
13 | ```
14 | 
15 | Where:
16 | - **'-i', '--input'**, is the input .csv file from which we read the statements. Tab-separated columns should contain at least the following tab-separated values (and the corresponding header): 
17 |   - "statement", i.e. the text of the sentence to be classified
18 |   - "section", i.e. the section title where the sentence is located
19 |   - "citation", the binary label corresponding to whether the sentence has a citation or not in the original text. This can be set to 0 if no evaluation is needed. 
20 | An example input file is provided in *test_input_data_sample.txt*
21 | 
22 | - **'-o', '--out_dir'**, is the output directory where we store the results
23 | - **'-m', '--model'**, is the path to the model which we use for classifying the statements.
24 | - **'-v', '--vocab'**, is the path to the vocabulary of words we use to represent the statements.
25 | - **'-s', '--sections'**, is the path to the vocabulary of section with which we trained our model.
26 | - **'-l', '--lang'**, is the language that we are parsing now, e.g. "en", or "it".
27 | 
28 | 
29 | 
30 | #### Models and Data
31 | The _models_ folder contains the links to the citation need models for _English_, _French_, and _Italian_ Wikipedias, trained on a large corpus of Wikipedia's Featured Articles. The training data for these articles, as well as more training data for other models described in the paper can be found in the _training_data_ folder.
32 | 
33 | ### Data about citation reason and qualitative analysis
34 | In the paper, we have more data about the qualitative analysis to identify a taxonomy of reasons for adding citations.
35 | The material behind the thought process can be found here: https://figshare.com/articles/Summaries_of_Policies_and_Rules_for_Adding_Citations_to_Wikipedia/7751027
36 | 
37 | The crowdsourced data with sentences annotated with their citation reasons can be found here: https://figshare.com/articles/%20Citation_Reason_Dataset/7756226
38 | 
39 | ### System Requirement
40 | 
41 | Python 2.7 
42 | 
43 | ### Installing dependencies
44 | There are some requirements for this script to run smoothly. Below are the instructions for installing dependencies in order to run this script.  
45 | 
46 | First, install virtualenv package via pip:
47 | ```
48 | $ pip install virtualenv
49 | ```
50 | Then, run the following commands to create a virtualenv and activate it:
51 | ```
52 | $ virtualenv ENV
53 | $ source ENV/bin/activate
54 | ```
55 | The name of the virtual environment (in this case, it is `ENV`) can be anything. Once you activate the virtualenv, you will see `(ENV)` showing before the shell prompt.
56 | 
57 | Now, go to the repository folder and install the dependencies listed in the `requirements.txt`:
58 | ```
59 | (ENV) $ cd citation-needed-paper
60 | (ENV) $ pip install -r requirements.txt
61 | ```
62 | 
63 | If you are done working in the virtualenv, use following command to leave the virtualenv:
64 | ```
65 | (ENV) $ deactivate
66 | ```
67 | For more information how to use virtualenv, please see [Virtualenv](https://virtualenv.pypa.io/en/stable/).
68 | 
69 | ### Citing this work
70 | When using this dataset, please use the following citation:
71 | 
72 | ```
73 | @inproceedings{Redi:2019:CNT:3308558.3313618,
74 |  author = {Redi, Miriam and Fetahu, Besnik and Morgan, Jonathan and Taraborelli, Dario},
75 |  title = {Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability},
76 |  booktitle = {The World Wide Web Conference},
77 |  series = {WWW '19},
78 |  year = {2019},
79 |  isbn = {978-1-4503-6674-8},
80 |  location = {San Francisco, CA, USA},
81 |  pages = {1567--1578},
82 |  numpages = {12},
83 |  url = {http://doi.acm.org/10.1145/3308558.3313618},
84 |  doi = {10.1145/3308558.3313618},
85 |  acmid = {3313618},
86 |  publisher = {ACM},
87 |  address = {New York, NY, USA},
88 |  keywords = {Citations, Crowdsourcing, Neural Networks;, Wikipedia},
89 | } 
90 | ```
91 | 
92 | 


--------------------------------------------------------------------------------
/run_citation_need_model.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import argparse
  3 | import pandas as pd
  4 | import pickle
  5 | import numpy as np
  6 | import types
  7 | 
  8 | from keras.models import load_model
  9 | from keras.preprocessing.sequence import pad_sequences
 10 | from sklearn.preprocessing import LabelBinarizer
 11 | from sklearn.metrics import confusion_matrix
 12 | 
 13 | from keras.utils import to_categorical
 14 | 
 15 | from keras import backend as K
 16 | K.set_session(K.tf.Session(config=K.tf.ConfigProto(intra_op_parallelism_threads=10, inter_op_parallelism_threads=10)))
 17 | 
 18 | '''
 19 |     Set up the arguments and parse them.
 20 | '''
 21 | 
 22 | 
 23 | def get_arguments():
 24 |     parser = argparse.ArgumentParser(
 25 |         description='Use this script to determinee whether a statement needs a citation or not.')
 26 |     parser.add_argument('-i', '--input', help='The input file from which we read the statements. Lines contains tab-separated values: the statement, the section header, and additionally the binary label corresponding to whether the sentence has a citation or not in the original text. This can be set to 0 if no evaluation is needed.', required=True)
 27 |     parser.add_argument('-o', '--out_dir', help='The output directory where we store the results', required=True)
 28 |     parser.add_argument('-m', '--model', help='The path to the model which we use for classifying the statements.', required=True)
 29 |     parser.add_argument('-v', '--vocab', help='The path to the vocabulary of words we use to represent the statements.', required=True)
 30 |     parser.add_argument('-s', '--sections', help='The path to the vocabulary of section with which we trained our model.', required=True)
 31 |     parser.add_argument('-l', '--lang', help='The language that we are parsing now.', required=False, default='en')
 32 | 
 33 |     return parser.parse_args()
 34 | 
 35 | 
 36 | '''
 37 |     Parse and construct the word representation for a sentence.
 38 | '''
 39 | 
 40 | 
 41 | def text_to_word_list(text):
 42 |     # check first if the statements is longer than a single sentence.
 43 |     sentences = re.compile('\.\s+').split(str(text))
 44 |     if len(sentences) != 1:
 45 |         # text = sentences[random.randint(0, len(sentences) - 1)]
 46 |         text = sentences[0]
 47 | 
 48 |     text = str(text).lower()
 49 | 
 50 |     # Clean the text
 51 |     text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
 52 |     text = re.sub(r"what's", "what is ", text)
 53 |     text = re.sub(r"\'s", " ", text)
 54 |     text = re.sub(r"\'ve", " have ", text)
 55 |     text = re.sub(r"can't", "cannot ", text)
 56 |     text = re.sub(r"n't", " not ", text)
 57 |     text = re.sub(r"i'm", "i am ", text)
 58 |     text = re.sub(r"\'re", " are ", text)
 59 |     text = re.sub(r"\'d", " would ", text)
 60 |     text = re.sub(r"\'ll", " will ", text)
 61 |     text = re.sub(r",", " ", text)
 62 |     text = re.sub(r"\.", " ", text)
 63 |     text = re.sub(r"!", " ! ", text)
 64 |     text = re.sub(r"\/", " ", text)
 65 |     text = re.sub(r"\^", " ^ ", text)
 66 |     text = re.sub(r"\+", " + ", text)
 67 |     text = re.sub(r"\-", " - ", text)
 68 |     text = re.sub(r"\=", " = ", text)
 69 |     text = re.sub(r"'", " ", text)
 70 |     text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
 71 |     text = re.sub(r":", " : ", text)
 72 |     text = re.sub(r" e g ", " eg ", text)
 73 |     text = re.sub(r" b g ", " bg ", text)
 74 |     text = re.sub(r" u s ", " american ", text)
 75 |     text = re.sub(r"\0s", "0", text)
 76 |     text = re.sub(r" 9 11 ", "911", text)
 77 |     text = re.sub(r"e - mail", "email", text)
 78 |     text = re.sub(r"j k", "jk", text)
 79 |     text = re.sub(r"\s{2,}", " ", text)
 80 | 
 81 |     text = text.strip().split()
 82 | 
 83 |     return text
 84 | 
 85 | 
 86 | '''
 87 |     Compute P/R/F1 from the confusion matrix.
 88 | '''
 89 | 
 90 | 
 91 | '''
 92 |     Create the instances from our datasets
 93 | '''
 94 | 
 95 | 
 96 | def construct_instance_reasons(statement_path, section_dict_path, vocab_w2v_path, max_len=-1):
 97 |     # Load the vocabulary
 98 |     vocab_w2v = pickle.load(open(vocab_w2v_path, 'rb'))
 99 | 
100 |     # load the section dictionary.
101 |     section_dict = pickle.load(open(section_dict_path, 'rb'))
102 | 
103 |     # Load the statements, the first column is the statement and the second is the label (True or False)
104 |     statements = pd.read_csv(statement_path, sep='\t', index_col=None, error_bad_lines=False, warn_bad_lines=False)
105 | 
106 |     # construct the training data
107 |     X = []
108 |     sections = []
109 |     y = []
110 |     outstring=[]
111 |     for index, row in statements.iterrows():
112 |         try:
113 |             statement_text = text_to_word_list(row['statement'])
114 | 
115 |             X_inst = []
116 |             for word in statement_text:
117 |                 if max_len != -1 and len(X_inst) >= max_len:
118 |                     continue
119 |                 if word not in vocab_w2v:
120 |                     X_inst.append(vocab_w2v['UNK'])
121 |                 else:
122 |                     X_inst.append(vocab_w2v[word])
123 | 
124 |             # extract the section, and in case the section does not exist in the model, then assign UNK
125 |             section = row['section'].strip().lower()
126 |             sections.append(np.array([section_dict[section] if section in section_dict else 0]))
127 | 
128 |             label = row['citations']
129 | 
130 |             # some of the rows are corrupt, thus, we need to check if the labels are actually boolean.
131 |             if type(label) != types.BooleanType:
132 |                 continue
133 | 
134 |             y.append(label)
135 |             X.append(X_inst)
136 |             outstring.append(str(row["statement"]))
137 |             #entity_id  revision_id timestamp   entity_title    section_id  section prg_idx sentence_idx    statement   citations
138 | 
139 |         except Exception as e:
140 |             print row
141 |             print e.message
142 |     X = pad_sequences(X, maxlen=max_len, value=vocab_w2v['UNK'], padding='pre')
143 | 
144 |     encoder = LabelBinarizer()
145 |     y = encoder.fit_transform(y)
146 |     y = to_categorical(y)
147 | 
148 |     return X, np.array(sections), y, encoder, outstring
149 | 
150 | 
151 | if __name__ == '__main__':
152 |     p = get_arguments()
153 | 
154 |     # load the model
155 |     model = load_model(p.model)
156 | 
157 |     # load the data
158 |     max_seq_length = model.input[0].shape[1].value
159 | 
160 |     X, sections, y, encoder,outstring = construct_instance_reasons(p.input, p.sections, p.vocab, max_seq_length)
161 | 
162 |     # classify the data
163 |     pred = model.predict([X, sections])
164 | 
165 | 
166 |     # store the predictions: printing out the sentence text, the prediction score, and original citation label.
167 |     outstr = 'Text\tPrediction\tCitation\n'
168 |     for idx, y_pred in enumerate(pred):
169 |         outstr += outstring[idx]+'\t'+str(y_pred[0])+ '\t' + str(y[idx]) + '\n'
170 | 
171 |     fout = open(p.out_dir + '/' + p.lang + '_predictions_sections.tsv', 'wt')
172 |     fout.write(outstr)
173 |     fout.flush()
174 |     fout.close()
175 | 


--------------------------------------------------------------------------------