├── img ├── Results_F1_DrQA_RMN_QANet.png └── SQuAD_French_constitution_.png └── README.md /img/Results_F1_DrQA_RMN_QANet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Alikabbadj/French-SQuAD/HEAD/img/Results_F1_DrQA_RMN_QANet.png -------------------------------------------------------------------------------- /img/SQuAD_French_constitution_.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Alikabbadj/French-SQuAD/HEAD/img/SQuAD_French_constitution_.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # French-SQuAD : French Machine Reading for Question Answering 2 | 3 | # SQuAD v1.1 dataset in French 4 | 5 | You will find in this repository a SQuAD v1.1 dev dataset in French. Since the train file is too big you can download it from : 6 | https://drive.google.com/drive/folders/1HzvQv4W7sveWkuxbw04tTfgiDq0ZKW8U?usp=sharing 7 | 8 | 9 | # How the dataset is build? 10 | 11 | Almost all the models available are trained on English datasets. For our work we need to train with a French dataset. 12 | Since we did not find any substantial French Q&A Dataset, we had to build one. Instead of starting from scratch and 13 | spend weeks to ask crowd workers to read article, create questions and report answers and their start and end 14 | position in the context, we preferred to translate the SQuAD training and dev datasets v1.1 (Rajpurkar et al., 2016) 15 | from English to French. 16 | 17 | SQuAD contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for 18 | testing. Only the training and validation data are publicly available. Each training example of SQuAD is a triple of (d; 19 | q; a) in which document d is a multi-sentence paragraph, q is the question and a is the answer to the question. 20 | For Machine Translation, we utilized the publicly available Google Translate API GoogleTrans package provided by 21 | SuHun Han on https://github.com/ssut/py-googletrans. 22 | 23 | However, translating (d; q; a) from English to French is not enough. All the models need the answer span and its 24 | position in the context, its start and end. Therefore, we need to find for the French answer the star and the end of the 25 | answer in the French context. 26 | 27 | Since the translation of the context and the answer are not always aligned, it is not always possible to find the answer 28 | as translated in the context. In our translation less than 2/3 of the answers were found in the context. For the rest we 29 | had to reconstitute the answer from the English one (EnA) and the French translated one (FrA). 30 | For this reconstituted French answer, we first split the strings (EnA) and (FrA) in a list of words (Lowa) and try to find 31 | the string in the context with a length equal to the maximum of (EnA) and (FrA) length, Lmax, and the maximum of 32 | word close to the words in (Lowa). We used three kind of methods to determine how close two words are. 33 | · Exact match = 1 34 | · Ratio Levenshtein distance 35 | · Jaro Winkler distance 36 | 37 | For each string of a length of Lmax in the context, we add up the words distances and take the string with the highest 38 | score. We did that with the strings including stop words and punctuation and strings without them (Non-normalized 39 | and Normalized). 40 | *closet string in the context: string in the context of a length of 40 (Lmax) with the highest distance (EM, Ratio Levenshtein or Jaro Winkle) 41 | 42 |

