├── img
    ├── Results_F1_DrQA_RMN_QANet.png
    └── SQuAD_French_constitution_.png
└── README.md


/img/Results_F1_DrQA_RMN_QANet.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Alikabbadj/French-SQuAD/HEAD/img/Results_F1_DrQA_RMN_QANet.png


--------------------------------------------------------------------------------
/img/SQuAD_French_constitution_.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Alikabbadj/French-SQuAD/HEAD/img/SQuAD_French_constitution_.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # French-SQuAD : French Machine Reading for Question Answering
 2 | 
 3 | # SQuAD v1.1 dataset in French
 4 | 
 5 | You will find in this repository a SQuAD v1.1 dev dataset in French. Since the train file is too big you can download it from :
 6 | https://drive.google.com/drive/folders/1HzvQv4W7sveWkuxbw04tTfgiDq0ZKW8U?usp=sharing
 7 | 
 8 | 
 9 | # How the dataset is build?
10 | 
11 | Almost all the models available are trained on English datasets. For our work we need to train with a French dataset.
12 | Since we did not find any substantial French Q&A Dataset, we had to build one. Instead of starting from scratch and
13 | spend weeks to ask crowd workers to read article, create questions and report answers and their start and end
14 | position in the context, we preferred to translate the SQuAD training and dev datasets v1.1 (Rajpurkar et al., 2016)
15 | from English to French.
16 | 
17 | SQuAD contains 107.7K query-answer pairs, with 87.5K for training, 10.1K for validation, and another 10.1K for
18 | testing. Only the training and validation data are publicly available. Each training example of SQuAD is a triple of (d;
19 | q; a) in which document d is a multi-sentence paragraph, q is the question and a is the answer to the question.
20 | For Machine Translation, we utilized the publicly available Google Translate API GoogleTrans package provided by
21 | SuHun Han on https://github.com/ssut/py-googletrans.
22 | 
23 | However, translating (d; q; a) from English to French is not enough. All the models need the answer span and its
24 | position in the context, its start and end. Therefore, we need to find for the French answer the star and the end of the
25 | answer in the French context.
26 | 
27 | Since the translation of the context and the answer are not always aligned, it is not always possible to find the answer
28 | as translated in the context. In our translation less than 2/3 of the answers were found in the context. For the rest we
29 | had to reconstitute the answer from the English one (EnA) and the French translated one (FrA).
30 | For this reconstituted French answer, we first split the strings (EnA) and (FrA) in a list of words (Lowa) and try to find
31 | the string in the context with a length equal to the maximum of (EnA) and (FrA) length, Lmax, and the maximum of
32 | word close to the words in (Lowa). We used three kind of methods to determine how close two words are.
33 | · Exact match = 1
34 | · Ratio Levenshtein distance
35 | · Jaro Winkler distance
36 | 
37 | For each string of a length of Lmax in the context, we add up the words distances and take the string with the highest
38 | score. We did that with the strings including stop words and punctuation and strings without them (Non-normalized
39 | and Normalized).
40 | *closet string in the context: string in the context of a length of 40 (Lmax) with the highest distance (EM, Ratio Levenshtein or Jaro Winkle)
41 | 
42 | <p align="center"><img width="70%" src="img/SQuAD_French_constitution_.png" /></p>
43 | 
44 | # How the dataset is build?
45 | 
46 | We used the F1 evaluation metric of accuracy for the model performance. F1 measures the portion of overlap tokens
47 | between the predicted and the right answer.
48 | 
49 | To evaluate our models, we toke the official SQuAD evaluation script (https://rajpurkar.github.io/SQuAD-explorer/) (we just
50 | changed the stop words from English to French ones), and we ran it along with our French SQuAD development v1.1 file
51 | obtained, as the French training file, by translation and answer reconstitution of the official SQuAD dev v1.1 English file.
52 | We trained and evaluated the three neural architectures, DrQA (Chen), RMR (Hu) & QANet (Yu), with different French
53 | SQuAD Datasets normalization, tokenizers, and word embedding vectors.
54 | 
55 | The best results are in (figure 14).
56 | 
57 | <p align="center"><img width="70%" src="img/Results_F1_DrQA_RMN_QANet.png" /></p>
58 | 
59 | We notice that the F1 English scores that we obtained with the neural networks architectures that we ran with the
60 | models stored in Github are lower than the ones reported in the SQuAD leaderboard on Sep 1, 2018. That could be
61 | because the reported scores are obtained with more recent models than those stored in Github.
62 | Of greater concern for our work, the French models score lower by more than 10 pts than the English models with
63 | same architecture.
64 | 
65 | The mean reason could lie in the training French Dataset that is the output of Machine Translation. As we have seen
66 | 2/3 of translated answers are in the translated context but we have to find it start position in the context. The issue is
67 | that the string representing the answer could be in different positions in the context. It is difficult to determine which
68 | one is the right one.
69 | 
70 | Therefore, this kind of error make the French Dataset less reliable for de training than de English one, which is created
71 | by crowdworkers. So to say handmade.
72 | 
73 | Furthermore, the French answer reconstitution algorithm, for the 1/3 rest of the answers, generates even more position
74 | and meaning errors.
75 | 
76 | # For more detail please see the paper : https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/
77 | 
78 | # Abstract of the paper
79 | 
80 | The paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training datasets. Until now these
81 | technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embedding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network architectures in French and carried out a French Q&A models with F1 score around 70%.
82 | 
83 | 


--------------------------------------------------------------------------------