├── README.md └── TurkishQuestionPairs.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Question Paraphrases Dataset For Turkish Language 2 | This data set contains 1377 positive question paraphrases for Turkish Languages and 2187 negative examples. 3 | The original questions are taken from Istanbul Bilgi Universitiy Frequently Asking Questions resource 4 | https://www.bilgi.edu.tr/tr/yasam/ogrenci/sss/. Later,we manually created two additional truly semantically equivalent questions. These quesitons pairs are considered postivie example of the quesiton paraphrasing. 5 | 6 | ## Negative Sampling: 7 | We also supplemented 2187 negative examples labeled as 0. The source of negative examples are be pairs of "related questions" which, although pertaining to similar topics, are not truly semantically equivalent. It would be very much like Quora Question Pairs 8 | https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs 9 | 10 | 11 | ## Paraphrasing Detection 12 | To do classify wheter two questions are duplicate, we trained a simple classification algorithm based on BERT. It would be very helpfull especially for ChatBot framework in Turkish Language. 13 | 14 | Following code shows how to run the detection model 15 | 16 | 17 | ``` 18 | from transformers import * 19 | import torch 20 | from transformers import AutoTokenizer, AutoModelForSequenceClassification 21 | tokenizer = AutoTokenizer.from_pretrained("savasy/TurkQP") 22 | model = AutoModelForSequenceClassification.from_pretrained("savasy/TurkQP") 23 | 24 | 25 | s0="Bugün yağmur yağacak mı ?" 26 | s1="Yağmur yağabilir mi bugün ?" 27 | inputs= tokenizer(s0, s1, add_special_tokens=True, return_tensors='pt') 28 | output= pytorch_model(inputs['input_ids'], token_type_ids=inputs['token_type_ids'])[0] 29 | pred = output.argmax().item() 30 | 31 | pred==1 32 | 33 | ``` 34 | 35 | -------------------------------------------------------------------------------- /TurkishQuestionPairs.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 62, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import pandas as pd" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 63, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "d=pd.read_csv(\"TurkQP.csv\")" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 64, 24 | "metadata": {}, 25 | "outputs": [ 26 | { 27 | "data": { 28 | "text/html": [ 29 | "
| \n", 47 | " | q1 | \n", 48 | "q2 | \n", 49 | "duplicate | \n", 50 | "
|---|---|---|---|
| 774 | \n", 55 | "Öğrenci kulüplerine sağlanan imkanlar nelerdir? | \n", 56 | "öğrenci kulüplerine ne tip imkanlar sunuluyor | \n", 57 | "1 | \n", 58 | "
| 2370 | \n", 61 | "Diplomamı kaybettim yeni diploma alabilir miyim? | \n", 62 | "Diplomamı benim yerime bir başkası alabilir mi? | \n", 63 | "0 | \n", 64 | "
| 470 | \n", 67 | "ders kaydımda çakışma olduğunda ne yapmalıyım | \n", 68 | "Çakışan derslerim var ders kaydım onaylanır mı? | \n", 69 | "1 | \n", 70 | "
| 2403 | \n", 73 | "Mezuniyet başarı sıralamamı öğrenebilir miyim? | \n", 74 | "Mezuniyet başarı sıralamamı öğrenebilir miyim? | \n", 75 | "0 | \n", 76 | "
| 1775 | \n", 79 | "Hangi üniversitelerden yatay geçiş kabul ediyo... | \n", 80 | "Ana dal yan dal çift anadal programlarında der... | \n", 81 | "0 | \n", 82 | "