├── README.md
└── SimpleNotebook.ipynb


/README.md:
--------------------------------------------------------------------------------
  1 | # Bert-base Turkish Sentiment Model
  2 | 
  3 | https://huggingface.co/savasy/bert-base-turkish-sentiment-cased
  4 | 
  5 | This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased
  6 | 
  7 | 
  8 | # Dataset
  9 | 
 10 | The dataset is taken from the studies [2] and [3] and merged.
 11 | 
 12 | * The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen.
 13 | The movie dataset is taken from a cinema Web page (www.beyazperde.com) with
 14 | 5331 positive and 5331 negative sentences. Reviews in the Web page are marked in
 15 | scale from 0 to 5 by the users who made the reviews. The study considered a review
 16 | sentiment positive if the rating is equal to or bigger than 4, and negative if it is less
 17 | or equal to 2. They also built Turkish product review dataset from an online retailer
 18 | Web page. They constructed benchmark dataset consisting of reviews regarding some
 19 | products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5,
 20 | and majority class of reviews are 5. Each category has 700 positive and 700 negative
 21 | reviews in which average rating of negative reviews is 2.27 and of positive reviews
 22 | is 4.5. This dataset is also used the study [1]
 23 | 
 24 | * The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 
 25 | 
 26 | *Merged Dataset* 
 27 | 
 28 | | *size*   | *data* |
 29 | |--------|----|
 30 | |   8000 |dev.tsv|
 31 | |   8262 |test.tsv|
 32 | |  32000 |train.tsv|
 33 | |  *48290* |*total*|
 34 | 
 35 | 
 36 | The dataset is used by following papers
 37 |  
 38 | * 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 
 39 | * 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment
 40 | Discovery and Opinion Mining (WISDOM ’13)
 41 | * [3] Hayran, A.,   Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey
 42 | 
 43 | # Training
 44 | 
 45 | ```
 46 | export GLUE_DIR="./sst-2-newall"
 47 | export TASK_NAME=SST-2
 48 |  
 49 | 
 50 | python3 run_glue.py \
 51 |   --model_type bert \
 52 |   --model_name_or_path dbmdz/bert-base-turkish-uncased\
 53 |   --task_name "SST-2" \
 54 |   --do_train \
 55 |   --do_eval \
 56 |   --data_dir "./sst-2-newall" \
 57 |   --max_seq_length 128 \
 58 |   --per_gpu_train_batch_size 32 \
 59 |   --learning_rate 2e-5 \
 60 |   --num_train_epochs 3.0 \
 61 |   --output_dir "./model"
 62 | 
 63 | ```
 64 | 
 65 | 
 66 | 
 67 | 
 68 | # Results
 69 | 
 70 | > 05/10/2020 17:00:43 - INFO - transformers.trainer -   ***** Running Evaluation *****
 71 | 
 72 | > 05/10/2020 17:00:43 - INFO - transformers.trainer -     Num examples = 7999
 73 | 
 74 | > 05/10/2020 17:00:43 - INFO - transformers.trainer -     Batch size = 8
 75 | 
 76 | >Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s]
 77 | 
 78 | >05/10/2020 17:01:17 - INFO - __main__ -   ***** Eval results sst-2 *****
 79 | 
 80 | >05/10/2020 17:01:17 - INFO - __main__ -     acc = 0.9539942492811602
 81 | 
 82 | >05/10/2020 17:01:17 - INFO - __main__ -     loss = 0.16348013816401363
 83 | 
 84 | 
 85 | Accuracy is about *%95.4*
 86 | # Code Usage
 87 | 
 88 | ```
 89 | from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
 90 | model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 91 | tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
 92 | sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
 93 | 
 94 | p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
 95 | print(p)
 96 | #[{'label': 'LABEL_1', 'score': 0.9871089}]
 97 | print (p[0]['label']=='LABEL_1')
 98 | #True
 99 | 
100 | 
101 | p= sa("Film çok kötü ve çok sahteydi")
102 | print(p)
103 | #[{'label': 'LABEL_0', 'score': 0.9975505}]
104 | print (p[0]['label']=='LABEL_1')
105 | #False
106 | ```
107 | 
108 | # Test your data
109 | 
110 | Suppose your file has lots of lines of comment and label (1 or 0) at the end  (tab seperated)
111 | 
112 | > comment1 ... \t label
113 | 
114 | > comment2 ... \t label
115 |  
116 | > ...
117 | 
118 | 
119 | 
120 | ```
121 | from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
122 | 
123 | f="/path/to/your/file/yourfile.tsv"
124 | model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
125 | tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
126 | sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model)
127 | 
128 | i,crr=0,0
129 | for line in open(f):
130 |  lines=line.strip().split("\t")
131 |  if len(lines)==2:
132 |   i=i+1
133 |   if i%100==0:
134 |    print(i)
135 |   pred= sa(lines[0])
136 |   pred=pred[0]["label"].split("_")[1]
137 |   if pred== lines[1]:
138 |    crr=crr+1
139 | 
140 | print(crr, i, crr/i)
141 | ```
142 | 
143 | 
144 | 
145 | 
146 | 
147 | 


--------------------------------------------------------------------------------
/SimpleNotebook.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 1,
 6 |    "metadata": {},
 7 |    "outputs": [],
 8 |    "source": [
 9 |     "# A simple notebook"
10 |    ]
11 |   },
12 |   {
13 |    "cell_type": "code",
14 |    "execution_count": null,
15 |    "metadata": {},
16 |    "outputs": [],
17 |    "source": [
18 |     "print(\"hello world\")"
19 |    ]
20 |   }
21 |  ],
22 |  "metadata": {
23 |   "kernelspec": {
24 |    "display_name": "Python 3",
25 |    "language": "python",
26 |    "name": "python3"
27 |   },
28 |   "language_info": {
29 |    "codemirror_mode": {
30 |     "name": "ipython",
31 |     "version": 3
32 |    },
33 |    "file_extension": ".py",
34 |    "mimetype": "text/x-python",
35 |    "name": "python",
36 |    "nbconvert_exporter": "python",
37 |    "pygments_lexer": "ipython3",
38 |    "version": "3.7.5"
39 |   }
40 |  },
41 |  "nbformat": 4,
42 |  "nbformat_minor": 4
43 | }
44 | 


--------------------------------------------------------------------------------