├── README.md └── SimpleNotebook.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Bert-base Turkish Sentiment Model 2 | 3 | https://huggingface.co/savasy/bert-base-turkish-sentiment-cased 4 | 5 | This model is used for Sentiment Analysis, which is based on BERTurk for Turkish Language https://huggingface.co/dbmdz/bert-base-turkish-cased 6 | 7 | 8 | # Dataset 9 | 10 | The dataset is taken from the studies [2] and [3] and merged. 11 | 12 | * The study [2] gathered movie and product reviews. The products are book, DVD, electronics, and kitchen. 13 | The movie dataset is taken from a cinema Web page (www.beyazperde.com) with 14 | 5331 positive and 5331 negative sentences. Reviews in the Web page are marked in 15 | scale from 0 to 5 by the users who made the reviews. The study considered a review 16 | sentiment positive if the rating is equal to or bigger than 4, and negative if it is less 17 | or equal to 2. They also built Turkish product review dataset from an online retailer 18 | Web page. They constructed benchmark dataset consisting of reviews regarding some 19 | products (book, DVD, etc.). Likewise, reviews are marked in the range from 1 to 5, 20 | and majority class of reviews are 5. Each category has 700 positive and 700 negative 21 | reviews in which average rating of negative reviews is 2.27 and of positive reviews 22 | is 4.5. This dataset is also used the study [1] 23 | 24 | * The study[3] collected tweet dataset. They proposed a new approach for automatically classifying the sentiment of microblog messages. The proposed approach is based on utilizing robust feature representation and fusion. 25 | 26 | *Merged Dataset* 27 | 28 | | *size* | *data* | 29 | |--------|----| 30 | | 8000 |dev.tsv| 31 | | 8262 |test.tsv| 32 | | 32000 |train.tsv| 33 | | *48290* |*total*| 34 | 35 | 36 | The dataset is used by following papers 37 | 38 | * 1 Yildirim, Savaş. (2020). Comparing Deep Neural Networks to Traditional Models for Sentiment Analysis in Turkish Language. 10.1007/978-981-15-1216-2_12. 39 | * 2 Demirtas, Erkin and Mykola Pechenizkiy. 2013. Cross-lingual polarity detection with machine translation. In Proceedings of the Second International Workshop on Issues of Sentiment 40 | Discovery and Opinion Mining (WISDOM ’13) 41 | * [3] Hayran, A., Sert, M. (2017), "Sentiment Analysis on Microblog Data based on Word Embedding and Fusion Techniques", IEEE 25th Signal Processing and Communications Applications Conference (SIU 2017), Belek, Turkey 42 | 43 | # Training 44 | 45 | ``` 46 | export GLUE_DIR="./sst-2-newall" 47 | export TASK_NAME=SST-2 48 | 49 | 50 | python3 run_glue.py \ 51 | --model_type bert \ 52 | --model_name_or_path dbmdz/bert-base-turkish-uncased\ 53 | --task_name "SST-2" \ 54 | --do_train \ 55 | --do_eval \ 56 | --data_dir "./sst-2-newall" \ 57 | --max_seq_length 128 \ 58 | --per_gpu_train_batch_size 32 \ 59 | --learning_rate 2e-5 \ 60 | --num_train_epochs 3.0 \ 61 | --output_dir "./model" 62 | 63 | ``` 64 | 65 | 66 | 67 | 68 | # Results 69 | 70 | > 05/10/2020 17:00:43 - INFO - transformers.trainer - ***** Running Evaluation ***** 71 | 72 | > 05/10/2020 17:00:43 - INFO - transformers.trainer - Num examples = 7999 73 | 74 | > 05/10/2020 17:00:43 - INFO - transformers.trainer - Batch size = 8 75 | 76 | >Evaluation: 100% 1000/1000 [00:34<00:00, 29.04it/s] 77 | 78 | >05/10/2020 17:01:17 - INFO - __main__ - ***** Eval results sst-2 ***** 79 | 80 | >05/10/2020 17:01:17 - INFO - __main__ - acc = 0.9539942492811602 81 | 82 | >05/10/2020 17:01:17 - INFO - __main__ - loss = 0.16348013816401363 83 | 84 | 85 | Accuracy is about *%95.4* 86 | # Code Usage 87 | 88 | ``` 89 | from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline 90 | model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased") 91 | tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased") 92 | sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model) 93 | 94 | p= sa("bu telefon modelleri çok kaliteli , her parçası çok özel bence") 95 | print(p) 96 | #[{'label': 'LABEL_1', 'score': 0.9871089}] 97 | print (p[0]['label']=='LABEL_1') 98 | #True 99 | 100 | 101 | p= sa("Film çok kötü ve çok sahteydi") 102 | print(p) 103 | #[{'label': 'LABEL_0', 'score': 0.9975505}] 104 | print (p[0]['label']=='LABEL_1') 105 | #False 106 | ``` 107 | 108 | # Test your data 109 | 110 | Suppose your file has lots of lines of comment and label (1 or 0) at the end (tab seperated) 111 | 112 | > comment1 ... \t label 113 | 114 | > comment2 ... \t label 115 | 116 | > ... 117 | 118 | 119 | 120 | ``` 121 | from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline 122 | 123 | f="/path/to/your/file/yourfile.tsv" 124 | model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased") 125 | tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased") 126 | sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model) 127 | 128 | i,crr=0,0 129 | for line in open(f): 130 | lines=line.strip().split("\t") 131 | if len(lines)==2: 132 | i=i+1 133 | if i%100==0: 134 | print(i) 135 | pred= sa(lines[0]) 136 | pred=pred[0]["label"].split("_")[1] 137 | if pred== lines[1]: 138 | crr=crr+1 139 | 140 | print(crr, i, crr/i) 141 | ``` 142 | 143 | 144 | 145 | 146 | 147 | -------------------------------------------------------------------------------- /SimpleNotebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# A simple notebook" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "print(\"hello world\")" 19 | ] 20 | } 21 | ], 22 | "metadata": { 23 | "kernelspec": { 24 | "display_name": "Python 3", 25 | "language": "python", 26 | "name": "python3" 27 | }, 28 | "language_info": { 29 | "codemirror_mode": { 30 | "name": "ipython", 31 | "version": 3 32 | }, 33 | "file_extension": ".py", 34 | "mimetype": "text/x-python", 35 | "name": "python", 36 | "nbconvert_exporter": "python", 37 | "pygments_lexer": "ipython3", 38 | "version": "3.7.5" 39 | } 40 | }, 41 | "nbformat": 4, 42 | "nbformat_minor": 4 43 | } 44 | --------------------------------------------------------------------------------