├── .gitignore
├── LICENSE
├── MirasIrony
    ├── MirasIrony.zip
    └── README.md
├── MirasOpinion
    ├── MirasOpinion.zip
    └── README.md
├── MirasText
    ├── MirasText_sample.txt
    ├── README.md
    └── parser.py
└── README.md


/.gitignore:
--------------------------------------------------------------------------------
1 | passwords
2 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 miras-tech
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/MirasIrony/MirasIrony.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/miras-tech/MirasText/e60eac95fffd90e4ecbb0526b17464a95992395d/MirasIrony/MirasIrony.zip


--------------------------------------------------------------------------------
/MirasIrony/README.md:
--------------------------------------------------------------------------------
 1 | ﻿MirasIrony: The Persian Irony Dataset
 2 | -------------------------
 3 | 
 4 | This repository contains information about MirasIrony dataset. In order to use the complete dataset you need to submit a request to behnamsabeti@gmail.com and we will provide you with the password needed to extract MirasIrony.zip.
 5 | 
 6 | The irony dataset is constructed from Persian tweets. 2942 tweets are labeled in total, statistics of which is as follows:
 7 | 
 8 | | Property    | Ironic | Non-Ironic   |
 9 | | :------- | ----: | :---: |
10 | | No. of tweets | 1278 |  1664    |
11 | | Avg. No. of tokens per tweet    | 37.36   |  28.34   |
12 | | Max. No. of tokens per tweet   | 50    |  50  |
13 | 
14 |  
15 | The data file is in CSV format, each row containing a tweet and its corresponding label. The labels are either 1, for an ironic tweet, or 0, for a non-ironic tweets.
16 | 
17 | 
18 | ## Cite
19 | Please cite the following paper in your publication if you are using MirasIrony in your research:
20 | 
21 | ```bibtex
22 | @inproceedings{golazizian2020irony,
23 |   title={Irony Detection in Persian Language: A Transfer Learning Approach Using Emoji Prediction},
24 |   author={Golazizian, Preni and Sabeti, Behnam and Asli, Seyed Arad Ashrafi and Majdabadi, Zahra and Momenzadeh, Omid and others},
25 |   booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
26 |   pages={2839--2845},
27 |   year={2020}
28 | }
29 | ```


--------------------------------------------------------------------------------
/MirasOpinion/MirasOpinion.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/miras-tech/MirasText/e60eac95fffd90e4ecbb0526b17464a95992395d/MirasOpinion/MirasOpinion.zip


--------------------------------------------------------------------------------
/MirasOpinion/README.md:
--------------------------------------------------------------------------------
 1 | # MirasOpinion: A Sentiment Analysis Dataset for Persian
 2 | This repository contains information about MirasOpinion dataset, which is the largest Persian sentiment analysis dataset up to this date, alongside a demo file which contains 20 documents with their corresponding labels. MirasOpinion.zip contains the whole dataset. In order to use the complete dataset you need to submit a request to behnamsabeti@gmail.com and we will provide you with the password needed to extract MirasOpinion.zip.
 3 | 
 4 | ## What Is MirasOpinion?
 5 | MirasOpinion is crawled from the Digikala website, one of the largest e-commerce websites in Iran. 2.5 million comments have been crawled, and after some pre-processing, we reduce its size to one million comments. Then the corpus had been labeled using crowd-sourcing; A telegram bot is used to send the unlabeled data to several users. Our bot asks them to label the represented document as positive, negative, or neutral.
 6 | 
 7 | Here is a summary of our dataset statistics:
 8 | 
 9 | | Total Documents   | 93,868 |
10 | |-------------------|--------|
11 | | Max Length        | 1434   |
12 | | Min Length        | 3      |
13 | | Mean Length       | 38.15  |
14 | | Positive Comments | 49515  |
15 | | Negative Comments | 14882  |
16 | | Neutral Comments  | 29471  |
17 | 
18 | 
19 | ## Cite
20 | Please cite the following paper in your publication if you are using MirasOpinion in your research:
21 | ```bibtex
22 | @inproceedings{ashrafi-asli-etal-2020-optimizing,
23 |     title = "Optimizing Annotation Effort Using Active Learning Strategies: A Sentiment Analysis Case Study in {P}ersian",
24 |     author = "Ashrafi Asli, Seyed Arad  and
25 |       Sabeti, Behnam  and
26 |       Majdabadi, Zahra  and
27 |       Golazizian, Preni  and
28 |       fahmi, reza  and
29 |       Momenzadeh, Omid",
30 |     booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
31 |     month = may,
32 |     year = "2020",
33 |     address = "Marseille, France",
34 |     publisher = "European Language Resources Association",
35 |     url = "https://www.aclweb.org/anthology/2020.lrec-1.348",
36 |     pages = "2855--2861",
37 |     language = "English",
38 |     ISBN = "979-10-95546-34-4",
39 | }
40 | ```


--------------------------------------------------------------------------------
/MirasText/README.md:
--------------------------------------------------------------------------------
 1 | # MirasText: An Automatically Generated Text Corpus for Persian
 2 | This repository contains MirasText corpus and description along side with what it has been used for and what it can be used for. A sample of the dataset is provided in MirasText_sample.txt which contains 1000 documents. The full dataset is uploaded in google drive and you can download it [here](https://drive.google.com/file/d/1QNHPv4B22d-Dj7oYoOKQNx2zUfFzsAUL/view?usp=sharing). In order to use the complete dataset you need to submit a request to behnamsabeti@gmail.com and we will provide you with the password needed to extract MirasText.zip.
 3 | 
 4 | ## MirasText Description
 5 | MirasText is the result of crawling more than 250 persain news websites. Each article in MirasText contains the following attributes:
 6 | * Content: The article content
 7 | * Description: The summary provided for each article (may not be available in some articles (nan))
 8 | * Keywords: The keywords associated with each article (may not be available in some articles (nan))
 9 | * Title: Title of the article
10 | * Website: The website from which the article is crawled
11 | * URL: The article absolute URL
12 | 
13 | MirasText has more than 2.8 million articles and over 1.4 billion content words. The following table demonstrates the statistics of the corpus:
14 | 
15 | |       Total Documents      |   2,835,414   |
16 | |:--------------------------:|:-------------:|
17 | |     Total Content Words    | 1,429,878,960 |
18 | |   Average Content Length   |     504.3     |
19 | |      Average Keywords      |      8.4      |
20 | | Average Description Length |      19.8     |
21 | |    Average Title Length    |      9.5      |
22 | 
23 | ## What it has been used for
24 | At Miras Technologies International we are using MirasText to develop some NLP applications. including:
25 | * Document Classification
26 | * Word Embedding Extraction
27 | * Summarization
28 | * Keyword Extraction
29 | 
30 | Please inform us if you have used MirasText for any porpuses to be added to this list.
31 | 
32 | ## What it can be used for
33 | MirasText can be used for a variety of NLP tasks, besides from the applications mentioned above it can also be used for:
34 | * Language Modeling
35 | * Title Extraction
36 | * Named Entity Recognition (for an unsupervised approach)
37 | 
38 | ## Dataset Description
39 | The dataset is provided in MirasText.zip. You will have to extract it first, then simply use parser.py to read the dataset line by line. Each line contains one article. The attributes of each article are delimited using an special delimiter to avoid conflicts. (delimiter = ***)
40 | 
41 | Each article is provided in the following format:
42 | content *** description *** keywords *** title *** website *** url
43 | (note that *** is the delimiter used to separate the attributes)
44 | 
45 | ## Cite
46 | Please cite the following paper in your publication if you are using MirasText in your research:
47 | ```bibtex
48 | @InProceedings{SABETI18.385,
49 |   author = {Behnam Sabeti ,Hossein Abedi Firouzjaee ,Ali Janalizadeh Choobbasti ,Seyed hani elamahdi Mortazavi Najafabadi and Amir Vaheb},
50 |   title = {MirasText: An Automatically Generated Text Corpus for Persian},
51 |   booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
52 |   year = {2018},
53 |   month = {may},
54 |   date = {7-12},
55 |   location = {Miyazaki, Japan},
56 |   editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and H�l�ne Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
57 |   publisher = {European Language Resources Association (ELRA)},
58 |   address = {Paris, France},
59 |   isbn = {979-10-95546-00-9},
60 |   language = {english}
61 |   }
62 |   ```


--------------------------------------------------------------------------------
/MirasText/parser.py:
--------------------------------------------------------------------------------
 1 | 
 2 | with open('MirasText.txt', 'r', encoding='utf8') as file:
 3 |     for line in file:
 4 |         tokens = line.split('***')
 5 |         content = tokens[0]
 6 |         description = tokens[1]
 7 |         keywords = tokens[2]
 8 |         title = tokens[3]
 9 |         website = tokens[4]
10 |         url = tokens[5]
11 | 
12 |         print('content : ' + content)
13 |         print('description : ' + description)
14 |         print('keywords : ' + keywords)
15 |         print('title : ' + title)
16 |         print('website : ' + website)
17 |         print('url : ' + url)
18 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Miras Datasets for Persian NLP
2 | 
3 | This repository contains information about three datasets for Persian NLP:
4 | 
5 | - MirasText
6 | - MirasOpinion
7 | - MirasIrony
8 | 
9 | Proceed to the corresponding folder for more information about each dataset.


--------------------------------------------------------------------------------