├── ScandiSent.zip ├── ScandiSent-mt.zip └── README.md /ScandiSent.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/timpal0l/ScandiSent/HEAD/ScandiSent.zip -------------------------------------------------------------------------------- /ScandiSent-mt.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/timpal0l/ScandiSent/HEAD/ScandiSent-mt.zip -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ScandiSent 2 | Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴󠁧󠁢󠁥󠁮󠁧󠁿) 3 | 4 | ## Information 5 | The corpus is crawled from [se.trustpilot.com](https://se.trustpilot.com/), [no.trustpilot.com](https://no.trustpilot.com/), [dk.trustpilot.com](https://dk.trustpilot.com/), [fi.trustpilot.com](https://fi.trustpilot.com/) and [trustpilot.com](https://trustpilot.com/). 6 | It consists of reviews from all the 22 corresponding categories: 7 | 8 | ```javascript 9 | categories = ['animals_pets', 'electronics_technology', 'events_entertainment', 'vehicles_transportation', 10 | 'business_services', 'health_medical', 'home_garden', 'hobbies_crafts', 'home_services', 11 | 'legal_services_government', 'construction_manufactoring', 'food_beverages_tobacco', 'media_publishing', 12 | 'money_insurance', 'travel_vacation', 'restaurants_bars', 'public_local_services', 'shopping_fashion', 13 | 'education_training', 'beauty_wellbeing', 'sports', 'housing_utility_company'] 14 | ``` 15 | 16 | The size for each language is 10 000 texts evenly balanced between positive and negative reviews. A positive review is considered as a text with the rating `4 or 5`, and a negative review is rated as `1 or 2`. The texts rated as `3` were not used. The zip files consist of csv files for each language with the columns `text` and `label`, were `label` == `1` is a positive review and `label` == `0`is a negative review. 17 | 18 | For our paper: [Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?](https://arxiv.org/pdf/2104.10441.pdf) we used the first 7500 texts for training and the last 2500 texts for evaluating. 19 | 20 | #### [ScandiSent.zip](ScandiSent.zip) 🇸🇪 🇳🇴 🇩🇰 🇫🇮 + 🏴󠁧󠁢󠁥󠁮󠁧󠁿 21 | Is the raw data for each language where we used [fastText](https://fasttext.cc/docs/en/language-identification.html) language identification to ensure that the texts were of the right language. 22 | 23 | #### [ScandiSent-mt.zip](ScandiSent-mt.zip) 🏴󠁧󠁢󠁥󠁮󠁧󠁿 24 | Consists of the raw data from `ScandiSent` machine translated to English 🏴󠁧󠁢󠁥󠁮󠁧󠁿 using Googles Neural Machine Translation API. 25 | 26 | ## Version 1.0 27 | 2021-02-06 28 | --------------------------------------------------------------------------------