├── ScandiSent.zip
├── ScandiSent-mt.zip
└── README.md


/ScandiSent.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/timpal0l/ScandiSent/HEAD/ScandiSent.zip


--------------------------------------------------------------------------------
/ScandiSent-mt.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/timpal0l/ScandiSent/HEAD/ScandiSent-mt.zip


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ScandiSent
 2 | Sentiment Corpus for Swedish 🇸🇪 Norwegian 🇳🇴 Danish 🇩🇰 Finnish 🇫🇮 (and English 🏴󠁧󠁢󠁥󠁮󠁧󠁿)
 3 | 
 4 | ## Information
 5 | The corpus is crawled from [se.trustpilot.com](https://se.trustpilot.com/), [no.trustpilot.com](https://no.trustpilot.com/), [dk.trustpilot.com](https://dk.trustpilot.com/), [fi.trustpilot.com](https://fi.trustpilot.com/) and [trustpilot.com](https://trustpilot.com/).
 6 | It consists of reviews from all the 22 corresponding categories:
 7 | 
 8 | ```javascript
 9 | categories = ['animals_pets', 'electronics_technology', 'events_entertainment', 'vehicles_transportation',
10 | 'business_services', 'health_medical', 'home_garden', 'hobbies_crafts', 'home_services',
11 | 'legal_services_government', 'construction_manufactoring', 'food_beverages_tobacco', 'media_publishing',
12 | 'money_insurance', 'travel_vacation', 'restaurants_bars', 'public_local_services', 'shopping_fashion',
13 | 'education_training', 'beauty_wellbeing', 'sports', 'housing_utility_company']
14 | ```
15 | 
16 | The size for each language is 10 000 texts evenly balanced between positive and negative reviews. A positive review is considered as a text with the rating `4 or 5`, and a negative review is rated as `1 or 2`. The texts rated as `3` were not used. The zip files consist of csv files for each language with the columns `text` and `label`, were `label` == `1` is a positive review and `label` == `0`is a negative review.
17 | 
18 | For our paper: [Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?](https://arxiv.org/pdf/2104.10441.pdf) we used the first 7500 texts for training and the last 2500 texts for evaluating.
19 | 
20 | #### [ScandiSent.zip](ScandiSent.zip) 🇸🇪 🇳🇴 🇩🇰 🇫🇮 + 🏴󠁧󠁢󠁥󠁮󠁧󠁿
21 | Is the raw data for each language where we used [fastText](https://fasttext.cc/docs/en/language-identification.html) language identification to ensure that the texts were of the right language.
22 | 
23 | #### [ScandiSent-mt.zip](ScandiSent-mt.zip) 🏴󠁧󠁢󠁥󠁮󠁧󠁿
24 | Consists of the raw data from `ScandiSent` machine translated to English 🏴󠁧󠁢󠁥󠁮󠁧󠁿 using Googles Neural Machine Translation API.
25 | 
26 | ## Version 1.0
27 | 2021-02-06
28 | 


--------------------------------------------------------------------------------