├── _config.yml
├── run.sh
├── requirements.txt
├── LICENSE
├── gather_urls.py
├── .gitignore
├── scrape_reviews.py
├── post_process.py
└── README.md


/_config.yml:
--------------------------------------------------------------------------------
1 | theme: jekyll-theme-cayman


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env sh
2 | 
3 | python ./gather_urls.py urls.txt
4 | python ./scrape_reviews.py urls.txt reviews.json
5 | python ./post_process.py reviews.json 110kDBRD


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | Click==7.0
2 | requests==2.21.0
3 | beautifulsoup4==4.7.1
4 | selenium==3.141.0
5 | progressbar2==3.39.2
6 | lxml==4.6.3
7 | pandas==0.24.1
8 | scikit-learn==0.20.2


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Benjamin van der Burgh
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/gather_urls.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import click
 4 | import requests
 5 | from bs4 import BeautifulSoup
 6 | 
 7 | 
 8 | TEMPLATE_URL = 'https://www.hebban.nl/main/Review/more?offset={}&step={}'
 9 | 
10 | 
11 | @click.command()
12 | @click.argument('outfile')
13 | @click.option('--offset', default=0, help='Review offset.')
14 | @click.option('--step', default=1000, help='Number of review urls to fetch per request.')
15 | def gather(outfile, offset, step):
16 |     """
17 |     This script gathers review urls from Hebban and writes them to OUTFILE.
18 |     """
19 |     urls = []
20 |     while True:
21 |         target_url = TEMPLATE_URL.format(offset, step)
22 |         r = requests.get(target_url)
23 |         data = r.json()
24 | 
25 |         if not data['html']:
26 |             break
27 | 
28 |         soup = BeautifulSoup(data['html'], 'lxml')
29 |         new_urls = [div['data-url'] for div in soup('div', {'class': 'item'})]
30 |         print(f"Fetched {len(new_urls)} urls from {len(target_url)}")
31 |         urls.extend(new_urls)
32 |         offset += 1000
33 | 
34 |     with open(outfile, 'w') as f:
35 |         for url in urls:
36 |             f.write(url)
37 |             f.write('\n')
38 | 
39 | 
40 | if __name__ == '__main__':
41 |     gather()


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 
106 | .idea/
107 | 
108 | *.tgz
109 | *.json
110 | *.txt
111 | 110kDBRD/
112 | 


--------------------------------------------------------------------------------
/scrape_reviews.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | 
 3 | import time
 4 | import codecs
 5 | import json
 6 | 
 7 | import click
 8 | from selenium import webdriver
 9 | from selenium.webdriver.chrome.options import Options
10 | from progressbar import ProgressBar
11 | 
12 | 
13 | # NOTE: this is often needed when a page is still loading
14 | def retry(fun, arg, max_retries=10, sleep=0.5):
15 |     """
16 |     Retry executing function FUN with arguments ARG.
17 |     :param fun: function to execute
18 |     :param arg: argument to pass to FUN
19 |     :param max_retries: maximum number of retries
20 |     :param sleep: number of seconds to sleep in between retries
21 |     :return:
22 |     """
23 |     data = fun(arg)
24 |     retries = 0
25 |     while not data and retries < max_retries:
26 |         time.sleep(sleep)
27 |         data = fun(arg)
28 |         retries = retries + 1
29 |     return data if data else []
30 | 
31 | 
32 | @click.command()
33 | @click.argument('infile')
34 | @click.argument('outfile')
35 | @click.option('--encoding', default='utf-8', help='Output file encoding.')
36 | @click.option('--indent', default=2, help='Output JSON file with scraped data.')
37 | def scrape(infile, outfile, encoding, indent):
38 |     """
39 |     Iterate over review urls in INFILE text file, scrape review data and output to OUTFILE.
40 |     """
41 |     options = Options()
42 |     options.add_argument('--headless')
43 |     options.add_argument('--disable-gpu')
44 |     driver = webdriver.Chrome(options=options)
45 | 
46 |     urls = [line.strip() for line in open(infile)]
47 | 
48 |     reviews = []
49 |     bar = ProgressBar()
50 |     errors = []
51 |     for url in bar(urls):
52 |         try:
53 |             driver.get(url)
54 |             title = retry(driver.find_elements_by_css_selector, "div[itemprop='itemReviewed']")
55 |             author = retry(driver.find_elements_by_css_selector, "a[class='author']")
56 |             reviewer = retry(driver.find_elements_by_class_name, 'user-excerpt-name')
57 |             rating = retry(driver.find_elements_by_css_selector, '.fa-star.full')
58 |             text = retry(driver.find_elements_by_xpath, '//../following-sibling::p')
59 |             published = retry(driver.find_elements_by_css_selector, "meta[itemprop='datePublished'")
60 | 
61 |             if text and rating:
62 |                 text = '\n'.join([p.text.strip() for p in text]).strip()
63 |                 if text:
64 |                     reviews.append({
65 |                         'url': url,
66 |                         'title': title[0].get_attribute('data-url').strip() if title else None,
67 |                         'author': author[0].get_attribute('href').strip() if author else None,
68 |                         'reviewer': reviewer[0].get_attribute('href').strip() if reviewer else None,
69 |                         'rating': len(rating),
70 |                         'text': text,
71 |                         'published': published[0].get_attribute('content').strip() if published else None
72 |                     })
73 |         except Exception:
74 |             errors.append(url)
75 |             print("Error {len(errors)}: {url}")
76 |             continue
77 | 
78 |     print(f"Finished scraping {len(urls)} urls with {len(errors)}")
79 | 
80 |     print(f"Writing reviews to {outfile}")
81 |     with codecs.open(outfile, 'w', encoding=encoding) as f:
82 |         json.dump(reviews, f, ensure_ascii=False, indent=indent)
83 | 
84 | 
85 | if __name__ == '__main__':
86 |     scrape()


--------------------------------------------------------------------------------
/post_process.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import codecs
  4 | import json
  5 | import os
  6 | 
  7 | import click
  8 | import sklearn
  9 | from progressbar import ProgressBar
 10 | 
 11 | 
 12 | def load(infile, keep_incorrect_date=False, unique=True, sort=True):
 13 |     """
 14 |     Load reviews from JSON input file.
 15 |     """
 16 |     with codecs.open(infile, encoding="utf-8") as f:
 17 |         reviews = json.load(f)
 18 | 
 19 |         if not keep_incorrect_date:
 20 |             reviews = [x for x in reviews if x is not None and x['published'] >= '2002-09-11T00:00:00+02:00']
 21 | 
 22 |         if unique:
 23 |             # Define a unique review as one with a unique review text
 24 |             u = {review['text']: review for review in reviews}
 25 |             reviews = list(u.values())
 26 | 
 27 |         if sort:
 28 |             reviews = sorted(reviews, key=lambda x: x['published'])
 29 | 
 30 |     return reviews
 31 | 
 32 | 
 33 | def zipper(list1, list2):
 34 |     """
 35 |     Zip two lists, alternating values. I.e.: zipper([1,2,3], [4,5,6]) == [1,4,2,5,3,6]
 36 |     :return:
 37 |     """
 38 |     result = [None] * (len(list1) + len(list2))
 39 |     result[::2] = list1
 40 |     result[1::2] = list2
 41 |     return result
 42 | 
 43 | 
 44 | def write_supervised(reviews, outdir, start_index):
 45 |     """
 46 |     Write reviews to OUTDIR with a separate folder for negative and positive reviews.
 47 |     """
 48 |     os.mkdir(outdir)
 49 |     pos_dir = os.path.join(outdir, 'pos')
 50 |     neg_dir = os.path.join(outdir, 'neg')
 51 |     os.mkdir(pos_dir)
 52 |     os.mkdir(neg_dir)
 53 |     index = start_index
 54 |     bar = ProgressBar()
 55 |     for review in bar(reviews):
 56 |         rating = review['rating']
 57 |         if rating > 3:
 58 |             target_dir = pos_dir
 59 |         elif rating < 3:
 60 |             target_dir = neg_dir
 61 |         else:
 62 |             raise Exception("rating should be negative or positive!")
 63 |         filename = "{}_{}.txt".format(index, rating)
 64 |         with codecs.open(os.path.join(target_dir, filename), 'w', encoding='utf-8') as f:
 65 |             f.write(review['text'])
 66 |         index += 1
 67 |     return index
 68 | 
 69 | 
 70 | def write_unsupervised(reviews, outdir, start_index):
 71 |     """
 72 |     Write reviews to OUTDIR (no separate folder for positive and negative).
 73 |     """
 74 |     os.mkdir(outdir)
 75 |     index = start_index
 76 |     bar = ProgressBar()
 77 |     for review in bar(reviews):
 78 |         rating = review['rating']
 79 |         filename = "{}_{}.txt".format(index, rating)
 80 |         with codecs.open(os.path.join(outdir, filename), 'w', encoding='utf-8') as f:
 81 |             f.write(review['text'])
 82 |         index += 1
 83 |     return index
 84 | 
 85 | 
 86 | def write_urls(reviews, outfile):
 87 |     """
 88 |     Write a provenance file containing an URL for each review.
 89 |     """
 90 |     with codecs.open(outfile, 'w', encoding='utf-8') as f:
 91 |         for review in reviews:
 92 |             f.write(review['url'])
 93 |             f.write('\n')
 94 | 
 95 | 
 96 | @click.command()
 97 | @click.argument('infile')
 98 | @click.argument('outdir')
 99 | @click.option('--encoding', default='utf-8', help='Input file encoding')
100 | @click.option('--keep-incorrect-date', default=False, help='Whether to keep reviews with invalid dates.')
101 | @click.option('--sort', default=True, help='Whether to sort reviews by date.')
102 | @click.option('--valid-size-fraction', default=0.1, help='Fraction of total to set aside as validation.')
103 | @click.option('--shuffle', default=True, help='Shuffle data before saving.')
104 | def process(infile, outdir, encoding, keep_incorrect_date, sort, valid_size_fraction, shuffle):
105 |     reviews = load(infile, keep_incorrect_date, sort, encoding)
106 | 
107 |     if shuffle:
108 |         reviews = sklearn.utils.shuffle(reviews)
109 | 
110 |     pos = [x for x in reviews if x['rating'] > 3]
111 |     neg = [x for x in reviews if x['rating'] < 3]
112 |     neut = [x for x in reviews if x['rating'] == 3]  # set aside for model fine-tuning
113 | 
114 |     # Balance dataset
115 |     train_size = min(len(pos), len(neg))
116 |     train_size -= train_size % 2  # make even
117 |     sup = zipper(pos[:train_size], neg[:train_size])  # alternate positive and negative samples
118 |     unsup = pos[train_size:] + neg[train_size:] + neut
119 | 
120 |     end = int(round(float(len(sup)) * valid_size_fraction))
121 |     end -= end % 2  # make even
122 | 
123 |     # Because sup contains alternating labels like [pos, neg, pos, neg...] we can split anywhere as long as end is even
124 |     test = sup[:end]
125 |     train = sup[end:]
126 | 
127 |     print(f"Size all data:\t{len(reviews)}")
128 |     print(f"Size supervised:\t{len(sup)}")
129 |     print(f"Size unsupervised:\t{len(unsup)}")
130 |     print(f"Size training:\t{len(train)}")
131 |     print(f"Size testing:\t{len(test)}")
132 | 
133 |     os.mkdir(outdir)
134 | 
135 |     index = 1
136 |     print("Writing train data...")
137 |     index = write_supervised(train, os.path.join(outdir, 'train'), index)
138 | 
139 |     print("Writing test data...")
140 |     index = write_supervised(test, os.path.join(outdir, 'test'), index)
141 | 
142 |     print("Writing unsupervised data...")
143 |     index = write_unsupervised(unsup, os.path.join(outdir, 'unsup'), index)
144 | 
145 |     print("Writing URLs...")
146 |     write_urls(train + test + unsup, os.path.join(outdir, 'urls.txt'))
147 | 
148 |     print("DONE! :)")
149 | 
150 | 
151 | if __name__ == '__main__':
152 |     process()
153 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # DBRD: Dutch Book Reviews Dataset
  2 | 
  3 | ![GitHub release (with filter)](https://img.shields.io/github/v/release/benjaminvdb/DBRD) ![GitHub](https://img.shields.io/github/license/benjaminvdb/DBRD) ![GitHub all releases](https://img.shields.io/github/downloads/benjaminvdb/DBRD/total) ![GitHub Sponsors](https://img.shields.io/github/sponsors/benjaminvdb)
  4 | 
  5 | The DBRD (pronounced *dee-bird*) dataset contains over 110k book reviews along with associated binary sentiment polarity labels. It is greatly influenced by the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) and intended as a benchmark for sentiment classification in Dutch. The scripts that were used to scrape the reviews from [Hebban](https://www.hebban.nl) can be found in the [DBRD GitHub repository](https://github.com/benjaminvdb/DBRD).
  6 | 
  7 | # Dataset
  8 | 
  9 | ## Downloads
 10 | 
 11 | The dataset is ~79MB compressed and can be downloaded from here:
 12 | 
 13 | **[Dutch Book Reviews Dataset](https://github.com/benjaminvdb/DBRD/releases/download/v3.0/DBRD_v3.tgz)**
 14 | 
 15 | 
 16 | A language model trained with [FastAI](https://github.com/fastai/fastai) on Dutch Wikipedia can be downloaded from here:
 17 | 
 18 | **[Dutch language model trained on Wikipedia](http://bit.ly/2trOhzq)**
 19 | 
 20 | 
 21 | ## Overview
 22 | 
 23 | ### Directory structure
 24 | 
 25 | The dataset includes three folders with data: `test` (test split), `train` (train split) and `unsup` (remaining reviews).
 26 | Each review is assigned a unique identifier and can be deduced from the filename, as well as the rating: `[ID]_[RATING].txt`. *This is different from the Large Movie Review Dataset, where each file in a directory has a unique ID, but IDs are reused between folders.*
 27 | 
 28 | The `urls.txt` file contains on line `L` the URL of the book review on Hebban for the book review with that ID, i.e., the URL of the book review in `48091_5.txt` can be found on line 48091 of `urls.txt`. It cannot be guaranteed that these pages still exist.
 29 | 
 30 | ````
 31 | .
 32 | ├── README.md     // the file you're reading
 33 | ├── test          // balanced 10% test split
 34 | │   ├── neg
 35 | │   └── pos:
 36 | ├── train:        // balanced 90% train split
 37 | │   ├── neg
 38 | │   └── pos
 39 | └── unsup         // unbalanced positive and neutral
 40 | └── urls.txt      // urls to reviews on Hebban
 41 | ````
 42 | 
 43 | ### Size
 44 | ````
 45 |   #all:           118516 (= #supervised + #unsupervised)
 46 |   #supervised:     22252 (= #training + #testing)
 47 |   #unsupervised:   96264
 48 |   #training:       20028
 49 |   #testing:         2224
 50 | ````
 51 | 
 52 | ### Labels
 53 | 
 54 | Distribution of labels `positive/negative/neutral` in rounded percentages.
 55 | ````
 56 |   training: 50/50/ 0
 57 |   test:     50/50/ 0
 58 |   unsup:    72/ 0/28
 59 | ````
 60 | 
 61 | Train and test sets are balanced and contain no neutral reviews (for which `rating==3`).
 62 | 
 63 | # Reproduce data
 64 | 
 65 | Since scraping Hebban induces a load on their servers, it's best to download the prepared dataset instead. This also makes sure your results can be compared to those of others. The scripts and instructions should be used mostly as a starting point for building a scraper for another website.
 66 | 
 67 | ## Install dependencies
 68 | 
 69 | ### ChromeDriver
 70 | I'm making using of [Selenium](https://www.seleniumhq.org) for automating user actions such as clicks. This library requires a browser driver that provides the rendering backend. I've made use of [ChromeDriver](http://chromedriver.chromium.org/).
 71 | 
 72 | #### macOS
 73 | If you're on macOS and you have Homebrew installed, you can install ChromeDriver by running:
 74 | 
 75 |     brew install chromedriver
 76 |     
 77 | #### Other OSes
 78 | You can download ChromeDriver from the official [download page](http://chromedriver.chromium.org/downloads).
 79 | 
 80 | ### Python
 81 | The scripts are written for **Python 3**. To install the Python dependencies, run:     
 82 | 
 83 |     pip3 install -r ./requirements.txt
 84 | 
 85 | 
 86 | ## Run
 87 | Two scripts are provided that can be run in sequence. You can also run `run.sh` to run all scripts with defaults.
 88 | 
 89 | ### Gather URLs
 90 | The first step is to gather all review URLs from [Hebban](https://www.hebban.nl). Run `gather_urls.py` to fetch them and save them to a text file.
 91 | 
 92 | ```
 93 | Usage: gather_urls.py [OPTIONS] OUTFILE
 94 | 
 95 |   This script gathers review urls from Hebban and writes them to OUTFILE.
 96 | 
 97 | Options:
 98 |   --offset INTEGER  Review offset.
 99 |   --step INTEGER    Number of review urls to fetch per request.
100 |   --help            Show this message and exit.
101 | ```
102 | 
103 | ### Scrape URLs
104 | The second step is to scrape the URLs for review data. Run `scrape_reviews.py` to iterate over the review URLs and save the scraped data to a JSON file.
105 | 
106 | ```
107 | Usage: scrape_reviews.py [OPTIONS] INFILE OUTFILE
108 | 
109 |   Iterate over review urls in INFILE text file, scrape review data and
110 |   output to OUTFILE.
111 | 
112 | Options:
113 |   --encoding TEXT   Output file encoding.
114 |   --indent INTEGER  Output JSON file with scraped data.
115 |   --help            Show this message and exit.
116 | ```
117 | 
118 | ### Post-process
119 | 
120 | The third and final step is to prepare the dataset using the scraped reviews. By default, we limit the number of reviews to 110k, filter out some reviews and prepare train and test sets of 0.9 and 0.1 the total amount, respectively.
121 | 
122 | ```
123 | Usage: post_process.py [OPTIONS] INFILE OUTDIR
124 | 
125 | Options:
126 |   --encoding TEXT              Input file encoding
127 |   --keep-incorrect-date TEXT   Whether to keep reviews with invalid dates.
128 |   --sort TEXT                  Whether to sort reviews by date.
129 |   --maximum INTEGER            Maximum number of reviews in output
130 |   --valid-size-fraction FLOAT  Fraction of total to set aside as validation.
131 |   --shuffle TEXT               Shuffle data before saving.
132 |   --help                       Show this message and exit.
133 | ```
134 | 
135 | ## Changelog
136 | 
137 | v3: Changed name of the dataset from 110kDBRD to DBRD. The dataset itself remains unchanged.
138 | 
139 | v2: Removed advertisements from reviews and increased dataset size to 118,516.
140 | 
141 | v1: Initial release
142 | 
143 | ## Citation
144 | 
145 | Please use the following citation when making use of this dataset in your work.
146 | 
147 | ```
148 | @article{DBLP:journals/corr/abs-1910-00896,
149 |   author    = {Benjamin van der Burgh and
150 |                Suzan Verberne},
151 |   title     = {The merits of Universal Language Model Fine-tuning for Small Datasets
152 |                - a case with Dutch book reviews},
153 |   journal   = {CoRR},
154 |   volume    = {abs/1910.00896},
155 |   year      = {2019},
156 |   url       = {http://arxiv.org/abs/1910.00896},
157 |   archivePrefix = {arXiv},
158 |   eprint    = {1910.00896},
159 |   timestamp = {Fri, 04 Oct 2019 12:28:06 +0200},
160 |   biburl    = {https://dblp.org/rec/journals/corr/abs-1910-00896.bib},
161 |   bibsource = {dblp computer science bibliography, https://dblp.org}
162 | }
163 | ```
164 | 
165 | ## Acknowledgements
166 | 
167 | This dataset was created for testing out the [ULMFiT](https://arxiv.org/abs/1801.06146) (by Jeremy Howard and Sebastian Ruder) deep learning algorithm for text classification. It is implemented in the [FastAI](https://github.com/fastai/fastai) Python library that has taught me a lot. I'd also like to thank [Timo Block](https://github.com/tblock) for making his [10kGNAD](https://github.com/tblock/10kGNAD) dataset publicly available and giving me a starting point for this dataset. The dataset structure based on the [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) by Andrew L. Maas et al. Thanks to [Andreas van Cranenburg](https://github.com/andreasvc) for pointing out a problem with the dataset.
168 | 
169 | And of course I'd like to thank all the reviewers on [Hebban](https://www.hebban.nl) for having taken the time to write all these reviews. You've made both book enthousiast and NLP researchers very happy :)
170 | 
171 | ## License
172 | 
173 | All code in this repository is licensed under a MIT License.
174 | 
175 | The dataset is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
176 | 


--------------------------------------------------------------------------------