├── .gitattributes
├── README.md
├── output_data
    └── README
├── prepare_clang8_dataset.py
├── prepare_clang8_dataset_test.py
├── requirements.txt
├── retokenize.py
├── run.sh
└── targets
    ├── README
    ├── clang8_de.detokenized.tsv
    ├── clang8_en.detokenized.tsv
    └── clang8_ru.detokenized.tsv


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.tsv filter=lfs diff=lfs merge=lfs -text
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # cLang-8 Dataset
 2 | 
 3 | cLang-8 (“cleaned Lang-8”) is a dataset for grammatical error correction (GEC).
 4 | The source sentences originate from the popular
 5 | [NAIST Lang-8 Learner Corpora](https://sites.google.com/site/naistlang8corpora/home),
 6 | while the target sentences are generated by our state-of-the-art GEC method
 7 | called gT5. The method is described in our [ACL-IJCNLP 2021 paper](https://arxiv.org/abs/2106.03830).
 8 | 
 9 | The paper shows that fine-tuning a
10 | [T5-11B](https://github.com/google-research/text-to-text-transfer-transformer)
11 | model on cLang-8 yields SOTA performance on GEC for English. cLang-8 thus
12 | simplifies a typical GEC training pipeline consisting of multiple fine-tuning
13 | stages.
14 | 
15 | ## Dataset Preparation
16 | 
17 | cLang-8 is generated by combining the target sentences found under `targets/`
18 | directory of this repository with the source sentences from the original Lang-8
19 | corpus which has to be downloaded separately. Specifically, you need to complete
20 | the following steps:
21 | 
22 | 1.  [Install](https://git-lfs.github.com/) Git Large File Storage (if not
23 |     already installed) and clone this repository.
24 | 2.  Fill
25 |     [this form](https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true),
26 |     after which you will receive an email with a link to “the raw format
27 |     containing all the data up to 2010”.
28 | 3.  Follow the link to download a zip file and extract it.
29 | 4.  Update the `LANG8_DIR` variable in `run.sh` to point to the resulting
30 |     extracted directory.
31 | 5.  Run command `sh run.sh` which will install the required Python 3
32 |     dependencies in a virtualenv and align the source and the target sentences.
33 | 
34 | NB: Running the above script takes about 1 hour when spaCy tokenization is
35 | enabled (recommended to make tokenization consistent with CoNLL-14 (see also 
36 | the next section) and BEA eval sets).
37 | 
38 | ## Tokenization Post-Processing for CoNLL-14
39 | 
40 | After training a model and computing predictions on the CoNLL-14 test set for
41 | the paper, we ran some post-processing steps found in `retokenize.py` to fix
42 | tokenization discrepancies. This improves the F0.5 scores by about 2.5 points
43 | (for T5 xxl).
44 | 
45 | You may instead want to try applying the post-processing steps to cLang-8
46 | targets before training a model.
47 | 
48 | ## Data Format
49 | 
50 | The resulting cLang-8 data files will be saved under `./output_data/` directory
51 | and they will be TSV files with a single tab-separated (source, target) pair per
52 | line. Three separate TSV files will be generated for the following languages:
53 | 
54 | Language | Number of examples
55 | -------- | ------------------
56 | English  | 2,372,119
57 | German   | 114,405
58 | Russian  | 44,830
59 | 
60 | ## How to Cite cLang-8
61 | 
62 | Please cite the following works if you use cLang-8:
63 | 
64 | ```
65 | @inproceedings{rothe2021a,
66 |   title = {{A Simple Recipe for Multilingual Grammatical Error Correction}},
67 |   author = {Rothe, Sascha and Mallinson, Jonathan and Malmi, Eric and Krause, Sebastian and Severyn, Aliaksei},
68 |   booktitle = {Proc. of ACL-IJCNLP},
69 |   year = {2021}
70 | }
71 | 
72 | @inproceedings{mizumoto2011mining,
73 |   title={{Mining revision log of language learning SNS for automated Japanese error correction of second language learners}},
74 |   author={Mizumoto, Tomoya and Komachi, Mamoru and Nagata, Masaaki and Matsumoto, Yuji},
75 |   booktitle={Proc. of 5th International Joint Conference on Natural Language Processing},
76 |   pages={147--155},
77 |   year={2011}
78 | }
79 | ```
80 | 
81 | ## License
82 | 
83 | Similar to the original Lang-8 corpus, cLang-8 is distributed for research and
84 | educational purposes only. Specifically, cLang-8 is released under
85 | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
86 | 
87 | The code is distributed under Apache 2.0 license.
88 | 
89 | ## Contact Us
90 | 
91 | If you have a technical question regarding the dataset, code, or publication,
92 | please create an issue in this repository.
93 | 


--------------------------------------------------------------------------------
/output_data/README:
--------------------------------------------------------------------------------
1 | The prepared data files will be stored here (by default) once you run: `run.sh`.


--------------------------------------------------------------------------------
/prepare_clang8_dataset.py:
--------------------------------------------------------------------------------
  1 | """Main file to combine cLang-8 targets with the original Lang-8 sources.
  2 | 
  3 | Before running this, download the Lang-8 raw corpus from:
  4 | 
  5 | https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true
  6 | 
  7 | and provide the download directory path via the `lang8_dir` flag.
  8 | """
  9 | 
 10 | import collections
 11 | import json
 12 | import os
 13 | 
 14 | from typing import Iterable, Iterator, List, Mapping, Sequence, Tuple
 15 | 
 16 | from absl import app
 17 | from absl import flags
 18 | 
 19 | import spacy
 20 | import tqdm
 21 | 
 22 | FLAGS = flags.FLAGS
 23 | 
 24 | flags.DEFINE_string(
 25 |     'lang8_dir', '',
 26 |     'Path to the directory containing the Lang-8 raw corpus, specifically the '
 27 |     'following version of it: lang-8-20111007-L1-v2.dat')
 28 | flags.DEFINE_string(
 29 |     'clang8_dir', './targets',
 30 |     'Path to the directory containing the cLang-8 files downloaded from '
 31 |     'GitHub.')
 32 | flags.DEFINE_string(
 33 |     'output_dir', './output_data',
 34 |     'Path to the directory where the output files are written.')
 35 | flags.DEFINE_bool(
 36 |     'tokenize_text', True,
 37 |     'Whether to tokenize sources and targets using spaCy.')
 38 | flags.DEFINE_list(
 39 |     'languages', 'ru,en,de',
 40 |     'Comma-separated list of languages for which to generate cLang-8.')
 41 | 
 42 | 
 43 | def _yield_lang8_raw_dicts(lang8_raw_dir: str):
 44 |   """Yields JSON rows from the Lang-8 raw corpus.
 45 | 
 46 |   Format of the rows is documented at:
 47 |   https://sites.google.com/site/naistlang8corpora/home/readme-raw
 48 | 
 49 |   Args:
 50 |     lang8_raw_dir: Directory containing the Lang-8 raw corpus, specifically the
 51 |       following version of it: lang-8-20111007-L1-v2.dat
 52 |   """
 53 |   path = os.path.join(lang8_raw_dir, 'lang-8-20111007-L1-v2.dat')
 54 |   num_rows = 0
 55 |   with open(path) as f:
 56 |     for line in f:
 57 |       try:
 58 |         row = json.loads(line)
 59 |         yield row
 60 |         num_rows += 1
 61 |       except json.decoder.JSONDecodeError:
 62 |         pass
 63 |   print(f'{num_rows} Lang-8 raw documents read.')
 64 | 
 65 | 
 66 | def _read_clang8_targets(
 67 |     path: str) -> Tuple[Mapping[Tuple[str, str], List[Tuple[str, str]]], int]:
 68 |   """Reads cLang-8 target generated by gT5.
 69 | 
 70 |   Args:
 71 |     path: Path to a language-specific cLang-8 targets.
 72 | 
 73 |   Returns:
 74 |     (journal_id, sentence_id) pair referring to Lang-8 raw IDs mapped to
 75 |       (sentence_number, target) where sentence_number is the learner sentence
 76 |       index.
 77 |   """
 78 |   ids_2_targets = collections.defaultdict(list)
 79 |   num_targets = 0
 80 |   with open(path) as f:
 81 |     for line in f:
 82 |       (journal_id, sentence_id, sentence_number, has_correction,
 83 |        target) = line.rstrip('\n').split('\t')
 84 |       del has_correction
 85 |       ids_2_targets[(journal_id, sentence_id)].append((sentence_number, target))
 86 |       num_targets += 1
 87 |   print(f'{num_targets} cLang-8 targets read.')
 88 |   return ids_2_targets, num_targets
 89 | 
 90 | 
 91 | def _yield_clang8_source_target_pairs(
 92 |     clang8_path: str, lang8_raw_dir: str) -> Iterator[Tuple[str, str]]:
 93 |   """Yields cLang-8 source-target pairs.
 94 | 
 95 |   The pairs are obtained by combining the cLang-8 target file and the original
 96 |   Lang-8 raw corpus.
 97 | 
 98 |   Args:
 99 |     clang8_path: Path to a language-specific cLang-8 targets.
100 |     lang8_raw_dir: Directory containing the Lang-8 raw corpus, specifically the
101 |       following version of it: lang-8-20111007-L1-v2.dat
102 |   """
103 |   ids_2_targets, num_targets = _read_clang8_targets(clang8_path)
104 |   num_combined = 0
105 |   with tqdm.tqdm(total=num_targets) as progress_bar:
106 |     for row in _yield_lang8_raw_dicts(lang8_raw_dir):
107 |       lang8_raw_ids = (row[0], row[1])
108 |       for sentence_number, target in ids_2_targets.get(lang8_raw_ids, []):
109 |         source = row[4][int(sentence_number)]
110 |         yield source, target
111 |         num_combined += 1
112 |         progress_bar.update(1)
113 |   print(f'{num_combined} sources mapped to cLang-8 targets.')
114 | 
115 | 
116 | def _tokenize(pairs: Iterable[Tuple[str, str]],
117 |               nlp: spacy.Language,
118 |               batch_size: int = 1000) -> Iterator[Tuple[str, str]]:
119 |   """Yields the input source-target pairs after tokenizing them.
120 | 
121 |   NB: This function loads all source-target pairs to memory at once.
122 | 
123 |   Args:
124 |     pairs: Untokenized (source, target) pairs.
125 |     nlp: SpaCy pipeline.
126 |     batch_size: Batch size used with `nlp.pipe`.
127 | 
128 |   Yields:
129 |     (tokenized source, tokenized target) pairs.
130 |   """
131 |   # Convert iterator to list to be able to separate sources and targets so that
132 |   # we can use `nlp.pipe` with batching for increased throughput.
133 |   pairs = list(pairs)
134 |   print('Tokenizing...')
135 |   source_docs = nlp.pipe([pair[0] for pair in pairs], batch_size=batch_size)
136 |   target_docs = nlp.pipe([pair[1] for pair in pairs], batch_size=batch_size)
137 |   with tqdm.tqdm(total=len(pairs)) as progress_bar:
138 |     for source, target in zip(source_docs, target_docs):
139 |       source_tokenized = ' '.join([token.text for token in source])
140 |       target_tokenized = ' '.join([token.text for token in target])
141 |       yield source_tokenized, target_tokenized
142 |       progress_bar.update(1)
143 | 
144 | 
145 | def _clean_spaces(text):
146 |   """Removes tabs and newlines for saving as TSV."""
147 |   return text.replace('\t', ' ').replace('\n', ' ').replace('\r', ' ')
148 | 
149 | 
150 | def _write_source_target_pairs_to_tsv(pairs: Iterable[Tuple[str, str]],
151 |                                       output_path: str) -> None:
152 |   """Saves source-target pairs separated with a tab to a file."""
153 |   with open(output_path, 'w') as f:
154 |     for source, target in pairs:
155 |       source = _clean_spaces(source)
156 |       target = _clean_spaces(target)
157 |       f.write(f'{source}\t{target}\n')
158 |   print(f'Wrote the source-target pairs to:\n{output_path}')
159 | 
160 | 
161 | def _prepare_clang8(language: str, clang8_targets_dir: str, lang8_dir: str,
162 |                     output_dir: str, tokenize_text: str) -> None:
163 |   """Prepares the cLang-8 dataset for a single language."""
164 |   # Load tokenizer.
165 |   if language == 'en':
166 |     model_path = 'en_core_web_sm'
167 |   elif language == 'de':
168 |     model_path = 'de_core_news_sm'
169 |   elif language == 'ru':
170 |     model_path = 'ru_core_news_sm'
171 |   else:
172 |     raise ValueError(f'Unsupported language: {language}')
173 |   disabled_components = ['lemmatizer', 'parser', 'tagger', 'ner']
174 |   nlp = spacy.load(model_path, disable=disabled_components)
175 | 
176 |   clang8_targets_path = os.path.join(clang8_targets_dir,
177 |                                      f'clang8_{language}.detokenized.tsv')
178 |   source_target_pairs = _yield_clang8_source_target_pairs(clang8_targets_path,
179 |                                                           lang8_dir)
180 |   tokenization_label = ''
181 |   if tokenize_text:
182 |     tokenization_label = '.spacy_tokenized'
183 |     source_target_pairs = _tokenize(source_target_pairs, nlp)
184 |   output_path = os.path.join(
185 |       output_dir, f'clang8_source_target_{language}{tokenization_label}.tsv')
186 |   _write_source_target_pairs_to_tsv(source_target_pairs, output_path)
187 | 
188 | 
189 | def main(argv: Sequence[str]) -> None:
190 |   if len(argv) > 1:
191 |     raise app.UsageError('Too many command-line arguments.')
192 | 
193 |   for language in FLAGS.languages:
194 |     print(f'\n{language}')
195 |     _prepare_clang8(language, FLAGS.clang8_dir, FLAGS.lang8_dir,
196 |                     FLAGS.output_dir, FLAGS.tokenize_text)
197 | 
198 | 
199 | if __name__ == '__main__':
200 |   app.run(main)
201 | 


--------------------------------------------------------------------------------
/prepare_clang8_dataset_test.py:
--------------------------------------------------------------------------------
 1 | """Tests for prepare_clang8_dataset."""
 2 | 
 3 | import os
 4 | from absl.testing import absltest
 5 | 
 6 | import prepare_clang8_dataset
 7 | 
 8 | 
 9 | class PrepareClang8DatasetTest(absltest.TestCase):
10 | 
11 |   def test_preparing_fake_clang8(self):
12 |     lang8_dir = self.create_tempdir().full_path
13 |     lang8_path = os.path.join(lang8_dir, 'lang-8-20111007-L1-v2.dat')
14 |     with open(lang8_path, 'w') as f2:
15 |       f2.write("""["123","111","Japanese","English","""
16 |                """["This isn't gramatical.","This is"],[[],[]]]\n""")
17 |       f2.write("""["123","222","Japanese","English",["A best"],[[]]]\n""")
18 | 
19 |     clang8_targets_dir = self.create_tempdir().full_path
20 |     targets_path = os.path.join(clang8_targets_dir, 'clang8_en.detokenized.tsv')
21 |     with open(targets_path, 'w') as f:
22 |       f.write("123\t111\t0\tFalse\tThis isn't grammatical.\n")
23 |       f.write('123\t111\t1\tFalse\tThis is\n')
24 |       f.write('123\t222\t0\tTrue\tThe best\n')
25 | 
26 |     language = 'en'
27 |     output_dir = self.create_tempdir().full_path
28 |     prepare_clang8_dataset._prepare_clang8(
29 |         language, clang8_targets_dir, lang8_dir, output_dir, tokenize_text=True)
30 |     output_path = os.path.join(
31 |         output_dir, f'clang8_source_target_{language}.spacy_tokenized.tsv')
32 |     with open(output_path) as f:
33 |       output_lines = f.readlines()
34 |     self.assertEqual(output_lines, [
35 |         "This is n't gramatical .\tThis is n't grammatical .\n",
36 |         'This is\tThis is\n',
37 |         'A best\tThe best\n',
38 |     ])
39 | 
40 | 
41 | if __name__ == '__main__':
42 |   absltest.main()
43 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | absl-py>=0.12.0
2 | spacy>=3.0.6
3 | tqdm>=4.60.0
4 | 


--------------------------------------------------------------------------------
/retokenize.py:
--------------------------------------------------------------------------------
 1 | """Simple regular expressions to fix tokenization issues for CoNLL.
 2 | 
 3 | Usage:
 4 | $ python3 retokenize.py [model_predictions_file] > [retokenized_predictions_file]
 5 | """
 6 | import fileinput
 7 | import re
 8 | 
 9 | retokenization_rules = [
10 |     # Remove extra space around single quotes, hyphens, and slashes.
11 |     (" ' (.*?) ' ", " '\\1' "),
12 |     (" - ", "-"),
13 |     (" / ", "/"),
14 |     # Ensure there are spaces around parentheses and brackets.
15 |     (r"([\]\[\(\){}<>])", " \\1 "),
16 |     (r"\s+", " "),
17 | ]
18 | 
19 | for line in fileinput.input():
20 |   for rule in retokenization_rules:
21 |     line = re.sub(rule[0], rule[1], line)
22 |   print(line.strip())
23 | 


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | set -e
 3 | set -x
 4 | 
 5 | # Download the Lang-8 raw corpus from:
 6 | # https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true
 7 | # and provide the directory here.
 8 | readonly LANG8_DIR='<INSERT LANG8 DIRECTORY HERE>'
 9 | 
10 | echo "Installing required packages..."
11 | virtualenv -p python3 .
12 | source ./bin/activate
13 | 
14 | pip install -r requirements.txt
15 | 
16 | python -m spacy download en_core_web_sm
17 | python -m spacy download de_core_news_sm
18 | python -m spacy download ru_core_news_sm
19 | 
20 | echo "Running a test..."
21 | python -m prepare_clang8_dataset_test
22 | 
23 | echo "Generating the cLang-8 dataset for three languages: ru, de, and en"
24 | python -m prepare_clang8_dataset \
25 |   --lang8_dir="${LANG8_DIR}" \
26 |   --tokenize_text='True' \
27 |   --languages='ru,de,en'
28 | 


--------------------------------------------------------------------------------
/targets/README:
--------------------------------------------------------------------------------
1 | # cLang-8 target sentences are stored to this directory.


--------------------------------------------------------------------------------
/targets/clang8_de.detokenized.tsv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:b4300f5ce9cf8c70de56ab4062176d01c70161e0303a384eafd4800a03e9b70d
3 | size 9349297
4 | 


--------------------------------------------------------------------------------
/targets/clang8_en.detokenized.tsv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:ea8fbdfc6adea720852e595a1609368884dd4cc23101c6cc152aa96bef1d8a1f
3 | size 181351414
4 | 


--------------------------------------------------------------------------------
/targets/clang8_ru.detokenized.tsv:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:ca02d4e5a2236fa612168dc3dc76590209396d218b6b5b00dec80fae5c5bd53a
3 | size 5277693
4 | 


--------------------------------------------------------------------------------