├── .gitattributes ├── README.md ├── output_data └── README ├── prepare_clang8_dataset.py ├── prepare_clang8_dataset_test.py ├── requirements.txt ├── retokenize.py ├── run.sh └── targets ├── README ├── clang8_de.detokenized.tsv ├── clang8_en.detokenized.tsv └── clang8_ru.detokenized.tsv /.gitattributes: -------------------------------------------------------------------------------- 1 | *.tsv filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cLang-8 Dataset 2 | 3 | cLang-8 (“cleaned Lang-8”) is a dataset for grammatical error correction (GEC). 4 | The source sentences originate from the popular 5 | [NAIST Lang-8 Learner Corpora](https://sites.google.com/site/naistlang8corpora/home), 6 | while the target sentences are generated by our state-of-the-art GEC method 7 | called gT5. The method is described in our [ACL-IJCNLP 2021 paper](https://arxiv.org/abs/2106.03830). 8 | 9 | The paper shows that fine-tuning a 10 | [T5-11B](https://github.com/google-research/text-to-text-transfer-transformer) 11 | model on cLang-8 yields SOTA performance on GEC for English. cLang-8 thus 12 | simplifies a typical GEC training pipeline consisting of multiple fine-tuning 13 | stages. 14 | 15 | ## Dataset Preparation 16 | 17 | cLang-8 is generated by combining the target sentences found under `targets/` 18 | directory of this repository with the source sentences from the original Lang-8 19 | corpus which has to be downloaded separately. Specifically, you need to complete 20 | the following steps: 21 | 22 | 1. [Install](https://git-lfs.github.com/) Git Large File Storage (if not 23 | already installed) and clone this repository. 24 | 2. Fill 25 | [this form](https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true), 26 | after which you will receive an email with a link to “the raw format 27 | containing all the data up to 2010”. 28 | 3. Follow the link to download a zip file and extract it. 29 | 4. Update the `LANG8_DIR` variable in `run.sh` to point to the resulting 30 | extracted directory. 31 | 5. Run command `sh run.sh` which will install the required Python 3 32 | dependencies in a virtualenv and align the source and the target sentences. 33 | 34 | NB: Running the above script takes about 1 hour when spaCy tokenization is 35 | enabled (recommended to make tokenization consistent with CoNLL-14 (see also 36 | the next section) and BEA eval sets). 37 | 38 | ## Tokenization Post-Processing for CoNLL-14 39 | 40 | After training a model and computing predictions on the CoNLL-14 test set for 41 | the paper, we ran some post-processing steps found in `retokenize.py` to fix 42 | tokenization discrepancies. This improves the F0.5 scores by about 2.5 points 43 | (for T5 xxl). 44 | 45 | You may instead want to try applying the post-processing steps to cLang-8 46 | targets before training a model. 47 | 48 | ## Data Format 49 | 50 | The resulting cLang-8 data files will be saved under `./output_data/` directory 51 | and they will be TSV files with a single tab-separated (source, target) pair per 52 | line. Three separate TSV files will be generated for the following languages: 53 | 54 | Language | Number of examples 55 | -------- | ------------------ 56 | English | 2,372,119 57 | German | 114,405 58 | Russian | 44,830 59 | 60 | ## How to Cite cLang-8 61 | 62 | Please cite the following works if you use cLang-8: 63 | 64 | ``` 65 | @inproceedings{rothe2021a, 66 | title = {{A Simple Recipe for Multilingual Grammatical Error Correction}}, 67 | author = {Rothe, Sascha and Mallinson, Jonathan and Malmi, Eric and Krause, Sebastian and Severyn, Aliaksei}, 68 | booktitle = {Proc. of ACL-IJCNLP}, 69 | year = {2021} 70 | } 71 | 72 | @inproceedings{mizumoto2011mining, 73 | title={{Mining revision log of language learning SNS for automated Japanese error correction of second language learners}}, 74 | author={Mizumoto, Tomoya and Komachi, Mamoru and Nagata, Masaaki and Matsumoto, Yuji}, 75 | booktitle={Proc. of 5th International Joint Conference on Natural Language Processing}, 76 | pages={147--155}, 77 | year={2011} 78 | } 79 | ``` 80 | 81 | ## License 82 | 83 | Similar to the original Lang-8 corpus, cLang-8 is distributed for research and 84 | educational purposes only. Specifically, cLang-8 is released under 85 | [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. 86 | 87 | The code is distributed under Apache 2.0 license. 88 | 89 | ## Contact Us 90 | 91 | If you have a technical question regarding the dataset, code, or publication, 92 | please create an issue in this repository. 93 | -------------------------------------------------------------------------------- /output_data/README: -------------------------------------------------------------------------------- 1 | The prepared data files will be stored here (by default) once you run: `run.sh`. -------------------------------------------------------------------------------- /prepare_clang8_dataset.py: -------------------------------------------------------------------------------- 1 | """Main file to combine cLang-8 targets with the original Lang-8 sources. 2 | 3 | Before running this, download the Lang-8 raw corpus from: 4 | 5 | https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true 6 | 7 | and provide the download directory path via the `lang8_dir` flag. 8 | """ 9 | 10 | import collections 11 | import json 12 | import os 13 | 14 | from typing import Iterable, Iterator, List, Mapping, Sequence, Tuple 15 | 16 | from absl import app 17 | from absl import flags 18 | 19 | import spacy 20 | import tqdm 21 | 22 | FLAGS = flags.FLAGS 23 | 24 | flags.DEFINE_string( 25 | 'lang8_dir', '', 26 | 'Path to the directory containing the Lang-8 raw corpus, specifically the ' 27 | 'following version of it: lang-8-20111007-L1-v2.dat') 28 | flags.DEFINE_string( 29 | 'clang8_dir', './targets', 30 | 'Path to the directory containing the cLang-8 files downloaded from ' 31 | 'GitHub.') 32 | flags.DEFINE_string( 33 | 'output_dir', './output_data', 34 | 'Path to the directory where the output files are written.') 35 | flags.DEFINE_bool( 36 | 'tokenize_text', True, 37 | 'Whether to tokenize sources and targets using spaCy.') 38 | flags.DEFINE_list( 39 | 'languages', 'ru,en,de', 40 | 'Comma-separated list of languages for which to generate cLang-8.') 41 | 42 | 43 | def _yield_lang8_raw_dicts(lang8_raw_dir: str): 44 | """Yields JSON rows from the Lang-8 raw corpus. 45 | 46 | Format of the rows is documented at: 47 | https://sites.google.com/site/naistlang8corpora/home/readme-raw 48 | 49 | Args: 50 | lang8_raw_dir: Directory containing the Lang-8 raw corpus, specifically the 51 | following version of it: lang-8-20111007-L1-v2.dat 52 | """ 53 | path = os.path.join(lang8_raw_dir, 'lang-8-20111007-L1-v2.dat') 54 | num_rows = 0 55 | with open(path) as f: 56 | for line in f: 57 | try: 58 | row = json.loads(line) 59 | yield row 60 | num_rows += 1 61 | except json.decoder.JSONDecodeError: 62 | pass 63 | print(f'{num_rows} Lang-8 raw documents read.') 64 | 65 | 66 | def _read_clang8_targets( 67 | path: str) -> Tuple[Mapping[Tuple[str, str], List[Tuple[str, str]]], int]: 68 | """Reads cLang-8 target generated by gT5. 69 | 70 | Args: 71 | path: Path to a language-specific cLang-8 targets. 72 | 73 | Returns: 74 | (journal_id, sentence_id) pair referring to Lang-8 raw IDs mapped to 75 | (sentence_number, target) where sentence_number is the learner sentence 76 | index. 77 | """ 78 | ids_2_targets = collections.defaultdict(list) 79 | num_targets = 0 80 | with open(path) as f: 81 | for line in f: 82 | (journal_id, sentence_id, sentence_number, has_correction, 83 | target) = line.rstrip('\n').split('\t') 84 | del has_correction 85 | ids_2_targets[(journal_id, sentence_id)].append((sentence_number, target)) 86 | num_targets += 1 87 | print(f'{num_targets} cLang-8 targets read.') 88 | return ids_2_targets, num_targets 89 | 90 | 91 | def _yield_clang8_source_target_pairs( 92 | clang8_path: str, lang8_raw_dir: str) -> Iterator[Tuple[str, str]]: 93 | """Yields cLang-8 source-target pairs. 94 | 95 | The pairs are obtained by combining the cLang-8 target file and the original 96 | Lang-8 raw corpus. 97 | 98 | Args: 99 | clang8_path: Path to a language-specific cLang-8 targets. 100 | lang8_raw_dir: Directory containing the Lang-8 raw corpus, specifically the 101 | following version of it: lang-8-20111007-L1-v2.dat 102 | """ 103 | ids_2_targets, num_targets = _read_clang8_targets(clang8_path) 104 | num_combined = 0 105 | with tqdm.tqdm(total=num_targets) as progress_bar: 106 | for row in _yield_lang8_raw_dicts(lang8_raw_dir): 107 | lang8_raw_ids = (row[0], row[1]) 108 | for sentence_number, target in ids_2_targets.get(lang8_raw_ids, []): 109 | source = row[4][int(sentence_number)] 110 | yield source, target 111 | num_combined += 1 112 | progress_bar.update(1) 113 | print(f'{num_combined} sources mapped to cLang-8 targets.') 114 | 115 | 116 | def _tokenize(pairs: Iterable[Tuple[str, str]], 117 | nlp: spacy.Language, 118 | batch_size: int = 1000) -> Iterator[Tuple[str, str]]: 119 | """Yields the input source-target pairs after tokenizing them. 120 | 121 | NB: This function loads all source-target pairs to memory at once. 122 | 123 | Args: 124 | pairs: Untokenized (source, target) pairs. 125 | nlp: SpaCy pipeline. 126 | batch_size: Batch size used with `nlp.pipe`. 127 | 128 | Yields: 129 | (tokenized source, tokenized target) pairs. 130 | """ 131 | # Convert iterator to list to be able to separate sources and targets so that 132 | # we can use `nlp.pipe` with batching for increased throughput. 133 | pairs = list(pairs) 134 | print('Tokenizing...') 135 | source_docs = nlp.pipe([pair[0] for pair in pairs], batch_size=batch_size) 136 | target_docs = nlp.pipe([pair[1] for pair in pairs], batch_size=batch_size) 137 | with tqdm.tqdm(total=len(pairs)) as progress_bar: 138 | for source, target in zip(source_docs, target_docs): 139 | source_tokenized = ' '.join([token.text for token in source]) 140 | target_tokenized = ' '.join([token.text for token in target]) 141 | yield source_tokenized, target_tokenized 142 | progress_bar.update(1) 143 | 144 | 145 | def _clean_spaces(text): 146 | """Removes tabs and newlines for saving as TSV.""" 147 | return text.replace('\t', ' ').replace('\n', ' ').replace('\r', ' ') 148 | 149 | 150 | def _write_source_target_pairs_to_tsv(pairs: Iterable[Tuple[str, str]], 151 | output_path: str) -> None: 152 | """Saves source-target pairs separated with a tab to a file.""" 153 | with open(output_path, 'w') as f: 154 | for source, target in pairs: 155 | source = _clean_spaces(source) 156 | target = _clean_spaces(target) 157 | f.write(f'{source}\t{target}\n') 158 | print(f'Wrote the source-target pairs to:\n{output_path}') 159 | 160 | 161 | def _prepare_clang8(language: str, clang8_targets_dir: str, lang8_dir: str, 162 | output_dir: str, tokenize_text: str) -> None: 163 | """Prepares the cLang-8 dataset for a single language.""" 164 | # Load tokenizer. 165 | if language == 'en': 166 | model_path = 'en_core_web_sm' 167 | elif language == 'de': 168 | model_path = 'de_core_news_sm' 169 | elif language == 'ru': 170 | model_path = 'ru_core_news_sm' 171 | else: 172 | raise ValueError(f'Unsupported language: {language}') 173 | disabled_components = ['lemmatizer', 'parser', 'tagger', 'ner'] 174 | nlp = spacy.load(model_path, disable=disabled_components) 175 | 176 | clang8_targets_path = os.path.join(clang8_targets_dir, 177 | f'clang8_{language}.detokenized.tsv') 178 | source_target_pairs = _yield_clang8_source_target_pairs(clang8_targets_path, 179 | lang8_dir) 180 | tokenization_label = '' 181 | if tokenize_text: 182 | tokenization_label = '.spacy_tokenized' 183 | source_target_pairs = _tokenize(source_target_pairs, nlp) 184 | output_path = os.path.join( 185 | output_dir, f'clang8_source_target_{language}{tokenization_label}.tsv') 186 | _write_source_target_pairs_to_tsv(source_target_pairs, output_path) 187 | 188 | 189 | def main(argv: Sequence[str]) -> None: 190 | if len(argv) > 1: 191 | raise app.UsageError('Too many command-line arguments.') 192 | 193 | for language in FLAGS.languages: 194 | print(f'\n{language}') 195 | _prepare_clang8(language, FLAGS.clang8_dir, FLAGS.lang8_dir, 196 | FLAGS.output_dir, FLAGS.tokenize_text) 197 | 198 | 199 | if __name__ == '__main__': 200 | app.run(main) 201 | -------------------------------------------------------------------------------- /prepare_clang8_dataset_test.py: -------------------------------------------------------------------------------- 1 | """Tests for prepare_clang8_dataset.""" 2 | 3 | import os 4 | from absl.testing import absltest 5 | 6 | import prepare_clang8_dataset 7 | 8 | 9 | class PrepareClang8DatasetTest(absltest.TestCase): 10 | 11 | def test_preparing_fake_clang8(self): 12 | lang8_dir = self.create_tempdir().full_path 13 | lang8_path = os.path.join(lang8_dir, 'lang-8-20111007-L1-v2.dat') 14 | with open(lang8_path, 'w') as f2: 15 | f2.write("""["123","111","Japanese","English",""" 16 | """["This isn't gramatical.","This is"],[[],[]]]\n""") 17 | f2.write("""["123","222","Japanese","English",["A best"],[[]]]\n""") 18 | 19 | clang8_targets_dir = self.create_tempdir().full_path 20 | targets_path = os.path.join(clang8_targets_dir, 'clang8_en.detokenized.tsv') 21 | with open(targets_path, 'w') as f: 22 | f.write("123\t111\t0\tFalse\tThis isn't grammatical.\n") 23 | f.write('123\t111\t1\tFalse\tThis is\n') 24 | f.write('123\t222\t0\tTrue\tThe best\n') 25 | 26 | language = 'en' 27 | output_dir = self.create_tempdir().full_path 28 | prepare_clang8_dataset._prepare_clang8( 29 | language, clang8_targets_dir, lang8_dir, output_dir, tokenize_text=True) 30 | output_path = os.path.join( 31 | output_dir, f'clang8_source_target_{language}.spacy_tokenized.tsv') 32 | with open(output_path) as f: 33 | output_lines = f.readlines() 34 | self.assertEqual(output_lines, [ 35 | "This is n't gramatical .\tThis is n't grammatical .\n", 36 | 'This is\tThis is\n', 37 | 'A best\tThe best\n', 38 | ]) 39 | 40 | 41 | if __name__ == '__main__': 42 | absltest.main() 43 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py>=0.12.0 2 | spacy>=3.0.6 3 | tqdm>=4.60.0 4 | -------------------------------------------------------------------------------- /retokenize.py: -------------------------------------------------------------------------------- 1 | """Simple regular expressions to fix tokenization issues for CoNLL. 2 | 3 | Usage: 4 | $ python3 retokenize.py [model_predictions_file] > [retokenized_predictions_file] 5 | """ 6 | import fileinput 7 | import re 8 | 9 | retokenization_rules = [ 10 | # Remove extra space around single quotes, hyphens, and slashes. 11 | (" ' (.*?) ' ", " '\\1' "), 12 | (" - ", "-"), 13 | (" / ", "/"), 14 | # Ensure there are spaces around parentheses and brackets. 15 | (r"([\]\[\(\){}<>])", " \\1 "), 16 | (r"\s+", " "), 17 | ] 18 | 19 | for line in fileinput.input(): 20 | for rule in retokenization_rules: 21 | line = re.sub(rule[0], rule[1], line) 22 | print(line.strip()) 23 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | set -x 4 | 5 | # Download the Lang-8 raw corpus from: 6 | # https://docs.google.com/forms/d/17gZZsC_rnaACMXmPiab3kjqBEtRHPMz0UG9Dk-x_F0k/viewform?edit_requested=true 7 | # and provide the directory here. 8 | readonly LANG8_DIR='' 9 | 10 | echo "Installing required packages..." 11 | virtualenv -p python3 . 12 | source ./bin/activate 13 | 14 | pip install -r requirements.txt 15 | 16 | python -m spacy download en_core_web_sm 17 | python -m spacy download de_core_news_sm 18 | python -m spacy download ru_core_news_sm 19 | 20 | echo "Running a test..." 21 | python -m prepare_clang8_dataset_test 22 | 23 | echo "Generating the cLang-8 dataset for three languages: ru, de, and en" 24 | python -m prepare_clang8_dataset \ 25 | --lang8_dir="${LANG8_DIR}" \ 26 | --tokenize_text='True' \ 27 | --languages='ru,de,en' 28 | -------------------------------------------------------------------------------- /targets/README: -------------------------------------------------------------------------------- 1 | # cLang-8 target sentences are stored to this directory. -------------------------------------------------------------------------------- /targets/clang8_de.detokenized.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:b4300f5ce9cf8c70de56ab4062176d01c70161e0303a384eafd4800a03e9b70d 3 | size 9349297 4 | -------------------------------------------------------------------------------- /targets/clang8_en.detokenized.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ea8fbdfc6adea720852e595a1609368884dd4cc23101c6cc152aa96bef1d8a1f 3 | size 181351414 4 | -------------------------------------------------------------------------------- /targets/clang8_ru.detokenized.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ca02d4e5a2236fa612168dc3dc76590209396d218b6b5b00dec80fae5c5bd53a 3 | size 5277693 4 | --------------------------------------------------------------------------------