├── .gitignore ├── README.rst ├── dialog2017 ├── __init__.py ├── conll.py ├── crf_baseline.py ├── evaluate.py ├── pymorphy2_baseline.py ├── to_conll.py ├── to_json.py └── utils.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by .ignore support plugin (hsz.mobi) 2 | ### Python template 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | env/ 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | # Installer logs 30 | pip-log.txt 31 | pip-delete-this-directory.txt 32 | 33 | # Unit test / coverage reports 34 | htmlcov/ 35 | .tox/ 36 | .coverage 37 | .coverage.* 38 | .cache 39 | nosetests.xml 40 | coverage.xml 41 | *,cover 42 | .hypothesis/ 43 | 44 | # Sphinx documentation 45 | docs/_build/ 46 | 47 | # Jupyter Notebook 48 | .ipynb_checkpoints 49 | 50 | # IDEs 51 | .idea/ 52 | 53 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Morphological Disambiguation for Russian language. 2 | See https://github.com/dialogue-evaluation/morphoRuEval-2017. 3 | 4 | Requirements 5 | ============ 6 | 7 | Scripts use Python 3.4+. 8 | Install ``requirements.txt`` using pip. 9 | 10 | 11 | Data preparation 12 | ================ 13 | 14 | To unify training data unpack it, then use ``python -m dialog2017.to_json`` 15 | to convert to JSON. For OpenCorpora export it should be called 16 | with "--opencorpora" flag:: 17 | 18 | python -m dialog2017.to_json ../morphoRuEval-2017/train/RNCgoldInUD_Morpho.conll data/rnc.json 19 | python -m dialog2017.to_json ../morphoRuEval-2017/train/gikrya_fixed.txt data/gikrya.json 20 | python -m dialog2017.to_json ../morphoRuEval-2017/train/syntagrus_full_fixed.ud data/syntagrus.json 21 | python -m dialog2017.to_json ../morphoRuEval-2017/train/unamb_sent_14_6.conllu data/opencorpora.json --opencorpora 22 | 23 | To convert JSON data back to CONLL format use ``dialog2017.to_conll`` script, e.g.:: 24 | 25 | python -m dialog2017.to_conll data/gikrya.json data/gikrya.txt 26 | 27 | Evaluation 28 | ========== 29 | 30 | Alternative evaluation script:: 31 | 32 | python -m dialog2017.evaluate corpus_gold.txt corpus_pred.txt 33 | 34 | It works both with CONLL and JSON corpora and allows to print errors. 35 | Metrics are the same. Lemmatization quality measurment is not implemented yet. 36 | 37 | pymorphy2 baseline 38 | ================== 39 | 40 | ``python -m dialog2017.pymorphy2_baseline`` script takes first pymorphy2 41 | prediction and converts resulting tag to Dialog 2017 format. 42 | 43 | :: 44 | 45 | $ python -m dialog2017.pymorphy2_baseline ../morphoRuEval-2017/Baseline/source/gikrya_test.txt ./data/gikrya-pred-test.txt 46 | reading... 47 | parsing... 48 | 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 20787/20787 [00:05<00:00, 4022.44it/s] 49 | saving... 50 | evaluating... 51 | 130443 out of 171550 (skipped: 98714); accuracy: 76.04% 52 | 53 | $ python -m dialog2017.pymorphy2_baseline ../morphoRuEval-2017/Baseline/source/syntagrus_test.txt ./data/syntagrus-pred-test.txt 54 | reading... 55 | parsing... 56 | 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 12529/12529 [00:05<00:00, 2475.93it/s] 57 | saving... 58 | evaluating... 59 | 109688 out of 146817 (skipped: 85066); accuracy: 74.71% 60 | 61 | $ python -m dialog2017.pymorphy2_baseline ./data/opencorpora.txt ./data/opencorpora-pred.txt 62 | reading... 63 | parsing... 64 | 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 38508/38508 [00:09<00:00, 4184.98it/s] 65 | saving... 66 | evaluating... 67 | 224714 out of 270063 (skipped: 187520); accuracy: 83.21% 68 | 69 | $ python -m dialog2017.pymorphy2_baseline ./data/rnc.txt ./data/rnc-pred.txt 70 | reading... 71 | parsing... 72 | 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 98892/98892 [00:15<00:00, 6233.93it/s] 73 | saving... 74 | evaluating... 75 | 544275 out of 797823 (skipped: 547467); accuracy: 68.22% 76 | 77 | 78 | CRF baseline 79 | ============ 80 | 81 | This is a "obvious" CRF-based baseline: features are grammemes extracted using 82 | pymorphy2 + words themselves + grammemes of nearby words + nearby words; 83 | output tags are just tags as-is (so there is ~300 output labels). 84 | 85 | :: 86 | 87 | python -m dialog2017.crf_baseline \ 88 | ../morphoRuEval-2017/Baseline/source/gikrya_train.txt \ 89 | ../morphoRuEval-2017/Baseline/source/gikrya_test.txt \ 90 | ./data/gikrya-pred-test-crf.txt \ 91 | model.joblib 92 | <..snip..> 93 | evaluating... 94 | evaluated 171550 tokens out of 270264 (63.47%) 95 | full tags: 162213 correct; accuracy=94.56% 96 | POS: 169297 correct; accuracy=98.69% 97 | -------------------------------------------------------------------------------- /dialog2017/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | -------------------------------------------------------------------------------- /dialog2017/conll.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Module with utilities for reading dialog2017 annotated corpora. 4 | 5 | Parse a corpus:: 6 | 7 | sents = read_sents('corpus.txt') 8 | 9 | """ 10 | import codecs 11 | 12 | from .utils import read_json, read_lines 13 | 14 | 15 | def read_sents(path, opencorpora=False): 16 | if path.endswith(".json"): 17 | return read_json(path) 18 | else: 19 | return read_sents_conll(path, opencorpora=opencorpora) 20 | 21 | 22 | def read_sents_conll(path, opencorpora=False): 23 | corpus_lines = read_lines(path) 24 | sents_iter = iter_sentences(corpus_lines, opencorpora) 25 | return list(sents_iter) 26 | 27 | 28 | def is_token_line(line): 29 | if not line.strip(): 30 | return False 31 | if line == "==newfile==" or line.startswith("==>"): 32 | # see https://github.com/dialogue-evaluation/morphoRuEval-2017/issues/5 33 | return False 34 | return True 35 | 36 | 37 | def parse_tag(tag): 38 | tags_list = tag.split("|") if tag != "_" else [] 39 | return dict(t.split("=") for t in tags_list) 40 | 41 | 42 | def to_token(line, opcorpora=False): 43 | parts = line.split('\t')[1:] 44 | if opcorpora: 45 | # see https://github.com/dialogue-evaluation/morphoRuEval-2017/issues/5 46 | parts = parts[:3] + parts[4:5] 47 | parts[1] = parts[1] # lemma 48 | parts[3] = parse_tag(parts[3]) # tag 49 | # parts[4] = parse_tag(parts[4]) # extra tags 50 | return parts[:4] 51 | 52 | 53 | def iter_sentences(corpus, opencorpora=False): 54 | sent = [] 55 | for line in corpus: 56 | if not is_token_line(line): 57 | if sent: 58 | yield sent 59 | sent = [] 60 | else: 61 | try: 62 | sent.append(to_token(line, opencorpora)) 63 | except Exception: 64 | print(line) 65 | raise 66 | if sent: 67 | yield sent 68 | 69 | 70 | def tag2conll(parts): 71 | if not parts: 72 | parts = "_" 73 | elif isinstance(parts, (list, tuple)): 74 | parts = "|".join(parts) 75 | elif isinstance(parts, dict): 76 | parts = "|".join("%s=%s" % (k, v) for (k, v) in sorted(parts.items())) 77 | return parts 78 | 79 | 80 | def conll_line(idx, word, lemma, pos, tags): 81 | return "\t".join([str(idx), word, lemma, pos, tag2conll(tags)]) 82 | 83 | 84 | def write_sents_to_file(sents, fp): 85 | """ Write sentences to a file ``fp`` in CONLL format """ 86 | for sent in sents: 87 | for idx, (word, lemma, pos, tags) in enumerate(sent, start=1): 88 | line = conll_line(idx, word, lemma, pos, tags) 89 | fp.write(line+"\n") 90 | fp.write("\n") 91 | 92 | 93 | def write_sents(sents, path): 94 | with codecs.open(path, 'w', encoding='utf8') as f: 95 | write_sents_to_file(sents, f) 96 | -------------------------------------------------------------------------------- /dialog2017/crf_baseline.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | from pathlib import Path 4 | from itertools import chain 5 | import argparse 6 | 7 | import joblib 8 | from sklearn.model_selection import train_test_split 9 | from sklearn.model_selection import cross_val_predict 10 | from tqdm import tqdm 11 | from pymorphy2 import MorphAnalyzer 12 | from pymorphy2.cache import memoized_with_single_argument 13 | from morphine import features 14 | from morphine.feature_extractor import FeatureExtractor 15 | from sklearn_crfsuite import CRF 16 | 17 | from dialog2017 import conll, evaluate 18 | 19 | _cache = {} 20 | 21 | morph = MorphAnalyzer() 22 | 23 | 24 | class TaggerFeatureExtractor(FeatureExtractor): 25 | IGNORE = set() 26 | 27 | def __init__(self): 28 | super(TaggerFeatureExtractor, self).__init__( 29 | token_features=[ 30 | features.bias, 31 | features.token_lower, 32 | features.Grammeme(threshold=0.01, add_unambig=True, ignore=self.IGNORE), 33 | # features.GrammemePair(threshold=0.0, add_unambig=True, ignore=self.IGNORE), 34 | ], 35 | global_features=[ 36 | features.sentence_start, 37 | features.sentence_end, 38 | 39 | features.Pattern([-1, 'token_lower']), 40 | features.Pattern([-2, 'token_lower']), 41 | 42 | features.Pattern([-1, 'Grammeme']), 43 | features.Pattern([+1, 'Grammeme']), 44 | 45 | # features.Pattern([-2, 'Grammeme'], [-1, 'Grammeme']), 46 | # features.Pattern([-1, 'Grammeme'], [0, 'Grammeme']), 47 | # features.Pattern([-1, 'Grammeme'], [0, 'GrammemePair']), 48 | # 49 | # features.Pattern([-1, 'GrammemePair']), 50 | # features.Pattern([+1, 'GrammemePair']), 51 | ], 52 | ) 53 | 54 | 55 | # def flatten_result(X): 56 | # return [ItemSequence(xseq).items() for xseq in X] 57 | 58 | 59 | @memoized_with_single_argument(_cache) 60 | def parse_token(tok): 61 | return morph.parse(tok) 62 | 63 | 64 | def parse_sent(morph, sent): 65 | tokens = [r[0] for r in sent] 66 | parsed_tokens = [parse_token(tok) for tok in tokens] 67 | return tokens, parsed_tokens 68 | 69 | 70 | def parse_corpus(morph, sents): 71 | return [parse_sent(morph, sent) for sent in tqdm(sents)] 72 | 73 | 74 | def flatten(s): 75 | return list(chain.from_iterable(s)) 76 | 77 | 78 | def join_tag(pos, tags): 79 | return "%s %s" % (pos, conll.tag2conll(tags)) 80 | 81 | 82 | def get_y(corpus): 83 | return [[join_tag(r[2], r[3]) for r in sent] for sent in corpus] 84 | 85 | 86 | def parse_tag_str(tag_str): 87 | pos, tags = tag_str.split() 88 | return pos, conll.parse_tag(tags) 89 | 90 | 91 | def load_corpus(path, take_first=0): 92 | corpus = conll.read_sents(path) 93 | if take_first: 94 | corpus = corpus[:take_first] 95 | X_raw = parse_corpus(morph, corpus) 96 | y = get_y(corpus) 97 | return corpus, X_raw, y 98 | 99 | 100 | def y_pred_to_sents_pred(sents_gold, y_pred): 101 | return [ 102 | [ 103 | [r[0], r[1], *parse_tag_str(tag)] 104 | for (r, tag) in zip(s, yseq) 105 | ] 106 | for (s, yseq) in zip(sents_gold, y_pred) 107 | ] 108 | 109 | 110 | def main(path_train, path_test, path_pred, path_crf, take_first, dev_size): 111 | print("loading train corpus..") 112 | _, X_raw, y = load_corpus(path_train, take_first=take_first) 113 | print("extracting features from train corpus..") 114 | fe = TaggerFeatureExtractor() 115 | X = fe.fit_transform(tqdm(X_raw)) 116 | print("training..") 117 | crf = CRF(algorithm='ap', verbose=True, max_iterations=10) 118 | if dev_size: 119 | X, X_dev, y, y_dev = train_test_split(X, y, test_size=dev_size) 120 | else: 121 | X_dev, y_dev = None, None 122 | crf.fit(X, y, X_dev, y_dev) 123 | 124 | print("saving..") 125 | joblib.dump({'fe': fe, 'crf': crf}, path_crf, compress=2) 126 | 127 | print("loading test corpus..") 128 | corpus, X_test_raw, y_test = load_corpus(path_test) 129 | print("extracting features from test corpus..") 130 | X_test = fe.transform(X_test_raw) 131 | print("predicting..") 132 | y_pred = crf.predict(tqdm(X_test)) 133 | 134 | print("saving results..") 135 | sents_pred = y_pred_to_sents_pred(corpus, y_pred) 136 | conll.write_sents(sents_pred, path_pred) 137 | 138 | 139 | if __name__ == '__main__': 140 | p = argparse.ArgumentParser() 141 | p.add_argument("path_train", help="path to a training corpus file in conll or json format") 142 | p.add_argument("path_test", help="path to a testing corpus file in conll or json format") 143 | p.add_argument("path_pred", help="path to a corpus file to be created") 144 | p.add_argument("path_crf", help="path to a model file to be created, " 145 | "e.g. model.joblib") 146 | p.add_argument("--take-first", type=int, default=0, 147 | help="use only first N sentences. 0 => 'use all sentences'.") 148 | p.add_argument("--dev-size", type=float, default=0.0) 149 | p.add_argument("--no-eval", type=bool, default=False) 150 | args = p.parse_args() 151 | main( 152 | path_train=args.path_train, 153 | path_test=args.path_test, 154 | path_pred=args.path_pred, 155 | path_crf=args.path_crf, 156 | take_first=args.take_first, 157 | dev_size=args.dev_size 158 | ) 159 | 160 | if not args.no_eval: 161 | print("evaluating...") 162 | evaluate.main(args.path_test, args.path_pred) 163 | -------------------------------------------------------------------------------- /dialog2017/evaluate.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ Evaluation utilities """ 3 | import argparse 4 | from typing import Dict, List 5 | 6 | from . import conll 7 | 8 | 9 | POS_TO_MEASURE = ["NOUN", "ADV", "PRON", "DET", "ADJ", "VERB", "NUM"] 10 | DOUBT_ADVERBS = {"как", "когда", "пока", "так", "где"} 11 | CATS_TO_MEASURE = { 12 | "NOUN": {"Gender", "Number", "Case"}, 13 | "ADJ": {"Gender", "Number", "Case", "Variant", "Degree"}, 14 | "PRON": {"Gender", "Number", "Case"}, 15 | "DET": {"Gender", "Number", "Case"}, 16 | "VERB": {"Gender", "Number", "VerbForm", "Mood", "Tense"}, 17 | "ADV": {"Degree"}, 18 | "NUM": {"Gender", "Number", "Case", "NumForm"}, 19 | } 20 | 21 | 22 | def simplify_pos(pos: str) -> str: 23 | if pos == 'PROPN': 24 | pos = 'NOUN' 25 | return pos 26 | 27 | 28 | def simplify_tags( 29 | pos: str, 30 | tags: Dict[str, str], 31 | ) -> Dict[str, str]: 32 | tags = tags.copy() 33 | cats_to_measure = CATS_TO_MEASURE.get(pos, set()) 34 | for g in list(tags.keys()): 35 | if g not in cats_to_measure: 36 | del tags[g] 37 | 38 | if pos == 'VERB' and tags.get('Tense') in {'Pres', 'Fut'}: 39 | tags['Tense'] = 'Notpast' 40 | 41 | if tags.get('Variant') == 'Short': 42 | tags['Variant'] = 'Brev' 43 | 44 | return tags 45 | 46 | 47 | def should_match_parses( 48 | word: str, 49 | pos_gold: str, 50 | ) -> bool: 51 | pos = simplify_pos(pos_gold) 52 | if pos not in POS_TO_MEASURE: 53 | return False 54 | if pos == 'ADV' and word.lower() in DOUBT_ADVERBS: 55 | return False 56 | return True 57 | 58 | 59 | def parses_pos_match( 60 | word: str, 61 | pos_gold: str, 62 | pos_pred: str, 63 | verbose: bool, 64 | ) -> bool: 65 | if not should_match_parses(word, pos_gold): 66 | return True 67 | pos_gold = simplify_pos(pos_gold) 68 | pos_pred = simplify_pos(pos_pred) 69 | 70 | if pos_gold != pos_pred: 71 | if verbose: 72 | print("%s: %s != %s" % (word, pos_gold, pos_pred)) 73 | return False 74 | 75 | return True 76 | 77 | 78 | def parses_full_match( 79 | word: str, 80 | pos_gold: str, 81 | tags_gold: Dict[str, str], 82 | pos_pred: str, 83 | tags_pred: Dict[str, str], 84 | verbose: bool, 85 | ) -> bool: 86 | if not parses_pos_match(word, pos_gold, pos_pred, verbose): 87 | return False 88 | 89 | pos_gold = simplify_pos(pos_gold) 90 | pos_pred = simplify_pos(pos_pred) 91 | tags_gold = simplify_tags(pos_gold, tags_gold) 92 | tags_pred = simplify_tags(pos_pred, tags_pred) 93 | 94 | if tags_gold == tags_pred: 95 | return True 96 | 97 | if set(tags_gold.values()) <= set(tags_pred.values()): 98 | # this is how official script works - extra tags are allowed 99 | # in prediction result 100 | return True 101 | 102 | if verbose: 103 | print("%s (%s): %s != %s, diff=%r" % ( 104 | word, pos_gold, tags_gold, tags_pred, 105 | set(tags_gold.items()) ^ set(tags_pred.items()) 106 | )) 107 | return False 108 | 109 | 110 | def rows_pos_match( 111 | row_gold: List, 112 | row_pred: List, 113 | verbose: bool=False 114 | ) -> bool: 115 | assert row_gold[0] == row_pred[0] 116 | return parses_pos_match( 117 | word=row_gold[0], 118 | pos_gold=row_gold[2], 119 | pos_pred=row_pred[2], 120 | verbose=verbose, 121 | ) 122 | 123 | 124 | def rows_full_match( 125 | row_gold: List, 126 | row_pred: List, 127 | verbose: bool=False 128 | ) -> bool: 129 | assert row_gold[0] == row_pred[0] 130 | return parses_full_match( 131 | word=row_gold[0], 132 | pos_gold=row_gold[2], 133 | tags_gold=row_gold[3], 134 | pos_pred=row_pred[2], 135 | tags_pred=row_pred[3], 136 | verbose=verbose, 137 | ) 138 | 139 | 140 | def should_match_rows(row_gold: List, row_pred: List) -> bool: 141 | assert row_gold[0] == row_pred[0] 142 | return should_match_parses( 143 | word=row_gold[0], 144 | pos_gold=row_gold[2], 145 | ) 146 | 147 | 148 | def measure_sents(sents_gold: List[List], sents_pred: List[List], 149 | verbose_max_errors=0): 150 | measured, total, correct_full, correct_pos = 0, 0, 0, 0 151 | 152 | for sent_gold, sent_pred in zip(sents_gold, sents_pred): 153 | for row_gold, row_pred in zip(sent_gold, sent_pred): 154 | total += 1 155 | if should_match_rows(row_gold, row_pred): 156 | measured += 1 157 | verbose = (measured - correct_full) <= verbose_max_errors 158 | if rows_pos_match(row_gold, row_pred, verbose=verbose): 159 | correct_pos += 1 160 | if rows_full_match(row_gold, row_pred, verbose=verbose): 161 | correct_full += 1 162 | 163 | return measured, total, correct_full, correct_pos 164 | 165 | 166 | def measure_conll(path_gold: str, path_pred: str, verbose_max_errors=0): 167 | sents_gold = conll.read_sents(path_gold) 168 | sents_pred = conll.read_sents(path_pred) 169 | assert len(sents_gold) == len(sents_pred) 170 | return measure_sents(sents_gold, sents_pred, verbose_max_errors) 171 | 172 | 173 | def main(path_gold, path_pred, n_errors=0): 174 | measured, total, correct_full, correct_pos = measure_conll( 175 | path_gold, 176 | path_pred, 177 | verbose_max_errors=n_errors 178 | ) 179 | print("evaluated {} tokens out of {} ({:.2%})".format(measured, total, 180 | measured/total)) 181 | print("full tags: {} correct; accuracy={:.2%}".format( 182 | correct_full, correct_full / measured 183 | )) 184 | print("POS: {} correct; accuracy={:.2%}".format( 185 | correct_pos, correct_pos / measured 186 | )) 187 | 188 | 189 | if __name__ == '__main__': 190 | p = argparse.ArgumentParser() 191 | p.add_argument("path_gold", help="path to a file in conll or json format") 192 | p.add_argument("path_pred", help="path to a file in conll or json format") 193 | p.add_argument("--n-errors", default=0, type=int, 194 | help="print first N errors") 195 | args = p.parse_args() 196 | main(args.path_gold, args.path_pred, args.n_errors) 197 | -------------------------------------------------------------------------------- /dialog2017/pymorphy2_baseline.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Simple pymorphy2 baseline: no context-based disabmiguation. 4 | """ 5 | from __future__ import absolute_import 6 | import argparse 7 | 8 | from tqdm import tqdm 9 | import pymorphy2 10 | from pymorphy2.shapes import is_punctuation 11 | from pymorphy2.cache import memoized_with_single_argument 12 | from russian_tagsets import dialog2017 13 | 14 | from dialog2017 import conll, evaluate 15 | from dialog2017.utils import write_json 16 | 17 | 18 | def _normalized_for_gikrya(p): 19 | # Если действовать по ГИКРЯ, то 20 | # причастия надо нормализовывать в причастия. 21 | # в OpenCorpora и НКРЯ - в глагол. 22 | if p.tag.POS in {'PRTS', 'PRTF'}: 23 | return p.inflect({'PRTF', 'sing', 'masc', 'nomn'}) 24 | 25 | # todo: он/она/они 26 | return p.normalized 27 | 28 | 29 | def _pymorphy2_to_dialog(p, word): 30 | pos, tag = dialog2017.from_opencorpora(str(p.tag), word).split() 31 | norm = _normalized_for_gikrya(p) 32 | return word, norm.word, pos, conll.parse_tag(tag) 33 | 34 | 35 | def pymorphy2_best_parse(morph, word): 36 | word_orig = word 37 | if word.endswith('.') and not is_punctuation(word): 38 | word = word[:-1] # abbreviation 39 | return _pymorphy2_to_dialog(morph.parse(word)[0], word_orig) 40 | 41 | 42 | def parse_corpus(sents): 43 | morph = pymorphy2.MorphAnalyzer() 44 | 45 | @memoized_with_single_argument({}) 46 | def parse(word): 47 | return pymorphy2_best_parse(morph, word) 48 | 49 | return [[parse(row[0]) for row in sent] for sent in tqdm(sents)] 50 | 51 | 52 | def main(path_gold, path_pred): 53 | print("reading...") 54 | sents_gold = conll.read_sents(path_gold) 55 | print("parsing...") 56 | sents_pred = parse_corpus(sents_gold) 57 | print("saving...") 58 | if path_pred.endswith('.json'): 59 | write_json(sents_pred, path_pred) 60 | else: 61 | conll.write_sents(sents_pred, path_pred) 62 | 63 | 64 | if __name__ == '__main__': 65 | p = argparse.ArgumentParser() 66 | p.add_argument("path_gold", help="path to a file in conll or json format") 67 | p.add_argument("path_pred", help="path to a file to be created") 68 | p.add_argument("--no-eval", help="don't run evaluation", default=False) 69 | args = p.parse_args() 70 | main(args.path_gold, args.path_pred) 71 | 72 | if not args.no_eval: 73 | print("evaluating...") 74 | evaluate.main(args.path_gold, args.path_pred, 0) 75 | -------------------------------------------------------------------------------- /dialog2017/to_conll.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import argparse 4 | 5 | import tqdm 6 | 7 | from dialog2017 import conll 8 | from .utils import read_json 9 | 10 | 11 | def convert(input_path, output_path): 12 | print("reading json corpus...") 13 | sents = read_json(input_path) 14 | print("converting to conll format...") 15 | conll.write_sents(tqdm.tqdm(sents, unit=' sentences'), output_path) 16 | 17 | 18 | if __name__ == '__main__': 19 | p = argparse.ArgumentParser() 20 | p.add_argument("input_path", help="path to input .json corpus") 21 | p.add_argument("result_path", help="path to .txt conll result to create") 22 | args = p.parse_args() 23 | convert(args.input_path, args.result_path) 24 | -------------------------------------------------------------------------------- /dialog2017/to_json.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import argparse 4 | 5 | from dialog2017 import conll 6 | from dialog2017.utils import write_json 7 | 8 | 9 | def convert(input_path, output_path, opencorpora): 10 | print("reading & parsing...") 11 | sents = conll.read_sents_conll(input_path, opencorpora=opencorpora) 12 | print("saving to json...") 13 | write_json(sents, output_path) 14 | print("done.") 15 | 16 | 17 | if __name__ == '__main__': 18 | p = argparse.ArgumentParser() 19 | p.add_argument("input_path", help="path to an unpacked corpus file") 20 | p.add_argument("result_path", help="path to the .json result to create") 21 | p.add_argument("--opencorpora", help="apply a fix for opencorpora export", 22 | action="store_true") 23 | args = p.parse_args() 24 | convert(args.input_path, args.result_path, args.opencorpora) 25 | -------------------------------------------------------------------------------- /dialog2017/utils.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import codecs 4 | 5 | 6 | def read_json(path): 7 | with codecs.open(path, 'r', encoding='utf8') as f: 8 | return json.load(f) 9 | 10 | 11 | def write_json(obj, path): 12 | with codecs.open(path, 'w', encoding='utf8') as f: 13 | json.dump(obj, f, ensure_ascii=False, indent=2, sort_keys=True) 14 | 15 | 16 | def read_lines(path): 17 | with open(path, 'rb') as f: 18 | return f.read().decode('utf8').splitlines() 19 | 20 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.12.0 2 | scipy==0.18.1 3 | scikit-learn==0.18.1 4 | tqdm 5 | typing 6 | 7 | # use latest git versions 8 | git+https://github.com/kmike/pymorphy2.git#egg=pymorphy2 9 | git+https://github.com/kmike/russian-tagsets.git#egg=russian-tagsets 10 | 11 | # required for CRF baseline 12 | joblib==0.11 13 | sklearn-crfsuite==0.3.5 14 | git+https://github.com/kmike/morphine.git#egg=morphine 15 | --------------------------------------------------------------------------------