├── img ├── 深度截图_选择区域_20200618190314.png └── 深度截图_选择区域_20200618190344.png ├── README.md └── covert_into_bio_bmes_format.py /img/深度截图_选择区域_20200618190314.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HuHsinpang/Ontonotes5.0-pretreatment/HEAD/img/深度截图_选择区域_20200618190314.png -------------------------------------------------------------------------------- /img/深度截图_选择区域_20200618190344.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HuHsinpang/Ontonotes5.0-pretreatment/HEAD/img/深度截图_选择区域_20200618190344.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Ontonotes5.0-pretreatment 2 | ontonotes5.0 数据预处理，按照官方给的方式进行训练集、验证集、测试集的分割。Ontonotes5.0 pretreatment 3 | 4 | ## 创建原因 5 | 最近要做ner实验，但是github上给的教程都不够详细或者有错误，比如说[yhcc同学](https://github.com/yhcc/OntoNotes-5.0-NER)给的教程生成的数据没有测试集，其他几个也有或多或少的错误。经过查资料和学习别人的代码，找到了正确处理ontonote5数据的方式，并在[yhcc同学](https://github.com/yhcc/OntoNotes-5.0-NER)等项目的基础上做了一个简单的conll格式文件到项目可用格式的程序，在这里分享一下。另外欢迎看看我项目里预处理的[bosonNER](https://github.com/HuHsinpang/BosonNER-Pretreatment)、[weiboNER](https://github.com/HuHsinpang/weiboNER-pretreatment)的数据集啊。 6 | 7 | ## 操作流程 8 | ### 下载处理程序和数据集 9 | 1. 从[处理程序](http://conll.cemantix.org/2012/data.html)， [Ontonotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19)下载数据与处理程序，并放在某个文件夹下，解压前如图1。ontonotes5.0数据下载的时候，需要先在网站上注册，数据获取流程最好先查一查，还挺麻烦的。有同学反应说“处理程序”所在的网址挂了，本来想上传这些文件的，但是因为有25M的限制无法上传，issue里面的同学也提供了其他的下载链接，不过记着对着图片下载哦，新的下载链接在[这里](http://conll.cemantix.org/2012/download/)![图1](./img/深度截图_选择区域_20200618190314.png) 10 | 11 | 2. 解压到当前文件夹，解压后内容如图2。建议在linux下解压，在Windows下的时候蹦了几个重复，不知道什么原因。![图2](./img/深度截图_选择区域_20200618190344.png) 12 | 13 | ### 数据处理 14 | 1. 在当前文件夹打开终端，创建py27环境，并执行第一步数据处理 15 | ``` 16 | conda create -n py27 python=2.7 17 | source activate py27 18 | ./conll-2012/v3/scripts/skeleton2conll.sh -D ./ontonotes-release-5.0/data/files/data/ ./conll-2012/ 19 | ``` 20 | 21 | 2. 切换回python3环境 22 | ``` 23 | source deactivate 24 | ``` 25 | 26 | 3. 把本项目上传的python程序放入解压文件所在文件夹，执行python xxx.py，就会生成一个result文件夹，里面是各个语言的预处理结果。因为比较感兴趣的是ner这块，所以处理的时候标记成了bmeso标注，感兴趣的可以在python程序上进行改动，以适应其他方向的处理需求 27 | 28 | ### 相关实验 29 | 我对比了一些当下较新的命名实体识别算法，并写了一篇文章，欢迎探讨 30 | 31 | ```胡新棒, 于溆乔, 李邵梅, 张建朋 . 基于知识增强的中文命名实体识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0059810``` 32 | -------------------------------------------------------------------------------- /covert_into_bio_bmes_format.py: -------------------------------------------------------------------------------- 1 | import os, glob, itertools 2 | 3 | 4 | def generate_collection(data_tag, dir_name, lang): 5 | folder = './conll-2012/v4/data/'+ data_tag + '/data/'+ lang 6 | results = itertools.chain.from_iterable(glob.iglob(os.path.join(root, '*.v4_gold_conll')) 7 | for root, dirs, files in os.walk(folder)) 8 | 9 | text, word_count, sent_count = "", 0, 0 10 | for cur_file in results: 11 | with open(cur_file, 'r', encoding='utf-8') as f: 12 | flag = None 13 | for line in f.readlines(): 14 | l = ' '.join(line.strip().split()) 15 | ls = l.split(" ") 16 | if len(ls) >= 11: 17 | word = ls[3] 18 | pos = ls[4] 19 | cons = ls[5] 20 | ori_ner = ls[10] 21 | ner = ori_ner 22 | # print(word, pos, cons, ner) 23 | if ori_ner == "*": 24 | if flag==None: 25 | ner = "O" 26 | else: 27 | ner = "I-" + flag 28 | elif ori_ner == "*)": 29 | ner = "I-" + flag 30 | flag = None 31 | elif ori_ner.startswith("(") and ori_ner.endswith("*") and len(ori_ner)>2: 32 | flag = ori_ner[1:-1] 33 | ner = "B-" + flag 34 | elif ori_ner.startswith("(") and ori_ner.endswith(")") and len(ori_ner)>2 and flag == None: 35 | ner = "B-" + ori_ner[1:-1] 36 | 37 | text += "\t".join([word, pos, cons, ner]) + '\n' 38 | word_count += 1 39 | else: 40 | text += '\n' 41 | if not line.startswith('#'): 42 | sent_count += 1 43 | text += '\n' 44 | # break 45 | 46 | if data_tag == 'development': 47 | data_tag = 'dev' 48 | 49 | filepath = os.path.join(dir_name, data_tag + '.bio') 50 | with open(filepath, 'w', encoding='utf-8') as f: 51 | f.write(text) 52 | 53 | filepath = os.path.join(dir_name, data_tag+'.info.txt') 54 | with open(filepath, 'w', encoding='utf-8') as f: 55 | f.write("For file:{}, there are {} sentences, {} tokens.".format(filepath, sent_count, word_count)) 56 | 57 | 58 | def nertag_bio2bioes(dir_name): 59 | for bio_file in glob.glob(dir_name + '/*.bio'): 60 | with open(bio_file.rsplit('/', 1)[0]+'/ontonotes5.'+bio_file.rsplit('/',1)[1].rstrip('bio')+'bmes', 'w', encoding='utf-8') as fout, open(bio_file, 'r', encoding='utf-8') as fin: 61 | lines = fin.readlines() 62 | for idx in range(len(lines)): 63 | if len(lines[idx])<3: # 句尾 64 | fout.write('\n') 65 | continue 66 | 67 | # witout addition feature 68 | word, pos, label = lines[idx].split()[0], lines[idx].split()[1], lines[idx].split()[-1] 69 | if "-" not in label: # O 70 | for idx in range(len(word)): 71 | fout.write(word[idx]+' O\n') 72 | else: 73 | label_type=label.split('-')[-1] 74 | if 'B-' in label: # B 75 | if (idx