├── img
├── 深度截图_选择区域_20200618190314.png
└── 深度截图_选择区域_20200618190344.png
├── README.md
└── covert_into_bio_bmes_format.py
/img/深度截图_选择区域_20200618190314.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HuHsinpang/Ontonotes5.0-pretreatment/HEAD/img/深度截图_选择区域_20200618190314.png
--------------------------------------------------------------------------------
/img/深度截图_选择区域_20200618190344.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/HuHsinpang/Ontonotes5.0-pretreatment/HEAD/img/深度截图_选择区域_20200618190344.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Ontonotes5.0-pretreatment
2 | ontonotes5.0 数据预处理,按照官方给的方式进行训练集、验证集、测试集的分割。Ontonotes5.0 pretreatment
3 |
4 | ## 创建原因
5 | 最近要做ner实验,但是github上给的教程都不够详细或者有错误,比如说[yhcc同学](https://github.com/yhcc/OntoNotes-5.0-NER)给的教程生成的数据没有测试集,其他几个也有或多或少的错误。经过查资料和学习别人的代码,找到了正确处理ontonote5数据的方式,并在[yhcc同学](https://github.com/yhcc/OntoNotes-5.0-NER)等项目的基础上做了一个简单的conll格式文件到项目可用格式的程序,在这里分享一下。另外欢迎看看我项目里预处理的[bosonNER](https://github.com/HuHsinpang/BosonNER-Pretreatment)、[weiboNER](https://github.com/HuHsinpang/weiboNER-pretreatment)的数据集啊。
6 |
7 | ## 操作流程
8 | ### 下载处理程序和数据集
9 | 1. 从[处理程序](http://conll.cemantix.org/2012/data.html), [Ontonotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19)下载数据与处理程序,并放在某个文件夹下,解压前如图1。ontonotes5.0数据下载的时候,需要先在网站上注册,数据获取流程最好先查一查,还挺麻烦的。有同学反应说“处理程序”所在的网址挂了,本来想上传这些文件的,但是因为有25M的限制无法上传,issue里面的同学也提供了其他的下载链接,不过记着对着图片下载哦,新的下载链接在[这里](http://conll.cemantix.org/2012/download/)

10 |
11 | 2. 解压到当前文件夹,解压后内容如图2。建议在linux下解压,在Windows下的时候蹦了几个重复,不知道什么原因。
12 |
13 | ### 数据处理
14 | 1. 在当前文件夹打开终端,创建py27环境,并执行第一步数据处理
15 | ```
16 | conda create -n py27 python=2.7
17 | source activate py27
18 | ./conll-2012/v3/scripts/skeleton2conll.sh -D ./ontonotes-release-5.0/data/files/data/ ./conll-2012/
19 | ```
20 |
21 | 2. 切换回python3环境
22 | ```
23 | source deactivate
24 | ```
25 |
26 | 3. 把本项目上传的python程序放入解压文件所在文件夹,执行python xxx.py,就会生成一个result文件夹,里面是各个语言的预处理结果。因为比较感兴趣的是ner这块,所以处理的时候标记成了bmeso标注,感兴趣的可以在python程序上进行改动,以适应其他方向的处理需求
27 |
28 | ### 相关实验
29 | 我对比了一些当下较新的命名实体识别算法,并写了一篇文章,欢迎探讨
30 |
31 | ```胡新棒, 于溆乔, 李邵梅, 张建朋 . 基于知识增强的中文命名实体识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0059810```
32 |
--------------------------------------------------------------------------------
/covert_into_bio_bmes_format.py:
--------------------------------------------------------------------------------
1 | import os, glob, itertools
2 |
3 |
4 | def generate_collection(data_tag, dir_name, lang):
5 | folder = './conll-2012/v4/data/'+ data_tag + '/data/'+ lang
6 | results = itertools.chain.from_iterable(glob.iglob(os.path.join(root, '*.v4_gold_conll'))
7 | for root, dirs, files in os.walk(folder))
8 |
9 | text, word_count, sent_count = "", 0, 0
10 | for cur_file in results:
11 | with open(cur_file, 'r', encoding='utf-8') as f:
12 | flag = None
13 | for line in f.readlines():
14 | l = ' '.join(line.strip().split())
15 | ls = l.split(" ")
16 | if len(ls) >= 11:
17 | word = ls[3]
18 | pos = ls[4]
19 | cons = ls[5]
20 | ori_ner = ls[10]
21 | ner = ori_ner
22 | # print(word, pos, cons, ner)
23 | if ori_ner == "*":
24 | if flag==None:
25 | ner = "O"
26 | else:
27 | ner = "I-" + flag
28 | elif ori_ner == "*)":
29 | ner = "I-" + flag
30 | flag = None
31 | elif ori_ner.startswith("(") and ori_ner.endswith("*") and len(ori_ner)>2:
32 | flag = ori_ner[1:-1]
33 | ner = "B-" + flag
34 | elif ori_ner.startswith("(") and ori_ner.endswith(")") and len(ori_ner)>2 and flag == None:
35 | ner = "B-" + ori_ner[1:-1]
36 |
37 | text += "\t".join([word, pos, cons, ner]) + '\n'
38 | word_count += 1
39 | else:
40 | text += '\n'
41 | if not line.startswith('#'):
42 | sent_count += 1
43 | text += '\n'
44 | # break
45 |
46 | if data_tag == 'development':
47 | data_tag = 'dev'
48 |
49 | filepath = os.path.join(dir_name, data_tag + '.bio')
50 | with open(filepath, 'w', encoding='utf-8') as f:
51 | f.write(text)
52 |
53 | filepath = os.path.join(dir_name, data_tag+'.info.txt')
54 | with open(filepath, 'w', encoding='utf-8') as f:
55 | f.write("For file:{}, there are {} sentences, {} tokens.".format(filepath, sent_count, word_count))
56 |
57 |
58 | def nertag_bio2bioes(dir_name):
59 | for bio_file in glob.glob(dir_name + '/*.bio'):
60 | with open(bio_file.rsplit('/', 1)[0]+'/ontonotes5.'+bio_file.rsplit('/',1)[1].rstrip('bio')+'bmes', 'w', encoding='utf-8') as fout, open(bio_file, 'r', encoding='utf-8') as fin:
61 | lines = fin.readlines()
62 | for idx in range(len(lines)):
63 | if len(lines[idx])<3: # 句尾
64 | fout.write('\n')
65 | continue
66 |
67 | # witout addition feature
68 | word, pos, label = lines[idx].split()[0], lines[idx].split()[1], lines[idx].split()[-1]
69 | if "-" not in label: # O
70 | for idx in range(len(word)):
71 | fout.write(word[idx]+' O\n')
72 | else:
73 | label_type=label.split('-')[-1]
74 | if 'B-' in label: # B
75 | if (idx