├── .idea ├── .gitignore ├── BIO-sequence-label.iml ├── inspectionProfiles │ └── profiles_settings.xml ├── misc.xml ├── modules.xml └── vcs.xml ├── README.md ├── bio_label.py ├── image ├── text_labeled.jpg └── word_dict.png ├── text_labeled.txt ├── text_unlabel.txt └── word_dict.txt /.idea/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Default ignored files 3 | /workspace.xml -------------------------------------------------------------------------------- /.idea/BIO-sequence-label.iml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | -------------------------------------------------------------------------------- /.idea/inspectionProfiles/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 6 | -------------------------------------------------------------------------------- /.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | -------------------------------------------------------------------------------- /.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # BIO-sequence-label 2 | BIO模式的-命名实体识别等任务的序列标注工具
3 | [对应的CSDN博客链接：](https://blog.csdn.net/broccoli2/article/details/103561708)
4 | 5 | 笔者研究方向为NLP知识抽取，做实体识别实验过程中需要对训练数据进行标注。
6 | ##处理流程如下： 7 | （1）使用规则提取出要标注的实体，如“盈方微电子股份有限公司”，将提取出的实体保存至word_dict.txt文件中作为词典，并为每一类实体自行设计创建一类标签。
8 | （2）将待标注样本处理成一行，也就是一行是一个样本。
9 | （3）根据自己需要选择标注好的文件的格式，可以是“taken空格labe”在一个文件中，也可以将token和label分开来。
10 | 11 | INT与BON为文本对应的标签。
12 | 占位词 NONE，这一行必须要有，作为词典的停止关键词
13 | 14 | word_dict.txt文件如下图所示：
15 | ![标注词典](image/word_dict.png)
16 | 17 | 18 | 19 | 标注好的数据如下图所示。
20 | 启 B-INT
21 | 迪 I-INT
22 | 设 I-INT
23 | 计 I-INT
24 | 集 I-INT
25 | 团 I-INT
26 | 股 I-INT
27 | 份 I-INT
28 | 有 I-INT
29 | 限 I-INT
30 | 公 I-INT
31 | 司 I-INT
32 | 于 O
33 | 今 O
34 | 日 O
35 | 36 | 37 | 38 | 有问题可提issues。觉得有用的话，欢迎star~ 39 | 40 | 41 | 42 | 43 | -------------------------------------------------------------------------------- /bio_label.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | ''' 4 | 基于外部词典对数据进行标注 BIO方式 5 | Author:西兰 6 | Date：2019-8-26 7 | E-mail:zhkinfo@163.com 8 | ''' 9 | 10 | 11 | features_list = [] 12 | with open('./data/word_dict.txt','r',encoding='utf-8') as f: 13 | for line in f.readlines(): 14 | features_list.append(line.strip().split(' ')[0]) 15 | #print(features_list[0]) 16 | 17 | ''' 18 | 创建特征词列表、特征词+tag字典（特征词作为key，tag作为value） 19 | ''' 20 | 21 | #将features_dict中的特征词和tag存入字典特征词为key，tag为value 22 | dict={} 23 | with open('./data/word_dict.txt','r',encoding='utf-8') as f: 24 | for line in f.readlines(): 25 | item = line.split(' ') 26 | #print(item) 27 | if len(item) >1: 28 | dict[item[0]]=item[1] 29 | else : 30 | with open('./data/error.txt','a',encoding='utf-8') as f: 31 | f.write(line+"\n") 32 | 33 | 34 | ''' 35 | 根据字典中的word和tag进行自动标注，用字典中的key作为关键词去未标注的文本中匹配，匹配到之后即标注上value中的tag 36 | ''' 37 | file_input = './data/dev_unlabel.txt' 38 | file_output = './cut_data/dev_labeled.txt' 39 | index_log = 0 40 | with open(file_input,'r',encoding='utf-8') as f: 41 | for line in f.readlines(): 42 | print(line) 43 | word_list = list(line.strip()) 44 | tag_list = ["O" for i in range(len(word_list))] 45 | 46 | for keyword in features_list: 47 | print(keyword) 48 | while 1: 49 | index_start_tag = line.find(keyword,index_log) 50 | #当前关键词查找不到就将index_log=0,跳出循环进入下一个关键词 51 | if index_start_tag == -1: 52 | index_log = 0 53 | break 54 | index_log = index_start_tag+1 55 | print(keyword,":",index_start_tag) 56 | #只对未标注过的数据进行标注，防止出现嵌套标注 57 | for i in range(index_start_tag, index_start_tag + len(keyword)): 58 | if index_start_tag == i: 59 | if tag_list[i] == 'O': 60 | tag_list[i] = "B-"+dict[keyword].replace("\n",'') # 首字 61 | else: 62 | if tag_list[i] == 'O': 63 | tag_list[i] = "I-"+dict[keyword].replace("\n",'') # 非首字 64 | 65 | 66 | with open(file_output,'a',encoding='utf-8') as output_f: 67 | for w,t in zip(word_list,tag_list): 68 | print(w+" "+t) 69 | if w != ' ' and w != ' ': 70 | output_f.write(w+" "+t+'\n') 71 | #output_f.write(w + " "+t) 72 | output_f.write('\n') 73 | -------------------------------------------------------------------------------- /image/text_labeled.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/broccolik/BIO-sequence-label/a71ce7a67c75d0f83f8f818bca7c79f8e8cd037f/image/text_labeled.jpg -------------------------------------------------------------------------------- /image/word_dict.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/broccolik/BIO-sequence-label/a71ce7a67c75d0f83f8f818bca7c79f8e8cd037f/image/word_dict.png -------------------------------------------------------------------------------- /text_labeled.txt: -------------------------------------------------------------------------------- 1 | 盈 B-INT 2 | 方 B-INT 3 | 微 B-INT 4 | 电 B-INT 5 | 子 B-INT 6 | 股 B-INT 7 | 份 B-INT 8 | 有 B-INT 9 | 限 B-INT 10 | 公 B-INT 11 | 司 B-INT 12 | （ O 13 | 股 O 14 | 票 O 15 | 简 O 16 | 称 O 17 | ：O 18 | * O 19 | S O 20 | T O 21 | 盈 O 22 | ... 23 | -------------------------------------------------------------------------------- /text_unlabel.txt: -------------------------------------------------------------------------------- 1 | 盈方微电子股份有限公司（股票简称：*ST盈方，股票代码：000670）2019年年度报告显示其2017年、2018年、2019年三个会计年度经审计的净利润连续为负值。根据本所《股票上市规则（2018年11月修订）》第14.1.1条的规定以及本所上市委员会的审核意见，本所决定盈方微电子股份有限公司股票自2020年4月7日起暂停上市。 -------------------------------------------------------------------------------- /word_dict.txt: -------------------------------------------------------------------------------- 1 | 盈方微电子股份有限公司 INT 2 | 北京光环新网科技股份有限公司 INT 3 | 周口市综合投资有限公司 INT 4 | 上海汉得信息技术股份有限公司 INT 5 | 湖南湘江新区投资集团有限公司 INT 6 | 融信福建投资集团有限公司 INT 7 | 湖南尔康制药股份有限公司 INT 8 | 厦门灿坤实业股份有限公司 INT 9 | 中融国证钢铁行业指数分级证券投资基金 BON 10 | 华中证空天一体军工指数证券投资基金 BON 11 | 富国新兴成长量化精选混合型证券投资基金 BON 12 | 江西省政府一般债券 BON 13 | 占位词 NONE 14 | --------------------------------------------------------------------------------