9 | TagRuler synthesizes labeling functions based on your annotations, allowing you to quickly and easily generate large amounts of training data for span annotation, without the need to program.
10 |
11 |
12 |
13 |
14 | # What is TagRuler?
15 |
16 | In 2020, we introduced [Ruler](https://github.com/megagonlabs/ruler), a novel data programming by demonstration system that allows domain experts to leverage data programming without the need for coding. Ruler generates document classification rules, but we knew that there was a bigger challenge left to tackle: span-level annotations. This is one of the more time-consuming labelling tasks, and creating a DPBD system for this proved to be a challenge because of the sheer magnitude of the space of labeling functions over spans.
17 |
18 | We feel that this is a critical extension of the DPBD paradigm, and that by open-sourcing it, we can help with all kinds of labelling needs.
19 |
20 | # How to use the source code in this repo
21 |
22 | Follow these instructions to run the system on your own, where you can plug in your own data and save the resulting labels, models, and annotations.
23 |
24 | ## 1. Server
25 |
26 | ### 1-1. Install Dependencies :wrench:
27 |
28 | ```shell
29 | cd server
30 | pip install -r requirements.txt
31 | python -m spacy download en_core_web_sm
32 | ```
33 |
34 | ### 1-2. (Optional) Download Data Files
35 |
36 | - **BC5CDR** ([Download Preprocessed Data](https://drive.google.com/file/d/1kKeINUOjtCVGr1_L3aC3qDo3-O-jr5hR/view?usp=sharing)): PubMed articles for Chemical-Disease annotation
37 | Li, Jiao & Sun, Yueping & Johnson, Robin & Sciaky, Daniela & Wei, Chih-Hsuan & Leaman, Robert & Davis, Allan Peter & Mattingly, Carolyn & Wiegers, Thomas & lu, Zhiyong. (2016). Original database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
38 |
39 | - **Your Own Data** See instructions in [server/datasets](server/datasets)
40 |
41 | ### 1-3. Run :runner:
42 |
43 | ```
44 | python api/server.py
45 | ```
46 |
47 | ## 2. User Interface
48 |
49 | ### 2-1. Install Node.js
50 |
51 | [You can download node.js here.](https://nodejs.org/en/)
52 |
53 | To confirm that you have node.js installed, run `node - v`
54 |
55 | ### 2-2. Run
56 |
57 | ```shell
58 | cd ui
59 | npm install
60 | npm start
61 | ```
62 |
63 | By default, the app will make calls to `localhost:5000`, assuming that you have the server running on your machine. (See the [instructions above](#Engine)).
64 |
65 | Once you have both of these running, navigate to `localhost:3000`.
66 |
67 |
68 | # Issues?
69 |
70 | ...or other inquiries, contact and/or .
71 |
--------------------------------------------------------------------------------
/package-lock.json:
--------------------------------------------------------------------------------
1 | {
2 | "lockfileVersion": 1
3 | }
4 |
--------------------------------------------------------------------------------
/server/api/server.py:
--------------------------------------------------------------------------------
1 | """
2 | Main module of the server file
3 | """
4 |
5 | from flask import render_template
6 | import connexion
7 | from flask_cors import CORS
8 |
9 |
10 | # create the application instance
11 | app = connexion.App(__name__, specification_dir="./")
12 |
13 | # Cead the swagger.yml file to configure the endpoints
14 | app.add_api("swagger.yml")
15 |
16 | CORS(app.app, resources={r"/*": {"origins": "*"}})
17 |
18 | # Create a URL route in our application for "/"
19 | @app.route("/")
20 | def home():
21 | """
22 | This function just responds to the browser URL
23 | localhost:5000/
24 |
25 | :return: the rendered templates "home.html"
26 | """
27 | return render_template("home.html")
28 |
29 |
30 | if __name__ == "__main__":
31 | try:
32 | app.run(debug=False, threaded=False)
33 | except KeyboardInterrupt:
34 | pass
--------------------------------------------------------------------------------
/server/api/templates/home.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | IDEA2 API
6 |
7 |
8 |
9 | Welcome to IDEA2 API page!
10 |
11 |
12 | Please negative to /api/ui for more information.
13 |
14 |
--------------------------------------------------------------------------------
/server/datasets/README.md:
--------------------------------------------------------------------------------
1 | # Using Your Own Data
2 |
3 | We'll release some preprocessing code soon! Until then, you need to replicate the following file structure:
4 |
5 | ```
6 | .
7 | +-- datasets
8 | | +-- your_dataset_here
9 | | | +-- processed.bert
10 | | | +-- processed.csv
11 | | | +-- processed.elmo
12 | | | +-- processed.nlp
13 | | | +-- processed.sbert
14 | | +-- example_dataset_1
15 | | +-- example_dataset_2
16 | ```
17 |
18 | Where each file contains preprocessed data that follows the following schema:
19 |
20 | `processed.csv` (csv format)
21 | | | text | labels | split |
22 | |----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------|
23 | | 85 | angioedema due to ace inhibitors : common and inadequately diagnosed . the estimated incidence of angioedema during angiotensin - converting enzyme ( ace ) inhibitor treatment is between 1 and 7 per thousand patients . this potentially serious adverse effect is often preceded by minor manifestations that may serve as a warning . | I-DI,O,O,I-CH,I-CH,O,O,O,O,O,O,O,O,O,O,I-DI,O,I-CH,I-CH,I-CH,I-CH,I-CH,I-CH,I-CH,I-CH,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O | train |
24 |
25 | 'split' is one of 'train', 'dev', 'test', 'valid', where the latter three have labels (for train, labels can be empty).
26 |
27 | `processed.bert` ([npy](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format) format)\
28 | A Numpy array of 2-D Numpy arrays. Each 2-D Numpy array is an array of BERT representations of tokens in a text data sample.
29 | ```
30 | array([array([[ 0.024, -0.004, ..., -0.002, 0.061 ],
31 | [ 0.059, -0.004, ..., -0.003, 0.044 ],
32 | ...,
33 | [ 0.048, 0.006, ..., 0.011, -0.016]], dtype=float32),
34 | ...,
35 | array([[-0.039, 0.090, ..., -0.002, -0.002 ],
36 | ...,
37 | [-0.019, 0.027, ..., -0.011, 0.045 ]], dtype=float32)], dtype=object)
38 |
39 | ```
40 |
41 | `processed.elmo` ([npy](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format) format)\
42 | A Numpy array of 2-D Numpy arrays. Each 2-D Numpy array is an array of ELMo representations of tokens in a text data sample.
43 | ```
44 | array([array([[ 0.024, -0.004, ..., -0.002, 0.061 ],
45 | [ 0.059, -0.004, ..., -0.003, 0.044 ],
46 | ...,
47 | [ 0.048, 0.006, ..., 0.011, -0.016]], dtype=float32),
48 | ...,
49 | array([[-0.039, 0.090, ..., -0.002, -0.002 ],
50 | ...,
51 | [-0.019, 0.027, ..., -0.011, 0.045 ]], dtype=float32)], dtype=object)
52 |
53 | ```
54 |
55 | `processed.sbert` ([npy](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html#module-numpy.lib.format) format)\
56 | A 2-D Numpy array which contains Sentence-BERT representations of text data samples. The shape of this array should be (N, V) where N is the number of text samples and V is the length of the Sentence-BERT representation.
57 | ```
58 | array([[ 0.039, 0.011, 0.063, ..., -0.007, -0.004],
59 | [-0.047, -0.048, 0.023, ..., -0.026, -0.054],
60 | ...,
61 | [-0.025, -0.024, 0.054, ..., -0.017, -0.048]], dtype=float32)
62 | ```
63 |
64 | `processed.nlp`\
65 | \\TODO
66 |
67 |
68 | # Data Attribution
69 |
70 |
71 | ## BC5CDR
72 | Li, Jiao & Sun, Yueping & Johnson, Robin & Sciaky, Daniela & Wei, Chih-Hsuan & Leaman, Robert & Davis, Allan Peter & Mattingly, Carolyn & Wiegers, Thomas & lu, Zhiyong. (2016).
73 |
74 | BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database. 2016. baw068. 10.1093/database/baw068. Community-run, formal evaluations and manually annotated text corpora are critically important for advancing biomedical text-mining research. Recently in BioCreative V, a new challenge was organized for the tasks of disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. Given the nature of both tasks, a test collection is required to contain both disease/chemical annotations and relation annotations in the same set of articles. Despite previous efforts in biomedical corpus construction, none was found to be sufficient for the task. Thus, we developed our own corpus called BC5CDR during the challenge by inviting a team of Medical Subject Headings (MeSH) indexers for disease/chemical entity annotation and Comparative Toxicogenomics Database (CTD) curators for CID relation annotation. To ensure high annotation quality and productivity, detailed annotation guidelines and automatic annotation tools were provided. The resulting BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions. Each entity annotation includes both the mention text spans and normalized concept identifiers, using MeSH as the controlled vocabulary. To ensure accuracy, the entities were first captured independently by two annotators followed by a consensus annotation: The average inter-annotator agreement (IAA) scores were 87.49% and 96.05% for the disease and chemicals, respectively, in the test set according to the Jaccard similarity coefficient. Our corpus was successfully used for the BioCreative V challenge tasks and should serve as a valuable resource for the text-mining research community.
75 |
76 | Database URL: http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/
77 |
78 |
--------------------------------------------------------------------------------
/server/datasets/bc5cdr/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/datasets/bc5cdr/.gitkeep
--------------------------------------------------------------------------------
/server/datasets/bc5cdr_example/processed.bert:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/datasets/bc5cdr_example/processed.bert
--------------------------------------------------------------------------------
/server/datasets/bc5cdr_example/processed.elmo:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/datasets/bc5cdr_example/processed.elmo
--------------------------------------------------------------------------------
/server/datasets/bc5cdr_example/processed.nlp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/datasets/bc5cdr_example/processed.nlp
--------------------------------------------------------------------------------
/server/datasets/bc5cdr_example/processed.sbert:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/datasets/bc5cdr_example/processed.sbert
--------------------------------------------------------------------------------
/server/datasets/data_process_example/datareader.py:
--------------------------------------------------------------------------------
1 | from allennlp.data import Instance
2 | from allennlp.data.dataset_readers import DatasetReader
3 | from allennlp.data.fields import TextField, SequenceLabelField
4 | from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
5 | from allennlp.data.tokenizers import Token
6 | from allennlp.data.tokenizers.word_tokenizer import SpacyWordSplitter, WordTokenizer
7 | from allennlp.data.tokenizers.word_splitter import JustSpacesWordSplitter
8 | from tqdm.auto import tqdm
9 | from typing import Iterator, List, Dict
10 | from xml.etree import ElementTree
11 | import pandas as pd
12 |
13 | @DatasetReader.register('text')
14 | class TextDatasetReader(DatasetReader):
15 | """
16 | DatasetReader for Laptop Reviews corpus available at
17 | http://alt.qcri.org/semeval2014/task4/.
18 | """
19 | def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
20 | super().__init__(lazy=False)
21 | self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
22 |
23 | def text_to_instance(self, doc_id: str, tokens: List[Token], tags: List[str] = None) -> Instance:
24 | tokens_field = TextField(tokens, self.token_indexers)
25 | fields = {"tokens": tokens_field}
26 |
27 | if tags:
28 | tags_field = SequenceLabelField(labels=tags, sequence_field=tokens_field)
29 | fields["tags"] = tags_field
30 |
31 | return Instance(fields)
32 |
33 | def _read(self, file_path: str) -> Iterator[Instance]:
34 | #splitter = JustSpacesWordSplitter()
35 | splitter = SpacyWordSplitter('en_core_web_sm', True, True, True)
36 | tokenizer = WordTokenizer(word_splitter=splitter)
37 | df = pd.read_csv(file_path)
38 | #TODO: fix based on correct index
39 | for i,text in tqdm(enumerate(df['text'])):
40 | # Tokenizes the sentence
41 | tokens = tokenizer.tokenize(text)
42 | space_split = text.split(' ')
43 | tokens_merged = []
44 | ii=0
45 | jj=0
46 | while ii 1:
23 | mul_tok_link_key += 1
24 | for crnt_token_text in crnt_token_texts:
25 | # crnt_token: text
26 | #crnt_token_text = origin_text[annotation["start_offset"]:annotation["end_offset"]]
27 | # initialize a new Token class
28 | crnt_token = Token(crnt_token_text, annotation['isPositive'])
29 | token_list.append(crnt_token)
30 |
31 | # Find the concept from annotation
32 | crnt_concept_name = None
33 | if ("label" in annotation) and (annotation["label"]):
34 | crnt_concept_name = annotation["label"]
35 |
36 | crnt_token.assign_concept(crnt_concept_name)
37 |
38 | # Find the relationship from annotation
39 | """
40 | crnt_rel_code = None
41 | if "link" in annotation:
42 | if not annotation["link"] is None:
43 | crnt_rel_code = int(annotation["link"])
44 | """
45 |
46 | crnt_rel_code = None
47 | if len(crnt_token_texts) > 1:
48 | crnt_rel_code = mul_tok_link_key
49 | rel_code_list.append(crnt_rel_code)
50 |
51 | # Find the index of current annotation
52 | flag: bool = False
53 | # print(annotated_text)
54 | for crnt_sent in sentences:
55 | sent_start = origin_doc[crnt_sent.start].idx
56 | sent_end = origin_doc[crnt_sent.end-1].idx + len(origin_doc[crnt_sent.end-1])
57 | if (annotation["start_offset"] >= sent_start) and (annotation["end_offset"]<=sent_end):
58 | crnt_token.assign_sent_idx(sentences.index(crnt_sent))
59 | flag = True
60 | break
61 | if not flag:
62 | print("No sentence found for the annotation: \"{}\"\nsentences: {}".format(annotation, sentences))
63 |
64 | # Find the named entity of current annotation
65 | #TODO if this is too slow, this can be done O(n) out of the loop
66 | crnt_token.assign_span_label(annotation['spanLabel'])
67 | crnt_tokens.append(crnt_token)
68 | offset = 0
69 | for crnt_token in crnt_tokens:
70 | for i,tk in enumerate(origin_doc):
71 | #TODO handle cases where selected span is not a token
72 | if tk.idx <= annotation['start_offset'] + offset and tk.idx+len(tk.text) >= annotation['start_offset'] + offset:
73 | if len(tk.ent_type_)>0: crnt_token.assign_ner_type(tk.ent_type_)
74 | crnt_token.assign_pos_type(tk.pos_)
75 | crnt_token.assign_dep_rel(tk.dep_)
76 | crnt_token.assign_tok_id(i)
77 | offset += (len(crnt_token.text) + 1) #TODO what is there's double spaces?
78 | break
79 |
80 |
81 | # Match existing concepts
82 | augment_concept(token_list, concepts)
83 |
84 | return token_list, rel_code_list
85 |
86 |
87 | def augment_concept(token_list, concepts: dict):
88 | for crnt_token in token_list:
89 | if crnt_token.concept_name is not None:
90 | continue
91 |
92 | for key in concepts.keys():
93 | if crnt_token.text in concepts[key]:
94 | crnt_token.assign_concept(key)
95 | break
96 |
97 |
98 | # remove stop word and punct
99 | def simple_parse(text: str, concepts: dict):
100 | token_list = []
101 | doc = nlp(text)
102 | tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
103 | #print(tokens)
104 |
105 | if len(doc) == len(tokens):
106 | # early return
107 | return token_list
108 |
109 | ner_dict = dict()
110 |
111 | # merge multiple tokens if falling into one ent.text
112 | for ent in doc.ents:
113 | matched_text = []
114 | for i in range(len(tokens)):
115 | if tokens[i] in ent.text:
116 | matched_text.append(tokens[i])
117 |
118 | if len(matched_text) > 1:
119 | new_text = ""
120 | for crnt_text in matched_text:
121 | new_text += crnt_text
122 | tokens.remove(crnt_text)
123 |
124 | tokens.append(ent.text)
125 | ner_dict[ent.text] = ent.label_
126 |
127 | for crnt_text in tokens:
128 | crnt_token = Token(crnt_text)
129 | if crnt_text in ner_dict.keys():
130 | crnt_token.assign_ner_type(ner_dict[crnt_text])
131 | token_list.append(crnt_token)
132 |
133 | augment_concept(token_list, concepts)
134 |
135 | return token_list
136 |
137 |
--------------------------------------------------------------------------------
/server/synthesizer/synthesizer.py:
--------------------------------------------------------------------------------
1 | from synthesizer.parser import parse, simple_parse
2 | from synthesizer.gll import *
3 |
4 |
5 | class Synthesizer:
6 |
7 | token_list = []
8 | rel_code_list = []
9 |
10 | def __init__(self, t_origin, annots, t_label, de, cs: dict, sent_id, label_dict):
11 | self.origin_text = t_origin
12 | self.annotations = annots
13 | self.delimiter = de
14 | self.concepts = cs
15 | self.label = t_label
16 | self.instances = []
17 | self.sent_id = sent_id
18 | self.label_dict= label_dict
19 |
20 | def print(self):
21 | print(self.instances)
22 |
23 | def run(self):
24 | """
25 | Based on one label, suggest LFs
26 |
27 | Returns:
28 | List(Dict): labeling functions represented as dicts with fields:
29 | 'Conditions', 'Connective', 'Direction', 'Label', and 'Weight'
30 | """
31 | self.token_list, self.rel_code_list = \
32 | parse(self.annotations, self.origin_text, self.delimiter, self.concepts)
33 |
34 | assert len(self.token_list) == len(self.rel_code_list)
35 |
36 | relationship_set = None
37 | relationship_undirected = dict()
38 | relationship_directed = dict()
39 |
40 | for i in range(len(self.rel_code_list)):
41 | crnt_rel_code = self.rel_code_list[i]
42 | crnt_token = self.token_list[i]
43 |
44 | if crnt_rel_code is None:
45 | if relationship_set is None:
46 | relationship_set = Relationship(RelationshipType.SET)
47 | relationship_set.add(crnt_token, crnt_rel_code)
48 |
49 | else:
50 | if abs(crnt_rel_code) < 100:
51 | # no direction relationship
52 | if crnt_rel_code not in relationship_undirected:
53 | relationship_undirected[crnt_rel_code] = Relationship(RelationshipType.UNDIRECTED)
54 |
55 | relationship_undirected[crnt_rel_code].add(crnt_token, crnt_rel_code)
56 |
57 | else:
58 | # directed relationship
59 | abs_code = abs(crnt_rel_code)
60 | if abs_code not in relationship_directed:
61 | relationship_directed[abs_code] = Relationship(RelationshipType.DIRECTED)
62 |
63 | relationship_directed[abs_code].add(crnt_token, crnt_rel_code)
64 |
65 | # for each relationship, generate instances
66 | if relationship_set is not None: # TODO: utilize this code for merging rules later
67 | self.instances.extend(relationship_set.get_instances(self.concepts, self.sent_id, self.label_dict))
68 | for k, v in relationship_undirected.items():
69 | self.instances.extend(v.get_instances(self.concepts, self.sent_id, self.label_dict))
70 |
71 | # add label to each instance
72 | for i, crnt_instance in enumerate(self.instances):
73 | # remove repeated conditions
74 | conditions = crnt_instance[CONDS]
75 | #crnt_instance[CONDS] = [dict(t) for t in {tuple(d.items()) for d in conditions}]
76 |
77 | # sort instances based on weight
78 | calc_weight(self.instances, self.concepts)
79 | self.instances.sort(key=lambda x: x[WEIGHT], reverse=True)
80 | return self.instances#[:20]
81 |
82 | def single_condition(self):
83 | extended_instances = []
84 | crnt_instance = self.instances[0]
85 | if len(crnt_instance[CONDS]) == 1:
86 | single_cond = crnt_instance[CONDS][0]
87 | crnt_text = list(single_cond.keys())[0]
88 | crnt_type = single_cond[crnt_text]
89 |
90 | # only process when the highlighted is token
91 | if crnt_type == KeyType[TOKEN]:
92 | # pipeline to process the crnt_text
93 | # remove stopwords and punct
94 | token_list = simple_parse(crnt_text, self.concepts)
95 |
96 | if len(token_list) == 0:
97 | return
98 |
99 | # relationship set
100 | relationship_set = Relationship(RelationshipType.SET)
101 | for crnt_token in token_list:
102 | relationship_set.add(crnt_token, None)
103 | extended_instances.extend( relationship_set.get_instances(self.concepts) )
104 | else:
105 | for condition in crnt_instance[CONDS]:
106 | new_inst = crnt_instance.copy()
107 | new_inst[CONDS] = [condition]
108 | extended_instances.extend([new_inst])
109 | self.instances = extended_instances
110 |
111 |
112 |
113 | def calc_weight(instances, concepts):
114 | # TODO better weight calc
115 | for crnt_instance in instances:
116 | crnt_weight = 1
117 | for crnt_cond in crnt_instance[CONDS]:
118 | k = crnt_cond.get("string")
119 | v = crnt_cond.get("type")
120 |
121 | KeyType = {TOKEN: 0, CONCEPT: 1, NER: 2, REGEXP: 3, POS: 4, DEP: 5, ELMO_SIMILAR: 6, BERT_SIMILAR: 7,
122 | SIM: 8}
123 |
124 | if v == KeyType[POS]:
125 | crnt_weight += 2
126 | elif v == KeyType[DEP]:
127 | crnt_weight += 1
128 | crnt_instance[WEIGHT] = crnt_weight
129 |
130 |
131 | def test_synthesizer():
132 | text_origin = "the room is clean."
133 | annotations = [{"id":5, "start_offset": 5, "end_offset":9}, {"id":12, "start_offset": 12, "end_offset":17}]
134 | label = 1.0
135 | de = '#'
136 | concepts = dict()
137 |
138 | concepts['Hotel'] = ['room']
139 |
140 | crnt_syn = Synthesizer(text_origin, annotations, label, de, concepts)
141 |
142 |
143 |
144 | #test_synthesizer()
145 |
--------------------------------------------------------------------------------
/server/verifier/DB.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | import json
3 | import os
4 | import pandas as pd
5 | from verifier.translator import make_lf
6 |
7 |
8 | class InteractionDBSingleton:
9 | filename = "interactionDB.json"
10 | def __init__(self):
11 | self.db = {}
12 | self.count = 0
13 |
14 | def add(self, interaction: dict):
15 | if "index" in interaction:
16 | index = interaction["index"]
17 | interaction['time_submitted'] = datetime.now()
18 | self.db[index].update(interaction)
19 | else:
20 | index = self.count
21 | interaction["index"] = index
22 | interaction['time_first_seen'] = datetime.now()
23 | self.db[index] = interaction
24 | self.count += 1
25 | return index
26 |
27 | def update(self, index: int, selected_lf_ids: list):
28 | self.db[index]["lfs"] = selected_lf_ids
29 |
30 | def get(self, index: int):
31 | return self.db[index]
32 |
33 | def save(self, dirname):
34 | with open(os.path.join(dirname, self.filename), "w+") as file:
35 | json.dump({
36 | "db": self.db,
37 | "count": self.count
38 | }, file, default=str)
39 |
40 | def load(self, dirname):
41 | with open(os.path.join(dirname, self.filename), "r") as file:
42 | data = json.load(file)
43 | self.db = data['db']
44 | self.count = data['count']
45 |
46 | interactionDB = InteractionDBSingleton()
47 |
48 | class LFDBSingleton:
49 | filename = "LF_DB.json"
50 | def __init__(self):
51 | self.db = {}
52 | self.lf_index = 0
53 |
54 | def get(self, lf_id):
55 | return self.db[lf_id]
56 |
57 | def add_lfs(self, lf_dicts: dict, all_concepts, emb_dict, ui_label_dict):
58 | new_lfs = {}
59 | for lf_hash, lf_explanation in lf_dicts.items():
60 | if not lf_hash in self.db:
61 | lf_explanation["time_submitted"] = datetime.now()
62 | lf_explanation["ID"] = self.lf_index
63 | lf_explanation["active"] = True
64 | self.lf_index += 1
65 |
66 | crnt_lf = make_lf(lf_explanation, all_concepts.get_dict(), emb_dict, ui_label_dict)
67 | self.db[lf_hash] = lf_explanation
68 | new_lfs[lf_hash] = crnt_lf
69 | else:
70 | self.db[lf_hash]["active"] = True
71 | new_lfs[lf_hash] = make_lf(self.db[lf_hash], all_concepts.get_dict(), emb_dict, ui_label_dict)
72 |
73 | return new_lfs
74 |
75 | def delete(self, lf_id: str):
76 | return self.db.pop(lf_id)
77 |
78 | def deactivate(self, lf_id: str):
79 | self.db[lf_id]["active"] = False
80 | return self.db[lf_id]
81 |
82 | def update(self, stats: dict):
83 | for lf_id, stats_dict in stats.items():
84 | self.db[lf_id].update(stats_dict)
85 | return self.db.copy()
86 |
87 | def __contains__(self, item: str):
88 | return item in self.db
89 |
90 | def __len__(self):
91 | return len(self.db)
92 |
93 | def save(self, dirname):
94 | with open(os.path.join(dirname, self.filename), "w+") as file:
95 | json.dump({
96 | "db": self.db,
97 | "lf_index": self.lf_index
98 | }, file, default=str)
99 |
100 | def load(self, dirname, all_concepts):
101 | with open(os.path.join(dirname, self.filename), "r") as file:
102 | data = json.load(file)
103 | lfs = data['db']
104 | self.db.update({k:v for k,v in lfs.items() if not v['active']})
105 | self.lf_index = data['lf_index']
106 | return self.add_lfs({k:v for k,v in lfs.items() if v['active']}, all_concepts)
107 |
108 |
109 | LF_DB = LFDBSingleton()
--------------------------------------------------------------------------------
/server/verifier/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/server/verifier/__init__.py
--------------------------------------------------------------------------------
/server/verifier/translator.py:
--------------------------------------------------------------------------------
1 | """Translate a dictionary explaining a labeling rule into a function
2 | """
3 | import re
4 | from wiser.rules import TaggingRule # A TaggingRule instance is defined for a tagging LF.
5 | from snorkel.labeling import LabelingFunction
6 | from synthesizer.gll import *
7 | import numpy as np
8 |
9 |
10 | def raw_stringify(s):
11 | """From a regex create a regular expression that finds occurences of the string as entire words
12 |
13 | Args:
14 | s (string): the string to look for
15 |
16 | Returns:
17 | string: a regular expession that looks for this string surrounded by non word characters
18 | """
19 | return "(?:(?<=\W)|(?<=^))({})(?=\W|$)".format(re.escape(s))
20 |
21 |
22 | def find_indices(cond_dict: dict, text: str):
23 | """Find all instances of this condition in the text
24 | """
25 | v = cond_dict.get("type")
26 | k = cond_dict.get("string")
27 | case_sensitive = True if cond_dict.get("case_sensitive") else False
28 |
29 | if v == KeyType[NER]:
30 | doc = nlp(text)
31 | for ent in doc.ents:
32 | if ent.label_ == k:
33 | return [(doc[ent.start].idx, doc[ent.end-1].idx + len(doc[ent.end-1].text))]
34 | return []
35 | elif v == KeyType[POS] or v == KeyType[DEP] or v == KeyType[ELMO_SIMILAR] or v == KeyType[BERT_SIMILAR]:
36 | return [(0,0)]
37 | elif case_sensitive:
38 | return [(m.start(), m.end()) for m in re.finditer(k, text)]
39 | else:
40 | return [(m.start(), m.end()) for m in re.finditer(k, text, re.IGNORECASE)]
41 |
42 |
43 | def make_lf(instance, concepts, emb_dict, ui_label_dict):
44 | def apply_instance(self, instance):
45 | """
46 | An ``Instance`` is a collection of :class:`~allennlp.data.fields.field.Field` objects,
47 | specifying the inputs and outputs to
48 | some model. We don't make a distinction between inputs and outputs here, though - all
49 | operations are done on all fields, and when we return arrays, we return them as dictionaries
50 | keyed by field name. A model can then decide which fields it wants to use as inputs as which
51 | as outputs.
52 |
53 | The ``Fields`` in an ``Instance`` can start out either indexed or un-indexed. During the data
54 | processing pipeline, all fields will be indexed, after which multiple instances can be combined
55 | into a ``Batch`` and then converted into padded arrays.
56 |
57 | Parameters
58 | ----------
59 | instance : An ``Instance`` is a collection of :class:`~allennlp.data.fields.field.Field` objects.
60 | instance['fields'] : ``Dict[str, Field]``
61 | ex:
62 | instance['fields']['tags'] = ['ABS', 'I-OP', 'ABS', 'ABS']
63 | """
64 | label = self.lf_dict.get(LABEL)
65 | direction = bool(self.lf_dict.get(DIRECTION))
66 | conn = self.lf_dict.get(CONNECTIVE)
67 | conds = self.lf_dict.get(CONDS)
68 | types = [cond.get("TYPE_") for cond in conds]
69 | strs_ = [cond.get("string") for cond in conds]
70 | #labels = np.array([np.array(['ABS'] * len(instance['tokens']), dtype=object)] * len(conds))
71 |
72 | type_ = conds[0].get("TYPE_")
73 | str_ = conds[0].get("string")
74 | is_pos = conds[0].get("positive")
75 | labels = np.array(['ABS'] * len(instance['tokens']), dtype=object)
76 | if type_ == BERT_SIMILAR or type_ == ELMO_SIMILAR:
77 | emb_type = 'bert' if type_ == BERT_SIMILAR else 'elmo'
78 | emb_thres = BERT_THRESHOLD if type_ == BERT_SIMILAR else ELMO_THRESHOLD
79 | sent_id = self.lf_dict.get(SENT_ID)
80 | tok_id = self.lf_dict.get(TOK_ID)
81 | if type(tok_id) == list:
82 | target_emb = self.emb_dict.emb_dict[emb_type][sent_id][tok_id]
83 | emb = instance[emb_type]
84 | cos_scores = target_emb @ emb.T
85 | if is_pos:
86 | similar_inds = (cos_scores > emb_thres)
87 | else:
88 | similar_inds = (cos_scores <= emb_thres)
89 | similar_ind = similar_inds[0][:-len(tok_id)+1]
90 | for i in range(1,len(tok_id)):
91 | similar_ind = (similar_ind) & (similar_inds[i][i:len(similar_inds[i])-len(tok_id)+1+i])
92 | for i in range(len(tok_id)):
93 | labels[i:len(labels)-len(tok_id)+1+i][similar_ind] = ui_label_dict[label]
94 | else:
95 | # target_emb is the emb vec corresponding to the similarity target word
96 | target_emb = self.emb_dict.emb_dict[emb_type][sent_id][tok_id]
97 | # emb is an N x M matrix containing token embeddings in a sentence
98 | emb = instance[emb_type]
99 | cos_scores = np.dot(emb, target_emb)
100 | if is_pos:
101 | labels[cos_scores > emb_thres] = ui_label_dict[label]
102 | else:
103 | labels[cos_scores <= emb_thres] = ui_label_dict[label]
104 | # try:
105 | # except:
106 | # print('error')
107 | elif type_ == POS:
108 | for i, pos in enumerate([token.pos_ for token in instance['tokens']]):
109 | if (pos in str_ and is_pos) or (pos not in str_ and not is_pos):
110 | labels[i] = ui_label_dict[label]
111 | elif type_ == DEP:
112 | for i, dep in enumerate([token.dep_ for token in instance['tokens']]):
113 | if (dep in str_ and is_pos) or (dep not in str_ and not is_pos):
114 | labels[i] = ui_label_dict[label]
115 | elif type_ == NER:
116 | for i, ner in enumerate([token.ent_type_ for token in instance['tokens']]):
117 | if (ner in str_ and is_pos) or (ner not in str_ and not is_pos):
118 | labels[i] = ui_label_dict[label]
119 | return list(labels)
120 |
121 |
122 | def lf_init(self):
123 | pass
124 |
125 |
126 | LF_class = type(instance['name'], (TaggingRule,), {"__init__":lf_init, "lf_dict":instance, "apply_instance":apply_instance, "emb_dict":emb_dict, "name":instance['name']} )
127 | return LF_class()
128 |
--------------------------------------------------------------------------------
/tagruler-teaser.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/tagruler-teaser.gif
--------------------------------------------------------------------------------
/ui/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "dpbd",
3 | "version": "0.1.0",
4 | "private": true,
5 | "dependencies": {
6 | "@material-ui/core": "^4.4.0",
7 | "@material-ui/icons": "^4.2.1",
8 | "axios": "^0.19.0",
9 | "clsx": "^1.0.4",
10 | "d3": "^5.12.0",
11 | "jquery": "^3.4.0",
12 | "material-ui-dropzone": "^3.3.0",
13 | "prop-types": "^15.7.2",
14 | "react": "^16.9.0",
15 | "react-dom": "^16.9.0",
16 | "react-redux": "^7.1.1",
17 | "react-router-dom": "^5.2.0",
18 | "react-scripts": "^3.4.1",
19 | "redux": "^4.0.4",
20 | "redux-thunk": "^2.3.0",
21 | "typeface-roboto": "0.0.75"
22 | },
23 | "scripts": {
24 | "start": "react-scripts start",
25 | "build": "react-scripts build",
26 | "test": "react-scripts test",
27 | "eject": "react-scripts eject"
28 | },
29 | "eslintConfig": {
30 | "extends": "react-app"
31 | },
32 | "browserslist": {
33 | "production": [
34 | ">0.2%",
35 | "not dead",
36 | "not op_mini all"
37 | ],
38 | "development": [
39 | "last 1 chrome version",
40 | "last 1 firefox version",
41 | "last 1 safari version"
42 | ]
43 | }
44 | }
45 |
--------------------------------------------------------------------------------
/ui/public/favicon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/megagonlabs/tagruler/f680478e82ffc913a05af35b5837eee477e7d322/ui/public/favicon.png
--------------------------------------------------------------------------------
/ui/public/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
12 |
16 |
17 |
26 | Interactive Span-level Annotation
27 |
28 |
29 |
30 |
31 |
41 |
42 |
43 |
--------------------------------------------------------------------------------
/ui/public/manifest.json:
--------------------------------------------------------------------------------
1 | {
2 | "short_name": "Ruler",
3 | "name": "Ruler: Data Programming by Demonstration for Text",
4 | "icons": [
5 | {
6 | "src": "favicon.png",
7 | "sizes": "64x64",
8 | "type": "image/x-png"
9 | }
10 | ],
11 | "start_url": ".",
12 | "display": "standalone",
13 | "theme_color": "#000000",
14 | "background_color": "#ffffff"
15 | }
16 |
--------------------------------------------------------------------------------
/ui/public/robots.txt:
--------------------------------------------------------------------------------
1 | # https://www.robotstxt.org/robotstxt.html
2 | User-agent: *
3 |
--------------------------------------------------------------------------------
/ui/src/AnnotationDisplayCollapse.js:
--------------------------------------------------------------------------------
1 | import React from 'react';
2 | import PropTypes from 'prop-types';
3 |
4 | import AnnotationDisplay from './AnnotationDisplay'
5 | import ExpandMoreIcon from '@material-ui/icons/ExpandMore';
6 | import ExpandLessIcon from '@material-ui/icons/ExpandLess';
7 | import Grid from '@material-ui/core/Grid';
8 |
9 | class AnnotationDisplayCollapse extends React.Component {
10 | constructor(props) {
11 | super(props);
12 | this.state = {
13 | open: false
14 | }
15 | this.collapseText(this.props.text);
16 | }
17 |
18 | collapseText(text, MAX_COLLAPSED_MARGIN = 100) {
19 | const max_start = this.props.annotations[0].start_offset || 0;
20 | const min_end = this.props.annotations[0].end_offset || 0;
21 |
22 | var coll_annotations = this.props.annotations;
23 | if (max_start > MAX_COLLAPSED_MARGIN) {
24 | const start = text.indexOf(" ", max_start - MAX_COLLAPSED_MARGIN);
25 | if (start > 30) {
26 | coll_annotations = this.props.annotations.map((ann) => {
27 | return ({
28 | ...ann,
29 | start_offset: ann.start_offset - start + 3,
30 | end_offset: ann.end_offset - start + 3
31 | })
32 | })
33 | text = "..." + text.slice(start);
34 | }
35 | }
36 |
37 | if (min_end < text.length - MAX_COLLAPSED_MARGIN) {
38 | const end = text.lastIndexOf(" ", min_end + MAX_COLLAPSED_MARGIN);
39 | if (text.length - end > 30){
40 | text = text.slice(0, end) + "...";
41 | }
42 | }
43 |
44 | this.coll_text = text;
45 | this.coll_annotations = coll_annotations;
46 | }
47 |
48 | toggleOpen() {
49 | this.setState({
50 | open: !this.state.open
51 | })
52 | }
53 |
54 | toggleIcon() {
55 | if (this.state.open) {
56 | return
57 | } else {
58 | return
59 | }
60 |
61 | }
62 |
63 | render() {
64 |
65 | var collapsedText = this.coll_text;
66 | var coll_annotations = this.coll_annotations;
67 |
68 | return(
69 |
74 |
79 | { this.props.text !== collapsedText ? this.toggleIcon() : null}
80 |
81 | )
82 | }
83 | }
84 |
85 | AnnotationDisplayCollapse.propTypes = {
86 | text: PropTypes.string,
87 | annotations: PropTypes.array
88 | }
89 |
90 | export default AnnotationDisplayCollapse
--------------------------------------------------------------------------------
/ui/src/App.js:
--------------------------------------------------------------------------------
1 | import React from 'react';
2 | import { createMuiTheme } from '@material-ui/core/styles';
3 | import { makeStyles } from '@material-ui/core/styles';
4 | import { ThemeProvider } from '@material-ui/styles';
5 | import ColorPalette from './ColorPalette';
6 | import Main from './Main';
7 |
8 | const windowHeight = Math.max(
9 | document.documentElement.clientHeight,
10 | window.innerHeight || 0);
11 |
12 | const useStyles = makeStyles(theme => ({
13 | root: {
14 | width: '100%',
15 | height: windowHeight,
16 | }
17 | }));
18 |
19 |
20 | const App = () => {
21 | const theme = createMuiTheme(ColorPalette),
22 | classes = useStyles();
23 |
24 | return (
25 |
26 |
27 |