├── .gitattributes ├── webred_21.tfrecord ├── webred_5.tfrecord └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | *.tfrecord filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /webred_21.tfrecord: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:500ff49030ad90b809db3bcd9026517d44596c14b161a7411ee29b43d944c9d8 3 | size 53089472 4 | -------------------------------------------------------------------------------- /webred_5.tfrecord: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:12f20320af609dc598acfde88f4ce7282419075a9cc8d769939cee555a03caf7 3 | size 1883002 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WebRED - Web Relation Extraction Dataset 2 | 3 | A dataset for extracting relationships from a variety of text found on the World Wide Web. 4 | Text on the web has diverse surface forms including writing styles, complexity and grammar. 5 | This dataset collects sentences from a variety of webpages and documents that represent 6 | a variety of those categories. 7 | In each sentence, there will be a subject and object entities tagged with subject 8 | `SUBJ{...}` and object `OBJ{...}`, respectively. 9 | The two entities are either related by a relation from a set of pre-defined ones 10 | or has no relation. 11 | 12 | More information about the dataset can be found in 13 | [our paper](https://arxiv.org/abs/2102.09681). 14 | If you use this dataset, make sure you cite this work as: 15 | 16 | ``` 17 | @misc{ormandi2021webred, 18 | title={WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web}, 19 | author={Robert Ormandi and Mohammad Saleh and Erin Winter and Vinay Rao}, 20 | year={2021}, 21 | eprint={2102.09681}, 22 | archivePrefix={arXiv}, 23 | primaryClass={cs.CL}, 24 | url={https://arxiv.org/abs/2102.09681}, 25 | } 26 | ``` 27 | 28 | We compare our dataset against other publicly available relationship extraction 29 | corpus in the table below. Notably, ours is the only dataset with text that is 30 | found on the web, where the input sources and writing styles vary wildly. 31 | 32 | We only present the human-annotated subsets from each dataset in the table 33 | below. The numbers in this table might be different from the ones reported in 34 | our paper as we were unable to release the full version of our dataset due to 35 | legal concerns. 36 | 37 | | Dataset | No of relations | No of examples | 38 | |------------------------------------------------------|---------------------|----------------| 39 | | [TACRED](https://nlp.stanford.edu/projects/tacred/) | 42 | 106,264 | 40 | | [DocRED](https://github.com/thunlp/DocRED) | 96 | 63,427 | 41 | | *WebRED 5* | 523 | 3,898 | 42 | | *WebRED 2+1* | 523 | 107,819 | 43 | 44 | Each example in `WebRED 5` was annotated by exactly `5` independent human 45 | annotators. In `WebRED 2+1`, each example was annotated by `2` independent 46 | annotators. If they disagreed, an additional annotator (`+1`) was assigned to 47 | the example who also provided a disambiguating annotation. 48 | 49 | In our paper, we used the `WebRED` data to fine-tune a model trained on a large 50 | unsupervised dataset. The details of how data collection of the pre-training 51 | data, the unsupervised model training, the supervised fine-tuning using 52 | `WebRED 2+1` and evaluation on `WebRED 5` happend are described in the paper. 53 | The paper also compares this model against others built on the datasets mentioned in the table 54 | above. 55 | 56 | ## Preparation 57 | First, download the data onto your disk: 58 | 59 | ```bash 60 | cd ~ 61 | git clone https://github.com/google-research-datasets/WebRED.git 62 | cd WebRED/ 63 | ``` 64 | 65 | ## Using the data 66 | The dataset is distributed in `Tensorflow.Example` format encoded as 67 | [`TFRecord`](https://www.tensorflow.org/tutorials/load_data/tfrecord). 68 | 69 | One can easily read the content of the dataset using 70 | [Tensorflow's data API](https://www.tensorflow.org/api_docs/python/tf/data): 71 | 72 | ```python 73 | import tensorflow as tf 74 | 75 | path_to_webred = '...' # Path to where the WebRED data was downloaded. 76 | 77 | def read_examples(*dataset_paths): 78 | examples = [] 79 | dataset = tf.data.TFRecordDataset(dataset_paths) 80 | for raw_sentence in dataset: 81 | sentence = tf.train.Example() 82 | sentence.ParseFromString(raw_sentence.numpy()) 83 | examples.append(sentence) 84 | return examples 85 | 86 | webred_sentences = read_examples(path_to_webred) 87 | sentence = webred_sentences[0] # As an instance of `tf.Example`. 88 | ``` 89 | 90 | Description of the features: 91 | 92 | * `num_pos_raters`: Number of unique human raters who thought that the 93 | sentence expresses the given relation. 94 | * `num_raters`: Number of unique human raters whou annotated the sentence fact pair 95 | * `relation_id`: The 96 | [WikiData relation ID](https://www.wikidata.org/wiki/Wikidata:Identifiers) 97 | of the fact. 98 | * `relation_name`: Human readable name of the relation of the fact. 99 | * `sentence`: The sentence with object and subject annotation in it. 100 | * `source name`: The name of the subject (source) entity. 101 | * `target_name`: The name of the object (target) entity. 102 | * `url`: Original url containing the annotated sentence. 103 | 104 | The individual features of an example can be accessed e.g. by the following way: 105 | 106 | ```python 107 | def get_feature(sentence, feature_name, idx=0): 108 | feature = sentence.features.feature[feature_name] 109 | return getattr(feature, feature.WhichOneof('kind')).value[idx] 110 | 111 | annotated_sentence_text = get_feature(sentence, 'sentence').decode('utf-8') 112 | relation_name = get_feature(sentence, 'relation_name').decode('utf-8') 113 | empirical_probability_of_the_sentence_expresses_the_relation = ( 114 | get_feature(sentence, 'num_pos_raters') / 115 | get_feature(sentence, 'num_raters')) 116 | ``` 117 | ## License 118 | 119 | This data is licensed by Google LLC under a [Creative Commons Attribution 4.0 120 | International License](http://creativecommons.org/licenses/by/4.0/). 121 | Users will be allowed to modify and repost it, and we encourage them to analyze 122 | and publish research based on the data. 123 | 124 | ## Contact Us 125 | 126 | If you have a technical question regarding the dataset, code or publication, 127 | please create an issue in this repository. You may also reach us at 128 | webred@google.com. 129 | --------------------------------------------------------------------------------