├── .gitattributes
├── webred_21.tfrecord
├── webred_5.tfrecord
└── README.md


/.gitattributes:
--------------------------------------------------------------------------------
1 | *.tfrecord filter=lfs diff=lfs merge=lfs -text
2 | 


--------------------------------------------------------------------------------
/webred_21.tfrecord:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:500ff49030ad90b809db3bcd9026517d44596c14b161a7411ee29b43d944c9d8
3 | size 53089472
4 | 


--------------------------------------------------------------------------------
/webred_5.tfrecord:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:12f20320af609dc598acfde88f4ce7282419075a9cc8d769939cee555a03caf7
3 | size 1883002
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # WebRED - Web Relation Extraction Dataset
  2 | 
  3 | A dataset for extracting relationships from a variety of text found on the World Wide Web.
  4 | Text on the web has diverse surface forms including writing styles, complexity and grammar.
  5 | This dataset collects sentences from a variety of webpages and documents that represent
  6 | a variety of those categories.
  7 | In each sentence, there will be a subject and object entities tagged with subject
  8 | `SUBJ{...}` and object `OBJ{...}`, respectively.
  9 | The two entities are either related by a relation from a set of pre-defined ones
 10 | or has no relation.
 11 | 
 12 | More information about the dataset can be found in
 13 | [our paper](https://arxiv.org/abs/2102.09681).
 14 | If you use this dataset, make sure you cite this work as:
 15 | 
 16 | ```
 17 | @misc{ormandi2021webred,
 18 |     title={WebRED: Effective Pretraining And Finetuning For Relation Extraction On The Web}, 
 19 |     author={Robert Ormandi and Mohammad Saleh and Erin Winter and Vinay Rao},
 20 |     year={2021},
 21 |     eprint={2102.09681},
 22 |     archivePrefix={arXiv},
 23 |     primaryClass={cs.CL},
 24 |     url={https://arxiv.org/abs/2102.09681},
 25 | }
 26 | ```
 27 | 
 28 | We compare our dataset against other publicly available relationship extraction
 29 | corpus in the table below. Notably, ours is the only dataset with text that is
 30 | found on the web, where the input sources and writing styles vary wildly.
 31 | 
 32 | We only present the human-annotated subsets from each dataset in the table
 33 | below. The numbers in this table might be different from the ones reported in
 34 | our paper as we were unable to release the full version of our dataset due to
 35 | legal concerns.
 36 | 
 37 | | Dataset                                              | No of relations     | No of examples |
 38 | |------------------------------------------------------|---------------------|----------------|
 39 | | [TACRED](https://nlp.stanford.edu/projects/tacred/)  | 42                  | 106,264        |
 40 | | [DocRED](https://github.com/thunlp/DocRED)           | 96                  | 63,427         |
 41 | | *WebRED  5*                                          | 523                 | 3,898          |
 42 | | *WebRED 2+1*                                         | 523                 | 107,819        |
 43 | 
 44 | Each example in `WebRED 5` was annotated by exactly `5` independent human
 45 | annotators. In `WebRED 2+1`, each example was annotated by `2` independent
 46 | annotators. If they disagreed, an additional annotator (`+1`) was assigned to
 47 | the example who also provided a disambiguating annotation.
 48 | 
 49 | In our paper, we used the `WebRED` data to fine-tune a model trained on a large
 50 | unsupervised dataset. The details of how data collection of the pre-training
 51 | data, the unsupervised model training, the supervised fine-tuning using
 52 | `WebRED 2+1` and evaluation on `WebRED 5` happend are described in the paper.
 53 | The paper also compares this model against others built on the datasets mentioned in the table 
 54 | above.
 55 | 
 56 | ## Preparation
 57 | First, download the data onto your disk:
 58 | 
 59 | ```bash
 60 | cd ~
 61 | git clone https://github.com/google-research-datasets/WebRED.git
 62 | cd WebRED/
 63 | ```
 64 | 
 65 | ## Using the data
 66 | The dataset is distributed in `Tensorflow.Example` format encoded as
 67 | [`TFRecord`](https://www.tensorflow.org/tutorials/load_data/tfrecord).
 68 | 
 69 | One can easily read the content of the dataset using
 70 | [Tensorflow's data API](https://www.tensorflow.org/api_docs/python/tf/data):
 71 | 
 72 | ```python
 73 | import tensorflow as tf
 74 | 
 75 | path_to_webred = '...'          # Path to where the WebRED data was downloaded.
 76 | 
 77 | def read_examples(*dataset_paths):
 78 |   examples = []
 79 |   dataset = tf.data.TFRecordDataset(dataset_paths)
 80 |   for raw_sentence in dataset:
 81 |     sentence = tf.train.Example()
 82 |     sentence.ParseFromString(raw_sentence.numpy())
 83 |     examples.append(sentence)
 84 |   return examples
 85 | 
 86 | webred_sentences = read_examples(path_to_webred)
 87 | sentence = webred_sentences[0]  # As an instance of `tf.Example`.
 88 | ```
 89 | 
 90 | Description of the features:
 91 | 
 92 |   * `num_pos_raters`: Number of unique human raters who thought that the
 93 |     sentence expresses the given relation.
 94 |   * `num_raters`: Number of unique human raters whou annotated the sentence fact pair
 95 |   * `relation_id`: The
 96 |     [WikiData relation ID](https://www.wikidata.org/wiki/Wikidata:Identifiers)
 97 |     of the fact.
 98 |   * `relation_name`: Human readable name of the relation of the fact.
 99 |   * `sentence`: The sentence with object and subject annotation in it.
100 |   * `source name`: The name of the subject (source) entity.
101 |   * `target_name`: The name of the object (target) entity.
102 |   * `url`: Original url containing the annotated sentence.
103 | 
104 | The individual features of an example can be accessed e.g. by the following way:
105 | 
106 | ```python
107 | def get_feature(sentence, feature_name, idx=0):
108 |   feature = sentence.features.feature[feature_name]
109 |   return getattr(feature, feature.WhichOneof('kind')).value[idx]
110 | 
111 | annotated_sentence_text = get_feature(sentence, 'sentence').decode('utf-8')
112 | relation_name = get_feature(sentence, 'relation_name').decode('utf-8')
113 | empirical_probability_of_the_sentence_expresses_the_relation = (
114 |     get_feature(sentence, 'num_pos_raters') /
115 |     get_feature(sentence, 'num_raters'))
116 | ```
117 | ## License
118 | 
119 | This data is licensed by Google LLC under a [Creative Commons Attribution 4.0
120 | International License](http://creativecommons.org/licenses/by/4.0/).
121 | Users will be allowed to modify and repost it, and we encourage them to analyze
122 | and publish research based on the data.
123 | 
124 | ## Contact Us
125 | 
126 | If you have a technical question regarding the dataset, code or publication,
127 | please create an issue in this repository. You may also reach us at
128 | webred@google.com.
129 | 


--------------------------------------------------------------------------------