├── README.md ├── converted ├── README ├── en-ud-tweet-dev.fixed.conllu ├── en-ud-tweet-test.fixed.conllu ├── en-ud-tweet-train.fixed.conllu └── fixRoot.py ├── en-ud-tweet-dev.conllu ├── en-ud-tweet-test.conllu └── en-ud-tweet-train.conllu /README.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | Tweebank v2 is a collection of English tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts. 4 | 5 | # Introduction 6 | 7 | Tweebank v2 is built on the original data of Tweebank v1 (840 unique tweets, 639/201 for training/test set), along with an additional 210 tweets sampled from the POS-tagged dataset of Gimpel et al. (2011) and 2,500 tweets sampled from the Twitter stream from February 2016 to July 2016. 8 | The latter data source consists of 147.4M English tweets. In the same way as Kong et al. (2011), 9 | reference unit is always the tweet in its entirety 10 | -- which may thus consist of multiple sentences -- not the sentence alone. 11 | Before annotation, we use simple regular expression to anonymize username and URL. 12 | 13 | Our annotation process was conducted in two stages. 14 | In the first stage, 18 researchers worked on the Tweebank v1 15 | proportion and the additional 210 tweets and created the initial annotations in one day. 16 | Before annotating, they were given a tutorial overview of the general UD 17 | annotation conventions and our guidelines specifically for annotating tweets. 18 | Both the guidelines and annotations 19 | were further refined by the authors of this paper to increase 20 | the coverage of our guidelines and solve inconsistencies between 21 | different annotators during this exercise. In the second stage, a tokenizer, a POS tagger, and a 22 | parser were trained on the annotated data from the first stage (1,050 tweets in total), 23 | and used to automatically analyze the sampled 2,500 tweets. Authors 24 | of this paper manually corrected the parsed data and finally achieved 3,550 labeled tweets. 25 | 26 | # Corpus splitting 27 | 28 | The treebank has been randomly split as follows: 29 | 30 | * en-ud-tweet-train.conllu: 1,639 tweets (24,753 words) 31 | * en-ud-tweet-dev.conllu: 710 tweets (11,742 words) 32 | * en-ud-tweet-test.conllu: 1,201 tweets (19,112 words) 33 | 34 | 35 | # References 36 | 37 | * Yijia Liu, Yi Zhu, Wanxiang Che, Bing Qin, Nathan Schneider, and Noah A. Smith. 2018. [Parsing Tweets into Universal Dependencies](https://www.aclanthology.org/N18-1088/). In Proc. of NAACL. 38 | * Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. 2014. [A Dependency Parser for Tweets](https://www.aclanthology.org/D14-1108/). In Proc. of EMNLP. 39 | * Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. [Part-of-speech tagging for Twitter](http://www.aclanthology.org/P11-2008): Annotation, features, and experiments. In Proc. of ACL. 40 | 41 | # Changelog 42 | 43 | 2018-04-15 v2.0 44 | 45 | * initial release 46 | 47 | # Metadata 48 | 49 | ``` 50 | Data available since: UD v2.1 51 | License: CC BY-NC-SA 4.0 52 | Includes text: yes 53 | Genre: social 54 | Lemmas: automatic 55 | UPOS: automatic with corrections 56 | Relations: automatic with corrections 57 | Contributors: Liu, Yijia; Zhu, Yi; Schneider, Nathan; Smith, Noah A. 58 | Contributing: elsewhere 59 | Contact: oneplus.lau@gmail.com 60 | ``` 61 | -------------------------------------------------------------------------------- /converted/README: -------------------------------------------------------------------------------- 1 | This is a conversion of the original tweebank, fixRoot.py is used to connect each separate tree within a tree to the first root with the parataxis relation (following https://www.aclweb.org/anthology/D18-1542.pdf). This version can be used with the official evaluation script. 2 | 3 | This is also the exact version used in the MaChAmp paper (https://arxiv.org/abs/2005.14672.pdf) 4 | 5 | Rob 21-03-2021 6 | -------------------------------------------------------------------------------- /converted/fixRoot.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | tree = [] 4 | 5 | for line in open(sys.argv[1]): 6 | tok = line.strip().split('\t') 7 | if line.strip() == '': 8 | if len(tree) != 0: 9 | firstRoot = '' 10 | for wordIdx in range(0, len(tree)): 11 | if tree[wordIdx][7] == 'root': 12 | if tree[wordIdx][6] != '0': 13 | tree[wordIdx][7] = 'parataxis' 14 | elif firstRoot == '': 15 | firstRoot = tree[wordIdx][0] 16 | else: 17 | tree[wordIdx][7] = 'parataxis' 18 | tree[wordIdx][6] = firstRoot 19 | 20 | for sent in tree: 21 | print('\t'.join(sent)) 22 | print() 23 | 24 | tree = [] 25 | elif line.startswith('#'): 26 | print(line, end='') 27 | else: 28 | tree.append(tok) 29 | 30 | --------------------------------------------------------------------------------