├── README.md
├── converted
    ├── README
    ├── en-ud-tweet-dev.fixed.conllu
    ├── en-ud-tweet-test.fixed.conllu
    ├── en-ud-tweet-train.fixed.conllu
    └── fixRoot.py
├── en-ud-tweet-dev.conllu
├── en-ud-tweet-test.conllu
└── en-ud-tweet-train.conllu


/README.md:
--------------------------------------------------------------------------------
 1 | # Summary
 2 | 
 3 | Tweebank v2 is a collection of English tweets annotated in Universal Dependencies that can be exploited for the training of NLP systems to enhance their performance on social media texts.
 4 | 
 5 | # Introduction
 6 | 
 7 | Tweebank v2 is built on the original data of Tweebank v1 (840 unique tweets, 639/201 for training/test set), along with an additional 210 tweets sampled from the POS-tagged dataset of Gimpel et al. (2011) and 2,500 tweets sampled from the Twitter stream from February 2016 to July 2016.
 8 | The latter data source consists of 147.4M English tweets. In the same way as Kong et al. (2011),
 9 | reference unit is always the tweet in its entirety
10 | -- which may thus consist of multiple sentences -- not the sentence alone.
11 | Before annotation, we use simple regular expression to anonymize username and URL.
12 | 
13 | Our annotation process was conducted in two stages.
14 | In the first stage, 18 researchers worked on the Tweebank v1
15 | proportion and the additional 210 tweets and created the initial annotations in one day.
16 | Before annotating, they were given a tutorial overview of the general UD
17 | annotation conventions and our guidelines specifically for annotating tweets.
18 | Both the guidelines and annotations
19 | were further refined by the authors of this paper to increase
20 | the coverage of our guidelines and solve inconsistencies between
21 | different annotators during this exercise. In the second stage, a tokenizer, a POS tagger, and a
22 | parser were trained on the annotated data from the first stage (1,050 tweets in total),
23 | and used to automatically analyze the sampled 2,500 tweets.  Authors 
24 | of this paper manually corrected the parsed data and finally achieved 3,550 labeled tweets.
25 | 
26 | # Corpus splitting
27 | 
28 | The treebank has been randomly split as follows:
29 | 
30 | * en-ud-tweet-train.conllu: 1,639 tweets (24,753 words)
31 | * en-ud-tweet-dev.conllu: 710 tweets (11,742 words)
32 | * en-ud-tweet-test.conllu: 1,201 tweets (19,112 words)
33 | 
34 | 
35 | # References
36 | 
37 | * Yijia Liu, Yi Zhu, Wanxiang Che, Bing Qin, Nathan Schneider, and Noah A. Smith. 2018. [Parsing Tweets into Universal Dependencies](https://www.aclanthology.org/N18-1088/). In Proc. of NAACL.
38 | * Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris Dyer, and Noah A. Smith. 2014. [A Dependency Parser for Tweets](https://www.aclanthology.org/D14-1108/). In Proc. of EMNLP.
39 | * Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. [Part-of-speech tagging for Twitter](http://www.aclanthology.org/P11-2008): Annotation, features, and experiments. In Proc. of ACL.
40 | 
41 | # Changelog
42 | 
43 | 2018-04-15 v2.0
44 | 
45 | * initial release
46 | 
47 | # Metadata
48 | 
49 | ```
50 | Data available since: UD v2.1
51 | License: CC BY-NC-SA 4.0
52 | Includes text: yes
53 | Genre: social
54 | Lemmas: automatic
55 | UPOS: automatic with corrections
56 | Relations: automatic with corrections
57 | Contributors: Liu, Yijia; Zhu, Yi; Schneider, Nathan; Smith, Noah A.
58 | Contributing: elsewhere
59 | Contact: oneplus.lau@gmail.com
60 | ```
61 | 


--------------------------------------------------------------------------------
/converted/README:
--------------------------------------------------------------------------------
1 | This is a conversion of the original tweebank, fixRoot.py is used to connect each separate tree within a tree to the first root with the parataxis relation (following https://www.aclweb.org/anthology/D18-1542.pdf). This version can be used with the official evaluation script.
2 | 
3 | This is also the exact version used in the MaChAmp paper (https://arxiv.org/abs/2005.14672.pdf)
4 | 
5 | Rob 21-03-2021
6 | 


--------------------------------------------------------------------------------
/converted/fixRoot.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | tree = []
 4 | 
 5 | for line in open(sys.argv[1]):
 6 |     tok = line.strip().split('\t')
 7 |     if line.strip() == '':
 8 |         if len(tree) != 0:
 9 |             firstRoot = ''
10 |             for wordIdx in range(0, len(tree)):
11 |                 if tree[wordIdx][7] == 'root':
12 |                     if tree[wordIdx][6] != '0':
13 |                         tree[wordIdx][7] = 'parataxis'
14 |                     elif firstRoot == '':
15 |                         firstRoot = tree[wordIdx][0]
16 |                     else:
17 |                         tree[wordIdx][7] = 'parataxis'
18 |                         tree[wordIdx][6] = firstRoot
19 |                 
20 |             for sent in tree:
21 |                 print('\t'.join(sent))
22 |             print()
23 | 
24 |             tree = []
25 |     elif line.startswith('#'):
26 |         print(line, end='')
27 |     else:
28 |         tree.append(tok)
29 | 
30 | 


--------------------------------------------------------------------------------