└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Syntax-Enhanced_Pre-trained_Model (Draft)
 2 | Source Data of ACL2021 paper "Syntax-Enhanced Pre-trained Model". 
 3 | 
 4 | 
 5 | ## Summary of  Paper
 6 | In this paper, we present SEPREM that leverage syntax information to enhance pre-trained models. To inject syntactic information, we introduce a syntax-aware attention layer and a newly designed pre-training task are proposed. Experimental results show that our method achieves state-of-the-art performance over six datasets. Further analysis shows that the proposed dependency distance prediction task performs better than dependency head prediction task.
 7 | 
 8 | For more details about our paper, we refer the interested readers to [here](https://arxiv.org/pdf/2012.14116.pdf).
 9 | 
10 | ## Pre-training Data
11 | We randomly collected 1B sentences from publicly released common crawl news datasets ([CCNews](https://commoncrawl.org/)) that contain English news articles crawled between December 2016 and March 2019.  Then, we adopted off-the-shelf [Stanza](https://github.com/stanfordnlp/stanza) to automatically generate the syntax information for each sentence. 
12 | It took a month and a half to get the results when running on 64 V100-32G. The average token length of each sentence is 25.34, and the average depth of syntax trees is 5.15.</br>
13 | 
14 | Now, we make the constructed 1B sentence public with the correponding syntax information to the community.</br>
15 | You can download the data from my [OneDrive](https://mail2sysueducn-my.sharepoint.com/:f:/g/personal/xuzn_mail2_sysu_edu_cn/ElrJsiEbzK9KlRInoBbmr1oBuCmUdRPVTdDvyk05GLPtcw) (Upload from 2021/05/11 and end on 2021/05/18).</br>
16 | <b>Please note that</b> the total size of all files should above 800GB, but we can only provide 722GB.</br>
17 | Since I am using my student certificate, the data on onedrive will <b>expire in 2023</b>.
18 | 
19 | ### 1. File Structure
20 | Due to the large amount of data,  we split the results of raw syntax information into **11** sections instead of storing in a single file. </br>
21 | Each section generally contains **10** folders with each folder contains about **10000** json files.</br>
22 | Unfortunatelly, the first section was deleted by mistake, so only the  **2nd~11th** sections can be provided. The 9/6 and 9/8 are also missing.</br>
23 | If you find that some json files are broken, this is due to unstable network transmission, please leave an issue and I will re-upload it as soon as possible.
24 | 
25 | We proviede the statistics of the results as follows:
26 | |Section Number|Number of Folder|Is provided|ToTal Size (GB)|Total Number of Sentence / Json File|
27 | :-:|:-:|:-:|:-:|:-:
28 | |1||:x:||
29 | |2|10|:grinning:|78.7|96988985 / 9699|
30 | |3|10|:grinning:|76.1|94198706 / 9420|
31 | |4|10|:grinning:|72.7|90297083 / 9030|
32 | |5|10|:grinning:|73.1|91042200 / 9105|
33 | |6|9|:grinning:|68.3|86357503 / 8636|
34 | |7|10|:grinning:|73.5|91920280 / 9193|
35 | |8|9|:grinning:|71.3|89769348 / 8977|
36 | |9|7|:grinning:|53.5|66958763 / 6696|
37 | |10|9|:grinning:|69.8|86494425 / 8650|
38 | |11|11|:grinning:|85.3|109427451 / 10943|
39 | |Sum|||722.3|903454744 / 90349|
40 | 
41 | ### 2. Data Format
42 | The storage unit of raw syntactic information is the json file mentioned above. </br>
43 | Each json file contains about **10000** raw syntax information, including *lemma*, *xpos*, *upos*, *head* and *deprel*.</br>
44 | We take one  item from the file **2/1/1_1_10000.json** as an example
45 | 
46 | <table>
47 |   <tr>
48 |     <td><b>lemma</b></td> 
49 |     <td>['you', 'should', 'stick', 'with', 'you', 'kid', '.']</td> 
50 |     </tr>
51 |   <tr>
52 |     <td><b>xpos</b></td> 
53 |     <td>['PRP', 'MD', 'VB', 'IN', 'PRP$', 'NN', '.']</td> 
54 |   </tr>
55 |   <tr>
56 |     <td><b>upos</b></td> 
57 |     <td>['PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'NOUN', 'PUNCT']</td> 
58 |   </tr>
59 |   <tr>
60 |     <td><b>head</b></td> 
61 |     <td>[3, 3, 0, 6, 6, 3, 3]</td> 
62 |   </tr>
63 |   <tr>
64 |     <td><b>deprel</b></td> 
65 |     <td>['nsubj', 'aux', 'root', 'case', 'nmod:poss', 'obl', 'punct']</td> 
66 |   </tr>
67 | </table>
68 | <b>Note that</b>, we only utilize the head information to build the syntax tree in our paper.</br>
69 | How to make better use of xpos, upos and deprel information <b>is still a challenge</b>.
70 | 
71 | <!--
72 | <table>
73 |   <tr>
74 |     <td><b>lemma</b></td> 
75 |     <td>['if', 'you', 'head', 'to', 'a', 'restaurant', 'this', 'weekend', ',', 'you', 'will', 'not', 'get', 'a', 'glass', 'of', 'water', 'without', 'ask', '.']</td> 
76 |     </tr>
77 |   <tr>
78 |     <td><b>xpos</b></td> 
79 |     <td>['IN', 'PRP', 'VBP', 'IN', 'DT', 'NN', 'DT', 'NN', ',', 'PRP', 'MD', 'RB', 'VB', 'DT', 'NN', 'IN', 'NN', 'IN', 'VBG', '.']</td> 
80 |   </tr>
81 |   <tr>
82 |     <td><b>upos</b></td> 
83 |     <td>['SCONJ', 'PRON', 'VERB', 'ADP', 'DET', 'NOUN', 'DET', 'NOUN', 'PUNCT', 'PRON', 'AUX', 'PART', 'VERB', 'DET', 'NOUN', 'ADP', 'NOUN', 'SCONJ', 'VERB', 'PUNCT']</td> 
84 |   </tr>
85 |   <tr>
86 |     <td><b>head</b></td> 
87 |     <td>[3, 3, 13, 6, 6, 3, 8, 3, 13, 13, 13, 13, 0, 15, 13, 17, 15, 19, 13, 13]</td> 
88 |   </tr>
89 |   <tr>
90 |     <td><b>deprel</b></td> 
91 |     <td>['mark', 'nsubj', 'advcl', 'case', 'det', 'obl', 'det', 'obl:tmod', 'punct', 'nsubj', 'aux', 'advmod', 'root', 'det', 'obj', 'case', 'nmod', 'mark', 'advcl', 'punct']</td> 
92 |   </tr>
93 | </table>
94 | -->
95 | 
96 | 
97 | ## Fine-tuning Stage
98 | We evaluated our proposed SEPREM model on entity typing, question answering and relation classification tasks under the different corresponding benchmarks, *e.g.*,  Open Entity, FIGER, SearchQA, Quasar-T, CosmosQA, and TACRED, respectively. Thanks to RuiZe's help, we used the fine-tuning pipelines provided by [K-adaper](https://arxiv.org/abs/2002.01808). Those piplelines are available from [here](https://github.com/microsoft/K-Adapter).
99 | 


--------------------------------------------------------------------------------
lemma	['you', 'should', 'stick', 'with', 'you', 'kid', '.']
xpos	['PRP', 'MD', 'VB', 'IN', 'PRP$', 'NN', '.']
upos	['PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'NOUN', 'PUNCT']
head	[3, 3, 0, 6, 6, 3, 3]
deprel	['nsubj', 'aux', 'root', 'case', 'nmod:poss', 'obl', 'punct']