└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Syntax-Enhanced_Pre-trained_Model (Draft) 2 | Source Data of ACL2021 paper "Syntax-Enhanced Pre-trained Model". 3 | 4 | 5 | ## Summary of Paper 6 | In this paper, we present SEPREM that leverage syntax information to enhance pre-trained models. To inject syntactic information, we introduce a syntax-aware attention layer and a newly designed pre-training task are proposed. Experimental results show that our method achieves state-of-the-art performance over six datasets. Further analysis shows that the proposed dependency distance prediction task performs better than dependency head prediction task. 7 | 8 | For more details about our paper, we refer the interested readers to [here](https://arxiv.org/pdf/2012.14116.pdf). 9 | 10 | ## Pre-training Data 11 | We randomly collected 1B sentences from publicly released common crawl news datasets ([CCNews](https://commoncrawl.org/)) that contain English news articles crawled between December 2016 and March 2019. Then, we adopted off-the-shelf [Stanza](https://github.com/stanfordnlp/stanza) to automatically generate the syntax information for each sentence. 12 | It took a month and a half to get the results when running on 64 V100-32G. The average token length of each sentence is 25.34, and the average depth of syntax trees is 5.15. 13 | 14 | Now, we make the constructed 1B sentence public with the correponding syntax information to the community. 15 | You can download the data from my [OneDrive](https://mail2sysueducn-my.sharepoint.com/:f:/g/personal/xuzn_mail2_sysu_edu_cn/ElrJsiEbzK9KlRInoBbmr1oBuCmUdRPVTdDvyk05GLPtcw) (Upload from 2021/05/11 and end on 2021/05/18). 16 | Please note that the total size of all files should above 800GB, but we can only provide 722GB. 17 | Since I am using my student certificate, the data on onedrive will expire in 2023. 18 | 19 | ### 1. File Structure 20 | Due to the large amount of data, we split the results of raw syntax information into **11** sections instead of storing in a single file. 21 | Each section generally contains **10** folders with each folder contains about **10000** json files. 22 | Unfortunatelly, the first section was deleted by mistake, so only the **2nd~11th** sections can be provided. The 9/6 and 9/8 are also missing. 23 | If you find that some json files are broken, this is due to unstable network transmission, please leave an issue and I will re-upload it as soon as possible. 24 | 25 | We proviede the statistics of the results as follows: 26 | |Section Number|Number of Folder|Is provided|ToTal Size (GB)|Total Number of Sentence / Json File| 27 | :-:|:-:|:-:|:-:|:-: 28 | |1||:x:|| 29 | |2|10|:grinning:|78.7|96988985 / 9699| 30 | |3|10|:grinning:|76.1|94198706 / 9420| 31 | |4|10|:grinning:|72.7|90297083 / 9030| 32 | |5|10|:grinning:|73.1|91042200 / 9105| 33 | |6|9|:grinning:|68.3|86357503 / 8636| 34 | |7|10|:grinning:|73.5|91920280 / 9193| 35 | |8|9|:grinning:|71.3|89769348 / 8977| 36 | |9|7|:grinning:|53.5|66958763 / 6696| 37 | |10|9|:grinning:|69.8|86494425 / 8650| 38 | |11|11|:grinning:|85.3|109427451 / 10943| 39 | |Sum|||722.3|903454744 / 90349| 40 | 41 | ### 2. Data Format 42 | The storage unit of raw syntactic information is the json file mentioned above. 43 | Each json file contains about **10000** raw syntax information, including *lemma*, *xpos*, *upos*, *head* and *deprel*. 44 | We take one item from the file **2/1/1_1_10000.json** as an example 45 | 46 |
lemma | 49 |['you', 'should', 'stick', 'with', 'you', 'kid', '.'] | 50 |
xpos | 53 |['PRP', 'MD', 'VB', 'IN', 'PRP$', 'NN', '.'] | 54 |
upos | 57 |['PRON', 'AUX', 'VERB', 'ADP', 'PRON', 'NOUN', 'PUNCT'] | 58 |
head | 61 |[3, 3, 0, 6, 6, 3, 3] | 62 |
deprel | 65 |['nsubj', 'aux', 'root', 'case', 'nmod:poss', 'obl', 'punct'] | 66 |