├── ACL17 └── Overview.md ├── Contributors.md ├── Embedding └── README.md ├── Informaiton Extraction and Text Mining └── Readme.md ├── LICENSE ├── Multilinguality └── Readme.md ├── Phonology Morphology and Word Segment └── README.md ├── README.md ├── Sentence-level Semantics └── README.md ├── Summerization └── README.md ├── Tagging Chunking Syntax and Parsing └── README.md └── _config.yml /ACL17/Overview.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Contributors.md: -------------------------------------------------------------------------------- 1 | ## 贡献者 Contributors 2 | 3 | | Contributor | Affiliation | 4 | | ----------- | ----------- | 5 | | [Nativeatom](https://github.com/Nativeatom) | Carnegie Mellon University | 6 | -------------------------------------------------------------------------------- /Embedding/README.md: -------------------------------------------------------------------------------- 1 | ## Embedding 2 | 3 | ### Overview 4 | * [Tutorial on EMNLP 14'](http://emnlp2014.org/tutorials/8_notes.pdf) 5 | * [Distributional and Word Embedding Models](https://github.com/Lambda-3/Indra/wiki/Distributional-and-Word-Embedding-Models) 6 | 7 | ### Word2Vec 8 | * Overview by Xin Rong *[word2vec Parameter Learning Explained](https://arxiv.org/pdf/1411.2738.pdf)* [[Video] (https://www.youtube.com/watch?v=D-ekE-Wlcds)]* 9 | * Explanation of Negative Sampling *[word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method](https://arxiv.org/pdf/1402.3722.pdf)* 10 | * Connection of Negative Sampling and Mutual Information *[Neural Word Embedding as Implicit Matrix Factorization](https://pdfs.semanticscholar.org/e0d7/69c5698275c2f745c1aac2aebc323f51aa24.pdf?_ga=2.100986117.540785117.1516862569-662515382.1516862569)* 14' 11 | 12 | #### Variant 13 | * emoji2vec 14 | Ben Eisner *[emoji2vec 15 | : Learning Emoji Representations from their Description](http://aclweb.org/anthology/W16-6208)* 15' 16 | 17 | ### Glove 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /Informaiton Extraction and Text Mining/Readme.md: -------------------------------------------------------------------------------- 1 | ## Information Extraction and Text Mining 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Multilinguality/Readme.md: -------------------------------------------------------------------------------- 1 | ## Multilinguality 2 | -------------------------------------------------------------------------------- /Phonology Morphology and Word Segment/README.md: -------------------------------------------------------------------------------- 1 | ## Phonology, Morphology and Word Segmentation 2 | - Task 3 | - Word Segmentation 4 | - Model 5 | - Hidden Markov Model(HMM) 6 | - Conditional Random Fields(CRFs) 7 | - [CRFs for Word Segmentation(Chinese)](https://www.cnblogs.com/liufanping/p/4899842.html) 8 | - [Parameter Learning in CRFs(Chinese)](http://www.cnblogs.com/pinard/p/7068574.html) 9 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Natural Language Procesing 2 | This repository includes basic concepts of Natural Language Processing, textbooks and blogs of good reputation, popular papers and so on. 3 | 4 | This is also the Natural Language Processing part of [Machine Learning Resources](https://github.com/jindongwang/MachineLearning) created by a group of people including [jindongwang](https://github.com/jindongwang). 5 | 6 | Contributors are welcomed to work together and make it BETTER! 7 | 8 | ## Resource of Textbooks and Lectures 9 | 10 | ### Mathemetical and Statistical Foundation 11 | * Linear Algebra 12 | - 18.06 MIT(Gilbert Strang)[[pdf](http://www.math.hcmus.edu.vn/~bxthang/Linear%20algebra%20and%20its%20applications.pdf)][[video](https://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/)] 13 | 14 | * Matrix Analysis 15 | 16 | * Convex Optimization 17 | - EE364A Stanford(Stephen Boyd)[[pdf](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf)][[website](http://web.stanford.edu/~boyd/cvxbook/)] 18 | - Introductory Lectures on Convex Programming(Yu.Nesterov)[[pdf](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.855&rep=rep1&type=pdf)] 19 | 20 | 21 | ### Machine Learning 22 | * [The Elements of Statistical Learning(ESL) - HTF](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) 23 | * [CS228 Probabilistic Graphical Model - Stanford](https://cs.stanford.edu/~ermon/cs228/index.html) 24 | * [10708 Probabilistic Graphical Model - CMU](http://www.cs.cmu.edu/~epxing/Class/10708/index.html) 25 | 26 | ### Deep Learning 27 | * [Deep Learning - Ian Goodfellow, Yoshua Bengio, Aaron Courville](https://github.com/HFTrader/DeepLearningBook) 28 | * [CS231n Convolutional Neural Networks for Visual Recognition - Stanford](http://cs231n.stanford.edu/) 29 | 30 | ### Natural Language Processing 31 | * Foundations of Statistical Natural Language Processing - 32 | Chris Manning 33 | * [Speech and Language Processing - Daniel Jurafsky and James H. Martin](http://www.cs.colorado.edu/~martin/slp2.html#Chapter3) 34 | * 统计学习方法 - 李航 35 | * [Advanced Natural Language Processing - MIT](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-864-advanced-natural-language-processing-fall-2005/index.htm) 36 | * [CS 224n Natural Language Processing with Deep Learning - Stanford](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-864-advanced-natural-language-processing-fall-2005/index.htm) 37 | * [Deep Learning for NLP at Oxford with Deepmind - Oxford](https://www.youtube.com/playlist?list=PL613dYIGMXoZBtZhbyiBqb0QtgK6oJbpm) 38 | * [11-747 NN4NLP](http://www.phontron.com/class/nn4nlp2021/) 39 | * [11-737 Multilingual NLP](http://demo.clab.cs.cmu.edu/11737fa20/) 40 | * [Some Knowledge about Machine Learning](https://www.youtube.com/playlist?list=PL613dYIGMXoZBtZhbyiBqb0QtgK6oJbpm) 41 | * [A list of datasets](https://github.com/awesomedata/awesome-public-datasets#naturallanguage) 42 | 43 | ## Models and Applications 44 | - Probalistic Graphical Model 45 | - Hidden Markov Model 46 | - Conditional Random Fields 47 | 48 | - Topic Model 49 | - Latent Dirichlet Allocation([paper](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)) 50 | 51 | - Deep Learning Model 52 | - [Long Short Term Memory(LSTM)](https://www.bioinf.jku.at/publications/older/2604.pdf) *Sepp Hochreiter, 1997* 53 | - [Interpretation](https://arxiv.org/pdf/1805.03716.pdf) *Omer Levy, UWashington, 2018* 54 | - Recurrent Neuron Network 55 | - Seq2Seq([Tensorflow Tutorial](https://github.com/llSourcell/seq2seq_model_live/blob/master/2-seq2seq-advanced.ipynb)) 56 | - [Machine Translation Tensorflow implement](https://github.com/tensorflow/nmt) 57 | - Convolutional Neuron Network 58 | - Attention Model 59 | - Overview([Chinese](https://blog.csdn.net/BVL10101111/article/details/78470716)) 60 | - Generative Adversial Network(GAN) 61 | - Transformer 62 | - [Training Tips](https://arxiv.org/abs/1804.00247) 63 | - [Bidirectional Encoder Representation from Transformers(BERT)](https://arxiv.org/abs/1810.04805) *Jacob Devlin, Google 2018* 64 | 65 | ## Blog and Tutorials 66 | - [Tensorflow implement on RNN and undocumented features](http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/) 67 | - [The Unreasonable Effectiveness of Recurrent Neural Networks](https://web.stanford.edu/class/cs379c/class_messages_listing/content/Artificial_Neural_Network_Technology_Tutorials/KarparthyUNREASONABLY-EFFECTIVE-RNN-15.pdf) 68 | 69 | ## Topics and Tasks 70 | Category of areas is based on tracks in ACL 2018, ACL 2020, EMNLP 2020 71 | ### [Summerization](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Summerization) 72 | - Task 73 | - Summerization 74 | - Opinion Summarization 75 | - Evaluation 76 | - Model 77 | - Extractive 78 | - Generative 79 | - [PointNet, ACL2017](https://arxiv.org/pdf/1704.04368.pdf) 80 | - [BART, ACL2020](https://www.aclweb.org/anthology/2020.acl-main.703/) 81 | - [Pegasus, ICML2020](https://arxiv.org/pdf/1912.08777.pdf) 82 | - Hybrid 83 | - Dataset 84 | - [XSum, EMNLP2018](https://github.com/EdinburghNLP/XSum) [[paper](https://arxiv.org/abs/1808.08745)] 85 | - [CNN/DailyMail](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail) 86 | - NEWSROOM 87 | - Multi-News 88 | - Gigaword 89 | - arXiv 90 | - PubMed 91 | - BIGPATENT 92 | - WikiHow 93 | - Reddit TIFU (long, short) 94 | - AESLC 95 | - BillSum 96 | 97 | ### [Embedding](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Embedding) 98 | - Model 99 | - Word2Vec 100 | - Pre-trained Embedding 101 | - Glove 102 | - word2vec 103 | - FastText 104 | - Contextual Word Embedding 105 | - ELMo 106 | - GPT 107 | - BERT 108 | - XLNet 109 | - BART 110 | - T-5 111 | 112 | ### Sentimental Analysis and Argument Mining 113 | 114 | ### Name Entity Recognition 115 | 116 | ### Tagging, Chunking 117 | - Task 118 | - Word Segmentation 119 | - Syntactic Parsing 120 | - Model 121 | - Hidden Markov Model (HMM) 122 | - Conditional Random Fields (CRFs) 123 | - Finetuned Language Models 124 | 125 | ### Syntax, Parsing 126 | - Task 127 | - Constituency Parsing 128 | - Dependency Parsing 129 | - Visual Grounded Syntactic Aquisition 130 | - Model 131 | - [Invertible Neural Projections, EMNLP 2018](https://arxiv.org/abs/1808.09111) 132 | - [Unsupervised Recurrent Neural Network Grammars (URNNG), NAACL 2019](https://arxiv.org/abs/1904.03746) 133 | - [Compound PCFG, ACL 2019](https://arxiv.org/abs/1906.10225) 134 | - [Lexical Compound PCFG, TACL 2020](https://www.aclweb.org/anthology/2020.tacl-1.42/) 135 | - [VG-NSL, ACL 2019](https://www.aclweb.org/anthology/P19-1180/) 136 | - Dataset 137 | - [PennTreeBank (PTB)](https://catalog.ldc.upenn.edu/LDC99T42) 138 | 139 | ### Document Analysis 140 | 141 | ### [Sentence-level Semantics](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Sentence-level%20Semantics) 142 | - Tasks 143 | - Semantic Parsing 144 | - AMR-to-text 145 | - Text-to-AMR 146 | - Table-to-text 147 | - Code Generation 148 | - Model 149 | - [TRANX](https://www.aclweb.org/anthology/D18-2002/) 150 | 151 | - Dataset 152 | 153 | ### Semantics: Lexical 154 | - Tasks 155 | - Word Sense Disambiguation 156 | 157 | ### [Information Extraction and Text Mining](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Informaiton%20Extraction%20and%20Text%20Mining) 158 | - Tasks 159 | - Topic Extraction 160 | - Sentimental Extraction 161 | - Aspect Extraction 162 | 163 | ### Machine Translation 164 | - Task 165 | - Machine Translation 166 | - Non-autogressive Machine Translation 167 | - Word-alignment 168 | - Model 169 | - [Back-translation, ACL 2016](https://www.aclweb.org/anthology/P16-1009/) 170 | - [OpenNMT, ACL 2017](https://www.aclweb.org/anthology/P17-4012/) [[project](https://opennmt.net/)][[code](https://github.com/OpenNMT/OpenNMT-py)] 171 | - [Transformer, NeurIPS 2017](https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) 172 | - Dataset 173 | - WMT 174 | 175 | ### Text Generation 176 | 177 | ### Text Classification 178 | - Task 179 | - SPAM Classification 180 | - Sentiment Analysis 181 | - Model 182 | - [CNN-sentence, EMNLP2014](https://www.aclweb.org/anthology/D14-1181.pdf) 183 | - [CharCNN, NeurIPS2015](https://arxiv.org/pdf/1509.01626.pdf) 184 | - Dataset 185 | - [Yelp Dataset](https://www.yelp.com/dataset/challenge) 186 | - [IMDB](https://www.imdb.com/interfaces/) 187 | - [Stanford Sentiment Treebank (SST)](https://nlp.stanford.edu/sentiment/index.html) 188 | 189 | ### Dialogue and Interactive Systems 190 | 191 | ### Question Answering 192 | - Task 193 | - Dataset 194 | - [CNN/DailyMail](https://cs.nyu.edu/~kcho/DMQA/) 195 | - [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 196 | - Benchmark: F1-86.967 [BERT + Synthetic Self-Training (ensemble)](https://github.com/google-research/bert) *Jan 10, 2019* 197 | - [RACE](https://www.cs.cmu.edu/~glai1/data/race/) 198 | - Benchmark: RACE-83.2 RACEC-M-86.5 RACE-H-81.3 [RoBERTa](https://arxiv.org/abs/1907.11692) *July 2019* 199 | 200 | ### Resources and Evaluation 201 | 202 | ### Linguistic Theories and Cognitive Modeling 203 | 204 | ### [Multilinguality](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Multilinguality) 205 | - Task 206 | - Code-Switching 207 | - Mutilingual Translation 208 | - Model 209 | - Dataset 210 | - [Linguistic Data Consortium](https://catalog.ldc.upenn.edu/) [[list](https://github.com/isi-nlp/mt_lit_lrec16/tree/master/ontology)] [[ordered by year](https://linguistics.cornell.edu/language-corpora)] 211 | - [CALLFRIEND Mandarin - Mainland Dialect](https://catalog.ldc.upenn.edu/LDC96S55) 212 | - [CALLHOME Mandarin Chinese](https://catalog.ldc.upenn.edu/LDC2008T17) 213 | - [Chinese-English Name Entity List](https://catalog.ldc.upenn.edu/LDC2005T34) 214 | - [List of Chinese Dataset](https://github.com/Lab41/sunny-side-up/wiki/Chinese-Datasets) 215 | - [Cantonese Dataset](http://compling.hss.ntu.edu.sg/hkcancor/) 216 | - [SEAME (Mandarin-English Code-Switching in South-East Asia)](https://catalog.ldc.upenn.edu/LDC2015S04) 217 | 218 | ### [Phonology, Morphology and Word Segmentation](https://github.com/Nativeatom/NaturalLanguageProcessing/tree/master/Phonology%20Morphology%20and%20Word%20Segment) 219 | 220 | ### Textual Inference 221 | 222 | ### Vision, Robotics, Speech, Multimodal 223 | 224 | ### Language Modeling 225 | - Tasks 226 | - Model 227 | - N-gram 228 | - [ELMo, NAACL2018](https://www.aclweb.org/anthology/N18-1202.pdf) 229 | - GPT 230 | - [GPT-2, arXiv2019](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) 231 | - [GPT-3, NeurIPS2020](https://arxiv.org/pdf/2005.14165.pdf) 232 | - [BERT, NAACL2019](https://www.aclweb.org/anthology/N19-1423/) 233 | - [RoBERTa, arXiv 2019](https://arxiv.org/pdf/1907.11692.pdf) 234 | - [SpanBERT, TACL 2020](https://www.aclweb.org/anthology/2020.tacl-1.5/) 235 | - Efficient 236 | - [ALBERT, arXiv 2020](https://arxiv.org/abs/1909.11942) 237 | - [SqueezeBERT, SustainNLP@EMNLP 2020](https://www.aclweb.org/anthology/2020.sustainlp-1.17/) 238 | - Domain Specific 239 | - [SciBERT, EMNLP 2019](https://www.aclweb.org/anthology/D19-1371/) 240 | - [BioBERT, Bioinformatics 2019](https://arxiv.org/pdf/1901.08746.pdf) 241 | - [patentBERT, arXiv 2019](https://arxiv.org/abs/1906.02124) 242 | - [FinBERT, arXiv 2020](https://arxiv.org/abs/2006.08097) 243 | - [MedBERT, arXiv 2020](https://arxiv.org/pdf/2005.12833.pdf) [[code](https://github.com/ZhiGroup/Med-BERT)] 244 | - [ClinicalBERT, CHIL 2020](https://arxiv.org/abs/1904.05342) 245 | - [LEGAL-BERT, EMNLP Finding 2020](https://www.aclweb.org/anthology/2020.findings-emnlp.261/) 246 | - [Tutorial](https://mccormickml.com/2020/06/22/domain-specific-bert-tutorial/) 247 | - Langauge Specific [[Latin BERT](https://arxiv.org/abs/2009.10053), [German BERT](https://deepset.ai/german-bert), [Italian BERT](http://ceur-ws.org/Vol-2481/paper57.pdf), [Chinese BERT](https://arxiv.org/abs/2004.13922)] 248 | - [BERTology, TACL 2020](https://www.aclweb.org/anthology/2020.tacl-1.54/) 249 | - [XLNet, NeurIPS2019](https://arxiv.org/pdf/1906.08237.pdf) 250 | - [MASS, ICML2019](https://arxiv.org/pdf/1905.02450.pdf) [[code](https://github.com/microsoft/MASS)] 251 | - [ELECTRA, ICLR2020](https://openreview.net/forum?id=r1xMH1BtvB) [[code](https://github.com/google-research/electra)] 252 | - [T5, JMLR2020](https://arxiv.org/abs/1910.10683) 253 | - [BART, ACL2020](https://www.aclweb.org/anthology/2020.acl-main.703/) 254 | - Finetuning 255 | - Invasive (LM not fixed) 256 | - Regular finetuning 257 | - [Re-initlization for few-shot learning](https://arxiv.org/pdf/2006.05987.pdf) ICLR2021 258 | - Non-invasive (LM fixed) 259 | - [Prefix-tuning](https://arxiv.org/pdf/2101.00190.pdf), arXiv2021 260 | - Language Model as 261 | - [BERTScore](https://arxiv.org/pdf/1904.09675.pdf), ICLR2020 262 | - Few-shot learner 263 | - [Bias in few-shot examples](https://arxiv.org/pdf/2102.09690.pdf), arXiv2021 264 | - Knowledge base [EMNLP2019](https://www.aclweb.org/anthology/D19-1250.pdf), [Tutorial@AAAI2021](https://usc-isi-i2.github.io/AAAI21Tutorial/) 265 | - Dataset 266 | - [CommonCrawl](https://commoncrawl.org/) 267 | - Wiki-Text 268 | - STORIES 269 | - [C4](https://arxiv.org/abs/1910.10683) [[huggingface](https://www.tensorflow.org/datasets/catalog/c4)] 270 | 271 | ### Computational Social Science and Social Media 272 | 273 | ### Discourse and Pragmatics 274 | 275 | ### Information Retrieval and Text Mining 276 | 277 | ### Language Grounding to Vision, Robotics and Beyond 278 | - Papers 279 | - [Experience Grounds Language, EMNLP2020](https://arxiv.org/abs/2004.10151) 280 | 281 | ### Machine Learning for NLP 282 | 283 | ### Theory and Formalism in NLP 284 | 285 | ### Ethics in NLP 286 | 287 | ### Commonsense Knowledge 288 | - Tasks 289 | - Fact Verification 290 | - Commonsense Reasoning 291 | - Word-level Rationales 292 | - Factually Consistent Generation 293 | 294 | - Model 295 | - [ConceptNet, AAAI2017](https://conceptnet.io/) 296 | - [COMET, ACL2019](https://mosaickg.apps.allenai.org/) [[paper](https://www.aclweb.org/anthology/P19-1470/)] 297 | 298 | - Dataset 299 | - [FEVER](https://arxiv.org/abs/1803.05355) 300 | - [FEVER 2.0 Shared Task, EMNLP 2019](https://www.aclweb.org/anthology/D19-6601/) [[FEVER@EMNLP2021](https://fever.ai/index.html)] 301 | - [CommonGen, EMNLP Finding 2020](https://inklab.usc.edu/CommonGen/) [[paper](https://arxiv.org/abs/1911.03705)] 302 | - [VitaminC, NAACL 2021](https://github.com/TalSchuster/VitaminC) [[paper](https://arxiv.org/abs/2103.08541)] 303 | 304 | ### Interpretability 305 | 306 | ### NLP Applications 307 | - Tasks 308 | - Grammartical Error Correction (GEC) [[BEA@NAACL2018](https://www.cs.rochester.edu/~tetreaul/naacl-bea13.html), [BEA@ACL2019](https://sig-edu.org/bea/2019), [BEA@ACL2020](https://sig-edu.org/bea/2020), [BEA@EACL2021](https://sig-edu.org/bea/current)] 309 | - Lexical Substitution 310 | - Lexical Simplification 311 | - Model 312 | - BERT-based Lexical Substitution/GEC, [[ACL2019](https://www.aclweb.org/anthology/P19-1328/), [AAAI2020](https://arxiv.org/pdf/2001.03521.pdf)] 313 | - Dataset 314 | - [ETS](https://www.ets.org/research/contact/data_requests/) 315 | 316 | ## Resources and Benchmarks 317 | - [Huggingface Dataset](https://huggingface.co/datasets) 318 | - [GLUE](https://gluebenchmark.com/) 319 | - [SuperGLUE](https://super.gluebenchmark.com/) 320 | - Leaderboards 321 | - [NLP-Progress](http://nlpprogress.com/) 322 | - [PaperwithCode](https://paperswithcode.com/sota) 323 | 324 | ## Interesting NLP 325 | - [Google Books Ngram Viewer](https://books.google.com/ngrams) 326 | 327 | 328 | ## Package 329 | - Machine Learning Package and Framework 330 | - sciki-learn 331 | - Tensorflow 332 | - Caffe2 333 | - Pytorch 334 | - MXNet 335 | - NLTK 336 | - gensim 337 | - jieba 338 | - Stanford NLP 339 | - Transformers (huggingface) 340 | 341 | ## 如何加入 How to contribute 342 | 343 | 如果你对本项目感兴趣,非常欢迎你加入! 344 | 345 | - 正常参与:请直接fork、pull都可以 346 | - 如果要上传文件:请**不要**直接上传到项目中,否则会造成git版本库过大。正确的方法是上传它的**超链接**。如果你要上传的文件本身就在网络中(如paper都会有链接),直接上传即可;如果是自己想分享的一些文件、数据等,鉴于国内网盘的情况,请按照如下方式上传: 347 | - (墙内)目前没有找到比较好的方式,只能通过链接,或者自己网盘的链接来做。 348 | - (墙外)首先在[UPLOAD](https://my.pcloud.com/#page=puplink&code=4e9Z0Vwpmfzvx0y2OqTTTMzkrRUz8q9V)直接上传(**不**需要注册账号);上传成功后,在[DOWNLOAD](https://my.pcloud.com/publink/show?code=kZWtboZbDDVguCHGV49QkmlLliNPJRMHrFX)里找到你刚上传的文件,共享链接即可。 349 | 350 | ## 如何开始项目协同合作 351 | 352 | [快速了解github协同工作 Learn how to collaborate through github](http://hucaihua.cn/2016/12/02/github_cooperation/) 353 | 354 | [及时更新fork项目 Update through fork](https://jinlong.github.io/2015/10/12/syncing-a-fork/) 355 | 356 | [如何使用git提交 How to submit in git](http://blog.csdn.net/u012150179/article/details/17172211) 357 | 358 | [Fetch and Merge in Git](https://www.oschina.net/translate/git-fetch-and-merge?print) 359 | 360 | #### [贡献者 Contributors](https://github.com/Nativeatom/NaturalLanguageProcessing/blob/master/Contributors.md) 361 | 362 | -------------------------------------------------------------------------------- /Sentence-level Semantics/README.md: -------------------------------------------------------------------------------- 1 | ### Sentence-level Semantics 2 | 3 | *At the other end of the spectrum, in the traditional semanticparsing literature, the problem is framedas predicting compositional semantic representations. These are mainly geared towards question-answering rather than task completion, and are usually directly executed against a knowledge base.Some of the standard datasets in this area include GeoQuery [24] and WebQuestions [2].Within both of these areas, neural approaches have supplanted previous feature-engineering basedapproaches in recent years [10, 14]. In the context of tree-structured semantic parsing, some otherinteresting approaches include Seq2Tree [7] which modifiesthe standard Seq2Seq decoder to betteroutput trees; SCANNER [5, 4] which extends the RNNG formulation specifically for semantic pars-ing such that the output is no longer coupled with the input; and TRANX [23] and Abstract SyntaxNetwork [19] which generate code along a programming language schema. For graph-structured se-mantic parsing [1, 11], SLING [21] produces graph-structured parses by modeling semantic parsingas a neural transition parsing problem with a more expressive transition tag set. While graph struc-tures can provide more detailed semantics, they are more difficult to parse and can be an overkill forunderstanding task oriented utterances.* *[Improving Semantic Parsing for Task Oriented Dialog](https://arxiv.org/pdf/1902.06000.pdf)* 4 | 5 | 6 | - Tasks 7 | - Semantic Parsing 8 | - Code Generation 9 | - NL2SQL 10 | - NL2Python 11 | - Seq2Seq based 12 | - Retrieval-based 13 | - [Retrieval-Based Neural Code Generation](https://arxiv.org/abs/1808.10025) 14 | - CNN based 15 | - [A Grammar-Based Structural CNN Decoder for Code Generation AAAI 19'](https://arxiv.org/abs/1811.06837) 16 | - Transition-based 17 | - [TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation ACL 18'](https://arxiv.org/abs/1810.02720) 18 | - NL2Java 19 | - Python2Java 20 | 21 | - Models 22 | - TRANX (*[TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation](https://arxiv.org/abs/1810.02720)*) **ACL 18'** 23 | 24 | 25 | ### Dataset and Benchmark 26 | #### HeartStone 27 | * ##### Description 28 | * ##### Download 29 | [Github](https://github.com/deepmind/card2code) - Wang Ling *[Latent Predictor Networks for Code Generation](https://arxiv.org/abs/1603.06744)* **ACL 16'** 30 | 31 | * ##### Benchmark 32 | - Zeyu Sun *[A Grammar-Based Structural CNN Decoder for Code Generation](https://arxiv.org/abs/1811.06837)* **AAAI 19'** 33 | - Accuracy: 30.3% 34 | - BLEU: 79.6% 35 | 36 | #### Django 37 | * ##### Description 38 | * ##### Download 39 | [Github](https://github.com/odashi/ase15-django-dataset) - Yusuke Oda *[Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation](https://ieeexplore.ieee.org/document/7372045)* **ASE 15'** 40 | 41 | * ##### Benchmark 42 | - Pengcheng Yin *[TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation](https://arxiv.org/abs/1810.02720)*) **ACL 18'** 43 | - Accuracy: 73.7% 44 | 45 | #### WikiSQL 46 | * ##### Description 47 | * ##### Download 48 | [Github](https://github.com/salesforce/WikiSQL) - Victor Zhong *[Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning](https://arxiv.org/abs/1709.00103)* **ArXiv 18'** 49 | 50 | * ##### Benchmark 51 | 52 | #### ATIS 53 | * ##### Description 54 | * ##### Download 55 | 56 | * ##### Benchmark 57 | 58 | #### JOBS 59 | * ##### Description 60 | * ##### Download 61 | 62 | * ##### Benchmark 63 | 64 | #### GEO 65 | * ##### Description 66 | * ##### Download 67 | 68 | * ##### Benchmark 69 | -------------------------------------------------------------------------------- /Summerization/README.md: -------------------------------------------------------------------------------- 1 | ## Summerization 2 | 3 | ### Task 4 | 5 | ### Dataset and Benchmark 6 | #### CNN/Daily Mail 7 | * ##### Description 8 | * ##### Download 9 | - [DeepMind](http://cs.nyu.edu/~kcho/DMQA/) 10 | - [Tokenized](https://github.com/JafferWilson/Process-Data-of-CNN-DailyMail) 11 | 12 | * ##### Benchmark 13 | - Abigail See *[Get To The Point: Summarization with Pointer-Generator Networks](https://arxiv.org/pdf/1704.04368.pdf)* **ACL 17'** 14 | - ROUGE-1 39.53% 15 | - ROUGE-2 17.28% 16 | - ROUGE-L 36.38% 17 | 18 | #### ACL Anthology Network 19 | * ##### Description 20 | * ##### Download 21 | - [ACL Antology](http://clair.eecs.umich.edu/aan/index.php) 22 | 23 | * ##### Benchmark 24 | - Qingyun Wang *[Paper Abstract Writing through Editing Mechanism](https://arxiv.org/pdf/1805.06064.pdf)* **ACL 18'** 25 | - ROUGE-L 20.3% 26 | - METOR 14% 27 | -------------------------------------------------------------------------------- /Tagging Chunking Syntax and Parsing/README.md: -------------------------------------------------------------------------------- 1 | ### Tagging, Chunking, Syntax and Parsing 2 | 3 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-cayman --------------------------------------------------------------------------------