├── README.md
├── chatbot
    ├── A Compare-Aggregate Model for Matching Text Sequences.md
    ├── A Knowledeg-Grounded Neural Conversation Model.md
    ├── A Survey on Dialogue Systems Recent Advances and New Frontiers.md
    ├── AliMe Chat A Sequence to Sequence and Rerank based Chatbot Engine.md
    ├── Attentive Pooling Networks.md
    ├── Bilateral Multi-Perspective Matching for Natural Language Sentences.md
    ├── Chat More Deepening and Widening the Chatting Topic via A Deep Model.md
    ├── Chinese Poetry Generation with Planning based Neural Network.md
    ├── Deep Reinforcement Learning for Dialogue Generation.md
    ├── Emotional Chatting Machine：Emotional Conversation Generation with Internal and External Memory.md
    ├── Generating Long and Diverse Responses with Neural Conversation Models.md
    ├── IRGAN：A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models.md
    ├── Incorporating Copying Mechanism in Sequence-to-Sequence Learning.md
    ├── LSTM-based Deep Learning Models For Non-factoid Answer Selection.md
    ├── Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders.md
    ├── MOJITALK Generating Emotional Responses at Scale.md
    ├── Modeling Multi-turn Conversation with Deep Utterance Aggregation.md
    ├── Neural Responding Machine for Short-Text Conversation.md
    ├── Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.md
    ├── README.md
    ├── Sequence to Backward and Forward Sequences A Content-Introducing Approach to Generative Short-Text Conversation.md
    ├── Sequential Match Network-A New Architecture for Multi-trun Response Selection in Retrieval-based Chatbots.md
    ├── The Design and Implementation of XiaoIce, an Empathetic Social Chatbot.md
    ├── Topic Aware Neural Response Generation.md
    └── Topic-to-Essay Generation with Neural Networks.md
├── classification
    ├── Attention-based LSTM for Aspect-level Sentiment Classification.md
    ├── Bag of Tricks for Efficient Text Classification.md
    ├── Baselines and Bigrams Simple Good Sentiment and Topic Classification.md
    ├── Classifying Relations by Ranking with Convolutional Neural Networks.md
    ├── Convolutional Neural Networks for Sentence Classification.md
    ├── ETH-DS3Lab at SemEval-2018 Task 7 Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction.md
    ├── Hierarchical Attention Networks for Document Classification.md
    ├── How to Fine-Tune BERT for Text Classiﬁcation.md
    ├── SGM Sequence Generation Model for Multi-Label Classification.md
    ├── Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks.md
    ├── Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.md
    └── readme.md
├── embedding
    ├── A Simple Approach to Learn Polysemous Word Embeddings.md
    ├── A Simple But Tough to Beat Baseline for Sentence Embeddings.md
    ├── Advances in Pre-Training Distributed Word Representations.md
    ├── An Efficient Framework for Learning Sentence Representations.md
    ├── Baseline Needs More Love On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms.md
    ├── Deep contextualized word representations.md
    ├── Distributed Representations of Sentences and Documents.md
    ├── Enriching Word Vectors with Subword Information.md
    ├── Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention.md
    ├── Learning Semantic Similarity for Very Short Texts.md
    ├── Learning Semantic Textual Similarity from Conversations.md
    ├── Linguistic Regularities in Continuous Space Word Representations.md
    ├── Mimicking Word Embeddings using Subword RNNs.md
    ├── On the Dimensionality of Word Embedding.md
    ├── Order-Embeddings of Images and Language.md
    ├── Quantifying Mental Health from Social Media with Neural User Embeddings.md
    ├── Semi-Supervised Sequence Modeling with Cross-View Training.md
    ├── Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.md
    └── Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation.md
├── information retrieval
    └── Personalizing Search via Automated Analysis of Interests and Activities.md
├── leaderboard.md
├── multi-task
    ├── Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts.md
    ├── Multi-Task Deep Neural Networks for Natural Language Understanding.md
    └── Overcoming catastrophic forgetting in neural networks.md
├── name entity recognition
    ├── Chinese NER Using Lattice LSTM.md
    ├── Neural Architectures for Fine-grained Entity Type Classification.md
    ├── Neural Architectures for Named Entity Recognition.md
    └── Semi-supervised Multitask Learning for Sequence Labeling.md
├── neural network
    ├── A Unified Architecture for Natural Language Processing.md
    ├── Adding Gradient Noise Improves Learning for Very Deep Networks.md
    ├── An overview of gradient descent optimization algorithms.md
    ├── Attentive Language Models.md
    ├── Calculus on Computational Graphs Backpropagation.md
    ├── Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models A Generative Approach to Sentiment Analysis.md
    ├── Deep Nets Don't Learn Via Memorization.md
    ├── Dialog Context Language Modeling with Recurrent Neural Networks.md
    ├── Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.md
    ├── Exploring Sparsity in Recurrent Neural Networks.md
    ├── How Does Batch Normalization Help Optimization (No, It Is Not About Internal Covariate Shift) .md
    ├── How to Construct Deep Recurrent Neural Networks.md
    ├── Is it Time to Swish Comparing Deep Learning Activation Functions Across NLP tasks.md
    ├── Learning to Generate Reviews and Discovering Sentiment.md
    ├── Pointer Sentinel Mixture Models.md
    ├── Recurrent Dropout without Memory Loss.md
    ├── Recurrent Neural Network Regularization.md
    ├── Semi-supervised Sequence Learning.md
    ├── Stacked Denoising Autoencoders.md
    ├── Targeted Dropout.md
    ├── Tying Word Vectors and Word Classifiers A Loss Framework for Language Modeling.md
    ├── Understanding Deep Learning Requires Rethinking Generalization.md
    ├── Understanding the difficulty of training deep feedforward neural networks.md
    ├── Using the Output Embedding to Improve Language Models.md
    └── Visualizing and Understanding Recurrent Networks.md
├── others
    └── Neural Machine Translation of Rare Words with Subword.md
├── pic
    ├── Chinese NER Using Lattice LSTM-pic1.png
    ├── Chinese NER Using Lattice LSTM-pic2.png
    ├── Chinese NER Using Lattice LSTM-pic3.png
    ├── Chinese NER Using Lattice LSTM-pic4.png
    ├── Chinese NER Using Lattice LSTM-pic5.png
    ├── Chinese_Poetry_Generation_with_Planning_based_Neural_Network_flow.png
    ├── Chinese_Poetry_Generation_with_Planning_based_Neural_Network_model.png
    ├── Deep & Cross Network for Ad Click Predictions-pic1.png
    ├── Deep & Cross Network for Ad Click Predictions-pic2.png
    ├── Deep & Cross Network for Ad Click Predictions-pic3.png
    ├── Deep & Cross Network for Ad Click Predictions-pic4.png
    ├── Deep Content-User Embedding Model for Music Recommendation.png
    ├── Deep Neural Networks for YouTube Recommendations-pic1.png
    ├── Deep Neural Networks for YouTube Recommendations-pic2.png
    ├── Deep Neural Networks for YouTube Recommendations-pic3.png
    ├── Deep Neural Networks for YouTube Recommendations-pic4.png
    ├── DeepFM-pic1.png
    ├── DeepFM-pic2.png
    ├── DeepFM-pic3.png
    ├── Distributed Representations of Sentences and Documents.png
    ├── Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction.png
    ├── Factorization Machines-pic1.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-1.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-2.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-3.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-4.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-5.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-6.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-7.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-8.png
    ├── HowtoFine-TuneBERTforTextClassiﬁcation-9.png
    ├── Implicit Feedback for Recommender Systems-pic1.png
    ├── Implicit Feedback for Recommender Systems-pic2.png
    ├── Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic1.png
    ├── Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic2.png
    ├── Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic3.png
    ├── Multi-Head Attention.png
    ├── Quantifying Mental Health from Social Media with Neural User Embeddings.png
    ├── Scaled Dot-Product Attention.png
    ├── Semi-Supervised Sequence Modeling with Cross-View Training.png
    ├── Stacked_Denoising_Autoencoders_autoencoders.PNG
    ├── Stacked_Denoising_Autoencoders_denoising_autoencoders.PNG
    ├── Stacked_Denoising_Autoencoders_stacked_denoising_autoencoders.png
    ├── Topic-to-Essay_Generation_with_Neural_Networks3.PNG
    ├── Wide & Deep Learning for Recommender Systems.png
    ├── copyNet_copy_mode.PNG
    ├── copyNet_generate_mode.PNG
    ├── copyNet_model.PNG
    ├── copyNet_state.PNG
    ├── copyNet_vocab.PNG
    ├── mimick model architecture.png
    └── transformer.png
├── reading comprehension
    └── Word or Characters, Fine-grained Gating For Reading Comprehension.md
├── reasoning
    └── A simple neural network module for relational reasoning.md
├── recommender system
    ├── Deep & Cross Network for Ad Click Predictions.md
    ├── Deep Content-User Embedding Model for Music Recommendation.md
    ├── Deep Neural Networks for YouTube Recommendations.md
    ├── DeepFM：A Factorization-Machine based Neural Network for CTR Prediction.md
    ├── Factorization Machines.md
    ├── Implicit Feedback for Recommender Systems.md
    └── Wide & Deep Learning for Recommender Systems.md
├── regularization
    ├── Adversarial Training Methods for Semi-Supervised Text Classification.md
    ├── Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.md
    ├── Batch Normalized Recurrent Neural Networks.md
    ├── Batch Renormalization Towards Reducing Minibatch Dependence in Batch-Normalized Models.md
    ├── Distributional Smoothing with Virtual Adversarial Training.md
    ├── Dropout is a special case of the stochastic delta rule faster and more accurate deep learning.md
    ├── Explaining and Harnessing Adversarial Examples.md
    ├── Layer Normalization.md
    ├── Recurrent Batch Normalization.md
    ├── Regularization of Neural Networks using DropConnect.md
    ├── Weight Normalization A Simple Reparameterization to Accelerate Training of Deep Neural Networks.md
    └── curriculum learning.md
├── self-supervised learning
    └── A Simple Framework for Contrastive Learning of Visual Representations.md
├── sequence to sequence
    ├── Fluency Boost Learning and Inference for Neural Grammatical Error Correction.md
    ├── Get To The Point Summarization with Pointer-Generator Networks.md
    ├── Grammar as a Foreign Language.md
    ├── Learning to Skim Text.md
    ├── Neural Machine Translation by Jontly Learning to Align and Translate.md
    ├── On Using Very Large Target Vocabulary for Neural Machine Translation.md
    ├── Pointer Networks.md
    ├── Sequence to Sequence Learning with Neural Networks.md
    └── Skip-Thought Vectors.md
└── transformer
    ├── ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS.md
    ├── Attention is All You Need.md
    ├── BERT Pre-training of Deep Bidirectional Transformers for Language Understanding.md
    ├── ERNIE Enhanced Representation through Knowledge Integration.md
    ├── Improving Language Understanding by Generative Pre-Training.md
    ├── RoBERTa_A Robustly Optimized BERT Pretraining Approach.md
    └── Universal Sentence Encoder.md


/chatbot/A Compare-Aggregate Model for Matching Text Sequences.md:
--------------------------------------------------------------------------------
 1 | ### A Compare-Aggregate Model for Matching Text Sequences
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;该paper更像是一篇实验性论文，在“general”框架下对其中某一块使用不同的方法比较、组合，该框架思路大致如下：
 5 | 
 6 |   1. preprocessing。将sentence转化为matrix，paper使用了LSTM，故每个step emit一个vector；
 7 |   2. attention。将Q得到的matrix与A每个step得到的vector进行计算，得到A中每个vector对应于Q加权后的vector h，类似于seq2seq中attention的机制；
 8 |   3. comparison。对A中每个step的vector与其对应h进行计算，得到vector t；
 9 |   4. aggregation。对步骤3得到的vector t序列使用CNN进行整合。
10 | 
11 |   而paper主要是对步骤3不同的整合方法进行了实验、比较，使用的方法有：
12 |   
13 |   * standard neural network layer（见式3）
14 |   * neural tensor network（见式4）
15 |   * Euclidean distance or cosine similarity（见式5）
16 |   * element-wise subtraction（见式6）
17 |   * element-wise multiplication（见式7）
18 | 
19 | #### comment：
20 | &emsp;&emsp;通过实验得到（subtraction+multiplication+nn）结果比（Euclidean distance or cosine similarity）效果更好。原因可能在于，前一种方式得到的是高维的matrix，而后一种方式只是二维的向量，表现能力比较弱，高维包含了更细致的信息；
21 | 
22 | #### highlight:
23 | &emsp;&emsp;We can see that SUB is closely related to Euclidean distance in that Euclidean distance is the sum of all the entries of the vector tj produced by SUB. But by not summing up these entries, SUB preserves some information about the different dimensions of the original two vectors. Similarly, MULT is closely related to cosine similarity but preserves some information about the original two vectors.
24 | 
25 | #### question:
26 |   1. QA matching的方式用于闲聊真的合适？paper3.4部分可视化结果表明，匹配正确的case中，model更多学到了Q与A相同词之间的关系，而闲聊中，这种相同词的情况比较少。
27 |   2. aggregation用LSTM效果会更好？
28 | 
29 | #### [code](https://github.com/shuohangwang/SeqMatchSeq)
30 | 


--------------------------------------------------------------------------------
/chatbot/A Knowledeg-Grounded Neural Conversation Model.md:
--------------------------------------------------------------------------------
 1 | ### A Knowledeg-Grounded Neural Conversation Model
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出使用data-driven和knowledge-grounded方式（23M general dialogue dataset+1M grounded dialogue dataset+1.1M tips），使seq2seq生成更加contentful的response，大致思路如下：
 5 |   
 6 |   1. 对post信息进行encode，得到vector u；
 7 |   2. 使用post文本信息检索与之相关的tips（根据tf-idf计算相似度得到），并用式2得到vector c；
 8 |   3. 使用u计算各个c的权重，并进行加权求和（式4），最终相加得到seq2seq中decoder部分的隐层输入（式5）。
 9 | 
10 | #### comment:
11 |   1. paper提出了使用post“相关的其它信息”使生成的回复更具实质内容，这个出发点不错，引入其它信息从某个角度讲是降低了post的“熵”；
12 |   2. 模型在deploy某个post的其它information时，用了类似attention的机制，每条tip信息输出权重由post得到的vector u衡量，这一trick在多篇paper都有体现，而且效果还不错；
13 |   3. 用BLEU作为对话系统的衡量指标，paper最好的BLEU得到为1.08，而翻译模型其BLEU一般在30+，使用BLEU评价生成式对话系统的好坏确实有失偏颇。
14 | 


--------------------------------------------------------------------------------
/chatbot/A Survey on Dialogue Systems Recent Advances and New Frontiers.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　本文是一篇survey文章，介绍了任务型和非任务型使用深度学习的主流技术路线，其中非任务型主要包含以下几点：
 3 |   1. seq2seq（用pair对训练，但会出现生成多样性-一直觉得这个是伪命题、上下文一致性，同时在其中可引入主题，情感，dialog act等信息，除了生成，也可改成rank形式，这方面Google有多篇paper）
 4 |   2. memory network（也是用pair训练，但通过memory机制引入了额外信息，除了生成，也可改成rank形式，这方面Facebook有多篇paper）
 5 |   3. retrieval-based（在特定的知识领域运用较多，而且相对成熟，如：客服领域）
 6 | 
 7 | #### comment:
 8 |   1. 文章所在概述算是比较全面，比较建议非任务型方面再加一篇2017年3月份的论文：[Learning Discourse-level Diversity for Neural Dialog Models using CVAE](https://github.com/xwzhong/papernote/blob/master/chatbot/Learning%20Discourse-level%20Diversity%20for%20Neural%20Dialog%20Models%20using%20Conditional%20Variational%20Autoencoders.md)，个人认为CVAE是一个不错的方向；
 9 |   2. 非任务型生成式有因“信息不足”有各种缺陷，检索式在大语料下会给人出乎意料的惊喜，但不够细致，结合此两者就目前来讲或许是不错的商用方案。
10 | 


--------------------------------------------------------------------------------
/chatbot/AliMe Chat A Sequence to Sequence and Rerank based Chatbot Engine.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;paper提出了一种结合"检索+seq2seq"的对话模型,整体思路是,先使用bm25检索post,得到最可能的k个问答对,然后再用seq2seq训练得到的生成模型对k个问答对进行重新排序,其主要步骤如下:
 3 |   1. First, we use the IR model to retrieve a set of k candidate QA pairs(k = 10);
 4 |   2. Second, we pair q with each candidate answer ri and calculate a confidence score o(ri) = s(q, ri) for each pair using the scoring function in Eqn. 2 of the rerank model;
 5 |   3. Third, we consider the answer r with the maximal score o(r) = max o(ri): if o(r) ≥ T, take the answer r; otherwise output a reply r 0 from the generation based model;
 6 |   
 7 | 
 8 | #### comment:
 9 |   1. 文中对比了seq2seq是否加入attention机制,经实验,加了attention机制的模型表现更优;
10 |   2. 使用seq2seq训练得到的模型对检索得到的问答对进行重排,这点比较新颖,能表达数据更细致的信息,但是仍未避免seq2seq用于生成对话内容时的缺陷(通用性回复多,句子偏短等), 只是将出现这种类型的回复比例降低了;
11 |   3. 实验给出的最高评分(suitable数+neutral数占总回复比)为60.01%.
12 | 
13 | #### more reading:
14 | &emsp;&emsp;[揭秘阿里小蜜：基于检索模型和生成模型相结合的聊天引擎](http://blog.csdn.net/uwr44uouqcnsuqb60zk2/article/details/78849003)
15 | 


--------------------------------------------------------------------------------
/chatbot/Attentive Pooling Networks.md:
--------------------------------------------------------------------------------
 1 | ### Attentive Pooling Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper是在QA-LSTM/CNN的基础上进行的改进，在介绍QA-LSTM论文中，作者提出了引入单向attention的机制，而在本文中则进行了加强，引入了双向attention，其考虑是基于question和answer能互相影响其生成的vector。
 5 | 
 6 | #### comment:
 7 |   1. 该paper的亮点在于如何做到question和answer之间相互影响其生成的vector的表达，而原文中的方法有点巧妙，又有点像“杂交”；
 8 |   2. 实验结果显示，在insuranceQA test1的结果：QA-LSTM-max66.6%，QA-LSTM attention avg68.1%，AP-LSTM71.7%。
 9 | 
10 | #### highlight:
11 | &emsp;&emsp;In the context of pair-wise ranking or classification with neural networks, AP enables the pooling layer to be aware of the current input pair, in a way that information from the two input items can directly influence the computation of each other's representations.
12 | 


--------------------------------------------------------------------------------
/chatbot/Bilateral Multi-Perspective Matching for Natural Language Sentences.md:
--------------------------------------------------------------------------------
 1 | ### Bilateral Multi-Perspective Matching for Natural Language Sentences
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;在两个sentence进行match时，paper提出当前sentence每个step emit的vector与另一sentence所有step emit的vector相关，思路步骤大致如下：
 5 |   
 6 |   1. Given two sentences P and Q, our model first encodes them with a BiLSTM encoder.
 7 |   2. Next, we match the two encoded sentences in two directions P against Q and Q against P. In each matching direction, each time step of one sentence is matched against all timesteps of the other sentence from multiple perspectives.
 8 |   3. Then, another BiLSTM layer is utilized to aggregate the matching results into a fixed-length matching vector.
 9 |   4. Finally, based on the matching vector, a decision is made through a fully connected layer.
10 | 
11 | #### comment：
12 |   1. 文中提到的Bilateral与3.2部分w的维度l对应，可以理解为：同样的计算方法，由于初始空间位置不同，从而学到不同的特点，eg：对于式子a-b，对a和b输入不同的值能得到不同的结果，但是都是进行减法运算。不过更进一步，作者仍没有很好地解释初始化的l个向量有何更确切的含义。
13 |   2. 模型的关键在于步骤2——如何衡量vector与matrix的关系，文中提出了四种不同的策略，其中max attentive matching的效果最好。
14 | 


--------------------------------------------------------------------------------
/chatbot/Chat More Deepening and Widening the Chatting Topic via A Deep Model.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;paper针对生成式闲聊问题，主体架构以seq2seq模型为基础，提出了一种使回复更deep和wide的方法。fig1详细地介绍了deep和wide的含义，所谓deep就是使之前会话的内容进行更深层次的分析（如果此前提到了下雨，这时可强调“雨确实是大”，wide即从此前话题进行关联扩展（针对上文提到下雨，可回复“你带伞了吗？”）。该模型主要分四部分：
 3 |   1. 使用tf-idf进行关键词提取（此前的历史会话暂称为history，当前回合用户说的话为post，待回复为response）。history+post中提取到的关键词构成context keyword集合，由context keyword集合筛选得到deeper keyword，最后由response提取得到wider keyword集合；
 4 |   2. global channel。用GRU对句子信息进行编码，得到所有词的高维输出；
 5 |   3. deep channel。从context keyword到deeper keyword过程，用了MLP组件计算context keyword各个词的权重；
 6 |   4. wide channel。用attention based RNN结合global channel信息+context keyword来预测wider keyword。
 7 |   
 8 | #### comment:
 9 |   1. paper提出的方法过于依赖关键词，在实验部分，也是根据关键词进行了筛选（在response中至少包含大于某阈值的两个keyword）;
10 |   2. post加入resp keyword训练，进行rerank时通用性回复score仍然较低，比较合理的方式可能是限制性条件下定向选取。
11 | 
12 | #### more:
13 |   1. [通过深度模型加深和拓宽聊天话题，让你与机器多聊两句](https://www.jiqizhixin.com/articles/2018-04-24-4)
14 |   2. [Chat More: Deepening and Widening the Chatting Topic via A Deep Model (code)](https://sigirdawnet.wixsite.com/dawnet)
15 | 


--------------------------------------------------------------------------------
/chatbot/Chinese Poetry Generation with Planning based Neural Network.md:
--------------------------------------------------------------------------------
 1 | ### Chinese Poetry Generation with Planning based Neural Network
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;文章介绍了根据一个句子来生成古诗的模型，流程如下：
 5 | ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_flow.png)
 6 | 
 7 | 实施步骤如下：
 8 | 1. 使用textrank提取关键词；
 9 | 2. 扩展词数（使其与待生成的古诗行数相同）：
10 |     + 方案一：利用训练数据中的古诗，使用textrank抽取一个关键词，每行各有一个，然后利用rnn训练并预测；
11 |     + 方案二：Knowledge-based，利用百科、wordnet进行扩展；
12 | 3. 古诗按照如下结构一行一行生成，使用attention引入当前行一定要出现的keyword（生成第二行时，第一行的信息和第二行的keyword作为上下文信息），结构图如下：
13 | ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_model.png)
14 | 
15 | #### comment:
16 | 1. 文章主要亮点在于生成古诗时，确保了keyword一定会出现在句子中；
17 | 2. keyword表示部分建议使用向量进行mask，用于区分上一句古诗（假设不是生成第一句）。
18 | 
19 | 
20 | #### more reading:
21 | 1. [Chinese Poetry Generation with Planning based Neural Network](https://x-algo.cn/index.php/2018/03/20/chinese-poetry-generation-with-planning-based-neural-network/)
22 | 


--------------------------------------------------------------------------------
/chatbot/Deep Reinforcement Learning for Dialogue Generation.md:
--------------------------------------------------------------------------------
 1 | ### Deep Reinforcement Learning for Dialogue Generation
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper想通过引入RL机制来解决使用seq2seq做对话系统时遗留的难题，如通用性回复较多。在具体实现中，作者首先使用seq2seq方法pre-train一个base模型，最后使用RL机制，以两个agent互相对话最终得到的reward来调整base model的参数。
 5 | 
 6 | #### comment:
 7 | 1. 使用RL的过程很清晰，定义了RL机制涉及到的action，state，policy，reward，可以当做RL的简单应用学习；
 8 | 2. 纵观全文，训练结果的好坏取决于reward公式的设计；在paper中，Ease of answer设计有以偏概全的嫌疑（你不能直接说many of these responses are likely to fall into similar regions in the vector space，需要更科学的解释或证明）；
 9 | 3. 文章使用RL机制时，有种“为了实现对话特点而设计”，从个人角度观点出发，更应该从“对话目的”角度来设计，而且，简单的使用RL机制来实现对话存疑。
10 | 
11 | #### code:
12 | * [lua](https://github.com/jiweil/Neural-Dialogue-Generation)
13 | * [python](https://github.com/Marsan-Ma/tf_chatbot_seq2seq_antilm)
14 | 


--------------------------------------------------------------------------------
/chatbot/Emotional Chatting Machine：Emotional Conversation Generation with Internal and External Memory.md:
--------------------------------------------------------------------------------
 1 | ### Emotional Chatting Machine：Emotional Conversation Generation with Internal and External Memory 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper把resp的emotion信息embed到一个高维的向量，提出了三种不同的方法来deploy这向量：
 5 | 1. emotion category embedding。计算resp的情绪类别，并将该类别转为指定维度向量v跟decoder部分的word embedding放在同一级嵌入到seq2seq中；
 6 | 2. internal memory。将1得到的情绪向量v在生成时逐步递减，最后降为0；
 7 | 3. external memory。将decoder部分的词典分为不相交的通用和情绪词典，根据预先计算的情绪显示控制每个step输出的word。
 8 | 
 9 | #### comment:
10 | 1. paper的亮点是可以根据预先设定的情绪来生成文本，由被动化为主动；
11 | 2. 个人意见，所提到的三种方案中，emotion category embedding从可扩展性，合理性讲都比较好；方案2中，使用向量来表征情绪本身就是抽象的，一个向量全为0不一定等于“没有情绪”；
12 | 3. 所提方案确实能降低perplexity，因为引入了response中的信息；
13 | 4. 论文有些表述不敢苟同，可能写得还不够细致，流程图解释得也不够，如fig3中M是怎么引入到式16的。
14 | 
15 | #### more reading:
16 | <https://610300.kuaizhan.com/5/82/p426375651610fb>
17 | 
18 | #### [code](https://github.com/loadder/ECM-tf)
19 | 


--------------------------------------------------------------------------------
/chatbot/Generating Long and Diverse Responses with Neural Conversation Models.md:
--------------------------------------------------------------------------------
 1 | ### Generating Long and Diverse Responses with Neural Conversation Models
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper仍旧是想解决使用seq2seq训练chatbot得到的回复具有长度短、通用性回复多的问题；
 5 |   1. 解决生成的句子短：在encoder部分引入已生成的response内容，同时因为显存大小的限制，提出了glimpse模型；
 6 |   2. 解决通用性回复多：在test/eval部分，第一个timestep随机选取候选word，从第二个timestep开始才使用常规beam search方式进行搜索。
 7 | 
 8 | #### comment:
 9 |   1. seq2seq中，encoder部分的文本可以尽可能地长，因为encoder部分没有output；相反，decoder部分的output部分需要计算softmax，显存占用明显，特别是上decoder部分的vocabulary非常大的时候，因此paper才提出note1中的glimpse模型；
10 |   2. 文中提到“self attention”，即target side attention是指在训练时，encoder部分加入已生成的内容。这样做是因为the decoder has to keep track of everything output so far in its fixed-lenght hidden state vector, which leads to incoherent or even contradictory。（难道用常见的attention mechanism还不能很好地解决？？）；
11 |   3. 在test/eval部分，第一步随机选取word，该策略可能的前提是有大量的训练语料，把一个post所有可能的回复都尽可能“学习过”(ps:本文用了23亿数据）；
12 |   4. 使用seq2seq做生成式对话，同长度post下，用model生成的回复具有的平均长度比训练语料小4左右（post在train中出现不同次数同样适用）；
13 |   5. 如果glimpse大小取训练语料中response最大长度的二分之一，与常规训练方式相比，该方法得到的perplexity会小一半左右，同时出现perplexity等于0的情况（response为空）；
14 |   6. 猜想：在post和resp唯一对应的情况下，perplexity适合用来评估对话系统的性能。
15 | 


--------------------------------------------------------------------------------
/chatbot/IRGAN：A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models.md:
--------------------------------------------------------------------------------
 1 | ### IRGAN：A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;文中提出的IRGAN模型给人眼前一亮，其使用GAN较好地整合了“生成”和“判别”这两种经典思想。原文2.1.1部分对其做了通俗易懂的介绍：the generative retrieval model would try to generate (or select) relevant documents that look like the groundtruth relevant documents and therefore could fool the discriminative retrieval model, whereas the discriminative retrieval model would try to draw a clear distinction between the ground-truth relevant documents and the generated ones made by its opponent generative retrieval model。以该方法用于检索式的对话系统为例，IRGAN主要过程如下：
 5 | 
 6 |   1. 生成模型的参数更新。用生成模型找出对query最合适的answer（源代码中包括了groundtruth），然后用判别模型计算这些unlabel answer的reward（其代表的正式判别模型的loss），最后再使用这个reward更新生成模型的参数。更新的特点是，unlabel answer离query的groundtruth越近，reward越小，该answer在生成模型的空间变化越小；
 7 |   2. 判别模型的参数更新。首先用生成模型找出最适合当前query的unlabel answer，然后使判别模型尽可能地分开groundtruth和这些answer。
 8 | 
 9 | #### comment:
10 |   1. IRGAN这种思想，不仅可以运用于文中提到的生成和判别模型中，甚至可以尝试运用到分类中，生成模型选出最模糊的样本令判别模型尽可能地去区分这些样本；
11 |   2. IRGAN的关键之处是“生成”和“判别”模型的交互过程;
12 |     + a. 由生成模型选出最具对抗性的样本；
13 |     + b. 判别模型针对a得到的样本进行训练。
14 |   3. 判别模型的accuracy会不断上升，生成模型的accuracy会不断下降，在第二次训练时，将判别模型的参数更新到生成模型中，不知道训练效果会不会进一步提升；
15 |   4. paper所讲的内容虽然比较绕，但并不会使论文褪色，结合作者给出的代码更容易理解。
16 | 
17 | #### [code](https://github.com/geek-ai/irgan)
18 | 
19 | #### more reading:
20 | [阿里SIGIR 2017论文：GAN在信息检索领域的应用](https://www.jiqizhixin.com/articles/44049401-05c0-4b77-884a-74a21695cb27)
21 | 


--------------------------------------------------------------------------------
/chatbot/Incorporating Copying Mechanism in Sequence-to-Sequence Learning.md:
--------------------------------------------------------------------------------
 1 | ### Incorporating Copying Mechanism in Sequence-to-Sequence Learning
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;针对seq2seq无法生成OOV词但属于encoder中的词情况，作者提出了一种复制encoder词汇的策略，词的分布示意图如下：
 5 | 
 6 | ![](https://github.com/xwzhong/papernote/blob/master/pic/copyNet_vocab.PNG)
 7 | 
 8 | + 当前step各个词的概率为以下两部分概率分布的和：
 9 |   + a. 生成模式得分p(yt, g|·)：按照常规计算方式得到，详见下式：
10 | ![](https://github.com/xwzhong/papernote/blob/master/pic/copyNet_generate_mode.PNG)
11 | 
12 |   + b. 复制模式得分p(yt, c|·)：在encoder中出现的词向量经映射后激活，再乘以隐层，详见下式：
13 | ![](https://github.com/xwzhong/papernote/blob/master/pic/copyNet_copy_mode.PNG)
14 | 
15 | + 隐层状态的更新：常规做法是根据一个隐层状态和输出y_t-1，以及attention c，copyNet在y_t-1路径新增了复制机制，词y_t-1的表示为其本身的向量和encoder输出加权并拼接的结果，详见下列公式：
16 | ![](https://github.com/xwzhong/papernote/blob/master/pic/copyNet_state.PNG)
17 | 
18 | 模型结构图如下：
19 | ![](https://github.com/xwzhong/papernote/blob/master/pic/copyNet_model.PNG)
20 | 
21 | #### comment:
22 | 1. copyNet在encoder-decoder模型任务中，如果decoder的文本大部分出现在encoder中（比如：文本摘要，抽取式生成），该模型的效果会表现比较好；
23 | 2. copy模式在decoder输出部分各个词的概率分布部分计算巧妙。
24 | 
25 | #### more reading:
26 | 1. [CopyNet解读](https://blog.csdn.net/xvshiting/article/details/82216112)
27 | 


--------------------------------------------------------------------------------
/chatbot/LSTM-based Deep Learning Models For Non-factoid Answer Selection.md:
--------------------------------------------------------------------------------
 1 | ### LSTM-based Deep Learning Models For Non-factoid Answer Selection
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper比较了在Q&A领域answer选取的几种方法，并提出了自己的model，总体思想是在计算Q&A cos相似度的情况下，使groundtruth与question的cos值，比unlabel的回复更大（见式7）,以下是计算cos的几种方式：
 5 |   
 6 |   1. QA-LSTM：分别计算Q&A biLSTM的hidden state，然后可按avg、max、head/tail三种方式获取向量（见fig1）；
 7 |   2. QA-LSTM/CNN：先用biLSTM获取hidden state构成一个矩阵，然后使用CNN获取合成后的向量（见fig2）；
 8 |   3. Attention-based QA-LSTM：对于question部分的vector q，用方法1获取，对于answer部分的vecter a，每个timestep的output都结合q进行计算，接下来可使用avg、max、head/tail三种方式合成；
 9 |   4. QA-LSTM/CNN with Attention：结合方法2和3得到。
10 | 
11 | #### comment:
12 |   1. cosine similarity（式7）：最后的一层并不是通常的分类或者回归的方法，而是采用了计算两个向量（Q&A）夹角的方法。M是需要设定的参数margin，q、a+、a-分别是问题、正向答案、负向答案(准确地说是unlabel答案）对应的语义表示向量。损失函数的意义就是：让正向答案和问题之间的向量cosine值要大于负向答案和问题的向量cosine值，大多少，就是margin这个参数来定义的。cosine值越大，两个向量越相近，所以通俗的说这个Loss就是要让正向的答案和问题愈来愈"相似"，让负向的答案和问题越来越不"相似"；
13 |   2. paper提出的方案比非DL state of art方案更好（insuranceQA和TREC-QA数据集）；
14 |   3. 方法3引入了attention，扩展了attention这种机制的思维，非常不错；
15 |   4. 试验中加入了不同文本长度做对比，非常有借鉴意义（结合insuranceQA test1和test2，方案3在answer文本长度小于55时表现更有优势）；
16 |   5. 在QA-LSTM中，acc排序：max>avg>head/tail（相邻之间相差4%左右），head/tail在QA-LSTM attention中表现仍是最差，而max表现更优可能是因为max学到了更多局部、有用的特征；
17 |   6. 4.3最后一部分指出使用了MLP计算，得到最终的分数，但效果不理想（作者提到可能是因为参数过多，难拟合）；
18 |   7. Q&A共参数集。分析，同参最终得到的两个vec对应的维度表征都一样；
19 |   8. 预训练语言模型效果会更好。
20 | 
21 | #### [code](https://github.com/sujitpal/dl-models-for-qa)
22 | 


--------------------------------------------------------------------------------
/chatbot/Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　文章还是针对seq2seq结构在闲聊领域中的问题。闲聊中，说出当前句子背后有很多额外的知识，但是seq2seq不能对这部分信息进行有效编码（个人观念是某些信息在当前pair中既不会出现，也很难从pair推理得来），针对这个问题，作者提出使用隐变量z来代替这个未知的信息（CVAE），同时在此基础上，为了方便训练（出于正常情况训练语料少），提出显示表征隐变量z的策略（kgCVAE）：
 3 |   1. CVAE：
 4 |     a. 训练——对resp进行编码，经recognition网络得到分布z，使用KL拟合prior网络分布z'（用于测试），结合z和encoder输出对resp进行生成;
 5 |     b. 测试——使用prior网络随机输出分布z'，结合z'和encoder输出进行生成；
 6 |   2. kgCVAE：
 7 |     a. 训练——在CVAE基础上，主动引入resp的细粒度信息（此处为dialog act），由细粒度信息推算其分布y，用y影响z生成，同时y、z和encoder输出对resp进行decode，利用MLPy主动拟合分布y
 8 |     b. 测试——由MLPy生成细粒度信息（dialog act）分布y'，再由y'、z'和encoder输出对resp进行decode。
 9 | 
10 | #### comment:
11 |   1. 作者的思路非常好，知道单纯的pair缺失了很多信息，因此考虑使用隐变量来表示这种差异（如果是一对一，或近似一对一问题，eg：翻译，那么这种差异就没那么紧迫）；
12 |   2. 根据实验结果，CVAE虽然能明显降低perplexity（相对于simple seq2seq），但在BLEU等评测中结果与baseline相近，这也是正常的，虽然CVAE中隐变量能编码这些信息，但是测试集中post-resp对是固定的，隐变量随机后的分布不一定刚好是resp对应的分布；
13 |   3. kgCVAE是一个好的方向，在隐变量z和resp中间再加入了其它信息的分布，其实这个分布y是显示地告诉模型去生成符合分布y的结果，更高层面讲是生成符合指定dialog act的句子，可能作者甚至想直接根据输入的某些类型（eg：情感，句式等）直接生成句子，照这样发展，需要剖析一个句子的生成跟什么因素相关。
14 | 
15 | #### more:
16 |   1. [《Learning Discourse-level Diversity for Neural Dialog Models Using Conditional VAE》阅读笔记](https://zhuanlan.zhihu.com/p/26898768)
17 |   2. [code](https://github.com/snakeztc/NeuralDialog-CVAE)
18 | 


--------------------------------------------------------------------------------
/chatbot/MOJITALK Generating Emotional Responses at Scale.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | 　　在seq2seq生成式闲聊对话机器人中引入情感信息，主要借鉴了[Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm
3 | ](https://arxiv.org/abs/1708.00524)和[Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders](https://arxiv.org/abs/1703.10960)。
4 | 
5 | #### comment:
6 | 　　文章创新性不高，但是借鉴的两篇论文可细看。
7 | 


--------------------------------------------------------------------------------
/chatbot/Modeling Multi-turn Conversation with Deep Utterance Aggregation.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | &nbsp;&nbsp;&nbsp;&nbsp;paper依旧是想使用retrieval based解决多轮会话问题，在常用的Ubuntu、豆瓣数据集上表现良好。
3 | 
4 | #### comment：
5 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出的改进策略将模型复杂化，提出的改进点看似解释得通，但并没有清晰的理论证明，只能从实验对比中发现其结果。其中比较好的一个做法是将“当前用户说的话”与“上文”和“待回复”做了更明确的区分。
6 | 
7 | #### more:
8 | &nbsp;&nbsp;&nbsp;&nbsp;[tensorflow code](https://github.com/cooelf/OneshotQA)
9 | 


--------------------------------------------------------------------------------
/chatbot/Neural Responding Machine for Short-Text Conversation.md:
--------------------------------------------------------------------------------
 1 | ### Neural Responding Machine for Short-Text Conversation
 2 | #### note:
 3 |   
 4 | 1. 提出的“翻译和对话系统差异”观点个人赞同，闲聊对话是一对多的关系（同个post有多种回复），进一步讲是多对多，而不像文本翻译，原文确定了，译文也“基本”确定了;
 5 | 2. 文中提的一个概念比较好，使用NRM-glo获取整个post的summarization，NRM-loc表证词的语义（当然，根据单向RNN的特点，当前词的语义自然包含了其之前的语义，如果使用bidretional-RNN就能获取当前词结合前后文本的语义）;
 6 | 3. 个人之见，NRM-loc已经包含了NRM-glc部分信息，这个可以从两者之见的结构可以看出，文中3.3部分也提出到了它们之见的关系，所以在训练过程中对NRM-loc和NRM-glo分开训练，最后fine tuning，感觉有点牵强;
 7 | 4. 在实际使用这个模型中（NRM-loc和NRM-glo），算法所学到的attention跟回复中的attention并不对应，个人猜测这是结果不理想（通用性回复多，句子短等）的主要原因，也是开放域的对话系统所遇到的共有难题。所谓回复中的attention，首先得有一个回复的具体点，在这点上再用该模型应该就能很好的生成有语义的具体的回复，而现在的模型学到的是由这些点构成的面（seq2seq算法在多对多语料中训练的结果），从而导致了上述不理想的结果。这是模型在多对多数据下的结果;
 8 | 5. 有个方面不错，先分开训练NRM-glo和NRM-loc，再合并，因为这种分开训练方式能整合多种机制，特别是无监督学习，如果能通过无监督学习学到大量的语义信息，再整合到seq2seq中，效果可能出乎意料，如skip thought来表征句子语义，OpenAI提出使用语言模型发现的sentiment unit（可以用来控制生成文本的情感）等。
 9 | 
10 | #### comment:
11 |   
12 | * 实验设置思考。
13 |     * 测试集数据量少，test post只有110；
14 |     * 在fine tuning所用数据集中，post:response比值约等于1:1，是不是应该和原training data中的比例相近？
15 | 
16 | #### [data](http://61.93.89.94/Noah_NRM_Data/):
17 | 
18 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;实验所用的数据为sina微博，数据质量一般，毕竟是微博评论
19 | 


--------------------------------------------------------------------------------
/chatbot/Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Noise-Contrastive Estimation for Answer Selection with Deep Neural Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper指出，在answer selection中将classification问题转为ranking问题（此前已有相关paper是这样做的，eg：QA-LSTM），同时对三种负样本的sample方式进行了实验比较：
 5 | 
 6 |   1. Random Sampling. We randomly select a number of negative samples for each positive answer.
 7 |   2. Max Sampling. We select the most difficult negative samples. In each epoch, we compute the similarities between all (p+, p−) pairs using the trained model from the previous training epoch. Then we select the negative answers by maximizing their similarities to the positive answer.
 8 |   3. Mix Sampling. We take advantages of both random sampling and max sampling by selecting half of the samples from each strategy.
 9 | 
10 | #### comment：
11 | &emsp;&emsp;此处提出的max sampling策略想法不错，利用上一个epoch最可能降低accuracy的负样本进一步训练，有点对抗的意味，而且实验结果也不错。
12 | 
13 | #### [code](https://github.com/castorini/NCE-CNN-Torch)
14 | 


--------------------------------------------------------------------------------
/chatbot/README.md:
--------------------------------------------------------------------------------
 1 | ## insurance QA
 2 | | Model |  dev  | test1 | test2 | paper | note |
 3 | | ------------- |:-------------:| :-----:|:-----:|:-----:|:-----:|
 4 | |  QA-CNN   |  0.6160 | 0.6020 |0.5610|  [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
 5 | |   QA-CNN+IRGAN  |  - | 0.6444 |0.6111|[arxiv](https://arxiv.org/abs/1705.10513)|[note](https://github.com/xwzhong/papernote/blob/master/chatbot/IRGAN%EF%BC%9AA%20Minimax%20Game%20for%20Unifying%20Generative%20and%20Discriminative%20Information%20Retrieval%20Models.md)|
 6 | |  QA-LSTM/CNN with attention   |  0.6720    |   0.6570     |   0.6330    |   [arxiv](https://arxiv.org/abs/1511.04108)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/LSTM-based%20Deep%20Learning%20Models%20For%20Non-factoid%20Answer%20Selection.md)   |
 7 | | QA-LSTM with attention(avg) |   0.6840  |    0.6810    |    0.6220   |   [arxiv](https://arxiv.org/abs/1511.04108)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/LSTM-based%20Deep%20Learning%20Models%20For%20Non-factoid%20Answer%20Selection.md)   |
 8 | |    AP-CNN    |  0.6880  |    0.6980    |    0.6630   |   [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
 9 | |    AP-BiLSTM   |    0.6840   |    0.7170    |    0.6640   |   [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
10 | |    MULT      |     0.7600     |    0.7520      |     0.7340    |    [arxiv](https://arxiv.org/abs/1611.01747)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/A%20Compare-Aggregate%20Model%20for%20Matching%20Text%20Sequences.md)   |
11 | |    SUBMULT+NN      |     0.7700     |    0.7560      |     0.7230    |    [arxiv](https://arxiv.org/abs/1611.01747)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/A%20Compare-Aggregate%20Model%20for%20Matching%20Text%20Sequences.md)   |
12 | |          |          |          |         |    [arxiv]()  |    [note]()   |
13 | 
14 | ## Wiki QA
15 | |  Model |  MAP  |  MRR  | paper | note |
16 | | ------------- |:-------------:| :-----:|:-----:|:-----:|
17 | |   AP-BiLSTM     |    0.6705     |   0.6842    |  [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
18 | |   AP-CNN     |    0.6886     |     0.6957    |  [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
19 | |  SentLevel   |    0.6930  |    0.7090     |    [uchicago](http://ttic.uchicago.edu/~kgimpel/papers/he+etal.emnlp15.pdf)  |   -   |
20 | |  PairwiseRank+WordLevel   |    0.6930  |    0.7100     |    [semanticscholar](https://pdfs.semanticscholar.org/cf8b/bceaca91f791388a64f3f1a0392d64e41f4f.pdf)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Noise-Contrastive%20Estimation%20for%20Answer%20Selection%20with%20Deep%20Neural%20Networks.md)   |
21 | |  PairwiseRank+SentLevel   |    0.7010  |    0.7180     |    [semanticscholar](https://pdfs.semanticscholar.org/cf8b/bceaca91f791388a64f3f1a0392d64e41f4f.pdf)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Noise-Contrastive%20Estimation%20for%20Answer%20Selection%20with%20Deep%20Neural%20Networks.md)   |
22 | |  WordLevel   |    0.7090  |    0.7230     |    [aclweb](http://www.aclweb.org/anthology/N16-1108)  |   -   |
23 | | BiMPM  |   0.7180  |  0.7310   |    [arxiv](https://arxiv.org/abs/1702.03814v3)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Bilateral%20Multi-Perspective%20Matching%20for%20Natural%20Language%20Sentences.md)   |
24 | |  SUBMULT+NN   |    0.7332     |   0.7477      |   [arxiv](https://arxiv.org/abs/1611.01747)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/A%20Compare-Aggregate%20Model%20for%20Matching%20Text%20Sequences.md)   |
25 | |  MULT   |    0.7433     |   0.7545      |   [arxiv](https://arxiv.org/abs/1611.01747)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/A%20Compare-Aggregate%20Model%20for%20Matching%20Text%20Sequences.md)   |
26 | |     |         |         |    [arxiv]()  |    [note]()   |
27 | 
28 | ## TREC QA
29 | |  Model  | MAP clean | MRR clean |  MAP raw | MRR raw | paper | note |
30 | | ------------- |:-------------:| :-----:|:-----:|:-----:|:-----:|:-----:|
31 | |  QA-LSTM with attention    |  0.6896     |    0.7849     | - | -   |  [arxiv](https://arxiv.org/abs/1511.04108)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/LSTM-based%20Deep%20Learning%20Models%20For%20Non-factoid%20Answer%20Selection.md)   |
32 | |   AP-BiLSTM   |   0.7132     |     0.8032     | - | -  | [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
33 | |  QA-LSTM/CNN with attention   |  0.7279      |   0.8240     | - | -    |  [arxiv](https://arxiv.org/abs/1511.04108)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/LSTM-based%20Deep%20Learning%20Models%20For%20Non-factoid%20Answer%20Selection.md)   |
34 | |   WordLevel     |   0.7380       |    0.8270     |   0.7550     |    0.8250     |   [aclweb](http://www.aclweb.org/anthology/N16-1108)  |   -   |
35 | |  AP-CNN   |  0.7530     |    0.8511   | - | -   |   [arxiv](https://arxiv.org/abs/1602.03609v1)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Attentive%20Pooling%20Networks.md)   |
36 | |  PairwiseRank+WordLevel   |    0.7620  |    0.8540     |    0.7640  |   0.8270     |    [semanticscholar](https://pdfs.semanticscholar.org/cf8b/bceaca91f791388a64f3f1a0392d64e41f4f.pdf)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Noise-Contrastive%20Estimation%20for%20Answer%20Selection%20with%20Deep%20Neural%20Networks.md)   |
37 | |   SentLevel     |   0.7770       |    0.8360     |   0.7620     |    0.8300     |   [uchicago](http://ttic.uchicago.edu/~kgimpel/papers/he+etal.emnlp15.pdf)  |   -   |
38 | |  PairwiseRank+SentLevel   |    0.8010  |    0.8770     |    0.7800|   0.8340     |    [semanticscholar](https://pdfs.semanticscholar.org/cf8b/bceaca91f791388a64f3f1a0392d64e41f4f.pdf)  |    [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Noise-Contrastive%20Estimation%20for%20Answer%20Selection%20with%20Deep%20Neural%20Networks.md)   |
39 | | BiMPM  |  0.8020   |  0.8750    | - | -  |   [arxiv](https://arxiv.org/abs/1702.03814v3)  | [note](https://github.com/xwzhong/papernote/blob/master/chatbot/Bilateral%20Multi-Perspective%20Matching%20for%20Natural%20Language%20Sentences.md)   |
40 | |        |          |         |        |         |   [arxiv]()  |    [note]()   |
41 | 
42 | 


--------------------------------------------------------------------------------
/chatbot/Sequence to Backward and Forward Sequences A Content-Introducing Approach to Generative Short-Text Conversation.md:
--------------------------------------------------------------------------------
 1 | ### Sequence to Backward and Forward Sequences A Content-Introducing Approach to Generative Short-Text Conversation 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper依旧是想就生成式对话中解决通用性回复的问题，思路比较简单，在decoder时，预选确定一个有意义的“词”，在一定出现该词的情况下进行扩充从而得到完整的回复，关键部分如下：
 5 | 
 6 | 1. train：对于一个完整的post-resp对，将“resp”随机选取一个word分割，前后部分设为head和tail，对head部分reverse，得到post-rehead，post-tail对，用两个不同的seq2seq模型进行训练；
 7 | 2. test/eval：使用petrain得到的PMI信息，计算出当前post出现情况下，最“适合”出现的word（“词”级别），再依次使用反向和正向seq2seq模型得到完整的回复；
 8 | 
 9 | #### comment:
10 | 1. paper提出的idea降低通用性回复，主要因为预先使用PMI挑出了相对“有意义”的词；
11 | 2. 如果只是想进行实验，复现模型相对简单，可以直接套用seq2seq的translate代码，再额外写一个PMI相关计算；
12 | 3. 就此模型而言，效果的好坏主要应该在于keywords词表以及test/eval部分keyword的选取；
13 | 4. 在train部分，不知道在选取word进行split部分，使用test/eval一样的选取方式效果会不会更好？
14 |   
15 | #### more reading:
16 | 1. [教 Chatbot 生成更有营养的对话](https://zhuanlan.zhihu.com/p/27275391)
17 | 2. [论文引介 | Sequence to Backward and Forward Sequences](http://chuansong.me/n/1613737151949)
18 | 


--------------------------------------------------------------------------------
/chatbot/Sequential Match Network-A New Architecture for Multi-trun Response Selection in Retrieval-based Chatbots.md:
--------------------------------------------------------------------------------
 1 | ### Sequential Match Network-A New Architecture for Multi-trun Response Selection in Retrieval-based Chatbots
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper的摘要提到，基于检索的方式多轮会话中，所得到的回复忽略了与之前会话内容之间的“关系”以及各自的“重要性”。“关系”体现在，给出正确的回复时，之前的会话彼此之间的关系；而重要性则提现在，正确的回复与之前对话之间的关系。作者提出的解决思路如下：
 5 | 
 6 | 1. “重要性”-a.以词来衡量；回复和每一个utterance的词转化为embedding后，计算两两词之间的相似度，得到矩阵A。b.以句来衡量。回复（utterance亦同）使用GRU encode，每个step的hidden state与每个utterance所有hidden state计算一个相似度。得到矩阵B。
 7 | 2. 对1中得到的2-D矩阵（A、B）使用CNN降维，得到n个m维的向量（n为之前的会话轮数）
 8 | 3. “关系”。使用GRU（n个m维的input）来描述赋予“重要性”后的会话关系。
 9 | 
10 | #### comment:
11 | 1. paper提出的architecture使用了n*(GRU+CNN)+GRU的结构（n为之前的会话轮数），从表达能力讲，尽可能地把观察到的结构已表示出来，自然带来的一个负面影响是复杂度的提高，由于本人没实验，引用作者原话：Compared to generation models, discriminative models are much faster due to its small hidden states. Although our model is complicated, it can converage in 8 hours on a K80 GPU. Our model spends 200ms to predict 200 instances (1 batch) in the test process. In general, performance is not an issue if you have a GPU.
12 | 2. 在描述“关系”中，作者使用了三种不同的方法进行比较，这种比较方式可取。
13 | 3. 文章第4部分还给出了response candidate的检索方式，值得借鉴。
14 | 
15 | #### more reading:
16 | <http://wennei.com/w6/2017-05-17/374016.html>
17 | 
18 | #### [code](https://github.com/MarkWuNLP/MultiTurnResponseSelection)
19 | 


--------------------------------------------------------------------------------
/chatbot/The Design and Implementation of XiaoIce, an Empathetic Social Chatbot.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章大致介绍了小冰整个对话框架的设计。
 3 |   + 1. 设计原则
 4 |     + 1.1. 拥有三个层面的能力
 5 |       + a. IQ：acquire a range of skills to keep up with the users and help them to complete specific tasks。如：回答知识性问题、根据用户发的图回复内容（paper4.4.1）、作诗（4.4.2）；
 6 |       + b. EQ：meet users’ emotional needs, such as emotional affection and social belonging。分两部分：empathy and social skills，用通俗的话讲为：理解用户情感状态，并用用户能接收的方式回复；
 7 |       + c. Personality。用于“小冰”角色独特的个性，并保持这种个性的稳定（不会今天是女孩，明天又以说出了男孩才会说出来的话）；
 8 |     + 1.2. 评价指标——CPS（Conversation-turns Per Session, 每次会话对话轮数，“用户说一次”+“机器人回复一次”算一轮）；
 9 |     + 1.3. 层级决策过程
10 |       + 将用户会话过程当做马尔科夫决策过程，决策标准CPS：让用户尽可能多发言。
11 |   + 2. 系统框架。分三部分：
12 |     + 用户入口层。eg：微博、QQ、微信等；
13 |     + 会话引擎层。判断用户说的话究竟回复什么内容，是看图评论，还是知识性问题，还是闲聊等；
14 |     + 数据层。人物画像（小冰+用户）、数据库（问答对+非问答对）、主题索引和知识图谱。
15 |   + 3. 会话引擎层
16 |     + 3.1. 会话管理——追踪会话状态
17 |       + a. 对话策略。将用户输入分三部分：核心对话、图片/音频评论、任务型；
18 |       + b. 主题管理。当分类器检测到接下来无话可说时，将切换有实质性内容的主题，切换时有5个指标判断待回复内容的合理性（上下文关联性、时效性、用户个人兴趣、受欢迎程度、用户接收程度）；
19 |     + 3.2. 共情计算（用于后续核心对话）
20 |       + a. 上下文理解。实体识别、指代消歧、文本补全；
21 |       + b. 用户理解。是否在同一主题、用户想要什么（请求？问候？表述？）
22 |       + c. 用户情绪。开心？难过？
23 |       + d. 用户态度。对话题的态度；
24 |       + e. 用户画像。
25 |     + 3.3. 核心对话
26 |       + a. 检索pair数据回复；
27 |       + b. 生成式回复。如用seg2seq生成；
28 |       + c. 检索非pair数据回复。使用知识图谱选取关联词，然后检索非pair数据；
29 |       + d. 将以上三种内容结果通过重排方式得到最终结果，排序时考虑维度：与当前回合用户的输入语义是否一致、上下文一致性、共情匹配程度（用3.2部分的特征计算评分），检索分数（eg：BM25）；
30 |       + e. 如果以上策略仍无结果返回，便回复有指引性的话，让用户多说话。
31 | 
32 | #### comment:
33 |   1. 从小冰的设计理念来看，有好的方面——能让用户尽可能地参与，但也有不太好的方面——为了会话而会话。可能这也是其设计评估指标上的考量；
34 |   2. 整个系统看起来比较复杂，某些方面可以简洁的实现；
35 |   3. 个人比较欣赏从非pair数据中获取回复这块，弥补了“检索pair数据更新”和“生成式策略偏向于通用性回复”的缺陷，更进一步，是不是可以在更新pair数据源的基础上直接检索pair数据本身，然后回复pair句子本身？
36 |   4. 不得不说微软为留住用户实现了很多“花式策略”。
37 | 
38 | #### more:
39 |   1. [沈向洋等重磅论文：公开微软小冰系统设计](http://www.qianjia.com/html/2018-12/29_318304.html)
40 | 


--------------------------------------------------------------------------------
/chatbot/Topic Aware Neural Response Generation.md:
--------------------------------------------------------------------------------
 1 | ### Topic Aware Neural Response Generation
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;在生成式对话中，paper引入了对话的topic信息，从而缩小生成对话时可回答界限，引入topic信息大致分为三步：
 5 | 
 6 | + a. 预先使用LDA训练topic模型，在加入seq2seq模型训练时，针对每组对话的post使用topic模型得到概率最大的主题类，并用该主题类下的词作为特征，同时将通用词进行剔除，每个词仅用对应的向量表示，并引入context最后一步的输出，增加context的语义，进而得到topic attention；
 7 | + b. 将topic attention与context attention放在同一层面，计算decoder部分的隐层s；
 8 | + c. 在decoder中output的词概率分布分两部分，一个是标准的输出分布，从隐层到输出向量，另一个是context attention与隐层输出结合，得到主题词的概率分布，并将这两个分布相加。
 9 | 
10 | #### comment:
11 | &emsp;&emsp;在不考虑行文的情况下，有两点值得借鉴：
12 | 1. paper将对话的topic信息joint到decoder部分的隐层；
13 | 2. 在decoder output部分，使用特定的机制偏向了生成的范围。
14 | 
15 | #### more reading:
16 | 1. [如何生成主题相关的对话 | 每周一起读 #11](https://zhuanlan.zhihu.com/p/27218817)
17 | 2. [从文本生成看Seq2Seq模型](https://zhuanlan.zhihu.com/p/29967933)
18 | 


--------------------------------------------------------------------------------
/chatbot/Topic-to-Essay Generation with Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Topic-to-Essay Generation with Neural Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;针对具体任务：给出几个主题词，生成该词下的短文本，文章给出了两个常规方案和一个主题衰减方案：
 5 | 1. 将主题词的词向量相加求平均，按语言模型方式逐字生成；
 6 | 2. 在主题词的之上增加一层attention，类似seq2seq方式生成；
 7 | 3. 在2的基础上，给每一步的各个主题词权重进行加权，减小该主题词已生成时再次出现的概率。结构图如下：
 8 | 
 9 | ![](https://github.com/xwzhong/papernote/blob/master/pic/Topic-to-Essay_Generation_with_Neural_Networks3.PNG)
10 | 
11 | #### comment:
12 | 1. 文章主要亮点在于主题词权重在生成过程中衰减这块。
13 | 
14 | 
15 | #### more reading:
16 | 1. [Topic-to-Essay Generation with Neural Networks阅读笔记和部分实验](https://blog.csdn.net/firesolider/article/details/94387427)
17 | 2. [IJCAI 2018 基于主题信息的神经网络作文生成模型](https://www.jiqizhixin.com/articles/2018-06-05)
18 | 


--------------------------------------------------------------------------------
/classification/Attention-based LSTM for Aspect-level Sentiment Classification.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | 　　paper很巧妙地利用了LSTM单一结构的attention机制, 使得句子级的aspect情感判断能做到侧重于文本的某些内容, 具体的实现有点像seq2seq中引入attention.
3 | 
4 | #### more reading:
5 | 　　[《Attention-based LSTM for Aspect-level Sentiment Classification》阅读笔记](https://zhuanlan.zhihu.com/p/23615176)
6 | 


--------------------------------------------------------------------------------
/classification/Bag of Tricks for Efficient Text Classification.md:
--------------------------------------------------------------------------------
 1 | 
 2 | #### note:
 3 | &emsp;&emsp;文章一作者是word2vec的原作,在word2vec的基础上提出了一种准确,高效的模型.在CBOW模型中,预测的是中间词出现的概率,而fasttext中,则是预测对应文本的类别.同时fasttext使用了hierarchical softmax,每个叶子代表类别.
 4 | 
 5 | #### comment:
 6 |   1. fasttext效果好的原因探讨:
 7 | 
 8 | 	  + a. fasttext虽然是线性模型,即使hidden unit很少,但其参数特别多,主要体现在每个词的embedding上,在训练的时候,词的embedding会跟着优化,这也是模型准确率高的一个重要原因,因为同个词在不同文本之间存在关联,而这种联系正是通过embedding建立;
 9 | 	  + b. label之间的表示也彼此影响.因为在层次结构中,如果类1和类2存在公共路径,那么在训练类1文本的时候,必然更新公共路径部分的参数;
10 | 	  + c. [subword策略](https://github.com/xwzhong/papernote/blob/master/embedding/Enriching%20Word%20Vectors%20with%20Subword%20Information.md)。将词切分成subword形式，并且加入了尖括号<>在单词外面，因为这样可以区分前缀和后缀，比如一个单词where如果用3-gram来表示，那么可以表示为：<wh, whe, her, ere, re>;
11 | 	  + d. 文章使用了bigram策略,已有实验在[情感分类任务](https://github.com/xwzhong/papernote/blob/master/classification/Baselines%20and%20Bigrams:%20Simple%2C%20Good%20Sentiment%20and%20Topic%20Classification.md)中表明,该策略能部分描述语序信息,具体引入方式可见[N-gram特征，浅谈FastText文本分类利器解读（2）](https://blog.csdn.net/qq_43019117/article/details/82770124).
12 | 
13 |   2. 能较好解决类别多的分类问题(实验测试了312k个类别的分类问题);
14 |   3. 文章没有实验使用预先训练好的word embeddings,如果用了,效果可能会有提升;
15 |   4. 训练,测试速度相比CNN,RNN快几个量级.
16 | 
17 | #### highlight:
18 |   1. we show that linear models with a rank constraint and a fast loss approximation can train on a billion words within ten minutes, while achieving performance on par with the state-of-the-art.(这里的rank constraint怎么理解??)
19 |   2. However, linear classifiers do not share parameters among features and classes. This possibly limits their generalization in the context of large output space where some classes have very few examples(表明为什么类别多的问题能较好解决).
20 |   3. On this task, adding bigram information improves the performance by 1-4%.
21 | 
22 | #### more reading:
23 |   1. [NLP︱高级词向量表达（二）——FastText（简述、学习笔记）](http://blog.csdn.net/sinat_26917383/article/details/54850933)
24 |   2. [fastText 源码分析](https://heleifz.github.io/14732610572844.html)
25 |   3. [fastText库讲解](https://www.quora.com/How-does-fastText-output-a-vector-for-a-word-that-is-not-in-the-pre-trained-model)
26 |   4. [all kinds of text classificaiton models and more with deep learning](https://github.com/brightmart/text_classification)
27 |   
28 | 


--------------------------------------------------------------------------------
/classification/Baselines and Bigrams Simple Good Sentiment and Topic Classification.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 
 3 | &emsp;&emsp;paper指出,在使用naive bayes和SVM时,模型结果容易受到model variant, features和数据集的影响.作者通过对NB和SVM实验,得到以下结论:
 4 | 
 5 |   1. the inclusion of word bigram features gives consistent gains on sentiment analysis tasks;(原文对情感和主题数据集进行了实验,发现同一个分类器,在情感数据使用了bigram其效果都比unigram好,但在主题分类任务中,有些模型的效果反倒变差.另外,实验没用其它领域的数据)
 6 |   2. for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds);(此处short snippet数据平均长度为20,最小平均长度为3,最大为24,longer documents指平均长度以百来记,情感数据集中,最小平均长度为231,最大为787)
 7 |   3. a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets.(NB log-count的计算公式见原文式2)
 8 | 
 9 | #### comment:
10 |   1. 原文提出的改进方法简洁,而且运算速度快;
11 |   2. Multinomial Naive Bayes (MNB)在scikit-learn中有实现;
12 |   3. 情感分类中,bigram比unigram好的原因: 在情感类文本中,往往会有否定词,eg:不/喜欢,此时bigram比unigram更能描述文本,换句话讲:bigram学到部分语序信息.
13 |   
14 | #### code:
15 |   1. [作者提供的matlab代码](https://github.com/sidaw/nbsvm)
16 |   2. [个人推荐的python代码](https://github.com/mesnilgr/nbsvm)
17 | 


--------------------------------------------------------------------------------
/classification/Classifying Relations by Ranking with Convolutional Neural Networks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　本文是关系抽取任务中的分类模型，给定句子“周星驰执导了《美人鱼》电影”和待确定pair（周星驰，美人鱼），确定给出的几组关系中其正确的关系类型（假定此处为“导演”）。整体思路是对句子编码最后进行分类：
 3 |   1. 获取句子中每个词的embedding表示r；
 4 |   2. word position embedding。position embedding——由句子中当前词与pair词的相对位置信息拼接得到wpe，再由r和wpe拼接得到word position embedding（句子中每个词都有）；
 5 |   3. 使用CNN-max得到最终的句向量；
 6 |   4. 添加全连接层，最终维度为类别大小；
 7 |   5. 使用rank loss计算损失。
 8 | 
 9 | #### comment:
10 | 　　文章出彩的地方在于pair词信息的引入和loss设计。loss着重考虑了当前样本最可能被错分的类别，假定正确类别是A，但模型预测“排除A”后，最大概率出现C，则反馈时，会尽量拉开A和C的距离，使当前样本预测为A的概率上升，预测为C的概率下降，这种方式仅考虑极端的情况，在多分类任务中会有比较好的效果。不足之处是，模型类别之间是等价的，假定上例中模型预测为B（A与B从类别上更相近），在反馈时会以同样的梯度去更新参数，而不会减小对B的惩罚力度。
11 | 
12 | #### more:
13 |   [tensorflow code](https://github.com/pratapbhanu/CRCNN)
14 | 


--------------------------------------------------------------------------------
/classification/Convolutional Neural Networks for Sentence Classification.md:
--------------------------------------------------------------------------------
 1 | ### Convolutional Neural Networks for Sentence Classification
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper在fig1架构的基础上,通过实验分析词向量在CNN文本分类中对准确率的影响,词向量所做的实验设置如下:
 5 | 
 6 |   1. CNN-rand: 词向量随机初始化并在训练分类器中不断更新；
 7 |   2. CNN-static: 词向量使用100B词训练好的word2vec初始化,并保持不变,对于未在pre-train出现的词,随机初始化;
 8 |   3. CNN-non-static: 初始化方式与2相同,但在分类器训练过程中不断更新;
 9 |   4. CNN-multichannel: 每个词使用两个词向量表示(使用pretrain word2vec),其中一个词向量随分类器不断更新,另一个则保持不变.
10 | 
11 | #### comment:
12 |   1. 相对于pretrain词向量, 随机初始化的方式始终是最差的(其中还包括SST情感分析任务,虽然在SST1中提升不多-至少0.5%).
13 |   2. 使用非当前任务的语料(此处为100B Google news)训练word2vec,能得到该语料的一些特征,将该word2vec用于其他任务的问题中,如果使用的是static方式,训练得到的分类模型更多是利用了其它语料的embedding信息,而non-static方式,则兼顾了其它语料及当前语料的信息,但是训练方式的缘故,仍会造成学习上的偏差,体现在:如果该pretrain的词向量是最好的,但因分类器本身并不直接在该embedding下得到最优结果,导致pretrain的词向量出现偏移.因此作者也提出,怎么样尽可能地利用pretrain信息以及该task本身的信息.
14 |   3. paper提出的"CNN-multichannel"方式正是想解决comment2提出的问题,实验中确实也取得一定的效果;
15 |   4. 文章要是能使用各个任务的语料,专门训练其对应的word2vec进行实验比较,或许能发现更多信息,印象中有实验表明,使用其对应的语料训练word2vec效果也不错;
16 |   5. 在实际使用中,我们拥有大语料(eg: 几百万的新闻,上千万的百科数据),而实际待解决的任务可能是特定领域的, 此时是不是也可以在大语料训练好的word2vec基础上,再使用具体任务的语料进行增量训练,通过适当的调整,也能解决pretrain的词向量出现大量OOV的情况(大语料和具体任务的语料同时训练,最后再使用具体任务的语料微调).
17 |   6. table3对CNN-multichannel训练完分类器的结果进行分析,有一个有趣的发现:bad原本与good在空间分布中很接近,但在训练分类器中经过non-static后,bad最终与terrible很接近,是不是可以使用这种方式来解决word2vec训练得到的词向量其反义词相近的情况?non-static在各个任务中表现都不差.
18 |   7. pretrain的word2vec语料可尽可能选择与待解决问题在语法,语义上相近,有助于提升最终的准确率.
19 | 
20 | #### hightlight:
21 |   1. Dropout consistently added 2%–4% relative performance.
22 |   2. we obtained slight improve- ments by sampling each dimension from U [−a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones.([detail](https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L88), why???)
23 |   3. Adadelta (Zeiler, 2012) gave similar results to Adagrad (Duchi et al., 2011) but required fewer epochs.([更多介绍](https://github.com/xwzhong/papernote/blob/master/neural%20network/An%20overview%20of%20gradient%20descent%20optimization%20algorithms.md))
24 | 
25 | #### [code](https://github.com/yoonkim/CNN_sentence)
26 | 
27 | #### more reading:
28 | [卷积神经网络用于文本分类](http://blog.csdn.net/cyl9413/article/details/53432640)
29 | 


--------------------------------------------------------------------------------
/classification/ETH-DS3Lab at SemEval-2018 Task 7 Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper是semeval-2018竞赛文章，在关系抽取比赛中取得不错的成绩。文章出彩的地方不是提出了什么新颖的结构，而是整合多方面的模块放大其效果。
 3 |   + 1. 整体结构：word->信息抽取（词、词性、与实体词的位置关系）->embedding编码->rnn和cnn分别输出各个类别的概率->整合rnn和cnn概率得到最终输出。详见下图：
 4 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Effectively%20Combining%20Recurrent%20and%20Convolutional%20Neural%20Networks%20for%20Relation%20Classification%20and%20Extraction.png)
 5 | + 2. 各个模块整合策略：
 6 |     + ensemble。使用不同的初始化参数，训练k个分类器，最后对各个类取平均值作为对应类的概率。（文中尝试了1<=k<=30的情况，经测试最好的效果时k=20时）
 7 |     + word embedding。作者下载了同领域数据，测试发现，word2vec 200维的结果相比于100维和300维F1值更高。（300维的效果相比于200维差了+2%F1）
 8 |     + 预处理：
 9 |       + 裁剪句子。作者通过分析Subtask1.1发现，拥有实体关系的词随着彼此距离的增加样本数越来越少，为了减少非实体关系判为实体关系的情况，仅考虑小于一定长度阈值的句子
10 |       + 实体标签。在实体词前后插入特殊的标签，用来明确指明中间的词为实体词
11 |       + 相对位置策略和类别数。
12 |       + 词性。将每个明确的词性用特定向量表示，类似于word embedding
13 |     + 充分利用已标注数据。由于“SemEval-2018 Task 7”中关系分类的数据较少，作者整合了Subtasks 1.1和1.2的数据（+3.6%F1）
14 |     + 生成额外数据。将实体对直接与另一句子的实体对替换（+0.7%F1）
15 |     + 参数优化。使用网格搜索方式进行调参
16 |     + 修改loss设计。因各个类含有的数据非常不均，而使用的交叉熵与最终的衡量标准F1有一定的差异，同时考虑到F1对数据量少的类别比较敏感，因此在训练时加大数据量少的类别影响（+1.6%F1）
17 |     + upsampling。仍旧是为了缓解数据不均问题（+12.2%F1）
18 | 
19 | #### comment:
20 |   1. rnn和cnn的整合策略并不是最后一个向量concat，而是输出的概率根据句子长度加权——since the RNN-based architecture had a tendency to obtain better results than its CNN-based counterpart for long sequences, we combined both predictions in such a way that a higher weight was assigned to the RNN predictions for longer sentences.
21 |   2. 为解决关系分类问题，大部分策略在不同任务中是通用的（eg：实体识别，chatbot，其它类型的分类等），但作者尽量将各项技巧都发挥到最好的效果（多实验验证），需要考虑在后期工作中秉持这种科学的态度和并将公共模块抽取出来；
22 | 
23 | #### question:
24 |   1. A shortcoming of this approach is that the cross-entropy loss usually only constitutes a conveniently decomposable proxy for what the ultimate goal of the optimization is (Eban et al., 2017): in this case, the macro-averaged F1 score. 怎么理解这句话？
25 | 
26 | #### more:
27 |   1. [文章概述](https://zhuanlan.zhihu.com/p/35845948)
28 | 


--------------------------------------------------------------------------------
/classification/Hierarchical Attention Networks for Document Classification.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　对于文档分类，作者提出从“词->句子->文档”的整体思路构建模型，其中不同“词”和“句子”对当前分类有不同的重要性，因此对此两步采用attention机制进行重要性衡量并提取特征，详见文章fig2。
 3 | 
 4 | #### comment:
 5 | 　　文章提出的观点切合特定场景的问题，特别是对“部分文本内容指向性高”的分类任务中。
 6 | 
 7 | #### more:
 8 |   1. [一个Hierarchical Attention神经网络的实现](https://blog.csdn.net/triplemeng/article/details/78269127)
 9 |   2. [基于Attention机制的深度学习模型在文本分类中的应用](https://www.jianshu.com/p/4fbc4939509f)
10 |   3. [tensorflow code](https://github.com/ilivans/tf-rnn-attention)
11 |   4. [几个模型在dbpedia数据集中的对比](https://github.com/TobiasLee/Text-Classification)
12 |   5. [fasttext在dbpedia数据集的准确率](https://github.com/facebookresearch/fastText/blob/master/docs/supervised-models.md)
13 | 


--------------------------------------------------------------------------------
/classification/How to Fine-Tune BERT for Text Classiﬁcation.md:
--------------------------------------------------------------------------------
 1 | 
 2 | #### note:
 3 | &nbsp;&nbsp;&nbsp;&nbsp;针对bert在文本分类中的应用，作者尝试了不同的微调方式来评测bert的效果，主要分三大类。
 4 |   + 1. Fine-Tuning Strategies 
 5 |     + 1.1. 长文本（大于512个token）
 6 |       + a. 方式：取前510个tokens、取后510个tokens、取前128+后382个tokens、层级方法（将文本切分成L/510块，并将最后一层的[CLS]输出进行整合）
 7 |       + b. 结论：在IMDb and Sogou数据集上都是head+tail表现最好
 8 |       
 9 |       ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-1.png)
10 |     + 1.2. 不同层的输出
11 |       + a. 方式：使用不同层[CLS]的输出进行映射
12 |       + b. 结论：在IMDb数据上最后一层的输出，以及Last 4 layer+max表现最好
13 |       ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-2.png)
14 |     + 1.3. 信息遗忘和学习率衰退（学习率的大小）
15 |       + a. 方式：调节学习率的初始值
16 |       + b. 结论：学习率过高容易将bert预训练学习到的内容遗忘
17 |       ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-3.png)
18 |       ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-4.png)
19 |   + 2. Further Pretraining
20 |     + 2.1. 具体任务数据再预训练
21 |     + 2.2. 同领域数据再预训练
22 |     + 2.3. 结论：同领域的数据对具体任务更有帮助，其他领域的数据甚至可能起到反作用
23 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-5.png)
24 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-6.png)
25 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-7.png)
26 |   + 3. Multi-task Fine-Tuning
27 |     + a. 方式：用不同数据集微调后再用特定任务数据集微调
28 |     + b. 结论：效果有提升，但是没特别明显
29 |     
30 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-8.png)
31 |   + 4. Few-ShotLearning
32 |     + a. 方式：使用不同百分比的数据进行微调
33 |     + b. 结论：随着微调数据集郑家，error rate越来越低
34 |     
35 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/HowtoFine-TuneBERTforTextClassi%EF%AC%81cation-9.png)
36 | 
37 | #### comment:
38 |   + 1. 文章是一篇实验性的文章，把bert尽可能的微调方式都列举了，总的来讲效果比较好的方式是：
39 |     + 长文本仍需要进行实验才能确定是用head，tail还是head+tail，这个跟具体的任务特点相关
40 |     + 但从准确率来看，一般选择最后一层[CLS]的输出，但如果要考虑线上的时间效率问题，则可以使用中间层输出，个人在其他任务中也发现，随着[CLS]层数的增加，准确率确实不断往上增，但时间复杂度也如此
41 |     + 如果具体任务内的无标注数足够多，则用其再预训练，如果不多，则可利用领域内的数据，同时输出中间结果，并进行具体任务的微调测试
42 |   + 2. 目前GLUE数据集上超越BERT的结果主要bert+multi task结果，文章的一些实验可能还不算完备
43 | 
44 | #### more:
45 |   1. [GLUE leaderboard](https://gluebenchmark.com/leaderboard)
46 | 


--------------------------------------------------------------------------------
/classification/SGM Sequence Generation Model for Multi-Label Classification.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　作者将multi-label classification任务转化为序列生成形式，本文使用了seq2seq with attention结构，encoder对句子进行编码，decoder对各类别进行解码生成。作者所做主要改动如下：
 3 |   + a. 在decoder生成第k个label时，如果前k-1步已经出现过了，则对其进行惩罚。（在实操中，具体惩不惩罚需要以实际需求为准）；
 4 |   + b. 提出了general embedding。在decoder阶段，生成第k个label不是仅仅考虑k-1步中最高概率的label，作者还考虑了其它label的概率，主要是避免过分依赖最大概率label造成错误累加。
 5 | 
 6 | #### comment:
 7 |   + a. 作者对模型的改动不算大，亮点是将多标签分类问题转为序列生成，而使用序列生成能很好的学习不同标签之间的关系；
 8 |   + b. 如果标签顺序并不是非常重要，使用transformer效果是不是会更好?（transformer对语序的学习能力相对lstm较差）
 9 | 
10 | #### more:
11 |   [深度学习：如何在多标签分类问题中考虑标签间的相关性？](https://zhuanlan.zhihu.com/p/39535198)
12 | 
13 | #### highlight:
14 | 　　The proposed sequence generation model generates labels sequentially and predicts the next label conditioned on its previously predicted labels. Therefore, it is likely that we would get a succession of wrong label predictions in the following time steps if the prediction is wrong at time-step t, which is also called exposure bias.
15 | 


--------------------------------------------------------------------------------
/classification/Sequential Short-Text Classification with Recurrent and Convolutional Neural Networks.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | 　　在多轮会话数据中，当前dialog act不仅与当前句子相关，还与前面说的话相关，针对这个特点，作者设计了利用序列信息的网络（详见fig1和fig2）。网络分两部分：
3 |   1. fig1: 对每个句子进行编码，转化为指定维度的向量；
4 |   2. fig2: 将各个句子进行整合，整合方法主要是将fig1结构得到的vector进行糅合，从而预测最终的类别。
5 | 
6 | #### comment:
7 | 　　作者提出的方法是针对上下文相关的多轮会话问题提出的，算是一个可研究的方向，但是在实际操作中，考虑到时间效率问题似乎不会使用过于复杂的网络结构。
8 | 


--------------------------------------------------------------------------------
/classification/Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　针对sentiment、emotion和sarcasm任务，paper提出使用emoji标注的大量数据下pretrain后进行transfer learning，要点如下：
 3 |   1. pretrain。使用Twitter含有指定emoji集合（64个emoji）的数据，以emoji为类别标签训练分类器（一个训练样本只有一个标签、一条text可能有多个标签-分多条训练样本）。
 4 |   2. model architecture。embedding-bilstm-attention sum+skip connection-softmax。
 5 |   3. transfer learning-chain thaw。使用2中的model architecture预训练好模型后，使用target task对pretrain好的参数进行fine-tuning，提出了chain thaw策略——先更新最后一层参数，然后再从最低层依次往上更新其参数，其中使用了valid data来判断训练是否结束。
 6 | 
 7 | #### comment:
 8 |   1. 模型效果好的原因探索：
 9 |   + a. pretrain数据量大。64 emoji就有12.4亿数据量，单个emoji数据量最少为510w+，最多为2.3亿，相差40+倍（为了平衡数据需要做平衡操作-文章使用了upsampling）；
10 |   + b. 模型结构设计。使用了attention机制和skip connection策略，在transfer learning时，使用了chain thaw策略，该方法在NER中也能提升1%的F1值。
11 |   2. paper中绘制的fig6有很大用处：
12 |   + a. 使用无监督方式自动区分情感类别；
13 |   + b. 描述不同情感之间的距离；
14 |   + c. 可将emoji归为不同的大类。
15 |   3. 不足之处是因使用了bilstm训练时间较长，或许使用embdding-position encoding-attention sum+skip connection-softmax也能达到与使用bilstm相近的效果，同时明显提升时间效率。
16 |   4. 文章3.2部分提到对embedding加constrain，使其值映射到[-1, 1]，估计是为了后续skip connection部分平齐数值范围而添加。
17 |   
18 | #### more:
19 |   1. [论文主页](https://www.media.mit.edu/projects/deepmoji/overview/)
20 |   2. [code](https://github.com/bfelbo/deepmoji)
21 |   3. [pretrain的结果体验](http://deepmoji.mit.edu)
22 |   4. [用python-pandas作图矩阵](https://www.jianshu.com/p/47b54eb35eed)
23 |   5. [SciPy Hierarchical Clustering and Dendrogram Tutorial](https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/#Perform-the-Hierarchical-Clustering)
24 |   6. [How to Handle Imbalanced Classes in Machine Learning](https://elitedatascience.com/imbalanced-classes)
25 |   7. [undersampling和oversampling进行训练和测试时的注意点](https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation)
26 |   8. [这个围笑代表什么 可能麻省理工的AI比你更懂](http://www.qingpingshan.com/m/view.php?aid=310874)
27 | 


--------------------------------------------------------------------------------
/classification/readme.md:
--------------------------------------------------------------------------------
 1 | #### 1. 先根据要求(准确率,速率)定模型整体方向
 2 | 
 3 | &emsp;&emsp;有些项目中,就算是能提高0.5%也是很大的改进,那么这种模型偏重准确率,而在有些任务中,准确率可以稍稍低一点(可能有其它方面弥补这块的损失,或这分类任务仅仅只是庞大系统的一小块),但因要兼顾整个系统的效率,实时性要求较高.当然,最好的选择是准确率最高同时实时性也最好.
 4 | 
 5 | #### 2. 准确率 vs predict时的速率
 6 | 
 7 | &emsp;&emsp;使用复杂深度学习重点考虑的是其分类准确率(数据表述更细致,参数更多), fasttext更多注重时间效率(c实现,w2v理论,OOV解决较好,使用n-gram考虑了句子中的词序,能解决部分数据不平衡问题,多分类中可以使用层级结构提高效率等, 相比SVM等经典机器学习算法在描述文本能力上更细致);
 8 | 
 9 | #### 3. 模型优化
10 | 
11 | &emsp;&emsp;怎么判定一个新模型/优化方案是否能提高准确率?
12 | 
13 |   + a. 考虑优化的思路是否切合数据的特点.比如,情感分类中使用VSM模型,如果再引入词序特征,准确率有很大的可能性提升,原因很简单,"喜欢不"和"不喜欢"表达的是不同的情感,而VSM模型描绘不出语序信息; 
14 |   + b. 接下来还要考虑,这种语序信息对整个分类任务影响的比例有多大,如果仅占万分之一,那么优化的意义不大,因为每引入一个假设,必然增加模型一定的error,关键是得确定优化的部分要尽可能大于这个error; 
15 |   + c. 最后还要考虑优化时,所用的表达方式能否正确代表优化的点,假设一句话的语序信息使用一个标量来表示,而原始的VSM维度上千维,那这个标量在整个分类模型中的作用非常有限(在此并不是说使用一个标量就能正确描述一句话的语序信息,仅做参考)
16 | 
17 | #### 4. 数据预处理
18 | ##### 文本
19 | &emsp;&emsp;以是否影响分类结果为标准判定预处理技巧的选取,从而达到减轻模型训练负担(参数个数,训练时间),按理来讲,数据越复杂,不规律,比如线性不可分相比与线性可分,则模型所需参数,训练时间(特别是深度学习模型)更多.
20 | 
21 |   + a. 假设文本中有繁体,简体,而分类判断对字体不关心,只关心其所述内容,那可将繁体转化为简体,同理大小写,全角字符转半角字符都是常用的技巧;
22 |   + b. 对于更深层次的,假设文本中有url,那是否需要剔除这些url也是根据上述标准,假设url对文本判定无影响,那可直接剔除,假设有影响,而且可以根据域名就能判定该url倾向于哪个类别,为了减轻模型训练负担,可直接提取域名,因为url多余的字符可以理解为噪音;
23 |   + c. 某一类字符串,虽然形式不同,但是其含有相同的含义也可进行统一映射,此处是根据最终任务的目标反推判断;
24 |   + d. 仅用指定字符集,eg:仅用中文;
25 |   + f. 连续字符压缩后有无影响?eg:啊啊啊啊啊啊啊->啊啊啊(用正则一行代码搞定, 连续相同的多字符串同理)
26 |   + g. 按字还是按词切分?以词为单位, 其包含的语义信息更完整,但量多(中文词以十万记,而常见字仅有3k左右), 以字切分有一个优点,对于模糊数据也能很好处理(eg: '伟.哥'分词后得不到'伟哥'这个词), 在深度学习分类模型中可优先考虑按字切分,同时加大模型结构(因为word embedding的表述空间小了,则需要模型更强的学习能力). 提到这个,在seq2seq中,如果decode部分的vocabulary非常大,则模型训练时间,内存/显存将要求更大, 当然也有sampling的方式缓解这个问题,但个人观点毕竟是取样,存在一定的信息损失.
27 |   + h. 文本切分后保留长度.在深度学习中,训练往往是以batch方式进行,在考虑保留长度时需考虑在该长度文本下,文本保留的信息(重点),机器内存/显存的限制和训练时间.文本长度可取95%左右文本都能涵盖的长度,原因:在保证足够的信息被保留的情况下,减少长尾数据对训练时间的制约;注意,如果文本是由多个子文本拼接而来,需要使用拼接后的文本计算;
28 | 
29 | ##### 非文本
30 | &emsp;&emsp;非文本此处指格式化的数据,如星期,日期,次数,时间长度等,此处分两类:
31 | 
32 |   + a. 标量表示. 如时间长度,其为跨度值,不能用枚举的方式表示.对于能用标量表示的特征,需要进行映射,往往使用z-score,使其均值为0,方差为1,也可以使用函数映射到指定范围,对于某些特别重要的特征,可扩大其映射范围;
33 |   + b. 元数据表示.如星期几, 此类数据暂称元数据,它的特征是使用枚举的方式得到. 元数据首先不用标量来表示,因为枚举特征之间的模型距离无法衡量, 以前常用的是用one hot(一个星期有七天,可用七个维度表示,对应的星期几赋值为1,其余为0), 但其表达元数据特征之间的关系仍是定量的,有一种比较好的方式是使用embedding,使用过程类似word embedding,只是将word换成了元数据. 在训练模型的过程可以不断学习元数据embedding的表示,达到自我学习的过程.有一点需要注意,某些非文本数值特征使用'标量表示'不一定合理,需从该特征的实际意义来理解并判断用何种表达方法;对于元数据特征embedding维度,可与其枚举数等同.
34 | 
35 | #### 6. 深度学习模型结构相关
36 |   + a. 激活函数的选择和及其参数初始化. dl模型中相比于梯度爆炸,梯度消失问题更棘手.梯度爆炸可通过clip方式有效缓解,而对于梯度消失,其本身难以检测.2009年[Glorot](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)分析了激活函数与其参数初始化对模型的影响,其中通过实验表明tanh激活函数在使用指定的初始化方式的情况下能有效避开梯度消失,并加快模型的收敛速度([笔记](https://github.com/xwzhong/papernote/blob/master/neural%20network/Understanding%20the%20difficulty%20of%20training%20deep%20feedforward%20neural%20networks.md)),keras有相关该论文初始化方法实现(tf已含有keras模块).(注意,激活函数的选择,初始化策略还与输入数据每个维度的取值范围有关,这点paper好像没有提及,但个人感觉也是比较重要的一点,一般会将各个维度"0均值标准化",即将各个特征转化为"均值为0,方差为1",这个策略不是必要操作,可根据实际的任务仅对某部分特征进行标准化).关于各不同激活函数可视化及优缺点分析可见[链接](https://www.jiqizhixin.com/articles/2017-10-10-3).
37 |   + b. RNN变体的选择-simple RNN, GRU和LSTM. simple RNN有天然的缺陷,可以不用考虑.有相关[实验](https://arxiv.org/pdf/1412.3555)([笔记](https://github.com/xwzhong/papernote/blob/master/neural%20network/Empirical%20Evaluation%20of%20Gated%20Recurrent%20Neural%20Networks%20on%20Sequence%20Modeling.md))表明,GRU在某些任务上效果比LSTM更好, 到具体任务时可实验测试,但论文中比较常见的是使用LSTM单元, 如果机器显存,计算能力足够,建议使用LSTM;
38 |   + c. 单向RNN还是双向RNN结构. 单向LSTM仅考虑了已出现的文本,而双向结构则考虑了后续待生成的问题(虽然带生成文本未显示得出),运用中常用双向结构;
39 |   + d. word embeddings.
40 |       + ① 如何初始化分类模型中的词向量-使用word2vec等模型训练还是随机初始化比较好?拿word2vec来说,其训练得到的词向量否定词之间的空间距离比较接近,对于否定词敏感的任务来讲,此时最终分类模型训练得到的结果会打一定折扣,此时可以考虑随机初始化词向量,对比两者差异. 关于训练word2vec训练情感分析的词向量,可参考论文-Learning word vectors for sentiment analysis, A. Sentiment classification based on supervised latent n-gram analysis, Re-embedding words.
41 |       + ② 词向量训练方法的选择;训练词向量建议使用fasttext,其相比于word2vec主要的优点是缓解OOV,因其使用subword表示,使得到的结果兼顾了词的形态信息,这点在错字中有出奇的效果,当然前提是这个错字词能被正确分词,同理,如果token仅仅只是label,在形态上没有间接的语义关联,那么效果会变差.2017年12月出了一篇[论文](https://github.com/xwzhong/papernote/blob/master/embedding/Advances%20in%20Pre-Training%20Distributed%20Word%20Representations.md),介绍了word2vec在考虑Position-dependent Weighting, Phrase representations及Subword information情况下的效果,实验结果令人欣喜;
42 |       + ③ 更新词向量与不更新-一般选择更新参数;有论文介绍说结合更新和不更新两种方式能得到更优的上层结果, 更新向量参数能使得到的词向量往上层结构偏移,使之更好拟合上层问题,不随模型更新可更好利用原本训练好词向量语义信息(能学到什么程度与所用词向量模型有关,eg:word2vec, glove和fasttext),个人认为,最好能在上层模型训练到一定程度的时候随上层模型更新词向量参数,原因:上层模型在初始化相关参数后,除词向量含有一定先验知识,其它参数是混沌状态,刚开始训练时,词向量需要兼顾这些混沌状态得到的模型最终结果,而这初始阶段的最终结果一般不是最优的,可想而知词向量在开始训练阶段将被污染,在上层模型不断训练过程中,词向量再往最进一步的优化,如果是刚开始不对词向量进行调整,等上层模型训练到一定程度的时候再进行调整或许能缓解这种影响,目前还没做相关实验验证;
43 |       + ④ 使用什么语料训练word embeddings?可以使用分类数据本身训练,有可以使用大型,规整的语料(eg:新闻,百科).已有人实验得出,使用分类数据本身也能达到大语料训练得来的效果.使用大型公开语料,首先数据量多,其次这些类数据比较规范,模型学到的语义信息也更规范.(ps:[400w搜狐新闻](http://blog.sina.com.cn/s/blog_16d74e01f0102x1js.html))
44 |       + ⑤ 对于OOV词,在分类模型中该如何初始化?对于fasttext不存在这个问题的考虑,因为其使用n-gram方式解决OOV词,哪怕这个OOV词拆分后的gram没有出现在fasttext模型中,每次它也能唯一返回一个词向量.对于word2vec, 本人试过使用随机初始化,最大最小值为已在word2vec模型的词向量中的最大最小值. 但看到论文"Convolutional Neural Networks for Sentence Classification"提到这样一句话:When randomly initializing words not in word2vec, we obtained slight improvements by sampling each dimension from U[−a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones.(如果谁有自己的见解希望[在这里分享下](https://www.zhihu.com/question/68711290)), 在yoonkim提供的代码中,a取值为[0.25](https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py#L95).
45 |   + e. 使用什么优化方案sgd还是adam? sgd收敛速度慢, 容易陷入局部最优, 困于鞍点, 针对这些问题的改进方案有Momentum(动量),NAG,Adagrad, Adadelta, RMSprop, Adam, AdaMax, Nadam等等, 初步调参可使用adam, 其理论,解决的方面相对较多,同时强烈建议细读论文[An overview of gradient descent optimization algorithms](https://arxiv.org/abs/1609.04747)([笔记](https://github.com/xwzhong/papernote/blob/master/neural%20network/An%20overview%20of%20gradient%20descent%20optimization%20algorithms.md)).
46 |   + f. 模型调参.
47 |       + ① epoch或训练步数的选择.训练模型前,将数据集分三份-训练集,验证集和测试集(可使用划分比例7:1:2).测试集用于评估在当前参数条件下,模型训练到最好时的效果.而这个"最好"的位置(具体是哪个epoch或step)是由"验证集"来确定的.在训练过程中,可在每个epoch结束或在指定间隔step后计算验证集的准确率,综合考察整个训练过程,选择验证集准确率最高的那个epoch或step模型结束训练位置.在训练实际上线模型时,由于使用了百分百数据集,其对应的epoch或step也可增加一点.
48 |       + ② 词汇表大小.词汇表获取严格来说不能用验证集和测试集;一般的文本数据集其词汇由词频统计可发现长尾现象,选取词汇时,先由词频由大到小排序,取前k个词,使前k个词的词频和占总词频的指定百分比以上即可(可取98%). 从另一个角度解释,该策略能在尽量保存多信息的前提下剔除更多的tough信息.对于被剔除的词,可用UNKNOWN等字符串代替或直接跳过,对于文本较长的数据,比较建议直接跳过,文本较长本身还有信息量比较多,剔除UNKNOWN token可减少训练时间.
49 |       + ③ batch size设置. 在神经网络算法原始的推导是使用整个batch计算梯度,限于运行内存/显存、单步更新速率，从而使用了mini-batch，个人实验发现，mini-batch越大，收敛速度越快，原因是更大的mini-batch更能代表模型整体收敛方向. 如果内存/显存允许，可尽量调高mini-batch数。但也不要忘记对同一个epoch进行shuffle操作，降低同个mini-batch样本与样本间的相互影响；同时,对于64, 128, 256, 512...等数值也是比较建议使用,主要考虑了内存的结构设计,能稍微提高训练速度(embedding size, rnn cell size等同理).
50 |       + ④ learning rate. 论文"Don't Decay the Learning Rate, Increase the Batch Size"提到,batch size随训练步数的增加,可以使用较大lr,相当于lr可以维持不变.但是一般的训练任务应该用不上这个技巧,因为其说的batch size在实验中是以千或万来记. lr的选取一般不需要太大,常见的初始化值有0.001, 0.01, 0.1, 1.0;
51 |       + ⑤ learning rate decay. 一般取0.9, 0.99, 0.999, 0.9998, 在有些任务中, 0.99和0.999的实验结果可能差很远. tf提供了tf.train.exponential_decay字段调节lr.此处的lr decay需要结合lr的选取来定,如果lr本身很小,那么lr decay也需要相应减小, 特别是当lr非常小时,需特别注意lr decay的调节.
52 |   + g. 是否对softmax层进行取样?判断是否使用该trick需要理解为什么会有"在softmax取样"这一说法.当时学者提出该trick是考虑到若softmax层对应的词汇表非常大, 则将训练过程中如果使用GPU则显存要求高,同时计算量特别大,每个word都需要计算一次,如使用语言模型和[seq2seq的decoder层](https://arxiv.org/abs/1412.2007)([笔记](https://github.com/xwzhong/papernote/blob/master/translation/On%20Using%20Very%20Large%20Target%20Vocabulary%20for%20Neural%20Machine%20Translation.md)).因此,在softmax层output size比较小(以千记)可不使用取样,因为既然是取样自然会有损失在其中.
53 |   + h. 是否使用batch normalization或layer normalization.有论文提及batch normalization[用于RNN hidden2hidden层效果不好](https://arxiv.org/abs/1510.01378)([笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Batch%20Normalized%20Recurrent%20Neural%20Networks.md)),也有[论文](https://arxiv.org/abs/1603.09025v5)([笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Recurrent%20Batch%20Normalization.md))指出其用于input2hidden对模型准确率,时间效率都有提升, 其中之一BN作者提出了BN的改进版[Batch Renormalization](https://arxiv.org/abs/1702.03275)([笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Batch%20Renormalization:%20Towards%20Reducing%20Minibatch%20Dependence%20in%20Batch-Normalized%20Models.md)),建议在选择的时候使用Batch Renormalization或[layer normalization](https://arxiv.org/abs/1607.06450)([笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Layer%20Normalization.md)).layer normalization在tf中已有实现,但个人在使用时发现,其需要非常高的内存,同时,单个batch计算效率明显下降,可在最终优化好的模型中测试是否使用normalization技巧.
54 | 
55 | #### 7. 加速模型训练的一些技巧
56 |   + a. 对于文本数据,先用程序将词或字转化为id,训练模型时直接读入这些id,在训练过程中再进行转化非常耗时,特别是对于每个epoch都要进行id转化的代码.
57 |   + b. 如果训练数据比较少,可直接读入内存中,并在训练前进行id转化;反之可通过转化为id后存入文本,或使用tfrecord, tfrecord对数据进行特定形式编码,存入文件中,读取时可按batch,多进程读取.
58 |   + c. 课程学习. 其思想是:在训练过程中,先使用简单的部分数据训练,到一定程度后再用整个训练数据集.这个思想非常好,建议阅读论文[curriculum learning](https://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf)([笔记](https://github.com/xwzhong/papernote/blob/master/regularization/curriculum%20learning.md)),该思想不仅实践简单,而且对我们日常标注语料,清理数据有很大的指导作用, 如:相同文本出现在正负类别中该如何处理.课程学习要想实际运用需要思考:如何定义一个问题的难易程度,可从数据角度和分类器角度考虑.从数据角度可思考词频,从分类器角度,以分类器判别难以为准.这里可以提一下[IRGAN](https://arxiv.org/abs/1705.10513)([笔记](https://github.com/xwzhong/papernote/blob/master/chatbot/IRGAN%EF%BC%9AA%20Minimax%20Game%20for%20Unifying%20Generative%20and%20Discriminative%20Information%20Retrieval%20Models.md)),在思想上有共通之处.
59 | 


--------------------------------------------------------------------------------
/embedding/A Simple Approach to Learn Polysemous Word Embeddings.md:
--------------------------------------------------------------------------------
 1 | ### A Simple Approach to Learn Polysemous Word Embeddings
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出一种解决word embedding在一词多义的表示问题.原理比较简单,给定pretrained词向量(eg: word2vec, glove等),并计算词汇表中两两之间的cooccurance(paper式1),最后,指定词的word embedding由当前context下所有词的加权embedding.
 5 | 
 6 | #### comment:
 7 |   1. paper提出的方法简单,换一个思路理解,就是利用了当前context的信息来确定该词的表示;
 8 |   2. 虽然使用了context文本信息,但仍是一种统计特征;
 9 |   3. 在experiment中,不同测试集的结果相对其它模型差异大.
10 | 
11 | #### [code](https://github.com/dingwc/multisense)
12 | 
13 | #### highlight:
14 | &emsp;&emsp;We note that (Chen et al., 2014) is learned using additional supervision from the WordNet knowledge-base in clustering; therefore, it achieves comparably much higher scores in WSR and CWS tasks in which the evaluation is also based on WordNet.
15 | 


--------------------------------------------------------------------------------
/embedding/A Simple But Tough to Beat Baseline for Sentence Embeddings.md:
--------------------------------------------------------------------------------
 1 | ### A Simple But Tough to Beat Baseline for Sentence Embeddings
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出了一个很简单的sentence embedding方法（见原文Algorithm 1）：
 5 | 
 6 |   1. 使用无监督学习的方法计算word embedding，eg：word2vec，GloVe；
 7 |   2. 对待表示的句子，将每个词的词向量加权求和，每个词的权重仅用到了词的概率和人为设定的参数a；
 8 |   3. 对第2步得到的句向量使用PCA或SVD进行微调。
 9 | 
10 | #### summary：
11 | 
12 | 1. 在一般任务中，对句子的word embedding使用tfidf方式进行加权比求平均更优；
13 | 2. paper提出的方法中，不同训练集得到的word embedding适用于与其不同domain的数据；
14 | 3. paper提出的方法因为沿用了word2vec或GloVe，因此自然也保留了其无监督方法的特点，如反义词的word embedding相似，造成在情感分析的任务中表现较差；
15 | 4. 论文4.3部分以词序作为变量在多组不同任务（similarity、entailment、sentiment）中进行了实验，可知词序对这些任务都有助提高；
16 | 5. 通过table5、6对比可知
17 |   + a. tfidf-GloVe有20个（总数为26）比GloVe-W更优，因此猜想tfidf+GloVe+R会有最优结果；
18 |   + b. Twitter数据在GloVe+R中比所有模型高20%左右，这点值得思考；
19 |      
20 | 6. SNLI数据集[leaderboard](https://nlp.stanford.edu/projects/snli/)显示，sentence embedding方法最高为84.6%（2017-07-20），本方法为78.2%；
21 | 7. paper所提出的方法确实比较简单，但也给出了相应的理论证明，在某些任务上表现不错。总之，需就待解决的实际问题来选择使用哪方面的方法，考虑是否使用word embedding，词序，tfidf，PCA/SVD等等，需分析什么时候需要，什么时候不需要。
22 | 
23 | #### highlight:
24 |   1. Use word embeddings computed using one of the popular methods on unlabeled corpus like Wikipedia, represent the sentence by a weighted average of the word vectors, and then modify them a bit using PCA/SVD.
25 |   2. To estimate cs, we estimate the direction c0 by computing the first principal component of c˜s’s for a set of sentences.
26 |   3. We also note that the top singular vectors c0 of the datasets seem to roughly correspond to the syntactic information or common words. For example, closest words (by cosine similarity) to c0 in the SICK dataset are “just”, “when”, “even”, “one”, “up”, “little”, “way”, “there”, “while”, and “but.”
27 |   4. Finally, we speculate that our method doesn’t outperform RNN’s and LSTM’s for sentiment tasks because (a) the word vectors —and more generally the distributional hypothesis of meaning —has known limitations for capturing sentiment due to the “antonym problem”, (b) also in our weighted average scheme, words like “not” that may be important for sentiment analysis are downweighted a lot. To address (a), there is existing work on learning better word embeddings for sentiment analysis (e.g., (Maas et al., 2011)). To address (b), it is possible to design weighting scheme (or learn weights) for this specific task.
28 | 
29 | #### question:
30 | &emsp;&emsp;pca/svd担当的角色主要是什么？
31 |   
32 | #### code
33 |   1. [theano](https://github.com/PrincetonML/SIF)
34 |   2. [sklearn](https://github.com/peter3125/sentence2vec)
35 |   3. 使用tf实现可用到tf.svd
36 | 


--------------------------------------------------------------------------------
/embedding/Advances in Pre-Training Distributed Word Representations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;paper比较简单,在13年提出的word2vec基础上加了三个近几年提出的改进方案:
 3 | 
 4 |   1. Position-dependent Weighting. 原word2vec中,content向量是word vector相加取平均,没有考虑词序影响,position weighting则考虑了词序影响;
 5 |   2. Phrase representations. 原始的cbow是基于unigram的,对词序不敏感,在英文中有些词经常同时出现,如: New York. 通过迭代使用mutual information将高度共现的token拼凑到一起,从而改进unigram出现的问题.
 6 |   3. [Subword information](https://github.com/xwzhong/papernote/blob/master/embedding/Enriching%20Word%20Vectors%20with%20Subword%20Information.md). 原本的模型结果中,词与词之间的向量没有考虑形态学特征(eg: '太阳'和'阳光','阳'这个字包含的信息在训练好的vector中是不共享的),同时无法解决OOV的问题.
 7 | 
 8 | #### comment:
 9 |   1. 对于'Phrase representations'用于中文方面,可以直接从分词角度解决;
10 |   2. 句子的词序计算方面,可考虑逆序数,n-gram以及文本提到的Phrase representations;
11 |   3. 文中大部分实验结果是使用fasttext训练得来;
12 |   4. 训练word embeddings时,以句子切分(而不是以段落),同时对句子进行去重,效果会更好(Notably, we have found it very important to de-duplicate sentences in large corpora such as the Common Crawl before training the models);
13 |   5. 实验中,数据比例wiki+news:crawl≈1:31,但使用paper提出的改进,其结果在分类任务中相差不大(avg crawl高0.2%)
14 | 
15 | #### highlight:
16 |   1. A critical aspect of their training is thus to capture efficiently as much statistical information as possible from rich and vast sources of data.(终究是统计学思路)
17 |   2. We performed the classification using the standard fastText toolkit running in a supervised mode (Joulin et al., 2016), using the pre-trained models to initialize the classifier.
18 |   3. Our findings indicate that improvements can be achieved by training well-known algorithms on very large text datasets, and that using certain tricks can provide further gains in quality.
19 | 
20 | #### question:
21 |   1. 原word2vec使用paper中的式4进行取样,心存疑惑,为什么不是根据词出现的概率?
22 |   2. [Squad](https://rajpurkar.github.io/SQuAD-explorer/)实验中, 具体是怎么使用embedding的?其指标是使用了EM还是F1score?
23 | 
24 | #### more reading:
25 | &emsp;&emsp;[论文翻译](http://blog.csdn.net/u014568072/article/details/78940777)
26 | 


--------------------------------------------------------------------------------
/embedding/An Efficient Framework for Learning Sentence Representations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章提出使用unlabel数据训练句子编码模型，操作很简单，给出一个句子A，预测候选句子(B1,B2,B3...)哪一个是其下一句（分类问题）。
 3 | 
 4 | #### comment:
 5 |   1. 作者将所提的想法与同样使用unlabel data训练句向量的[skip-thought](https://github.com/xwzhong/papernote/blob/master/sequence%20to%20sequence/Skip-Thought%20Vectors.md)对比，个人认为skip-thought提出时间已经比较早（2015-06，本文提出时间2018-03），skip-thought已经被很多模型超越；
 6 |   2. 就为什么文章所提训练策略效果好，作者给出如后解释：Instead of training a model to reconstruct the surface form of the input sentence or its neighbors, we take the following approach. Use the meaning of the current sentence to predict the meanings of adjacent sentences, where meaning is represented by an embedding of the sentence computed from an encoding function。但是仔细一想，说了跟没说差不多，更深层次的原因没法科学论证；
 7 |   3. 从实验效果看确实比skip-thought好很多，BERT模型中task2-next sentence prediction跟文章策略有相通之处，但个人认为仍需对第2点所提问题做探索。
 8 | 
 9 | 
10 | #### more:
11 |   1. [tensoflow代码](https://github.com/lajanugen/S2V)
12 |   2. [中文讲解](https://www.jianshu.com/p/50b4094a9c0b)
13 | 


--------------------------------------------------------------------------------
/embedding/Baseline Needs More Love On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　文章介绍了由word vector生成sentence vector的方式，并着重介绍了max pooling和hierarchical pooling：
 3 |   1. max pooling。对句子中的词向量相同维度取max；
 4 |   2. hierarchical pooling。借鉴ngram方式，在指定大小的窗口中取average pooling，最后在所有ngram单元中取max pooling。
 5 |   
 6 | #### comment:
 7 |   1. 其提出的hierarchical pooling方式能顾及部分语序信息，但针对长度≤窗口k的句子则学不到语序信息。
 8 |   2. max pooling策略在dl中经常用到，没却没有过多的解释，文章4.1.1详细分析了max pooling学到的权重特点：
 9 |   + a. 学习到的权重更稀疏（大部分为0）；
10 |   + b. 同一纬度权重最大的前5个词表明了其一般属于相同类别。
11 |   3. 词序对sentiment classification相比于主题模型更重要。
12 | 
13 | #### highlight:
14 |   1. Simple pooling operations are surprisingly effective at representing longer documents (with hundreds of words), while recurrent/convolutional compositional functions are most effective when constructing representations for short sentences.
15 |   2. Sentiment analysis tasks are more sensitive to word-order features than topic categorization tasks. However, a simple hierarchical pooling layer proposed here achieves comparable results to LSTM/CNN on sentiment analysis tasks.
16 |   3. To match natural language sentences, e.g., textual entailment, answer sentence selection, etc., simple pooling operations already exhibit similar or even superior results, compared to CNN and LSTM.
17 |   4. In SWEM with max-pooling operation, each individual dimension of the word embeddings contains interpretable semantic patterns, and groups together words with a common theme or topic.
18 | 


--------------------------------------------------------------------------------
/embedding/Deep contextualized word representations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　文章提出了一种word向量更丰富的表示，主要步骤如下：
 3 |   1. 预训练：使用大型语料训练bi-lstm语言模型；使用时，固定bi-lstm参数，获取bi-lstm的输出，经映射并加权相加后得到x的上下文（contextualized）信息y（详见式1）；
 4 |   2. 针对特定任务，原本是将x输入rnn（与预训练使用的bi-lstm参数不共享），现在将x与y拼接后再输入，或与x经rnn计算后得到的输出h拼接，以此来利用x在当前句子中的上下文信息。
 5 | 
 6 | #### comment:
 7 |   1. 从实验的结果上看，该方法在阅读理解（Liu），语义角色标注（He），实体识别（Peters)，细粒度情感分类（5类，McCann)模型中提升明显；
 8 |   2. 从直观上分析，该方法是从细致化角度表达了word的上下文，该细致化主要归功于lstm的输出（高层次表达)，而且通过一定的实验证明了其有效性，但仍像是黑匣子，难以从理论上去分析其原因。
 9 | 
10 | #### more:
11 |   1. [论文主页](https://allennlp.org/elmo)
12 |   2. [tensorflow code](https://github.com/allenai/bilm-tf)
13 |   3. [tensorflow hub调用（需网络，没中文）](https://www.tensorflow.org/hub/modules/google/elmo/2)
14 | 


--------------------------------------------------------------------------------
/embedding/Distributed Representations of Sentences and Documents.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　基于word2vec理论，文章提出句子/篇章编码的方法，分两种模型：
 3 | 
 4 | 1. paragraph vector-a distributed memory model。模型分三层，输入层、隐含层和输出层，结构图如下：
 5 | ![](https://github.com/xwzhong/papernote/blob/master/pic/Distributed%20Representations%20of%20Sentences%20and%20Documents.png)
 6 | 
 7 |   + 1.1. 训练阶段：
 8 |     + a. 输入层：为paragraph id对应的向量、context前k-1个词的词向量（不像cbow模型，由上下文的词预测中心词，这里是用前k-1个词预测第k个词，同时注意，此处的context由paragraph按窗口大小k切分得来）
 9 |     + b. 隐含层：步骤a中的向量取平均或拼接
10 |     + c. 输出层：softmax层，对应输出的label为context第k个词
11 | 
12 |   + 1.2. inference阶段
13 |     + a. 如果paragraph已经出现在训练语料中，则可直接利用训练好的paragraph id对应的向量
14 |     + b. 如果没有，则需初始化这个向量，在固定词向量相关参数的基础上使用梯度下降更新此处初始化的paragraph向量
15 | 
16 | 2. paragraph vector without ordering-distributed bag of words。模型一样分三层：
17 |   + a. 输入层：paragraph id对应的向量
18 |   + b. 隐含层：步骤a中的向量
19 |   + c. 输出层：softmax层，从context中随机抽取m个词，最大化这m个词出现的概率
20 | 
21 | 
22 | #### comment:
23 | 1. 该方法思路秉承了word2vec cbow和skip-gram的思想，细节方面还是有些不同，比如“paragraph vector-a distributed memory model”是用“paragraph向量+前k-1个词”预测第k个词
24 | 2. 个人对“paragraph vector-a distributed memory model”训练和inference步骤的结合上持有异议，训练过程中，词向量相关的参数是变化的，而inference过程中，其是固定的，这样会造成两个过程的不协调，可能是为了加快inference的预测速度而做的改动，个人提出的一个改进策略是，在train阶段分两部分，一部分会改变词向量的参数，另一部分不会，这样就能和预测过程匹配
25 | 3. 缺点：
26 |   + a. 参数多。每个paragraph都有一个向量，假定paragraph数跟vocab大小一致，则整个模型的参数量增加一倍，直接拉低训练速度，而往往训练数据的量都比较大，特别是对于短句，动辄上亿，对于长文本，例如新闻，一般也有上百万
27 |   + b. 预测速度慢。预测过程中，如果paragraph不在训练语料中（对于长文本，往往都是不在其训练语料中，如新闻），则需要初始化该paragraph向量，再用梯度下降方式更新其向量，长文本还可得到更多context，每个context都得进行一次或多次梯度更新，非常不友好
28 | 4. 这个模型在2016年11月份左右听同事试验过（反馈说用gensim训练速度慢-一方面是没有c加速，word2vec则有），到目前为止在其它论文的对比实验中，很少提到该模型（一两次）
29 | 


--------------------------------------------------------------------------------
/embedding/Enriching Word Vectors with Subword Information.md:
--------------------------------------------------------------------------------
 1 | ### Enriching Word Vectors with Subword Information
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出了一种改进基于word2vec的方式。核心思想是使用了subword信息。简单来讲，假设有两个词“火影忍者”和“忍者”，由于这两个词存在交集，使用subword方式便将这两者内在的语义联系起来，而不需要这两个词都出现的语料。实现比较简单，仅需在原有代码的基础上，使用n-gram的方式修改数据的输入，eg：“火影忍者”的2-gram表示——<火,火影,影忍,忍者,者>，其中<和>分别为起始和结尾标识符。
 5 | 
 6 | #### comment：
 7 |   1. subword的优点：
 8 |       * a. 学习到词与词形态学方面的信息，词与词交集越多，且语义相近的数据集，其效果越好。（可查看sisg-的效果）
 9 |       * b. 较好解决OOV的情况，对于长尾数据是很好的补充。（可查看sisg的效果）
10 |   2. 论文在四个任务上进行了比较——word similarity, word analogy, morphological reresentation, language model, 有以下结论：
11 |       * a. 不考虑OOV的情况（sisg-），模型提高约3+%左右，考虑OOV（sisg）模型提高约4+%。
12 |       * b. 在word analogy任务中，syntactic方面的准确率提高更明显，最多能提高22%+。
13 |   3. n-gram是用在词内部，不是用在词与词之间，所以能学到的是词内部中部分顺序信息。扩展到词与词用n-gram方式便可得到带顺序的句向量(不知道实际的效果在特定任务上效果如何?经思考,如果扩展到词用n-gram,内存要求比较高,对于2-gram,假设中文词有10w,则有10的十次方个token)。
14 |   4. 文中虽然对n的选择进行了分析，但n的选择用于中文仍需实验得到，标准——对最终任务的提升效果。
15 |   5. 该策略为fasttext的一部分，gensim上已有代码实现，可直接调用。
16 |   6. 既然使用subword学习到的是形态学信息，如果其信息是完全不相关的，是不是该把这部分数据剔除，例如人的姓名。
17 | 
18 | #### highlight:
19 |   1. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.
20 |   2. Most of these techniques represent each word of the vocabulary by a distinct vector, without parameter sharing.
21 |   3. In particular, they ignore the internal structure of words, which is an important limitation for morphologically rich languages.
22 |   4. As expected, the nearest neighbors for complex, technical and infrequent words using our approach are better than the ones obtained using the baseline model.
23 |   5. we observe that the most important n-grams correspond to valid morphemes.
24 | 


--------------------------------------------------------------------------------
/embedding/Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention.md:
--------------------------------------------------------------------------------
 1 | ### Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;在对sentence进行embedding时，当前sentence各个step emit的vector（使用biLSTM生成）能考虑、根据该sentence itself的信息进行加权，故称为inner-attention，该想法的根据是：when human read one sentence, people usually can roughly form an intuition about which part of the sentence is more important according past experience.
 5 | 
 6 | #### comment：
 7 |   1. 实验较少，sentence itself信息计算只实验了mean pooling的情况，对比下max pooling应该更好；或引入tfidf进行加权，因为paper既然提到了不同词的重要性，那也可以跟tfidf做个对比；
 8 |   2. differenting input算是一个tricky。在预处理中，将premise和hypothesis共有的word从两边剔除，而且还能加快计算速度。
 9 | 
10 | #### highlight:
11 | &emsp;&emsp;Differentiating input strategy forces the model to focus on different part of the two sentences which may help the classification for Neutral and Contradiction examples as we observed that our model tended to assign unconfident instances to Entailment.
12 |   
13 | #### question:
14 | &emsp;&emsp;有了双向LSTM还需要inner attention？BiLSTM主要考虑了前后文、长文本信息，但是并没有考虑前后文中哪些词重要哪些词不重要。
15 | 


--------------------------------------------------------------------------------
/embedding/Learning Semantic Similarity for Very Short Texts.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;文章立足与短文本(实验使用了长度为10, 20和30的句子)语义相似度匹配, 分析了从词向量到句向量的不同整合方式,eg: 对每个维度取min,max,min/max拼接,mean,与idf结合等策略,最终提出了一个有监督的权重学习模型.
 3 | 
 4 | #### comment:
 5 |   1. 经实验发现以下特征:
 6 | 
 7 |       + a. 使用单纯的tf-idf向量匹配,文本越长,匹配错误率越低(length_10_36.15%-length_20_20.09%-length_30_12.55%);
 8 |       + b. 对于长文本(实验中,长度为30),paper实验的各种整合方式对其错误率的影响相对较小(±4%)vs(±12.8)length10;
 9 |       + c. 在无监督的方式中, mean+idf weighted的方式在长度为10和20的匹配中表现优异,效果比较:10>20>30;
10 |       + d. 在长度为20的文本比较distance计算公式, split error指标中,Mean+Euclidean比Mean+cosine高1.5%;
11 |       + e. 在有监督的权重学习中,长度为20的文本其split eror为14.44%,比mean+idf weighted+cosine低1.33%;
12 |   2. paper提出的有监督学习importance factors方案给人眼前一亮,比较可惜的是没在长度为10的文本中进行实验;
13 |   3. 在importance factors策略中,在学习到的权重基础上乘上idf值效果会不会更好?因为importance factors仅学习到不同词之间的相对权重,没有考虑词本身的重要性;
14 | 
15 | #### highlight:
16 |   1. Our main contribution is a first step towards a hybrid method that combines the strength of dense distributed representations— as opposed to sparse term matching—with the strength of tf-idf based methods to automatically reduce the impact of less informative terms.
17 |   2. In this paper, we show how word embeddings can be combined into a new vector representation for the entire considered fragment, in which the impact of frequent words, i.e. with a low idf-component, and therefore mostly non-informative—is reduced with respect to more informative words. This leads to a significant increase in the effectiveness of detecting semantically similar shorttext fragments, compared to traditional tf-idf techniques or simple heuristic methods to combine word embeddings.
18 |   3. We regard two text fragments to be semantically similar if their corresponding vector representations lie close to each other according to some distance measure, and dissimilar if the vectors lie farther apart.(paper给出的语义相似度定义)
19 | 


--------------------------------------------------------------------------------
/embedding/Learning Semantic Textual Similarity from Conversations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　文章提出使用对话数据+迁移学习（此处使用了SNLI数据集）来生成句向量，从而用于qa中question rerank，answer rerank和sentence相似度计算等任务中。
 3 |   1. 框架整体结构。使用transformer结构（f4-b）对input&response进行编码（fig3），得到两个句向量u和v，再对v使用全连接层映射，得到v'，对u和v'计算score，其中input encoder和response encoder参数相同。
 4 |   2. 数据。reddit社区一问多答语料+SNLI文本推理数据。训练中的整合方式时：When training the multitask model, we initialize the shared parameters with a pretrained Reddit model. We employ a distributed training system with multiple workers, where 95% of workers are used to continue training the Reddit task and 5% of workers are used to train the SNLI task。（个人理解在初始化后是同时训练两份数据集）
 5 |   3. 实验。设置了多个不同场景的实验：
 6 |   
 7 |       + a. reddit数据集上response prediction（qa中answer重排）, 结论：相比于DAN(deep averaging network)结构, transformer高8%左右准确率；
 8 |       + b. SNLI数据集分类任务。Reddit+SNLI准确率为84.1%；
 9 |       + c. STS Benchmark句子相似度（pair有相似得分）。在对原本的cos得分映射后，Reddit+SNLI tuned后test结果为0.808。tuned：For adaptation, the second configuration uses an additional transformation matrix of the sentence embeddings. The matrix is parameterized using the STS training data。（tuned的影响比较大，2%+）；
10 |       + d. CQA Subtask B在qa中对question rerank。
11 | 
12 | #### comment：
13 |   1. 整一篇文章下来，突出了transformer结构+迁移学习的优点：
14 |   
15 |       + a. transformer结构对比DAN具有明显的优势；
16 |       + b. 使用了SNLI数据使得在STS Benchmark任务中表现突出，但作者也分析了其中的原因（4.4.2），指出：, presumably because most of the SNLI sentences are image captions, while Reddit doesn’t contain much captionstyle data. In contrast, the performance of the models are similar in the other two categories.在CQA Subtask B实验结果中也能得到一定的佐证。进一步讲，在迁移学习中，使用领域相关数据对最终的任务改进才比较明显。
17 |       + c. 文章introduction提到，假设answer相同，则推测其对应question语义相近，但通篇下来似乎没有着重分析；其次，个人认为这是双向的，一句话有多种回复方式，同时一个回复其也可以有多种提问方式，特别是在闲聊领域中，因此一种比较好的方式是结合此两方面的特点。
18 |   2. 引入SNLI分类数据时，作者借鉴了[Conneau et al. (2017)](https://github.com/xwzhong/papernote/blob/master/embedding/Supervised%20Learning%20of%20Universal%20Sentence%20Representations%20from%20Natural%20Language%20Inference%20Data.md)提出的方法：encode a sentence pair into vectors u1, u2 and construct a feature vector (u1, u2, |u1 − u2|, u1 ∗ u2).
19 |   3. 文章摘要中写道：Our method trains an unsupervised model to predict conversational input-response pairs. 此处说unsupervised似乎不太合理，作者只是没有使用人工标注的数据集，而使用了确定的一问一答语料+SNLI语料。
20 |   4. 在句子相似度匹配上，半监督学习或许是不错的思路。首先使用大语料训练一个无监督模型，然后再设计有监督的模型（该模型可能和无监督模型一样，也可能不一样），利用极少的数据微调模型，主要是考虑到有标注数据的缺失，同时利用领域相近的数据一般能学到相似的知识，[Learning to Generate Reviews and Discovering Sentiment](https://github.com/xwzhong/papernote/blob/master/neural%20network/Learning%20to%20Generate%20Reviews%20and%20Discovering%20Sentiment.md)就是一个例子。
21 | 
22 | #### more:
23 |   1. [Learning Semantic Textual Similarity from Conversations 论文实现](https://blog.csdn.net/sinat_30665603/article/details/80174128)
24 |   2. [如何匹配两段文本的语义？](https://www.jiqizhixin.com/articles/2018-06-11-17)
25 |    
26 | 


--------------------------------------------------------------------------------
/embedding/Linguistic Regularities in Continuous Space Word Representations.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | &emsp;&emsp;paper提出使用simple RNN语言模型来构建词向量的方法，训练好的模型参数即可转变为词向量，详见式1.
3 | 
4 | #### comment:
5 |   1. 相比word2vec，RNNLM生成词向量计算量更大，但得到的语法/语义效果更佳，其中一点在于RNNLM考虑了词序；
6 |   2. 在softmax层遇到了与word2vec同样的问题，矩阵维度大导致计算量大（涉及词汇量大小），应该需要进一步的优化，eg：采样；
7 | 


--------------------------------------------------------------------------------
/embedding/Mimicking Word Embeddings using Subword RNNs.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对词向量的OOV情况，作者提出使用已有词的向量训练“字符级”有监督的模型来预测位置词的向量（详见paper第3部分）：Formally: given a language L, a vocabulary V ⊆ L of size V , and a pre-trained embeddings table W ∈ R^(V×d) where each word wk is assigned a vector e^k of dimension d, our model is trained to find the function f : L → R^d such that the projected function f | V approximates the assignments f (wk) ≈ e^k . Given such a model, a new word wk∗ ∈ L\V can now be assigned an embedding ek∗ = f(wk∗). 模型结构如下：
 3 | 
 4 | ![](https://github.com/xwzhong/papernote/blob/master/pic/mimick%20model%20architecture.png)
 5 | 
 6 | #### comment:
 7 | &nbsp;&nbsp;&nbsp;&nbsp;相比于[Bite Pair Encoding](https://github.com/xwzhong/papernote/blob/master/others/Neural%20Machine%20Translation%20of%20Rare%20Words%20with%20Subword.md)，文章提出的解决OOV的方案有点曲折，使用word2vec或fasttext等其它算法预训练词向量后还需要有监督地训练向量拟合模型
 8 | 
 9 | 
10 | #### more:
11 | &nbsp;&nbsp;&nbsp;&nbsp;[代码](https://github.com/yuvalpinter/Mimick)
12 | 


--------------------------------------------------------------------------------
/embedding/On the Dimensionality of Word Embedding.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章提出了词向量维度确定的方法，首先指出词向量需满足unitary不变性的特点：无论是词向量夹角（相似性)，还是平行度（类比性），亦或是距离（聚类性），假如把词嵌入整体做一个旋转或者镜面对称，这些几何性质都不会发生变化，而Pairwise Inner Product（PIP）损失能着重测量词嵌入unitary不变性质之间的距离，因此被设计为确定词向量维度的损失函数。
 3 | 
 4 | #### comment:
 5 |   1. 该策略仍是从词语义分析出发，而不是从后续的高层任务（eg：分类，序列标注等）来确定词嵌入的维度；
 6 |   2. paper实验表明，在“词语义”分析上，存在最优词嵌入维度，而且word2vec和glove对过拟合具有鲁棒性，维度非常高时，在词义相似性任务虽有下降，但下降幅度随维度增加并不明显，从这个角度来说，在使用词嵌入的下游任务中是不是也存在最优维度？
 7 |   3. [ETH-DS3Lab at SemEval-2018 Task 7](https://github.com/xwzhong/papernote/blob/master/classification/ETH-DS3Lab%20at%20SemEval-2018%20Task%207:%20Effectively%20Combining%20Recurrent%20and%20Convolutional%20Neural%20Networks%20for%20Relation%20Classification%20and%20Extraction.md)为不同词嵌入维度具有不同的效果提供了一个佐证。
 8 | 
 9 | #### more:
10 |   1. [NeurIPS 2018 oral论文解读：如何给词嵌入模型选择最优维度](https://www.jiqizhixin.com/articles/2019-01-03-8)
11 | 


--------------------------------------------------------------------------------
/embedding/Order-Embeddings of Images and Language.md:
--------------------------------------------------------------------------------
 1 | ### Order-Embeddings of Images and Language
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出了一种描述“上下位”的loss设计，该设计所考虑的点主要有两个，“上下关系或传递性”的描述以及“两者关系非对称性”。title中的“order”代表了“传递性”，文中第二部分从理论给出了设计方式，最后提出了描述“上下位关系”的max-margin loss方程（见式3），该式子翻译过来即为：令positive样本满足上下位关系（使式3的前半部分尽量为0），negative样本则相反。
 5 | 
 6 | #### comment：
 7 |   1. 传递性例子（见fig1）。As a partial order, this relation is transitive: “woman walking her dog”, “woman walking”, “person walking”, “person”, and “entity” are all valid abstractions of the rightmost image.
 8 |   2. 非对称性例子。For instance, (person, organism) is a direct hypernym pair, but it takes eight hypernym edges to get from cat to organism.
 9 |   3. 在使用该loss设计方式时，一定得先明确“问题”是否具有上下位关系，这是理论基础，在论文实验中，order-embedding(reversed)设置进一步得到论证。
10 |   4. paper提出的loss设计，有较强的理论基础，而且经实验发现，简单的设计也得到了不错的实验结果。
11 | 


--------------------------------------------------------------------------------
/embedding/Quantifying Mental Health from Social Media with Neural User Embeddings.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章想根据“用户发的推特”学习用户向量表征，达到以下目的：
 3 |   + a. 对于“抑郁症患者”“创伤后应激障碍患者”“对照组（未上述疾病用户）”三类数据，在空间上是否有聚类特点
 4 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Quantifying%20Mental%20Health%20from%20Social%20Media%20with%20Neural%20User%20Embeddings.png)
 5 |   + b. 将向量表示用于下游任务
 6 | 用户向量训练策略：用户index -> user embedding（类似从词转id中的word embedding）-> 计算“用户向量”和“推特中出现的词向量”的dot product -> 使用hinge loss（负样本词为不在推特文本中的词）
 7 | 
 8 | #### comment:
 9 |   1. 该策略类似[paragraph2vector](Distributed Representations of Sentences and Documents)，固定paragraph/用户向量，但同时也明显存在缺陷，对于没有出现过的用户回出现“OOV”，可以考虑将用户tag当作用户表征，对于直接使用用户唯一标识作为输入的模型都有一定程度的缺陷；
10 |   2. 用于下游任务时，固定用户embedding后加一个全连接层，效果更佳（文章给的解释是提供了提供了领域和任务的特征）
11 | 
12 | #### more:
13 |   1. [github代码](https://github.com/samiroid/usr2vec)
14 | 


--------------------------------------------------------------------------------
/embedding/Semi-Supervised Sequence Modeling with Cross-View Training.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对有监督任务中数据量少的情况，作者提出使用半监督的方式强化句子层面的表示。以下图为例：
 3 | 
 4 | ![](https://github.com/xwzhong/papernote/blob/master/pic/Semi-Supervised%20Sequence%20Modeling%20with%20Cross-View%20Training.png)
 5 |   
 6 |   + a. 有标签任务中需要预测Washington为地点；
 7 |   + b. 在无标签的任务中，仅提取句子部分信息，同样需要预测'Washington'为地点。此处，'Washington'可能不可见，预测时，最后一层softmax层参数不同，其余相同，监督方式为使用KL散度拟合与步骤a的输出分布。
 8 | 
 9 | 
10 | #### comment:
11 | &nbsp;&nbsp;&nbsp;&nbsp;文章主要的想法是深化句子层面的语义表征，跟近期的BERT有相通之处，但使用的数据量相对于BERT少很多。
12 | 
13 | 
14 | #### more:
15 |   1. [tensoflow代码](https://github.com/tensorflow/models/tree/master/research/cvt_text)
16 |   2. [遗珠之作？谷歌Quoc Le这篇NLP预训练模型论文值得一看](https://www.jiqizhixin.com/articles/2019-01-07-5)
17 | 


--------------------------------------------------------------------------------
/embedding/Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.md:
--------------------------------------------------------------------------------
 1 | ### Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出使用SNLI（Stanford Natural Language Inference）数据训练inference模型，再用其中间产物来生成其它task的句子向量。该模型是在fig1结构下进行的。
 5 | 
 6 | #### comment：
 7 |   1. 从实验结果来看，通过该有监督学习方法得到的sentence embedding方式确实比无监督学习得到的sentence embedding在多个任务中效果更好，但是背后深层次的原理仍不能很明确地证明出来，一而概之可以说是SNLI为待构建的sentence提供了除该句子之外其它的信息（或者是能学到的，确定的模型同时也确定了其能表达、学到的内容），但是具体的理论证明仍需学者探索，同时也因为这个因素造成其不可控，难以运用在实际项目中；
 8 |   2. BiLSTM-max比mean、LSTM等表现更优。
 9 |   
10 | #### highlight:
11 | &emsp;&emsp;stacked RNN：The goal of a such model is to encourage each recurrent level to operate at a different timescale.
12 | 
13 | #### thought:
14 | &emsp;&emsp;使用transfer learning时需要考虑以下几点，pretrain的information是否切合最终的问题，比如pretrain word2vec，如果只是通过简单的加权（mean或tfidf），反倒使情感分类的效果更差（因为word2vec弱化了情感正负面词之间的相似度），反之，pretrain的模型越切合待解决问题，最终的效果会越好，切合的角度有：数据来源（涉及到词分布、语法特点）、问题特点等。
15 | 
16 | #### highlight:
17 |   * hypothesis：We hypothesize that the semantic nature of NLI makes it a good candidate for learning universal sentence embeddings in a supervised way.
18 |   * reason：Models can be trained on SNLI in two different ways: (i) sentence encoding-based models that explicitly separate the encoding of the individual sentences and (ii) joint methods that allow to use encoding of both sentences (to use cross-features or attention from one sentence to the other).
19 |   * BiLSTM-Mean does not make sharp choices on which part of the sentence is more important than others.
20 | 
21 | #### [code](https://github.com/facebookresearch/)
22 | 


--------------------------------------------------------------------------------
/embedding/Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper指出使用词语相似度的高阶关系进一步抽取word vector语义信息，具体操作是矩阵相乘（类似于有向图的可达矩阵），同时可将该高阶信息通过线性转化得来；
 3 | 
 4 | #### comment：
 5 |   1. paper提出的观点有一定的理论依据，想法挺有意思，同时在不同DL模型中引入方便;
 6 |   2. 抛开paper提出的改进策略，各项实验都表明fasttext效果优于word2vec和glove,整体排序: fasttext>glove>word2vec;
 7 |   3. 相比于fasttext，该策略对word2vec和glove提升比较明显，同时alpha参数选取比较麻烦，alpha选得差会降低模型的效果（综合来看，选用fasttext时训练word embedding时对模型提高有限）；
 8 |   
 9 | 
10 | #### more:
11 |   1. [CoNLL 2018 ｜ 最佳论文揭晓：词嵌入获得的信息远比我们想象中的要多得多](https://mbd.baidu.com/newspage/data/landingsuper?context=%7B%22nid%22%3A%22news_8705292014518314824%22%7D&n_type=0&p_from=1)
12 |   2. [code](https://github.com/artetxem/uncovec)
13 | 


--------------------------------------------------------------------------------
/information retrieval/Personalizing Search via Automated Analysis of Interests and Activities.md:
--------------------------------------------------------------------------------
1 | ### Personalizing Search via Automated Analysis of Interests and Activities
2 | 
3 | #### note:
4 | &emsp;&emsp;paper提出在索引框架上，根据用户的额外信息（eg：此前输入的query，浏览过的网页，写/看过的邮件等），加入BM25算法中进行reranking，利用额外信息的方式仍立足于词频。
5 | 
6 | #### comment：
7 | &emsp;&emsp;模型做的变动较小，但却很好地引入了个人统计信息。
8 | 


--------------------------------------------------------------------------------
/leaderboard.md:
--------------------------------------------------------------------------------
 1 | #### Sentiment analysis
 2 | * [IMDb\SST\Yelp\SemEval\Sentihood\SemEval-2014 Task 4](http://nlpprogress.com/english/sentiment_analysis.html)
 3 | 
 4 | #### text similarity
 5 | * [STSbenchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark)
 6 | 
 7 | #### text classification
 8 | * [The Stanford Natural Language Inference (SNLI) Corpus](https://nlp.stanford.edu/projects/snli/)
 9 | 
10 | #### Natural Language Inference
11 | * [CoQA: A Conversational Question Answering Challenge](https://stanfordnlp.github.io/coqa/)
12 | * [SQuAD: The Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/)
13 | * [Question Answering in Context](http://quac.ai)
14 | 
15 | #### QA
16 | * [Question Answering (State of the art)](https://aclweb.org/aclwiki/Question_Answering_(State_of_the_art))
17 | 
18 | #### others
19 | * [glue](https://gluebenchmark.com/leaderboard)
20 | * [SWAG Dataset](https://leaderboard.allenai.org/swag/submissions/public)
21 | * [aclwiki State of the art](https://aclweb.org/aclwiki/State_of_the_art)
22 | 


--------------------------------------------------------------------------------
/multi-task/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对参数共享的多任务学习，文章引入expert的概念，不同expert学习不同的知识，再通过gate来控制不同expert提供知识的多少，文章提出路径：
 3 |   + 背景：常见的multi task是共享底层网络的（Shared-Bottom model），结构如下图（缺点：不同的任务可能关联性差，直接将不同任务的主要参数进行共享，如果不同任务的关联性比较低，模型不容易在两者中都拟合的特别好，跟模型的学习能力相关）：
 4 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Modeling%20Task%20Relationships%20in%20Multi-task%20Learning%20with%20Multi-gate%20Mixture-of-Experts-pic1.png)
 5 |   + 改进思路：将共享的参数拆分成多组，每一组相当于一个expert，用以学习特定方面的知识，然后通过gate来控制每个expert对不同任务加权，结构图：
 6 |     
 7 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Modeling%20Task%20Relationships%20in%20Multi-task%20Learning%20with%20Multi-gate%20Mixture-of-Experts-pic2.png)
 8 |   + 进一步升级：n个任务引入n组gate，原来单gate中，gate参数相同（可以理解为attention参数相同，输入不同），结构图：
 9 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Modeling%20Task%20Relationships%20in%20Multi-task%20Learning%20with%20Multi-gate%20Mixture-of-Experts-pic3.png)
10 | 
11 | #### comment:
12 |   1. paper主要想解决不同任务关联性低时参数学习存在一定冲突问题，通过引入多组参数+gate控制的想法来解决，学习的目标是两个任务的预测效果都要求比较好
13 | 
14 | #### more:
15 |   1. [多任务学习模型详解：Multi-gate Mixture-of-Experts（MMoE ，Google，KDD2018）](https://blog.csdn.net/ty44111144ty/article/details/99068255)
16 | 


--------------------------------------------------------------------------------
/multi-task/Multi-Task Deep Neural Networks for Natural Language Understanding.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper在bert基础上引入多任务学习，与原bert相比，在GLUE数据集上高2.2%绝对值，引入多任务策略比较简单：当前step，随机选择某个任务，然后在预训练的bert基础上微调任务相关（包含bert本身）参数。
 3 | 
 4 | #### comment:
 5 |   + 1. 文章idea根源：借助不同任务学习到不同的能力在bert参数部分（不同任务共享的参数）得到展现；
 6 |   + 2. 不同任务在直接如论文方式微调情况下也可能阻碍其它任务的学习。
 7 | 
 8 | #### more:
 9 |   1. [官方提供的pytorch代码](https://github.com/namisan/mt-dnn)
10 | 


--------------------------------------------------------------------------------
/multi-task/Overcoming catastrophic forgetting in neural networks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对串行训练的multi-task，考虑到不同任务优化的方向不同，当前task B可能会使已达到最优超参平面的task A远离，因此提出elastic weight consolidation（EWC）改进策略：
 3 |   + 1. 前提：任务A和任务B存在较优的公共解区间
 4 |   + 2. 方案：在当前step k（任务A已经训练完），任务B的梯度反向传播时，对于任务A重要的weight施加较小的影响（参数更新幅度小）
 5 |   + 3. 操作：
 6 |     + a. 对于任务A，按照常规方法训练，达到解空间x
 7 |     + b. 对于任务B，先确定解空间x对任务A的重要程度F（利用任务A的验证集，计算x的梯度，然后取平方，换言之，梯度越小，该参数更接近任务A的最优解，因此越重要），并记录解空间x的具体数值star，然后在计算任务B的loss时新增对旧任务的consolidation。
 8 |     + c. 针对任务C、D...，重复step b
 9 | 
10 | #### comment:
11 |   + 1. 文章提出的观念比较新颖，其中的理论推导相比[代码](https://github.com/ariseff/overcoming-catastrophic)难理解一点，针对多任务，但有不共用的参数部分需要留意。
12 | 
13 | #### more:
14 |   + 1. [tf示例代码](https://github.com/ariseff/overcoming-catastrophic)
15 | 


--------------------------------------------------------------------------------
/name entity recognition/Chinese NER Using Lattice LSTM.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;NER模型可以分为char-based和word-based模型，背景：
 3 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese%20NER%20Using%20Lattice%20LSTM-pic2.png)
 4 |   + char-based缺点：char-based模型没有很好地利用词语粒度的信息，char+bichar利用bigram词的embedding信息，char+softword的思想是利用多任务学习共享参数来学习词语粒度的信息
 5 |   + word-based缺点：word-based需要首先保证分词的准确率；整合char信息的方式主要是在word embedding时concatenate char语义信息（每个char输入lstm或cnn等到一个向量）
 6 |   + 文章提出的模型（lattice lstm）：lattice是基于char-based的NER模型，同时显式地利用了词语粒度信息，利用的地方——在当前字符j的cell vector同时利用了char和word：
 7 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese%20NER%20Using%20Lattice%20LSTM-pic1.png)
 8 |     + i. char级别：
 9 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese%20NER%20Using%20Lattice%20LSTM-pic5.png)
10 |     + ii. word级别：
11 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese%20NER%20Using%20Lattice%20LSTM-pic3.png)
12 |     + iii. 整合char和word：
13 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Chinese%20NER%20Using%20Lattice%20LSTM-pic4.png)
14 | 
15 | #### comment:
16 |   1. 文章提出的切入点好，而且引入word的信息相对直接
17 | 


--------------------------------------------------------------------------------
/name entity recognition/Neural Architectures for Fine-grained Entity Type Classification.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　作者提出了一种细粒度实体识别的方法，主要步骤如下：
 3 |   1. 对mention entity进行编码。原文是以英文为基础，以word为单位，将每个word使用embedding转为vector后，将所有vector对应维度相加并取平均。
 4 |   2. 对context进行编码。在原句子中以mention entity单词切分，分左右两部分词集left, right。使用bilstm分别对其进行编码（左右两边是不同的lstm参数），然后使用attentive机制对输出的hidden output进行加权，并归一化。
 5 |   3. 引入hand-craft特征（tab1）。
 6 |   4. 引入hierarchical label，使具有相似属性的实体label其参数能共享（原文fig2）。
 7 | 
 8 | #### comment:
 9 |   1. Fine-grained Entity Type Classification指细粒度的实体识别，比如常用的lstm+crf模型能识别人名，地名，而Fine-grained Entity Type Classification则进一步细分人名，如：医生，演员，作家等。
10 |   2. paper提到，对mention entity进行encoding时选了相对粗糙的方法，是为了降低过拟合的可能。
11 |   3. 步骤2中引入的attentive机制与seq2seq中引入的方式如出一辙，而且经本文实验发现，相比于averaging，attentive模型准确率提高8%+（tab3）。
12 |   4. hand-craft特征对结果影响明显，根据数据不同，影响程度不同，甚至可能会降低acc（tab5）。
13 | 
14 | #### [code](https://github.com/shimaokasonse/NFGEC)
15 | 


--------------------------------------------------------------------------------
/name entity recognition/Neural Architectures for Named Entity Recognition.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;这是一篇使用深度学习进行实体识别的经典论文, biLSTM+CRF结构. 原理比较简单, 先使用biLSTM结构得到每个char/word的logit输出, 输出的维度对应tag个数, 然后使用CRF计算最优路径.
 3 | 
 4 | #### summary:
 5 | ##### 1. 简评
 6 | &emsp;&emsp;biLSTM+CRF进行实体识别思路比较简单, 在简单优化的情况下就能超越之前精雕细琢的模型.
 7 | 
 8 | ##### 2. "为什么少量的数据(以千计)便能得到良好的实体识别效果"
 9 | 
10 |   + a. 前提: 如果仅根据当前句子本身可利用的信息便能识别出实体, 那么使用识别算法的效果便可做一定的预期, 这是一个很重要的前提, 可预期模型识别在现有标注语料下的识别上限, 正如翻译, 大部分情况下, 仅根据当前语料对训练便能得到较好的效果, 而在对话系统中, 同样使用seq2seq其效果却与翻译的结果相差很多;
11 |   + b. 句子结构: 实体识别任务非常需要利用句子本身结构的信息, 整体数据的语法结构越规范, 模型学习后预测的效果会更好, 例如: '你在阿里巴巴工作吗?', 如果类似'在xxx工作'经常出现, 那么模型也会学到'在xxx工作'结构中, 'xxx'很可能代表公司;
12 |   + c. 词语记忆: 所谓的记忆是指模型在训练语料中见过某个实体词, 若在预测新数据的时出现该实体词, 则能正确识别的可能性将非常大;
13 |   + d. embedding泛化: 此处的泛化是由数据特点决定的, 如, 大部分人名的姓氏出自百家姓, 而这些字(词)可通过embedding预先学习到其相似特征; 时间也是同理, 时间内常包含数字信息, 而数字在embedding中相似的空间的结构能减轻模型学习的负担. 因此embedding在biLSTM+CRF中的重要性不言而喻;
14 |   + e. biLSTM+CRF结构; 还有一个不能忽略的是biLSTM+CRF本身的结构, biLSTM能学习语序信息, 其双向结构能记忆上下文, 同时CRF也是非常优秀的统计算法, 在分词任务中已有实验证明其效果非常显著.
15 | 
16 | ##### 3. 预训练
17 | 
18 |   + a. embedding
19 | 
20 |     + ① 该模型的准确率对embedding比较敏感, 使用与实体识别相似的大规模数据(G级别)训练得到的embedding, 比小规模, 不同类型(大规模)的数据更好;
21 |     + ② 在训练biLSTM+CRF模型时, 建议不随模型更新embedding参数(embedding用大规模同类型语料训练得到), 其主要原因是固定同类型大规模语料训练得到char/word之间的关系, 经实验, embedding若一直是trainable将导致准确率下降; 也有学者提到在训练到一定epoch后再更新embedding参数, 个人发现针对同类型大规模得到的embedding, 该策略对F1值影响不大;
22 |     + ③ 从embedding对模型的重要程度可知, 适当增加embedding的维度也是提高F1值不错的选择;
23 |     + ④ 连接input embedding和output embedding也能提高模型的准确率, 可见[论文笔记](https://github.com/xwzhong/papernote/blob/master/neural%20network/Using%20the%20Output%20Embedding%20to%20Improve%20Language%20Models.md), 该策略的进阶版解释详见[链接](https://github.com/xwzhong/papernote/blob/master/neural%20network/Tying%20Word%20Vectors%20and%20Word%20Classifiers:%20A%20Loss%20Framework%20for%20Language%20Modeling.md);
24 | 
25 |   + b. 语言模型. 使用语言模型先对biLSTM结构进行预训练发现, biLSTM+CRF模型的拟合速度加快, 但是最终的F1值却有所下降, 关于这点个人仍十分怀疑实验的科学性, 待验证;
26 |   + c. 词级别.
27 | 
28 | ##### 4. batch size
29 | &emsp;&emsp;该模型偏向于使用较小的batch size, batch size大虽然能加快模型的收敛, 但最终F1值却下降, 可见同个batch的样本彼此之间应该有较大的影响, 可考虑使用较小的batch size或使用normalization策略降低同个batch样本之间的互相影响.(更多解释见"[当前关注点：批量大小、学习率、泛化性能下降](https://www.jiqizhixin.com/articles/051303)"部分)
30 | 
31 | ##### 5. projection layer
32 | &emsp;&emsp;在biLSTM output到softmax之间添加前馈神经网络;
33 | 
34 |   + a. 原因; 结合论文[Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953)理解, biLSTM的输出为高秩, 直接连接softmax层对限制softmax的学习能力, 用个可能不恰当的比喻, 如, 线性可分模型训练非线性数据集将得到较差的结果;
35 |   + b. 搭建"连接input embedding和output embedding"的桥梁; 原tying技巧直接将LSTM输出与word embedding连接, 这需要LSTM隐层状态维度与embedding维度相同(如果是biLSTM, 则embedding维度需为LSTM隐层状态维度的两倍), 如果希望LSTM隐层维度与embedding维度不相同, 可再加一层或多层前馈神经层, 此时不影响tying技巧的使用, 因为该技巧核心是input embedding与projection layer output相同.
36 |   + c. 重新初始化projection layer参数, 对第K次训练好的biLSTM参数进行再训练; 此时biLSTM参数相当于已经站在比较好的收敛位置, 此时再初始化projection layer参数进行训练, 将使模型biLSTM+projection参数训练更好的拟合位置.
37 | 
38 | ##### 6. tagging schemes
39 | &emsp;&emsp;paper提到了两种打标签方式, IOB(inside, outside, begin)和IOBES(inside, outside, begin, end, singleton), 从5.a提到的论文出发, 个人也是推荐使用IOBES形式的打标签策略.
40 | 
41 | ##### 7. 训练数据:
42 | 
43 |   + a. 标注标准; 不管是人工标注语料, 还是想合并从网上搜集得来的数据集, 其标注标准最好能一致, 比如, 对于别称"小李子", 数据集A不判定为人名, 而数据集B判为人名, 那么数据集A和B合并时, 总数据集内部便有两套标准, 从模型本身的角度出发, 训练数据存在歧义, 自然影响模型最终的预测.(如果数据集本身有很多模糊不定的样本, 要想让模型学习到'正确'的结果会不太现实, 而且训练数据一般比较少). 个人提倡一种"分得开, 合得拢"的标注思想: 以最小单元为标注思想, 往上层构建, 如:2018年4月8日, 可标注为: (2018年, 时间), (4月, 时间), (8日, 时间), 因为某些句子可能只提到了'2018年', 那么'2018年'便是共有的'子单元', 但是(2018年, 时间), (4月, 时间), (8日, 时间)又能完整地合并成一个时间, 就像原子通过组合成为分子. 另外说一个纠正数据的策略, 统计每个词属各个实体类型的分布以及同个类别下不同词的分布情况, 看看是不是有明显标注出错的部分;
44 |   + b. 增加训练数据量; 除了网上提供的标准论文数据集, 针对具体业务, 一般需要手工来标注, 不言而喻, 增加训练集的数目能明显提高模型准确率, 针对中文短文本, 以千计的数据量便能得到0.915+的F1值. 如果针对一个具体领域的实体识别任务, 完全没有标注数据(一个原因是数据类型完全不同, 另一个可能是标注标准不同), 可使用网上已有的实体标注模型, 大规模标注, 筛选句子中实体较多的句子进行人工纠正, 这样做能有效地学习一个句子尽可能多的实体. 当有一定量的训练数据, 为提高模型准确率想继续标注, 此时要用训练的模型来预测新的数据, 对新数据中预测错误的样本进行纠正并学习, 该做法便能有针对性地优化模型没学习到的部分;
45 |   + c. 不同实体类别的比例及子标签(eg: B-PERSON, I-PERSON等)的分布; 对于一般的对话文本, 其实体类别的分布应该是'时间', '地点', '人物'这三种类型占比较多, 而'公司名', '机构名'相对较少, 模型针对特定类别的识别准确率与其本身的数据量大小应该也有关系的(实际情况待验证, 这个点算是个人一些思考), 因此, 如果想让'人物'识别准确率高一点, 可以有针对性地提高该类别的数据量大小.
46 | 
47 | ##### 8. char(字符)和word(词语)的结合
48 | &emsp;&emsp;在汉语中, 可按char或word来构建biLSTM+CRF模型. 按char, 除了可减少embedding规模大小, 最主要的是能利用摆脱'分词'算法本身, 按word则可利用'词语'级别的语义信息, 但是如果一个词是一个tag, 如果分词本身不准确, 将明显影响实体识别的准确率. 在原论文中, 作者提出将'char'和'word'结合起来, 预测的时候对'word'打tag, 而在中文中要利用类似的结构则需要分词.
49 | 
50 | ##### 9. 实体识别与分词算法的结合
51 | &emsp;&emsp;可通过实体识别的结果来进行辅助性分词, 当实体识别训练数据达到一定程度后, 则可直接考虑"分词"+"实体识别"同时进行, 当训练数据较少的时候, 不推荐将分词直接引入到biLSTM+CRF中, 个人观点是基于大规模语料的统计分词会比小规模(以千记)类似实体识别语料效果应该要好, 虽然个人还没进行实验.
52 | 
53 | 
54 | #### more reading:
55 |   1. [NLP论文笔记1：Neural Architectures for Named Entity Recognition](https://blog.csdn.net/juanjuan1314/article/details/78905239)
56 | 
57 | 


--------------------------------------------------------------------------------
/name entity recognition/Semi-supervised Multitask Learning for Sequence Labeling.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出在训练序列标注模型(BiLSTM+CRF)时，同时进行语言模型训练，要点记录如下：
 3 |   + a. loss中加上语言模型的损失；
 4 |   + b. 在训练语言模型时，当前token_k仅用正向LSTM输出预测token_k+1，反向LSMT输出预测token_k-1。
 5 | 
 6 | #### comment：
 7 |   1. 作者提出语言模型想法与[Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models: A Generative Approach to Sentiment Analysis](https://github.com/xwzhong/papernote/blob/master/neural%20network/Contextual%20Bidirectional%20Long%20Short-Term%20Memory%20Recurrent%20Neural%20Network%20Language%20Models:%20A%20Generative%20Approach%20to%20Sentiment%20Analysis.md)相似，而且都从实验证明的其效果;
 8 |   2. [<< Semi-supervised Multitask Learning for Sequence Labeling>>阅读笔记](https://zhuanlan.zhihu.com/p/27899932)文章梳理不错，通读可快速了解文章重点。
 9 | 
10 | #### highlight:
11 | &nbsp;&nbsp;&nbsp;&nbsp;This language modeling objective incentivises the system to learn general purpose patterns of semantic and syntactic composition.
12 | 


--------------------------------------------------------------------------------
/neural network/A Unified Architecture for Natural Language Processing.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章发表于2008年，提出将多项NLP任务（词性标注、分块、实体识别、语言模型[非常规语言模型]、语义相关词）一起训练来提高某项任务（语义角色标注[SRL]，因其较复杂）中的效果。大致过程：针对具体某项NLP任务，每个词有k个不同类型的特征，eg：词本身、首字母是否为大小写、与谓词的距离等，对每个特征进行embedding，然后使用神经网络进行训练。要点记录：
 3 |   1. 每项任务共用同一份“词本身”的embedding，从而达到jointly训练的效果；
 4 |   2. 语言模型不是利用前k-1个词来预测第k个词，而是判断句子中的词是不是与其前后文本相关。训练时，将句子A中间词用随机挑选的词替换后得到的句子B，优化方向是使A得到的score比B尽可能地＞1。
 5 | 
 6 | 
 7 | #### comment:
 8 | &nbsp;&nbsp;&nbsp;&nbsp;作者提出的语言模型新颖，该任务与SRL结合时比其它任务单独或一起训练都更好（详见原文tab2），该策略与[BERT](https://github.com/xwzhong/papernote/blob/master/transformer/BERT:%20Pre-training%20of%20Deep%20Bidirectional%20Transformers%20for%20Language%20Understanding.md)模型的masked language model有相近之处，或许借助神经网络强拟合能力，加上恰当的结构设计（目的是为了表示词、句子等信息）能得到出人意料的效果。
 9 | 
10 | 
11 | #### more:
12 | &nbsp;&nbsp;&nbsp;&nbsp;Here, we only want to have a good representation of words: we take advantage of the complete context of a word (before and after) to predict its relevance.
13 | 


--------------------------------------------------------------------------------
/neural network/Adding Gradient Noise Improves Learning for Very Deep Networks.md:
--------------------------------------------------------------------------------
 1 | ### Adding Gradient Noise Improves Learning for Very Deep Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper指出,在"very deep"网络中,可以对梯度增加noise,以提高学习的准确率.操作的方法很简单,在每一个step中,对计算得到的梯度添加一个高斯噪音,该高斯分布均值为0,方差随训练步数递减(详见式1).paper从几个的模型(eg:deep fully connected networks, end2end memory networks, neural programmer, neural random access machines, neural GPUs)的实验中论证了该方法的可行性.
 5 | 
 6 | #### comment:
 7 |   1. 针对deep fully connected networks, 该方法有以下好处: 
 8 |   
 9 |    * a. 即使没有gradient clipping操作,该方法仍表现优异.
10 |    * b. 对参数初始化不敏感.
11 | 
12 |   2. 论文提出的trick很简单,但是该方法适用于深层(复杂)的网络结构,用于shallow网络反倒可能使效果变差.
13 |   3. 在引入noise时,考虑了模型在收敛时最佳梯度不断变小的情况,因此将step作为noise的参数,同时,对于一般的随机因子,高斯分布是个不错的选择.
14 |   4. noise在我们的观念中都是应该剔除的,但是有些实验表明,对pure数据添加适当的噪音能提高模型准确率,特别是针对容易过拟合的模型,此处noise起到的作用是防止过拟合.同时也有curriculum learning方式指出,应该尽可能得将这种tough(包含noise)样本尽可能地在后期学习(或者剔除).这两种思路都没错,但是都有其适用的前提.
15 |   5. 由引入noise能提升效果可知,neural networks模型在训练过程中仍有很多不定性,如果是确定的,应该就不会引入不确定的变量.
16 | 
17 | #### highlight:
18 | &emsp;&emsp;We found that adding noise to the gradient during training helps training and generalization of complicated neural networks. We suspect that the effects are pronounced for complex models because they have many local minima.
19 | 


--------------------------------------------------------------------------------
/neural network/An overview of gradient descent optimization algorithms.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;paper对梯度下降可使用的优化算法进行的简要介绍，有以下几个方面：
 3 | 
 4 |   1. Gradient descent variants: batch, stochastic及mini-batch gradient descent;
 5 |   2. 梯度优化算法：momentum、nesterov accelerated gradient（NAG，momentum的改进版）、adagrad、adadelta、RMSprop、Adam等、同时给出了[可视化链接](http://cs231n.github.io/neural-networks-3/);
 6 |   3. 并行化策略；
 7 |   4. 简要提及其它优化策略：shuffling and curriculum learning、batch normalization、early stopping、gradient noise；
 8 | 
 9 | #### comment:
10 |   1. 一般来讲，我们会选用mini-batch的方式训练，但是不要忘记，算法推导之初是使用batch，限于运行内存/显存、单步更新速率，从而使用了mini-batch，个人实验发现，mini-batch越大，收敛速度越快，如果内存/显存允许，可尽量调高mini-batch数。但也不要忘记对同一个epoch进行shuffle操作，降低同个mini-batch样本与样本间的相互影响；
11 |   2. 梯度优化算法，主要还是针对learning rate的优化，更深入可指每个参数learning rate的优化。从momentum到NAG，再到adagrad、Adam等，可清晰地看到中间演化过程，最终作者也提及，默认可使用Adam优化策略，其用动量的角度引入一阶和二阶梯度。
12 | 


--------------------------------------------------------------------------------
/neural network/Attentive Language Models.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | 　　作者提出在language model中引入已生成word的attention来预测下一个词，大致步骤如下：
3 |   1. 获取当前step之前各个hidden state，并计算各个hidden state的权重，hidden state加权相加后得到当前step的attention c；
4 |   2. 将当前step的hidden state与attention c拼接，最后再进行softmax映射。
5 | 
6 | #### comment:
7 |   1. 文章提出了两种attention计算方式（式5和式6），设计与seq2seq引入方式相近，扩展来讲，在其它任务中，也能使用类似的attention计算策略；
8 |   2. 根据实验结果来看，该方法对perplexity改进明显（使用了LSTM结构），可分析出，LSTM虽然能学习长序列信息，但是针对LM这个问题仍暴露出其在长序列学习上的不足。
9 | 


--------------------------------------------------------------------------------
/neural network/Calculus on Computational Graphs Backpropagation.md:
--------------------------------------------------------------------------------
1 | ### Calculus on Computational Graphs: Backpropagation
2 | 
3 | #### note:
4 | 1. 文章以一个计算公式为基础，讲解使用图来计算其结果、进行微分，进而引入前向微分和反向微分，最后介绍常用的反向微分方式相对于前向的优缺点；
5 | 2. 前向和反向最明显的区别是，如果有k个input，n个output，前向微分需遍历k次，后向微分需遍历n次，当k远大于n时（一般情况也是这样），时间效率相差就非常明显。
6 | 
7 | #### comment:
8 | &emsp;&emsp;文章思路清晰，以非常简单的例子逐步讲解这个过程；同时后向微分的方式是常用深度学习框架的设计核心（如theano， tf等），阅读这篇文章对使用这些框架很有帮助。
9 | 


--------------------------------------------------------------------------------
/neural network/Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models A Generative Approach to Sentiment Analysis.md:
--------------------------------------------------------------------------------
1 | #### note:
2 | 　　文章指出，使用biLSTM构建语言模型预测下一个词A时（假定完整句子为sABCDe），其也利用了A本身的信息（原因在backforward过程，s的输出由A的输入得到），针对这点，作者提出下一个词（A）的预测仅与其上下文（sBCDe）相关，并在IMDB情感二分类实验中提升0.5%的准确率。
3 | 
4 | #### more:
5 | 　　[tensorflow code](https://github.com/dongjun-Lee/birnn-language-model-tf)
6 | 


--------------------------------------------------------------------------------
/neural network/Deep Nets Don't Learn Via Memorization.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;论文主要是反驳[Understanding Deep Learning Requires Rethinking Generalization](https://arxiv.org/abs/1611.03530)提出的观点, 在摘要部分已将文章思想讲得很清楚: We use empirical methods to argue that deep neural networks (DNNs) do not achieve their performance by memorizing training data, in spite of overlyexpressive model architectures. Instead, they learn a simple available hypothesis that fits the finite data samples, In support of this view, we establish that there are
 3 | qualitative differences when learning noise vs. natural datasets, showing that:
 4 |   1. more capacity is needed to fit noise;
 5 |   2. time to convergence is longer for random labels, but shorter for random inputs;
 6 |   3. DNNs trained on real data examples learn simpler functions than when trained with noise data, as measured by the sharpness of the loss function at convergence;
 7 |   最后作者也指出: for appropriately tuned explicit regularization, e.g. dropout, we can degrade DNN training performance on noise datasets without compromising generalization on real data.
 8 | 
 9 | #### comment:
10 |   1. 针对本文提到的几个观点,个人对此实验现象理解如下:
11 |      + a. noise data内在的模式比正常标注的数据模式更简单(eg:区分男人/女人比区分不同种族更容易), 因此拟合noise数据需要更多的参数及相应的计算;
12 |      + b. random label并没有改变输入变量x的模式(pattern),该模型相比对input数据加入噪声更规整,导致从特征到标签的学习难度加大(输入是规整,输出是random);而对input加入噪声,其内部模式发生了变化,因此output对random后的input也是random的(它不需要知道output有什么样的模式,因为output是由input推导而来),所以相比random label更易拟合;在此猜测,如果同时对input和output进行random操作,其拟合速度与仅对input进行random操作接近;
13 |      + c. real data有更简洁的pattern因此需要更少的参数及训练时间;
14 |   2. 个人觉得本文提出的反驳意见与rethinking generalization提出的观点不冲突:
15 |      + a. rethinking generalization这篇论文侧重表达DL的拟合能力强,哪怕对input数据随机加入噪声,但只要参数足够大,训练时间足够长,training accuracy都能达到百分百;
16 |      + b. 本文侧重说明dl不是单纯具有数据的'记忆'能力,从real data和noise data看出, 其本身也具有很强的泛化能力.该观点在rethinking那篇论文中也有提及.
17 |   3. 同一个模型,hidden unit越大,收敛速度越快(fig2).
18 | 
19 | #### question:
20 | &emsp;&emsp;在2.3部分-effect of regularization on learning, 除了说明不同正则化方法在noise data下的不同表现还说明了说明?
21 | 
22 | #### more reading:
23 |   1. [论文笔记:Understanding Deep Learning Requires Rethinking Generalization](https://github.com/xwzhong/papernote/blob/master/neural%20network/Understanding%20Deep%20Learning%20Requires%20Rethinking%20Generalization.md)
24 |   2. [讨论](https://openreview.net/forum?id=rJv6ZgHYg)
25 | 


--------------------------------------------------------------------------------
/neural network/Dialog Context Language Modeling with Recurrent Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Dialog Context Language Modeling with Recurrent Neural Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;针对language model，paper对dialog level信息的employ进行了总结，从最基本的language model，到多轮对话，值得记录的有以下几点：
 5 | 
 6 | 1. Single-Turn-RNNLM: Conventional RNNLM that operates on single turn level with no context information;
 7 | 2. BoW-Context-RNNLM: Contextual RNNLM with BoW representation of preceding text as context;
 8 | 3. DRNNLM: Contextual RNNLM with turn level context vector connected to initial RNN state of the target turn;
 9 | 4. CCDCLM: Contextual RNNLM with turn level context vector connected to RNN hidden state of the target turn;
10 | 5. DACLM: RNNLM with true dialog act context vector connected to RNN state of the target turn at each time step at each time step;
11 | 6. IDCLM: The last RNN state of turn k?1 serves as the starting RNN state of k+1 and the context vector to turn k, which is fed to turn k’s RNN hidden state at ach time step together with the word input;
12 | 7. ESIDCLM: under IDCLM, use external RNN to model the context changes explicitly
13 | 
14 | #### comment:
15 | 1. IDCLM考虑了同一个人的“说话风格”；
16 | 2. ESIDCLM更像是句子层面的recurrent neural network；
17 | 3. 从实验结果来看，仍旧是引入了额外tag信息的DACLM模型表现更优。
18 | 


--------------------------------------------------------------------------------
/neural network/Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.md:
--------------------------------------------------------------------------------
 1 | ### Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
 2 | #### note:
 3 |   
 4 | 1. 总体来讲，GRU和LSTM比传统的RNN好;
 5 | 2. 在学习能力上，GRU不一定输给LSTM，在某些数据集上，参数数目相近的情况下，GRU略胜一筹;
 6 | 3. 在时间效率方面，如果GRU解决当前问题情况下优于LSTM，那么同时刻下，GRU会比LSTM学习得更充分，反之亦然。
 7 | 
 8 | #### comment:
 9 |   
10 | 1. 对比RNN，GRU和LSTM三者之间学习能力、时间效率时，考虑了参数数目的影响，思考比较严谨，这点不错;
11 | 2. 在非常细致的问题优化中再考虑RNN变体的选择，因为就算RNN和LSTM对同一个问题有不同的结果，但是在参数数目相近的情况，其结果差距一般不会太大。
12 | 
13 | #### [code](https://github.com/jych/librnn)
14 | 


--------------------------------------------------------------------------------
/neural network/Exploring Sparsity in Recurrent Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Exploring Sparsity in Recurrent Neural Networks
 2 | 
 3 | #### note:
 4 | 1. paper想通过“pruning weights"的方式提高rnn在测试时的执行速度，此处提到的pruning weights是通过weights的条件限制，使更多的weight为0，模型变得更sparse，进而加快运算；
 5 | 2. 在deploy过程中，文中说训练仅增加了一个常量级（根据paper提到的算法，比较可信）；
 6 | 3. 在测试方面，如果仅以参数中非零个数为变量，同个数下，结果确实是比较好，但是，如果同hidden layer下分为dense和sparse，dense方式的效果更好。文章说测试的准确率比原来的dense方式更优是从非零个数的参数角度考虑的，至于大家批判这种方式是否合理就见仁见智了。
 7 | 
 8 | #### comment:
 9 | 1. 既然sparsity能达到90+%，是不是可考虑直接减小模型的规模？
10 | 2. 从某个角度验证了4.17openai发的[paper](https://arxiv.org/abs/1704.05119)思路。
11 | 
12 | #### practice: 
13 | &emsp;&emsp;这篇文章的改进思路更适合大项目。
14 | 


--------------------------------------------------------------------------------
/neural network/How Does Batch Normalization Help Optimization (No, It Is Not About Internal Covariate Shift) .md:
--------------------------------------------------------------------------------
1 | #### [highlight](https://mp.weixin.qq.com/s/7obF_oxk5BrF3R-kbAsRrw):
2 | 　　作者研究了BatchNorm能提高深度神经网络训练有效性的根源，并发现BatchNorm与internal covariate shift之间的关系是微不足道的。特别是，从优化的角度来看，BatchNorm并不会减少internal covariate shift。相反，BatchNorm对训练过程的关键作用在于其重新规划了优化问题，使其Lipschitzness（BatchNorm对损失函数稳定性的影响）稳定和β-smoothness更有效（“有效”在这里指，朝梯度方向移动时测量梯度的变化），这意味着训练中使用的梯度更具有良好的预测性和性能，从而可以更快速、有效地进行优化。
3 | 
4 | 　　同时，作者也展示了这种平滑效果并不是BatchNorm特有的，其他一些自然正则化策略也具有相似的效果，并能带来可比较的性能增益。
5 | 


--------------------------------------------------------------------------------
/neural network/How to Construct Deep Recurrent Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### How to Construct Deep Recurrent Neural Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper指出了如何构建深层RNN结构的方向，paper逻辑大致如下：
 5 | 
 6 |   1. 为什么要构建deep rnn？hypothesis：Deep learning is built around a hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a shallow one (Bengio, 2009)；
 7 |   2. 构建的方向是什么？从以下三个方向加深rnn：input-to-hidden，hidden-to-hidden，hidden-to-output；
 8 |   3. 实验。从实验结果来看，shortcut挺好用，但是容易过拟合。
 9 | 
10 | #### comment：
11 |   1. 如何区分模型“复杂”和“有效”？“有效”猜想：用尽可能少的参数达到目的，而不是畸形的。
12 |   2. 关于模型结构、参数大小、过拟合的思考:
13 | 
14 |         a. 首先针对待解决的问题，设计出合适的DNN结构，比如针对语言模型，可以使用纯RNN，GRU，LSTM等，一般情况下，初期仅考虑准确率情况下，在候选结构中，尽量选取在参数不用非常大的情况下，其结果仍非常好的模型，比如GRU就明显比RNN要好，因为参数大小相同的情况下，GRU明显比RNN更具长时记忆。
15 |         
16 |         b. 适当提高参数大小。设置参数时，考虑选取128,256,512等2的n次方，有利于加快使用GPU的计算。还有神经元参数初始化（使用normalization initialization），learning rate，learning rate decay选取等。
17 |         
18 |         c. 当参数过大时，容易过拟合，此时可以考虑引入dropout、l2正则等，对于word embedding还可以考虑使用adversarial/virtual training，这两种都是非常不错的正则化方式。
19 |   
20 |   3. deep不是目的，解决问题才是。
21 |   
22 | #### highlight:
23 | &emsp;&emsp;stacked RNN：The goal of a such model is to encourage each recurrent level to operate at a different timescale.
24 | 


--------------------------------------------------------------------------------
/neural network/Is it Time to Swish Comparing Deep Learning Activation Functions Across NLP tasks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　该paper为实验性文章，测试了21中激活函数使用MLP、CNN和RNN解决文本方面任务（分类、序列标注）时的效果对比，实验发现，penalized tanh函数在所有任务上表现得最稳定。
 3 |  
 4 |  #### more:
 5 |   1. [代码](https://github.com/UKPLab/emnlp2018-activation-functions)
 6 |   2. [21种NLP任务激活函数大比拼：你一定猜不到谁赢了](https://www.jiqizhixin.com/articles/012701)
 7 |   3. [26种神经网络激活函数可视化](https://www.jiqizhixin.com/articles/2017-10-10-3)
 8 |   4. [一文概览深度学习中的激活函数](https://www.jiqizhixin.com/articles/2017-11-02-26)
 9 |   5. [谷歌大脑提出新型激活函数Swish惹争议：可直接替换并优于ReLU？（附机器之心测试）](https://www.jiqizhixin.com/articles/2017-10-21-4)
10 | 


--------------------------------------------------------------------------------
/neural network/Learning to Generate Reviews and Discovering Sentiment.md:
--------------------------------------------------------------------------------
 1 | ### Learning to Generate Reviews and Discovering Sentiment
 2 | #### note:
 3 |   
 4 | 1. 文章使用Amazon商品评论数据(38G)训练了一个1层4096个unit的语言模型，这4096个unit中，发现了一个sentiment unit，能指示待encoded中每个字（或词）的情感极性（正面或负面），另外该句子encoded后，还能判断整个句子的情感极性，在IMDB数据集上，错误率降低到7.7%（state of art 方法为5.91%）；
 5 | 2. 训练好语言模型后，通过很少的标注数据（30个）就能超过在Stanford sentiment treebank数据集下的state of art方法；
 6 | 3. 在使用语言模型生成句子时，能通过人工直接控制sentiment unit的值来决定所生成文本的情感。
 7 | 
 8 | #### comment:
 9 | 
10 | 1. 语言模型仍有很多未知的潜力，对于hidden unit，我们对其仍知之甚少；
11 | 2. 文本中没提及如何寻找这个sentiment unit，但是可以尝试使用已标注的相近领域情感分类数据来找；
12 | 3. 其它unit是不是也反映了数据在某方面的特点，已知的有句子长度，会包含语义上的转折unit？
13 | 4. seq2seq会不会有同样的unit？
14 | 5. 如果language model有很多理想的unit（大家想通过这些unit来控制生成），是不是会有lang2seq模型（language to sequence）,这样既利用了language model能用大量数据无监督学习的特点，还能利用seq2seq end2end的特性。
15 | 
16 | #### practice：
17 | 
18 | 1. 通过领域训练得到的language model在特定领域使用时，如果语料的overlap不高，效果不一定特别好，因此可在通用领域训练好的model基础上，用待解决问题领域的数据进行fine tuning。其它运用还有word2vec。
19 | 2. 利用好这个已发现的sentiment unit，不仅可以减少人工标注数据来训练情感分类器，还能直接控制文本生成等等。
20 | 
21 | #### more reading:
22 | <http://it.sohu.com/20170407/n486996650.shtml>
23 | 
24 | #### [code](https://github.com/openai/generating-reviews-discovering-sentiment)
25 | 


--------------------------------------------------------------------------------
/neural network/Pointer Sentinel Mixture Models.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　作者提出了[Pointer Networks](https://github.com/xwzhong/papernote/blob/master/sequence%20to%20sequence/Pointer%20Networks.md)的一种变体。pointer networks仅能使用input text中的词，针对rare word和long-term dependecies，作者提出的pointer sentinel mixture models能结合此前的input和unseen word（相对于input来说的，而不是常指的OOV）。主要步骤如下（文中用于语言模型）：
 3 |   1. 使用rnn对句子进行编码，得到前N-1个词的hidden state；
 4 |   2. 将第N-1个词的hidden state映射到同hidden size的维度（式2），得到q；
 5 |   3. 计算q与前N-1个词的内积（式3），同时计算参数g（由参数s与q计算内积得到），concate两者后进行softmax映射（式7），得到a;
 6 |   4. 将a[:vocab_size+1]与对应输入词的one_hot相乘（相当于根据上文得到的附加权重），与原始rnn softmax后的vocab proba distribution相加得到最终的词概率分布。
 7 | 
 8 | #### comment:
 9 |   1. 该策略能有效上文缓解rare word和长时依赖问题，主要是通过提高上文已出现的词概率来预测当前输出词；
10 |   2. 不完美处是该方法不能生成属于OOV但属于上文中的词；
11 |   3. 文章步骤4讲得不是特别清楚，可以参考[代码](https://github.com/yzh119/Pointer-Sentinel-Mixture-Model/blob/master/src/model.py#L53)进行理解。
12 | 
13 | #### [code](https://github.com/yzh119/Pointer-Sentinel-Mixture-Model)
14 | 


--------------------------------------------------------------------------------
/neural network/Recurrent Dropout without Memory Loss.md:
--------------------------------------------------------------------------------
 1 | ### Recurrent Dropout without Memory Loss
 2 | #### note:
 3 |   
 4 | &emsp;&emsp;针对RNN、LSTM和GRU具有长时记忆能力，作者提出了在hidden2hidden层间进行dropout的策略，该策略有别于[Recurrent Neural Network Regularization](https://arxiv.org/abs/1409.2329)指出进行dropout的位置，而且经实验发现，paper提出recurrent dropout方式整合input2hidden及hidden2output位置dropout能得到更好的效果。以LSTM为例，在hidden2hidden层，作者仅对“hidden state update”进行dropout（详见式9）。同时dropout方程不同于hinton提出的式3，而为式18（防止在inference过程中dropping theupdate vectors g）。
 5 | 
 6 | #### comment:
 7 |   
 8 | 1. 文章对前人提出在hidden2hidden层进行dropout方式进行比较，实验发现所提出的方法得到的结果最好；
 9 | 2. paper提出的改进方式，隐约给人一种“Recurrent Neural Network Regularization”论文中的思想，没有直接对context state进行dropout；
10 | 3. 从编码的角度看，该方式在tensorflow LSTMCell，GRUCell等很容易添加，仅需几行代码。（tf源码nn_ops.py中实现了式18dropout方程，同时虽然LSTMCell相比BasicLSTMCell添加了很多备选trick，但是并没有实现文中提到的策略）
11 | 
12 | #### code: 
13 | * [theano](https://github.com/stas-semeniuta/drop-rnn)
14 | 
15 | #### highligt:
16 | * We have observed that it helps the network to generalize better when not coupled with the forward dropout, but is usually no longer beneficial when used together with a regular forward dropout. The problem is caused by the scaling of neuron activations during inference.
17 | * In this case the above argument holds as well, but instead of observing exponentially decreasing hidden states during testing, we will observe exponentially increasing values of hidden states during training.Our approach addresses the problem discussed previously by dropping the update vectors g.
18 | 
19 | #### q&a: 
20 | 1. per-step和per-sequence没有完全理解其区别
21 | 
22 | #### more reading:
23 | [谈谈Tensorflow的dropout](http://www.jianshu.com/p/c9f66bc8f96c)
24 | 


--------------------------------------------------------------------------------
/neural network/Recurrent Neural Network Regularization.md:
--------------------------------------------------------------------------------
 1 | ### Recurrent Neural Network Regularization
 2 | #### note:
 3 |   
 4 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;文章提出了使用RNN时进行正则化的dropout改进方法。文章中就为什么这个简单的改进可行说得非常好（第三部分结尾），简而言之：传统的dropout对相邻cell中的连接段（connection）也进行了dropout，从而导致“记忆”的丢失，而作者提出的改进只对cell的输入和输出进行dropout，因此既达到了正则化的要求，又不至于丢失long range学到的内容。
 5 | #### comment:
 6 |   
 7 | * 这篇文章适合RNN入门，不仅因为其改进简单有效，而且TensorFlow也提供了这篇paper的代码。另外，理解文章的关键是在哪里进行dropout，特别是multi-layer的情况；
 8 | * 在涉及到长时记忆单元时不要进行dropout，仅对输入或输出部分dropout，这个策略不仅适用RNN，其他深度学习算法一样适合。
 9 | 
10 | #### code: 
11 | * [python](https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py)
12 | * [lua](https://github.com/wojzaremba/lstm)
13 | 
14 | #### more reading:
15 | [what happen to dropout](https://www.reddit.com/r/MachineLearning/comments/5l3f1c/d_what_happened_to_dropout/)
16 | 


--------------------------------------------------------------------------------
/neural network/Semi-supervised Sequence Learning.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　paper提出两种文本方面的半监督学习方法：
 3 |   1. 对rnn结构训练语言模型；
 4 |   2. 对rnn结构训练encoder-decoder模型，其中encoder和decoder文本相同，rnn单元的权重也相同。
 5 | 
 6 | #### comment:
 7 |   1. paper对所提出的方法在多项文本分类数据集中进行了测试，超过了当时多个数据集state of art模型，实验多，有对比性；
 8 |   2. Semi-supervised策略在数据集比较少的任务中可思考运用，除了常见的分类任务，实体识别也是不错的运用方向；
 9 | 
10 | #### more:
11 | 　　[Semi-supervised Sequence Learning #PaperWeekly#](http://rsarxiv.github.io/2016/06/07/Semi-supervised-Sequence-Learning-PaperWeekly/)
12 | 


--------------------------------------------------------------------------------
/neural network/Stacked Denoising Autoencoders.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出“栈式降噪自编码”算法提取高级特征，演变过程如下：
 3 |   + a. 自编码算法。神经网络只含一个隐层（一般hidden size比输入维度小），输入和输出相等，loss即为输入和输出的差异，训练好隐层即对应高级特征，然后再结合具体的场景微调模型。结构图如下：
 4 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Stacked_Denoising_Autoencoders_autoencoders.PNG)
 5 |   + b. 降噪自编码算法。在自编码算法的基础上，输入中的元素以一定的概率置为0（输出不变），使得模型在输入数据corrupt（损坏）的情况下仍能复原，提高抗噪声能力，同时在一定程度上减轻了训练数据与测试数据的代沟。
 6 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Stacked_Denoising_Autoencoders_denoising_autoencoders.PNG)
 7 |   + c. 栈式降噪自编码。常规的自编码算法只含有一个隐层，通过栈式叠加，当训练好第k层隐层的参数后，固定1到k层的隐层参数，训练第k+1层的参数，最后再根据具体任务微调，结构图如下：
 8 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Stacked_Denoising_Autoencoders_stacked_denoising_autoencoders.png)
 9 | 
10 | 
11 | #### comment:
12 |   + a. “降噪”跟常规的dropout类似；
13 |   + b. 栈式自编码算法看起来像是逐层获取更高级编码特征。
14 |   + c. 单隐层不是可以拟合任意函数，为什么还要栈式自编码：
15 |       + 输入的向量维度一般比hidden size高
16 |       + 单隐层虽然能编码任意的表达式，但是深层网络往往更简洁（此处是不是应该有理论或实验证明？？）
17 |   + d. [深层网络拟合能力强，为什么还要自编码？](https://blog.csdn.net/aisikaov5/article/details/51193137)数据获取问题、局部极值问题、过拟合、梯度弥散问题。
18 | 
19 | 
20 | #### more:
21 |   + a. [堆叠降噪自动编码器 Stacked Denoising Auto Encoder（SDAE）](https://blog.csdn.net/zbzcDZF/article/details/86570761)
22 |   + b. [Deep Learning 学习笔记（8）：自编码器( Autoencoders )](https://www.cnblogs.com/Ponys/p/3315642.html)
23 |   + c. [栈式自编码算法](https://blog.csdn.net/aisikaov5/article/details/51193137)


--------------------------------------------------------------------------------
/neural network/Targeted Dropout.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出了一个对大模型（参数多）剪枝的策略（并不是降低参数的量，而是使参数值尽可能地为0，从而加快预测时的计算速度，也可以说是模型压缩）。该策略可从两个角度进行剪枝：
 3 |   1. Unit dropout：Unit dropout randomly drops units (sometimes referred to as neurons) at each training step to reduce dependence between units and prevent overfitting.
 4 |   2. Weight dropout（来源[dropconnect](https://github.com/xwzhong/papernote/blob/master/regularization/Regularization%20of%20Neural%20Networks%20using%20DropConnect.md)）：Weight dropout randomly drops individual weights in the weight matrices at each training step. Intuitively, this is dropping connections between layers, forcing the network to adapt to a different connectivity at each training step.
 5 | 
 6 | 具体操作：
 7 |   1. 以magnitude来衡量unit/weight的重要性，unit pruning和weight pruning计算公式不同（分别为论文式1和式2）；
 8 |   2. train: 选取γ|θ|个最不重要的值，并以α概率进行dropout；
 9 |   3. eval: 直接将“γ|θ|个最不重要的值”设为0；
10 | 
11 | #### comment:
12 |   1. train阶段，选出“γ|θ|个最不重要的值”时并没有将其全部设0是因某些元素在往后的训练中权重可能会上升。
13 |   2. 实验表明，该方法相对于直接提高dropout进行剪枝有更好的鲁棒性——剪枝效果明显，但准确率下降幅度小。
14 |   3. 就“dropout与Targeted Dropout关系”有个网友说得好：这个不是替换dropout的，dropout提出来的时候是为了防止overfitting的，这个是用来做模型压缩的，不是一个东西。
15 | 
16 | #### more:
17 |   1. [Dropout可能要换了，Hinton等研究者提出神似剪枝的Targeted Dropout](https://zhuanlan.zhihu.com/p/50796723)
18 |   2. [tensorflow code](https://github.com/for-ai/TD)。核心代码是[dropouts.py](https://github.com/for-ai/TD/blob/master/models/utils/dropouts.py)。
19 | 


--------------------------------------------------------------------------------
/neural network/Tying Word Vectors and Word Classifiers A Loss Framework for Language Modeling.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;作者指出,使用RNN结构实现语言模型,loss方程的设计仅使用了one hot,即仅使用正确的word对应的损失.作者提出使用方程分布的形式添加额外的loss, 考虑如下情况: "高兴"和"开心"是相似词, 如果output layer的target word是"高兴", 那"开心"也"很可能"是正确的target,而one hot设计没有考虑这部分相似信息. paper使用了预训练得到的word embedding来描述词与词这种相似性, 设为groundtruth(其包含了每个词与其它词之间的相似度, 假定该相似关系构成了分布Q), 在projection layer部分,为了使target word下,其余词的分布P与Q尽量接近, 可推得input embedding(即为word embedding)和output projection相等. 判断分布P与分布Q的差异使用了KL散度.
 3 | 
 4 | #### comment:
 5 |   1. 文章背后提出的逻辑有依据, 从最初的loss设计考虑分布出发,到推得input embedding和output projection相等的关系,逐步深入.同时,在该论文之前,已有[实验](https://github.com/xwzhong/papernote/blob/master/neural%20network/Using%20the%20Output%20Embedding%20to%20Improve%20Language%20Models.md)证明连接input embedding和output embedding能改进语言模型和翻译模型;
 6 |   2. 分布loss结果的好坏很大程度上取决于预训练得到的word embedding质量,因作者假定word embedding为groundtruth, 由P去拟合Q;
 7 |   3. 式3.1和3.7使用softmax层进行归一化,目的是以词概率分布的角度来衡量分布差异;
 8 |   4. [图文解说及代码](https://github.com/icoxfog417/tying-wv-and-wc), [augment_loss计算](https://github.com/icoxfog417/tying-wv-and-wc/blob/master/model/augmented_model.py#L32).
 9 | 
10 | #### question:
11 |   1. 为什么在计算式3.1和3.7时添加了'temperature'因子?
12 |   2. 为什么在lstm+crf+reuse embedding实体识别模型中加入"augmented loss"对模型效果基本无提升?
13 | 


--------------------------------------------------------------------------------
/neural network/Understanding Deep Learning Requires Rethinking Generalization.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;该论文对deep learning表现优异的原因进行了实验,思考,在实验的基础上提出了一些观点,虽然有些方面缺乏理论证明,但仍能从其思辨角度了解deep learning的特点.部分要点记录如下:
 3 |   1. Randomization tests: Deep neural networks easily fit random labels. 对于random label的训练数据,其训练样本是固定的, 虽然模型train error能达到0%,但是模型学习的是固定下来的label,不是学到random这个过程,如果是学到random这个过程,那么训练好的模型准确率将与random的结果相近;
 4 |      + a. The effective capacity of neural networks is sufficient for memorizing the entire data set.
 5 |      + b. Even optimization on random labels remains easy. In fact, training time increases only by
 6 | a small constant factor compared with training on the true labels.
 7 |      + c. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
 8 |   2. The role of explicit regularization: Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.
 9 |   3. Finite sample expressivity: a very simple two-layer ReLU network with p = 2n+d parameters that can express any labeling of any sample of size n in d dimensions.此方面的论证, 可参见Duda的"pattern classification"书籍神经网络部分;
10 |   4. The role of implicit regularization: the algorithm itself is implicitly regularizing the solution. 深度学习模型本身具有泛华能力.个人观点是, SGD本身的拟合方向与数据的特征有一定的交集.
11 | 
12 | #### comment:
13 |   1. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise.此处指, 模型本身的框架是核心,外在的正则化策略(eg: weight decay, dropout等)不能根本上影响深度学习模型的效果;
14 |   2. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. 此处,如果数据含有明显的规则,那相对的学习时间,参数个数也可减少,更深一步讲,如果模型本身的优化思路跟数据特点切合,那么所需训练数据量,训练时间,参数等可更少;
15 |   3. 少量的噪音能降低模型过拟合的倾向,但噪音越多,其收敛速度也会下降,而且过多的噪音也会严重降低最终的准确率;
16 |   4. paper给人一个非常震撼的观点,指出deep learning更多是以其"记忆"而表现优异, 通过反思自己使用deep learning模型的过程,比较赞同这个观点,DL依旧是基于统计学的方法,只不过其表达能力比传统的机器学习方法更优秀,对于构造通用型人工智能,DL可能只实现其中部分.并不是否定其在某些领域,如图像,语音识别等的惊人表现;
17 |   5. 文章虽然没有给出针对某个领域实质性的改进思路,但是其对深度学习模型的见解对个人了解DL的技术限度有非常大的好处.
18 | 
19 | #### question:
20 | &emsp;&emsp;为什么random label更难收敛(见fig1 a)? random label数据其样本特征相对正确label更杂乱,规则更不明显,增加了模型训练成本.
21 | 
22 | #### more reading:
23 |   1. [如何评价 ICLR 2017 中关于 Rethinking Generalization 的那篇文章？](https://www.zhihu.com/question/56151007)
24 |   2. [论文分享：Understanding Deep Learning Requires Rethinking Generalization](https://zhuanlan.zhihu.com/p/26567289)
25 |   3. [论文笔记 understanding deep learning requires rethinking generalization](http://blog.csdn.net/u014380165/article/details/71188924)
26 |   4. [打响新年第一炮，Gary Marcus提出对深度学习的系统性批判](https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==&mid=2650735630&idx=1&sn=5840c3e9bed487da3a9080d482fcc58e&scene=21#wechat_redirect)
27 | 


--------------------------------------------------------------------------------
/neural network/Understanding the difficulty of training deep feedforward neural networks.md:
--------------------------------------------------------------------------------
 1 | #### note：
 2 | &emsp;&emsp;文章分析了激活函数和参数初始化对neural network的影响，要点如下：
 3 | 
 4 |  1. the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. 详见论文3.1部分的分析。
 5 |  2. We find that a new non-linearity that saturates less can often be beneficial.
 6 |  3. propose a new initialization scheme that brings substantially faster convergence.
 7 | 
 8 | #### comment:
 9 |  1. Two things we want to avoid and that can be revealed from the evolution of activations is excessive saturation of activation functions on one hand (then gradients will not propagate well，因为在饱和点处的梯度很低，通过前面的layer通过后向反馈得到的信息特别少), and overly linear units (they will not compute something interesting???).
10 |  2. sigmoid函数因均值不是0而造成效果差的原因分析。We hypothesize that this behavior is due to the combination of random initialization and the fact that an hidden unit output of 0 corresponds to a saturated sigmoid.
11 |  3. 对称的激活函数（eg：tanh）与sigmoid函数对比。In the case of symmetric activation functions like the hyperbolic tangent and the softsign, sitting around 0 is good because it allows gradients to flow backwards. However, pushing the sigmoid outputs to 0 would bring them into a saturation regime which would prevent gradients to flow backward and prevent the lower layers from learning useful features。在y≈0处，sigmoid接近饱和，而以tanh为例，此时x≈0，此处的梯度大，有利于最后一层的梯度信息传递给前面隐层。从以上分析中也可以发现，在选用对称激活函数时，初始化策略一定要关于0对称，作者还从理论角度分析，给出了常用的glorot初始化策略，详见式16；
12 |  4. 为什么使用交叉熵代替二次代价函数——在饱和状态，二次代价函数的梯度小，导致参数偏导降低，减缓学习速率，而交叉熵因约去了梯度，进而学习速率教高，详见[文章](http://blog.csdn.net/yqljxr/article/details/52075053)；而文中也提到，交叉熵代价函数其平滑的面更少（详见4.1，fig5）。
13 |  5. pretrain在实践中可多做思考。Note that deep networks with sigmoids but initialized from unsupervised pre-training (e.g. from RBMs) do not suffer from this saturation behavior.
14 | 
15 | #### conclusions from paper：
16 |  1. The more classical neural networks with sigmoid or hyperbolic tangent units and standard initialization fare rather poorly, converging more slowly and apparently towards ultimately poorer local minima.（因此建议结合tanh激活函数和glorot初始化策略）
17 |  2. The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity.（绘制tanh和softsign曲线可知，tanh随x变化更剧烈）
18 |  3. For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing upward) and gradients (flowing backward).
19 |  4. Monitoring activations and gradients across layers and training iterations is a powerful investigative tool for understanding training difficulties in deep nets.
20 |  5. Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer.
21 |  6. Keeping the layer-to-layer transformations such that both activations and gradients flow well (i.e. with a Jacobian around 1) appears helpful, and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning.
22 |  7. Many of our observations remain unexplained, suggesting further investigations to better understand gradients and training dynamics in deep architectures.
23 | 


--------------------------------------------------------------------------------
/neural network/Using the Output Embedding to Improve Language Models.md:
--------------------------------------------------------------------------------
 1 | ### Using the Output Embedding to Improve Language Models
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出了一种改进RNN语言模型的方法,该方法操作简单,只需'连接'模型input embedding和output embedding.其给出该方法的核心思想是:
 5 | 
 6 | &emsp;&emsp;_We call U the input embedding, and V the output embedding. In both matrices, we expect rows that correspond to similar words to be similar: for the input embedding, we would like the network to react similarly to synonyms, while in the output embedding, we would like the scores of words that are interchangeable to be similar (Mnih and Teh, 2012)._
 7 | 
 8 | #### comment:
 9 |   1. 其思考角度翻译过来就是,对于output embedding, 如果词与词之间相似,那么在softmax层对应的score也应该相近,直接将word embedding与output embedding进行tying,相当于显示地利用了已训练好的word embedding相似关系,同时减少了参数. 使用显示tying input和output embedding也能看出模型学习能力的不足;
10 |   2. 论文第三部分提到使用projection regularization能对模型进行微调(在lstm output后添加全连接层),但不仅限于此, 论文提到使用tying技巧需要rnn的output hidden size和embedding大小相同, 但在实际运用中, hidden size往往比embedding size大,此时可以通过添加全连接层进行调整,使输入最后softmax层的维度等于vocaburary size;
11 |   3. 文章提出的思想不仅可以用于rnn结构的语言模型,也能用于类似rnn结构的其它模型中,在LSTM+CRF实体识别模型中,使用该技巧能提高0.01左右的F1 score;
12 |   4. "Following (Zaremba et al., 2014), we employ two models: large and small. The large model employs dropout for regularization. The small model is not regularized." 个人对此持有异议:large相比于 small model有更强的表达能力,softmax能较容易地描述其特征,而small模型内部结构复杂,此时使用projection层相当于搭建lstm输出和softmax之间的桥梁.
13 | 
14 | #### more reading:
15 | &emsp;&emsp;[Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling](https://github.com/xwzhong/papernote/blob/master/neural%20network/Tying%20Word%20Vectors%20and%20Word%20Classifiers:%20A%20Loss%20Framework%20for%20Language%20Modeling.md)
16 | 
17 | #### [code](https://github.com/ofirpress/UsingTheOutputEmbedding)
18 | 


--------------------------------------------------------------------------------
/neural network/Visualizing and Understanding Recurrent Networks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &emsp;&emsp;paper通过实验对语言模型的一些特点进行了定量分析。分析的方面有：
 3 |   1. RNN、GRU和LSTM的对比；
 4 |   2. LSTM hidden unit与句子长度的定量关系；
 5 |   3. 结合fig1和tab2可估算不同参数的RNN变体与n-gram、n-NN模型表达能力的关系等。
 6 | 
 7 | #### comment:
 8 |   1. 在language model中，RNN增加layer层数进行对比，可从参数总量不变的条件下去看（eg：1layer-128size VS 2layer-64size对比），可发现RNN、GRU和LSTM在1layer的情况下，效果往往更好；
 9 |   2. RNN记忆文本的长度与参数和文本使用该模型学习难易相关。往往参数越大，学习能力自然越强；同时，如果所选模型切合待训练数据，其学习的效果自然更佳；
10 |   3. 文中提到一个“interpretable lsmt cell”，其与[论文](http://de.arxiv.org/pdf/1704.01444)提到的sentiment unit相像，详细分析可见[笔记](https://github.com/xwzhong/papernote/blob/master/neural%20network/Learning%20to%20Generate%20Reviews%20and%20Discovering%20Sentiment.md)；
11 |   4. 从论文的分析可了解到，RNN、GRU和LSTM仍像统计模型。
12 | 


--------------------------------------------------------------------------------
/others/Neural Machine Translation of Rare Words with Subword.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对文本翻译中不常用词，作者提出使用BPE（Bite Pair Encoding）方式对词进行切分。
 3 |   1. 动机：the translation of some words is transparent in that they are translatable by a competent translator even if they are novel to him or her, based on a translation of known subword units such as morphemes or phonemes；
 4 |   2. 学习BPE：从单字符开始，不断向上合并相邻字符对（每次取最大字符对频率进行合并），直到大于某阈值或无字符对（有点类似于构建哈弗曼树——用尽可能少的字符对编码文本）；
 5 |   3. 编码BPE：从单字符开始，将相邻最大频率字符对当做一个字符串，若词典中无该字符串则继续合并。
 6 | 
 7 | #### comment:
 8 |   1. 该方法能较好地解决OOV的情况，特别是针对数据量比较大任务中。同时该策略不单可用于文本翻译中，也可以用于各类任务；
 9 |   2. 实验结果中对source和target语料放在一起学习BPE的效果比单独学习更好，因一起学习能保持rare word切分一致性（此处跟语料相关，不能盲目套用）；
10 |   3. 编码操作除了note中提及的部分，还可以考虑最大前向、后向编码，note中编码继承了学习时的特点——生成尽可能少的词，但是并没有考虑使编码时的元字符串尽可能地少，而最大前向、后向则可以；
11 |   4. 该编码方式能做到对数据集所有文本进行编码，同时，在达到最低词汇表大小要求情况下指定词汇表大小。
12 | 
13 | #### more:
14 | &nbsp;&nbsp;&nbsp;&nbsp;[代码](https://github.com/rsennrich/subword-nmt)，其中核心的代码是[learn_bpe.py](https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py)
15 | 


--------------------------------------------------------------------------------
/pic/Chinese NER Using Lattice LSTM-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese NER Using Lattice LSTM-pic1.png


--------------------------------------------------------------------------------
/pic/Chinese NER Using Lattice LSTM-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese NER Using Lattice LSTM-pic2.png


--------------------------------------------------------------------------------
/pic/Chinese NER Using Lattice LSTM-pic3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese NER Using Lattice LSTM-pic3.png


--------------------------------------------------------------------------------
/pic/Chinese NER Using Lattice LSTM-pic4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese NER Using Lattice LSTM-pic4.png


--------------------------------------------------------------------------------
/pic/Chinese NER Using Lattice LSTM-pic5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese NER Using Lattice LSTM-pic5.png


--------------------------------------------------------------------------------
/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_flow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_flow.png


--------------------------------------------------------------------------------
/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Chinese_Poetry_Generation_with_Planning_based_Neural_Network_model.png


--------------------------------------------------------------------------------
/pic/Deep & Cross Network for Ad Click Predictions-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep & Cross Network for Ad Click Predictions-pic1.png


--------------------------------------------------------------------------------
/pic/Deep & Cross Network for Ad Click Predictions-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep & Cross Network for Ad Click Predictions-pic2.png


--------------------------------------------------------------------------------
/pic/Deep & Cross Network for Ad Click Predictions-pic3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep & Cross Network for Ad Click Predictions-pic3.png


--------------------------------------------------------------------------------
/pic/Deep & Cross Network for Ad Click Predictions-pic4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep & Cross Network for Ad Click Predictions-pic4.png


--------------------------------------------------------------------------------
/pic/Deep Content-User Embedding Model for Music Recommendation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep Content-User Embedding Model for Music Recommendation.png


--------------------------------------------------------------------------------
/pic/Deep Neural Networks for YouTube Recommendations-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep Neural Networks for YouTube Recommendations-pic1.png


--------------------------------------------------------------------------------
/pic/Deep Neural Networks for YouTube Recommendations-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep Neural Networks for YouTube Recommendations-pic2.png


--------------------------------------------------------------------------------
/pic/Deep Neural Networks for YouTube Recommendations-pic3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep Neural Networks for YouTube Recommendations-pic3.png


--------------------------------------------------------------------------------
/pic/Deep Neural Networks for YouTube Recommendations-pic4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Deep Neural Networks for YouTube Recommendations-pic4.png


--------------------------------------------------------------------------------
/pic/DeepFM-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/DeepFM-pic1.png


--------------------------------------------------------------------------------
/pic/DeepFM-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/DeepFM-pic2.png


--------------------------------------------------------------------------------
/pic/DeepFM-pic3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/DeepFM-pic3.png


--------------------------------------------------------------------------------
/pic/Distributed Representations of Sentences and Documents.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Distributed Representations of Sentences and Documents.png


--------------------------------------------------------------------------------
/pic/Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction.png


--------------------------------------------------------------------------------
/pic/Factorization Machines-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Factorization Machines-pic1.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-1.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-2.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-3.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-4.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-5.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-6.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-7.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-8.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-8.png


--------------------------------------------------------------------------------
/pic/HowtoFine-TuneBERTforTextClassiﬁcation-9.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/HowtoFine-TuneBERTforTextClassiﬁcation-9.png


--------------------------------------------------------------------------------
/pic/Implicit Feedback for Recommender Systems-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Implicit Feedback for Recommender Systems-pic1.png


--------------------------------------------------------------------------------
/pic/Implicit Feedback for Recommender Systems-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Implicit Feedback for Recommender Systems-pic2.png


--------------------------------------------------------------------------------
/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic1.png


--------------------------------------------------------------------------------
/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic2.png


--------------------------------------------------------------------------------
/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts-pic3.png


--------------------------------------------------------------------------------
/pic/Multi-Head Attention.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Multi-Head Attention.png


--------------------------------------------------------------------------------
/pic/Quantifying Mental Health from Social Media with Neural User Embeddings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Quantifying Mental Health from Social Media with Neural User Embeddings.png


--------------------------------------------------------------------------------
/pic/Scaled Dot-Product Attention.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Scaled Dot-Product Attention.png


--------------------------------------------------------------------------------
/pic/Semi-Supervised Sequence Modeling with Cross-View Training.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Semi-Supervised Sequence Modeling with Cross-View Training.png


--------------------------------------------------------------------------------
/pic/Stacked_Denoising_Autoencoders_autoencoders.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Stacked_Denoising_Autoencoders_autoencoders.PNG


--------------------------------------------------------------------------------
/pic/Stacked_Denoising_Autoencoders_denoising_autoencoders.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Stacked_Denoising_Autoencoders_denoising_autoencoders.PNG


--------------------------------------------------------------------------------
/pic/Stacked_Denoising_Autoencoders_stacked_denoising_autoencoders.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Stacked_Denoising_Autoencoders_stacked_denoising_autoencoders.png


--------------------------------------------------------------------------------
/pic/Topic-to-Essay_Generation_with_Neural_Networks3.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Topic-to-Essay_Generation_with_Neural_Networks3.PNG


--------------------------------------------------------------------------------
/pic/Wide & Deep Learning for Recommender Systems.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/Wide & Deep Learning for Recommender Systems.png


--------------------------------------------------------------------------------
/pic/copyNet_copy_mode.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/copyNet_copy_mode.PNG


--------------------------------------------------------------------------------
/pic/copyNet_generate_mode.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/copyNet_generate_mode.PNG


--------------------------------------------------------------------------------
/pic/copyNet_model.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/copyNet_model.PNG


--------------------------------------------------------------------------------
/pic/copyNet_state.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/copyNet_state.PNG


--------------------------------------------------------------------------------
/pic/copyNet_vocab.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/copyNet_vocab.PNG


--------------------------------------------------------------------------------
/pic/mimick model architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/mimick model architecture.png


--------------------------------------------------------------------------------
/pic/transformer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xwzhong/papernote/8bc1e82ff52af2e7ce7304038094e7e90757a8f6/pic/transformer.png


--------------------------------------------------------------------------------
/reading comprehension/Word or Characters, Fine-grained Gating For Reading Comprehension.md:
--------------------------------------------------------------------------------
 1 | ### Word or Characters, Fine-grained Gating For Reading Comprehension
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper主要讲了基于word-level和char-level形成high-level vector时的一个改进trick, 该trick是在[miyamoto](https://arxiv.org/abs/1606.01700)基础上提出的，把原来的scalar改为vector。
 5 | 
 6 | #### comment:
 7 | 1. paper提到的改进方案的核心思想是使用能计算的额外知识（如：token的词性，频率等）用模型自动学习的能力把这些知识加到模型中；
 8 | 2. 在加入这些“知识”的时候，用了gate的思想，有点像LSTM和GRU。
 9 | 
10 | #### practice: 
11 | &emsp;&emsp;目前该方法在SQuAD测试集下的F1为73.327%（Human Performance 91.221%），排名24(2017-06-20)，最新排名见[URL](https://rajpurkar.github.io/SQuAD-explorer/)，微软近期开始致力于从文本中挖掘回复，初步以该数据集为标准，学习reading comprehension可以该网站做参考。
12 | 


--------------------------------------------------------------------------------
/reasoning/A simple neural network module for relational reasoning.md:
--------------------------------------------------------------------------------
 1 | ### A simple neural network module for relational reasoning 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出的relational network是为了解决目标（object）之间的关系（relation）问题。
 5 | 
 6 | 1. 应用场景。给定object之间的关系描述，对给出的关系提问进行回答；
 7 | 2. paper提出的relational network更多的是一种思想，因此可用于多种形式的关系描述，可以为图像，也可以为文本；
 8 | 3. RN的实现部分不难，相对简洁（见式1），假设有object A、B、C，先对AB、BC、AC的组合关系进行MLP降维，得到向量a、b、c，再对这三个向量每个元素对应位置相加，得到向量z，最后再对z进行MLP降维。
 9 | 
10 | #### comment:
11 | 1. 核心部分在于object该如何利用不同形式（图像、不同类型文本）的语料进行转化；
12 | 2. fig2中，不是一个kernel代表一个object。因为一个kernel时计算整张图的结果，代表的是整幅图完整的含义，不同kernel对同一块区域构成的vector代表的是这一块区域不同的映射特征；
13 | 3. 实验结果非常吸引人，有break through的意味。
14 | 


--------------------------------------------------------------------------------
/recommender system/Deep & Cross Network for Ad Click Predictions.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;旧不同特征之间的关联问题，文章提出使用cross network进一步的增强特征之间的交互，deep&cross结构如下：
 3 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20%26%20Cross%20Network%20for%20Ad%20Click%20Predictions-pic1.png)
 4 |   + deep network：标准的前馈神经网络
 5 |   + cross network：第k层的输出为第0层和第k-1层输出交互得到，相当于第0层经过多次特征交互后计算得到的加权输出：
 6 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20%26%20Cross%20Network%20for%20Ad%20Click%20Predictions-pic2.png)
 7 |   + 实验：对比了在同参数量下dnn vs dcn的效果，或同loss下dnn vs dcn需要的参数量，实验相对严谨：
 8 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20%26%20Cross%20Network%20for%20Ad%20Click%20Predictions-pic3.png)
 9 |     
10 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20%26%20Cross%20Network%20for%20Ad%20Click%20Predictions-pic4.png)
11 | 
12 | #### comment:
13 |   1. 文章虽然提出cross network公式看起来很漂亮，解释起来有一定的可行性，但是实际带来的收益却比较小，以fig3的logloss下降为例，2layers+32nodes的logloss相对下降0.4%，而实验没有用测试集的auc等直接指标，可能也是因为效果不明显（虽然论文说an improvement of 0.001 in logloss is considered as
14 | practically significant，但个人不买单）
15 | 


--------------------------------------------------------------------------------
/recommender system/Deep Content-User Embedding Model for Music Recommendation.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章是典型的双塔结构——分别对content和user编码得到一个向量，然后根据两个向量的cosine值来衡量相似度，要点记录：
 3 |   + a. content和user向量使用pair-wise相似度计算的原因（而不是用交互式，比如concat/element-wise multiplication后映射到一个浮点值）：the pair-wise calculation is more similar to the matrix factorization method that can provide high-level user and item factors（个人感觉这种说法本身没有错，但是不能很好的解释为什么不用fusion方式——用不用fusion应该从最终的效果出发）
 4 |   + b. 计算pair的score时使用cosine而不是dot product：文章只说测试效果发现cosine比较好，但是没有给出理论分析
 5 |   + c. loss设计使用了max-margin loss而不是softmax function with categorical cross-entropy：文章也是说实验效果margin更好，也没给出理论解释
 6 |   + d. margin-loss引入了grid-searched：the number of negative samples as 20 and the margin value ∆ as 0.2
 7 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20Content-User%20Embedding%20Model%20for%20Music%20Recommendation.png)
 8 |   
 9 | #### comment:
10 |   1. paper比较基础，偏实验性，理论解释较少
11 | 


--------------------------------------------------------------------------------
/recommender system/Deep Neural Networks for YouTube Recommendations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章介绍了YouTube视频推荐的实现逻辑，整体结构分两部分（candidate generation和ranking）：
 3 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20Neural%20Networks%20for%20YouTube%20Recommendations-pic1.png)
 4 |   
 5 |   + candidate generation：目的（粗筛，生成排序的候选集），路径（训练时构建分类任务，线上以向量召回方式使用时），模型结构和各个模块特点依次如下：
 6 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20Neural%20Networks%20for%20YouTube%20Recommendations-pic2.png)
 7 |     * 分类标签：正样本（视频观看时长大于特定值），负样本（candidate sampling + importance weighting）【显示的数据（eg：点赞/踩）比较稀疏，而曝光和点击数据又可能因缩略图迷惑性造成虚假点击】
 8 |     * 多相特征：连续型（视频上传时长），类别特征（观看过的视频【以id形式编码】，用户搜索的token，地理位置特征等）
 9 |     * 标签和上下文选择：构建训练/预测时，以待推荐的时间前信息作为输入，而不是将未来的信息也作为输入
10 | 
11 |   + ranking：目的（有限的预测耗时内，引入更多特征），路径（构建时长预测回归模型），模型结构和优化点依次如下：
12 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20Neural%20Networks%20for%20YouTube%20Recommendations-pic3.png)
13 |     * 特征表示：
14 |         * 特征工程（eg：在这个主题下上一次用户看是多久前）
15 |         * 类别特征（常规编码embedding 方式）：曝光过，点击过的，搜索语言
16 |         * 归一化连续特征：标准化（均值为0，方差为1），次线性（x^0.5），超线性（x^2）【归一化的原因是神经网络对输入数据分布敏感（参考batch normalization），使用次/超线性使增强单个特征的表达丰富度，实验效果loss降低0.2%】
17 |     * 对观看时长建模：正样本（用户有点击），负样本（用户没点击），使用观看时长对正样本score进行加权
18 |     * 模型参数（全连接层）：不同模型参数的实验效果：
19 |     
20 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Deep%20Neural%20Networks%20for%20YouTube%20Recommendations-pic4.png)
21 | 
22 | #### more:
23 |   1. The primary role of ranking is to use impression data to specialize and calibrate candidate predictions for the particular user interface.
24 | 


--------------------------------------------------------------------------------
/recommender system/DeepFM：A Factorization-Machine based Neural Network for CTR Prediction.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章提出的DeepFM可以当作[FM](https://github.com/xwzhong/papernote/blob/master/recommender%20system/Factorization%20Machines.md)和[wide&deep](https://github.com/xwzhong/papernote/blob/master/recommender%20system/Wide%20%26%20Deep%20Learning%20for%20Recommender%20Systems.md)的结合版，整体结构如下:
 3 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/DeepFM-pic1.png)
 4 |     
 5 |   + [FM](https://github.com/xwzhong/papernote/blob/master/recommender%20system/Factorization%20Machines.md)：学习低阶特征（单个特征+两个特征交互）
 6 |   
 7 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/DeepFM-pic2.png)
 8 |   + [wird&deep（dnn）](https://github.com/xwzhong/papernote/blob/master/recommender%20system/Wide%20%26%20Deep%20Learning%20for%20Recommender%20Systems.md)：学习多个特征交互的特征
 9 | 
10 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/DeepFM-pic3.png)
11 | #### comment:
12 |   1. 文章主要是实验效果好，同时FM并不是只能学习低阶（≤2的特征交互），它也能泛化到更高阶的交互
13 |   2. word&deep只是手动的引入了一些人工特征，文章并没有用，特别强调似乎有点过
14 | 
15 | #### more:
16 |   + 文章超参实验结果：
17 |     + 激活函数：relu比tanh效果好
18 |     + dropout：0.9效果最好（以0.1间隔枚举0.5至1）
19 |     + 神经元大小/隐层数：效果先增后减（过拟合）
20 |     + 网络结构：神经元数相同的情况下，递减/相同的方式效果比递增/钻石型效果好
21 | 


--------------------------------------------------------------------------------
/recommender system/Factorization Machines.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章是因子分解的经典论文，该策略最重要的是缓解了数据稀疏问题：
 3 |   + 背景：某些特征经过关联，与label之间的相关性强，例如，“USA”与“Thanksgiving”、“China”与“Chinese New Year”这样的关联特征，对用户的点击有着正向的影响。换句话说，来自“China”的用户很可能会在“Chinese New Year”有大量的浏览、购买行为，而在“Thanksgiving”却不会有特别的消费行为。
 4 |   + 如何解决两个特征之间的组合：一种直接的方法就是采用多项式模型来表示两个特征的组合，x_i为第i个特征的取值，x_i\*x_j表示特征x_i和x_j的特征组合，其系数w_ij即为我们学习的参数，也是x_i\*x_j组合的重要程度
 5 |     ![](https://github.com/xwzhong/papernote/blob/master/pic/Factorization%20Machines-pic1.png)
 6 |   + 上述方法存在的缺陷：如果第j,k个特征没有在训练集合组合出现，只出现了（i,j）和（i,k）这时将学不到这个特征
 7 |   + 文章提出的解决方案：将w_ij拆分为两个向量的点积，得到i和j的向量表示，从而能得到j,k共现时的权重
 8 | 
 9 | #### more:
10 |   1. 文章提出的方案不仅缓解数据稀疏问题，还具有更广的适应性（可用于分类/回归/排序），同时优化参数的方式相同（不像svd++等需要专门设计）
11 |   2. [CTR预估模型FM、FFM、DeepFM](https://www.biaodianfu.com/ctr-fm-ffm-deepfm.html#FM%E6%A8%A1%E5%9E%8B%E5%8E%9F%E7%90%86)
12 |   3. idea看着很简单，但是也从理论上解释了为什么类似word embedding方式的好处
13 | 


--------------------------------------------------------------------------------
/recommender system/Implicit Feedback for Recommender Systems.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;文章提出三种隐式反馈来构建推荐系统中的数据集，以及两种使用方式：
 3 |   + 三种隐式反馈：
 4 |   
 5 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Implicit%20Feedback%20for%20Recommender%20Systems-pic1.png)
 6 |   + 两种使用方式：
 7 |   
 8 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Implicit%20Feedback%20for%20Recommender%20Systems-pic2.png)
 9 | #### comment:
10 |   1. 利用隐式反馈数据的背景：显式的标注数据稀缺，构建成本高，多样性差（标注的用户相对集中）
11 |   2. 这种数据构造方式，优点是数据量可以构建的足够大（假设有对应的行为），同时也能缓解噪音带来的影响，缺点是需要进行专门埋点
12 | 


--------------------------------------------------------------------------------
/recommender system/Wide & Deep Learning for Recommender Systems.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对推荐系统中“**记忆**”和“**泛化**”问题，文章提出使用“**wide**”和“**deep**”策略构建模型：
 3 |   + a. **记忆（wide）**：线性模型中“记住”非线性特征（对于非线性特征，使用交叉乘积变换方式得到one hot特征）
 4 |   + b. **泛化（deep）**：针对类别特征，使用低维稠密向量表示，进而泛化未出现组合特征（类似于因式分解，将非线性特征分解成基础因子，线上预测使用这些基础因子组合得到此前少出现的组合特征-非线性特征）
 5 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Wide%20%26%20Deep%20Learning%20for%20Recommender%20Systems.png)
 6 | 
 7 | #### more:
 8 |   1. ensemble（多个模型单独训练，最后对预测结果进行汇总）：参数不共享，单个模型参数量更多
 9 |   2. joint（多模型/模块同时训练，只有一个预测结果）：相对较好，可只对模型薄弱模块增加参数
10 |   3. While Wide & Deep has a slightly higher offline AUC, the impact is more significant on online traffic. One possible reason is that the impressions and labels in offline data sets are fixed, whereas the online system can generate new exploratory recommendations by blending generalization with memorization, and learn from new user responses.
11 | 


--------------------------------------------------------------------------------
/regularization/Adversarial Training Methods for Semi-Supervised Text Classification.md:
--------------------------------------------------------------------------------
 1 | ### Adversarial Training Methods for Semi-Supervised Text Classification
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出使用adversarial training和virtual adversarial training方式运用于文本分类中，此前已有相关paper证明，adversarial training和virtual adversarial training在图像分类方面已有很出色的表现。提出的原理在于，如果测试语料中有很小的perturbation，训练出来的model仍可能在测试中轻易判错（可见[样例](https://blog.openai.com/adversarial-example-research/))。在用于文本中时，其运用步骤大致如下：
 5 |   
 6 |   1. 使用word embedding。原因在于离散变量难以实现无穷小的计算，而连续型变量，每增加一点都是有意义的。
 7 |   2. 根据词频进行标准化。类似与z-score。
 8 |   3. 引入perturbation（干扰项）——adversarial training和virtual adversarial training。
 9 |   
10 |    + a. adv的思想：使用loss来计算embedding接下来最可能的变化方向；
11 |    + b. vat的思想：尽可能使model的分界面光滑。
12 | 
13 | #### comment:
14 |   1. 步骤2中进行了标准化操作，其中一点是考虑了参数初始化的影响——embedding的某些元素可能特别大，而激活函数的值域一般是(-1, 1)或(0, 1), 为了加快收敛，会尽可能将初始化参数后的激活值限定在中间部分（0或0.5)。
15 |   2. 该paper提出引入文本中的对抗学习，可用于各种各样的文本类型任务中，虽然简单，但是理论很深，而且从实验结果来看，效果非常不错。
16 |   3. [adversarial training笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Explaining%20and%20Harnessing%20Adversarial%20Examples.md)、[virtual adversarial training笔记](https://github.com/xwzhong/papernote/blob/master/regularization/Distributional%20Smoothing%20with%20Virtual%20Adversarial%20Training.md)
17 | 
18 | #### highlight:
19 | &emsp;&emsp;Adversarial training (Goodfellow et al., 2015) is a novel regularization method for classifiers to improve robustness to small, approximately **worst** case perturbations.
20 | 
21 | 
22 | ### more reading:
23 | [note](https://github.com/dennybritz/deeplearning-papernotes/blob/master/notes/adversarial-text-classification.md)
24 | 


--------------------------------------------------------------------------------
/regularization/Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift.md:
--------------------------------------------------------------------------------
 1 | ### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;作者主要想通过减小deep network中layer与layer之间的covariate shift来加快模型训练的速度。提出的原因是：layer的输入分布本身的变化导致其参数也需要相应的调整，注意此处说的变化，是指“分布”的变化，比如线性分布转为双曲线的分布。paper提出的解决方案是在每个unit进行计算激活值前（称为pre-activation)，先对其进行normalization操作，使其均值为0，方差为1，有点像zero-score，考虑到在整个training set进行normalization可行性不高（内存，运算时间，更新时效等），所以在batch层面进行，故而叫做batch normalization。
 5 | 
 6 | #### comment：
 7 |   1. 文章讲了该方法提出的很多理论，对深入了解神经网络很有帮助，同时非常建议研读“Understanding the difficulty of training deep feedforward neural networks”；
 8 |   2. paper提出的方法有以下特点：可使用更大的learning rate值（更高的lr也对应了更快的decay），加速模型训练（能加速并不特指增加了lr本身，还有其它因素），正则化（从而能降低其它正则化的强度，如dropout，l2等）；batch size不能太低（eg: 个位数），理论上越高越好。
 9 | 
10 | #### highlight:
11 |   1. We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift. By fixing the distribution of the layer inputs x as the training progresses, we expect to improve the training speed.
12 |   2. By whitening the inputs to each layer, we would take a step towards achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.
13 | 


--------------------------------------------------------------------------------
/regularization/Batch Normalized Recurrent Neural Networks.md:
--------------------------------------------------------------------------------
1 | ### Batch Normalized Recurrent Neural Networks
2 | 
3 | #### note:
4 | &emsp;&emsp;作者探讨了在RNN中使用batch normalization的可行性，经实验，作者指出对“simple rnn”的hidden2hidden层的每个activation function进行batch normalization时，其训练速度不减反曾，准确率也略微变差，因此作者提出只在input2hidden层进行标准化（详见式18，同时针对multi layer rnn，只对纵向的变换使用normalization操作，类似dropout）。实验结果表明，测试准确率基本没有提升。
5 | 
6 | #### comment：
7 |   1. paper对hidden2hidden引入batch normalization进行了实验，其结果却出乎意料地差，其中的缘由值得思考，原文给出的解释是“It may be due to the fact that we estimate new statistics at each time step, or because of the repeated application of γ and β during the recurrent procedure, which could lead to exploding or vanishing gradients.”。从个人从看的paper出发，觉得原因是pre-activation的计算直接糅合了hidden和input的信息，而这两个方向的信息不应该“直接相加计算normalization”，因为它们两个的分布是不一样的，可以考虑单独进行normalization后再进行相加（recurrent batch normalization这篇论文会有实验）。
8 |   2. paper提出的sequence-wise normalization策略不错，可用在其它填充序列进行batch计算的方法。
9 | 


--------------------------------------------------------------------------------
/regularization/Batch Renormalization Towards Reducing Minibatch Dependence in Batch-Normalized Models.md:
--------------------------------------------------------------------------------
 1 | ### Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models.md
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper作者是原batch normalization方法提出者之一，其指出原有的BN在training过程中具有以下缺点：
 5 | 
 6 | * batch size设置需较大；
 7 | * 样本与样本之间存在依赖的问题。
 8 | 
 9 | &emsp;&emsp;造成上述两者的原因，作者猜想：that is due to the dependence of model layer inputs on all the examples in the minibatch, and different activations being produced between training and inference。提出的改进方法是在train和infer过程中employ BN步骤建立映射关系。详见论文第3部分。
10 | 
11 | #### comment：
12 |   1. 在建立映射关系时，作者很巧妙地引入了infer过程中的μ和σ。
13 |   2. 在training过程中，如果没有特别要求，尽量保持每个训练的batch彼此之间都不同，不仅是同一个epoch，不同epoch之间同理。
14 |   
15 | #### highlight:
16 |   1. However, its effectiveness diminishes when the training minibatches are small, or do not consist of independent samples.
17 |   2. Let us observe that if we have a minibatch and normalize a particular node x using either the minibatch statistics or their moving averages, then the results of these two normalizations are related by an affine transform.
18 | 


--------------------------------------------------------------------------------
/regularization/Distributional Smoothing with Virtual Adversarial Training.md:
--------------------------------------------------------------------------------
 1 | ### Distributional Smoothing with Virtual Adversarial Training
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;文章提出了增加数据smothness（平滑度）的方法：
 5 | 
 6 |   1. 动机。input越光滑，其数据本身越整洁（相比于圆形），因此，需要减小其“粗糙”部分（相比于三角形）；
 7 |   2. smothness定义。使用[KL divergence衡量](https://github.com/xwzhong/articlelist#math)数据的平滑度。最终得到新的loss方程，见式4.
 8 | 
 9 | #### comment:
10 |   1. smothness定义翻译过来，即：输入样本A，随机偏移得到样本B，然后使用KL计算偏移后的分布与原分布的差异（度量光滑度），再尽可能得保证AB被分到同一类中。
11 |   2. vat有三个参数，偏移最大范围ϵ、与原loss方程比例λ和随机次数I；
12 |   3. 无需label标注信息即可用来生成对抗样本；
13 |   4. 实验结果好。
14 | 


--------------------------------------------------------------------------------
/regularization/Dropout is a special case of the stochastic delta rule faster and more accurate deep learning.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 
 3 | &nbsp;&nbsp;&nbsp;&nbsp;文章指出dropout是随机delta规则(stochastic delta rule)的特例。
 4 | 
 5 | &nbsp;&nbsp;&nbsp;&nbsp;dropout数学分布上的解释：To put the Dropout algorithm in probablistic context, consider that a Bernoulli random variable over many trials results in a Binomial distribution with mean np and standard deviation (np(1 − p)). The random variable is the number of removals (“successes”) over learning on the x axis and the probability that the hidden unit will be removed with Binomial (np, np(1 − p)).
 6 | 
 7 | &nbsp;&nbsp;&nbsp;&nbsp;相比dropout，stochastic delta rule的分布参数不固定，SDR adaptively updates the random variable parameters for subsequent sampling and Dropout samples from a Binomial random variable with fixed parameters (mean, standard deviation at p)，在实现时大致步骤如下：
 8 |   + a. 初始化每个参数方差为0，（此后更新时，以当前loss反向计算的梯度进行更新）；
 9 |   + b. 对于当前参数，以其当前值（作为正态分布均值）和参数方差（作为正太分布方差）进行取样，取样后的值即为更新后的参数值。
10 | 
11 | #### comment：
12 |   1. 本文是缓解“过拟合”难得的好文，从理论上看有依据，同时从实验的准确率看表现异常好；
13 |   2. 从论文分析可知，其假定每个神经元服从同样的分布，个人认为这是收敛速度加快的主要原因，同分布减小了参数的搜索空间；
14 |   3. SDR也有点类似于[adversarial training](https://github.com/xwzhong/papernote/blob/master/regularization/Explaining%20and%20Harnessing%20Adversarial%20Examples.md)，SDR是将梯度反馈到参数上，对参数进行一定范围的随机，而adversarial training则是将梯度反馈到输入的样本特征中，对样本本身引入额外因子；
15 |   4. 使用SDR约需额外一倍的内存（显存），用于保存梯度不为None的trainable参数的方差；
16 |   5. 文章虽然指出，达到相同准确率时SDR模型不需要过多的epoch，但是每个epoch的计算量应该比基本模型高一倍左右（需要进行额外一次的反向梯度计算）。
17 | 
18 | #### more:
19 |   1. [经典防过拟合方法Dropout，只是SDR的特例](https://zhuanlan.zhihu.com/p/43083693)
20 |   2. [tensorflow code](https://github.com/noahfl/densenet-sdr)
21 | 
22 | #### question:
23 | &nbsp;&nbsp;&nbsp;&nbsp;该策略在文本上不对word embedding使用SDR会不会好点？原因在于SDR是针对神经元上的操作。但是从dropout实际使用来说，对word embedding进行dropout也能达到缓解过拟合的情况；
24 | 
25 | #### highlight:
26 | &nbsp;&nbsp;&nbsp;&nbsp;The proposed sequence generation model generates labels sequentially and predicts the next label conditioned on its previously predicted labels. Therefore, it is likely that we would get a succession of wrong label predictions in the following time steps if the prediction is wrong at time-step t, which is also called exposure bias.
27 | 


--------------------------------------------------------------------------------
/regularization/Explaining and Harnessing Adversarial Examples.md:
--------------------------------------------------------------------------------
 1 | ### Explaining and Harnessing Adversarial Examples
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper主要是想解决adversarial样本在训练好的中容易被错分的情况。
 5 |   
 6 |   1. 作者指出，adversarial干扰主要是由样本的线性特征造成的（文中指出，早期学者已经做出定量分析），因此提出在样本中预先引入这个perturbation（详见paper第3、4部分）；
 7 |   2. “样本的线性干扰因子”的计算方式。在当前loss下，样本输入的特征x线性变化方向的该变量，当x第i个特征在当前loss导数下为正时，则说明该特征有变大的可能性，原特征加上这个变化的可能性便可得到adversarial loss。
 8 | 
 9 | #### comment:
10 |   1. 不仅是从行文还是所提模型背后的哲理，都值得深究；
11 |   2. 关于计算量。因为需要再从loss反馈到x中，估计需增加一倍左右的计算；
12 |   3. 讨论：train和eval存在鸿沟的原因。现有的训练数据都是有一定的干扰噪音，致使模型并没有学到groundtruth，从而也导致了训练error与测试error之间存在一低一高的问题。引用原文的话是：even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label
13 |   4. 不仅可以用于图像中，还可用于文本中。
14 | 
15 | #### highlight:
16 | &emsp;&emsp;The adversarial training procedure can be seen as minimizing the worst case error when the data is perturbed by an adversary.
17 | 
18 | #### more reading:
19 | [深度学习：简单而有局限性的求解方式](https://www.jiqizhixin.com/articles/3e949ae2-48ac-4606-98a0-844fb291c4d4)
20 | 


--------------------------------------------------------------------------------
/regularization/Layer Normalization.md:
--------------------------------------------------------------------------------
 1 | ### Layer Normalization
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;针对DL模型training时间长的问题，作者提出使用layer normalization的方式提高训练速度。该方法与batch normalization有个极大的不同：与batch size大小无关，从而有以下有点：
 5 |   1. training过程中，可使用较小的batch size，不用考虑sample之间的依赖;
 6 |   2. test过程使用layer normalization的过程与training完全相同。
 7 |   
 8 | &emsp;&emsp;具体操作：针对每一个sample，以layer为单位（假设有m个hidden unit），计算当前step该layer下m个unit pre-activity的均值和方差，m个unit共享得到的均值和方差并进行标准化。
 9 | 
10 | #### comment：
11 |   1. 该方法很好地摒弃了mini-batch大小的限制；
12 |   2. 在RNN单元何处使用layer normalization时，paper的做法类似[Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025)，而且得到的实验结果不错；
13 |   3. paper fig6进一步证明batch size越大，模型性能越好；
14 |   4. 经实验发现，LSTM使用layer normalization后，在其它参数相同的情况下，效率降低1.x以上。[更多讨论](https://www.reddit.com/r/MachineLearning/comments/4ufmxy/layer_normalization_implemented_in_tensorflow/)
15 | 
16 | #### highlight:
17 |   1. Notice that changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose outputs can change by a lot.
18 |   2. under layer normalization, all the hidden units in a layer share the same normalization terms µ and σ, but different training cases have different normalization terms. Unlike batch normalization, layer normaliztion does not impose any constraint on the size of a mini-batch and it can be used in the pure online regime with batch size 1.
19 |   
20 | 


--------------------------------------------------------------------------------
/regularization/Recurrent Batch Normalization.md:
--------------------------------------------------------------------------------
 1 | ### Recurrent Batch Normalization
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper跟“Batch Normalized Recurrent Neural Networks”一样是探讨对RNN引入batch normalization后的效果，但其deploy策略表明，对模型准确率、训练速度都有较大的提升。从实现的方法上看，其对每个激活函数的input都进行了batch normalization（有sigmoid，tanh，详见式6-8）。
 5 | 
 6 | #### comment：
 7 |   1. 从式6可知，激活函数的均值为0，方差却为2，而不是1（两个方差为1但是不相关的分布相加其方差为2，均值为0），同时也侧面表明γ值需要相对较小，参数的初始化需要考虑的点比较多，核心是在当前loss function的基础上，w和x要相契合，w*x值太大，容易造成激活函数饱和，从而降低传播梯度，训练时间加大。
 8 |   2. 对比“Batch Normalized Recurrent Neural Networks”论文，其实验结果显示hidden2hidden中引入batch normalization效果差的原因还是这点：pre-activation的计算先对hidden和input的信息相加再进行batch normalization，而这两个方向的信息不应该“直接相加计算normalization”，因为它们两个的分布是不一样；
 9 |   3. 在inference中，population方式比batch statistics更好。
10 | 
11 | #### highlight:
12 | &emsp;&emsp;Covariate shift is a change in the distribution of the inputs to a model. This occurs continuously during training of feed-forward neural networks, where changing the parameters of a layer affects the distribution of the inputs to all layers above it. As a result, the upper layers are continually adapting to the shifting input distribution and unable to learn effectively.
13 | 


--------------------------------------------------------------------------------
/regularization/Regularization of Neural Networks using DropConnect.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　DropConnect是传统dropout上的改进，相比于深度学习中使用的Dropout，在理论上更具泛化能力。以某神经元为对象，Dropout对该神经元的输出进行调整（添加mask，共需调整1个元素），而DropConnect则对该神经元的输入进行调整（添加mask，共需调整k个元素，k代表该神经元所在位置上一层的输出维度）。
 3 |   
 4 | #### comment:
 5 |   1. 截止2018-06-29，Google搜索显示该论文引用达到935，但在NLP领域看到使用的不多，存疑。
 6 | 
 7 | #### code:
 8 |   1. [DropConnect Implementation in Python and TensorFlow](https://nickcdryan.com/2017/06/13/dropconnect-implementation-in-python-and-tensorflow/)
 9 |   2. [Can I use `tf.nn.dropout` to implement DropConnect?](https://stackoverflow.com/questions/44355229/can-i-use-tf-nn-dropout-to-implement-dropconnect)
10 | 
11 | #### more:
12 |   1. [Regularization of Neural Networks using DropConnect](https://cs.nyu.edu/~wanli/dropc/)
13 |   2. [Difference between DropOut and DropConnect](https://stats.stackexchange.com/questions/201569/difference-between-dropout-and-dropconnect)
14 |   3. [What is the difference between dropout and dropconnect?](https://www.quora.com/What-is-the-difference-between-dropout-and-dropconnect)
15 |   4. [Deep learning：四十六(DropConnect简单理解)](https://www.cnblogs.com/tornadomeet/p/3430312.html)
16 | 


--------------------------------------------------------------------------------
/regularization/Weight Normalization A Simple Reparameterization to Accelerate Training of Deep Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper以batch normalization为基础，提出了weight normalization。针对f(x)=wx+b，BN对x进行了normalization操作，而WN则对w进行调整，具体操作是令w的“值”和“方向”分别用不同参数表示，从而减少“值”与“方向”之间的依赖。
 5 | 
 6 | #### comment：
 7 | 1. 与BN相比，WN排除了同个batch中sample与sample之间的依赖关系，同时也解决了BN batch设置需较大，training和infering存在差异的问题；
 8 | 2. 缺点，模型参数需专门初始化；
 9 | 3. 从f(x)=wx+b中x的角度解释WN，经过WN，x的均值为0，方差为||v||。（v包含了w的方向信息）
10 | 
11 | #### highlight:
12 | 1. we now have ||w|| = g, independent of the parameters v. We therefore call this reparameterizaton weight normalization.
13 | 2. By decoupling the norm of the weight vector (g) from the direction of the weight vector (v/||v||), we speed up convergence of our stochastic gradient descent optimization, as we show experimentally in section 5.
14 | 


--------------------------------------------------------------------------------
/regularization/curriculum learning.md:
--------------------------------------------------------------------------------
 1 | ### curriculum learning
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper提出的curriculum learning有点类似人类学习机制——先学简单的技能，再学困难的。用作者的话说：The basic idea is to start small, learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level。该trick在运用过程中，仅需改变训练数据集中不同sample的训练数据，对模型的结构不需更改。经实验发现，该方法具有以下优点：
 5 | 
 6 |   1. give rise to improved generalization and faster convergence;
 7 |   2. to find better local minima of a non-convex training criterion.
 8 | 
 9 | #### comment:
10 |   1. curriculum learning核心是如何判定样本的学习难易，文中从noise、样本的形式（图形旋转、扭曲）、词频等角度给出了实验，个人不得不惊叹该方法的简洁与可用性；
11 |   2. 个人理解，在employ该方法时，可猜想要使“模型参数、结构简洁”该如何操作；
12 |   3. 文中也对为什么该方法可行做出了猜想，个人看法是，容易学习到的sample是基础，构成了整个收敛面的basin，而tough sample相当于是高层面的样本，先基于简单的样本去学习，尽量保证往后的学习中不会偏离这个基本的basin。
13 | 
14 | #### highligt:
15 |   1. A theoretical motivation for deep architectures comes from complexity theory: some functions can be represented compactly with an architecture of depth k, but require an exponential size architecture when the depth is restricted to be less than k.
16 |   2. Erhan et al. (2009) found that unsupervised pre-training makes it possible to start the supervised optimization in a region of parameter space corresponding to solutions that were not much better in terms of final training error but substantially better in terms of test error.
17 |   3. From the machine learning point of view, once the success of some curriculum strategies has been established, the important questions are: why? and how?
18 | 


--------------------------------------------------------------------------------
/self-supervised learning/A Simple Framework for Contrastive Learning of Visual Representations.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;在图像领域，作者提出simCLR框架，在大规模无标注语料上进行自监督训练，结果表明微调模型后ImageNet数据集上表现优异，simCLR任务：
 3 |   + 1. 设计：将原图片进行增强，在同一图片扩展出来的子图中拉近它们的距离，在不同图片扩展出来的子图中推远其距离（如下图），相似度计算使用MLP后的cosine；
 4 |   + 2. loss：使用contrastive cross entropy loss（cosine除了一个temperature parameter）；
 5 |   + 3. 受益点：较大的批量、训练时长、网络深度、数据增加策略（迁移到文本中要多尝试）、在损失函数前采用基于MLP的非线性投影（但在finetune时丢弃该层表现会更好，详见原文fig2）
 6 | 
 7 | #### comment:
 8 |   + 1. 该论文较大的贡献在于少量标注数据集时，如何能尽可能地提高预测性能；
 9 |   + 2. 作者尝试了多种不同的变换策略，最终发现组合“裁剪”和“色彩失真”可以得到SOTA（单策略不会提高太多性能），说明如果要迁移到文本中，还需要进行多种尝试
10 |   + 
11 | #### more:
12 |   + 1. [自监督学习](https://zhuanlan.zhihu.com/p/66063089)：直接从无标注数据中进行学习，eg：language model、next sentence prediction
13 |   + 2. [论文解读](https://mp.weixin.qq.com/s?__biz=MzI3MTA0MTk1MA==&mid=2652065896&idx=1&sn=0b539317e18ea705744200dff8c0f8b9)
14 |   + 3. [tf源码](https://github.com/google-research/simclr)
15 | 
16 | #### question:
17 |   + 1. 为什么loss计算时sim要除以tau?
18 | 


--------------------------------------------------------------------------------
/sequence to sequence/Fluency Boost Learning and Inference for Neural Grammatical Error Correction.md:
--------------------------------------------------------------------------------
 1 | #### [main](https://www.jiqizhixin.com/articles/2018-07-22-10):
 2 | 　　作者提出使用流畅度提升学习（fluency boost learning）来对句子进行纠错，其核心原理就是在训练模型的过程中，让seq2seq模型生成出多个结果，然后将结果中流畅度不如目标端正确句子的生成句子和目标端正确句子配对，组成全新的流畅提升句对，作为下一轮训练的训练数据。而流畅度提升推断（fluency boost inference）则是利用seq2seq模型对句子进行多轮修改，直到句子的流畅度不再提升为止。这种多轮修改的策略能够率先改掉句子的一部分语法错误，从而使句子的上下文更加清晰，有助于模型修改剩下的错误。
 3 | 
 4 | #### comment:
 5 | 　　亮点主要有以下方面：
 6 |    a. 训练数据少的情况下，通过beam search生成多个结果进行再训练；
 7 |    b. 使用流畅度衡量当前句子与最终正确句子的差距，从而判断是否继续训练/生成（类似不断对句子进行纠正的过程）；
 8 | 
 9 | #### more:
10 |   1. [通过全新学习和推断机制提升seq2seq 模型的语法改错性能](https://www.jiqizhixin.com/articles/2018-07-22-10)
11 |   2. [机器语法纠错能力新突破，微软小英变身英语写作老师](https://www.msra.cn/zh-cn/news/features/engkoo-composition-score)
12 | 


--------------------------------------------------------------------------------
/sequence to sequence/Get To The Point Summarization with Pointer-Generator Networks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　作者提出使用Pointer-Generator Networks生成文本摘要，动机是摘要部分的词多为OOV词，使用pointer networks在seq2seq decoder部分既能得到预先定义的vocabulary中各词概率分布，也能使用copy机制，直接从encoder中复制某个词：
 3 |   1. encoder：对input进行encode，得到各个step的输出(e1, ..., en)；
 4 |   2. decoder：对当前step，vocab的概率分两部分（式9）：
 5 |       a. 常规方法计算得到的vocab各词概率；
 6 |       b. 使用(e1, ..., en)计算当前decoder步下各ek的权重，每个权重对应encoder位置的概率。
 7 |   3. loss：为了避免生成的内容出现了连续重复的句子，loss添加了额外的损失，名为converge loss。
 8 | 
 9 | #### comment:
10 |   1. 该方法在decoder部分，既能生成指定vocab中词，也能主动在encoder部分直接copy，如果decoder中的词大部分来自encoder，而且oov词较多，使用该方面能较好的解决上述问题（适用范围）；
11 |   2. 该方法巧妙的地方是vocab词与encoder中的词结合部分，其直接在原始的vocab中加权后进行拼接。
12 | 
13 | #### [code](https://github.com/abisee/pointer-generator)
14 | 
15 | #### highlight:
16 | 　　In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition.
17 | 
18 | 


--------------------------------------------------------------------------------
/sequence to sequence/Grammar as a Foreign Language.md:
--------------------------------------------------------------------------------
 1 | ### Grammar as a Foreign Language 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper使用了seq2seq+attention机制运用于句子的语法解析上，新瓶装旧酒，模型几乎没变（详细模型参见[bahdanau](https://arxiv.org/abs/1409.0473)论文），只是实验了运用于语法解析上的效果。
 5 | 
 6 | #### comment:
 7 | 1. 语法解析本身的意义有多大？因为DL运用于相关的领域不需要事先具备“语法”知识，当然对于特殊项目的需求除外；
 8 | 2. 实验设置中讨论了多个因素对结果的影响，这方面值得一看，如：beam size大小的设置及其影响、dropout和pretrain word2vec等；
 9 | 3. 了解beam size对模型的影响需要理解在training decoder部分，每个timestep都是最大化目标idx，再结合test/eval过程中，beam搜索过程，每个timestep对每个beam进行k次扩展，最后取k个结果作为next timestep的搜索种子。
10 | 


--------------------------------------------------------------------------------
/sequence to sequence/Learning to Skim Text.md:
--------------------------------------------------------------------------------
 1 | ### Learning to Skim Text
 2 | #### note:
 3 | &emsp;&emsp;文章提到，在使用RNN时，需要读入所有的text，但是text中有冗余信息，可以通过skim这部分文本内容来提高decoder时的速度。其中，使用了reinforcement leanring来学习这个skim过程。
 4 | 
 5 | #### comment:
 6 | 1. 作者首先有复现冗余信息这样一个理论，想通过jump这些不必要的信息来提高测试时的效率，考虑到jump步数为离散变量，所以使用了RN的方法来解决这一问题，所以核心是在RL的实际deploy过程；
 7 | 2. 从模型的角度来思考，其能jump这部分冗余信息，可能是当前已encoded的信息对接下来k步内容已经有了很高的置信度，所以可直接jump这部分信息（这是从测试的时候考虑的）；
 8 | 3. 模型起到了一定的正则作用；
 9 | 4. 对于有jump信息的text，使用文章提到的方法能提高测试时执行的速率，但初步预测，训练时间会比不用该方法要长。如果jump信息少，不仅可能因此增加了训练时间，对测试部分的执行效率也不一定提高。（jump信息：对完整的text剔除掉后仍能不影响模型的预测效果，如分类，文本生成等）；
10 | 5. paper3.1部分的试验设置得比较巧妙，人能明显地判断出数据有skip信息，从而据此来训练进而判断模型是否能学到这部分信息；
11 | 6. 不同的实验组，其embedding是否固定不一致，有何依据？
12 | 7. paper3.1部分为什么要curriculum training scheme？training size很大就能不担心overfitting？
13 | 
14 | #### practice：
15 | 1. jump信息多的时候再考虑使用skim技巧。
16 | 2. 如何判断jump信息是否多：a.文本长度是否较长；b.重现短语是否较多；c.是否为特定领域文本等。
17 | 3. 对于能提高test部分多少准确率可参照paper不同规格的实验。
18 | 
19 | #### more reading:
20 | <http://www.jiqizhixin.com/article/2720>
21 | 


--------------------------------------------------------------------------------
/sequence to sequence/Neural Machine Translation by Jontly Learning to Align and Translate.md:
--------------------------------------------------------------------------------
 1 | ### Neural Machine Translation by Jointly Learning to Align and Translate 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;basic seq2seq把input信息都encode进“一个”固定的vector中，其vector本身的表达能力有限，特别是针对input句子本身比较长的情况，句子越长，vector信息越模糊。paper提出的解决方案是在decoder的每一个step中，不仅考虑decoder在该step之前得到的connection信息s和输入y，还对encoder部分“每一个”STEP进行回顾，并判定各个STEP隐层信息的权重、加权求和，得到该decoder部分step下的输入。
 5 | 
 6 | #### comment:
 7 | 1. jointly怎么理解？需根据attention引入的方式来解释（attention就是加权求和后的c），在我看来就是这个加权求和过程；
 8 | 2. 猜想。如果encoder-decoder中，两边的词是一对一的对应（或接近一对一），那么模型的效果会比“一对多”和“多对一”好一点。对于一词多义的情况，需要利用好上下文信息，而rnn正有这样的优点；
 9 | 3. 该模型对长句子效果好，短句子是否可考虑不用该模型？得清楚长短在此文的界限。根据论文的fig2可知，对于句子长度为10的文本，该模型已经有很大的提升，长度为10已经挺短的，因此一般都会使用attention机制；
10 | 4. appendix中对如何employ c进行了详细的公式解说，非常值得看；
11 | 5. 本文是难得一遇的好文，对basic seq2seq有明显的提升效果。
12 | 


--------------------------------------------------------------------------------
/sequence to sequence/On Using Very Large Target Vocabulary for Neural Machine Translation.md:
--------------------------------------------------------------------------------
 1 | ### On Using Very Large Target Vocabulary for Neural Machine Translation
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;paper是对decoder部分target vocabulary非常大（10w+）的时候建议使用的stragedy。原因：在seq2seq decoder部分，对于每个timestep都需要计算vocabulary size次向量运算（言外之意是需要计算式7），对于vocabulary集合比较大的情况，其运算量和显存（如果使用GPU）会明显增加，因此作者提出了sample softmax技巧，在training过程中，对于decoder的每个timestep中的output部分，只sample指定大小的vocabulary集合替换完整的词集进行计算。
 5 | 
 6 | #### comment:
 7 |   1. title提到very large target vocabulary，在本文之前的translation实验中，有取3W、8W大小的词集，所以就此猜测，10W+应该算是比较大的词典；
 8 |   2. 为什么是target vocabulary，而encoder部分不考虑？此处需要理解encoder部分是没有output layer的；而且，就算是在decoder中的input vocabulary，其亦可包含两套词集，在此用“target”非常精确；
 9 |   3. 理解本文的时候需要理解在decoder部分中，train和test/eval是不同的，在train的时候，groundtruth target是已知的，式6只需计算一个分子，但是在test/eval时，target是未知的，有多少个候选word就需要计算多少个分子，然后再进行排序比较；
10 |   4. 很明显，该方法是在计算能力和显存缺乏的情况下提出的。
11 | 


--------------------------------------------------------------------------------
/sequence to sequence/Pointer Networks.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | 　　Pointer Networks基于seq2seq with attention的结构，当前序列的输出集合为当前序列的输入集合，同时输出集合并没有像一般定义好的词集，而直接用位置索引替代，模型逻辑：
 3 |   1. encoder：对input进行encode，得到各个step的输出(e1, ..., en)；
 4 |   2. decoder：对当前step，使用(e1, ..., en)计算当前decoder步下各ek的权重，最大的权重位置对应的encoder输入即为当前decoder的输出（详见paper 2.3部分）。
 5 | 
 6 | #### comment:
 7 |   1. 该策略在凸包、旅行商问题中实验中效果良好；
 8 |   2. 其能很好的指向输入的unit，但限制在当前序列输入的unit，若其它序列中的unit未在当前序列出现，同样不能生成；
 9 |   3. 文章提出的策略非常巧妙，同时也说明seq2seq with attention策略虽然能考虑到某些方面，但针对特定问题，显示的表达想要的特征可能会得到意想不到的结果。
10 | 


--------------------------------------------------------------------------------
/sequence to sequence/Sequence to Sequence Learning with Neural Networks.md:
--------------------------------------------------------------------------------
 1 | ### Sequence to Sequence Learning with Neural Networks 
 2 | 
 3 | #### note:
 4 | &emsp;&emsp;本文是seq2seq比较原始版本，先使用一个rnn将一段文本映射到固定维度的vector中，再由另一个rnn将该vector作为初始状态进行decoder。同时，本文指出对encoder部分的input进行reverse能有更好的效果。
 5 | 
 6 | #### comment:
 7 | 1. 本文得到的测试结果虽比现在（2017-06-21）state of art差得比较远（BLUE34.8vs41.29），但可从其所遇到的问题，了解attention、sample softmax等优秀的策略的提出背景；
 8 | 2. paper提到使用reverse效果更好，从侧面指出rnn在长句方面仍有欠缺，而同期Bahdanau等人提出的attention机制刚好能在一定程度上弥补（bidirectional rnn也能弥补）该缺点；
 9 | 3. model analysis部分，对于类似的句子，seq2seq encoder得到的hidden state在空间中聚合，这一点发人深省。
10 | 
11 | #### more reading:
12 | <https://opensource.googleblog.com/2017/04/tf-seq2seq-sequence-to-sequence-framework-in-tensorflow.html>
13 | 


--------------------------------------------------------------------------------
/sequence to sequence/Skip-Thought Vectors.md:
--------------------------------------------------------------------------------
 1 | ### Skip-Thought Vectors
 2 | 
 3 | #### note:
 4 | 1. paper提出了一种将句子转为高维向量，称之为skip-thought vector。训练的时候采用了无监督的方式，基本框架是seq2seq。核心思想是利用某句话前后两个句子来描述该句的语义和语法特点，从而以此来生成该句的高维句向量；
 5 | 2. 待测试的语料中出现OOV的情况，文中提出了两个vocabulary空间进行线性映射的方法。
 6 | 
 7 | #### comment:
 8 | 1. 算法的核心思想是用了"around"句子来限定句子的表示，在看论文摘要的时候，提出了两个问题：
 9 |   + a.如何定义语义和语法的判定量度；
10 |   + b.如何划定语义和语法之间的相对权重。这两个问题听起来像是传统机器学习所考虑的，但是在这里完全不用人为的去定义一些指标，或许这也是DL方法的一大优势，但也缺乏可解释性；
11 | 2. 为什么要叫“skip”，正是因为encoder句子为中间句子，decoder句子为该句的前一句和后一句；
12 | 3. 在实验方面，在多个领域中进行了比较，有语义相关性，分类等，实验严谨，有对比性。
13 | 
14 | #### [code](https://github.com/tensorflow/models/tree/master/skip_thoughts)
15 | 


--------------------------------------------------------------------------------
/transformer/ALBERT A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对bert显存占用高、训练慢，作者提出了两个改进策略，同时针对句子之间的连贯性问题提出了对next sentence task的改进策略：
 3 |   + 1. Factorized embedding parameterization（对word embedding进行因式分解）：word embedding用较小的dim参数E，然后映射到transformer隐藏层大小H，参数量从O(V\*H)变为O(V\*E+E\*H)，当H远大于E时，参数降低非常明显（在实验中，作者始终将E设为128，即使是albert-base）；
 4 |   + 2. Cross-layer parameter sharing（transformer层与层之间共享参数）；
 5 |   + 3. Inter-sentence coherence loss（句间连贯性损失）：bert中next sentence task更多学到的是主题分类任务，因为负样本是从其他文章中随机选取的，从而对“句子之间的连贯性”信息学习不足（因为已经可以通过主题将两个句子分开）。因此作者提出sentence-order prediction（SOP）任务，该策略正样本构造跟原bert方法一样，负样本为正样本句子顺序调换。
 6 | 
 7 | 
 8 | #### comment：
 9 |   1. 作者提出的第一个策略“Factorized embedding parameterization”，解释感觉有点牵强，给人的印象是：为了降低显存，降低了word embedding大小，然后实验效果还行；
10 |   2. 文章给出的时间，是指总的训练时间减少（因为训练步数减少，共享层与层之间的参数有点像batch normalization策略），而不是同数据量预测时间减少；
11 | 
12 | #### more:
13 |   1. [作者提供的源码](https://github.com/google-research/ALBERT)
14 |   2. [中文预训练ALBERT](https://github.com/brightmart/albert_zh)
15 | 


--------------------------------------------------------------------------------
/transformer/Attention is All You Need.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出transformer，其以encoder-decoder结构对句子进行编码映射:
 3 |   1. encoder:  InputEmbedding + PositionalEncoding + N*[LayerNorm(x+MultiHeadAttention(x)) + LayerNorm(x+FeedForward(x))]
 4 |   2. decoder: OutputEmbedding + PositionalEncoding + N*[LayerNorm(y+MaskedMultiHeadAttention(y) + LayerNorm(y+MultiHeadAttention(y)) + LayerNorm(y+FeedForward(y))] + Linear + Softmax
 5 |   如下图：
 6 | ![](https://github.com/xwzhong/papernote/blob/master/pic/transformer.png)
 7 | 
 8 |   
 9 |   各组件详细介绍：
10 |   1. positional encoding(paper3.5）:选用原因——We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_pos+k can be represented as a linear function of PE_pos;
11 |   2. [layer normalization](https://github.com/xwzhong/papernote/blob/master/regularization/Layer%20Normalization.md);
12 |   3. attention, 多个Scaled Dot-Product Attention拼接构成了multi-head attention:
13 | 
14 |   + a. Scaled Dot-Product Attention(paper3.2.1), 其最后添加scale的原因——We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients，计算见下图：
15 | 
16 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Scaled%20Dot-Product%20Attention.png)
17 |   
18 |   + b. Multi-Head Attention(paper3.2.2). 将V,K,Q进行线性映射，并分别split为num_head，进而对num_head个部分计算Scaled Dot-Product Attention，最后concat并映射，计算见下图：
19 |   
20 |   ![](https://github.com/xwzhong/papernote/blob/master/pic/Multi-Head%20Attention.png)
21 |   
22 |   4. Position-wise Feed-Forward Networks(paper3.3)):FFN(x) = max(0, x*W_1 + b_1)*W_2 + b_2。
23 | 
24 | 
25 | 
26 | #### comment：
27 |   1. transformer核心组件是multi-head attention部分，token向量交叉计算，最后输出加权后的向量，直观上感觉比较简单，但是加上了多个辅助技巧后实验效果确实不错；
28 |   2. transformer结构中使用了很多小技巧，如scale, layer normalization, mask, residual connection, label smoothing等，有些缺乏必要理论解释；
29 |   3. 单个epoch因并行使得训练速度较快，但个人通过一些运用发现，其拟合速度较慢，可能是由于整个网络搜索空间大，致使其需要更多的尝试才能找到收敛方向，同时也预示使用预训练的参数能加快收敛速度。
30 | 
31 | 
32 | #### more:
33 |   1. [论文阅读笔记之Attention Is All You Need](https://blog.csdn.net/mrphd/article/details/79357890)
34 |   2. [tensorflow code](https://github.com/Kyubyong/transformer)
35 | 
36 | #### highlight:
37 |   1. why self attention:One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third, the shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies.
38 |   2. Label Smoothing During training, we employed label smoothing of value e_ls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
39 | 


--------------------------------------------------------------------------------
/transformer/BERT Pre-training of Deep Bidirectional Transformers for Language Understanding.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper在论文"Attention Is All You Need"提出的transformer基础上使用双向编码，并基于“大规模语料”+“两项特定的task”预训练参数，最终在task specific任务中fine tuning，得到论文所有呈现数据集中的最优结果。
 3 |   1. bidirectional transformer。[OpenAI GPT](https://blog.openai.com/language-unsupervised/)提出了left-to-right transformer，其仅为单向（every token can only attended to previous tokens in the self-attention layers of the Transformer），同时虽然biLSTM虽然也使用了双向结构，但是这两个方向的LSTM相对独立，paper5.1部分详细实验了去除双向结构的影响。各结构对比详见原文fig1。
 4 | 
 5 |   2. pretrain task。数据使用BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words)，同时以下面两种task预训练（最终的loss——the sum of the mean masked LM likelihood and mean next sentence prediction likelihood）：
 6 | 
 7 |   + Task #1: Masked LM。以15%随机选取句子的中token，同时对于这15%的词——80%用[mask]代替，10%随机选取一个token，10%保留原始的token，对15%token进行细分的原因:
 8 |      + a mismatch between pre-training and fine- tuning, since the [MASK] token is never seen dur- ing fine-tuning
 9 |      + downside of using an MLM is that only 15% of tokens are predicted in each batch, which suggests that more pre-training steps may be required for the model to converge
10 |   + Task #2: Next Sentence Prediction。由于很多下游任务需要衡量两个句子之间的关系，因此预训练时使用Next Sentence Prediction任务尽量描述这种关系。其中，不同句子部分使用不同的segment embeddings。
11 |   
12 |   3. 实验结果:
13 |   + [GLUE](https://gluebenchmark.com/leaderboard)。11项任务都摘得第一，BERT_large平均score比原最优高6.7;
14 |   + [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/)。single模型在f1值上超越人类并超过所有其它学者提出的模型；
15 |   + CoNLL-2003 NER。F1 92.6%(原最优)，BERT_large为92.8%;
16 |   + [SWAG](https://leaderboard.allenai.org/swag/submissions/public)。Dev：BERT_base 81.6%；Test：BERT_large 86.3%，openAI GPT 77.97%；human 88%。
17 |   
18 |   4. paper对比了是否使用bidirectional transformer、Next Sentence Prediction等模块，详见5.1，同时发现（5.2），增大模型大小能明显提高各任务的准确率。
19 | 
20 | #### comment：
21 |   1. 文章可谓NLP领域突破性进展，单一模型能达到如此惊人的效果，其主要得益于双向transformer，大语料特定方式的预训练。强烈推荐看more下reddit讨论部分，上面有原作一些回复；
22 |   2. 训练相同参数下的BERT_large在8 P100s下训练要一年？（详见reddit讨论，注：论文max len为512）；
23 |   3. 针对SWAG数据，它的形式比较像对话系统，将该算法用于闲聊是不是可以有较大的提升（对于多轮编码，可以使用多个segment embeddings）；
24 |   4. pretrain已证明了在各项任务中的优越性，但是针对具体的任务，预训练什么模型结构，用什么样的数据仍需考究。
25 | 
26 | #### more:
27 |   1. [reddit讨论](https://www.reddit.com/r/MachineLearning/comments/9nfqxz/r_bert_pretraining_of_deep_bidirectional/)
28 |   2. [全面超越人类！Google称霸SQuAD，BERT横扫11大NLP测试](http://www.qianjia.com/html/2018-10/13_307585.html)
29 |   3. [code](https://github.com/google-research/bert)
30 |   4. [深入理解BERT Transformer ，不仅仅是注意力机制](https://www.jiqizhixin.com/articles/2019-03-19-13)
31 |   5. [Bert时代的创新：Bert在NLP各领域的应用进展](https://www.jiqizhixin.com/articles/2019-06-10-7)
32 |   
33 | #### highlight:
34 |   1. We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets.
35 |   2. Now you have two representations of each word, one left-to-right and one right-to-left, and you can concatenate them together for your downstream task. But intuitively, it would be much better if we could train a single model that was deeply bidirectional(from reddit讨论).
36 |   3. [SWAG (Situations With Adversarial Generations)](https://leaderboard.allenai.org/swag/submissions/public) is a dataset for studying grounded commonsense inference. It consists of 113k multiple choice questions about grounded situations: each question comes from a video caption, with four answer choices about what might happen next in the scene. The correct answer is the (real) video caption for the next event in the video; the three incorrect answers are adversarially generated and human verified, so as to fool machines but not humans.
37 | 
38 | 


--------------------------------------------------------------------------------
/transformer/ERNIE Enhanced Representation through Knowledge Integration.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;针对BERT模型，ERNIE提出两点改进：
 3 |   + 1. 中文层面，BERT进行mask是基于字的。对于实体词中某个字进行mask的样本，模型很容易根据实体词中其他字预测被mask的部分，比如：阿里[MASK]巴，很容易推测[MASK]为“巴”，因此ERNIE提出对实体词进行mask。
 4 |   + 2. 增加训练语料来源：百科、新闻、论坛对话类数据。
 5 | 
 6 | #### comment:
 7 |   + 1. 文章提出的idea比较简单，但是效果提升明显，针对中文数据，使用分词策略也能达到实体词mask的效果；
 8 |   + 2. 目前百度已经ERNIE模型通过蒸馏的方式用于搜索场景中，据说效果还不错。
 9 | 
10 | #### more:
11 |   + 1. [ERNIE预览：百度 知识增强语义表示模型ERNIE](https://www.jianshu.com/p/fb66f444bb8c)
12 |   + 2. [提速1000倍，百度飞桨发布基于ERNIE的语义理解开发套件](http://baijiahao.baidu.com/s?id=1649525069184443515)
13 | 


--------------------------------------------------------------------------------
/transformer/Improving Language Understanding by Generative Pre-Training.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper提出使用transformer预训练语言模型，再用训练好的参数针对特定任务进行有监督微调，要点记录如下：
 3 |   1. 训练时，使用的是“Attention Is All You Need”中decoder：使用当前词来预测下一个词；
 4 |   2. 在supervised fine-tuning时，loss不仅包含了specific task损失，还加入了语言模型的损失；
 5 |   3. 通过实验发现，9/12任务效果有提升，部分任务效果提升较明显，eg：8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE)。
 6 | 
 7 | #### comment：
 8 | &nbsp;&nbsp;&nbsp;&nbsp;在fine-tuning时加入语言模型的loss应该是想尽量保存语言模型的序列结构，在训练过程中逐步降低语言模型损失的权重会不会好点，因为最终我们是想在specific task中提高准确率，同时训练到一定程度后，非语言模型部分的参数也趋于稳定。
 9 | 
10 | #### more:
11 |   1. [Improving Language Understanding with Unsupervised Learning](https://blog.openai.com/language-unsupervised/)
12 |   2. [tensorflow code](https://github.com/openai/finetune-transformer-lm/)
13 | 
14 | #### highlight:
15 |   1. We’d like to better understand why language model pre-training of transformers is effective. A hypothesis is that the underlying generative model learns to perform many of the tasks we evaluate on in order to improve its language modeling capability and that the more structured attentional memory of the transformer assists in transfer compared to LSTMs.
16 |   2. Overall, the trend suggests that larger datasets benefit from the auxiliary objective but smaller datasets do not.
17 | 


--------------------------------------------------------------------------------
/transformer/RoBERTa_A Robustly Optimized BERT Pretraining Approach.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper针对bert指出，在预训练数据足够多、策略设置得好的情况下，模型效果也能极大的提高：
 3 |   1. 数据量：BERT-large 13G -> RoBERTa 160G
 4 |   2. 策略：
 5 |      + 增加预训练步数，batch size尽可能地大（考虑到显存限制，可使用"梯度累积后再更新"的策略）
 6 |      + 剔除next sentence prediction任务
 7 |      + 尽可能用长文本训练模型（在训练的前90%step，sample中文本长度为512）
 8 |      + 动态调整mask策略（不要在每个epoch中对同一个sample进行相同的mask策略）
 9 | 
10 | #### comment：
11 |   1. 论文首先指出BERT训练耗时长，同时初始化参数对模型影响很大，整个文章偏实验性质；
12 |   2. next sentence prediction是否保留还是应该结合下游的场景进行分析。
13 | 
14 | #### more:
15 |   1. [BERT王者归来！Facebook推出RoBERTa碾压XLNet制霸三大排行榜](http://mini.eastday.com/mobile/190730180533658.html)
16 |   2. [pytorch code](https://github.com/pytorch/fairseq/tree/master/examples/roberta)
17 | 


--------------------------------------------------------------------------------
/transformer/Universal Sentence Encoder.md:
--------------------------------------------------------------------------------
 1 | #### note:
 2 | &nbsp;&nbsp;&nbsp;&nbsp;paper指出，使用[SNLI](https://nlp.stanford.edu/projects/snli/snli_1.0.zip)数据预训练[transformer](https://github.com/xwzhong/papernote/blob/master/transformer/Attention%20is%20All%20You%20Need.md)参数，对下游句向量编码的任务效果有明显的提升，要点记录：
 3 |   1. 使用SNLI数据预训练被证明有效：[Supervised Learning of Universal Sentence Representations from Natural Language Inference Data(2017-05)](https://github.com/xwzhong/papernote/blob/master/embedding/Supervised%20Learning%20of%20Universal%20Sentence%20Representations%20from%20Natural%20Language%20Inference%20Data.md), [Learning Semantic Textual Similarity from Conversations(2018-04)](https://github.com/xwzhong/papernote/blob/master/embedding/Learning%20Semantic%20Textual%20Similarity%20from%20Conversations.md)；
 4 |   2. 预训练sentence embedding比word embedding模型效果更好；
 5 |   3. 文中提出的预训练方法在下游任务中仅需少量标注语料便能获得较好结果，后续增加有标签的数据对最终准确率增长贡献小；
 6 |   4. 使用angular distance替代cosine衡量句子之间的相似度，更多解释可见[Cosine similarity vs angular distance](https://math.stackexchange.com/questions/2874940/cosine-similarity-vs-angular-distance)；
 7 |   5. [SST Bench](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark)句子相似度数据集(dev0.814/test0.782), [BERT_base(test0.858), BERT_large(test0.865)](https://github.com/xwzhong/papernote/blob/master/transformer/BERT:%20Pre-training%20of%20Deep%20Bidirectional%20Transformers%20for%20Language%20Understanding.md)。
 8 | 
 9 | 
10 | #### comment:
11 |   1. 文章更像是一篇实验性文章，印证transformer+预训练SNLI的有效性；
12 |   2. angular distance衡量相似度可考虑使用。
13 | 
14 | #### more:
15 | &nbsp;&nbsp;&nbsp;&nbsp;[Cosine similarity vs angular distance](https://math.stackexchange.com/questions/2874940/cosine-similarity-vs-angular-distance): The problem with the cosine is that when the angle between two vectors is small, the cosine of the angle is very close to 1 and you lose precision...
16 | 


--------------------------------------------------------------------------------