└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Reinforcement-Learning-For-Dialogue-Systems 2 | Reinforcement Learning For Dialogue Systems 强化学习在对话系统中的应用 论文或开源应用总结 3 | 4 | 5 | ## Papers 6 | 1、End-to-End Task-Completion Neural Dialogue Systems 7 | https://arxiv.org/pdf/1703.01008 8 | 9 | 2、2016-A User Simulator for Task-Completion Dialogues 10 | https://arxiv.org/pdf/1612.05688 11 | 12 | 13 | 3、Adversarial Advantage Actor-Critic Model for Task-Completion Dialogue Policy Learning 14 | ICASSP 2018 15 | https://arxiv.org/pdf/1710.11277 16 | 17 | 4、Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning 18 | https://arxiv.org/pdf/1704.03084 19 | 20 | 5、Subgoal Discovery for Hierarchical Dialogue Policy Learning 21 | EMNLP 2018 22 | https://arxiv.org/pdf/1804.07855 23 | 24 | 25 | 6、Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning 26 | ACL2018 27 | https://arxiv.org/pdf/1801.06176 28 | 29 | 7、Discriminative Deep Dyna-Q: Robust Planning for Dialogue Policy Learning 30 | EMNLP 2018 31 | https://arxiv.org/pdf/1808.09442 32 | 33 | 8、Switch-based Active Deep Dyna-Q: Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning 34 | AAAI209 35 | https://arxiv.org/pdf/1811.07550 36 | 37 | 38 | 9、Budgeted Policy Learning for Task-Oriented Dialogue Systems 39 | ACL 2019 40 | https://arxiv.org/pdf/1906.00499 41 | 42 | 10、2019-Emnlp-Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog 43 | 44 | 45 | 11、Su P H, Budzianowski P, Ultes S, et al. Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management[J]. arXiv preprint arXiv:1707.00130, 2017. 46 | 47 | 48 | 12、Weisz G, Budzianowski P, Su P H, et al. Sample efficient deep reinforcement learning for dialogue systems with large action spaces 49 | 50 | 13、He J, Chen J, He X, et al. Deep reinforcement learning with a natural language action space[J]. arXiv preprint arXiv:1511.04636, 2015. 51 | 52 | 14、Casanueva I, Budzianowski P, Su P H, et al. Feudal reinforcement learning for dialogue management in large domains[J]. arXiv preprint arXiv:1803.03232, 2018. 53 | 54 | 55 | 15、 Abel D, Salvatier J, Stuhlmüller A, et al. Agent-agnostic human-in-the-loop reinforcement learning[J]. arXiv preprint arXiv:1701.04079, 2017. 56 | 57 | 58 | 16、 Ross S, Gordon G, Bagnell D. A reduction of imitation learning and structured prediction to no-regret online learning[C]//Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011: 627-635. 59 | 60 | 61 | 17、 Chen L, Zhou X, Chang C, et al. Agent-aware dropout dqn for safe and efficient on-line dialogue policy learning[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 2454-2464. 62 | 63 | 64 | 20、Dialogue Environments are Different from Games: Investigating Variants of Deep Q-Networks for Dialogue Policy 65 | 66 | 67 | 68 | ## 开源实现 69 | ### 微软开源端到端对话系统框架Convlab:https://github.com/ConvLab/ConvLab 70 | ### ConvLab中使用的用于策略的RL算法包含: 71 | 72 | DQN: 2013-Playing atari with deep reinforcement learning 73 | REINFORCE:1992-Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 74 | PPO:2017-Proximal policy optimization algorithms 75 | PPO's self-imitation variant: 2018- Self-imitation learning https://arxiv.org/pdf/1707.06347.pdf 76 | HRL:2017-Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning 77 | A2C on policy:Asynchronous Methods for Deep Reinforcement Learning 78 | A2C with an extra SIL loss function:Self-Imitation Learning https://arxiv.org/abs/1806.05635 79 | SARSA 80 | 81 | 82 | ### 清华对话系统工具tatk中使用的用于策略的RL算法包含:https://github.com/thu-coai/tatk 83 | Policy Gradient: Simple statistical gradient-following algorithms for connectionist reinforcement learning 84 | PPO:Proximal policy optimization algorithms 85 | 86 | 87 | ### 88 | --------------------------------------------------------------------------------