├── ChatGPT与多模态必读论文100篇(2.27日起,每周更新) ├── ChatGPT技术原理解析:从RL之PPO算法、RLHF到GPT4、instructGPT.md ├── README.md ├── Transformer通俗笔记:从Word2Vec、Seq2Seq逐步理解到GPT、BERT.md ├── assets └── images │ ├── RL_simple_primer │ ├── 046b440033c8296f2f9b0f1bf5c3190e.png │ ├── 0bb3ab43b467ce1071d28a89537abc9c.png │ ├── 0dbb490df24a8ddc53d81df6b09c9c76.png │ ├── 1798baf5dba54e21a19508e82d407a8a.png │ ├── 1ecb75833281415497f94e0cbe0279bd.png │ ├── 2c3b63b5cb414b438df7a89e14dae8a4.png │ ├── 2edcff5b6e883269f1dd93f829dbb223.gif │ ├── 445315a773e54b1ba101fee85904fcee.png │ ├── 47088532ca0ef7bde5bae8849728ac8f.png │ ├── 5bbf0a4065be4b5584277e28502f7a7a.png │ ├── 61c34bd9692048c1858665730c1cad15.jpeg │ ├── 6ca0c2b4b623c690c60a749dc1b21291.png │ ├── 756685e2f07b494b99bc97f4ce0f4bf9.png │ ├── 7d5efb19d49a44cbbdf7bf50da94712d.png │ ├── 84f59c98f4b14321acb72db7b8459cee.png │ ├── 9726cb99f9af48bc8ac545ace05804e7.png │ ├── 9dcf9cefeb844a5a93b2b6fd38bf5a80.png │ ├── ab200b2773a547bfa48e639956c52ca0.jpeg │ ├── bad281ae107d49d3adf8aa2d012a08c1.png │ ├── bc1f965ef46e4ea693bb1950fd76d7e8.png │ ├── c3f775519533445db1120f7dc79d4ba1.png │ ├── d3366d92a8ba432b9480a55c7bbc7895.png │ ├── d7fff6ebbf9a42d3ba9c04e1c924d476.png │ ├── dcf2240f8a56451089a314ffe0c6fc08.png │ ├── e402747c8e5780a3631e27ea30232e74.gif │ └── e8fc39f9f7f291dc8b121858a201545b.png │ ├── chatpt_principle │ ├── 029ac3d1134a474abd060e48d838a049.png │ ├── 05b64b744bc74d828e0394a95ce4e487.png │ ├── 05d74bcd64944f7c87902bd59f12ea1d.png │ ├── 1133d2956f234040a30fa4278e2734d0.png │ ├── 1831052632dbed6050771e49dd341516.png │ ├── 1b37ed859770ba388d86273e3c7c6517.png │ ├── 1ecb75833281415497f94e0cbe0279bd.png │ ├── 23e1b2939c3a41a29f99971d5427e1ce.png │ ├── 2805e9b3b99e4fe99f27881c9c188cb7.png │ ├── 2c3b63b5cb414b438df7a89e14dae8a4.png │ ├── 300d2c30d66a2fc2c9a96a2535790a19.png │ ├── 3842d5cfc696477cac1cf9eb5136b4c1.png │ ├── 3bec8b5062e6439f993962b621da0d3e.png │ ├── 452ba38d4bf44c7aafc14e44933e2239.png │ ├── 4ea27677bac93469d6143a5161d5b037.png │ ├── 4f09155231abbc5ede08a1354f25c5a9.png │ ├── 5742da727fbb4fb5a4274b75f9af6c0c.png │ ├── 61a6cc2a71dd2e3b126ff058cd5d045e.png │ ├── 6505c4bd9e3b4d2d94df645f147597c5.png │ ├── 6ce3da16ad6b41c998ee25b8aca3fa75.png │ ├── 6f3b99c974224ad48bc9f5ebbe6929e0.png │ ├── 725d62dd8d0f2997cc2329d5a50977bc.png │ ├── 725f34d9c6654c8983dcda50148b1f02.png │ ├── 72a8c7c487864a4e9444036411dc93c3.png │ ├── 74ab07c2395c4bc08c5bab772095ee99.png │ ├── 78055db0e39e623f2e2b7b4efa4b3593.png │ ├── 79bc702fdf3542a8a9e5b8b89d7a9986.png │ ├── 7b9d2eca7b3548508e9cfa98acc8e371.png │ ├── 8212f02b257a47c98709297e1e812040.png │ ├── 8df866b5d7d4486eb8e5ab9f5ad56191.png │ ├── 93f758832a8345788e79da7935a344ba.png │ ├── 974beb9bee394f0794c56d52de02d25e.png │ ├── 97b7e66fd24947e59fbf90b1b9a57706.png │ ├── a7ff56efd16a42999498d25db1751f1a.png │ ├── a9e3250b63ed40a38ee96d14eaf7635f.png │ ├── b070502a47924402bf65968b53a3f726.png │ ├── b74c354d63fc4bb1bd8fa73b87739185.png │ ├── bd6e7b452a10424eab899855dd4eec9a.png │ ├── c16ccb65bb19b11425eeb4b29e5ccd66.png │ ├── c3f775519533445db1120f7dc79d4ba1.png │ ├── c93a2597c53240dabdafb2b99f12051d.png │ ├── c9e1b0266bc14700a3f9c252126a7df0.png │ ├── d031d2873feecc4c67cf7d45ece4386e.png │ ├── dc30db988ae045d993f1584713592e75.png │ ├── dcf2240f8a56451089a314ffe0c6fc08.png │ ├── e16c9c6245ec496bb23e2f2d3862019f.png │ ├── e94393305b0b40659583bc3f75f44d25.png │ ├── f16b9ac94a7c474891c1b73246afd85f.png │ ├── f9a13aaa56734561abc7b630bec89028.png │ └── f9a531cf142446c9bb594d14bd9d9df0.png │ └── transformer │ ├── 00219aefb2e4074942c05a232def806e_MD5.png │ ├── 014f4a7b562d69685de1bad63acdae2e_MD5.png │ ├── 03477feb6607035c09e0647ed51c0717_MD5.jpg │ ├── 03835827fb44e543e70cde4c24f4958b_MD5.png │ ├── 0408a0f35170960be731d6dabe8c6121.jpeg │ ├── 05b64b744bc74d828e0394a95ce4e487.png │ ├── 05d8982149148c52d9ec28b8a6700dd3.png │ ├── 066744f17a322d41029315a892b10800.png │ ├── 0813eac0ee0498fc409d021262d0093d_MD5.png │ ├── 081711acdffbd86d5dd901ab11f12e45_MD5.png │ ├── 0a9151721dee403cbe55614f36f6fe02.png │ ├── 0af46ee5f1d80f7899e09bc50bdf9d55.png │ ├── 0b63042d5c388feeeae9255dd4d204b1.png │ ├── 0bd8253f3963faa8d8ac4d90b3beb00b.png │ ├── 0d9a5a90d7aae9d3d4c1c827a6b7e968.png │ ├── 156f5f47cf96ef93621d09dbc8d7f173.png │ ├── 15a9e270ce2848b483324eac1e1ac7b9.png │ ├── 15dfeaa4db05178319010b7d4c86d0dc.jpeg │ ├── 15eac347175fa9bb817d973431a52b44_MD5.png │ ├── 17c9ff5d3d01c44fa3fed1bdf9e93ebc_MD5.jpg │ ├── 19a3e9fed1959e276be2f9ee902c1bb0.jpeg │ ├── 1a52c76159f239b0787ecefa4bcd990b.jpeg │ ├── 1bed15c3749c8760dc9975361148523a.png │ ├── 1ca2f383a4a6d73fe29aaff772304f62_MD5.png │ ├── 1d91a686c492378519bff3a12d384db5.png │ ├── 1dc008cab4294402b44c07329e87d816.png │ ├── 20191023230922448.png │ ├── 20191024181221371.jpg │ ├── 20191024181316166.png │ ├── 2019122823361086.png │ ├── 21772349724bbb0e999a384fc8b92f83_MD5.png │ ├── 22d2ddf0859248e4b87f30a0675e57f8.png │ ├── 24cbf60fdb182ca8df5e22063a31479e_MD5.png │ ├── 2526a376e956527a86a57e7f99ae0c9a.png │ ├── 271317ff7a4defa3ea93b4794e11e409_MD5.png │ ├── 28bdbd7107b9a884eb7b0623872ef9eb_MD5.png │ ├── 28f360e43e6a8ce32d68f8a7cd7a45b9_MD5.png │ ├── 2909d40fc68f1e231ed69366a8a7ecba_MD5.jpg │ ├── 290d55d2215c3b3976ad9c8c1d16da7e.png │ ├── 29508eef85e802223ef53bd31cbb74c2.png │ ├── 2a73281367bf7e0ef3782e14c719f488.png │ ├── 2acce61223fd0d086b8d489b18200e54.png │ ├── 2b00497744edfb93461dcbae45e2a75b.png │ ├── 2b2ea7b8a48be07468c7729fcea4f602_MD5.png │ ├── 2cce32e676a02e898395a940b7f62af5.png │ ├── 2faf7e7bb7284f04b912814b4c23d473.png │ ├── 309f5158c52f4362ccd9b93a673093da.png │ ├── 30bfc0d6863fb24ae629fe8deb16b10a.png │ ├── 31bfa57cc9c715958a059beb99a24887.png │ ├── 3361f4f6b1c1740ddeae3e3ba81eeea5_MD5.png │ ├── 34b76d54bf8f141493febb05dcbf0d20_MD5.jpg │ ├── 36b3e4458dc348608c50afa567807feb.png │ ├── 3851c26c0777e066820492052caf3344.png │ ├── 3919032d75e33cf2d945e264d1cbef6d_MD5.png │ ├── 3974360c263c0d2fe20f0d02d834b69d_MD5.png │ ├── 3a05fa9e0ba85561984fd4b927aacd78_MD5.png │ ├── 3a9d8e9d4940ac8e5eb9985fb2257837_MD5.png │ ├── 3b4ae9d3e22c831b1a5a07855e81bfdd_MD5.jpg │ ├── 3b516347112f14b5c0fde5aa215ba193.png │ ├── 3e46433783c7ebfad45bc9898c723515.png │ ├── 3ea2716e45104ac8ddd5cee6208fdd1b.png │ ├── 3f30702e3b3e8a45025c0a8f41ed21cb_MD5.png │ ├── 4059740ccdd04f6648aa5c9c5ac89ef2_MD5.jpg │ ├── 40bfd7d3ee8fab25f70ad4e47e6b7a67_MD5.jpg │ ├── 42a81fda78cf5263c94c4b53d8c30c25_MD5.png │ ├── 445cf49afb604fe68a75cdf085cfc2f1.png │ ├── 45add25ad0472ac9fddd46261ffcb2b7_MD5.jpg │ ├── 46798ec25cc22fbfda16551feaa0c64c_MD5.png │ ├── 47310365fdd2442e464a9d9362eb4df1.png │ ├── 477df3becf6872bf2ff1d040839b75e7.png │ ├── 4b973001a63c78ff49b7b9b88f67226f_MD5.jpg │ ├── 4c745b794769a28651312412ef1286ef.png │ ├── 4da285ff84827d16dc064548120b2fb2_MD5.jpg │ ├── 4db158cf962f1262b5d9b53e63e7be81_MD5.jpg │ ├── 4ee1ec24925721d45cbf191f26b3b14f_MD5.png │ ├── 508fe2f946b84eaa41a0ff6607c02304_MD5.png │ ├── 50aebf29216b34ee48f4432404748ea0.png │ ├── 50ea89dc0a3e013fcd7a1fd1eb382f7d.png │ ├── 51b6f11a315c19a269a7b103b377e368_MD5.png │ ├── 521be58bc3ebe619b1ef35ab1ea550a7_MD5.png │ ├── 53bf9c9d6f04d7d6e8abc60795baab3b.png │ ├── 54d783622495456a4ebe4f78e0954228_MD5.png │ ├── 5836e4fade679f7fdcdc37360768fd5c_MD5.png │ ├── 5a872ef655bc85126ea29c230052991f.png │ ├── 5ac1306e01b1b032ce640aac979e0927_MD5.png │ ├── 5d5de25d8b30e3e371204f01afa026fb.jpeg │ ├── 5e7d7ab1b09c5ce83f03c84a15d1ead8_MD5.jpg │ ├── 5f1954870d53077ba0ea0483301816a9_MD5.png │ ├── 60312199b92e2955a3ffb2f8a1217aec_MD5.jpg │ ├── 655085e6ce07069ae50868b0b18bdd42.png │ ├── 664d9bbbe5a7bfe9c1871d57a21434bc_MD5.png │ ├── 66cb19b99a76b56f7c3d1a978e456351_MD5.jpg │ ├── 66e7f8d36c71bb58891c9ec63122908c.png │ ├── 687fa78458782791c69f537dd780d0c3_MD5.jpg │ ├── 68ea4a796f4f321bba91094eb9596701_MD5.png │ ├── 690c0782107b43044ce8584da9dc943f_MD5.png │ ├── 695f99dca949276ba8466607f46742a0_MD5.jpg │ ├── 6a11e3084d834ba585f8f1ca51fd7856.png │ ├── 6f499fde032339f85b299d8763c555aa_MD5.gif │ ├── 6fc06ee880115438b233d12939755797.jpeg │ ├── 71b156192cbbc0427bab79e11815a346_MD5.png │ ├── 71da3125f632d5325d04e0413c81f7a6.png │ ├── 71fe84a2a89d45699da903bf0462f321.png │ ├── 74368756e892a696fbad9171bc37f5c9.png │ ├── 74843c907a1ea3ff862ff898dc22fce5.jpeg │ ├── 756c7616dca078c5a826d5d89e55c5a2_MD5.png │ ├── 77c0c9a43d2a8f2d5870056e8c5d8377_MD5.png │ ├── 77e86008772b231c277afdae86bac801.png │ ├── 78df8537c4b8e681e472668a776f4260.png │ ├── 7c4e16d736617d56d7f0dcb7e2e559db_MD5.png │ ├── 7ccf738d91802c1ad8a7fbf5530dcc39_MD5.png │ ├── 7d8d81a072c77c7fd8bf5839180cec29_MD5.jpg │ ├── 7e9c729734c7ab295cf2797fb6cb85dc.jpeg │ ├── 7eb52d3adc295d26bba1a624c68ecbce_MD5.png │ ├── 7fc3a202f4d625dd2c204d268f444573_MD5.jpg │ ├── 801922b26169dd269be29abc282b1ce8.png │ ├── 85065d3ce99f9487e4b742c4cd25888f_MD5.png │ ├── 85b45b16351cf6a04b8d59bcca2384ed.jpeg │ ├── 89e6b33107a946d5aec33eaa102e57ba.png │ ├── 8a9df8046657ebba0550aa3b2f073dbe_MD5.jpg │ ├── 8ab23e56b7f40d7eea51e82049b02c5f_MD5.png │ ├── 8c4b1b6c2e23e4c3e10be29c14009ac8.png │ ├── 8cc143c0a77c4c44b354360d181d2f39.png │ ├── 8ce7989d10e8a73e30de1b83c79c7c41.png │ ├── 8ef00669ec2c7b35c995c09a29a3ded6_MD5.png │ ├── 8f75803b17a4bb9147582fec634a70a1_MD5.png │ ├── 906501ef6328bf7f3ecc0f02cc57305d.png │ ├── 921a5287939700d160fbb2a27f33f30f.jpeg │ ├── 95a2ef7d1fadd567d04b368ef22cefd7_MD5.png │ ├── 95d0fee620aaa7ae29d675c7f9827cfe_MD5.png │ ├── 98dab2254c03b650a248195525a3ec89_MD5.png │ ├── 9c58346d0fceebd90ada8f1021177e68.png │ ├── 9cc6a1fec7f70374f0aaffceace27a3b_MD5.png │ ├── 9d3070c4862b419e382b8b89ebeb0f04_MD5.png │ ├── 9eaee194b8192e37e24c8563eba9da9a_MD5.png │ ├── 9f07d90b3a5e0d7c42f3f0b4aace80fd.png │ ├── 9f1a69a7c86edca423a70f106b1520bd.jpeg │ ├── 9f4d835b03cf34d9f303df8f2009387c_MD5.jpg │ ├── 9ff0f2f885122ef541024b679cdf7c5c_MD5.jpg │ ├── a1b6a676c59d4cd4babaf46829595381.jpeg │ ├── a550ff894ef7c4dd5fb72473e3fca0d5_MD5.jpg │ ├── ab77150713e2ba2c019e37f7acf70ca0.png │ ├── ad73f95847f33a27bdef611c1a07f484_MD5.gif │ ├── adb96ea87e11431caff3b1363a00601a.png │ ├── afee7380b23f96110e9593354f6c1c99_MD5.jpg │ ├── b0a45c6a0dde463aa6e360ba40344206.png │ ├── b18ad5eac41c5fd8e31ef5025b614d3f.jpeg │ ├── b249dd91bcd1580bc8b360fd56ab816c_MD5.png │ ├── b2c172d8f1f52ef1ee490a99f05c9e44_MD5.png │ ├── b2f9a9dd6628485288965dcf05f36101.png │ ├── b31bf9c40283208ab733992dcd6bebba_MD5.png │ ├── b3a33f23896b4a30c96c2383bf836f2c_MD5.jpg │ ├── b49d1598a6664e20a918f525d09e2693.png │ ├── b54a295fecf8687cc24421d893f4dab6_MD5.png │ ├── b63a62ad53b699a20454fa02bfcfa789_MD5.jpg │ ├── b675304324148d1ddf10e68dc33b8153_MD5.png │ ├── b7ab108e1e948a7a5e9d771f2e366d6c.png │ ├── ba78d968411707c3032b77ca7359d2ae.png │ ├── bbe11a2bae24b43f3d943b5e9a96fd43_MD5.png │ ├── c16874e0304388bf5ea54cc204b4ef85_MD5.png │ ├── c2f2e3bbbbd2bfb024f60d94d7252046.png │ ├── c4d818d40f353c792ab0f19e626ad88e.png │ ├── c56f16d937169ee283b7dbdc7785700a_MD5.jpg │ ├── c575fe306df6fe8dd9475d84c2b1785c_MD5.png │ ├── c662ea4c1137ac8be4b1eaa11c6e3f06.png │ ├── c6a6cb296455b36bf3416d3e91a6e3c9.png │ ├── c757183fe507acde590ad1e7d1d73b61.png │ ├── c7ef68d441014d869b3744878549e0b0.png │ ├── c8fef938155d460db008d19950238bfe.png │ ├── ca3fe94491cef6b86588816106c17a85_MD5.png │ ├── cd4d8c6626380ae119976c75726e7357.png │ ├── cdce655d5f7216fc787b1793ce51f146_MD5.jpg │ ├── cf4ba05e16d64b10b197514bcd1f9b44.png │ ├── d31e0e522cb77b9aafe3cb6196376373_MD5.png │ ├── d369d818c140892e9b6bca1eeab4ee39.png │ ├── d5eea65b2dc61b87dd22623363321ce6_MD5.png │ ├── d6431465c1e7bec6d952041358c9edf8_MD5.png │ ├── d66b05cf01ce1c27a2e87793b78fe2ae_MD5.jpg │ ├── d6cd76d803bb8ad54e57aa67d01ed808_MD5.png │ ├── d8d8d1bc376fe0db90599c99ecefad8e_MD5.png │ ├── d9bae5ae137038398e8ac3118191d6e8_MD5.png │ ├── d9c6037ac0cea0dad506b75496c6edfa_MD5.png │ ├── d9f1735f332c29db399e072817da3144_MD5.jpg │ ├── da1a8850c5c75f116349b24aa3a62c1d.png │ ├── dadb2abaf3a8cfb78fab6d4ad04d1b6f.png │ ├── db937132cff0d9444788b4a43d0f0fb9_MD5.png │ ├── de565ecf43e0bda278ce99eebd875322.png │ ├── deaa8c2b8634b11e28524700ce181ce7_MD5.png │ ├── e2b74978dcefbe4e6c14f56d9ee43f95_MD5.png │ ├── e2f3ed201af4091433fa0c2e552c1888_MD5.jpg │ ├── e5e7f107f3f0281d2892fde09ce474ef_MD5.jpg │ ├── e68a76a5b17995b45d6efdea27f29c51.png │ ├── e69279b508c770fac37ee6c24d129331.png │ ├── e6a5088c3ef6310a35b9b4ff41e285eb.png │ ├── e6c505de964ca005b5c72021a7082ced_MD5.png │ ├── e8d2029b3f55cee5adf8946a3e665e7c_MD5.png │ ├── e8de3f53932a45a3e07add8602220d08_MD5.png │ ├── e8e501ad4397f9e68814d2b156d7f703_MD5.jpg │ ├── e8f36c8c3a2afbaf84273c977a6c16b5.png │ ├── e9e771e25442e2dfc1470b2cc422ed2e_MD5.png │ ├── eb96bb41bd547f26bcaeaa80dd9c572d.png │ ├── ebd475e785b842bda203b694566b9f93.png │ ├── ec83795b5d76dba74f63b03e5909b402_MD5.png │ ├── edf65186f1740032d70a9665a14dc5f9_MD5.png │ ├── f061d1fda72a7e14523875d504342e38_MD5.png │ ├── f1bdc252f47440bd467484cf92ecfa31_MD5.png │ ├── f22cb1de22144ad6806b83acb3fb45a4.png │ ├── f3eb7c90b409dbbe7f6a4ebbbbea417c_MD5.jpg │ ├── f5c9bff7f0a59caa72a96d5c8eb80b20.png │ ├── f6650dd74f4288d047e2fdad8aab4803.png │ ├── f87852261bb2c5ef347926e55428615b_MD5.jpg │ ├── fa1e392270b62d2ca973c41e850e71de_MD5.png │ ├── fa3b06a67625a681ba4e127a3be10cc0.png │ ├── fadc2266e03a0d37020d56c897df0a34_MD5.jpg │ ├── fb7b40a52ecf867d81671a218af0edb5_MD5.jpg │ ├── fbacc026424dbdc50c7ad87b36f68ad9.png │ ├── fc657ecca0a6d942ea9bfa5a5c828edb.png │ ├── fd6a975a1cf279e5f4850a2e66918192_MD5.png │ └── feaa01981ce3ab2f50012dcfa86ff40a.png ├── 强化学习极简入门上:通俗理解MDP、DP MC TC和Q学习、策略梯度、PPO.md └── 强化学习极简入门下:通俗理解MDP、DP MC TC和Q学习、策略梯度、PPO.md /ChatGPT与多模态必读论文100篇(2.27日起,每周更新): -------------------------------------------------------------------------------- 1 | 2 | 3 | ``` python 4 | 2023,大模型与ChatGPT火得如日中天; 5 | 想要较好的了解这块技术,最好的方式就是读论文; 6 | 而,读什么论文呢?如果选择只读100篇论文,这100篇是什么? 7 | 基于此,本文以《ChatGPT与多模态必读论文100篇》推荐笔者及July老师认为最该读的100篇 8 | ``` 9 | 10 | 11 | ## 第一部分 OpenAI/Google的基础语言大模型 12 | **【第001篇】GPT,从最原始GPT开始了解** 13 | [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf "Improving Language Understanding by Generative Pre-Training") 14 | 15 | **【第002篇】GPT2,从GPT进化到GPT2** 16 | [Language Models are Unsupervised Multitask Learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf "Language Models are Unsupervised Multitask Learners") 17 | 18 | **【第003篇】GPT3原始论文,再次进化到GPT3** 19 | [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165 "Language Models are Few-Shot Learners") 20 | 21 | **【第004篇】让GPT3再次进化,InstructGPT原始论文** 22 | [Training language models to follow instructions with human feedback](https://arxiv.org/pdf/2203.02155 "Training language models to follow instructions with human feedback") 23 | 24 | **【第005篇】T5模型,19年10月Google发布** 25 | [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://www.jmlr.org/papers/volume21/20-074/20-074 "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer") 26 | 虽也基于transformer,但区别于BERT的编码器架构与GPT的解码器架构,T5是transformer的encoder-decoder架构 27 | 用的750G的训练数据,其训练方法则为:BERT-style的MASK法/replace span(小段替换)/Drop法,以及类似BERT对文本的15%做破坏、且replace span时对3的小段破坏 28 | 29 | **【第006篇】LaMDA,论文发布于22年1月,显示LaMDA的参数高达137B,用的transformer decoder架构** 30 |  [Language Models for Dialog Applications](https://arxiv.org/pdf/2201.08239 "Language Models for Dialog Applications") 31 | 21年5月,Google对外宣布内部正在研发对话模型LaMDA,基于transformer decoder架构,在微调阶段 使用58K的对话数据,过程类似真人的对话过程,给定一个Query,比如 How old is Rafael Nadal? ,如果人知道答案,那么直接回答35岁即可,如果不知道,则需要去 Research 一下,借助搜索引擎找到答案,然后再回答35岁 32 | 33 | **【第007篇】FLAN大模型,21年9月Google提出,其基于LaMDA-PT做Instruction Fine-Tuning** 34 | [Finetuned Language Models Are Zero-Shot Learners](https://arxiv.org/pdf/2109.01652 "Finetuned Language Models Are Zero-Shot Learners") 35 | FLAN is the instruction-tuned version of LaMDA-PT 36 | 37 | **【第008篇】PaLM模型,22年4月Google发布,基于Transformer decoder架构** 38 | [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/pdf/2204.02311 "PaLM: Scaling Language Modeling with Pathways") 39 | 22年3月,Google的Barham等人发布了Pathways系统,用于更高效地训练大型模型 40 | Pathways 的愿景 —— 一个很接近人脑的框架:一个模型,可以做多任务,多模态 41 | 且在做任务时,只是 sparsely activated,只使用一部分的参数 42 | 参数规模最大的版本达到惊人的5400亿参数(8B 62B 540B),使用multi-query注意力、SwiGLU激活函数以及RoPE位置嵌入 43 | 且在每个Transformer块中使用 "平行 "表述(Wang & Komatsuzaki,2021) 44 | 是Google的Pathways架构或OpenAI GPT2/3提出的小样本学习的进一步扩展 45 | 46 | PaLM首次展示了Pathways的大规模使用——能够以高效的方式在数千或数万个加速器芯片上训练一个模型 47 | 具体来说,通过Pathways,PaLM 540B在两个通过数据中心网络连接的TPU v4 Pod上训练,使用模型和数据并行的组合,在每个Pod中使用3072个TPU v4芯片,连接到768台主机,能够有效地将训练扩展到6144个芯片,而不需要使用任何pipeline并行,其效率水平是以前这种规模的模型所不能达到的 48 | 49 | 以前的大多数大型语言模型 50 | 要么是在单个TPU系统上训练的(比如GLaM by Du等人2021年,LaMDA by Thopilan等人) 51 | 要么是使用由Huang等人在2019年提出的pipeline并行,从而在GPU集群(Megatron-Turing NLG 530B by Smith等人2022年),或多个TPU v3 pod(Gopher by Rae等人2021年)上扩展,最大规模为4096个TPU v3芯片 52 | 53 | 另,在自然语言、代码和数学推理等任务中表现的都很不错 54 | 此外,预训练数据集由一个7800亿个token组成的语料库,该数据集是由过滤过的网页(占比27%)、书籍(占比13%)、Wikipedia(占比4%)、新闻文章(占比1%)、Github源代码(占比5%,包括Java、HTML、Javascript、Python、PHP、C#、XML、C++和C,总计196GB的源代码),和社交媒体对话(占比50%)组成的,这个数据集是也用于训练LaMDA和GLaM 55 | 56 | **【第009篇】RLAIF** 57 | [Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/pdf/2212.08073 "Constitutional AI: Harmlessness from AI Feedback") 58 | OpenAI之前一副总裁离职搞了个ChatGPT的竞品,ChatGPT用人类偏好训练RM再RL(即RLHF),Claude则基于AI偏好模型训练RM再RL(即RLAIF) 59 | 60 | **【第010篇】Sparrow,DeepMind的Sparrow,发表时间稍晚于instructGPT** 61 | [Improving alignment of dialogue agents via targeted human judgements](https://arxiv.org/pdf/2209.14375 "Improving alignment of dialogue agents via targeted human judgements") 62 | 其大致的技术思路和框架与 instructGPT 的三阶段基本类似,但Sparrow 中把奖励模型分为两个不同 RM 的思路 63 | 64 | **【第011篇】GPT4,当前王牌,支持多模态** 65 |  [GPT-4 Technical Report](https://arxiv.org/pdf/2303.08774 "GPT-4 Technical Report") 66 | 增加了多模态能力的GPT4的技术报告 67 | 68 | ## 第二部分 大语言模型的关键技术 69 | **【第012篇】Transformer原始论文** 70 | [Attention Is All You Need](https://arxiv.org/pdf/1706.03762 "Attention Is All You Need") 71 | 72 | ICL原始论文 73 | 74 | Evaluating Large Language Models Trained on Code,Codex原始论文 75 | 预测当前序列的最后一个词时 可以选取概率最大的词(softmax最高的值),但没法全局最优且不具备多样性,当然 可以使用束搜索 一次性获取多个解 76 | 论文中用的是核采样,预测的各个词根据概率从大到小排序,选取前些个概率加起来为95%的词 77 | 78 | CoT原始论文:Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 79 | 28 Jan 2022 · Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou 80 | 也从侧面印证,instructGPT从22年1月份之前 就开始迭代了 81 | 82 | RLHF原始论文 83 | Proximal Policy Optimization Algorithms 84 | 2017年,OpenAI发布的PPO原始论文,在理解过程中有时会问下GPT4,感叹GPT4的细节能力 虽经常不是很严谨 但细节能力是真6 85 | 86 | 另,这是TRPO论文 87 | Large Language Models are Zero-Shot Reasoners 88 | 来自东京大学和谷歌的工作,关于预训练大型语言模型的推理能力的探究,“Let's think step by step”的梗即来源于此篇论文 89 | Emergent Abilities of Large Language Models 90 | Google 22年8月份发的,探讨大语言模型的涌现能力 91 | Scaling Instruction-Finetuned Language Models,微调PaLM-540B(2022年10月) 92 | 从三个方面改变指令微调,一是改变模型参数,提升到了540B,二是增加到了1836个微调任务,三是加上Chain of thought微调的数据 93 | Multimodal Chain-of-Thought Reasoning in Language Models 94 | 23年2月,亚马逊的研究者则在这篇论文里提出了基于多模态思维链技术改进语言模型复杂推理能力的思想 95 | Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers代码地址,这篇文章则将ICL看作是一种隐式的Fine-tuning 96 | 97 | 旋转位置嵌入(RoPE)论文 98 | 99 | The Flan Collection: Designing Data and Methods for Effective Instruction Tuning 100 | Fine-Tuning Language Models from Human Preferences,这是论文对应的代码:微调GPT2 101 | 《The Natural Language Decathlon:Multitask Learning as Question Answering》,GPT-1、GPT-2论文的引用文献,Salesforce发表的一篇文章,写出了多任务单模型的根本思想 102 | Meta-learning via Language Model In-context Tuning 103 | Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022 104 | 105 | 106 | ## 第三部分 Meta等公司发布的类ChatGPT开源模型和各种微调 107 | 108 | LLaMA: Open and Efficient Foundation Language Models,2023年2月24日Meta发布了全新的65B参数大语言模型LLaMA,开源,大部分任务的效果好于2020年的GPT-3 109 | 这是针对该论文的解读之一 110 | Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022 111 | SELF-INSTRUCT: Aligning Language Model with Self Generated Instructions,代码地址,解读1、解读2 112 | 3月中旬,斯坦福发布Alpaca:只花100美元,人人都可微调Meta家70亿参数的LLaMA大模型 113 | 而斯坦福团队微调LLaMA的方法,便是来自华盛顿大学Yizhong Wang等去年底提出的这个Self-Instruct 114 | 115 | 具体而言,论文中提出,首先从自生成指令种子集中的175个人工编写的「指令-输出」对开始,然后,提示text-davinci-003使用种子集作为上下文示例来生成更多指令 116 | 而斯坦福版Alpaca,就是花了不到500美元使用OpenAI API生成了5.2万个这样的示例微调LLaMA搞出来的 117 | 118 | Alpaca: A Strong Open-Source Instruction-Following Model 119 | 120 | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model 121 | 122 | 123 | GLM: General Language Model Pretraining with Autoregressive Blank Infilling 124 | 2022年5月,正式提出了GLM框架 125 | GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL,代码地址 126 | GLM-130B便是基于的GLM框架的大语言模型 127 | 128 | 129 | ## 第四部分 具备多模态能力的大语言模型 130 | Language Is Not All You Need: Aligning Perception with Language Models,微软23年3月1日发布的多模态大语言模型Kosmos-1的论文 131 | PaLM-E: An Embodied Multimodal Language Model(论文地址),Google于23年3月6日发布的关于多模态LLM:PaLM-E,可让能听懂人类指令且具备视觉能力的机器人干活 132 | Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models,微软于23年3月8日推出visual ChatGPT(另,3.9日微软德国CTO说,将提供多模态能力的GPT4即将一周后发布) 133 | At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one round fixed inputs and outputs. 134 | 135 | To this end, We build a system called {Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 136 | 1) sending and receiving not only languages but also images 137 | 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 138 | 3) providing feedback and asking for corrected results. 139 | 140 | We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback 141 | Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks,这是针对该论文的解读之一 142 | 2022年8月,微软提出的多模态预训练模型BEiT-3 143 | 144 | BEiT: BERT Pre-Training of Image Transformers 145 | BEiT-2: Masked Image Modeling with Vector-Quantized Visual Tokenizers 146 | 147 | ​Aligning Text-to-Image Models using Human Feedback,这是解读之一 148 | ChatGPT的主要成功要归结于采用RLHF来精调LLM,近日谷歌AI团队将类似的思路用于文生图大模型:基于人类反馈(Human Feedback)来精调Stable Diffusion模型来提升生成效果 149 | 目前的文生图模型虽然已经能够取得比较好的图像生成效果,但是很多时候往往难以生成与输入文本精确匹配的图像,特别是在组合图像生成方面。为此,谷歌最新的论文提出了基于人类反馈的三步精调方法来改善这个问题 150 | 151 | Segment Anything 152 | 23年4.6日,Meta发布史上首个图像分割基础模型SAM,将NLP领域的prompt范式引进CV,让模型可以通过prompt一键抠图。网友直呼:CV不存在了! 153 | 154 | minigpt-4的介绍页面、GitHub、论文:MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models 155 | 156 | ​Flamingo: a visual language model for few-shot learning 157 | Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022 158 | Language models are unsupervised multitask learners. 2019 159 | Improving language understanding by generative pre-training. 2018 160 | 161 | 162 | ## 第五部分 AI绘画与多模态能力背后的核心技术 163 | End-to-End Object Detection with Transformers 164 | DETR by 2020年5月,这是针对DETR的解读之一 165 | 166 | 回顾下20年之前的模型提出史(我18年写过一篇:一文读懂目标检测:R-CNN、Fast R-CNN、Faster R-CNN、YOLO、SSD) 167 | 2014 R-CNN 168 | 2015 Fast R-CNN、Faster R-CNN 169 | 2016 YOLO、SSD 170 | 2017 Mask R-CNN、YOLOv2 171 | 2018 YOLOv3 172 | 2019 CenterNet 173 | 2020 DETR 174 | 175 | 20年之后,CV迎来了生成式下的多模态时代『我也正在写这个系列博客,AI绘画与CV多模态原理解析:VAE、扩散模型DDPM、DETR、ViT/Swin transformer、CLIP/BLIP到stable diffusion、GPT4(后者待5月中旬发布)』 176 | 2020年 177 | 6月 DDPM 178 | 10月 DDIM、Vision Transformer 179 | 2021年 180 | 1月 CLIP、DALL·E 181 | 3月 Swin Transformer 182 | 11月 MAE、Swin Transformer V2 183 | 2022年 184 | 1月 BLIP 185 | 4月 DALL·E 2 186 | 8月 Stable Diffusion、BEiT-3 187 | 2023年 188 | 1月 BLIP2 189 | 3月 Visual ChatGPT、GPT-4 190 | 191 | AN IMAGE IS WORTH 16X16 WORDS:TRANSFORMERS FOR IMAGE RECOGNITION A​​​​​​T SCALE 192 | 发表于2020年10月的Vision Transformer原始论文,代表Transformer正式杀入CV界 193 | Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,发表于21年3月 194 | Swin Transformer V2: Scaling Up Capacity and Resolution 195 | 第一篇的解读戳这,第二篇的解读戳这里 196 | Auto-Encoding Variational Bayes,苏剑林关于VAE的解读之一,这是另外一个作者:基于苏这个VAE的解读对扩散模型的理解 197 | WGAN 198 | Denoising Diffusion Probabilistic Models,2020年6月提出DDPM,即众人口中常说的diffusion model 199 | 这是苏剑林关于DDPM的相对通俗的系列解读,这是另一份解读:What are Diffusion Models?(该解读的中文笔记) 200 | CLIP: Connecting Text and Images - OpenAI,这是针对CLIP论文的解读之一 201 | CLIP由OpenAI在2021年1月发布,超大规模模型预训练提取视觉特征,图片和文本之间的对比学习(简单粗暴理解就是发微博/朋友圈时,人喜欢发一段文字然后再配一张或几张图,CLIP便是学习这种对应关系) 202 | 203 | 2021年10月,Accomplice发布的disco diffusion,便是第一个结合CLIP模型和diffusion模型的AI开源绘画工具,其内核便是采用的CLIP引导扩散模型(CLIP-Guided diffusion model) 204 | 且后续有很多基于CLIP的一系列改进模型,比如Lseg、GroupViT、ViLD、GLIP 205 | Hierarchical Text-Conditional Image Generation with CLIP Latents,这是解读之一 206 | DALL·E 2论文2022年4月发布(至于第一代发布于2021年初),通过CLIP + Diffusion models,达到文本生成图像新高度 207 | High-Resolution Image Synthesis with Latent Diffusion Models 208 | 2022年8月发布的Stable Diffusion基于Latent Diffusion Models,专门用于文图生成任务 209 | 这些是相关解读:图解stable diffusion(翻译版之一)、这是另一解读,这里有篇AI绘画发展史的总结 210 | 211 | Stable Diffusion和之前的Diffusion扩散化模型相比, 重点是做了一件事, 那就是把模型的计算空间,从像素空间经过数学变换,在尽可能保留细节信息的情况下降维到一个称之为潜空间(Latent Space)的低维空间里,然后再进行繁重的模型训练和图像生成计算 212 | BLIP (from Salesforce) released with the paper BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi. 213 | BLIP-2 (from Salesforce) released with the paper BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models by Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 214 | 215 | 216 | ## 第六部分 预训练模型的发展演变史 217 | A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT:https://arxiv.org/pdf/2302.09419,预训练基础模型的演变史 218 | Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer 219 | Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing,作者来自CMU的刘鹏飞,这是相关资源 220 | 221 | 另一篇类似的,Pre-Trained Models: Past, Present and Future 222 | 21年1月初在CCF启智会支持下,文继荣、唐杰和黄民烈三位老师召集了以预训练模型为主题的闭门研讨会,此后22位老师和同学经过近半年准备,共同形成了这篇43页的综述和观点文章 Pre-Trained Models: Past, Present and Future 223 | 224 | 225 | ## 第七部分 其它 226 | Offsite-Tuning: Transfer Learning without Full Model 227 | 对于许多的私有基础模型,数据所有者必须与模型所有者分享他们的数据以微调模型,这是非常昂贵的,并引起了隐私问题(双向的,一个怕泄露模型,一个怕泄露数据) 228 | Deep Residual Learning for Image Recognition,ResNet论文,短短9页,Google学术被引现15万多 229 | 这是李沐针对ResNet的解读,另 这是李沐针对一些paper的解读列表 230 | WHAT LEARNING ALGORITHM IS IN-CONTEXT LEARNING? INVESTIGATIONS WITH LINEAR MODELS 231 | 232 | Transformer-XL: Attentive language models beyond a fixed-length context 233 | Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond 234 | COLT5: Faster Long-Range Transformers with Conditional Computation 235 | 236 | 237 | 238 | 239 | ``` 240 | 原作者为:[v_JULY_v](https://blog.csdn.net/v_JULY_v "v_JULY_v") 241 | ``` 242 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 9 | 10 | ## Project background 11 | ChatGPT一经推出便火爆全球,为了彻底写清楚ChatGPT背后的所有关键细节,July从1月初写到5月底仍未完工,过程中涉及到多篇文章(RL 论文 项目 CV多模态),再加上之前写的Transformer、RL数学基础等多篇笔记,成了一个大系列: 12 | 13 | - [ChatGPT技术原理解析:从RL之PPO算法、RLHF到GPT4、instructGPT](https://github.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/blob/main/ChatGPT%E6%8A%80%E6%9C%AF%E5%8E%9F%E7%90%86%E8%A7%A3%E6%9E%90%EF%BC%9A%E4%BB%8ERL%E4%B9%8BPPO%E7%AE%97%E6%B3%95%E3%80%81RLHF%E5%88%B0GPT4%E3%80%81instructGPT.md) 14 | - [Transformer通俗笔记:从Word2Vec、Seq2Seq逐步理解到GPT、BERT](https://github.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/blob/main/Transformer%E9%80%9A%E4%BF%97%E7%AC%94%E8%AE%B0%EF%BC%9A%E4%BB%8EWord2Vec%E3%80%81Seq2Seq%E9%80%90%E6%AD%A5%E7%90%86%E8%A7%A3%E5%88%B0GPT%E3%80%81BERT.md) 15 | - RL所需的微积分/概率统计基础、最优化基础 16 | - [强化学习极简入门(上):通俗理解MDP、DP MC TC和Q学习](https://github.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/blob/main/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0%E6%9E%81%E7%AE%80%E5%85%A5%E9%97%A8%EF%BC%9A%E9%80%9A%E4%BF%97%E7%90%86%E8%A7%A3MDP%E3%80%81DP%20MC%20TC%E5%92%8CQ%E5%AD%A6%E4%B9%A0%E3%80%81%E7%AD%96%E7%95%A5%E6%A2%AF%E5%BA%A6%E3%80%81PPO.md) 17 | - [强化学习极简入门(下):策略梯度、PPO](https://github.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/blob/main/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0%E6%9E%81%E7%AE%80%E5%85%A5%E9%97%A8%E4%B8%8B%EF%BC%9A%E9%80%9A%E4%BF%97%E7%90%86%E8%A7%A3MDP%E3%80%81DP%20MC%20TC%E5%92%8CQ%E5%AD%A6%E4%B9%A0%E3%80%81%E7%AD%96%E7%95%A5%E6%A2%AF%E5%BA%A6%E3%80%81PPO.md) 18 | - ChatGPT与多模态必读论文100篇(2.27日起,每周更新) 19 | - 类ChatGPT的部署与微调:从LLaMA、Alpaca/Vicuna/BELLE、中文版、从GLM、ChatGLM到MOSS、ChatDoctor、可商用 20 | - 类ChatGPT代码逐行解读:从零实现Transformer、ChatGLM-6B、从零实现TRL、ChatLLaMA、ColossalChat、DeepSpeed Chat 21 | - AI绘画与CV多模态原理解析:VAE、扩散模型DDPM、DETR、ViT/Swin transformer、CLIP/BLIP到stable diffusion、GPT4(后者待6月中旬发布) 22 | 23 | ———————————————— 24 | 25 | 23年5月9日,七月ChatGPT原理解析课的一学员虞同学在群内建议道:“或者我们自己是否也可以搞一个项目,大家共同参与科研维护”,之后多位同学响应表示支持 26 | 27 | July个人觉得也可以:“比如项目之一,可以先以我博客内部分文章 搬到GitHub上,然后维护修改旧的章节、扩写新的章节,再之后共同开发论文修订助手之类的项目(针对未成形的论文 通过微调某个开源模型 给出修订、审阅意见)”,于此便有了本GitHub:ChatGPT资源库(原理/微调/代码/论文) 28 | 29 | ## 100 papers 30 | LLM/ChatGPT与多模态必读论文150篇(已更至第101篇) 31 | 32 | https://blog.csdn.net/v_JULY_v/article/details/129508065 33 | 34 | ## co-sponsor 35 | 36 | July、七月ChatGPT原理课的十几位同学,他们是:@corleytd、@EdwardSelf、@JusticeGL、@wangzaistone、@windspin2003、@zh2nlp.. 37 | 38 | ---- 39 | 40 | ## 本项目编写规范(初步) 41 | 温馨提示:由于本项目中存在大量的LaTex公式, github 与 原生markdown 适配可能有所差别,故如果将本项目clone到本地阅读可能导致在如Typro 等编辑软件中会出现显示异常,我们建议在github网页中进行浏览。 42 | 43 | 如果您在参与本项目内容贡献过程中遇到问题,可参考以下栏目,也可将新的解决方案或建议列在以下栏目中。 44 | 45 | ### 关于LaTex公式: 46 | 由于Github对于Markdown 原生语法中LaTex公式解析存在的部分缺憾,导致使用Markdown语法书写的数学公式在github网页中展示会出现异常,特于此文档当前栏目记录一些常用的手法,仅供参考。 47 | [Github LaTex 支持文档](https://docs.github.com/zh/get-started/writing-on-github/working-with-advanced-formatting/writing-mathematical-expressions) 48 | * 关于行内公式的书写手法: 49 | * 原生markdown采用"\$LaTex\$"包裹的形式 50 | * Github 中采用"\$\`LaTex\`\$"的形式进行包裹[此方案仅为解决网页版不显示的问题,在这种方案下,github 公式显示正常,但是原生的markdown中会出现多余字符,如有更好的方案,可直接在此处更新方案] 51 | * 关于行间公式: 52 | * 原生markdown采用"\$\$LaTex\$\$"包裹的形式 53 | * Github 中采用以下形式包裹: 54 |
55 |

56 | ```math
57 | Latex
58 | ```
59 |

60 |
61 | 62 | * 关于常用的特殊字符公式: 63 | 64 | | 符号 | LaTeX | 备注 | 65 | | :-----: | :-------: | :-------: | 66 | | $`\#`$ | `\#` | | 67 | | $`\%`$ | `\%` | | 68 | | $`^\wedge`$ | `^\wedge` | | 69 | | $`\&`$ | `\&` | | 70 | | $`\_`$ | `\_` | | 71 | | $`\prime`$ | `\prime` | 导数 | 72 | | $`\lt`$ | `\lt` | 小于 | 73 | | $`\le`$ | `\le` | 小于等于 | 74 | | $`\gt`$ | `\gt` | 大于 | 75 | | $`\ge`$ | `\ge` | 大于等于 | 76 | | $`\mid`$ | `\mid` | 条件概率中的竖分割线 | 77 | | $`\cdots`$ | `\cdots` | 垂直居中省略号 | 78 | | $`\ldots`$ | `\ldots` | 底部对齐省略号 | 79 | | $`\omega`$ | `\omega` | | 80 | | $`\Omega`$ | `\Omega` | | 81 | | $`\lim \limits_{x \to \infty} f(x)`$ | `\lim \limits_{x \to \infty} f(x)` | 极限下标 | 82 | | $`PPO _{\theta} `$ | `PPO_{\theta}` | 普通下标 | 83 | | $`\sum \limits_{i = 1} ^ N `$ | `\sum \limits_{i = 1} ^ N` | 求和中的上下标 | 84 | 。。。 85 | 86 | 87 | **备注**:Github中对于某些复杂的LaTex语法暂未支持,如果遇到渲染不出来的情况请酌情修改公式写法。 88 | 89 | 90 | ### 关于图片 91 | 本项目中的所有图片均保存在assets/images/doc_name 目录下。 92 | 93 | --- 94 | -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/046b440033c8296f2f9b0f1bf5c3190e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/046b440033c8296f2f9b0f1bf5c3190e.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/0bb3ab43b467ce1071d28a89537abc9c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/0bb3ab43b467ce1071d28a89537abc9c.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/0dbb490df24a8ddc53d81df6b09c9c76.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/0dbb490df24a8ddc53d81df6b09c9c76.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/1798baf5dba54e21a19508e82d407a8a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/1798baf5dba54e21a19508e82d407a8a.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/1ecb75833281415497f94e0cbe0279bd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/1ecb75833281415497f94e0cbe0279bd.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/2c3b63b5cb414b438df7a89e14dae8a4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/2c3b63b5cb414b438df7a89e14dae8a4.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/2edcff5b6e883269f1dd93f829dbb223.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/2edcff5b6e883269f1dd93f829dbb223.gif -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/445315a773e54b1ba101fee85904fcee.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/445315a773e54b1ba101fee85904fcee.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/47088532ca0ef7bde5bae8849728ac8f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/47088532ca0ef7bde5bae8849728ac8f.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/5bbf0a4065be4b5584277e28502f7a7a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/5bbf0a4065be4b5584277e28502f7a7a.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/61c34bd9692048c1858665730c1cad15.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/61c34bd9692048c1858665730c1cad15.jpeg -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/6ca0c2b4b623c690c60a749dc1b21291.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/6ca0c2b4b623c690c60a749dc1b21291.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/756685e2f07b494b99bc97f4ce0f4bf9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/756685e2f07b494b99bc97f4ce0f4bf9.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/7d5efb19d49a44cbbdf7bf50da94712d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/7d5efb19d49a44cbbdf7bf50da94712d.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/84f59c98f4b14321acb72db7b8459cee.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/84f59c98f4b14321acb72db7b8459cee.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/9726cb99f9af48bc8ac545ace05804e7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/9726cb99f9af48bc8ac545ace05804e7.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/9dcf9cefeb844a5a93b2b6fd38bf5a80.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/9dcf9cefeb844a5a93b2b6fd38bf5a80.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/ab200b2773a547bfa48e639956c52ca0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/ab200b2773a547bfa48e639956c52ca0.jpeg -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/bad281ae107d49d3adf8aa2d012a08c1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/bad281ae107d49d3adf8aa2d012a08c1.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/bc1f965ef46e4ea693bb1950fd76d7e8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/bc1f965ef46e4ea693bb1950fd76d7e8.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/c3f775519533445db1120f7dc79d4ba1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/c3f775519533445db1120f7dc79d4ba1.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/d3366d92a8ba432b9480a55c7bbc7895.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/d3366d92a8ba432b9480a55c7bbc7895.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/d7fff6ebbf9a42d3ba9c04e1c924d476.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/d7fff6ebbf9a42d3ba9c04e1c924d476.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/dcf2240f8a56451089a314ffe0c6fc08.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/dcf2240f8a56451089a314ffe0c6fc08.png -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/e402747c8e5780a3631e27ea30232e74.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/e402747c8e5780a3631e27ea30232e74.gif -------------------------------------------------------------------------------- /assets/images/RL_simple_primer/e8fc39f9f7f291dc8b121858a201545b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/RL_simple_primer/e8fc39f9f7f291dc8b121858a201545b.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/029ac3d1134a474abd060e48d838a049.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/029ac3d1134a474abd060e48d838a049.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/05b64b744bc74d828e0394a95ce4e487.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/05b64b744bc74d828e0394a95ce4e487.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/05d74bcd64944f7c87902bd59f12ea1d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/05d74bcd64944f7c87902bd59f12ea1d.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/1133d2956f234040a30fa4278e2734d0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/1133d2956f234040a30fa4278e2734d0.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/1831052632dbed6050771e49dd341516.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/1831052632dbed6050771e49dd341516.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/1b37ed859770ba388d86273e3c7c6517.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/1b37ed859770ba388d86273e3c7c6517.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/1ecb75833281415497f94e0cbe0279bd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/1ecb75833281415497f94e0cbe0279bd.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/23e1b2939c3a41a29f99971d5427e1ce.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/23e1b2939c3a41a29f99971d5427e1ce.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/2805e9b3b99e4fe99f27881c9c188cb7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/2805e9b3b99e4fe99f27881c9c188cb7.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/2c3b63b5cb414b438df7a89e14dae8a4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/2c3b63b5cb414b438df7a89e14dae8a4.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/300d2c30d66a2fc2c9a96a2535790a19.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/300d2c30d66a2fc2c9a96a2535790a19.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/3842d5cfc696477cac1cf9eb5136b4c1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/3842d5cfc696477cac1cf9eb5136b4c1.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/3bec8b5062e6439f993962b621da0d3e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/3bec8b5062e6439f993962b621da0d3e.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/452ba38d4bf44c7aafc14e44933e2239.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/452ba38d4bf44c7aafc14e44933e2239.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/4ea27677bac93469d6143a5161d5b037.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/4ea27677bac93469d6143a5161d5b037.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/4f09155231abbc5ede08a1354f25c5a9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/4f09155231abbc5ede08a1354f25c5a9.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/5742da727fbb4fb5a4274b75f9af6c0c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/5742da727fbb4fb5a4274b75f9af6c0c.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/61a6cc2a71dd2e3b126ff058cd5d045e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/61a6cc2a71dd2e3b126ff058cd5d045e.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/6505c4bd9e3b4d2d94df645f147597c5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/6505c4bd9e3b4d2d94df645f147597c5.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/6ce3da16ad6b41c998ee25b8aca3fa75.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/6ce3da16ad6b41c998ee25b8aca3fa75.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/6f3b99c974224ad48bc9f5ebbe6929e0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/6f3b99c974224ad48bc9f5ebbe6929e0.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/725d62dd8d0f2997cc2329d5a50977bc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/725d62dd8d0f2997cc2329d5a50977bc.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/725f34d9c6654c8983dcda50148b1f02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/725f34d9c6654c8983dcda50148b1f02.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/72a8c7c487864a4e9444036411dc93c3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/72a8c7c487864a4e9444036411dc93c3.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/74ab07c2395c4bc08c5bab772095ee99.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/74ab07c2395c4bc08c5bab772095ee99.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/78055db0e39e623f2e2b7b4efa4b3593.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/78055db0e39e623f2e2b7b4efa4b3593.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/79bc702fdf3542a8a9e5b8b89d7a9986.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/79bc702fdf3542a8a9e5b8b89d7a9986.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/7b9d2eca7b3548508e9cfa98acc8e371.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/7b9d2eca7b3548508e9cfa98acc8e371.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/8212f02b257a47c98709297e1e812040.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/8212f02b257a47c98709297e1e812040.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/8df866b5d7d4486eb8e5ab9f5ad56191.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/8df866b5d7d4486eb8e5ab9f5ad56191.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/93f758832a8345788e79da7935a344ba.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/93f758832a8345788e79da7935a344ba.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/974beb9bee394f0794c56d52de02d25e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/974beb9bee394f0794c56d52de02d25e.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/97b7e66fd24947e59fbf90b1b9a57706.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/97b7e66fd24947e59fbf90b1b9a57706.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/a7ff56efd16a42999498d25db1751f1a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/a7ff56efd16a42999498d25db1751f1a.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/a9e3250b63ed40a38ee96d14eaf7635f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/a9e3250b63ed40a38ee96d14eaf7635f.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/b070502a47924402bf65968b53a3f726.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/b070502a47924402bf65968b53a3f726.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/b74c354d63fc4bb1bd8fa73b87739185.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/b74c354d63fc4bb1bd8fa73b87739185.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/bd6e7b452a10424eab899855dd4eec9a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/bd6e7b452a10424eab899855dd4eec9a.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/c16ccb65bb19b11425eeb4b29e5ccd66.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/c16ccb65bb19b11425eeb4b29e5ccd66.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/c3f775519533445db1120f7dc79d4ba1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/c3f775519533445db1120f7dc79d4ba1.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/c93a2597c53240dabdafb2b99f12051d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/c93a2597c53240dabdafb2b99f12051d.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/c9e1b0266bc14700a3f9c252126a7df0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/c9e1b0266bc14700a3f9c252126a7df0.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/d031d2873feecc4c67cf7d45ece4386e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/d031d2873feecc4c67cf7d45ece4386e.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/dc30db988ae045d993f1584713592e75.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/dc30db988ae045d993f1584713592e75.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/dcf2240f8a56451089a314ffe0c6fc08.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/dcf2240f8a56451089a314ffe0c6fc08.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/e16c9c6245ec496bb23e2f2d3862019f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/e16c9c6245ec496bb23e2f2d3862019f.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/e94393305b0b40659583bc3f75f44d25.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/e94393305b0b40659583bc3f75f44d25.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/f16b9ac94a7c474891c1b73246afd85f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/f16b9ac94a7c474891c1b73246afd85f.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/f9a13aaa56734561abc7b630bec89028.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/f9a13aaa56734561abc7b630bec89028.png -------------------------------------------------------------------------------- /assets/images/chatpt_principle/f9a531cf142446c9bb594d14bd9d9df0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/chatpt_principle/f9a531cf142446c9bb594d14bd9d9df0.png -------------------------------------------------------------------------------- /assets/images/transformer/00219aefb2e4074942c05a232def806e_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/00219aefb2e4074942c05a232def806e_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/014f4a7b562d69685de1bad63acdae2e_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/014f4a7b562d69685de1bad63acdae2e_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/03477feb6607035c09e0647ed51c0717_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/03477feb6607035c09e0647ed51c0717_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/03835827fb44e543e70cde4c24f4958b_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/03835827fb44e543e70cde4c24f4958b_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/0408a0f35170960be731d6dabe8c6121.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0408a0f35170960be731d6dabe8c6121.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/05b64b744bc74d828e0394a95ce4e487.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/05b64b744bc74d828e0394a95ce4e487.png -------------------------------------------------------------------------------- /assets/images/transformer/05d8982149148c52d9ec28b8a6700dd3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/05d8982149148c52d9ec28b8a6700dd3.png -------------------------------------------------------------------------------- /assets/images/transformer/066744f17a322d41029315a892b10800.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/066744f17a322d41029315a892b10800.png -------------------------------------------------------------------------------- /assets/images/transformer/0813eac0ee0498fc409d021262d0093d_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0813eac0ee0498fc409d021262d0093d_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/081711acdffbd86d5dd901ab11f12e45_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/081711acdffbd86d5dd901ab11f12e45_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/0a9151721dee403cbe55614f36f6fe02.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0a9151721dee403cbe55614f36f6fe02.png -------------------------------------------------------------------------------- /assets/images/transformer/0af46ee5f1d80f7899e09bc50bdf9d55.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0af46ee5f1d80f7899e09bc50bdf9d55.png -------------------------------------------------------------------------------- /assets/images/transformer/0b63042d5c388feeeae9255dd4d204b1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0b63042d5c388feeeae9255dd4d204b1.png -------------------------------------------------------------------------------- /assets/images/transformer/0bd8253f3963faa8d8ac4d90b3beb00b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0bd8253f3963faa8d8ac4d90b3beb00b.png -------------------------------------------------------------------------------- /assets/images/transformer/0d9a5a90d7aae9d3d4c1c827a6b7e968.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/0d9a5a90d7aae9d3d4c1c827a6b7e968.png -------------------------------------------------------------------------------- /assets/images/transformer/156f5f47cf96ef93621d09dbc8d7f173.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/156f5f47cf96ef93621d09dbc8d7f173.png -------------------------------------------------------------------------------- /assets/images/transformer/15a9e270ce2848b483324eac1e1ac7b9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/15a9e270ce2848b483324eac1e1ac7b9.png -------------------------------------------------------------------------------- /assets/images/transformer/15dfeaa4db05178319010b7d4c86d0dc.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/15dfeaa4db05178319010b7d4c86d0dc.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/15eac347175fa9bb817d973431a52b44_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/15eac347175fa9bb817d973431a52b44_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/17c9ff5d3d01c44fa3fed1bdf9e93ebc_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/17c9ff5d3d01c44fa3fed1bdf9e93ebc_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/19a3e9fed1959e276be2f9ee902c1bb0.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/19a3e9fed1959e276be2f9ee902c1bb0.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/1a52c76159f239b0787ecefa4bcd990b.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/1a52c76159f239b0787ecefa4bcd990b.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/1bed15c3749c8760dc9975361148523a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/1bed15c3749c8760dc9975361148523a.png -------------------------------------------------------------------------------- /assets/images/transformer/1ca2f383a4a6d73fe29aaff772304f62_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/1ca2f383a4a6d73fe29aaff772304f62_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/1d91a686c492378519bff3a12d384db5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/1d91a686c492378519bff3a12d384db5.png -------------------------------------------------------------------------------- /assets/images/transformer/1dc008cab4294402b44c07329e87d816.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/1dc008cab4294402b44c07329e87d816.png -------------------------------------------------------------------------------- /assets/images/transformer/20191023230922448.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/20191023230922448.png -------------------------------------------------------------------------------- /assets/images/transformer/20191024181221371.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/20191024181221371.jpg -------------------------------------------------------------------------------- /assets/images/transformer/20191024181316166.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/20191024181316166.png -------------------------------------------------------------------------------- /assets/images/transformer/2019122823361086.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2019122823361086.png -------------------------------------------------------------------------------- /assets/images/transformer/21772349724bbb0e999a384fc8b92f83_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/21772349724bbb0e999a384fc8b92f83_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/22d2ddf0859248e4b87f30a0675e57f8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/22d2ddf0859248e4b87f30a0675e57f8.png -------------------------------------------------------------------------------- /assets/images/transformer/24cbf60fdb182ca8df5e22063a31479e_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/24cbf60fdb182ca8df5e22063a31479e_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/2526a376e956527a86a57e7f99ae0c9a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2526a376e956527a86a57e7f99ae0c9a.png -------------------------------------------------------------------------------- /assets/images/transformer/271317ff7a4defa3ea93b4794e11e409_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/271317ff7a4defa3ea93b4794e11e409_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/28bdbd7107b9a884eb7b0623872ef9eb_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/28bdbd7107b9a884eb7b0623872ef9eb_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/28f360e43e6a8ce32d68f8a7cd7a45b9_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/28f360e43e6a8ce32d68f8a7cd7a45b9_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/2909d40fc68f1e231ed69366a8a7ecba_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2909d40fc68f1e231ed69366a8a7ecba_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/290d55d2215c3b3976ad9c8c1d16da7e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/290d55d2215c3b3976ad9c8c1d16da7e.png -------------------------------------------------------------------------------- /assets/images/transformer/29508eef85e802223ef53bd31cbb74c2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/29508eef85e802223ef53bd31cbb74c2.png -------------------------------------------------------------------------------- /assets/images/transformer/2a73281367bf7e0ef3782e14c719f488.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2a73281367bf7e0ef3782e14c719f488.png -------------------------------------------------------------------------------- /assets/images/transformer/2acce61223fd0d086b8d489b18200e54.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2acce61223fd0d086b8d489b18200e54.png -------------------------------------------------------------------------------- /assets/images/transformer/2b00497744edfb93461dcbae45e2a75b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2b00497744edfb93461dcbae45e2a75b.png -------------------------------------------------------------------------------- /assets/images/transformer/2b2ea7b8a48be07468c7729fcea4f602_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2b2ea7b8a48be07468c7729fcea4f602_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/2cce32e676a02e898395a940b7f62af5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2cce32e676a02e898395a940b7f62af5.png -------------------------------------------------------------------------------- /assets/images/transformer/2faf7e7bb7284f04b912814b4c23d473.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/2faf7e7bb7284f04b912814b4c23d473.png -------------------------------------------------------------------------------- /assets/images/transformer/309f5158c52f4362ccd9b93a673093da.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/309f5158c52f4362ccd9b93a673093da.png -------------------------------------------------------------------------------- /assets/images/transformer/30bfc0d6863fb24ae629fe8deb16b10a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/30bfc0d6863fb24ae629fe8deb16b10a.png -------------------------------------------------------------------------------- /assets/images/transformer/31bfa57cc9c715958a059beb99a24887.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/31bfa57cc9c715958a059beb99a24887.png -------------------------------------------------------------------------------- /assets/images/transformer/3361f4f6b1c1740ddeae3e3ba81eeea5_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3361f4f6b1c1740ddeae3e3ba81eeea5_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/34b76d54bf8f141493febb05dcbf0d20_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/34b76d54bf8f141493febb05dcbf0d20_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/36b3e4458dc348608c50afa567807feb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/36b3e4458dc348608c50afa567807feb.png -------------------------------------------------------------------------------- /assets/images/transformer/3851c26c0777e066820492052caf3344.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3851c26c0777e066820492052caf3344.png -------------------------------------------------------------------------------- /assets/images/transformer/3919032d75e33cf2d945e264d1cbef6d_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3919032d75e33cf2d945e264d1cbef6d_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/3974360c263c0d2fe20f0d02d834b69d_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3974360c263c0d2fe20f0d02d834b69d_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/3a05fa9e0ba85561984fd4b927aacd78_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3a05fa9e0ba85561984fd4b927aacd78_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/3a9d8e9d4940ac8e5eb9985fb2257837_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3a9d8e9d4940ac8e5eb9985fb2257837_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/3b4ae9d3e22c831b1a5a07855e81bfdd_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3b4ae9d3e22c831b1a5a07855e81bfdd_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/3b516347112f14b5c0fde5aa215ba193.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3b516347112f14b5c0fde5aa215ba193.png -------------------------------------------------------------------------------- /assets/images/transformer/3e46433783c7ebfad45bc9898c723515.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3e46433783c7ebfad45bc9898c723515.png -------------------------------------------------------------------------------- /assets/images/transformer/3ea2716e45104ac8ddd5cee6208fdd1b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3ea2716e45104ac8ddd5cee6208fdd1b.png -------------------------------------------------------------------------------- /assets/images/transformer/3f30702e3b3e8a45025c0a8f41ed21cb_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/3f30702e3b3e8a45025c0a8f41ed21cb_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/4059740ccdd04f6648aa5c9c5ac89ef2_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4059740ccdd04f6648aa5c9c5ac89ef2_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/40bfd7d3ee8fab25f70ad4e47e6b7a67_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/40bfd7d3ee8fab25f70ad4e47e6b7a67_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/42a81fda78cf5263c94c4b53d8c30c25_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/42a81fda78cf5263c94c4b53d8c30c25_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/445cf49afb604fe68a75cdf085cfc2f1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/445cf49afb604fe68a75cdf085cfc2f1.png -------------------------------------------------------------------------------- /assets/images/transformer/45add25ad0472ac9fddd46261ffcb2b7_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/45add25ad0472ac9fddd46261ffcb2b7_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/46798ec25cc22fbfda16551feaa0c64c_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/46798ec25cc22fbfda16551feaa0c64c_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/47310365fdd2442e464a9d9362eb4df1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/47310365fdd2442e464a9d9362eb4df1.png -------------------------------------------------------------------------------- /assets/images/transformer/477df3becf6872bf2ff1d040839b75e7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/477df3becf6872bf2ff1d040839b75e7.png -------------------------------------------------------------------------------- /assets/images/transformer/4b973001a63c78ff49b7b9b88f67226f_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4b973001a63c78ff49b7b9b88f67226f_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/4c745b794769a28651312412ef1286ef.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4c745b794769a28651312412ef1286ef.png -------------------------------------------------------------------------------- /assets/images/transformer/4da285ff84827d16dc064548120b2fb2_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4da285ff84827d16dc064548120b2fb2_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/4db158cf962f1262b5d9b53e63e7be81_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4db158cf962f1262b5d9b53e63e7be81_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/4ee1ec24925721d45cbf191f26b3b14f_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/4ee1ec24925721d45cbf191f26b3b14f_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/508fe2f946b84eaa41a0ff6607c02304_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/508fe2f946b84eaa41a0ff6607c02304_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/50aebf29216b34ee48f4432404748ea0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/50aebf29216b34ee48f4432404748ea0.png -------------------------------------------------------------------------------- /assets/images/transformer/50ea89dc0a3e013fcd7a1fd1eb382f7d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/50ea89dc0a3e013fcd7a1fd1eb382f7d.png -------------------------------------------------------------------------------- /assets/images/transformer/51b6f11a315c19a269a7b103b377e368_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/51b6f11a315c19a269a7b103b377e368_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/521be58bc3ebe619b1ef35ab1ea550a7_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/521be58bc3ebe619b1ef35ab1ea550a7_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/53bf9c9d6f04d7d6e8abc60795baab3b.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/53bf9c9d6f04d7d6e8abc60795baab3b.png -------------------------------------------------------------------------------- /assets/images/transformer/54d783622495456a4ebe4f78e0954228_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/54d783622495456a4ebe4f78e0954228_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/5836e4fade679f7fdcdc37360768fd5c_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5836e4fade679f7fdcdc37360768fd5c_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/5a872ef655bc85126ea29c230052991f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5a872ef655bc85126ea29c230052991f.png -------------------------------------------------------------------------------- /assets/images/transformer/5ac1306e01b1b032ce640aac979e0927_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5ac1306e01b1b032ce640aac979e0927_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/5d5de25d8b30e3e371204f01afa026fb.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5d5de25d8b30e3e371204f01afa026fb.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/5e7d7ab1b09c5ce83f03c84a15d1ead8_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5e7d7ab1b09c5ce83f03c84a15d1ead8_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/5f1954870d53077ba0ea0483301816a9_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/5f1954870d53077ba0ea0483301816a9_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/60312199b92e2955a3ffb2f8a1217aec_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/60312199b92e2955a3ffb2f8a1217aec_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/655085e6ce07069ae50868b0b18bdd42.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/655085e6ce07069ae50868b0b18bdd42.png -------------------------------------------------------------------------------- /assets/images/transformer/664d9bbbe5a7bfe9c1871d57a21434bc_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/664d9bbbe5a7bfe9c1871d57a21434bc_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/66cb19b99a76b56f7c3d1a978e456351_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/66cb19b99a76b56f7c3d1a978e456351_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/66e7f8d36c71bb58891c9ec63122908c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/66e7f8d36c71bb58891c9ec63122908c.png -------------------------------------------------------------------------------- /assets/images/transformer/687fa78458782791c69f537dd780d0c3_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/687fa78458782791c69f537dd780d0c3_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/68ea4a796f4f321bba91094eb9596701_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/68ea4a796f4f321bba91094eb9596701_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/690c0782107b43044ce8584da9dc943f_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/690c0782107b43044ce8584da9dc943f_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/695f99dca949276ba8466607f46742a0_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/695f99dca949276ba8466607f46742a0_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/6a11e3084d834ba585f8f1ca51fd7856.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/6a11e3084d834ba585f8f1ca51fd7856.png -------------------------------------------------------------------------------- /assets/images/transformer/6f499fde032339f85b299d8763c555aa_MD5.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/6f499fde032339f85b299d8763c555aa_MD5.gif -------------------------------------------------------------------------------- /assets/images/transformer/6fc06ee880115438b233d12939755797.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/6fc06ee880115438b233d12939755797.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/71b156192cbbc0427bab79e11815a346_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/71b156192cbbc0427bab79e11815a346_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/71da3125f632d5325d04e0413c81f7a6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/71da3125f632d5325d04e0413c81f7a6.png -------------------------------------------------------------------------------- /assets/images/transformer/71fe84a2a89d45699da903bf0462f321.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/71fe84a2a89d45699da903bf0462f321.png -------------------------------------------------------------------------------- /assets/images/transformer/74368756e892a696fbad9171bc37f5c9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/74368756e892a696fbad9171bc37f5c9.png -------------------------------------------------------------------------------- /assets/images/transformer/74843c907a1ea3ff862ff898dc22fce5.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/74843c907a1ea3ff862ff898dc22fce5.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/756c7616dca078c5a826d5d89e55c5a2_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/756c7616dca078c5a826d5d89e55c5a2_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/77c0c9a43d2a8f2d5870056e8c5d8377_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/77c0c9a43d2a8f2d5870056e8c5d8377_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/77e86008772b231c277afdae86bac801.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/77e86008772b231c277afdae86bac801.png -------------------------------------------------------------------------------- /assets/images/transformer/78df8537c4b8e681e472668a776f4260.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/78df8537c4b8e681e472668a776f4260.png -------------------------------------------------------------------------------- /assets/images/transformer/7c4e16d736617d56d7f0dcb7e2e559db_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7c4e16d736617d56d7f0dcb7e2e559db_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/7ccf738d91802c1ad8a7fbf5530dcc39_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7ccf738d91802c1ad8a7fbf5530dcc39_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/7d8d81a072c77c7fd8bf5839180cec29_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7d8d81a072c77c7fd8bf5839180cec29_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/7e9c729734c7ab295cf2797fb6cb85dc.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7e9c729734c7ab295cf2797fb6cb85dc.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/7eb52d3adc295d26bba1a624c68ecbce_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7eb52d3adc295d26bba1a624c68ecbce_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/7fc3a202f4d625dd2c204d268f444573_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/7fc3a202f4d625dd2c204d268f444573_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/801922b26169dd269be29abc282b1ce8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/801922b26169dd269be29abc282b1ce8.png -------------------------------------------------------------------------------- /assets/images/transformer/85065d3ce99f9487e4b742c4cd25888f_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/85065d3ce99f9487e4b742c4cd25888f_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/85b45b16351cf6a04b8d59bcca2384ed.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/85b45b16351cf6a04b8d59bcca2384ed.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/89e6b33107a946d5aec33eaa102e57ba.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/89e6b33107a946d5aec33eaa102e57ba.png -------------------------------------------------------------------------------- /assets/images/transformer/8a9df8046657ebba0550aa3b2f073dbe_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8a9df8046657ebba0550aa3b2f073dbe_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/8ab23e56b7f40d7eea51e82049b02c5f_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8ab23e56b7f40d7eea51e82049b02c5f_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/8c4b1b6c2e23e4c3e10be29c14009ac8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8c4b1b6c2e23e4c3e10be29c14009ac8.png -------------------------------------------------------------------------------- /assets/images/transformer/8cc143c0a77c4c44b354360d181d2f39.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8cc143c0a77c4c44b354360d181d2f39.png -------------------------------------------------------------------------------- /assets/images/transformer/8ce7989d10e8a73e30de1b83c79c7c41.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8ce7989d10e8a73e30de1b83c79c7c41.png -------------------------------------------------------------------------------- /assets/images/transformer/8ef00669ec2c7b35c995c09a29a3ded6_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8ef00669ec2c7b35c995c09a29a3ded6_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/8f75803b17a4bb9147582fec634a70a1_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/8f75803b17a4bb9147582fec634a70a1_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/906501ef6328bf7f3ecc0f02cc57305d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/906501ef6328bf7f3ecc0f02cc57305d.png -------------------------------------------------------------------------------- /assets/images/transformer/921a5287939700d160fbb2a27f33f30f.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/921a5287939700d160fbb2a27f33f30f.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/95a2ef7d1fadd567d04b368ef22cefd7_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/95a2ef7d1fadd567d04b368ef22cefd7_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/95d0fee620aaa7ae29d675c7f9827cfe_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/95d0fee620aaa7ae29d675c7f9827cfe_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/98dab2254c03b650a248195525a3ec89_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/98dab2254c03b650a248195525a3ec89_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/9c58346d0fceebd90ada8f1021177e68.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9c58346d0fceebd90ada8f1021177e68.png -------------------------------------------------------------------------------- /assets/images/transformer/9cc6a1fec7f70374f0aaffceace27a3b_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9cc6a1fec7f70374f0aaffceace27a3b_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/9d3070c4862b419e382b8b89ebeb0f04_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9d3070c4862b419e382b8b89ebeb0f04_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/9eaee194b8192e37e24c8563eba9da9a_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9eaee194b8192e37e24c8563eba9da9a_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/9f07d90b3a5e0d7c42f3f0b4aace80fd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9f07d90b3a5e0d7c42f3f0b4aace80fd.png -------------------------------------------------------------------------------- /assets/images/transformer/9f1a69a7c86edca423a70f106b1520bd.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9f1a69a7c86edca423a70f106b1520bd.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/9f4d835b03cf34d9f303df8f2009387c_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9f4d835b03cf34d9f303df8f2009387c_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/9ff0f2f885122ef541024b679cdf7c5c_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/9ff0f2f885122ef541024b679cdf7c5c_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/a1b6a676c59d4cd4babaf46829595381.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/a1b6a676c59d4cd4babaf46829595381.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/a550ff894ef7c4dd5fb72473e3fca0d5_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/a550ff894ef7c4dd5fb72473e3fca0d5_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/ab77150713e2ba2c019e37f7acf70ca0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ab77150713e2ba2c019e37f7acf70ca0.png -------------------------------------------------------------------------------- /assets/images/transformer/ad73f95847f33a27bdef611c1a07f484_MD5.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ad73f95847f33a27bdef611c1a07f484_MD5.gif -------------------------------------------------------------------------------- /assets/images/transformer/adb96ea87e11431caff3b1363a00601a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/adb96ea87e11431caff3b1363a00601a.png -------------------------------------------------------------------------------- /assets/images/transformer/afee7380b23f96110e9593354f6c1c99_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/afee7380b23f96110e9593354f6c1c99_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/b0a45c6a0dde463aa6e360ba40344206.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b0a45c6a0dde463aa6e360ba40344206.png -------------------------------------------------------------------------------- /assets/images/transformer/b18ad5eac41c5fd8e31ef5025b614d3f.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b18ad5eac41c5fd8e31ef5025b614d3f.jpeg -------------------------------------------------------------------------------- /assets/images/transformer/b249dd91bcd1580bc8b360fd56ab816c_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b249dd91bcd1580bc8b360fd56ab816c_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/b2c172d8f1f52ef1ee490a99f05c9e44_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b2c172d8f1f52ef1ee490a99f05c9e44_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/b2f9a9dd6628485288965dcf05f36101.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b2f9a9dd6628485288965dcf05f36101.png -------------------------------------------------------------------------------- /assets/images/transformer/b31bf9c40283208ab733992dcd6bebba_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b31bf9c40283208ab733992dcd6bebba_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/b3a33f23896b4a30c96c2383bf836f2c_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b3a33f23896b4a30c96c2383bf836f2c_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/b49d1598a6664e20a918f525d09e2693.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b49d1598a6664e20a918f525d09e2693.png -------------------------------------------------------------------------------- /assets/images/transformer/b54a295fecf8687cc24421d893f4dab6_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b54a295fecf8687cc24421d893f4dab6_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/b63a62ad53b699a20454fa02bfcfa789_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b63a62ad53b699a20454fa02bfcfa789_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/b675304324148d1ddf10e68dc33b8153_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b675304324148d1ddf10e68dc33b8153_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/b7ab108e1e948a7a5e9d771f2e366d6c.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/b7ab108e1e948a7a5e9d771f2e366d6c.png -------------------------------------------------------------------------------- /assets/images/transformer/ba78d968411707c3032b77ca7359d2ae.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ba78d968411707c3032b77ca7359d2ae.png -------------------------------------------------------------------------------- /assets/images/transformer/bbe11a2bae24b43f3d943b5e9a96fd43_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/bbe11a2bae24b43f3d943b5e9a96fd43_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/c16874e0304388bf5ea54cc204b4ef85_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c16874e0304388bf5ea54cc204b4ef85_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/c2f2e3bbbbd2bfb024f60d94d7252046.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c2f2e3bbbbd2bfb024f60d94d7252046.png -------------------------------------------------------------------------------- /assets/images/transformer/c4d818d40f353c792ab0f19e626ad88e.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c4d818d40f353c792ab0f19e626ad88e.png -------------------------------------------------------------------------------- /assets/images/transformer/c56f16d937169ee283b7dbdc7785700a_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c56f16d937169ee283b7dbdc7785700a_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/c575fe306df6fe8dd9475d84c2b1785c_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c575fe306df6fe8dd9475d84c2b1785c_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/c662ea4c1137ac8be4b1eaa11c6e3f06.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c662ea4c1137ac8be4b1eaa11c6e3f06.png -------------------------------------------------------------------------------- /assets/images/transformer/c6a6cb296455b36bf3416d3e91a6e3c9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c6a6cb296455b36bf3416d3e91a6e3c9.png -------------------------------------------------------------------------------- /assets/images/transformer/c757183fe507acde590ad1e7d1d73b61.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c757183fe507acde590ad1e7d1d73b61.png -------------------------------------------------------------------------------- /assets/images/transformer/c7ef68d441014d869b3744878549e0b0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c7ef68d441014d869b3744878549e0b0.png -------------------------------------------------------------------------------- /assets/images/transformer/c8fef938155d460db008d19950238bfe.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/c8fef938155d460db008d19950238bfe.png -------------------------------------------------------------------------------- /assets/images/transformer/ca3fe94491cef6b86588816106c17a85_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ca3fe94491cef6b86588816106c17a85_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/cd4d8c6626380ae119976c75726e7357.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/cd4d8c6626380ae119976c75726e7357.png -------------------------------------------------------------------------------- /assets/images/transformer/cdce655d5f7216fc787b1793ce51f146_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/cdce655d5f7216fc787b1793ce51f146_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/cf4ba05e16d64b10b197514bcd1f9b44.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/cf4ba05e16d64b10b197514bcd1f9b44.png -------------------------------------------------------------------------------- /assets/images/transformer/d31e0e522cb77b9aafe3cb6196376373_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d31e0e522cb77b9aafe3cb6196376373_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d369d818c140892e9b6bca1eeab4ee39.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d369d818c140892e9b6bca1eeab4ee39.png -------------------------------------------------------------------------------- /assets/images/transformer/d5eea65b2dc61b87dd22623363321ce6_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d5eea65b2dc61b87dd22623363321ce6_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d6431465c1e7bec6d952041358c9edf8_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d6431465c1e7bec6d952041358c9edf8_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d66b05cf01ce1c27a2e87793b78fe2ae_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d66b05cf01ce1c27a2e87793b78fe2ae_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/d6cd76d803bb8ad54e57aa67d01ed808_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d6cd76d803bb8ad54e57aa67d01ed808_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d8d8d1bc376fe0db90599c99ecefad8e_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d8d8d1bc376fe0db90599c99ecefad8e_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d9bae5ae137038398e8ac3118191d6e8_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d9bae5ae137038398e8ac3118191d6e8_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d9c6037ac0cea0dad506b75496c6edfa_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d9c6037ac0cea0dad506b75496c6edfa_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/d9f1735f332c29db399e072817da3144_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/d9f1735f332c29db399e072817da3144_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/da1a8850c5c75f116349b24aa3a62c1d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/da1a8850c5c75f116349b24aa3a62c1d.png -------------------------------------------------------------------------------- /assets/images/transformer/dadb2abaf3a8cfb78fab6d4ad04d1b6f.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/dadb2abaf3a8cfb78fab6d4ad04d1b6f.png -------------------------------------------------------------------------------- /assets/images/transformer/db937132cff0d9444788b4a43d0f0fb9_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/db937132cff0d9444788b4a43d0f0fb9_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/de565ecf43e0bda278ce99eebd875322.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/de565ecf43e0bda278ce99eebd875322.png -------------------------------------------------------------------------------- /assets/images/transformer/deaa8c2b8634b11e28524700ce181ce7_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/deaa8c2b8634b11e28524700ce181ce7_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/e2b74978dcefbe4e6c14f56d9ee43f95_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e2b74978dcefbe4e6c14f56d9ee43f95_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/e2f3ed201af4091433fa0c2e552c1888_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e2f3ed201af4091433fa0c2e552c1888_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/e5e7f107f3f0281d2892fde09ce474ef_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e5e7f107f3f0281d2892fde09ce474ef_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/e68a76a5b17995b45d6efdea27f29c51.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e68a76a5b17995b45d6efdea27f29c51.png -------------------------------------------------------------------------------- /assets/images/transformer/e69279b508c770fac37ee6c24d129331.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e69279b508c770fac37ee6c24d129331.png -------------------------------------------------------------------------------- /assets/images/transformer/e6a5088c3ef6310a35b9b4ff41e285eb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e6a5088c3ef6310a35b9b4ff41e285eb.png -------------------------------------------------------------------------------- /assets/images/transformer/e6c505de964ca005b5c72021a7082ced_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e6c505de964ca005b5c72021a7082ced_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/e8d2029b3f55cee5adf8946a3e665e7c_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e8d2029b3f55cee5adf8946a3e665e7c_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/e8de3f53932a45a3e07add8602220d08_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e8de3f53932a45a3e07add8602220d08_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/e8e501ad4397f9e68814d2b156d7f703_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e8e501ad4397f9e68814d2b156d7f703_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/e8f36c8c3a2afbaf84273c977a6c16b5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e8f36c8c3a2afbaf84273c977a6c16b5.png -------------------------------------------------------------------------------- /assets/images/transformer/e9e771e25442e2dfc1470b2cc422ed2e_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/e9e771e25442e2dfc1470b2cc422ed2e_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/eb96bb41bd547f26bcaeaa80dd9c572d.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/eb96bb41bd547f26bcaeaa80dd9c572d.png -------------------------------------------------------------------------------- /assets/images/transformer/ebd475e785b842bda203b694566b9f93.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ebd475e785b842bda203b694566b9f93.png -------------------------------------------------------------------------------- /assets/images/transformer/ec83795b5d76dba74f63b03e5909b402_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/ec83795b5d76dba74f63b03e5909b402_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/edf65186f1740032d70a9665a14dc5f9_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/edf65186f1740032d70a9665a14dc5f9_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/f061d1fda72a7e14523875d504342e38_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f061d1fda72a7e14523875d504342e38_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/f1bdc252f47440bd467484cf92ecfa31_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f1bdc252f47440bd467484cf92ecfa31_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/f22cb1de22144ad6806b83acb3fb45a4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f22cb1de22144ad6806b83acb3fb45a4.png -------------------------------------------------------------------------------- /assets/images/transformer/f3eb7c90b409dbbe7f6a4ebbbbea417c_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f3eb7c90b409dbbe7f6a4ebbbbea417c_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/f5c9bff7f0a59caa72a96d5c8eb80b20.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f5c9bff7f0a59caa72a96d5c8eb80b20.png -------------------------------------------------------------------------------- /assets/images/transformer/f6650dd74f4288d047e2fdad8aab4803.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f6650dd74f4288d047e2fdad8aab4803.png -------------------------------------------------------------------------------- /assets/images/transformer/f87852261bb2c5ef347926e55428615b_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/f87852261bb2c5ef347926e55428615b_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/fa1e392270b62d2ca973c41e850e71de_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fa1e392270b62d2ca973c41e850e71de_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/fa3b06a67625a681ba4e127a3be10cc0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fa3b06a67625a681ba4e127a3be10cc0.png -------------------------------------------------------------------------------- /assets/images/transformer/fadc2266e03a0d37020d56c897df0a34_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fadc2266e03a0d37020d56c897df0a34_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/fb7b40a52ecf867d81671a218af0edb5_MD5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fb7b40a52ecf867d81671a218af0edb5_MD5.jpg -------------------------------------------------------------------------------- /assets/images/transformer/fbacc026424dbdc50c7ad87b36f68ad9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fbacc026424dbdc50c7ad87b36f68ad9.png -------------------------------------------------------------------------------- /assets/images/transformer/fc657ecca0a6d942ea9bfa5a5c828edb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fc657ecca0a6d942ea9bfa5a5c828edb.png -------------------------------------------------------------------------------- /assets/images/transformer/fd6a975a1cf279e5f4850a2e66918192_MD5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/fd6a975a1cf279e5f4850a2e66918192_MD5.png -------------------------------------------------------------------------------- /assets/images/transformer/feaa01981ce3ab2f50012dcfa86ff40a.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/julycoding/ChatGPT_principle_fine-tuning_code_paper/4684c652878cd00d345531a8240234642e62d439/assets/images/transformer/feaa01981ce3ab2f50012dcfa86ff40a.png -------------------------------------------------------------------------------- /强化学习极简入门上:通俗理解MDP、DP MC TC和Q学习、策略梯度、PPO.md: -------------------------------------------------------------------------------- 1 | ## **前言** 2 | 22年底/23年初ChatGPT大火,在写《[ChatGPT技术原理解析](https://blog.csdn.net/v_JULY_v/article/details/128579457 "ChatGPT技术原理解析")》的过程中,发现ChatGPT背后技术涉及到了RL/RLHF,于是又深入研究RL,研究RL的过程中又发现里面的数学公式相比ML/DL更多,于此激发我一边深入RL,一边重修微积分、概率统计、最优化,前者成就了本篇RL极简入门,后者成就了另两篇数学笔记:概率统计极简入门(23修订版)、一文通透优化算法(23修订版). 3 | 4 | 如上篇ChatGPT笔记所说,本文最早是作为ChatGPT笔记的第一部分的,但RL细节众多,如果想完全在上篇笔记里全部介绍清楚,最后篇幅将长之又长同时还影响完读率,为了避免因为篇幅限制而导致RL很多细节阐述的不够细致,故把RL相关的部分从上文中抽取出来独立成本文. 5 | 6 | + 一方面,原有内容(第一部分 RL基础:什么是RL与MRP、MDP,和第四部分 **策略学习**:从策略梯度到TRPO、PPO算法)继续完善、改进,完善记录见本文文末 7 | + 二方面,在原有内容上新增了以下这两部分内容的详细阐述: 8 | 第二部分 RL进阶之三大表格求解法:DP、MC、TD 9 | 第三部分 **价值学习**:从n步Sarsa算法到Q-learning、DQN 10 | 11 | 另,本文有两个特色 12 | 13 | 1. 定位入门.过去一个多月,我翻遍了十来本RL中文书,以及网上各种RL资料 14 | 有的真心不错(比如sutton的RL书,但此前从来没有接触过RL的不建议一上来就看该书,除非你看过本文之后) 15 | 其余大部分要么就是堆砌概念/公式,要么对已经入门的不错,但对还没入门的初学者极度不友好,很多背景知识甚至公式说明、符号说明没有交待,让初学者经常看得云里雾里 16 | 本文会假定大部分读者此前从来没有接触过RL,会尽可能多举例、多配图、多交待,100个台阶,一步一步拾级而上,不出现任何断层 17 | 2. 推导细致.本文之前,99%的文章都不会把PPO算法从头推到尾,本文会把PPO从零推到尾,按照“**RL-策略梯度-重要性采样(重要性权重)-增加基线(避免奖励总为正)-TRPO(加进KL散度约束)-PPO(解决TRPO计算量大的问题)**”的顺序逐步介绍每一步推导 18 | 且为彻底照顾初学者,本文会解释/说明清楚每一个公式甚至符号,包括推导过程中不省任何一个必要的中间推导步骤,十步推导绝不略成三步 19 | 20 | 总之,大部分写书、写教材、写文章的人过了那个从不懂到懂的过程,所以懂的人写给不懂的人看,处处都是用已懂的思维去写,而不是用怎么从不懂到懂的思维 去写,未来三年 奋笔疾书,不断给更多初学者普及AI和RL技术 21 | 22 | ## 第一部分 RL基础:什么是RL与MRP、MDP 23 | 24 | ## 1.1 入门强化学习所需掌握的基本概念 25 | 26 | ### 1.1.1 什么是强化学习:依据策略执行动作-感知状态-得到奖励 27 | 28 | 强化学习里面的概念、公式,相比ML/DL特别多,初学者刚学RL时,很容易被接连不断的概念、公式给绕晕,而且经常忘记概念与公式符号表达的一一对应. 29 | 30 | 为此,我建议学习RL的第一步就是一定要扎实关于RL的一些最基本的概念、公式(不要在扎实基础的阶段图快或图囵吞枣,不然后面得花更多的时间、更大的代价去弥补),且把概念与公式的一一对应关系牢记于心,这很重要.当然,为最大限度的提高本文的可读性,我会尽可能的多举例、多配图. 31 | 32 | 另,RL里面存着大量的数学,考虑到可以为通俗而增加篇幅,但不为了介绍而介绍式的增加篇幅,故 33 | 34 | + 像高数/概率统计里的什么叫[导数](https://zh.wikipedia.org/wiki/%E5%AF%BC%E6%95%B0 "导数"),期望以及什么叫[概率分布](https://blog.csdn.net/v_july_v/article/details/8308762 "概率分布")、熵/香浓熵(Shannon熵)/交叉熵、相对熵(也称KL散度,即[KL divergence](https://zhuanlan.zhihu.com/p/425693597?utm_id=0 "KL divergence"))、多元函数、偏导数,可以参见Wikipedia或《[概率统计极简入门:通俗理解微积分/期望方差/正态分布前世今生(23修订版)](https://blog.csdn.net/v_JULY_v/article/details/8308762 "概率统计极简入门:通俗理解微积分/期望方差/正态分布前世今生(23修订版)")》等类似笔记 35 | + 而AI一些最基本的概念比如损失函数、[梯度](https://zh.wikipedia.org/wiki/%E6%A2%AF%E5%BA%A6 "梯度")、梯度下降、随机梯度下降(SGD)、[学习率](https://paddlepedia.readthedocs.io/en/latest/tutorials/deep_learning/model_tuning/learning_rate.html "学习率")等,可以参考此篇笔记:《[一文通透优化算法:从梯度下降、SGD到牛顿法、共轭梯度(23修订版)](https://blog.csdn.net/v_JULY_v/article/details/81350035 "一文通透优化算法:从梯度下降、SGD到牛顿法、共轭梯度(23修订版)")》,本文则不过多介绍 36 | 37 | 话休絮烦,下面进入正题,且先直接给出强化学习的定义和其流程,然后再逐一拆解、说明. 38 | 39 | 所谓强化学习(Reinforcement Learning,简称RL),是指基于智能体在复杂、不确定的环境中最大化它能获得的奖励,从而达到自主决策的目的. 40 | 41 | 经典的强化学习模型可以总结为下图的形式(你可以理解为任何强化学习都包含这几个基本部分:智能体、行为、环境、状态、奖励): 42 | 43 | ![e8fc39f9f7f291dc8b121858a201545b.png](./assets/images/RL_simple_primer/e8fc39f9f7f291dc8b121858a201545b.png) 44 | 45 | 一般的文章在介绍这些概念时很容易一带而过,这里我把每个概念都逐一解释下 46 | + Agent,一般译为智能体,就是我们要训练的模型,类似玩超级玛丽的时候操纵马里奥做出相应的动作,而这个马里奥就是Agent 47 | 48 | + action(简记为$`a`$),玩超级玛丽的时候你会控制马里奥做三个动作,即向左走、向右走和向上跳,而马里奥做的这三个动作就是action 49 | 50 | + Environment,即环境,它是提供reward的某个对象,它可以是AlphaGo中的人类棋手,也可以是自动驾驶中的人类驾驶员,甚至可以是某些游戏AI里的游戏规则 51 | 52 | + reward(简记为$`r`$),这个奖赏可以类比为在明确目标的情况下,接近目标意味着做得好则奖,远离目标意味着做的不好则惩,最终达到收益/奖励最大化,且这个奖励是强化学习的核心 53 | + State(简介为$`S`$),可以理解成环境的状态,简称状态 54 | 55 | 总的而言,Agent依据策略决策从而执行动作action,然后通过感知环境Environment从而获取环境的状态state,进而,最后得到奖励reward(以便下次再到相同状态时能采取更优的动作),然后再继续按此流程“**依据策略执行动作-感知状态--得到奖励**”循环进行. 56 | 57 | ### 1.1.2 RL与监督学习的区别和RL方法的分类 58 | 59 | 此外,RL和监督学习(supervised learning)的区别: 60 | 61 | + 监督学习有标签告诉算法什么样的输入对应着什么样的输出(譬如分类、回归等问题) 62 | 所以对于监督学习,目标是找到一个最优的模型函数,使其在训练数据集上最小化一个给定的损失函数,相当于最小化预测误差 63 | 最优模型 = arg minE{ [损失函数(标签,模型(特征)] } 64 | RL没有标签告诉它在某种情况下应该做出什么样的行为,只有一个做出一系列行为后最终反馈回来的reward,然后判断当前选择的行为是好是坏 65 | 相当于RL的目标是最大化智能体策略在和动态环境交互过程中的价值,而策略的价值可以等价转换成奖励函数的期望,即最大化累计下来的奖励期望 66 | 最优策略 = arg maxE { [奖励函数(状态,动作)] } 67 | 68 | + 监督学习如果做了比较坏的选择则会立刻反馈给算法 69 | RL的结果反馈有延时,有时候可能需要走了很多步以后才知道之前某步的选择是好还是坏 70 | 71 | + 监督学习中输入是独立分布的,即各项数据之间没有关联 72 | RL面对的输入总是在变化,每当算法做出一个行为,它就影响了下一次决策的输入 73 | 74 | 进一步,RL为得到最优策略从而获取最大化奖励,有 75 | 76 | + **基于值函数的方法**,通过求解一个状态或者状态下某个动作的估值为手段,从而寻找最佳的价值函数,找到价值函数后,再提取最佳策略 77 | 比如Q-learning、DQN等,适合离散的环境下,比如围棋和某些游戏领域 78 | + **基于策略的方法**,一般先进行策略评估,即对当前已经搜索到的策略函数进行估值,得到估值后,进行策略改进,不断重复这两步直至策略收敛 79 | 80 | 比如策略梯度法(policy gradient,简称PG),适合连续动作的场景,比如机器人控制领域 81 | 以及Actor-Criti(一般被翻译为演员-评论家算法),Actor学习参数化的策略即策略函数,Criti学习值函数用来评估状态-动作对,不过,Actor-Criti本质上是属于基于策略的算法,毕竟算法的目标是优化一个带参数的策略,只是会额外学习价值函数,从而帮助策略函数更好的学习. 82 | 83 | 此外,还有对策略梯度算法的改进,比如TRPO算法、PPO算法,当然PPO算法也可称之为是一种Actor-Critic架构,下文会重点阐述. 84 | 85 | 可能你还有点懵懵懂懂,没关系,毕竟还有不少背景知识还没有交待,比如RL其实是一个马尔可夫决策过程(Markov decision process,MDP),而为说清楚MDP,得先从随机过程、马尔可夫过程(Markov process,简称MP)开始讲起,故为考虑逻辑清晰,我们还是把整个继承/脉络梳理下. 86 | 87 | ## 1.2 什么是马尔科夫决策过程 88 | 89 | ### 1.2.1 MDP的前置知识:随机过程、马尔可夫过程、马尔可夫奖励 90 | 91 | 如HMM学习最佳范例中所说,有一类现象是确定性的现象,比如红绿灯系统,红灯之后一定是红黄、接着绿灯、黄灯,最后又红灯,每一个状态之间的变化是确定的 92 | 93 |
94 |
95 | 96 | 但还有一类现象则不是确定的,比如今天是晴天,谁也没法百分百确定明天一定是晴天还是雨天、阴天(即便有天气预报) 97 |
98 |
99 | 100 | 对于这种假设具有$`M`$个状态的模型 101 | 102 | 1. 共有$`M^2`$个状态转移,因为任何一个状态都有可能是所有状态的下一个转移状态 103 | 2. 每一个状态转移都有一个概率值,称为状态转移概率,相当于从一个状态转移到另一个状态的概率 104 | 3. 所有的$`M^2`$个概率可以用一个**状态转移矩阵**表示 105 | 106 | 下面的状态转移矩阵显示的是天气例子中可能的状态转移概率:  107 | ```math 108 | \begin{matrix} 109 | & Today\\ 110 | Yesterday& \begin{array}{c} 111 | \\ 112 | sun\\ 113 | cloud\\ 114 | rain\\ 115 | \end{array}\left[ \begin{matrix} 116 | sun& cloud& rain& \\ 117 | 0.50& 0.375& 0.125& \\ 118 | 0.25& 0.125& 0.625& \\ 119 | 0.25& 0.375& 0.375& \\ 120 | \end{matrix} \right]\\ 121 | \end{matrix} 122 | ``` 123 | 124 | 也就是说,如果昨天是晴天,那么今天是晴天的概率为0.5,是多云的概率为0.375、是雨天的概率为0.125,且这三种天气状态的概率之和必为1. 125 | 126 | 接下来,我们来抽象建模下.正如概率论的研究对象是静态的随机现象,而随机过程的研究对象是随时间演变的随机现象(比如天气随时间的变化): 127 | 128 | + 随机现象在某时刻t的取值是一个向量随机变量,用$`S_t`$表示,比如上述天气转移矩阵便如下图所示 129 | ```math 130 | \begin{bmatrix} s_1\rightarrow s_1,s_1\rightarrow s_2,s_1\rightarrow s_3& \\ s_2\rightarrow s_1,s_2\rightarrow s_2,s_2\rightarrow s_3 & \\ s_3\rightarrow s_1,s_3\rightarrow s_2,s_3\rightarrow s_3 & & \end{bmatrix} 131 | ``` 132 | + 在某时刻t的状态$`S_t`$通常取决于t时刻之前的状态,我们将已知历史信息$`(S_1,\cdots ,S_t)`$时下一个时刻的状态$`S_{t+1}`$的概率表示成$`P(S_{t+1}|S_1,\cdots ,S_t)`$ 133 | 如此,便可以定义一个所有状态对之间的转移概率矩阵 134 | ```math 135 | P=\begin{bmatrix} P(s_1|s_1) \ P(s_2|s_1) \ P(s_3|s_1) \cdots P(s_n|s_1)&\\ P(s_1|s_2) \ P(s_2|s_2) \ P(s_3|s_2) \cdots P(s_n|s_2) &\\ \cdots \cdots \cdots &\\ \cdots \cdots \cdots &\\P(s_1|s_n) \ P(s_2|s_n) \ P(s_3|s_n) \cdots P(s_n|s_n) \end{bmatrix} 136 | ``` 137 | 138 | + 当且仅当某时刻的状态只取决于上一时刻的状态时,一个随机过程被称为具有**马尔可夫性质**,即$`P(S_{t+1}|S_t) = P(S_{t+1}|S_1,\cdots ,S_t)`$,当然了,虽说当前状态只看上一个状态,但上一个状态其实包含了更上一个状态的信息,所以不能说当下与历史是无关的 139 | + 而具有马尔可夫性质的随机过程便是**马尔可夫过程** 140 | 141 | 在马尔可夫过程的基础上加入奖励函数$`R`$和折扣因子$`\gamma`$,就可以得到**马尔可夫奖励过程**(Markov reward process,MRP).其中 142 | + **奖励函数**,某个状态$`s`$的奖励$`R(s)`$,是指转移到该状态s时可以获得奖励的期望,有$`R(s) = E[R_{t+1}|S_t = s]`$ 143 | 144 | **注意,有的书上奖励函数和下面回报公式中的$`R_{t+1}`$的下标$`t+1`$写为t**,其实严格来说,先有t时刻的状态/动作之后才有t+1时刻的奖励,但应用中两种下标法又都存在,读者注意辨别 145 | + 此外,实际中,因为一个状态可以得到的奖励是持久的,所有奖励的衰减之和称为回报,可用$`G`$**表示当下即时奖励和所有持久奖励等一切奖励的加权和** (考虑到一般越往后某个状态给的回报率越低,也即奖励因子或折扣因子越小,用$`\gamma`$表示),从而有 146 | ```math 147 | \begin{aligned}G_t &=R_{t+1} + \gamma \cdot R_{t+2}+ \gamma ^2\cdot R_{t+3} + \gamma ^3\cdot R_{t+4}+\cdots\\&= R_{t+1} + \gamma (R_{t+2}+ \gamma \cdot R_{t+3} + \gamma ^2\cdot R_{t+4}+\cdots) \\&= R_{t+1} + \gamma G_{t+1} \end{aligned} 148 | ``` 149 | 150 | 举个例子,一个少年在面对“上大学、去打工、在家啃老”这三种状态,哪一种更能实现人生的价值呢?相信很多人为长远发展都会选择上大学,因为身边有太多人因为上了大学,而好事连连,比如读研读博留学深造、进入大厂、娶个漂亮老婆、生个聪明孩子,当然了,上大学好处肯定多多,但上大学这个状态对上面4件好事所给予的贡献必然是逐级降低,毕竟越往后,越会有更多或更重要的因素成就更后面的好事,总不能所有好事都百分百归功于最开头选择了“上大学”这个状态/决策嘛 151 | 152 | 而一个状态的期望回报就称之为这个状态的价值,所有状态的价值则组成了所谓的价值函数,用公式表达为$`V(s) = E[G_t|S_t=s]`$,展开一下可得 153 | ```math 154 | \begin{aligned} V(s) &= E[G_t|S_t=s] \\& = E[R_{t+1} + \gamma G_{t+1}|S_t =s]\\& = E[R_{t+1}|S_t =s] + \gamma E[G_{t+1}|S_t =s]\\& = E[R_{t+1}|S_t = s] + \gamma E[V(S_{t+1})|S_t = s] \end{aligned} 155 | ``` 156 | 157 | 在上式最后一个等式中 158 | + 前半部分表示当前状态得到的即时奖励$`E[R_{t+1}|S_t = s] = R(s)`$ 159 | + 后半部分表示当前状态得到的所有持久奖励$`\gamma E[V(S_{t+1})|S_t = s]`$,可以根据从状态s出发的转移概率得到『至于上述推导的最后一步,在于$`E[G_{t+1}|S_t = s]`$等于$`E[V(S_{t+1})|S_t = s)]`$』 160 | 161 | 有个别朋友在我维护的Machine Learning读书会群里说,对上述推导最后一步的推导过程有疑问,考虑到本文追求详尽细致,加之大部分资料都是把这个当结论默认的,故还是把这个推导过程详细写一下 162 | ```math 163 | \begin{aligned} E[G_{t+1}|S_t = s] &= \sum G_{t+1}P\left \{ G_{t+1}|S_t = s \right \} \\& = \sum G_{t+1}\sum_{s'}^{}P\left \{ G_{t+1}|S_{t+1} = s',S_t =s \right \}P\left \{ S_{t+1} = s'|S_t =s \right \} \\& = \sum_{s'}^{}\sum G_{t+1}P\left \{ G_{t+1}|S_{t+1} =s',S_t =s \right \}P\left \{ S_{t+1} =s'|S_t =s \right \} \\& = \sum_{s'}^{}E[G_{t+1}|S_{t+1} = s',S_t =s]P\left \{ S_{t+1}=s'|S_t =s \right \} \\& = \sum_{s'}^{}V(S_{t+1})P\left \{ S_{t+1}=s'|S_t =s \right \} \\& = E[V(S_{t+1})|S_t =s] \end{aligned} 164 | ``` 165 | 可能又有同学对上述第二个等式怎么来的又有疑问了,怎么推导呢?我们只需推导出 166 | ```math 167 | P\left \{ G_{t+1}|S_t =s \right \} = \sum_{s'}^{}P\left \{ G_{t+1}|S_{t+1} = s',S_t =s \right \}P\left \{ S_{t+1} = s'|S_t =s \right \} 168 | ``` 169 | 推导过程如下 170 | ```math 171 | \begin{aligned} P\left \{ G_{t+1}|S_t=s \right \} &= \frac{P\left \{ G_{t+1},S_t=s \right \}}{P(S_t=s)} \\&= \frac{\sum_{s'}^{}P\left \{ G_{t+1},S_{t+1}=s',S_t =s \right \}}{P(S_t =s)} \\&= \frac{\sum_{s'}^{}P\left \{ G_{t+1}|S_{t+1}=s',S_t=s \right \}P(S_{t+1}=s',S_t =s)}{P(S_t =s)}\\&= \frac{\sum_{s'}^{}P\left \{ G_{t+1}|S_{t+1}=s',S_t=s \right \}P(S_{t+1}=s'|S_t =s)P(S_t =s)}{P(S_t =s)}\\&= \sum_{s'}^{}P\left \{ G_{t+1}|S_{t+1}=s',S_t=s \right \}P(S_{t+1}=s'|S_t=s) \end{aligned} 172 | ``` 173 | 174 | 从而,综合前后两个部分可得 175 | ```math 176 | V(s) = R(s) + \gamma \sum_{s'\in S}^{}P(s'|s)V(s') 177 | ``` 178 | 179 | 而这就是所谓的**贝尔曼方程**(bellman equation).该公式精准而简洁,其背后浓缩了很多信息,为形象起见,举个例子,比如状态$`S_1`$得到的即时奖励为$`R_{s1}`$,然后接下来,有 180 | + $`P_{12}`$的概率引发$`S_2`$状态,此时状态$`S_2`$得到的即时奖励为$`R_{s2}`$ 181 | 接下来有$`P_{24}`$的概率引发状态$`S_4`$($`S_4`$的即时奖励为$`R_{s4}`$,后续无持久奖励),有$`P_{25}`$的概率引发$`S_5`$状态($`S_5`$的即时奖励为$`R_{s5}`$,后续无持久奖励) 182 | + $`P_{13}`$的概率引发$`S_3`$状态,此时$`S_3`$状态得到的即时奖励为$`R_{s3}`$ 183 | 接下来有$`P_{36}`$的概率引发状态$`S_6`$($`S_6`$的即时奖励为$`R_{s6}`$,后续无持久奖励),,有$`P_{37}`$的概率引发状态$`S_7`$($`S_7`$的即时奖励为$`R_{s7}`$,后续无持久奖励) 184 | 185 | 其中折扣因此为$`\gamma`$,那么因状态$`S_1`$而得到的一切奖励为 186 | ```math 187 | R_{s1} + \gamma (P_{12}R_{s2} + P_{13}R_{s3}) + \gamma^2(P_{24} R_{s4} + P_{25} R_{s5}) + \gamma^2(P_{36} R_{s6} + P_{37}R_{s7}) \\ = R_{s1} + \gamma (P_{12}R_{s2} + P_{13}R_{s3}) + \gamma^2(P_{24} R_{s4} + P_{25} R_{s5} + P_{36} R_{s6} + P_{37}R_{s7}) 188 | ``` 189 | 190 | 类似的,因状态$`S_2`$得到的一切奖励为$`R_{s2} + \gamma (P_{24}R_{s4} + P_{25}R_{s5})`$ 191 | 192 | > 为更加形象起见,再举一个生活中最常见的“吃饭-抽烟/剔牙”例子 193 | > 比如你吃完饭后你自己的心情愉悦值即奖励+5,然后下一个状态,有 194 | > + 0.6的概率是抽烟(抽烟带来的心情愉悦值即奖励+7,要不说 饭后一支烟 赛过活神仙呢) 195 | > + 0.4的概率是剔牙(剔牙带来的奖励值+3) 196 | > 197 | > 假设折扣因子$`\gamma`$(上文说过了,就是一个状态对后续状态的贡献程度)为0.5,且假定 198 | > + 吃饭的状态定义为$`s_1`$,则$`R_{s1} = 5`$ 199 | > + 抽烟的状态定义为$`s_2`$,则$`R_{s2} = 7`$,且由于抽烟之后无后续状态,所以$`G_{s2}`$也是$`7`$ 200 | > + 剔牙的状态定义为$`s_3`$,则$`R_{s3} = 3`$,且由于剔牙之后无后续状态,所以$`G_{s3}`$也是$`3`$ 201 | > 202 | > 从而有: 203 | > 当从$`s_1 \rightarrow s_2`$时, 204 | > $`G_{s1} = R_{s1} + \gamma R_{s2} = 5 + 0.5 \times 7 = 8.5`$ 205 | > 当从$`s_1 \rightarrow s_3`$时, 206 | >.$`G'_{s1} = R_{s1} + \gamma R_{s3} = 5 + 0.5\times 3 = 6.5`$ 207 | > 208 | > 由于状态$`s_2`$和状态$`s_3`$没有后续状态,所以$`s_2`$和$`s_3`$对应的状态值函数分别为 209 | ```math 210 | v_{s2} = R_{s2} = 7 \\v_{s3} = R_{s3} = 3 211 | ``` 212 | > 再根据贝尔曼方程$`V(s) = R(s) + \gamma \sum_{s'\in S}^{}P(s'|s)V(s')`$,可得状态$`s_1`$的状态价值函数为 213 | ```math 214 | \begin{aligned} V(s1) &= R_{s1} + \gamma (P_{12}R_{s2} + P_{13}R_{s3}) \\&= 5+ 0.5 \times (0.6\times 7 + 0.4 \times 3) \\&= 7.7 \end{aligned} 215 | ``` 216 | > 当然,你也可以如此计算(可以很明显的看出,计算量不如上述过程简洁,所以一般优先按上述方式计算) 217 | ```math 218 | \begin{aligned} V(s1) &= E[G_t|S_t=s] \\& = p_{12} \times G^{s2}_{s1} + p_{13} \times G^{s3}_{s1} \\& = P_{12} (R_{s1} + \gamma R_{s2}) + P_{13} (R_{s1} + \gamma R_{s3})\\& = 0.6(5 + 0.5\times 7) + 0.4(5+0.5\times 3) \\& = 7.7 \end{aligned} 219 | ``` 220 | 221 | 上述例子的状态比较少所以计算量不大,但当状态一多,则贝尔曼方程的计算量还是比较大的,而求解较大规模的马尔可夫奖励过程中的价值函数时,可以用的方法包括:动态规划、蒙特卡洛方法、时序差分(temporal difference,简称TD)方法 222 | 当然,其中细节还是不少的,下文第二部分会详述这三大方法 223 | 224 | ### 1.2.2 马尔可夫决策过程(MDP):马尔可夫奖励(MRP) + 智能体动作因素 225 | 226 | 根据上文我们已经得知,在随机过程的基础上 227 | + 增加马尔可夫性质,即可得马尔可夫过程 228 | + 而再增加奖励,则得到了马尔可夫奖励过程(MRP) 229 | + 如果我们再次增加一个来自外界的刺激比如智能体的动作,就得到了**马尔可夫决策过程**(MDP) 230 | 通俗讲,MRP与MDP的区别就类似随波逐流与水手划船的区别 231 | > 在马尔可夫决策过程中,$`S_t`$($`S`$是状态的集合)和$`R_t`$($`R`$是奖励的集合)的每个可能的值出现的概率只取决于前一个状态$`S_{t-1}`$和前一个动作$`A_{t-1}`$($`A`$是动作的集合),并且与更早之前的状态和动作完全无关 232 | 233 | 换言之,当给定当前状态$`S_t`$(比如$`S_t =s`$,以及当前采取的动作$`A_t`$(比如$`A_t = a`$,那么下一个状态$`S_{t+1}`$出现的概率,可由状态转移概率矩阵表示如下 234 | ```math 235 | \begin{aligned}P_{ss'}^{a} &= P(S_{t+1}=s'|S_t =s,A_t = a) \\&= {}P(s'|s,a) \end{aligned} 236 | ``` 237 | 238 | 假定在当前状态和当前动作确定后,其对应的奖励则设为$`R_{t+1} = r`$,故sutton的RL一书中,给的状态转移概率矩阵类似为 239 | ```math 240 | \begin{aligned}p(s',r|s,a)=P\left \{ S_{t+1} = s',R_{t+1} = r |S_t = s,A_t = a \right \} \end{aligned} 241 | ``` 242 | 243 | 从而可得奖励函数即为 244 | ```math 245 | \begin{aligned}R(s,a) &= E[R_{t+1} | S_t = s,A_t = a] \\&=\sum_{s'\in S}^{}p(s',r|s,a) \sum_{r\in R}^{}r \end{aligned} 246 | ``` 247 | 248 | > 考虑到后来有读者对上面这个公式有疑惑,所以解释下推导步骤 249 | > 250 | > 1. 首先,计算在状态$`s`$下采取动作$`a`$后,转移到下一个状态$`s'`$并获得奖励$`r`$的概率,表示为$`p(s',r|s,a)`$ 251 | > 252 | > 2. 然后,我们对所有可能的下一个状态$`s'`$求和,并对所有可能的奖励$`r`$求和(不少情况下,即使状态转移和动作是已知的,奖励$`r`$仍然可能是随机的,比如买股票,股票价格随机波动,导致购买之后的盈亏也具有随机性) 253 | > 3. 最后,我们将这些概率与对应的奖励$`r`$相乘并相加,以得到条件期望 254 | > 255 | > 当然,如果奖励是确定性的,则可以简化公式,去掉对$`r`$的求和,即: 256 | ```math 257 | R(s,a) = \sum p(s',r|s,a) * r 258 | ``` 259 | > 260 | > 相当于此时只需要计算在状态$`s`$下采取动作$`a`$后,转移到下一个状态$`s'`$的概率乘以确定的奖励$`r`$,然后对所有可能的下一个状态$`s'`$求和以得到条件期望 261 | 262 | 至于过程中采取什么样的动作就涉及到策略policy,**策略函数**可以表述为$`\pi`$函数(当然,这里的$`\pi`$跟圆周率没半毛钱关系) 263 | + 从而可得$`a=\pi(s)`$,意味着输入状态$`S`$,策略函数$`\pi`$输出动作$`a`$ 264 | + 此外,还会有这样的表述:$`a = \pi _{\theta }(s)`$,相当于在输入状态$`s`$确定的情况下,输出的动作$`a`$只和参数$`\theta`$有关,这个$`\theta`$就是策略函数$`\pi`$的参数 265 | + 再比如这种$`\pi (a|s) = P(A_t = a| S_t = s)`$,相当于输入一个状态$`s`$下,智能体采取某个动作$`a`$的概率 266 | 267 | 通过上文,我们已经知道不同状态出现的概率不一样(比如今天是晴天,那明天是晴天,还是雨天、阴天不一定),同一状态下执行不同动作的概率也不一样(比如即便在天气预报预测明天大概率是天晴的情况下,你大概率不会带伞,但依然不排除你可能会防止突然下雨而带伞) 268 | 269 | 而有了动作这个因素之后,我们重新梳理下价值函数 270 | + 首先,通过**状态价值函数**对当前状态进行评估 271 | ```math 272 | \begin{aligned} V_{\pi}(s) &= E_\pi [G_t|S_t = s] \\&= E_\pi [R_{t+1} + \gamma G_{t+1} | S_t = s] \\&= E_\pi [R_{t+1} + \gamma V_\pi (S_{t+1}) | S_t = s] \end{aligned} 273 | ``` 274 | 相当于从状态$`s`$出发遵循策略$`\pi`$能获得的期望回报 275 | + 其次,通过“动作价值函数”对动作的评估 276 | ```math 277 | \begin{aligned} Q_\pi (s,a) &= E_\pi [G_t | S_t=s,A_t = a] \\& = E_\pi [R_{t+1} + \gamma G_{t+1}| S_t=s,A_t = a] \\& = E_\pi [R_{t+1} + \gamma Q_\pi (S_{t+1},A_{t+1})| S_t=s,A_t = a] \end{aligned} 278 | ``` 279 | 相当于对当前状态$`s`$依据策略$`\pi`$执行动作$`a`$得到的期望回报,这就是大名鼎鼎的$`Q`$函数,得到$`Q`$函数后,进入某个状态要采取的最优动作便可以通过$`Q`$函数得到 280 | ![5bbf0a4065be4b5584277e28502f7a7a.png](./assets/images/RL_simple_primer/5bbf0a4065be4b5584277e28502f7a7a.png) 281 | 282 | 当有了策略、价值函数和模型3个组成部分后,就形成了一个马尔可夫决策过程(Markov decision process).如下图所示,这个决策过程可视化了状态之间的转移以及采取的动作. 283 | 284 | ![6ca0c2b4b623c690c60a749dc1b21291.png](./assets/images/RL_simple_primer/6ca0c2b4b623c690c60a749dc1b21291.png) 285 | 286 | 287 | 且通过状态转移概率分布,我们可以揭示状态价值函数和动作价值函数之间的联系了 288 | 289 | + 在使用策略$`\pi`$时,**状态$`S`$的价值等于在该状态下基于策略$`\pi`$采取所有动作的概率与相应的价值相乘再求和的结果** 290 | ```math 291 | V_{\pi}(s) = \sum_{a \in A}^{}\pi (a|s)Q_\pi (s,a) 292 | ``` 293 | 294 | 我猜可能有读者会问怎么来的,简略推导如下 295 | ```math 296 | \begin{aligned} V_{\pi}(s) &= E_\pi [G_t|S_t = s] \\& = \sum_{a \in A}^{}\pi (a|s)E_\pi [G_t|S_t = s,A_t = a]\\& = \sum_{a \in A}^{}\pi (a|s)Q_\pi (s,a) \end{aligned} 297 | ``` 298 | 299 | + 而使用策略$`\pi`$时,在状态$`S`$下采取动作$`a`$!的价值等于当前奖励$`R(s,a)`$,加上经过衰减的所有可能的下一个状态的状态转移概率与相应的价值的乘积 300 | ```math 301 | Q_\pi (s,a) = R(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)V_\pi (s') 302 | ``` 303 | 针对这个公式 大部分资料都会一带而过,但不排除会有不少读者问怎么来的,考虑到**对于数学公式咱们不能想当然靠直觉的自认为**,所以还是得一五一十的推导下 304 | ```math 305 | \begin{aligned} Q_\pi (s,a) &= E[G_t|S_t = s,A_t = a] \\&= E[R_{t+1} + \gamma G_{t+1} | S_t =s,A_t = a] \\&= E[R_{t+1}|S_t = s,A_t = a] + \gamma E[ G_{t+1} | S_t =s,A_t = a] \\&= R(s,a) + \gamma \sum_{s'}^{} V_\pi (S_{t+1}) P[S_{t+1} = s' |S_t =s,A_t = a ] \\&= R(s,a) + \gamma \sum_{s'}^{} P_{ss'}^{a}V_\pi (s') \end{aligned} 306 | ``` 307 | 308 | 上述推导过程总共五个等式,其中,第三个等式到第四个等式依据的是 309 | ```math 310 | E[ G_{t+1} | S_t =s,A_t = a] = \sum_{s'}^{} V_\pi (S_{t+1}) P[S_{t+1} = s' |S_t =s,A_t = a ] 311 | ``` 312 | 至于第四个等式到第五个等式依据的是状态转移概率矩阵的定义 313 | ```math 314 | P_{ss'}^{a} = P(S_{t+1}=s'|S_t =s,A_t = a) 315 | ``` 316 | 317 | 接下来,把上面$`V_\pi (s)`$和$`Q_\pi (s,a)`$的计算结果互相代入,可得**马尔可夫决策的贝尔曼方程** 318 | ```math 319 | V_{\pi}(s) = \sum_{a \in A}^{}\pi (a|s)\left [ R(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)V_\pi (s'))\right ] 320 | ``` 321 | ```math 322 | Q_\pi (s,a) = R(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)\left [ \sum_{a' \in A}^{}\pi (a'|s')Q_\pi (s',a') \right ] 323 | ``` 324 | 325 | 上述过程可用下图形象化表示(配图来自文献21) 326 | 327 | ![ab200b2773a547bfa48e639956c52ca0.jpeg](./assets/images/RL_simple_primer/ab200b2773a547bfa48e639956c52ca0.jpeg) 328 | 329 | 330 | ## 第二部分 RL进阶之三大表格求解法:DP、MC、TD 331 | 332 | ## 2.1 动态规划法 333 | 334 | ### 2.1.1 什么是动态规划 335 | 336 | 上文简单介绍过动态规划,其核心思想在于复杂问题的最优解划分为多个小问题的最优解的求解问题,就像递归一样,且子问题的最优解会被储存起来重复利用 337 | 338 | 举个例子,输入两个整数n和sum,从数列1,2,3.......n 中随意取几个数,使其和等于sum,要求将其中所有的可能组合列出来. 339 | 340 | 注意到取n,和不取n个区别即可,考虑是否取第n个数的策略,可以转化为一个只和前n-1个数相关的问题. 341 | 342 | + 如果取第n个数,那么问题就转化为“取前n-1个数使得它们的和为sum-n”,对应的代码语句就是sumOfkNumber(sum - n, n - 1); 343 | + 如果不取第n个数,那么问题就转化为“取前n-1个数使得他们的和为sum”,对应的代码语句为sumOfkNumber(sum, n - 1) 344 | 345 | 所以其关键代码就是 346 | 347 | ```cpp 348 | list1.push_front(n); //典型的01背包问题 349 | SumOfkNumber(sum - n, n - 1); //“放”n,前n-1个数“填满”sum-n 350 | list1.pop_front(); 351 | SumOfkNumber(sum, n - 1); //不“放”n,前n-1个数“填满”sum 352 | ``` 353 | 354 | > 其实,这是一个典型的0-1背包问题,其具体描述为:有$`N`$)件物品和一个容量为$`V`$的背包.放入第$`i`$件物品耗费的费用是$`C_i`$(也即占用背包的空间容量),得到的价值是$`W_i`$,求解将哪些物品装入背包可使价值总和最大. 355 | > 356 | > 简单分析下:这是最基础的背包问题,特点是每种物品仅有一件,可以选择放或不放.用子问题定义状态:即$`F[i,v]`$表示前$`i`$件物品恰放入一个容量为$`v`$的背包可以获得的最大价值 357 | > 358 | > 对于“将前$`i`$件物品放入容量为$`v`$的背包中”这个子问题,若只考虑第$`i`$件物品的策略(放或不放),那么就可以转化为一个只和前$`i-1`$件物品相关的问题.即:  359 | >+ 如果不放第$`i`$件物品,那么问题就转化为“前$`i-1`$件物品放入容量为$`v`$的背包中”,价值为$`F[i-1,v]`$; 360 | >+ 如果放第$`i`$件物品,那么问题就转化为“前$`i-1`$件物品放入剩下的容量为$`v-C_i`$的背包中”,此时能获得的最大价值就是$`F[i-1,v-C_i]`$再加上通过放入第i件物品获得的价值$`W_i`$ 361 | > 362 | >  则其状态转移方程便是: 363 | ```math 364 | F[i,v]=max\{F[i-1,v],F[i-1,v-C_i]W_i\} 365 | ``` 366 | 367 | 通过上述这两个个例子,相信你已经看出一些端倪,具体而言,动态规划一般也只能应用于有最优子结构的问题.最优子结构的意思是局部最优解能决定全局最优解(对有些问题这个要求并不能完全满足,故有时需要引入一定的近似).简单地说,问题能够分解成子问题来解决. 368 | 369 | 动态规划算法一般分为以下4个步骤: 370 | 371 | 1. 描述最优解的结构 372 | 2. 递归定义最优解的值 373 | 3. 按自底向上的方式计算最优解的值   //此3步构成动态规划解的基础. 374 | 4. 由计算出的结果构造一个最优解      //此步如果只要求计算最优解的值时,可省略 375 | 376 | 换言之,动态规划方法的最优化问题的俩个要素:最优子结构性质,和子问题重叠性质 377 | 378 | + 最优子结构 379 | 如果问题的最优解所包含的子问题的解也是最优的,我们就称该问题具有最优子结构性质(即满足最优化原理).意思就是,总问题包含很多个子问题,而这些子问题的解也是最优的. 380 | + 重叠子问题 381 | 子问题重叠性质是指在用递归算法自顶向下对问题进行求解时,每次产生的子问题并不总是新问题,有些子问题会被重复计算多次.动态规划算法正是利用了这种子问题的重叠性质,对每一个子问题只计算一次,然后将其计算结果保存在一个表格中,当再次需要计算已经计算过的子问题时,只是在表格中简单地查看一下结果,从而获得较高的效率 382 | 383 | 更多请参看此文(上面阐述什么是DP的内容就来自此文):[通俗理解动态规划:由浅入深DP并解决LCS问题(23年修订版)](https://blog.csdn.net/v_JULY_v/article/details/6110269 "通俗理解动态规划:由浅入深DP并解决LCS问题(23年修订版)") 384 | 385 | ### 2.1.2 通过动态规划法求解最优策略 386 | 387 | 如果你之前没接触过RL,你确实可能会认为DL只存在于数据结构与算法里,实际上 388 | 389 | + 最早在1961年,有人首次提出了DP与RL之间的关系 390 | + 1977年,又有人提出了启发式动态规划,强调连续状态问题的梯度下降法 391 | + 再到1989年,Watkins明确的将RL与DP联系起来,并将这一类强化学习方法表征为增量动态规划 392 | 393 | 下面,我们考虑如何求解最优策略$`v_*(s)`$ 394 | 395 | 1. 首先,最优策略可以通过最大化$`q_\pi (s,a)`$找到 396 | ```math 397 | Q_\pi (s,a) = R(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)V_\pi (s') 398 | ``` 399 | 2. 当$`a= argmax \left \{ Q_*(s,a) \right \}`$时,$`\pi _*(a|s) = 1`$ 400 | 401 | 综合上述两点,可得 402 | ```math 403 | v_{*}(s) = max \left \{ R(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)V_\pi (s')) \right \} 404 | ``` 405 | 406 | 另,考虑到 407 | 408 | ```math 409 | \begin{aligned}R(s,a) &= E[R_{t+1} | S_t = s,A_t = a] \\&=\sum_{s'\in S}^{}p(s',r|s,a) \sum_{r\in R}^{}r \end{aligned} 410 | ``` 411 | 412 | 故也可以如sutton的RL一书上,这样写满足贝尔曼最优方程的价值函数$`V_*(s)`$ 413 | 414 | ```math 415 | \begin{aligned} v_*(s) &= max E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t =s,A_t =a] \\&= max \sum_{s',r}^{}p(s',r|s,a) [r + \gamma v_*(s')] \end{aligned} 416 | ``` 417 | 418 | 相当于当知道奖励函数和状态转换函数时,便可以根据下一个状态的价值来更新当前状态的价值,意味着可以把计算下一个可能状态的价值当成一个子问题,而把计算当前状态的价值看做当前问题,这不刚好就可以用DP来求解了 419 | 420 | 于是,sutton的RL一书上给出了DP求解最优策略的算法流程 421 | > 1.初始化 422 | ```math 423 | 对s\in S,任意设定V(s)\in \mathbb{R}以及\pi (s) \in A(s) 424 | ``` 425 | > 2.策略评估 426 | ```math 427 | \begin{array}{l}循环:\\\,\,\,\,\Delta \leftarrow 0 \\\,\,\,\,对每一个s\in S循环:\\\,\,\,\,\,\,\,\,\,\,\, v \leftarrow V(s) \\\,\,\,\,\,\,\,\,\,\,\, V(s) \leftarrow \sum_{s',r}^{} p(s',r|s,\pi (s)) [r + \gamma V(s')] \\\,\,\,\,\,\,\,\,\,\,\,\Delta \leftarrow max(\Delta ,|v - V(s)|) \\直到\Delta <\theta (一个决定估计精度的小正数) \end{array} 428 | ``` 429 | > 3.策略改进 430 | ```math 431 | \begin{array}{l}policy-stable \leftarrow true \\对每一个s\in S:\\\,\,\,\,\,old-actiton \leftarrow \pi (s) \\\,\,\,\,\,\pi (s) \leftarrow argmax_{(a)} \left \{ \sum_{s',r}^{}p(s',r|s,a) [r + \gamma V(s')] \right \}\\\,\,\,\,\,如果old-action \neq \pi (s),那么policy-stable \leftarrow false \\如果policy-stable为true,那么停止并返回V \approx v_*以及\pi \approx \pi _*;否则跳转到2 \end{array} 432 | ``` 433 | 434 | ## 2.2 蒙特卡洛法 435 | 436 | 蒙特卡洛(monte carlo,简称MC)方法,也称为统计模拟方法,就是通过大量的随机样本来估算或近似真实值,比如近似估算圆的面经、近似定积分、近似期望、近似随机梯度 437 | 438 | 比如先看估算圆的面积,如下图 439 | 440 | ![d3366d92a8ba432b9480a55c7bbc7895.png](./assets/images/RL_simple_primer/d3366d92a8ba432b9480a55c7bbc7895.png) 441 | 442 | 可以通过这个式子来近似计算:圆的面积/ 正方形的面积 = 圆中点的个数/正方形中点的个数 443 | 444 | 类似的,我们也可以用蒙特卡洛方法来估计一个策略在一个马尔可夫决策过程中的状态价值.考虑到 一个状态的价值是它的期望回报,那么如果我们用策略在MDP上采样很多条序列,然后计算从这个状态出发的回报再求其期望是否就可以了?好像可行!公式如下: 445 | 446 | ```math 447 | V_\pi (s) = E_\pi [G_t|S_t = s] = \frac{1}{N} \sum_{i=1}^{N}G_{t}^{(i)} 448 | ``` 449 | 450 | 再看下如何估算定积分的值『如果忘了定积分长啥样的,可以通过RL所需数学基础的其中一篇笔记《[概率统计极简入门:通俗理解微积分/期望方差/正态分布前世今生(23修订版)](https://blog.csdn.net/v_JULY_v/article/details/8308762 "概率统计极简入门:通俗理解微积分/期望方差/正态分布前世今生(23修订版)")》回顾下,比如积分可以理解为由无数个无穷小的面积组成的面积S』 451 | 452 | ![445315a773e54b1ba101fee85904fcee.png](./assets/images/RL_simple_primer/445315a773e54b1ba101fee85904fcee.png) 453 | 454 | 如上图,我们可以通过随机取4个点,然后类似求矩形面积那样(底x高),从而用4个面积$`f(x)(b-a)`$的期望来估算定积分$`\int_{a}^{b}f(x)dx`$的值,为让对面积的估算更准确,我们可以取更多的点,比如$`N`$当$`N \rightarrow \propto`$时 455 | 456 | ```math 457 | \int_{a}^{b}f(x)dx = \lim_{N\rightarrow \propto } \frac{1}{N}(b-a)\sum_{i-1}^{N}f(x_i) 458 | ``` 459 | 460 | 接下来 461 | 462 | 1. 假设令$`q(x) = \begin{cases} \frac{1}{a-b} & x\in [a,b] \\ 463 | 0 & \end{cases},\\ 464 | 且f^*(x) = \begin{cases} \frac{f(x)}{q(x)} & \text{ if } q(x) \neq 0 \\ 0 & \text{ if } q(x) = 0 \end{cases}`$ 465 | 466 | 2. 且考虑到对于连续随机变量$`X`$,其概率密度函数为$`p(x)`$, 467 | 期望为$`E[X] = \int_{-\propto }^{\propto }xp(x)dx`$, 468 | 则有 469 | ```math 470 | \int_{a}^{b}f(x)dx = \int_{a}^{b}f^*(x)q(x)dx = E_{x\sim f_{X}(x)}[f^*(x)]] 471 | ``` 472 | 跟蒙特卡洛方法关联的还有一个重要性采样,不过,暂不急,在第四部分时用到再讲. 473 | 474 | ## 2.3 时序差分法及与DP、MC的区别 475 | 476 | 当面对状态价值函数的求解时 477 | 478 | ```math 479 | \begin{aligned} V_{\pi}(s) &= E_\pi [G_t|S_t = s] \\& = E_\pi [R_{t+1} + \gamma G_{t+1} | S_t = s] \\& = E_\pi [R_{t+1} + \gamma V_\pi (S_{t+1}) | S_t = s] \end{aligned} 480 | ``` 481 | 482 | 上述公式总共三个等式 483 | + 动态规划(DP)会把上述第三个等式的估计值作为目标,不是因为DP要求需要环境模型本身提供期望值,而是因为真实的$`v_\pi (S_{t+1})`$是未知的,所以只能使用当前的估计值$`V_\pi (S_{t+1})`$来替代 484 | ```math 485 | V(S_t) \leftarrow E_\pi [R_{t+1} + \gamma V(S_{t+1})] 486 | ``` 487 | 且DP求解状态S_t的状态值函数时,需要利用所有后续状态$`S_{t+1}`$ 488 | ```math 489 | V_{\pi}(s) = \sum_{a \in A}^{}\pi (a|s)\left [ r(s,a) + \gamma \sum_{s' \in S}^{}P(s'|s,a)V_\pi (s'))\right ] 490 | ``` 491 | + 蒙特卡洛方法(MC)会上述把第一个等式的估计值作为目标,毕竟第一个等式中的期望值是未知的,所以我们用样本回报来代替实际的期望回报 492 | 但MC求解状态$`S_t`$的状态值函数时,需要等一个完整序列结束,因为只有到此时,$`G_t`$才是已知的 493 | ```math 494 | V(S_t) \leftarrow V(S_t) + \alpha [G_t - V(S_t)] 495 | ``` 496 | 497 | + 而时序差分(TD)呢,它既要采样得到上述第一个等式的期望值,而且还要通过使用上述第三个等式中当前的估计值V来替代真实值$`v_\pi`$ 498 | 且TD每过一个time step就利用奖励$`R_{t+1}`$和值函数$`V(S_{t+1})`$ 更新一次(当然,这里所说的one-step TD 方法,也可以两步一更新,三步一更新….) 499 | 考虑到$`G_t = R_{t+1} + \gamma V(S_{t+1})`$,可得 500 | ```math 501 | V(S_t) \leftarrow V(S_t) + \alpha \left [ R_{t+1} + \gamma V{S_{t+1}} - V(S_t) \right ] 502 | ``` 503 | 此更新法也叫TD(0)法,或者一步时序差分法, 504 | $`R_{t+1} + \gamma V{S_{t+1}}`$被称为TD目标, 505 | $`\delta = R_{t+1} + \gamma V_\pi (S_{t+1}) - V(S_t)`$被称为TD误差 506 | 507 | **TD与DP一致的是**,时序差分方法也无需等待交互的最终结果,而可以基于下一个时刻的收益和估计值就可以更新当前状态的价值函数, 508 | 不需像MC等到N步以后即等一个完整序列结束后才能更新$`V(S_t)`$ 509 | 就像不同学生做题,有的学生则是TD派:做完一题就问老师该题做的对不对 然后下一题即更新做题策略,有的学生是MC派:做完全部题才问老师所有题做的对不对 然后下一套试卷更新做题策略 510 | 511 | **TD与DP不一致的是**,TD俗称无模型的RL算法,不需要像DP事先知道环境的奖励函数和状态转移函数(和MC一样,可以直接从与环境互动的经验中学习策略,事实上,很多现实环境中,其MDP的状态转移概率无从得知) 512 | 513 | 总之,TD结合了DP和MC,**与DP一致的点时与MC不一致,与DP不一致的点时恰又与MC一致**,某种意义上来说,结合了前两大方法各自的优点,从而使得在实际使用中更灵活,具体而言如下图所示 514 | 515 | ![47088532ca0ef7bde5bae8849728ac8f.png](./assets/images/RL_simple_primer/47088532ca0ef7bde5bae8849728ac8f.png) 516 | 517 | 顺带再举一个例子,好比行军打仗时,为了得到更好的行军路线,将军派一人前去探路 518 | + MC的做法相当于一条道走到黑 没走个10公里不回头 519 | + DP相当于所有道比如10条道 每条道都走个1公里 不错过任何一条可能成为最好道的可能,最后10条道都走完1公里后才返回汇报/反馈 520 | + TD则相当于先选一条道走个1公里即返回汇报/反馈,之后再走下一条道的1公里 521 | 522 | 为承上启下更为总结,再说一个问题,即七月ChatGPT课群里有学问提问:校长 在A2C算法中,这个优势函数中的计算$`A(s,a) =Q(s,a) - V(s)`$ 其中这个$`Q(s,a)`$和$`V(s)`$是由神经网络模拟出来的吗 523 | 524 | > 关于什么是优势函数 下文会具体阐述,咱们就用上文的知识来一步步推导 525 | > 因 526 | > $`\begin{aligned} Q_\pi (s,a) &= E[G_t|S_t = s,A_t = a] \\&= E[R_{t+1} + \gamma G_{t+1} | S_t =s,A_t = a] \\&= E[R_{t+1}|S_t = s,A_t = a] + \gamma E[ G_{t+1} | S_t =s,A_t = a] \\&= R(s,a) + \gamma \sum_{s'}^{} V_\pi (S_{t+1}) P[S_{t+1} = s' |S_t =s,A_t = a ] \\&= R(s,a) + \gamma \sum_{s'}^{} P_{ss'}^{a}V_\pi (s') \end{aligned}`$ 527 | > 从而求解Q时,算是实际奖励$`R + V`$,而$`V`$通过critic网络学习(比如通过蒙特卡洛或时序差分), 528 | > 最终$`A(s, a) = Q(s,a) - V(s)`$ 529 | > 相当于如春天所说,实践中只需要$`V(s)`$用神经网络来实现就行,因为$`Q(s,a)`$已经可以被$`V(s)`$和$`R`$表示了,不需要再另外实现. 530 | 531 | 532 | ## 2.4 RL的分类:基于模型(Value-base/Policy-based)与不基于模型 533 | 534 | 根据问题求解思路、方法的不同,我们可以将强化学习分为 535 | 536 | ![9726cb99f9af48bc8ac545ace05804e7.png](./assets/images/RL_simple_primer/9726cb99f9af48bc8ac545ace05804e7.png) 537 | + 基于模型的强化学习(Model-based RL),可以简单的使用动态规划求解,任务可定义为预测和控制,预测的目的是评估当前策略的好坏,即求解状态价值函数$`V_\pi (S)`$,控制的目的则是寻找最优策略$`\pi ^*`$和$`V_*(s)`$ 538 | 在这里“模型”的含义是对环境进行建模,具体而言,是否已知其$`P`$和$`R`$,即$`p(s'|s,a)`$和$`R(s,a)`$的取值 539 | 540 | $`\rightarrow`$如果有对环境的建模, 541 | 那么智能体便可以在执行动作前得知状态转移的情况即$`p(s'|s,a)`$和奖励$`R(s,a)`$,也就不需要实际执行动作收集这些数据; 542 | 543 | $`\rightarrow`$ 否则便需要进行采样,通过与环境的交互得到下一步的状态和奖励,然后仅依靠采样得到的数据更新策略 544 | 545 | + 无模型的强化学习(Model-free RL),又分为 546 | **基于价值**的强化学习(Value-based RL),其会学习并贪婪的选择值最大的动作,即$`a =\underset{a}{\arg \max}\ Q(s,a)`$,最经典的便是off-policy模式的Q-learning和on-policy模式的SARSA,一般得到的是确定性策略,下文第三部分重点介绍. 547 | 548 | **基于策略**的强化学习(Policy-based RL),其对策略进行进行建模$`\pi (s,a)`$并优化,一般得到的是随机性策略,下文第四部分会重点介绍. 549 | ![84f59c98f4b14321acb72db7b8459cee.png](./assets/images/RL_simple_primer/84f59c98f4b14321acb72db7b8459cee.png) 550 | 551 | ## 第三部分 价值学习:从n步Sarsa算法到Q-learning、DQN 552 | 553 | ## 3.1 TD(0)控制/Sarsa(0)算法与TD(n)控制/n步Sarsa算法 554 | 555 | 既然上文可以用时序差分来估计状态价值函数,那是否可以用类似策略迭代的方法来评估动作价值函数呢?毕竟在无模型的RL问题中,动作价值函数比状态价值函数更容易被评估 556 | 557 | 如果用类似TD(0)控制的思路寻找最优的动作价值函数并提取出最优策略,便被称作Sarsa(0)算法,所以,Sarsa所做出的改变很简单,它将原本时序差分方法更新$`V`$的过程,变成了更新$`Q`$,即可以如下表达 558 | ```math 559 | Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)] 560 | ``` 561 | 562 | 此外,上文说过,“TD每过一个time step就利用奖励$`R_{t+1}`$和值函数$`V(S_{t+1})`$更新一次,当然,这里所说的one-step TD 方法,也可以两步一更新,三步一更新”,这个所谓的多步一更新我们便称之为N步时序差分法 563 | 564 | ![61c34bd9692048c1858665730c1cad15.jpeg](./assets/images/RL_simple_primer/61c34bd9692048c1858665730c1cad15.jpeg) 565 | 566 | 首先,我们先回复下回报公式的定义,即为(根据前几项可以看出:$`\gamma`$的上标加$`t+1`$即为$`R`$的下标,反过来,当最后一项$`R`$的下标$`T`$确定后,自然便可以得出$`\gamma`$的上标为$`T -t -1`$ 567 | ```math 568 | G_t = R_{t+1} + \gamma R_{t+2} + \gamma ^2 R_{t+3}+\cdots + \gamma ^{T-t-1}R_T 569 | ``` 570 | 从而有 571 | + 单步回报:$`G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1})`$,即为TD(0)控制$`/Sarsa(0)`$算法 572 | 573 | + 两步回报:$`G_{t:t+2} = R_{t+1} + \gamma R_{t+2} + \gamma ^2V_{t+1}(S_{t+2})`$ 574 | 575 | + n步回报:$`G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma ^{n-1}R_{t+n} + \gamma ^nV_{t+n-1}(S_{t+n})`$ 576 | 此时,类似于TD(0)预测 577 | ```math 578 | V(S_t) \leftarrow V(S_t) + \alpha \left [ R_{t+1} + \gamma V{S_{t+1}} - V(S_t) \right ] 579 | ``` 580 | 581 | 有以下状态值更新规则 582 | ```math 583 | V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \left [ G_{t:t+n} - V_{t+n-1}(S_t) \right ], 0\leq t < T 584 | ``` 585 | 而对于其他任意状态$`s(s\neq S_t)`$的价值估计保持不变:$`V_{t+n}(S) = V_{t+n-1}(S)`$ 586 | 587 | 类似的,当用n步时序差分的思路去更新$`Q`$函数则就是所谓的n步Sarsa算法,当我们重新根据动作价值的估计定义如下的b步方法的回报为 588 | 589 | ```math 590 | G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma ^{n-1}R_{t+n} + \\\gamma ^nQ_{t+n-1}(S_{t+n},A_{t+n}) \\ n\geq 1,0\leq t < T-n 591 | ``` 592 | 593 | 如此,便可以得到如下$`Q`$的更新方式 594 | ```math 595 | Q_{t+n}(S_t) = Q_{t+n-1}(S_t,A_t) + \alpha \left [ G_{t:t+n} - Q_{t+n-1}(S_t,A_t) \right ],\\0\leq t < T 596 | ``` 597 | 598 | ## 3.2 Q-learning 599 | 600 | ### 3.2.1 重要性采样:让同策略完成到异策略的转变 601 | 602 | Q-learning介绍得闲阐明一个问题,即所谓的同策略学习与异策略学习 603 | 604 | 首先,先来明确两个概念: 605 | 606 | + 行动遵循的行动策略与被评估的目标策略是同一个策略(如果要学习的智能体和与环境交互的智能体是相同的),则称之为同策略,比如上文介绍的Sarsa 607 | 608 | + 行动遵循的行动策略和被评估的目标策略是不同的策略(或如果要学习的智能体和与环境交互的智能体不是相同的),则称之为异策略,比如即将介绍的Q-learning 609 | 610 | 而异策略就是基于**重要性采样**的原理实现的(但反过来,*不是说只要采用了重要性采用,就一定是异策略*,比如下文将讲的PPO算法),即通过使用另外一种分布,来逼近所求分布的一种方法 611 | 612 | 但具体怎么操作呢?为说明怎么变换的问题,再举一个例子. 613 | 614 | > 假设有一个函数$`f(x)`$,$`x`$需要从分布$`p`$中采样,应该如何怎么计算$`f(x)`$的期望值呢? 615 | > 616 | > 如果分布$`p`$不能做积分,那么只能从分布$`p`$尽可能多采样更多的$`x^{i}`$,然后全都代入到$`f(x)`$,按照蒙特卡洛方法的原则取它的平均值就可以得到近似$`f(x)`$的期望值: 617 | ```math 618 | \mathbb{E}_{x \sim p}[f(x)] \approx \frac{1}{N} \sum_{i=1}^N f(x^i) 619 | ``` 620 | > 当不能在分布$`p`$中采样数据,而只能从另外一个分布!$`q`$中去采样数据时($`q`$可以是任何分布),就需要做些变换,如下三步 621 | > 622 | > 1. 首先,期望值$`\mathbb{E}_{x \sim p}[f(x)]`$的另一种写法是$`\int f(x) p(x) \mathrm{d}x`$,对其进行变换,如下式所示 623 | ```math 624 | \begin{aligned}\int f(x) p(x) \mathrm{d}x&=\int f(x) \frac{p(x)}{q(x)} q(x) \mathrm{d}x \\&=\mathbb{E}_{x \sim q}[f(x){\frac{p(x)}{q(x)}}]\end{aligned} 625 | ``` 626 | > 627 | > 2. 整理下可得(左边是分布$`p`$,右边是分布$`q`$): 628 | ```math 629 | \mathbb{E}_{x \sim p}[f(x)]=\mathbb{E}_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] 630 | ``` 631 | > 632 | > 3. 如此,便就可以从$`q`$里面采样 $`x`$,再计算$`f(x) \frac{p(x)}{q(x)}`$,再取期望值 633 | > 634 | > 所以就算我们不能从$`p`$里面采样数据,但只要能从$`q`$里面采样数据,就可以计算从$`p`$采样$`x`$,然后代入$`f`$以后的期望值 635 | 636 | ### 3.2.2 Sarsa算法与Q-learning更新规则的对比 637 | 638 | 和Sarsa(0)算法的更新规则 639 | ```math 640 | Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t)] 641 | ``` 642 | 有点像,Q-learning的动作价值函数更新规则如下 643 | ```math 644 | Q\left(S_{t}, A_{t}\right) = Q\left(S_{t}, A_{t}\right)+\alpha\left[R_{t+1}+\gamma \max _{a'} Q\left(S_{t+1}, a'\right)-Q\left(S_{t}, A_{t}\right)\right] 645 | ``` 646 | 647 | 啥意思呢,一步步来看 648 | 1. Q学习有两种策略:目标策略和行为策略 649 | 目标策略是我们需要学习的策略,一般用$`\pi`$表示,直接在Q表格上使用贪心策略,取它下一步能得到的所有状态,即 650 | ```math 651 | \pi\left(s_{t+1}\right) = \underset{a^{\prime}}{\arg \max}\sim Q\left(s_{t+1}, a^{\prime}\right) 652 | ``` 653 | 行为策略 $`\mu`$可以是一个随机的策略,与环境交互(采集轨迹或数据),但我们采取 $`\varepsilon-`$贪心策略,让行为策略不至于是完全随机的,它是基于Q表格逐渐改进的 654 | 655 | 2. 我们可以构造 Q学习目标,Q学习的下一个动作都是通过arg max 操作选出来 的 (不管行为策略怎么探索、去哪探索,反正就是取奖励最大化下的策略),于是我们可得 656 | ```math 657 | \begin{aligned} R_{t+1}+\gamma Q\left(S_{t+1}, A^{\prime}\right) &=R_{t+1}+\gamma Q\left(S_{t+1},\arg \max ~Q\left(S_{t+1}, a^{\prime}\right)\right) \\ &=R_{t+1}+\gamma \max _{a^{\prime}} Q\left(S_{t+1}, a^{\prime}\right) \end{aligned} 658 | ``` 659 | 660 | 再次总结一下其与Sarsa的区别 661 | 662 | + 在Sarsa算法中,新动作用于更新动作价值函数,并且用于下一时刻的执行工作,这意味着行动策略与目标策略属于同一个策略 663 | + 但在Q-learning算法中,使用确定性策略选出的新动作只用于动作价值函数,而不会被真正执行,当动作价值函数更新后,得到新状态,并基于新状态由$`\varepsilon-`$贪心策略选择得到执行行动,这意味着行动策略与目标策略不属于同一个策略 664 | 665 | ## 3.3 DQN 待更 -------------------------------------------------------------------------------- /强化学习极简入门下:通俗理解MDP、DP MC TC和Q学习、策略梯度、PPO.md: -------------------------------------------------------------------------------- 1 | ## 第四部分 策略学习:从策略梯度、Actor-Criti到TRPO、PPO算法 2 | 3 | ## 4.1 策略梯度与其突出问题:采样效率低下 4 | 5 | 本节推导的核心内容参考自Easy RL教程等资料(但修正了原教程上部分不太准确的描述,且为让初学者更好懂,补充了大量的解释说明和心得理解,倪老师则帮拆解了部分公式)。 6 | 7 | 另,都说多一个公式则少一个读者,本文要打破这点,虽然本节推导很多,但每一步推导都有介绍到,不会省略任何一步推导,故不用担心看不懂(对本文任何内容有任何问题,都欢迎随时留言评论)。 8 | 9 | ### 4.1.1 什么是策略梯度和梯度计算/更新的流程 10 | 11 | 策略梯度的核心算法思想是: 12 | + 参数为$`\theta`$的策略$`\pi_{\theta }`$接受状态,输出动作概率分布,在动作概率分布中采样动作,执行动作(形成运动轨迹$`\tau`$),得到奖励$`r`$,跳到下一个状态 13 | + 在这样的步骤下,可以使用策略$`\pi`$收集一批样本,然后使用梯度下降算法学习这些样本,不过当策略$`\pi`$的参数更新后,这些样本不能继续被使用,还要重新使用策略$`\pi`$与环境互动收集数据 14 | 15 | 比如REINFORCE算法便是常见的策略梯度算法,类似下图所示(下图以及本节大部分配图/公式均来自easy RL教程) 16 | 17 | ![](./assets/images/RL_simple_primer/9dcf9cefeb844a5a93b2b6fd38bf5a80.png) 18 | 19 | 接下来,详细阐述。首先,我们已经知道了策略函数可以如此表示:$`a = \pi _{\theta }(s)`$ 20 | 21 | 其中,$`\pi _{\theta}`$可以理解为一个我们所熟知的神经网络 22 | + 当你对神经网络有所了解的话,你一定知道通过梯度下降求解损失函数的极小值(忘了的,可以复习下:首先通过正向传播产生拟合值,与标签值做“差”计算,产生误差值,然后对误差值求和产生损失函数,最后对损失函数用梯度下降法求极小值,而优化的对象就是神经网络的参数$`\theta`$ 23 | 24 | + 类比到πθ这个问题上,现在是正向传播产生动作,然后动作在环境中产生奖励值,通过奖励值求和产生评价函数,此时可以针对评价函数做梯度上升(gradient ascent),毕竟能求极小值,便能求极大值,正如误差能最小化,奖励/得分就能最大化 25 | 26 | 如何评价策略的好坏呢? 27 | 28 | 假设机器人在策略$`\pi_{\theta }`$的决策下,形成如下的运动轨迹(类似你玩三国争霸时,你控制角色在各种不同的游戏画面/场景/状态下作出一系列动作,而当完成了系统布置的某个任务时则会得到系统给的奖励,如此,运动轨迹用$`\tau`$表示,从而$`\tau`$表示为一个状态$`s`$、动作$`a`$、奖励值$`r`$不断迁移的过程) 29 | 30 | $`\tau = (s_{1},a_{1},r_{1},s_{2},a_{2},r_{2},...,s_{t},a_{t},r_{t})`$ 31 | 32 | > 可能有读者注意到了,既然奖励是延后的,$`s_t$,$a_t`$后的奖励怎么用$`r_t`$而非$`r_{t+1}`$呢,事实上,sutton RL书上用$`S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,\cdots,S_t,A_t,R_{t+1}`$表示整条轨迹,其实这样更规范,但考虑到不影响大局和下文的推导,本笔记则暂且不细究了 33 | 34 | 给定智能体或演员的策略参数$`\theta`$,可以计算某一条轨迹$`\tau`$发生的概率为『轨迹$`\tau`$来源于在特定的环境状态下采取特定动作的序列,而特定的状态、特定的动作又分别采样自智能体的动作概率分布$`p_{\theta }(a_{t}|s_{t})`$、状态的转换概率分布$`p(s_{t+1}|s_t,a_t)`$ 35 | 36 | $`\begin{aligned} p_{\theta}(\tau) &=p\left(s_{1}\right) p_{\theta}\left(a_{1} | s_{1}\right) p\left(s_{2} | s_{1}, a_{1}\right) p_{\theta}\left(a_{2} | s_{2}\right) p\left(s_{3} | s_{2}, a_{2}\right) \cdots \\ &=p\left(s_{1}\right) \prod_{t=1}^{T} p_{\theta}\left(a_{t} | s_{t}\right) p\left(s_{t+1} | s_{t}, a_{t}\right) \end{aligned}`$ 37 | 38 | 其中,有的资料也会把$`p_{\theta }(a_{t}|s_{t})`$写成为$`\pi _{\theta }(a_{t}|s_{t})`$,但由于毕竟是概率,所以更多资料还是写为$`p_{\theta }(a_{t}|s_{t})`$ 39 | 40 | 如何评价策略呢?这个策略评价函数为方便理解也可以称之为策略价值函数,就像上文的状态价值函数、动作价值函数,说白了,评估策略(包括状态、动作)的价值,就是看其因此得到的期望奖励 41 | 42 | 故考虑到期望的定义,由于每一个轨迹$`\tau`$ 都有其对应的发生概率,**对所有$`\tau`$出现的概率与对应的奖励进行加权最后求和**,即可得期望值: 43 | 44 | $`\bar{R}_{\theta}=\sum_{\tau} R(\tau) p_{\theta}(\tau)=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}[R(\tau)]`$ 45 | 46 | 上述整个过程如下图所示 47 | 48 | ![](./assets/images/RL_simple_primer/756685e2f07b494b99bc97f4ce0f4bf9.png) 49 | 50 | 通过上文已经知道,想让奖励越大越好,可以使用梯度上升来最大化期望奖励。而要进行梯度上升,先要计算期望奖励$`\bar{R}_{\theta}`$的梯度。 51 | 52 | 考虑对$`\bar{R}_{\theta}`$做梯度运算『再次提醒,忘了什么是梯度的,可以通过[一文通透优化算法:从梯度下降、SGD到牛顿法、共轭梯度(23修订版)](https://blog.csdn.net/v_JULY_v/article/details/81350035 "一文通透优化算法:从梯度下降、SGD到牛顿法、共轭梯度(23修订版)")复习下』 53 | 54 | $`\nabla \bar{R}_{\theta}=\sum_{\tau}{R}(\tau )\nabla \mathrm{p}_{\theta}(\tau )`$ 55 | 56 | 其中,只有$`p_{\theta}(\tau)`$与$`\theta`$有关。再考虑到$`\nabla f(x)=f(x)\nabla \log f(x)`$,可得 57 | 58 | $`\frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)}= \nabla\log p_{\theta}(\tau)`$ 59 | 60 | 从而进一步转化,可得$`\begin{aligned} \nabla \bar{R}_{\theta}&=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] \end{aligned}`$,表示期望的梯度等于对数概率梯度的期望乘以原始函数 61 | 62 | > Em,怎么来的?别急,具体推导是 63 | > 64 | > $`\begin{aligned} \nabla \bar{R}_{\theta}&=\sum_{\tau} R(\tau) \nabla p_{\theta}(\tau)\\&=\sum_{\tau} R(\tau) p_{\theta}(\tau) \frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)} \\&= \sum_{\tau} R(\tau) p_{\theta}(\tau) \nabla \log p_{\theta}(\tau) \\ &=\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] \end{aligned}`$ 65 | > 66 | > 上述推导总共4个等式3个步骤,其中,第一步 先分母分子都乘以一个$`p_{\theta}(\tau)`$,第二步 把$`\frac{\nabla p_{\theta}(\tau)}{p_{\theta}(\tau)}= \nabla \log p_{\theta}(\tau)`$代入计算,第三步 根据期望的定义$`E[X] = \sum_{i}^{}p_ix_i`$做个简单转换,此处的$`X`$就是$`R(\tau )`$ 67 | > 68 | > 此外,本文一读者在23年2.24日的留言说,还想了解$`\nabla f(x)=f(x)\nabla \log f(x)`$是怎么推导而来的,这个式子可以通过如下推导得到 69 | > 70 | > 首先,对函数$`f(x)`$取对数得: 71 | > 72 | > $`\log f(x)`$ 73 | > 74 | > 对上式求导数得: 75 | > 76 | > $`\frac{d}{dx}\log f(x) = \frac{1}{f(x)}\frac{d}{dx}`$ 77 | > 78 | > 将等式两边同乘以$`f(x)`$,得到: 79 | > 80 | > $`f(x) \frac{d}{dx} \log f(x) = \frac{d}{dx}`$ 81 | > 82 | > 这个等式表明,我们可以用$`\nabla \log f(x)`$来表示$`\nabla f(x)`$,即: 83 | > 84 | > $`\nabla f(x)=f(x)\nabla \log f(x)`$ 85 | 86 | 然不巧的是,期望值$`\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right]`$无法计算,按照蒙特卡洛方法近似求期望的原则,可以采样$`N`$条轨迹$`\tau`$并计算每一条轨迹的值,再把每一条轨迹的值加起来除以$`N`$取平均,即($`\tau^{n}`$上标$`n`$代表第$`n`$条轨迹,而$`a_{t}^{n}`$、$`s_{t}^{n}`$则分别代表第$`n`$条轨迹里时刻$`t`$的动作、状态) 87 | 88 | $`\begin{aligned} \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] &\approx \frac{1}{N} \sum_{n=1}^{N} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(\tau^{n}\right) \\ &=\frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(a_{t}^{n} \mid s_{t}^{n}\right) \end{aligned}`$ 89 | 90 | > 任何必要的中间推导步骤咱不能省,大部分文章基本都是一笔带过,但本文为照顾初学者甚至更初级的初学者,$`\nabla \log p_{\theta}(\tau)`$中间的推导过程还是要尽可能逐一说明下: 91 | > 92 | > 1. 首先,通过上文中关于某一条轨迹$`\tau`$发生概率的定义,可得 93 | > 94 | > $`p_\theta (\tau ) = p(s_{1}) \prod_{t=1}^{T_{n}}p(s_{t+1}|s_t,a_t)p_{\theta }(a_{t}|s_{t})`$ 95 | > 96 | > 2. 然后两边都取对数,可得 97 | > 98 | > $`logp_\theta (\tau ) = logp(s_1)\prod_{t=1}^{T_{n}} p(s_{t+1}|s_t,a_t)p_{\theta }(a_{t}|s_{t})`$ 99 | > 100 | > 由于乘积的对数等于各分量的对数之和,故可得 101 | > 102 | > $`logp_\theta (\tau ) = logp(s_1) + \sum_{t=1}^{T_n}(logp(s_{t+1}|s_t,a_t) + logp_{\theta }(a_{t}|s_{t}))`$ 103 | > 104 | > 3. 接下来,取梯度可得 105 | > 106 | > $`\begin{aligned} \nabla \log p_{\theta}(\tau) &= \nabla \left(\log p(s_1)+ \sum_{t=1}^{T_n}\log p(s_{t+1}|s_t,a_t) + \sum_{t=1}^{T_n}\log p_{\theta}(a_t|s_t) \right) \\ &= \nabla \log p(s_1)+ \nabla \sum_{t=1}^{T_n}\log p(s_{t+1}|s_t,a_t) + \nabla \sum_{t=1}^{T_n}\log p_{\theta}(a_t|s_t) \\ &=\nabla \sum_{t=1}^{T_n}\log p_{\theta}(a_t|s_t)\\ &=\sum_{t=1}^{T_n} \nabla\log p_{\theta}(a_t|s_t) \end{aligned}`$ 107 | > 108 | > 上述过程总共4个等式,在从第2个等式到第3个等式的过程中,之所以消掉了 109 | > 110 | > $`\nabla \log p(s_1)+\nabla \sum_{t=1}^{T_n}{\log}p(s_{t+1}|s_t,a_t)`$ 111 | > 112 | > 是因为其与$`\theta`$无关(环境状态不依赖于策略),其对$`\theta`$的梯度为0。 113 | 114 | 完美!这就是所谓的**策略梯度定理**,我们可以直观地理解该公式 115 | 116 | $`\nabla \bar{R}_{\theta}=\frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)`$ 117 | 118 | 1. 即在采样到的数据里面,采样到在某一个状态$`s_t`$要执行某一个动作$`a_t`$,$`(s_t,a_t)`$是在整个轨迹$`\tau`$的里面的某一个状态和动作的对 119 | 2. 为了最大化奖励,假设在$`s_t`$执行$`a_t`$,最后发现$`\tau`$的奖励是正的,就要增加在$`s_t`$ 执行$`a_t`$的概率。反之,如果在$`s_t`$执行$`a_t`$会导致$`\tau`$的奖励变成负的, 就要减少在$`s_t`$执行$`a_t`$的概率 120 | 3. 最后,用梯度上升来更新参数,原来有一个参数$`\theta`$,把$`\theta`$加上梯度$`\nabla \bar{R}_{\theta}`$,当然要有一个学习率$`\eta`$(类似步长、距离的含义),学习率的调整可用 Adam、RMSProp等方法调整,即 121 | 122 | $`\theta \leftarrow \theta+\eta \nabla \bar{R}_{\theta}`$ 123 | 124 | > 有一点值得说明的是...,为了提高可读性,还是举个例子来说明吧。 125 | > 126 | > 比如到80/90后上大学时喜欢玩的另一个游戏CF(即cross fire,10多年前我在东华理工的时候也经常玩这个,另一个是DNF),虽然玩的是同一个主题比如沙漠战场,但你每场的发挥是不一样的,即便玩到同一个地方(比如A区埋雷的地方),你也可能会控制角色用不同的策略做出不同的动作,比如 127 | >+ 在第一场游戏里面,我们在状态$`s_1`$采取动作 $`s_1`$,在状态$`s_2`$采取动作 $`a_2`$。且你在同样的状态$`s_1`$下, 不是每次都会采取动作$`a_1`$的,所以我们要记录,在状态 $`s^1_1`$ 采取 $`a^1_1`$、在状态 $`s^1_2`$采取$`a^1_1`$等,整场游戏结束以后,得到的奖励是 $`R(\tau^1)`$ 128 | >+ 在第二场游戏里面,在状态$`s^2_1`$采取$`a^2_1`$,在状态 $`s^2_2`$采取$`a^2_2`$,采样到的就是$`\tau^2`$,得到的奖励是$`R(\tau^2)`$ 129 | > 这时就可以把采样到的数据用梯度计算公式把梯度算出来 130 | > 131 | > 1. 也就是把每一个$`s`$与$`a`$的对拿进来,计算在某一个状态下采取某一个动作的对数概率$`\log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)`$,对这个概率取梯度$`\nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)`$ 132 | > 2. 然后在梯度前面乘一个权重$`R\left(\tau^{n}\right)`$,权重就是这场游戏的奖励,这也是和一般分类问题的区别所在 133 | > 134 | > ![](./assets/images/RL_simple_primer/bad281ae107d49d3adf8aa2d012a08c1.png) 135 | > 136 | > 3. 计算出梯度后,就可以通过$`\theta \leftarrow \theta+\eta \nabla \bar{R}_{\theta}`$更新模型了 137 | 138 | ### 4.1.2 避免采样的数据仅能用一次:重要性采样(为采样q解决p从而增加重要性权重) 139 | 140 | 然而策略梯度有个问题,在于$`\mathbb{E}_{\tau \sim p_{\theta}(\tau)}`$是对策略$`{\pi _{\theta}}`$采样的轨迹$`\tau`$求期望。一旦更新了参数,从$`\theta`$变成$`\theta'`$,在对应状态s下采取动作的概率$`p_\theta(\tau)`$就不对了,之前采样的数据也不能用了。 141 | 142 | 换言之,策略梯度是一个会花很多时间来采样数据的算法,其大多数时间都在采样数据。智能体与环境交互以后,接下来就要更新参数,我们只能更新参数一次,然后就要重新采样数据, 才能再次更新参数。 143 | 144 | ![](./assets/images/RL_simple_primer/0dbb490df24a8ddc53d81df6b09c9c76.png) 145 | 146 | 这显然是非常花时间的,怎么解决这个问题呢?为避免采样到的数据只能使用一次的问题,还记得上文介绍过的重要性采样否,使得 147 | 148 | > 1. 可以用另外一个策略$`\pi_{\theta'}`$与环境交互,用$`\theta'`$采样到的数据去训练$`\theta`$ 149 | > 2. 假设我们可以用$`\theta'`$采样到的数据去训练$`\theta`$,我们可以多次使用$`\theta'`$采样到的数据,可以多次执行梯度上升,可以多次更新参数$`\theta`$, 都只需要用$`\theta'`$采样到的同一批数据 150 | 151 | 故基于重要性采样的原则,我们可以用另外一个策略$`\pi _{\theta^{'}}`$,与环境做互动采样数据来训练$`\theta`$,从而间接计算$`R(\tau) \nabla \log p_{\theta}(\tau)`$,而当我们转用$`\theta'`$去采样数据训练$`\theta`$后 152 | 153 | 1. 只需在$`R(\tau) \nabla \log p_{\theta}(\tau)`$的基础上补上一个重要性权重:$`\frac{p_{\theta}(\tau)}{p_{\theta^{\prime}}(\tau)}`$,这个**重要性权重**针对某一个轨迹$`\tau`$用$`\theta`$算出来的概率除以这个轨迹$`\tau`$用$`\theta^{'}`$算出来的概率 154 | 2. 注意,上面例子中的$`p`$/$`q`$与此处的$`p_{\theta}(\tau)/p_{\theta^{\prime}}(\tau)`$没有任何联系,前者只是为了说明重要性权重的两个普通的分布而已 155 | 156 | 最终加上重要性权重之后,可得 157 | 158 | $`\nabla \bar{R}_{\theta}=\mathbb{E}_{\tau \sim p_{\theta^{\prime}(\tau)}}\left[\frac{p_{\theta}(\tau)}{p_{\theta^{\prime}}(\tau)} R(\tau) \nabla \log p_{\theta}(\tau)\right]`$ 159 | 160 | 怎么来的?完整推导如下 161 | 162 | $`\begin{aligned}\mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[R(\tau) \nabla \log p_{\theta}(\tau)\right] &= \sum_{\tau} \left[R(\tau) \nabla \log p_{\theta}(\tau)\right]p_{\theta}(\tau) \\ &= \sum_{\tau} \left[\frac{p_{\theta}(\tau)}{p_{\theta}^\prime(\tau)}R(\tau) \nabla \log p_{\theta}(\tau)\right]p_{\theta}^\prime(\tau) \\ &= \mathbb{E}_{\tau \sim p_{\theta^{\prime}(\tau)}}\left[\frac{p_{\theta}(\tau)}{p_{\theta^{\prime}}(\tau)} R(\tau) \nabla \log p_{\theta}(\tau)\right] \\ & = \nabla \bar{R}_{\theta}\end{aligned}`$ 163 | 164 | ## 4.2 优势演员-评论家算法(Advantage Actor-Criti):为避免奖励总为正增加基线 165 | 166 | 梯度的计算好像差不多了?但实际在做策略梯度的时候,并不是给整个轨迹$`\tau`$都一样的分数,而是每一个状态-动作的对会分开来计算,但通过蒙特卡洛方法进行随机抽样的时候,可能会出问题,比如在采样一条轨迹时可能会出现 167 | 168 | + 问题1:所有动作均为正奖励 169 | + 问题2:出现**比较大的方差**(另,重要性采样时,采样的分布与当前分布之间也可能会出现比较大的方差,具体下一节详述) 170 | 171 | 对于第一个问题,举个例子,比如在某一一个状态,可以执行的动作有a、b、c,但我们可能只采样到动作b或者只采样到动作c,没有采样到动作a 172 | 173 | 1. 但不管采样情况如何,现在所有动作的奖励都是正的,所以采取a、b、c的概率都应该要提高 174 | 2. 可实际最终b、c的概率按预期提高了,但因为a没有被采样到,所以a的概率反而下降了 175 | 3. 然而问题是a不一定是一个不好的动作,它只是没有被采样到 176 | 177 | ![](./assets/images/RL_simple_primer/bc1f965ef46e4ea693bb1950fd76d7e8.png) 178 | 179 | 为了解决奖励总是正的的问题,也为避免方差过大,需要在之前梯度计算的公式基础上加一个基准线$`b`$『此$`b`$指的baseline,非上面例子中的$`b`$,这个所谓的基准线$`b`$可以是任意函数,只要不依赖于动作$`a`$即可』 180 | 181 | $`\nabla \bar{R}_{\theta} \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}}\left(R\left(\tau^{n}\right)-b\right) \nabla \log p_{\theta}\left(a_{t}^{n} \mid s_{t}^{n}\right)`$ 182 | 183 | 上面说$`b`$可以是任意函数,这个“任意”吧,对初学者而言可能跟没说一样,所以$`b`$到底该如何取值呢 184 | + $`b`$有一种选择是使用轨迹上的奖励均值,即 $`b=\frac{1}{T}\sum_{t=1}^{T}R_t(\tau)`$ 185 | 从而使得$`R(\tau)−b`$有正有负 186 | 当$`R(\tau)`$大于平均值$`b`$时,则$`R(\tau)−b`$为正,则增加该动作的概率 187 | 当$`R(\tau)`$小于平均值$`b`$时,则$`R(\tau)−b`$为负,则降低该动作的概率 188 | 如此,对于每条轨迹,平均而言,较好的50%的动作将得到奖励,避免所有奖励均为正或均为负,同时,也减少估计方差 189 | + $`b`$还可以是状态价值函数$`V_{\pi}(st)`$ 190 | 可曾还记得2.1节介绍过的所谓Actor-Criti算法(一般被翻译为演员-评论家算法) 191 | Actor学习参数化的策略即策略函数,Criti学习值函数用来评估状态-动作对,然后根据评估结果改进或更新策略 192 | 193 | 当然,Actor-Criti本质上是属于基于策略的算法,毕竟算法的目标是优化一个带参数的策略(实际用到PPO算法时,会计算一个策略损失),只是会额外学习价值函数(相应的,运用PPO算法时,也会计算一个价值损失),从而帮助策略函数更好的学习,而学习优势函数的演员-评论家算法被称为优势演员-评论家(Advantage Actor-Criti,简称A2C)算法 194 | 195 | 而这个$`R(\tau)-b`$一般被定义为优势函数$`A^{\theta}(s_t,a_t)`$,有几点值得注意: 196 | 197 | 1. 在考虑到评估动作的价值,就看其因此得到的期望奖励,故一般有$`A_\pi (s,a) = Q_\pi (s,a) - V_\pi (s)`$,此举意味着在选择一个动作时,根据该动作相对于特定状态下其他可用动作的执行情况来选择,而不是根据该动作的绝对值(由$`Q`$函数估计) 198 | 且通常我们**只学习$`V_\pi (s)`$「比如通过时序差分法估计」,然后通过$`V_\pi (s)`$与奖励的结合来估计$`Q_\pi`$,即$`Q_{\pi}=R+\gamma V\pi (st+1)`$ 199 | ,从而可得** 200 | $`A\pi (s,a)=Q\pi (s,a)−V\pi (s)=R+\gamma V\pi (st+1)−V\pi (s)`$ 201 | 202 | 2. **总之,$`A^{\theta }(s_{t},a_{t})`$要估测的是在状态$`s_{t}`$采取动作$`a_{t}`$是好的还是不好的:即 203 | $`→`$如果$`A^{\theta }(s_{t},a_{t})`$是正的(即大于0),意味着在状态 $`s_{t}`$ 采取动作 $`a_{t} `$获得的回报比在状态 $`s_{t}`$采取任何其他可能的动作获得的回报都要好,就要增加概率; 204 | $`→`$如果$`A^{\theta }(s_{t},a_{t})`$是负的(即小于0),意味着在状态 $`s_{t}`$ 采取动作 $`a_{t}`$ 得到的回报比其他动作的平均回报要差,要减少概率** 205 | 206 | 3. 最终在更新梯度的时候,如下式所示『我们用演员$`\theta`$去采样出$`s_{t}`$跟$`a_{t}`$,采样出状态跟动作的对$`(s_{t},a_{t})`$,计算这个状态跟动作对的优势$`A^{\theta }(s_{t},a_{t})`$』 207 | ```math 208 | \mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta}}\left[A^{\theta}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right] 209 | ``` 210 | 211 | 212 | 进一步,由于$`A^{\theta}(s_t,a_t)`$是演员$`\theta`$与环境交互的时候计算出来的,基于重要性采样的原则,当从$`\theta`$换到$`\theta'`$的时候,就需要在 213 | ```math 214 | \mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta}}\left[A^{\theta}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right] 215 | ``` 216 | 217 | 基础上,$`A^{\theta}(s_t,a_t)`$变换成$`A^{\theta'}(s_t,a_t)`$,一变换便得加个重要性权重(即把$`s_{t}`$、$`a_{t}`$用$`\theta`$采样出来的概率除掉$`s_{t}`$、$`a_{t}`$用$`\theta^{'}`$采样出来的概率),公式如下『Easy RL纸书第1版上把下面公式中的$`A^{\theta'}(s_t,a_t)`$写成了$`A^{\theta}(s_t,a_t)`$』 218 | ```math 219 | \mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(s_{t}, a_{t}\right)}{p_{\theta^{\prime}}\left(s_{t}, a_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right] 220 | ``` 221 | 222 | 接下来,我们可以拆解$`p_{\theta}(s_{t}, a_{t})`$和$`p_{\theta'}\left(s_{t}, a_{t}\right)`$,即 223 | ```math 224 | \begin{aligned} p_{\theta}\left(s_{t}, a_{t}\right)&=p_{\theta}\left(a_{t}|s_{t}\right) p_{\theta}(s_t) \\ p_{\theta'}\left(s_{t}, a_{t}\right)&=p_{\theta'}\left(a_{t}|s_{t}\right) p_{\theta'}(s_t) \end{aligned} 225 | ``` 226 | 227 | 于是可得公式 228 | ```math 229 | \mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} \frac{p_{\theta}\left(s_{t}\right)}{p_{\theta^{\prime}}\left(s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right] 230 | ``` 231 | 232 | 这里需要做一件事情,假设模型是$`\theta`$的时候,我们看到$`s_{t}`$的概率,跟模型是$`\theta^{'}`$的时候,看到$`s_{t}`$的概率是差不多的,即$`p_{\theta}(s_t)=p_{\theta'}(s_t)`$。 233 | 234 | > 为什么可以这样假设呢?一种直观的解释就是$`p_{\theta}(s_t)`$很难算,这一项有一个参数$`\theta`$,需要拿$`\theta`$去跟环境做互动,算$`s_{t}`$出现的概率。 尤其是如果输入是图片的话,同样的$`s_{t}`$根本就不会出现第二次。我们根本没有办法估这一项,所以就直接无视这个问题。 235 | > 236 | > 但是$`p_{\theta}(a_t|s_t)`$是很好算,我们有$`\theta`$这个参数,它就是个网络。我们就把$`s_{t}`$带进去,$`s_{t}`$就是游戏画面。 我们有个策略的网络,输入状态$`s_{t}`$,它会输出每一个$`a_{t}`$的概率。所以,我们只要知道$`\theta`$和$`\theta'`$的参数就可以算$`\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)}`$。 237 | 238 | 所以,实际上在更新参数的时候,我们就是按照下式来更新参数: 239 | 240 | $`\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)\right]`$ 241 | 242 | 从而最终可以从梯度$`\nabla f(x)=f(x) \nabla \log f(x)`$来反推目标函数,当使用重要性采样的时候,要去优化的目标函数如下式所示,把它记$`J^{\theta^{\prime}}(\theta)`$ 243 | 244 | $`J^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right]`$ 245 | 246 | 终于大功告成! 247 | 248 | ## 4.3 基于信任区域的TRPO:加进KL散度解决两个分布相差大或步长难以确定的问题 249 | 250 | 好巧不巧,看似大功告成了,但重要性采样还是有个问题。具体什么问题呢,为更好的说明这个问题,我们回到上文的那个例子中。 251 | 252 | > 还是那两个分布:$`p`$、$`q`$,当不能从$`p`$里面很好的采样数据,而能从$`q`$里面很好的采样数据时,基于重要性采样的原则,虽然我们可以把$`p`$换成任何的$`q`$,但是在实现上,$`p`$和$`q`$的差距不能太大,差距太大,会出问题 253 | $`\mathbb{E}_{x \sim p}[f(x)]=\mathbb{E}_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]`$ 254 | > 255 | > 比如,虽然上述公式成立,但如果不是计算期望值,而是 256 | >+ 计算方差时$`Var_{x∼p}[f(x)]`$和$`Var_{x∼q}[f(x)\frac{p(x)}{q(x)}]`$是不一样的 257 | > 因为两个随机变量的平均值相同,并不代表它们的方差相同 258 | > 259 | > 此话怎讲?以下是推导过程: 260 | > 将$`f(x)`$、$`f(x) \frac{p(x)}{q(x)}`$分别代入方差的公式 261 | ```math 262 | Var[X]=E[X^2]−(E[X])^2 263 | ``` 264 | > 则分别可得(且考虑到不排除会有比初级更初级的初学者学习本文,故把第二个公式拆解的相对较细) 265 | ```math 266 | Var_{x \sim p}[f(x)]=\mathbb{E}_{x \sim p}\left[f(x)^{2}\right]-\left(\mathbb{E}_{x \sim p}[f(x)]\right)^{2} 267 | ``` 268 | ```math 269 | \begin{aligned} Var_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right] &=\mathbb{E}_{x \sim q}\left[\left(f(x) \frac{p(x)}{q(x)}\right)^{2}\right]-\left(\mathbb{E}_{x \sim q}\left[f(x) \frac{p(x)}{q(x)}\right]\right)^{2} \\ &= \int \left(f(x) \frac{p(x)}{q(x)}\right)^{2} q(x) dx - \left(\mathbb{E}_{x \sim p}[f(x)]\right)^{2} \\ &= \int f(x)^{2} \frac{p(x)}{q(x)} p(x)dx - \left(\mathbb{E}_{x \sim p}[f(x)]\right)^{2} \\ &=\mathbb{E}_{x \sim p}\left[f(x)^{2} \frac{p(x)}{q(x)}\right]-\left(\mathbb{E}_{x \sim p}[f(x)]\right)^{2} \end{aligned} 270 | ``` 271 | > 272 | > 上述两个公式前后对比,可以很明显的看出 273 | >后者的第一项多乘了$`p(x)q(x)`$,如果$`p(x)q(x)`$差距很大,$`f(x)p(x)q(x)`$的方差就会很大 274 | 275 | 所以结论就是,如果我们只要对分布$`p`$采样足够多次,对分布$`q`$采样足够多次,得到的期望值会是一样的。但是如果采样的次数不够多,会因为它们的方差差距可能是很大的,所以就可能得到差别非常大的结果。 276 | 277 | 这意味着什么呢,意味着我们目前得到的这个公式里 278 | ```math 279 | J^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right] 280 | ``` 281 | 282 | 如果$`p_\theta(a_t|s_t)`$与$`p_{\theta}'(a_t|s_t)`$相差太多,即这两个分布相差太多,重要性采样的结果就会不好。怎么避免它们相差太多呢?这就是TRPO算法所要解决的。 283 | 284 | 2015年John Schulman等人提出了信任区域策略优化(Trust Region Policy Opimization,简称TRPO),表面上,TRPO的出现解决了同时解决了两个问题,一个是解决重要性采样中两个分布差距太大的问题,一个是解决策略梯度算法中步长难以确定的问题 285 | 286 | + 关于前者,在1.2.2节得到的目标函数基础上(下图第一个公式),增加了一个KL散度约束(如下图第二个公式) 287 | 288 | $`J^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right]`$ 289 | 290 | $`\begin{aligned} J_{\mathrm{TRPO}}^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} | s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right],\mathrm{KL}\left(\theta, \theta^{\prime}\right)<\delta \end{aligned}`$ 291 | 292 | 至此采样效率低效的问题通过重要性采样(重要性权重)、以及增加KL散度约束解决了 293 | 294 | > KL散度(KL divergence),也称相对熵,而相对熵 = 交叉熵 - shannon熵,其衡量的是两个数据分布$`p`$和$`q`$之间的差异 295 | > 296 | > $`D_{KL}(P||Q) = E_x log\frac{P(x)}{Q(x)}`$ 297 | > 298 | > 下图左半边是一组原始输入的概率分布曲线$`p(x)`$,与之并列的是重构值的概率分布曲线$`q(x)`$,下图右半边则显示了两条曲线之间的差异 299 | > 300 | > ![](./assets/images/RL_simple_primer/046b440033c8296f2f9b0f1bf5c3190e.png) 301 | > 302 | > 顺带从零推导下KL散度的公式 303 | > 304 | > 1 所谓**概率**:对于$`x`$,可以定义概率分布为$`p(x)`$或$`q(x)`$ 305 | > 306 | > 2 所谓**信息**:对$`p(x)`$取对数,加符号得正值$`I(p) = -logp(x)`$,概率越高,包含的信息越小,因为事件越来越确定;相反,概率越低,包含的信息越多,因为事件具有很大的不确定性 307 | > 308 | > 3 所谓**Shannon熵**(熵是信息的平均,直观上,Shannon熵是信息在同一分布下的平均):$`p(x)`$对$`I(p)`$平均,即 309 | > 310 | > $`\begin{aligned} H(p) &= E_{x\sim P} [I(p)] \\&= \sum p(x)I(p) \\&= - \sum p(x)logp(x) \end{aligned}`$ 311 | > 312 | > 4 所谓**交叉熵Cross-Entropy**(直观上,交叉熵是信息在不同分布下的平均),即指$`p(x)`$对$`I(q)`$平均,即 313 | > 314 | > $`\begin{aligned} H(p,q) &= E_{x\sim P} [I(q)] \\&= \sum p(x)I(q) \\&= - \sum p(x)logq(x) \end{aligned}`$ 315 | > 316 | > 5 所谓相对熵或**KL散度 = 交叉熵 - shannon熵**,即 317 | > 318 | > $`\begin{aligned} D_{KL}(p||q) &= H(p,q) - H(p) \\&= -\sum p(x)logq(x) + \sum p(x)logp(x) \\&= -\sum p(x)log\frac{q(x)}{p(x)} \\&= \sum p(x)log\frac{p(x)}{q(x)} \end{aligned}`$ 319 | > 320 | > 所以如果在KL散度表达式的最前面加个负号,再结合Jensen不等式自然有 321 | > 322 | > $`\begin{aligned} -D_{KL}(p||q) &= \sum p(x)log\frac{q(x)}{p(x)} \\& \leq log \sum p(x)\frac{q(x)}{p(x)} \\&= log1 \\&= 0 \end{aligned}`$ 323 | 324 | + 关于后者,具体而言,当策略网络是深度模型时,沿着策略梯度更新参数,很有可能由于步长太长,策略突然显著变差,进而影响训练效果 325 | 326 | 这是1.2.1节,我们已经得到的策略梯度计算、策略梯度更新公式如下(别忘了,学习率$`\eta`$类似步长、距离的含义)分别如下 327 | 328 | $`\nabla \bar{R}_{\theta}=\frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_{n}} R\left(\tau^{n}\right) \nabla \log p_{\theta}\left(a_{t}^{n} | s_{t}^{n}\right)`$ 329 | 330 | $`\theta \leftarrow \theta+\eta \nabla \bar{R}_{\theta}`$ 331 | 332 | 对这个问题,我们考虑在更新时找到一块信任区域(trust region),在这个区域上更新策略时能够得到安全性保证,这就是TRPO算法的主要思想 333 | 334 | 335 | 本质上,其实这两个问题是同一个问题(简言之,**避免两个分布相差大即意味着避免步长过大**)。举个例子,比如爬两面都是悬崖的山,左右脚交替往前迈步,无论哪只脚向前迈步都是一次探索 336 | 337 | 1. 为尽快到达山顶且不掉下悬崖,一方面 你会选择最陡峭的方向,二方面 你会用目光选择一片信任区域,从而尽量远离悬崖边,在信任区域中,首先确定要探索的最大步长(下图的黄色圆圈),然后找到最佳点并从那里继续搜索 338 | 2. 好,现在问题的关键变成了,怎么确定每一步的步长呢?如果每一步的步长太小,则需要很长时间才能到达峰值,但如果步长太大,就会掉下悬崖(像不像两个分布之间差距不能太大) 339 | 3. 具体做法是,从初始猜测开始可选地,然后动态地调整区域大小。例如,如果新策略和当前策略的差异越来越大,可以缩小信赖区域。怎么实现?KL散度约束! 340 | 341 | ![](./assets/images/RL_simple_primer/d7fff6ebbf9a42d3ba9c04e1c924d476.png) 342 | 343 | 总之,TRPO就是考虑到连续动作空间无法每一个动作都搜索一遍,因此大部分情况下只能靠猜。如果要猜,就最好在信任域内部去猜。而TRPO将每一次对策略的更新都限制了信任域内,从而极大地增强了训练的稳定性。 344 | 345 | 至此,PG算法的采样效率低下、步长难以确定的问题都被我们通过TRPO给解决了。但TRPO的问题在哪呢? 346 | 347 | TRPO的问题在于把 KL 散度约束当作一个额外的约束,没有放在目标里面,导致TRPO很难计算,总之因为信任域的计算量太大了,John Schulman等人于2017年又推出了TRPO的改进算法:PPO 348 | 349 | ## 4.4 近端策略优化PPO:解决TRPO的计算量大的问题 350 | 351 | ### 4.4.1 什么是近端策略优化PPO与PPO-penalty 352 | 353 | 如上所述,PPO算法是针对TRPO计算量的大的问题提出来的,正因为PPO基于TROP的基础上改进,故PPO也解决了策略梯度不好确定学习率Learningrate(或步长Stepsize)的问题。毕竟通过上文,我们已经得知 354 | 355 | 1. 如果stepsize过大,学出来的Policy会一直乱动,不会收敛;但如果StepSize太小,想完成训练,我们会等到地老天荒 356 | 2. 而PPO利用NewPolicy和OldPolicy的比例,限制了NewPolicy的更新幅度,让策略梯度对稍微大点的Stepsize不那么敏感 357 | 358 | 具体做法是,PPO算法有两个主要的变种:近端策略优化惩罚(PPO-penalty)和近端策略优化裁剪(PPO-clip),其中PPO-penalty和TRPO一样也用上了KL散度约束。 359 | 360 | 近端策略优化惩罚PPO-penalty的流程如下 361 | 362 | 1. 首先,明确目标函数,通过上节的内容,可知咱们需要优化$`J^{\theta'}(\theta)`$,让其最大化 363 | 364 | $`J^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} \mid s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} \mid s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right]`$ 365 | 366 | 2. 接下来,先初始化一个策略的参数$`\theta`$,在每一个迭代里面,我们用前一个训练的迭代得到的actor的参数$`\theta '`$与环境交互,采样到大量状态-动作对,根据$`\theta '`$交互的结果,估测$`A^{\theta '}(s_t,a_t)`$ 367 | 368 | 3. 由于目标函数牵涉到重要性采样,而在做重要性采样的时候,$`p_\theta(a_t|s_t)`$不能与$`p_{\theta}'(a_t|s_t)`$相差太多,所以需要在训练的时候加个约束,这个约束就好像正则化的项一样,是$`\theta`$与$`\theta'`$输出动作的 KL散度,用于衡量$`\theta`$与$`theta'`$的相似程度,我们希望在训练的过程中,学习出的$`theta`$与$`theta'`$越相似越好 369 | 所以需要最后使用 PPO 的优化公式:$`\\J_{\mathrm{PPO}}^{\theta^{\prime}}(\theta)=J^{\theta^{\prime}}(\theta)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$ 370 | 371 | 372 | 当然,也可以把上述那两个公式合二为一『如此可以更直观的看出,PPO-penalty把KL散度约束作为惩罚项放在了目标函数中(可用梯度上升的方法去最大化它),此举相对TRPO减少了计算量』 373 | 374 | $`J_{\mathrm{PPO}}^{\theta^{\prime}}(\theta)=\mathbb{E}_{\left(s_{t}, a_{t}\right) \sim \pi_{\theta^{\prime}}}\left[\frac{p_{\theta}\left(a_{t} \mid s_{t}\right)}{p_{\theta^{\prime}}\left(a_{t} \mid s_{t}\right)} A^{\theta^{\prime}}\left(s_{t}, a_{t}\right)\right] \quad-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$ 375 | 376 | 377 | 上述流程有一个细节并没有讲到,即$`\beta`$是怎么取值的呢,事实上,$`\beta`$是可以动态调整的,故称之为自适应KL惩罚(adaptive KL penalty),具体而言 378 | 379 | * 先设一个可以接受的 KL 散度的最大值$`KL_{max}`$ 380 | 假设优化完$`J_{\mathrm{PPO}}^{\theta^{\prime}}(\theta)=J^{\theta^{\prime}}(\theta)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$以后,**KL 散度值太大导致$`KL(\theta,\theta^{\prime})> KL_{max}`$,意味着$`\theta`$与$`\theta'`$差距过大(即学习率/步长过大)**,也就代表后面惩罚的项$`\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$惩罚效果太弱而没有发挥作用,**故增大惩罚把$`\beta`$增大** 381 | 382 | * 再设一个 KL 散度的最小值$`KL_{min}`$ 383 | 如果优化完$`J_{\mathrm{PPO}}^{\theta^{\prime}}(\theta)=J^{\theta^{\prime}}(\theta)-\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$以后, 384 | KL散度值比最小值还要小导致$`KL(\theta ,\theta ^{\prime})< KL_{max}`$,意味着$`\theta`$与$`\theta'`$差距过小,也就代表后面惩罚的项$`\beta \mathrm{KL}\left(\theta, \theta^{\prime}\right)`$惩罚效果太强了,我们担心它只优化后一项,使$`\theta`$与$`\theta'`$一样,这不是我们想要的,所以减小惩罚即减小$`\beta`$ 385 | 386 | > 关于$`\beta`$具体怎么设置的?除了上面提到的自适应KL惩罚(adaptive KL penalty),来自2017年发表的PPO论文 387 | > 388 | > ![](./assets/images/RL_simple_primer/2c3b63b5cb414b438df7a89e14dae8a4.png) 389 | > 390 | > 另外,2019年发表的论文《[Fine-Tuning Language Models from Human Preferences](https://arxiv.org/pdf/1909.08593 "Fine-Tuning Language Models from Human Preferences")》『*也是本博客内另一篇文章“[ChatGPT相关技术论文100篇](https://blog.csdn.net/v_JULY_v/article/details/129508065 "ChatGPT相关技术论文100篇")”中提到的第56篇,另这是论文对应的代码:[微调GPT2](https://github.com/openai/lm-human-preferences "微调GPT2")*』,其中也提到了根据 KL\_target 自适应调节$`\beta`$的算法,这个方法已经被 TRLX/TRL实现 391 | > 392 | > ![](./assets/images/RL_simple_primer/c3f775519533445db1120f7dc79d4ba1.png) 393 | > 394 | > 更多训练细节可以看下instructGPT论文原文 395 | > 396 | > ![](./assets/images/RL_simple_primer/dcf2240f8a56451089a314ffe0c6fc08.png) 397 | 398 | 总之,近端策略优化惩罚可表示为 399 | 400 | $`\begin{aligned} &J_{\text{PPO}}^{\theta'}(\theta)=J^{\theta'}(\theta)-\beta \text{KL}\left(\theta, \theta'\right) \\ &J^{\theta'}(\theta) \approx \sum_{\left(s_{t}, a_{t}\right)} \frac{p_{\theta}\left(a_{t} \mid s_{t}\right)}{p_{\theta'}\left(a_{t} \mid s_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right)\end{aligned}`$ 401 | 402 | ### 4.4.2 PPO算法的另一个变种:近端策略优化裁剪PPO-clip 403 | 404 | 如果觉得计算 KL散度很复杂,则还有一个**PPO2**算法,即**近端策略优化裁剪PPO-clip**。近端策略优化裁剪的目标函数里面没有 KL 散度,其要最大化的目标函数为(easy RL上用$`\theta ^k`$代替$`\theta '`$,为上下文统一需要,本笔记的文字部分统一用$`\theta '`$) 405 | 406 | $`\begin{aligned} J_{\mathrm{PPO2}}^{\theta'}(\theta) \approx \sum_{\left(s_{t}, a_{t}\right)} \min &\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right),{clip}\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)}, 1-\varepsilon, 1+\varepsilon\right) A^{\theta'}\left(s_{t}, a_{t}\right)\right) \end{aligned}`$ 407 | 408 | 整个目标函数在$`min`$这个大括号里有两部分,最终对比两部分那部分更小,就取哪部分的值,这么做的本质目标就是为了让$`p_{\theta }(a_{t}|s_{t})`$和$`p_{\theta'}(a_{t}|s_{t})`$可以尽可能接近,不致差距太大。 409 | 410 | 换言之,这个裁剪算法和KL散度约束所要做的事情本质上是一样的,都是为了让两个分布之间的差距不致过大,但裁剪算法相对好实现,别看看起来复杂,其实代码很好写 411 | 412 | ```python 413 | // ratios即为重要性权重,exp代表求期望,括号里的environment_log_probs代表用于与环境交互的策略 414 | ratios = torch.exp(log_probs - environment_log_probs) 415 | 416 | // 分别用sur_1、sur_2来计算公式的两部分 417 | // 第一部分是重要性权重乘以优势函数 418 | sur_1 = ratios * advs 419 | 420 | // 第二部分是具体的裁剪过程 421 | sur_2 = torch.clamp(ratios, 1 - clip_eps, 1 + clip_eps) * advs 422 | 423 | // 最终看谁更小则取谁 424 | clip_loss = -torch.min(sur_1,sur_2).mean() 425 | ``` 426 | 427 | 回到公式,公式的第一部分我们已经见过了,好理解,咱们来重点分析公式的第二部分 428 | 429 | $`{clip}\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)}, 1-\varepsilon, 1+\varepsilon\right) A^{\theta'}\left(s_{t}, a_{t}\right)`$ 430 | 431 | * 首先是$`{clip}`$括号里的部分,用一句话简要阐述下其核心含义就是:如果$`{p_{\theta}(a_{t}|s_{t})}`$和$`{p_{\theta'}(a_{t}|s_{t})}`$之间的概率比落在范围$`(1- \varepsilon)`$和$`(1+\varepsilon)`$之外,$`\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)}`$将被剪裁,使得其值最小不小于$`(1- \varepsilon)`$,最大不大于$`(1+\varepsilon)`$ 432 | 433 | ![](./assets/images/RL_simple_primer/1798baf5dba54e21a19508e82d407a8a.png) 434 | 435 | * 然后是$`{clip}`$括号外乘以$`A^{\theta '}(s_t,a_t)`$,如果$`A^{\theta '}(s_t,a_t)`$大于0,则说明这是好动作,需要增大$`p_{\theta }(a_{t}|s_{t})`$,但$`\frac{p_{\theta}(a_{t}|s_{t})}{p_{\theta'}(a_{t}|s_{t})}`$最大不能超过$`(1+\varepsilon)`$ 436 | 437 | 如果$`A^{\theta '}(s_t,a_t)`$小于0,则说明该动作不是好动作, 438 | 需要减小$`p_{\theta }(a_{t}|s_{t})`$,但$`\frac{p_{\theta }(a_{t}|s_{t})}{p_{\theta'}(a_{t}|s_{t})}`$最小不能小过$`(1-\varepsilon)`$ 439 | 440 | > 最后把公式的两个部分综合起来,针对整个目标函数 441 | 442 | ```math 443 | \begin{aligned} J_{\mathrm{PPO2}}^{\theta'}(\theta) \approx \sum_{\left(s_{t}, a_{t}\right)} \min &\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right),{clip}\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)}, 1-\varepsilon, 1+\varepsilon\right) A^{\theta'}\left(s_{t}, a_{t}\right)\right) \end{aligned} 444 | ``` 445 | 446 | > 如果$`A^{\theta'}(s_t,a_t)`$大于0且$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}`$大于$`(1+\epsilon)`$ 447 | > 则相当于第二部分是$`(1+\epsilon)×A^{\theta'}(s_t,a_t)`$,和第一部分$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}*A^{\theta'}(s_t,a_t)`$对比取更小值当然是$`(1+\epsilon)`$的截断值: $`(1+\epsilon )*A^{\theta \prime}(s_t,a_t)`$ 448 | 449 | > 如果$`A^{\theta'}(s_t,a_t)'大于0且\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}`$小于$`(1-\epsilon)`$ 450 | 则相当于第二部分是$`(1-\epsilon)*A^{\theta'}(s_t,a_t)`$,和第一部分$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}*A^{\theta'}(s_t,a_t)`$对比取更小值当然是原函数值: $`\frac{p_{\theta}(a_t|s_t)}{p_{\theta \prime}(a_t|s_t)}*A^{\theta \prime}(s_t,a_t)`$ 451 | > ![](./assets/images/RL_simple_primer/0bb3ab43b467ce1071d28a89537abc9c.png) 452 | > 453 | > 反之,**如果 $`A^{\theta \prime}(s_t,a_t)`$ 小于0,则最终目标函数的取值为了更小则和 $`A^{\theta \prime}(s_t,a_t)`$大于0时反过来**,毕竟加了个负号自然一切就不同了,为方便初学者一目了然,咱们还是把计算过程列出来,即 454 | >+ 如果$`A^{\theta'}(s_t,a_t)`$小于0且$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}`$大于$`(1+\epsilon)`$ 455 | 则相当于第二部分是$`(1+\epsilon)*A^{\theta'}(s_t,a_t)`$,和第一部分$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}*A^{\theta'}(s_t,a_t)`$对比 456 | 取更小值当然是原函数值: $`\frac{p_{\theta}(a_t|s_t)}{p_{\theta \prime}(a_t|s_t)}*A^{\theta \prime}(s_t,a_t)`$ 457 | > 458 | >+ 如果$`A^{\theta'}(s_t,a_t)`$小于0且$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}`$小于$`(1-\epsilon)`$ 459 | 则相当于第二部分是$`(1-\epsilon)×A^{\theta'}(s_t,a_t)`$,和第一部分$`\frac{p_{\theta}(a_t|s_t)}{p_{\theta'}(a_t|s_t)}*A^{\theta'}(s_t,a_t)`$对比取更小值当然是$`(1-\epsilon)`$的截断值: $`(1-\epsilon )*A^{\theta \prime}(s_t,a_t)`$ 460 | 461 | ### 4.4.3 PPO算法的一个简单实现:对话机器人 462 | 463 | 综上,PPO算法是一种具体的Actor-Critic算法实现,比如在对话机器人中,输入的prompt是state,输出的response是action,想要得到的策略就是怎么从prompt生成action能够得到最大的reward,也就是拟合人类的偏好。具体实现时,可以按如下两大步骤实现 464 | 465 | 1. **首先定义4个模型:Actor(action\_logits)、SFT(sft\_logits)、Critic(value)、RM「r(x, y)」,和kl\_div、reward、优势函数adv** 466 | 从prompt库中采样出来的prompt在经过SFT(微调过GPT3/GPT3.5的模型称之为SFT)做generate得到一个response,这个『prompt + response』定义为sequence(这个采样的过程是批量采样进行generate,得到一个sequence buffer),然后这个sequence buffer的内容做batched之后输入给4个模型做inference 467 | 468 | ![](./assets/images/RL_simple_primer/1ecb75833281415497f94e0cbe0279bd.png) 469 | 470 | 这4个模型分别为Actor、SFT、Critic、RM,其中: 471 | Actor和SFT都是175B的模型,且Actor参数由SFT初始化(SFT是baseline),Actor输出action\_logits,SFT输出sft\_logits 472 | sft\_logits和action\_logits做kl\_div,为了约束actor模型的更新step不要偏离原始模型SFT太远 473 | 474 | Critic和RM是6B的模型,Critic参数由RM初始化 475 | Critic输出标量value,RM输出标量r(x, y),由r(x, y)和kl\_div计算得到reward, 476 | 477 | reward和value计算得到adv 478 | 2. **其次,通过pg\_loss和value\_loss优化迭代** 479 | Actor的流程是取出sequence,然后inference生成新的logits,再和sequence对应的之前的logits计算ratio,和adv计算出pg\_loss,也就是actor的loss,然后反向传播,优化器迭代 480 | Critic的流程是取出sequence,然后inference得到新的value,和old\_value做**clip\_value,再和reward计算value loss**,然后反向传播,优化器迭代 481 | 482 | ![](./assets/images/RL_simple_primer/7d5efb19d49a44cbbdf7bf50da94712d.png) 483 | 484 | 485 | 至于代码实现可以参阅此文:[类ChatGPT逐行代码解读(2/2):从零实现ChatLLaMA、ColossalChat、DeepSpeed Chat](https://blog.csdn.net/v_JULY_v/article/details/129996493 "类ChatGPT逐行代码解读(2/2):从零实现ChatLLaMA、ColossalChat、DeepSpeed Chat") 486 | 487 | ## 后记 488 | 489 | 1.6日决定只是想写个ChatGPT通俗导论,但为了讲清楚其中的PPO算法,更考虑到之后还会再写一篇强化学习极简入门,故中途花了大半时间去从头彻底搞懂RL,最终把网上关于RL的大部分中英文资料都翻遍之外(详见参考文献与推荐阅读),还专门买了这几本书以系统学习 490 | 491 | + 第1本,《白话强化学习与pytorch》,偏通俗,对初学者友好,缺点是部分内容混乱,且对部分前沿/细节比如PPO算法阐述不足,毕竟19年出版的了 492 | + 第2本,《EazyRL强化学习教程》,基于台大李宏毅等人的公开课,虽有不少小问题且其对MDP和三大表格求解法的阐述比较混乱,但其对策略梯度和PPO的阐述于初学入门而言还不错的 493 | + 第3本,《动手学强化学习》,张伟楠等人著,概念讲解、公式定义都很清晰,且配套代码实现,当然 如果概念讲解能更细致则会对初学者更友好 494 | + 第4本,《深度强化学习》,王树森等人著,偏原理讲解(无代码),此书对于已对RL有一定了解的是不错的选择 495 | + 第5本,《强化学习2》,权威度最高但通俗性不足,当然 也看人,没入门之前你会觉得此书晦涩,入门之后你会发现经典还是经典、不可替代,另书之外可配套七月的强化学习2带读课 496 | + 第6本,《深度强化学习:基于Python的理论与实践》,理论讲的挺清楚,代码实践也不错 497 | + 第7本,《强化学习(微课版)》,这本书是作为RL教材出版的,所以有教材的特征,即对一些关键定理会举例子展示实际计算过程,比如马尔可夫决策的计算过程,对初学者友好 498 | 499 | 总之,RL里面的概念和公式很多(相比ML/DL,RL想要机器/程序具备更好的自主决策能力),而 500 | 501 | + 一方面,绝大部分的资料没有站在初学者的角度去通俗易懂化、没有把概念形象具体化、没有把公式拆解举例化(如果逐一做到了这三点,何愁文章/书籍/课程不通俗) 502 | + 二方面,不够通俗的话,则资料一多,每个人的公式表达习惯不一样便会导致各种形态,虽说它们最终本质上都是一样的,可当初学者还没有完全入门之前,就不一定能一眼看出背后的本质了,然后就不知道该以哪个为准,从而丧失继续前进的勇气(这种情况下,要么硬着头皮继续啃 可能会走弯路,要么通过比如七月在线的课程问老师或人 提高效率少走弯路) 503 | 504 | 比如一个小小的策略梯度的计算公式会有近10来种表达,下面特意贴出其中6种,供读者辨别 505 | 506 | 第一种,本笔记和Easy RL中用的 507 | ```math 508 | \nabla \bar{R}_{\theta}=\frac{1}{N}\sum_{n=1}^N{\sum_{t=1}^{T_n}{R}}\left( \tau ^n \right) \nabla \log p_{\theta}\left( a_{t}^{n}|s_{t}^{n} \right) 509 | ``` 510 | 第二种,Sutton强化学习《Reinforcement Learning: An Introduction》第一版中用的 511 | 512 | ```math 513 | \nabla _{\theta}J(\pi _{\theta})=\sum_s^{}{d^{\pi}}(s)\sum_a^{}{\nabla _{\theta}}\pi _{\theta}(a|s)Q^{\pi}(s,a)=E_{\pi}[\gamma ^t\nabla _{\theta}log\pi _{\theta}(A_t|S_t)Q^{\pi}(S_t,A_t)] 514 | ``` 515 | 516 | 其中 517 | $` 518 | d\pi (s)=\sum_{t=0}^{\infty}{\gamma ^t}Pr(s_0\rightarrow s,t,\pi )=\sum_{t=0}^{\infty}{\gamma ^t}Pr\left\{ S_t=s|s_0,\pi \right\} 519 | `$叫做discounted state distribution 520 | 521 | 第三种,David sliver在2014年的《Deterministic Policy Gradient Algorithm》论文中用的 522 | 523 | ```math 524 | \nabla _{\theta}J(\pi _{\theta})=\int_S^{}{\rho ^{\pi}}(s)\int_A^{}{\nabla _{\theta}}\pi _{\theta}(a|s)Q^{\pi}(s,a)=E_{s\sim \rho ^{\pi},a\sim \pi _{\theta}}[\nabla _{\theta}log\pi _{\theta}(a|s)Q^{\pi}(s,a)] 525 | ``` 526 | 其中,$`\rho ^{\pi}(s)`$与上述$`d\pi (s)`$相同,都是discounted state distribution。 527 | 528 | 第四种,肖志清《强化学习:原理与Python实现》中用的 529 | 530 | 531 | ```math 532 | \nabla _{\theta}J(\pi _{\theta})=E[\sum_{t=0}^{\infty}{\gamma ^t}\nabla _{\theta}log\pi _{\theta}(A_t|S_t)Q^{\pi}(S_t,A_t)] 533 | ``` 534 | 第五种,Sutton强化学习在2018年的第二版中用的 535 | 536 | ```math 537 | \nabla _{\theta}J(\pi _{\theta})\propto \sum_S^{}{\mu ^{\pi}}(s)\sum_a^{}{\nabla _{\theta}}\pi _{\theta}(a|s)Q^{\pi}(s,a)=E_{\pi}[\nabla _{\theta}log\pi _{\theta}(A_t|S_t)Q^{\pi}(S_t,A_t)] 538 | 539 | ``` 540 | 541 | 其中, 542 | ```math 543 | \mu ^{\pi}(s)=\lim_{t\rightarrow \propto} Pr(S_t=s|s_0,\pi _{\theta}) 544 | ``` 545 | 是stationary distribution (undiscounted state distribution) 546 | 547 | 第六种,Open AI spinning up教程中用的 548 | 549 | ```math 550 | \nabla _{\theta}J(\pi _{\theta})=E_{(\tau \sim \pi )}[\sum_{t=0}^T{\nabla _{\theta}}log\pi _{\theta}(a_t|s_t)Q^{\pi}(s_t,a_t)] 551 | 552 | ``` 553 | 554 | ## 参考文献与推荐阅读 555 | 556 | 1. 关于强化学习入门的一些基本概念 557 | 2. 《白话强化学习与Pytorch》,此书让我1.6日起正式开启RL之旅,没看此书之前,很多RL资料都不好看懂 558 | 3. 《EazyRL强化学习教程》,基于台大李宏毅和UCLA周博磊等人的RL公开课所编著,其[GitHub](https://github.com/datawhalechina/easy-rl "GitHub")、[其在线阅读地址](https://datawhalechina.github.io/easy-rl/#/ "其在线阅读地址") 559 | 4. 《动手学强化学习》,张伟楠等人著 560 | 5. 台大李宏毅RL公开课,这是其:[视频/PPT/PDF下载地址](http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLDS18.html "视频/PPT/PDF下载地址") 561 | 6. UCLA周博磊RL公开课,这是其[:GitHub地址](https://github.com/zhoubolei/introRL ":GitHub地址") 562 | 7. 关于Q-leaning的几篇文章:1[如何用简单例子讲解 Q - learning 的具体过程?](https://www.zhihu.com/question/26408259 "如何用简单例子讲解 Q - learning 的具体过程?")2[莫烦:什么是 Q-Learning](https://zhuanlan.zhihu.com/p/24808797 "莫烦:什么是 Q-Learning") 563 | 8. AlphaGo作者之一David Silver主讲的[增强学习笔记1](https://zhuanlan.zhihu.com/p/50478310 "增强学习笔记1")、[笔记2](https://www.zhihu.com/column/reinforce "笔记2"),另附其讲授的《UCL Course on RL》的[课件地址](https://www.davidsilver.uk/teaching/ "课件地址") 564 | 9. huggingface的两篇RL教程:[An Introduction to Deep Reinforcement Learning](https://huggingface.co/blog/deep-rl-intro "An Introduction to Deep Reinforcement Learning")、[GitHub:The Hugging Face Deep Reinforcement Learning Course](https://github.com/huggingface/deep-rl-class "GitHub:The Hugging Face Deep Reinforcement Learning Course") 565 | 566 | 10. TRPO原始论文:[Trust Region Policy Optimization](https://arxiv.org/pdf/1502.05477.pdf "Trust Region Policy Optimization") 567 | 11. **PPO原始论文**:[Proximal Policy Optimization Algorithms](https://arxiv.org/pdf/1707.06347.pdf "Proximal Policy Optimization Algorithms") 568 | 12. PPO算法解读(英文2篇):解读1 [RL — Proximal Policy Optimization (PPO) Explained](https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12 "RL — Proximal Policy Optimization (PPO) Explained")、解读2[Proximal Policy Optimization (PPO)](https://huggingface.co/blog/deep-rl-ppo "Proximal Policy Optimization (PPO)") 569 | 13. PPO算法解读(中文4篇):[Easy RL上关于PPO的详解](https://datawhalechina.github.io/easy-rl/#/chapter5/chapter5 "Easy RL上关于PPO的详解")、[详解近端策略优化](https://www.cnblogs.com/xingzheai/p/15931681.html "详解近端策略优化")、[详解深度强化学习 PPO算法](https://zhuanlan.zhihu.com/p/88525394?utm_id=0 "详解深度强化学习 PPO算法")、[ChatGPT第二弹:PPO算法](https://mp.weixin.qq.com/s?__biz=MjM5ODkzMzMwMQ==&mid=2650435573&idx=2&sn=427f3ad2cb6ab5120686652fd9a64b8f&chksm=becde9af89ba60b9a6a4306b126b32deff8a78788208fd431a40f89736390c6883f35dc0b71d&mpshare=1&scene=23&srcid=0208gSWRajh40YzsiJ2YqN2B&sharer_sharetime=1675828126672&sharer_shareid=8dff0e13d79dbe85e759d04101e63bbf#rd "ChatGPT第二弹:PPO算法") 570 | 14. PPO算法实现:[GitHub - lvwerra/trl: Train transformer language models with reinforcement learning.](https://github.com/lvwerra/trl "GitHub - lvwerra/trl: Train transformer language models with reinforcement learning.") 571 | 15. [如何选择深度强化学习算法?MuZero/SAC/PPO/TD3/DDPG/DQN/等](http://www.deeprlhub.com/d/166-muzerosacppotd3ddpgdqn "如何选择深度强化学习算法?MuZero/SAC/PPO/TD3/DDPG/DQN/等") 572 | 16. [如何通俗理解隐马尔可夫模型HMM?](https://www.julyedu.com/questions/interview-detail?kp_id=30&cate=NLP&quesId=2765 "如何通俗理解隐马尔可夫模型HMM?") 573 | 17. [HMM学习最佳范例](https://www.52nlp.cn/hmm%E5%AD%A6%E4%B9%A0%E6%9C%80%E4%BD%B3%E8%8C%83%E4%BE%8B%E5%85%A8%E6%96%87pdf%E6%96%87%E6%A1%A3%E5%8F%8A%E7%9B%B8%E5%85%B3%E6%96%87%E7%AB%A0%E7%B4%A2%E5%BC%95 "HMM学习最佳范例") 574 | 18. [强化学习中“策略梯度定理”的规范表达、推导与讨论](https://zhuanlan.zhihu.com/p/490373525?utm_campaign=&utm_medium=social&utm_oi=644502718257958912&utm_psn=1602107506024824832&utm_source=qq "强化学习中“策略梯度定理”的规范表达、推导与讨论") 575 | 19. [强化学习——时序差分算法](https://zhuanlan.zhihu.com/p/34747205?utm_campaign=&utm_medium=social&utm_oi=644502718257958912&utm_psn=1602473260729393152&utm_source=qq "强化学习——时序差分算法") 576 | 20. [KL-Divergence详解](https://zhuanlan.zhihu.com/p/425693597?utm_id=0 "KL-Divergence详解") 577 | 21. 《强化学习(微课版)》,清华大学出版社出版 578 | 22. [使用蒙特卡洛计算定积分(附Python代码)](https://www.guanjihuan.com/archives/1145 "使用蒙特卡洛计算定积分(附Python代码)") 579 | 23. [David Silver 增强学习补充——重要性采样](https://zhuanlan.zhihu.com/p/78720910 "David Silver 增强学习补充——重要性采样")、[强化学习中的重要性采样(Importance Sampling)](https://zhuanlan.zhihu.com/p/371156865 "强化学习中的重要性采样(Importance Sampling)") 580 | 24. [关于Instruct GPT复现的一些细节与想法](https://zhuanlan.zhihu.com/p/609078527 "关于Instruct GPT复现的一些细节与想法") 581 | 25. [类ChatGPT逐行代码解读(2/2):从零起步实现ChatLLaMA和ColossalChat](https://blog.csdn.net/v_JULY_v/article/details/129996493 "类ChatGPT逐行代码解读(2/2):从零起步实现ChatLLaMA和ColossalChat") 582 | 583 | * * * 584 | 585 | ## 附录:修改/完善/新增记录 586 | 587 | RL里的细节、概念、公式繁多,想完全阐述清楚是不容易的,以下是自从23年1.16日以来的修改、完善、新增记录: 588 | 589 | 1. 1.16日,第一轮修改/完善/新增 590 | 修正几个笔误,且考虑到不排除会有比初级更初级的初学者阅读本文,补充部分公式的拆解细节 591 | 2. 1.17日,先后修正了2.2节重要性采样与重要性权重的部分不准确的描述、修正个别公式的笔误,以及补充1.4.2中关于PPO-clip的细节阐述、优化第四部分的相关描述 592 | 3. 1.18日,为措辞更准确,优化1.2.5节“基于信任区域的TRPO:通过KL散度解决两个分布相差大和步长难以确定的问题”的相关描述 593 | 4. 1.19日,为让读者理解起来更一目了然 594 | 优化1.3.1节中关于什么是近端策略优化PPO的描述 595 | 优化1.3.2节中关于“近端策略优化惩罚PPO-penalty关于自适应KL惩罚(adaptive KL penalty)”的描述 596 | 拆解细化关于 597 | $`\nabla \log p_{\theta}(\tau )`$的推导过程 598 | 补充说明对优势函数 599 | $`A^{\theta}(s_t,a_t)`$的介绍 600 | 5. 1.20日,第五轮修改/完善/新增 601 | 通过LaTeX重新编辑部分公式,以及补充说明1.2.1节中关于某一条轨迹$`\tau `$发生概率的定义 602 | 6. 1.21日(大年三十),新增对蒙/新增特卡洛方法的介绍,以及新增$`R(\tau )-b`$中基线$`b`$的定义,且简化2.1.1节中关于强化学习过程的描述 603 | 7. 1.22日,为严谨起见改进第二部分中对回报$`G`$、状态价值函数、动作价值函数、马尔可夫决策、策略评价函数的描述,并纠正个别公式的笔误 604 | 8. 1.23日,梳理1.1节的整体内容结构和顺序脉络,以让逻辑更清晰,补充了MDP的整个来龙去脉(包括且不限于马尔可夫过程、马尔可夫奖励、马尔可夫决策以及贝尔曼方程) 605 | 9. 1.25日,为方便对高数知识有所遗忘的同学更顺畅的读懂本文,新增对导数、期望的简单介绍(后汇总至概率统计极简入门笔记里),以及补充对$`R(\tau )-b`$中基线$`b`$的定义的介绍 606 | 10. 1.26日,第十轮修改/完善/新增 607 | 优化改进2.2节关于策略梯度的梯度计算的整个推导过程,以让逻辑更清晰 608 | 11. 1.27日,优化关于增加基线baseline和优势函数等内容的描述 609 | 在后记里补充策略梯度计算公式的5种其它表达 610 | 优化关于“近端策略优化惩罚PPO-penalty的流程”的描述 611 | 12. 1.28日,新增对MC和TD方法各自的阐述及两者的对比,优化对KL散度定义的描述,新增近端策略优化裁剪PPO-clip的关键代码 612 | 13. 1.30日,新增马尔可夫决策的贝尔曼方程以及对应的计算图解,以方便一目了然 613 | 简单阐述了下GPT2相比GPT的结构变化,以及完善丰富了下文末的参考文献与推荐阅读,比如增加图解GPT2、图解GPT3的参考文献 614 | 14. 1.31日,为行文严谨,针对1.1.2节中关于马尔可夫奖 励的部分 615 | 规范统一个别公式的大小写表示 616 | 补充状态$`s`$下奖励函数的定义$`R(s)=E[R_{t+1}|S_t=s]`$ 617 | 修正回报公式的笔误$`G_t=R_{t+1}+\gamma \cdot R_{t+2}+\gamma ^2\cdot R_{t+3}+\gamma ^3\cdot R_{t+4}+\cdots `$ 618 | 修正状态价值函数公式的笔误 619 | 且为形象起见,新增一个“吃饭-抽烟/剔牙”的例子以展示利用贝尔曼方程计算的过程 620 | 此外,且为通俗细致,针对1.1.3节中关于马尔科夫决策的部分 621 | 拆解状态价值函数、动作价值函数的定义公式,拆解关于状态价值函数和动作价值函数之间联系的推导过程 622 | 623 | 15. 2.1日,第十五轮修改/完善/新增 624 | 参考sutton RL一书,补充对奖励函数、回报公式、轨迹定义的公式表达规范的说明 625 | 16. 2.12日,为更一目了然,新增对状态价值函数的贝尔曼方程的解释,与例子说明 626 | 17. 2.13日,开始完善动态规划一节的内容,然后发现之前写的一篇关于DP的博客不够通俗,故本周得先修订下那篇旧文,以通俗理解DP 627 | 18. 2.15日,基于已经修订好的DP旧文,完善本文对DP算法的阐述 628 | 并修正第一部分里关于“什么是强化学习”的不够准确的表述 629 | 19. 2.16日,新增对蒙特卡洛方法和时序差分法的介绍,并重点对比DP、MC、TD三者的区别与联系 630 | 明确全文结构,分为四大部分,一一对应:RL基础、RL进阶、RL深入、RL高级 631 | 20. 2.17日,第二十轮修改/完善/新增 632 | 新增第三部分价值学习:从n步Sarsa算法到Q-learning、DQN,并修订部分细节 633 | 润色部分细节,修订部分目录 634 | 21. 2.20日,新增一节“RL的分类:基于模型(Value-base/Policy-based)与不基于模型” 635 | 22. 2.21日,新增KL散度公式的从零推导过程 636 | 23. 2.24日,新增关于$`E[G_{t+1}|S_t = s]`$为何等于$`E[V(S_{t+1}|S_t = s)]`$的详细推导 637 | 24. 2.26日,新增关于$`\nabla f(x)=f(x)\nabla \log f(x)`$如何而来的推导 638 | 25. 3.4日,完善对于“Q学习的两种策略:行为策略和目标策略”的描述 639 | 26. 2年4月,纠正本文下面几个热心读者指出的笔误,大家有啥问题 继续随时反馈,非常感谢 640 | 27. 4.27,针对一读者留言所表达的疑惑,新增1.2.2节中关于「奖励函数$`R(s,a)`$推导」的说明解释 641 | 并新增一节“4.4.3 PPO算法的一个简单实现:对话机器人” 642 | --------------------------------------------------------------------------------