193 | 194 | 195 | 196 | Demonstration-Guided Reinforcement Learning with Learned Skills 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 219 | 220 | 221 | 222 | 223 | 224 | 225 | 226 | 227 | 228 | 229 |
230 | Demonstration-Guided Reinforcement Learning
with Learned Skills
231 |

232 |

Karl Pertsch 233 | 234 |

235 | 236 |

Youngwoon Lee 237 | 238 |

239 | 240 |

Yue Wu 241 | 242 |

243 | 244 |

Joseph Lim 245 | 246 |

247 |

248 | 249 | 250 | 251 | 252 |

CLVR Lab, University of Southern California

253 | Conference on Robot Learning (CoRL), 2021 254 | 255 |

256 |

[Paper]

257 |

[Code]

258 | 259 |

260 | 261 | 262 | 264 | 265 | 266 | 267 |
268 |

269 |
270 | 271 |

272 | Demonstration-guided reinforcement learning (RL) is a promising approach for learning complex behaviors by leveraging both reward feedback and a set of target task demonstrations. Prior approaches for demonstration-guided RL treat every new task as an independent learning problem and attempt to follow the provided demonstrations step-by-step, akin to a human trying to imitate a completely unseen behavior by following the demonstrator's exact muscle movements. Naturally, such learning will be slow, but often new behaviors are not completely unseen: they share subtasks with behaviors we have previously learned. In this work, we aim to exploit this shared subtask structure to increase the efficiency of demonstration-guided RL. We first learn a set of reusable skills from large offline datasets of prior experience collected across many tasks. We then propose Skill-based Learning with Demonstrations (SkiLD), an algorithm for demonstration-guided RL that efficiently leverages the provided demonstrations by following the demonstrated skills instead of the primitive actions, resulting in substantial performance improvements over prior demonstration-guided RL approaches. We validate the effectiveness of our approach on long-horizon maze navigation and complex robot manipulation tasks. 273 |

274 |

275 | 276 | 277 | 278 |

Overview

279 |

280 | Our goal is to use skills extracted from prior experience to improve the efficiency of demonstration-guided RL on a new task. We aim to leverage a set of provided demonstrations by following the performed skills as opposed to the primitive actions. 281 |

282 |

283 |
284 |

285 | Learning in our approach, SkiLD, is performed in three stages. (1): First, we extract a set of reusable skills from prior, task-agnostic experience. We build on prior work in skill-based RL for learning the skill extraction module (SPiRL, Pertsch et al. 2020). (2): We then use the pre-trained skill encoder to infer the skills performed in task-agnostic and demonstration sequences and learn state-conditioned skill distributions, which we call skill prior and skill posterior respectively. (3): Finally, we use both distributions to guide a hierarchical skill policy during learning of the downstream task. 286 |

287 | 288 | 289 | 290 | 291 | 295 | 296 | 303 | 304 |

292 | 293 |

294 |

297 |

Demonstration-Guided Downstream Learning

298 | 299 | While we have learned a state-conditioned distribution over the demonstrated skills, we cannot always trust this skill posterior, since it is only valid within the demonstration support (green region). Thus, to guide the hierarchical policy during downstream learning, SkiLD leverages the skill posterior only within the support of the demonstrations and uses the learned skill prior otherwise, since it was trained on the task-agnostic experience dataset with a much wider support (red region). 300 | 301 | 302 |

305 | 306 | 307 | 308 | 309 | 310 |

Environments

311 | 312 | 313 | 317 | 318 | 322 | 323 | 327 | 328 |

314 |

Maze Navigation

315 |

316 |

319 |

Kitchen Manipulation

320 |

321 |

324 |

Office Cleanup

325 |

326 |

329 | 330 |

331 | We evaluate our approach on three long-horizon tasks: maze navigation, kitchen manipulation and office cleanup. In each environment, we collect a large, task-agnostic dataset and a small set of task-specific demonstrations. 332 |

333 |

334 | 335 | 336 | 337 | 338 |

339 |

How does SkiLD Follow the Demonstrations?

340 |

341 | 342 |

343 |

344 | We analyze the qualitative behavior of our approach in the maze environment: the discriminator D(s) can accurately estimate the support of the demonstrations (green). Thus, the SkiLD policy minimizes divergence to the demonstration-based skill posterior within the demonstration support (third panel, blue) and follows the task-agnostic skill prior otherwise (fourth panel). In summary, the agent learns to follow the demonstrations whenever it's within their support and falls back to prior-based exploration outside the support. 345 |

346 |

347 | 348 | 349 | 350 | 351 |

352 |

Qualitative Results

353 |

354 | 355 | 356 | 361 | 362 | 366 | 367 | 371 | 372 | 376 | 377 | 378 | 379 | 384 | 385 | 388 | 389 | 392 | 393 | 396 | 397 |

357 \| 358 \| Kitchen Manipulation 359 \| 360 \|	363 \| SkiLD 364 \| 365 \|		368 \| SPiRL 369 \| 370 \|		373 \| SkillBC + SAC 374 \| 375 \|
380 \| 381 \| Office Cleanup 382 \| 383 \|	386 \| 387 \|		390 \| 391 \|		394 \| 395 \|

398 |

399 | Rollouts from the trained policies on the robotic manipulation tasks. In the kitchen environment the agent needs to perform four subtasks: open microwave, flip light switch, open slide cabinet, open hinge cabinet. In the office cleanup task it needs to put the correct objects in the correct receptacles. In both environments, our approach SkiLD is the only method that cann solve the full task. SPiRL lacks guidance through the demonstrations and thus solves wrong subtasks and fails at the target task. Skill-based BC with SAC finetuning is brittle and unable to solve more than one subtask. For more qualitative result videos, please check our supplementary website. 400 |

401 | 402 | 403 | 404 |

405 |

Quantitative Results

406 |

407 | 408 |

409 |

410 | 411 | 412 |

413 |

Imitation Learning Results

414 |

415 | 416 |

417 |

418 | We apply SkiLD in the pure imitation setting, without access to environment rewards and instead use a GAIL-style reward based on our learned discriminator, which is trained to estimate demonstration support. We show that our approach is able to leverage prior experience through skills for effective imitation of long-horizon tasks. By finetuning the learned discriminator we can further improve performance on the kitchen manipulation task which requires more complex control. 419 |

420 | 421 | 422 |

Source Code

423 |

424 | We have released our implementation in PyTorch on the github page. Try our code! 425 |

426 |

427 | [GitHub] 428 |

429 |

430 | 431 | 432 | 433 |

Citation

434 | 435 | 445 | 446 |

436 |


437 |           @article{pertsch2021skild,
438 |             title={Demonstration-Guided Reinforcement Learning with Learned Skills},
439 |             author={Karl Pertsch and Youngwoon Lee and Yue Wu and Joseph J. Lim},
440 |             journal={5th Conference on Robot Learning},
441 |             year={2021},
442 |           }
443 |

444 |

447 |

448 | 449 | 450 | 454 | 455 | 456 | 459 |

357 \| 358 \| Kitchen Manipulation 359 \| 360 \|	363 \| SkiLD 364 \| 365 \|		368 \| SPiRL 369 \| 370 \|		373 \| SkillBC + SAC 374 \| 375 \|
380 \| 381 \| Office Cleanup 382 \| 383 \|	386 \| 387 \|		390 \| 391 \|		394 \| 395 \|