Safe Reinforcement Learning via Shielding

Mohammed Alshiekh; Roderick Bloem; Rüdiger Ehlers; Bettina Könighofer; Scott Niekum; Ufuk Topcu

doi:10.1609/aaai.v32i1.11797

Authors

Mohammed Alshiekh University of Texas at Austin
Roderick Bloem Graz University of Technology
Rüdiger Ehlers University of Bremen and DFKI GmbH
Bettina Könighofer Graz University of Technology, Institute for Applied Information Processing and Communications
Scott Niekum University of Texas at Austin
Ufuk Topcu University of Texas at Austin

DOI:

https://doi.org/10.1609/aaai.v32i1.11797

Keywords:

Reinforcement Learning, Formal Methods

Abstract

Reinforcement learning algorithms discover policies that maximize reward, but do not necessarily guarantee safety during learning or execution phases. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic. To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive system called a shield. The shield monitors the actions from the learner and corrects them only if the chosen action causes a violation of the specification. We discuss which requirements a shield must meet to preserve the convergence guarantees of the learner. Finally, we demonstrate the versatility of our approach on several challenging reinforcement learning scenarios.

Safe Reinforcement Learning via Shielding

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription