Browsing by Author "Klassen, Toryn Q."
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- ItemLearning Reward Machines: A Study in Partially Observable Reinforcement Learning(2023) Toro Icarte, Rodrigo Andrés; Klassen, Toryn Q.; Valenzano, Richard; Castro Anich, Margarita; Waldie, Ethan; McIlraith, Sheila A.Reinforcement Learning (RL) is a machine learning paradigm wherein an artificial agentinteracts with an environment with the purpose of learning behaviour that maximizesthe expected cumulative reward it receives from the environment. Reward machines(RMs) provide a structured, automata-based representation of a reward function thatenables an RL agent to decompose an RL problem into structured subproblems that canbe efficiently learned via off-policy learning. Here we show that RMs can be learnedfrom experience, instead of being specified by the user, and that the resulting problemdecomposition can be used to effectively solve partially observable RL problems. We posethe task of learning RMs as a discrete optimization problem where the objective is to findan RM that decomposes the problem into a set of subproblems such that the combinationof their optimal memoryless policies is an optimal policy for the original problem. Weshow the effectiveness of this approach on three partially observable domains, where itsignificantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations,and broader potential.
- ItemReward Machines for Deep RL in Noisy and Uncertain Environments(2024) Li, Andrew C.; Chen, Zizhao; Klassen, Toryn Q.; Vaezipoor, Pashootan; Toro Icarte Rodrigo Andres; McIlraith, Sheila A.
- ItemReward Machines for Deep RL in Noisy and Uncertain Environments(2024) Li, Andrew C.; Chen, Zizhao; Klassen, Toryn Q.; Vaezipoor, Pashootan; Toro Icarte, Rodrigo Andrés; McIlraith, Sheila A.Reward Machines provide an automaton-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing the underlying structure of a reward function, they enable the decomposition of an RL task, leading to impressive gains in sample efficiency. Although Reward Machines and similar formal specifications have a rich history of application towards sequential decision-making problems, they critically rely on a ground-truth interpretation of the domain-specific vocabulary that forms the building blocks of the reward function—such ground-truth interpretations are elusive in the real world due in part to partial observability and noisy sensing. In this work, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that exploit task structure under uncertain interpretation of the domain- specific vocabulary. Through theory and experiments, we expose pitfalls in naive approaches to this problem while simultaneously demonstrating how task structure can be successfully leveraged under noisy interpretations of the vocabulary.
- ItemReward Machines: Exploiting Reward Function Structure in Reinforcement Learning(2022) Icarte, Rodrigo Toro; Klassen, Toryn Q.; Valenzano, Richard; Mcllraith, Sheila A.Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible - to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.