Corelab Seminar

Michael M. Zavlanos
Distributed, Non-stationary, and Causal Reinforcement Learning

Reinforcement learning (RL) has been widely used to solve sequential decision making problems in unknown stochastic environments. In this talk we first present a new zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL) with partial state and action observations and for online learning in non-stationary environments. Zeroth-order optimization methods enable the optimization of black-box models that are available only in the form of input-output data and are common in training of Deep Neural Networks and RL. In the absence of input-output models, exact first or second order information (gradient or hessian) is unavailable and can not be used for optimization. Therefore, zeroth-order methods rely on input-output data to obtain approximations of the gradients that can be used as descent directions. In this talk, we present a new one-point policy gradient estimator that we have recently developed that requires a single function evaluation at each iteration to estimate the gradient, by using the residual between two consecutive feedback points. We refer to this scheme as residual feedback. We show that residual feedback in MARL allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards. Also, in online learning, one-point policy gradient estimation is the only viable choice. We show that, in both MARL and online learning, residual feedback induces a smaller estimation variance than other one-point feedback methods and, therefore, improves the learning rate. We also present a new transfer RL method we have recently developed that uses data from an expert agent that has access to the environmental context to help a learner agent that can not observe the same context to find an optimal context-unaware policy. It is well known that, disregarding the causal effect of the contextual information, can introduce bias in the transition and reward models estimated by the learner, resulting in a learned suboptimal policy. To address this challenge, we have developed a new method to obtain causal bounds on the transition and reward functions using the expert data, which we then use to develop a new causal bound-constrained Q learning method that converges to the true value function using only a fraction of new data samples.

Short bio: Michael M. Zavlanos received the Diploma in mechanical engineering from the National Technical University of Athens, Greece, in 2002, and the M.S.E. and Ph.D. degrees in electrical and systems engineering from the University of Pennsylvania, Philadelphia, PA, in 2005 and 2008, respectively. He is currently the Yoh Family Associate Professor in the Department of Mechanical Engineering and Materials Science at Duke University, Durham, NC. He also holds a secondary appointment in the Department of Electrical and Computer Engineering and the Department of Computer Science. His research focuses on control theory, optimization, learning, and AI and, in particular, autonomous systems and robotics, networked and distributed control systems, and cyber-physical systems. Dr. Zavlanos is a recipient of various awards including the 2014 ONR YIP Award and the 2011 NSF CAREER Award.