PhD Dissertation Defense: Vahid Tavakol Aghaei17-06-2019

Markov Chain Monte Carlo Algorithm for Bayesian Policy Search


Vahid Tavakol Aghaei
Mechatronics Engineering, PhD Dissertation, 2019


Thesis Jury

Assoc. Prof.  Ahmet Onat, Asst. Prof. Sinan Yıldırım (Thesis Advisors),

Prof. Dr. Kürşat Şendur, Prof. Dr. İlker Birbil

Prof. Dr. Taylan Cemgil, Asst. Prof. Öznur Taştan



Date & Time: 24th, June 2019 –  14 PM

Place: FENS 2019

Keywords : Reinforcement Learning; Markov Chain Monte Carlo; Particle filtering; Risk sensitive reward; Policy search; Control




The fundamental intention in Reinforcement Learning (RL) is to seek for optimal parameters of a given parameterized policy. Policy search algorithms have paved the way for making the RL suitable for applying to complex dynamical systems, such as robotics domain, where the environment comprised of high-dimensional state and action spaces. Although many policy search techniques are based on the wide spread policy gradient methods, thanks to their appropriateness to such complex environments, their performance might be affected by slow convergence or local optima complications. The reason for this is due to the urge for computation of the gradient components of the parameterized policy. In this study, we avail a Bayesian approach for policy search problem pertinent to the RL framework, The problem of interest is to control a discrete time Markov decision process (MDP) with continuous state and action spaces. We contribute to the field by propounding a Particle Markov Chain Monte Carlo (P-MCMC) algorithm as a method of generating samples for the policy parameters from a posterior distribution, instead of performing gradient approximations. To do so, we adopt a prior density over policy parameters and aim for the posterior distribution where the ‘likelihood’ is assumed to be the expected total reward. In terms of risk-sensitive scenarios, where a multiplicative expected total reward is employed to measure the performance of the policy, rather than its cumulative counterpart, our methodology is fit for purpose owing to the fact that by utilizing a reward function in a multiplicative form, one can fully take sequential Monte Carlo (SMC), known as the particle filter within the iterations of the P-MCMC.

Furthermore, in order to deal with the challenging problem of the policy search in large-dimensional state spaces an Adaptive MCMC algorithm will be proposed.


This research is organized as follows: In Chapter 1 we commence with a general introduction and motivation to the current work and highlight the topics that are going to be covered. In Chapter 2, a literature review pursuant to the context of the thesis will be conducted. In Chapter 3, a brief review of some popular policy gradient based RL methods is provided. We proceed with Bayesian inference paradigm notion and present Markov Chain Monte Carlo methods in Chapter 4. The original work of the thesis is formulated in this chapter where a novel SMC algorithm for policy search in RL setting is proposed. In order to exhibit the fruitfulness of the proposed algorithm in learning a parameterized policy, numerical simulations are incorporated in Chapter 5. To validate the applicability of the proposed MCMC method in real-time it is implemented on a control problem of a physical setup of a two degree of freedom (2-DoF) robotic manipulator where its corresponding results appear in Chapter 6. Finally, concluding remarks and future work are expressed in chapter 7.