⤷ [9]
Victor Gabillon, Mohammad Ghavamzadeh & Bruno Scherrer,
Approximate Dynamic Programming Finally Performs Well in the Game of Tetris,
NIPS 2013,
[Paper]
[Poster]
[Demo-Video]
[Abstract]
Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris.
Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI's results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE.
⤷ [7]
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon & Matthieu Geist,
Approximate Modified Policy Iteration,
ICML 2012,
[Paper]
[Poster]
[Slides]
[Video]
[Abstract]
Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted Q iteration, and classification-based policy iteration. We provide error propagation analysis that unifies those for approximate policy and value iteration. For the classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation.
@inproceedings{GabLazGhaSch11,
author = { {B}runo {S}cherrer
{M}ohammad {G}havamzadeh and
{V}ictor {G}abillon and
{M}atthieu {G}eist}
title = {Approximate Modified Policy Iteration},
booktitle = {ICML},
year = {2012},
}
[5]
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh & Bruno Scherrer,
Classification-based Policy Iteration with a Critic,
ICML 2011,
[Paper]
[Poster]
[Slides]
[Video]
[Abstract]
In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use the critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on LSTD and BRM methods. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.
@inproceedings{GabLazGhaSch11,
author = {{V}ictor {G}abillon and
{A}lessandro {L}azaric and
{M}ohammad {G}havamzadeh and
{B}runo {S}cherrer},
title = {{C}lassification-based {P}olicy {I}teration with a {C}ritic},
booktitle = {ICML},
pages = {1049-1056},
year = {2011},
}
[4]
Victor Gabillon, Alessandro Lazaric & Mohammad Ghavamzadeh,
Rollout Allocation Strategies for Classification-based Policy Iteration,
Workshop on Reinforcement Learning and Search in Very Large Spaces, 2010,
[Paper]
[Slides]
[Abstract]
Classification-based policy iteration algorithms are variations of policy iteration that do not use any kind of value function representation. The main idea is 1) to replace the usual value function learning step with rollout estimates of the value function over a finite number of states, called the rollout set, and the actions in the action space, and 2) to cast the policy improvement step as a classification problem. The choice of rollout allocation strategies over states and actions has significant impact on the performance and computation time of this class of algorithms. In this paper, we present new strategies to allocate the available budget (number of rollouts) at each iteration of the algorithm over states and actions. Our empirical results indicate that for a fixed budget, using the proposed strategies improves the accuracy of the training set over the existing methods.