**2016:**
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner & Peter Bartlett,
**Improved Learning Complexity in Combinatorial Pure Exploration Bandits,**
*AISTATS 2016**,*
[Paper]
[Poster]
[Abstract]

We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant.

**2015:**
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, Matthieu Geist,
**Approximate Modified Policy Iteration and its Application to the Game of Tetris,**
*JMLR 2015**, *
[Paper]
[Abstract]

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.

**2014:** * Doctorate/Ph.D thesis*: **Budgeted Classification-based Policy Iteration.**
[Thesis]
[Slides]
[Abstract]

This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.

**2014:**
Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson & S. Muthukrishnan,
**Large-Scale Optimistic Adaptive Submodularity,**
*AAAI 2014**, *
[Paper]
[Poster]
[Slides]
[Abstract]

Maximization of submodular functions has wide applications in artificial intelligence and machine learning. In this paper, we propose a scalable learning algorithm for maximizing an adaptive submodular function. The key structural assumption in our solution is that the state of each item is distributed according to a generalized linear model, which is conditioned on the feature vector of the item. Our objective is to learn the parameters of this model. We analyze the performance of our algorithm, and show that its regret is polylogarithmic in time and quadratic in the number of features. Finally, we evaluate our solution on two problems, preference elicitation and face detection, and show that high-quality policies can be learned sample efficiently.

**2013:**
Victor Gabillon, Mohammad Ghavamzadeh & Bruno Scherrer,
**Approximate Dynamic Programming Finally Performs Well in the Game of Tetris,**
*NIPS 2013**, *
[Paper]
[Poster]
[Demo-Video]
[Abstract]

Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris.
Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 \D7 10 and large 10 \D7 20 boards. Although the CBMPI's results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE.

**2013:**
Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson & S. Muthukrishnan,
**Adaptive Submodular Maximization in Bandit Setting,**
*NIPS 2013**, *
[Paper]
[Poster]
[Slides]
[Abstract]

Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as *Optimistic Adaptive Submodular Maximization (OASM)* because it trades off exploration and exploitation based on the *optimism in the face of uncertainty* principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem.

**2012:**
Victor Gabillon, Mohammad Ghavamzadeh & Alessandro Lazaric,
**Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence,**
*NIPS 2012**, *
[Paper]
[Poster]
[Abstract]

We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: *em fixed budget* and * fixed confidence*. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms.

** 2012:**
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon & Matthieu Geist,
**Approximate Modified Policy Iteration,**
*
ICML 2012, *
[Paper]
[Poster]
[Slides]
[Video]
[Abstract]

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted Q iteration, and classification-based policy iteration. We provide error propagation analysis that unifies those for approximate policy and value iteration. For the classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation.

@inproceedings{GabLazGhaSch11,

author = { {B}runo {S}cherrer
{M}ohammad {G}havamzadeh and
{V}ictor {G}abillon and
{M}atthieu {G}eist}

title = {Approximate Modified Policy Iteration},

booktitle = {ICML},

year = {2012},

}

**2011:**
Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric & Sébastien Bubeck,
**Multi-Bandit Best Arm Identification,**
*NIPS 2011**, *
[Paper]
[Slides]
[Poster]
[Abstract]

We study the problem of identifying the best arm in each of the bandits in a multi-bandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.