**2016:**
**Improved Learning Complexity in Combinatorial Pure Exploration Bandits,**
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner & Peter Bartlett,
*AISTATS 2016**,*
[Paper]
[Poster]
[Abstract]

We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant.

**2015:**
**Approximate Modified Policy Iteration and its Application to the Game of Tetris,**
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, Matthieu Geist,
*JMLR 2015**, *
[Paper]
[Abstract]

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of the well-known approximate DP algorithms: fitted-value iteration, fitted-Q iteration, and classification-based policy iteration. We provide error propagation analysis that unify those for approximate policy and value iteration. We develop the finite-sample analysis of these algorithms, which highlights the influence of their parameters. In the classification-based version of the algorithm (CBMPI), the analysis shows that MPI's main parameter controls the balance between the estimation error of the classifier and the overall value function approximation. We illustrate and evaluate the behavior of these new algorithms in the Mountain Car and Tetris problems. Remarkably, in Tetris, CBMPI outperforms the existing DP approaches by a large margin, and competes with the current state-of-the-art methods while using fewer samples.

**2014:**
* Doctorate/Ph.D thesis*: **Budgeted Classification-based Policy Iteration.**
[Thesis]
[Slides]
[Abstract]

This dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.

**2014:**
**Large-Scale Optimistic Adaptive Submodularity,**
Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson & S. Muthukrishnan,
*AAAI 2014**, *
[Paper]
[Poster]
[Slides]
[Abstract]

Maximization of submodular functions has wide applications in artificial intelligence and machine learning. In this paper, we propose a scalable learning algorithm for maximizing an adaptive submodular function. The key structural assumption in our solution is that the state of each item is distributed according to a generalized linear model, which is conditioned on the feature vector of the item. Our objective is to learn the parameters of this model. We analyze the performance of our algorithm, and show that its regret is polylogarithmic in time and quadratic in the number of features. Finally, we evaluate our solution on two problems, preference elicitation and face detection, and show that high-quality policies can be learned sample efficiently.

**2013:**
**Approximate Dynamic Programming Finally Performs Well in the Game of Tetris,**
Victor Gabillon, Mohammad Ghavamzadeh & Bruno Scherrer,
*NIPS 2013**, *
[Paper]
[Poster]
[Demo-Video]
[Abstract]

Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris.
Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 \D7 10 and large 10 \D7 20 boards. Although the CBMPI's results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE.

**2013:**
**Adaptive Submodular Maximization in Bandit Setting,**
Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson & S. Muthukrishnan,
*NIPS 2013**, *
[Paper]
[Poster]
[Slides]
[Abstract]

Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as *Optimistic Adaptive Submodular Maximization (OASM)* because it trades off exploration and exploitation based on the *optimism in the face of uncertainty* principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem.

**2012:**
**Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence,**
Victor Gabillon, Mohammad Ghavamzadeh & Alessandro Lazaric,
*NIPS 2012**, *
[Paper]
[Poster]
[Abstract]

We study the problem of identifying the best arm(s) in the stochastic multi-armed bandit setting. This problem has been studied in the literature from two different perspectives: *em fixed budget* and * fixed confidence*. We propose a unifying approach that leads to a meta-algorithm called unified gap-based exploration (UGapE), with a common structure and similar theoretical analysis for these two settings. We prove a performance bound for the two versions of the algorithm showing that the two problems are characterized by the same notion of complexity. We also show how the UGapE algorithm as well as its theoretical analysis can be extended to take into account the variance of the arms and to multiple bandits. Finally, we evaluate the performance of UGapE and compare it with a number of existing fixed budget and fixed confidence algorithms.

** 2012:**
**Approximate Modified Policy Iteration,**
Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon & Matthieu Geist,
*
ICML 2012, *
[Paper]
[Poster]
[Slides]
[Video]
[Abstract]

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm that contains the two celebrated policy and value iteration methods. Despite its generality, MPI has not been thoroughly studied, especially its approximation form which is used when the state and/or action spaces are large or infinite. In this paper, we propose three implementations of approximate MPI (AMPI) that are extensions of well-known approximate DP algorithms: fitted-value iteration, fitted Q iteration, and classification-based policy iteration. We provide error propagation analysis that unifies those for approximate policy and value iteration. For the classification-based implementation, we develop a finite-sample analysis that shows that MPI's main parameter allows to control the balance between the estimation error of the classifier and the overall value function approximation.

@inproceedings{GabLazGhaSch11,

author = { {B}runo {S}cherrer
{M}ohammad {G}havamzadeh and
{V}ictor {G}abillon and
{M}atthieu {G}eist}

title = {Approximate Modified Policy Iteration},

booktitle = {ICML},

year = {2012},

}

**2011:**
**Multi-Bandit Best Arm Identification,**
Victor Gabillon, Mohammad Ghavamzadeh, Alessandro Lazaric & Sébastien Bubeck,
*NIPS 2011**, *
[Paper]
[Slides]
[Poster]
[Abstract]

We study the problem of identifying the best arm in each of the bandits in a multi-bandit multi-armed setting. We first propose an algorithm called Gap-based Exploration (GapE) that focuses on the arms whose mean is close to the mean of the best arm in the same bandit (i.e., small gap). We then introduce an algorithm, called GapE-V, which takes into account the variance of the arms in addition to their gap. We prove an upper-bound on the probability of error for both algorithms. Since GapE and GapE-V need to tune an exploration parameter that depends on the complexity of the problem, which is often unknown in advance, we also introduce variations of these algorithms that estimate this complexity online. Finally, we evaluate the performance of these algorithms and compare them to other allocation strategies on a number of synthetic problems.

** 2011:**
**Classification-based Policy Iteration with a Critic,**
Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh & Bruno Scherrer,
*
ICML 2011, *
[Paper]
[Poster]
[Slides]
[Video]
[Abstract]

In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use the critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on LSTD and BRM methods. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.

@inproceedings{GabLazGhaSch11,

author = {{V}ictor {G}abillon and
{A}lessandro {L}azaric and
{M}ohammad {G}havamzadeh and
{B}runo {S}cherrer},

title = {{C}lassification-based {P}olicy {I}teration with a {C}ritic},

booktitle = {ICML},

pages = {1049-1056},

year = {2011},

}

**2010:**
**Rollout Allocation Strategies for Classification-based Policy Iteration,**
Victor Gabillon, Alessandro Lazaric & Mohammad Ghavamzadeh,
Workshop on Reinforcement Learning and Search in Very Large Spaces,
[Paper]
[Slides]
[Abstract]

Classification-based policy iteration algorithms are variations of policy iteration that do not use any kind of value function representation. The main idea is ** 1)** to replace the usual value function learning step with rollout estimates of the value function over a finite number of states, called the rollout set, and the actions in the action space, and **2) ** to cast the policy improvement step as a classification problem. The choice of rollout allocation strategies over states and actions has significant impact on the performance and computation time of this class of algorithms. In this paper, we present new strategies to allocate the available budget (number of rollouts) at each iteration of the algorithm over states and actions. Our empirical results indicate that for a fixed budget, using the proposed strategies improves the accuracy of the training set over the existing methods.

**2010:**
**Affichage de publicités sur des portails ***web*,
Victor Gabillon, Jérémie Mary & Philippe Preux,
*
EGC 2010,
10th French-speaking International Conference on Knowledge Extraction and Management*, Best applied paper award.
[Paper]
[ Paper]
[Slides]
[Poster]
[Abstract]

We consider the problem of displaying commercial ads on web pages. More and more advertisers are willing to pay only when their ads are clicked. In this emerging "cost per click" model, the ad server has to learn the interest of each type of visitors for the different ads in order to maximize the income. This "exploration versus exploitation" problem is often addressed as if resources were unlimited whereas in a realistic context, budget constraints on the ads like limited number of clicks, as well as the duration of the ad campaigns, should be taken into account. For this purpose, we introduce MAB+LP, a hybrid approach based on linear programming and multi-armed bandits to handle this problem and investigate its performance through simulations on a realistic model designed with an important commercial web actor. These experiments exhibit near optimal performance on the investigated problem setting.

@inproceedings{GabMarPre10,

author = {{V}ictor {G}abillon and
{J}{\'e}r{\'e}mie {M}ary and
{P}hilippe {P}reux},

title = {{A}ffichage de publicit{\'e}s sur des portails web},

booktitle = {EGC},

year = {2010},

pages = {67-78},

}

**2009:**
* Master Internship Report*: **Machine Learning tools for Online Advertisement.**
[Report]
[Abstract]

On-line advertisement is the main income source on the Internet. In the last years, this recent market began to switch from pay per impression to pay per click contracts. The ability to present to the web-visitors advertisements they are likely to click on has now become momentous. Ad servers, such has Orange, anticipate this change by an R&D effort. In this context, we will present a review of solutions born in the field of machine learning to tackle those issues. Moreover, trying to put the existing ideas all together, we finally present and analyse a new model.

@techreport{ GabInt09,

author = {{G}abillon, {V}ictor},

title = {{M}achine {L}earning {T}ools for {O}n-line {A}dvertisement},

institution = {{INRIA} {L}ille - {N}ord {E}urope, {\'E}cole {N}ormale {S}up\'erieure de {C}achan, {TELECOM} {S}ud{P}aris},

year = {2009}

}

** 2009:**
**Asymptotic performance analysis of PCA algorithms based on the weighted subspace criterion**,
Jean-Pierre Delmas & Victor Gabillon,
*
ICASSP 2009,*
[Paper]
[Abstract]

This paper studies the asymptotic distribution of the eigenvectors estimated by some PCA algorithms based on the weighted subspace criterion. This enables us to analyse how the choice of the weighting matrix affects the algorithm's performance, an issue previously overlooked.

@inproceedings{ DelGab09,
author = {{D}elmas, {J}ean {P}ierre and {G}abillon , {V}ictor },

title = {{A}symptotic performance analysis of PCA algorithms based on the weighted subspace criterion},

journal ={Acoustics, Speech, and Signal Processing, IEEE International Conference on},

volume = {0},

year = {2009},

pages = {3237-3240},

publisher = {IEEE Computer Society},

}