Rollout Sampling Approximate Policy Iteration -- Algorithms and Bounds for Sampling-based Approximate Policy Iteration
Christos Dimitrakakis and Michail Lagoudakis
| What | Talk |
|---|---|
| When |
2008-07-02 18:05
2008-07-02 18:30
2008-07-02 from 18:05 to 18:30 |
| Add event to calendar |
|
- Several researchers have recently investigated the connection between reinforcement learning and classification. We are motivated by proposals of approximate policy iteration schemes without value functions which focus on policy representation using classifiers and address policy learning as a supervised learning problem. This paper proposes variants of an improved policy iteration scheme which addresses the core sampling problem in evaluating a policy through simulation as a multi-armed bandit machine. The resulting algorithm offers comparable performance to the previous algorithm achieved, however, with significantly less computational effort. An order of magnitude improvement is demonstrated experimentally in two standard reinforcement learning domains: inverted pendulum and mountain-car.
- Several approximate policy iteration schemes without value functions, which focus on policy representation using classifiers and address policy learning as a supervised learning problem, have been proposed recently. Finding good policies with such methods requires not only an appropriate classifier, but also reliable examples for the best actions, covering all of the state space. One major question is how to find a good covering efficiently. However, up to this time, little work has been done to reduce the sample complexity of such methods, especially in continuous state spaces. This paper focuses on the simplest possible classification strategy / policy representation for such spaces (a discretised grid) and performs a sample-complexity comparison between previously the simplest (and commonly) sample allocation strategy, which allocates samples equally at each state under consideration, and an almost as simple method, which is shown to require significantly fewer samples.




