Reinforcement Learning by Direct Optimal Value Estimation and Regret Minimization
Manuel Loth and Philippe Preux
| What | Talk |
|---|---|
| When |
2008-07-01 11:55
2008-07-01 12:20
2008-07-01 from 11:55 to 12:20 |
| Add event to calendar |
|
We introduce an online Reinforcement Learning algorithm of which the concept is to replace the idea of policy improvement --- explicit in policy iteration algorithms and implicit in value iteration ones --- by the idea of knowledge improvement about the value function of the optimal policy, along a learning policy driven by regret minimization. This is achieved by maintaining for each state a probability distribution over its optimal value, relative to the information gathered so far. This approach can have a high sample efficiency, from both the way updates are performed and the fact that the exploration/exploitation trade-off is well handled by an UCB policy.




