2016년 11월 26일 토요일

[Paper] Human-level control through deep reinforcement learning

Human-level control through deep reinforcement learning - Volodymyr, Koray, David, et al.

Recently, Google AlphaGo defeated Korean top Go player, Se-dol Lee, by 4 to 1. This was very surprising that Artificial intellectual agent could learn superhuman play and can establish game plan to win the game. The basis of AlphaGo's theory lies on reinforcement learning with deep neural network for the task of Deep reinforcement learning, which is called deep Q network.


1. Previous Works and DeepMind

The theory of reinforcement learning provides a model of how agents may optimize their control of an environment. However, to use reinforcement learning successfully in situations treating real-world complexity, agents are faced with difficult tasks that they must derive efficient represntations of the environment from high-dimensional inputs, and use this to generalize past experience to new situations.

In the past, reinforcement learning agents have achieved some successes in a variety of domains, however, applicability has previously been limited to ...
(a). domains where useful features can be handcrafted
(b). domains with fully observed
(c). low-dimensional state spaces

The paper, Human-level control through deep reinforcement learning, published in Nature on Feb 2015, describes a deep reinforcement learning system which combines deep neural networks with reinforcement learning for the fist time, and is able to master a diverse range of Atari games to superhuman level with only the raw pixels and score as inputs. This method is later applied to the autonomous agent, DeepMind that defeated Se-dol Lee.


2. Reinforcement Learning


In supervised learning, we have unambiguous label y for every input x as a training set. In contrast, for many sequential decision making and control problems, it is very difficult to provide this type of explicit supervision to a learning algorithm. For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the "correct" actions are to make it walk, and so do not know how to provide explicit supervision for a learning algorithm to try to mimic.

In the reinforcement learning framework, we can use an algorithm exploiting reward function (Q), which indicates if the learning agent doing well or not. So the reinforcement learning is characterized as an interaction between a learner (agent) and an environment that provides evaluative feedback. We should be noted that environment is often formulated with a Markov decision process which includes state s, action a, and reward r.


3. Defining state s

In real game, defining state s is critical to define problem. First, we can come up with general hypotheses.
a. Locations of the paddle,
b. Location/direction of ball
c. Number/position of bricks.
and so on. However, it may be possible to make it more universal, more generalized state s. So, just like what human visual perception do, we can utilize pixels as an input.


4. Value function

Future reward:
   

Discounted future reward:
   

Optimal action-value function:
   

In order to evaluate certain action A in a given state S, we have to measure its value. We can define R stating future reward function. In practical problems, we can use discounted future reward instead, so that we will put less weight on the value of reward as time goes by. So the quality function Q is a function which represents the maximum evaluation of future reward R when we perform action A in state S. (Pi is a policy toward action A which maximizes Q function.


5. Q-learning

So, how do we define Q function? Given a fixed policy π, its value function Q satisfies the Bellman equations:
   
So the optimal strategy is to select the action a' maximizing the expected value of [r+rQ].
   
However, in practice, this basic approach is impractical because the action-value function Q is estimated separately for each sequence without any generalization. That is,
  (a) very limited states/actions
  (b) cannot generalize to unobserved states

When we think about the breakout game, since input is pixels of entire screen, we cannot make Q table for every state of every pixels. Instead, it is common to use a function approximator to estimate the action-value function Q.


6. Deep Q-network

In this paper, deep convolutional network was used to approximate the optimal action-value function. This is called Q-network nonlinear function. A Q network can be trained by adjusting the parameters w at iteration I to reduce the mean-squared error in Bellman equation.
    Full structure of Deep Q-Net.
With a use of neural network, this algorithm could cover a wide range of challenging tasks, and therefore utilizes the local spatial correlations present in images, building in robustness to natural transformations such as changes of viewpoint or scale.

We can express squared error term using Bellman equation.


7. Solving problems with using neural network

Since non-linear function approximator Q network is not stable, the articl solves these instability by applying two new ideas. First one is a method named experience replay, paring feature data from the database, forming mini-batch of translation. Second one is an iterative update that adjusts the action-values, in regarding to the parameter data.
  a. Method termed experience replay
  b. Target values are updated periodically
 The agent selects and executes actions according to an ϵ-greedy policy based on Q. Instead of arbitrary length as inputs to neural network, Q-function works on a fixed length representation of histories produced by the function pi. Iterates until game ends at time T.
 
Because the agent only observes the current screen, the task is partially observed. So many emulator states are perceptually aliased. Therefore, sequences of actions and observations are input to the algorithm. Then the algorithm learns game strategie depending on these historical sequences. The ultimate goal of the agent is to interact with the emulator by selecting actions in a way that maximizes future rewards.

This approach is in some respect limited because the memory buffer does not differentiatie important transitions and always overwirtes with recent transitions owing to the finite memory size N. Similarly, the uniform sampling gives equal importance to all transitions in the replay memory. A more sophisticated sampling strategy might emphasize transitions from which we can learn the most.


8. Result Analysis
 We should be noted that after each peak point, the output value drops a little. Considering the fact that the state does not go worse, this phenomenon is not proper one, in my opinion. It might be due to the fact that it only has a constant size of memory and only stores four consecutive frames. I expect that if we solve this phenomenon, we can enhance its performance.
 
...