For the first year of my PhD I had given myself the target of learning and implementing some of the state of the art systems around Deep Reinforcement Learning. I felt that this would be a good way to get up to speed with the relevant literature and understand what limitations the field still faces.
With that in mind I first began to implement Deep Q Networks as outlined by Mnih et al in the paper on Human-level control through deep reinforcement learning.
Q-learning focuses on the equation below, where you would like to learn a function that takes a state and action and gives you a value for taking that action. The value is a combination of the immediate reward for entering the state and the possible future reward from that state.
Deep Q Networks are a culmination of reinforcement learning techniques, the architecture of an example DQN is below. With two convolution layers for processing the input images and then two fully connected layers that output the estimated Q values for each action. From the output of the network the action with the highest Q value is chosen.
At the time I started this TensorFlow had just been made open source and this seemed like as good a project as any to learn how TensorFlow works. At this stage I had very little experience with using any deep learning libraries, I had implemented my own feed-forward neural network in C++ as a programming exercise but nothing that involved convolution layers (my limited experience with computer vision did help in this area).
I first started by getting the network to train on a very simple game called Grid World. This game randomly places the player at a start position in a grid and then also randomly places a goal. The player is able to move in the 4 cardinal directions and receives a reward when it reaches the goal. This proved to be very useful whilst learning as it provided a game that should train quickly, allowing for bugs to be found faster.
Above is a screenshot of a training run for gridworld, I used D3 in order to view the excess moves made by the network for each game. This shows the network learning how to play gridworld, It would eventually learn to play perfectly and the graph would “flat line” with no excess moves being made.
Moving onto DOOM
Once I had that working I was informed of a deep reinforcement learning competition that would be taking place at CIG (Computational Intelligence and Games) conference. The completion was ViZDOOM and it requires teams to create agents that can play multiplayer deathmatches of DOOM. This immediately caught my attention, I’m a pretty big fan of DOOM and the thought of getting to mess around with DOOM and push the boundaries of deep reinforcement learning was too strong to ignore. So with that I begun to take what I had done with gridworld and apply it to DOOM. I had luckily been smart enough to structure my code that I could effectively just write a small amount of code that linked my network class to the DOOM game and it all worked.
The above video shows the results of training a DQN on the basic scenario in ViZDOOM. The network successfully manages to identify and shoot an enemy target. The graph below shows the training run that produced this agent. It shows the mean value the network outputs for each action steadily increases as it back propagates the reward from the final state. The training run was carried out over 1.1Million game ticks.