Reinforcement Learning Capture-The-Flag Tutorial

This tutorial provides an example of a Reinforcement Learning (RL) implementation to the Capture the Flag (CTF) scenario. For more detail, please see the CTF documentation Capture the Flag Scenario. This tutorial provides a complete working example by creating OpenAI plugins for the CTF scenario. The implementation concepts used here is only to serve as a starting place and guide for how to develop plugins for the scrimmage for RL.

For more information on how to develop OpenAI environment for scrimmage, please see the tutorial on Create an OpenAI Plugin. We assume the users have already familiarized themself with the documentation in the above provided links.

The CTF OpenAI repository can be cloned/download from the GitLab CTF RL Demo.

This tutorial uses a Q-Learning algorithm, a model-free reinforcement learning algorithm to learn a policy. Since model-free algorithms are sample inefficient, for simplicity and to allow for quicker training, a quasi-strategy policy was implemented with built in logic to selectively set the observation state for the agent to focus on. We do this by setting the state space as the heading from the agent to the objective.

The CTF scenario is innately complex. It involves multiple target objectives including the enemy agent, the enemy flag to take and then the return to the agent’s home base with the taken flag. Additionally, the rules of this game means the agent can only capture the enemy agent only if the agent and the enemy agent is in the agent’s home base aka the blue boundary. Because of the complex nature, RL implementation can take many hours to train properly even if given the ideal combination of states, actions, rewards, hyper-parameters, etc. Thus for this tutorial, we’ve provided a simplified example to get users started and then provided guidelines to enhance the implementation to allow the agent to fully learn the policy on its own.

There are two main plugins for this tutorial, the Autonomy and the Sensor. The example here are named the CTFAutonomy and the CTFSensor. The CTFAutonomy provides the action space, action and reward implementation while the CTFSensor provides the state space and the state observation at each step. Additionally, there are two main python modules. The ctf_qlearn which creates the OpenAI Gym environment for the CTF and sets the hyper-parameters for tuning the training and the qlearn, which is the Q-Learning algorithm used in this tutorial. Additionally, the CTF mission scenario is captured in the ctf_mission file and allows for configurable parameters that can be passed into the plugins.

CTFSensor

A focus of the CTFSensor is in its set_observation_space and get_observation methods where it sets the observation state space and returns the observation at each step, respectively. It consist of the heading between the agent to the objective. It’s further discretized to 8 bins.

observation_space.discrete_count.push_back(8)

data[beginning++] = heading_bin

This heading is updated depending on the observable environment versus being learned by the agent in order to speed training for tutorial purpose. The heading is targeted for the enemy agent if the agent and the enemy agent are both within the agent’s home boundary. Otherwise, the heading is updated to the enemy’s flag position and subsequently, back to the agent’s home boundary if the agent has the flag. In order for the agent to be notified of such observable events - flag taken for example, the plugin subscribes to the various messaging. It also checks for enemy’s location.

CTFAutonomy

The following shows the action spaces in the CTFAutonomy. These correspond to a 2-dimensional space. Please note, scrimmage itself provides for 3-D space but for simplicity, we limit the actions to 2-D space. Thus the actions correspond to the velocity in the x and y direction.

action_options_[0] = std::make_pair(0, max_speed_)

action_options_[7] = std::make_pair(max_speed_,-max_speed_)

Reward Calculation

The heart of the CTFAutonomy is in the CTFAutonomy::calc_reward method. Here we can shape the training by setting the rewards depending on the agents state. For this example, we have the reward set for the agent taking the flag, for capturing the flag and also for capturing the enemy. We also have the negative reward if the agent goes out of game bounds. To speed up the training, we also have a small reward for the agent moving toward the objective as calculated by the difference in previous and current distance to the objective. Reward itself can be set in the mission file. reward_ = get<int>(“reward”, params, reward_)

Configurable Mission

There are also key variables that can be set in the ctf_mission file. For CTFAutonomy, they are, flag_boundary_id, capture_boundary_id, max_speed, and reward.

These are easily accessible from the CTFAutonomy (CTFAutonomy::init_helper). flag_boundary_id_ = get<int>(“flag_boundary_id”, params, flag_boundary_id_); capture_boundary_id_ = get<int>(“capture_boundary_id”, params, capture_boundary_id_); max_speed_ = get<int>(“max_speed”, params, max_speed_); reward_ = get<int>(“reward”, params, reward_); For example, the max_speed provides the maximum velocity allowable for the agent, which here is set at 10 m/s in any direction (to correspond to the max speed for the enemy agent) while the reward variable provides a way to set the reward for arriving at the objective targets.

Subscribing to Messaging

Scrimmage provides a rich communication pathway for the plugins to send and receive messages to each other. For further information, please see the documentation on Publishers And Subscribers. Between the CTFSensor and the CTFAutonomy plugins, they subscribe to the FlagTaken, FlagCaptured, Boundary, and NonTeamCapture messaging.

Python modules for creating CTF gym environment and implementing Q-Learning
  • ctf_qlearn.py
  • qlearn.py

File ctf_qlearn sets up the openAI gym environment for the CTF. It allows for two modes, the training and non-training mode. Training mode will update the Q-table as the agent explores the environment over many episodes. Once training is complete, the module will save the resulting Q-table into a pickle file. In non-training mode, the learnt Q-table is uploaded and used. Alternatively, the user can run scrimmage using just the ‘return_action_func’ itself. For more information, please see the OpenAI Plugin. File qlearn.py has an example implementation of the q-learning algorithm and is not tied to any scenario or environment.

Running the Tutorial in Training Mode

The following provides instructions on running this tutorial. 1. Clone or Download the repository 2. Set the required environment variables and paths for scrimmage. source ~/.scrimmage/setup.bash 3. Create build directory in ctf_qlearn_demo. mkdir build && cd build 4. Build the code. cmake .. && make 5. Train the agent. Move to the main repository directory, then run: python3 ctf_qlearn.py (for training, make sure the train_mode=True)

It should take anywhere from 2-15 minutes depending on the particular computing power and resources.

Once training is complete, summary metrics will be provided including SimpleCaptureMetrics like the one below.

Playback training episodes

Requirement: open a browser with the vnc connection: (http://localhost:6901) (assuming local run using vSCRIMMAGE on default port 6901)

Use the following command to see the GUI rendering of the latest training playbacks: scrimmage-playback ~/.scrimmage/logs/latest

Use the following command to plot the trajectories of the latest training episode: python3 ~/scrimmage/scripts/plot_3d_fr.py ~/.scrimmage/logs/latest

Run in Non-Training Mode

There are two ways to run the tutorial in non-training mode. Both require that the model (pickle) file has already been saved by training the agent first.

Use the python module: In the code, set test_openai(train_mode=False) then python3 ctf_qlearn.py

Run using scrimmage: scrimmage missions/ctf-qlearn.xml

Important: Start the run in the browser. The list of available viewer controls are provided here.

Though the training consisted of testing on one particular mission scenario, the policy learned is robust to handle some limited changes to the agent placements. You can play around with this by changing the starting position of the blue and red agent in the mission scenario file and the running in non-training mode. Please keep in mind to keep the z-Dimension the same as the tutorial only considers 2-D space. To make the training more robust, future enhancements can include changing the mission during the training to consider various starting points.

Further Enhancements

For a true q-learning implementation, the built-in logic to selectively set the heading should be removed and substituted with additional observation states to allow the agent to learn the complete policy on its own. (Note, additional state space will increase the training time). Since the heading is updated depending on the observable environment, the goal is to provide these observable states such as the heading for the other possible targets as well as which agent boundaries the agents are in.

The following are suggestions to enhance the agent:

  • Set additional observation states in CTFSensor (eg. Provide addition heading information for all the objectives as well as for binary flag whether the agent and enemy agent is in agents home base, etc.)

observation_space.discrete_count.push_back(8); heading_bin_flag observation_space.discrete_count.push_back(8); heading_bin_base observation_space.discrete_count.push_back(9); heading_bin_enemy observation_space.discrete_count.push_back(2); in_base observation_space.discrete_count.push_back(2); has_flag observation_space.discrete_count.push_back(2); closest enemy_in_base

  • Update the method ‘get_observation’ accordingly to accommodate additional states.
  • Update the ctf mission file to include more individual reward/penalty (eg. Out of bounds penalty).
  • Update CTFAutonomy plugin to reflect changes in the mission file and to update the reward calculation.
  • Tune the hyper parameters in ctf_qlearn module.