**By Mateusz Wyszyński**.

**Introduction**

We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.

**Perfect Information Games**

Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over.

Every perfect information game can be represented in the form of a tree

All the time we choose to play on the machine with the highest upper bound for *xi* (so “+” in the formula above).

This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.

**Real world application example**

MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99$, 100$, 115$ and 120$. Now you want to test how those prices will perform and which to choose eventually. During first day of your experiment 4000 people visited your shop when the first price (99$) was tested and 368 bought the product, for the rest of the prices we have the following outcome:

`100$`

4060 visits and 355 purchases,`115$`

4011 visits and 373 purchases,`120$`

4007 visits and 230 purchases.

Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.

library(bandit) set.seed(123) visits_day1 <- c(4000, 4060, 4011, 4007) purchase_day1 <- c(368, 355, 373, 230) prices <- c(99, 100, 115, 120) post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.3960 0.0936 0.5104 0.0000

We calculated the Bayesian probability that the price performed the best and can see that the price `115$`

has the highest probability (0.5). On the other hand `120$`

seems bit too much for the customers.

The experiment continues for a few more days.

Day 2 results:

visits_day2 <- c(8030, 8060, 8027, 8037) purchase_day2 <- c(769, 735, 786, 420) post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.308623 0.034632 0.656745 0.000000

After the second day price `115$`

still shows the best results, with `99$`

and `100$`

performing very similar.

Using `bandit`

package we can also perform significant analysis, which is handy for overall proportion comparison using `prop.test`

.

significance_analysis(purchase_day2, visits_day2) ## successes totals estimated_proportion lower upper ## 1 769 8030 0.09576588 -0.004545319 0.01369494 ## 2 735 8060 0.09119107 0.030860453 0.04700507 ## 3 786 8027 0.09791952 -0.007119595 0.01142688 ## 4 420 8037 0.05225831 NA NA ## significance rank best p_best ## 1 3.322143e-01 2 1 3.086709e-01 ## 2 1.437347e-21 3 1 2.340515e-06 ## 3 6.637812e-01 1 1 6.564434e-01 ## 4 NA 4 0 1.548068e-39

At this point we can see that `120$`

is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the `p_best`

are very small (p_best has negligible value).

Day 3 results:

visits_day3 <- c(15684, 15690, 15672, 8037) purchase_day3 <- c(1433, 1440, 1495, 420) post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000) probability_winning <- prob_winner(post_distribution) names(probability_winning) <- prices probability_winning ## 99 100 115 120 ## 0.087200 0.115522 0.797278 0.000000

value_remaining = value_remaining(purchase_day3, visits_day3) potential_value = quantile(value_remaining, 0.95) potential_value ## 95% ## 0.02670002

Day 3 results led us to conclude that `115$`

will generate the highest conversion rate and revenue. We are still unsure about the conversion probability for the best price `115$`

, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).

The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price `$115`

.

If this example was interesting to you, checkout our another post about dynamic pricing.

We are ready to learn how the Monte Carlo Tree Search algorithm works - next page.