蒙地卡羅樹搜索法的必贏策略以及快速Nonogram解題程式的實作

No Thumbnail Available

Date

2019

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

DeepMind的AlphaZero展現了增強式學習即使在沒有人類知識的情況下也能表現出超越人類世界冠軍的棋力。然而AlphaZero所使用的蒙地卡羅樹搜索法無法根據遊戲理論值來評估盤面好壞。即使遊戲的結果已經被得知,蒙地卡羅樹仍會拜訪這個節點。在這篇論文中我們提出了Exact-win策略來對蒙地卡羅樹進行剪枝。Exact-win讓MCTS不再去處理已知遊戲理論值的節點,增加發現其他關鍵走步的機會。實驗結果顯示了我們的Exact-win方法在一些即死遊戲上顯著提升了原始MCTS的棋力,像是在井字遊戲和連四棋。在使用了Exact-win策略之後,Exact-win與原始版本的Leela Zero、ELF OpenGo和PhoenixGo對下了100盤後分別取得61、58和51場勝場。雖然DeepMind的AlphaZero仍未開源,但我們期待未來我們的方法也能用來加強AlphaZero。就我們所知,這是第一個可以直接加強AlphaZero的方法。 在本篇論文中我們也將揭露我們的Nonogram程式Requiem的實作方式,該程式在近幾次的比賽中都以十分顯著的時間差距贏得冠軍。Nonogram是一個單人的紙筆邏輯遊戲,玩家須根據每一行每一列的提示來對二維的方格填入顏色。我們改進了吳老師等人的方法,藉由自由度參數來減少maximal painting的計算開銷。並結合一個設計好的位元盤面表示法來配合BMI指令架構,在加速運算的同時減少記憶體的負載。我們的Nonogram程式正確地解開了2011年到2018年間的所有錦標賽的題目,並且比歷年的程式都來得快。
DeepMind’s AlphaZero demonstrated that reinforcement learning could reach the strength of over the human champions without human knowledge. However, the Monte-Carlo Tree Search (MCTS) used by AlphaZero cannot know the theoretical value of a board state. Even if the result of a node has been known, MCTS will still revisit this node. In this thesis, we propose the Exact-win strategy to prune the Monte-Carlo tree. Exact-win allows MCTS to no longer deal with the nodes that have the determined theoretical values and increases the chances of discovering other key moves. The experiments show that our Exact-win method substantially enhances the strength of the original MCTS in some sudden death games, such as the Tic-Tac-Toe and the Connect4. After employing the Exact-win strategy, Exact-win won 61, 58, and 51 of per 100 games against the original version of Leela Zero, ELF OpenGo, and PhoenixGo, respectively. DeepMind’s AlphaZero is currently not open source. If they make it open source, we expect that our approach can also promote AlphaZero. As far as we know, this is the first concept to enhance AlphaZero's approach with the concrete experiments performed on Go. Also, in this thesis, we reveal the implementation of our Nonogram program, named Requiem, which won all the games by a significant time gap in recent competitions. Nonogram is a pen and paper single-player logic game in which players paint each cell of a two-dimensional grid according to clues for specific rows and columns. We improve the method proposed by Wu et al. by adding a freedom parameter which can significantly reduce the total computation cost of maximal painting. Combining a well-designed bitboard with BMI instruction, we reduce memory loading while accelerating operations. Our Nonogram program solved all 1000 puzzles in every Nonogram Tournament from 2011 to 2018 faster than all nonogram programs.

Description

Keywords

增強式學習, 蒙地卡羅樹搜索法, AlphaZero, 必贏策略, Nonogram, 謎題, 動態規劃, 位元操作指令, Reinforcement learning, Monte-Carlo Tree Search, AlphaZero, Exact-win, Nonogram, Puzzle, Dynamic Programming, BMI

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By