使用KataGo方法及迫著空間搜尋提升AlphaZero在六子棋的訓練成效

No Thumbnail Available

Date

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

自從Google DeepMind提出AlphaZero演算法之後,許多使用傳統搜尋法的電腦對局程式都被AlphaZero作法取代。然而AlphaZero作法需要非常大量的算力,才能夠達到頂尖的水準,因此我們希望透過程式效能改進及傳統做法的輔助,提升AlphaZero在六子棋遊戲的訓練效率,讓我們可以使用個人電腦達到頂尖水準。本篇論文使用Alpha-Zero-General開源程式碼作為基礎,研發一支AlphaZero的六子棋程式。我們參考galvanise_zero的做法修改MCTS的搜尋方式、參考OOGiveMeFive提出的通用型Bitboard,將其進行修改後用於六子棋程式中,並且參考陽明交通大學的CZF_Connect6提出的六子棋強度改進方式。本篇論文從三個面向來加速AlphaZero的訓練效率。第一個是提升程式效能,我們分析Alpha-Zero-General的一個效能瓶頸是MCTS的部分,因此透過C++及平行化的方式重新實作MCTS,大幅提升AlphaZero的訓練效率。第二個是提升神經網路的性能,使用KataGo提出的Global Pooling及Auxiliary Policy Targets方法修改神經網路,並套用於六子棋程式中。第三個是提升訓練資料的品質,使用KataGo提出的Forced Playout and Policy Target Pruning方法及傳統的迫著空間搜尋提升訓練資料的品質。另外本篇論文提出一種新的訓練方式,提升AlphaZero加入heuristics的訓練效果。我們使用C++、平行化及批次預測的方式可以讓MCTS的搜尋效率達到26.4的加速比,並且使用Bitboard的方式可以讓迫著空間搜尋達到6.03的加速比。在短時間的訓練中,雖然使用相同時間AlphaZero方法可以訓練更多個迭代,不過使用相同時間訓練的KataGo方法與原始AlphaZero方法相比依然可以取得57.58%的勝率,且使用相同時間訓練的KataGo-TSS Hybrids方法與原始AlphaZero方法相比也可以取得70%的勝率。並且這三種作法訓練到500個迭代後與NCTU6_Level3對戰,都可以取得超過65%的勝率。
Since Google DeepMind proposed the AlphaZero algorithm, many traditional search methods for computer game programs have been replaced by the AlphaZero method. However, the AlphaZero method requires a very large amount of computing power to reach the top level. Therefore, we hope to improve the training efficiency of AlphaZero in the game of Connect6 through the improvement of program performance and the assistance of traditional methods, so that we can use personal computers to reach the top level.This thesis uses the Alpha-Zero-General open source code as the basis to develop an AlphaZero Connect6 program. We refer to the method of galvanise_zero to modify the method of MCTS, refer to the general-purpose Bitboard proposed by OOGiveMeFive, modify it and use it in the game of Connect6, and refer to the method for improving the strength of Connect6 proposed by CZF_Connect6 of Yang Ming Chiao Tung University.This thesis accelerates the training efficiency of AlphaZero from three aspects. The first is program performance improvement. We analyze that a performance bottleneck of Alpha-Zero-General is its MCTS, so we re-implement MCTS through C++ language and parallelization. This greatly improves the training efficiency of AlphaZero. The second is to improve the performance of the neural network. We refer to the KataGo method, using Global Pooling and Auxiliary Policy Targets to modify the neural network and apply it to the game of Connect6. The third is to improve the quality of training data by using KataGo's Forced Playout and Policy Target Pruning method as well as the traditional Threat Space Search to improve the quality of training data. In addition, this thesis proposes a new training method to improve the training effect of AlphaZero by adding heuristics. By using C++ language, parallelization and batch prediction, the search efficiency of MCTS can reach a speedup of 26.4, and by using Bitboard, the threat space search can reach a speedup of 6.03. In short-term training with the same time,although the AlphaZero method can train more iterations, the KataGo method can still achieve a 57.58% win rate compared with the original AlphaZero method. Furthermore, the KataGo-TSS Hybrids trained with the same time can also achieve a 70% win rate against the original AlphaZero method. After the three methods have been trained to 500 iterations and played against NCTU6_Level3, they all achieve a win rate of more than 65%.

Description

Keywords

電腦對局, 強化式學習, 六子棋, AlphaZero, KataGo, 平行化, Computer Games, Reinforcement Learning, Connect6, AlphaZero, KataGo, Parallelization

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By