AlphaGo Zero
In 2016, AlphaGo defeated world champion Lee Sedol 4–1; in 2017, AlphaGo Zero trained from pure self‑play without human data or handcrafted priors.
Milestones
- 2016.03: AlphaGo defeated Lee Sedol, showcasing the power of deep learning + search.
- 2017.10: AlphaGo Zero removed human data and trained entirely via self‑play, achieving superior performance.
- Methodological insight: treat the search‑induced action distribution as a “teacher signal,” and perform end‑to‑end optimization with an integrated policy–value network.
Core Architecture: Policy–Value Network (Policy–Value Net)
The neural architecture used in this project is illustrated below (see my GitHub for details):

- Input: multi‑plane encodings of board states (current/history).
- Shared trunk: convolutional layers capturing local and global structural features.
- Dual heads: policy probability distribution and value (win‑rate) estimate.
Search: MCTS (Monte Carlo Tree Search)

- The iterative cycle of selection → expansion → evaluation → backup; use PUCT/UCT to balance exploration and exploitation. Multiple simulations yield action visit counts (N(a)), which are normalized to a policy distribution (\pi).
Training Loop (Engineering View)

- Self‑play: current network + MCTS generate games.
- Samples: extract ((s, \pi, z)).
- Forward: output ((\log \pi_\theta, \hat v_\theta)).
- Loss (abbrev.):
L = - π^T log π_θ # policy loss (match the search distribution)
- (z - v̂_θ)^2 # value error
- λ||θ||^2 # regularization
- Optimize and iterate; periodically evaluate and save checkpoints.
Origin
In 2016, AlphaGo defeated world champion Lee Sedol by 4–1, and in 2017 the upgraded AlphaGo Zero was introduced. At the time I was in middle school and enjoyed playing Go; the news was astonishing and sparked my interest in artificial intelligence. Upon learning neural‑network fundamentals in college, I started to replicate AlphaGo and other classic neural models and to apply them in practical settings.
Of course, reproducing AlphaGo did not happen out of thin air. I surveyed a substantial body of literature, read extensively, took notes, and implemented (wrote code) — which was highly educational. Because the details are numerous and space is limited, this article provides only a brief overview; I will elaborate in subsequent posts.
First, the training process (machine‑learning workflow) comprises several stages: initializing the neural model (policy–value network), self‑play, and sample generation. Using Monte Carlo Tree Search (MCTS), the search tree expands on board states and runs multiple simulations. In each simulation, MCTS expands more promising actions and evaluates their potential continuations until termination or a specified search depth. By aggregating many simulations, MCTS compiles statistics over actions and produces training data: board states, the simulation‑derived policy distribution, and estimated returns. These serve as supervision for the neural network.
During training, the policy–value network is continuously optimized. The policy head predicts the move‑probability distribution under the current board state; the value head estimates position strength (win rate). Both are trained on self‑play samples generated by MCTS, progressively improving accuracy. I will discuss algorithmic details and several engineering adjustments in future articles.
Finally, from the perspective of reinforcement‑learning interaction, the agent interacts with the environment: it receives a state, outputs an action, and the environment returns the next state and a reward. The game outcome serves as a supervised signal, while the MCTS policy distribution plays a teacher‑signal role that guides policy learning.
For visualization and verification, I built a Gomoku interface using pygame and set up a local server via sockets to schedule matches, decoupling the UI from the training process and enabling direct observation of policy improvement. I will share implementation details and code snippets in later posts.