Skip to content

Human-level control through deep reinforcement learning

Why this mattered

Before this paper, reinforcement learning had produced strong results in controlled settings, but it had not convincingly shown that a single agent could learn a broad range of complex behaviors directly from high-dimensional sensory input. Mnih et al. introduced the deep Q-network (DQN), combining Q-learning with a convolutional neural network, experience replay, and a target network to stabilize training. On Atari 2600 games, the same architecture learned policies from raw pixels and game scores, reaching human-level or better performance on many titles without game-specific feature engineering. The paradigm shift was not merely higher scores; it was the demonstration that deep learning could turn perception and action into one trainable system.

What became newly possible was a practical route from raw observation to competent control in environments where hand-designed state representations were costly or unavailable. The paper helped reframe reinforcement learning as a scalable representation-learning problem: instead of giving the agent compact symbolic features, researchers could let neural networks learn task-relevant features jointly with value estimates. This made deep reinforcement learning a central research program and gave the field a benchmark-driven proof of concept that general-purpose agents could acquire diverse skills through trial and error.

Its influence is visible in later breakthroughs that extended the same basic ambition: learning powerful policies and value functions with deep networks. AlphaGo and AlphaZero used deep reinforcement learning together with search and self-play to exceed human expert performance in board games; later work in robotics, simulated control, and large-scale game agents built on the idea that learned representations could support sequential decision-making. The paper did not solve general intelligence, sample efficiency, or robustness, but it made the modern deep RL agenda concrete: agents could learn nontrivial control directly from rich perceptual input, using one broadly applicable neural architecture.

Abstract

(no abstract available)

Sources