什么是RLHF？2024年强化学习从人类反馈技术详解

Introduction

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that optimizes the performance of an AI agent by first training a 'reward model' on direct human feedback and then using reinforcement learning.

通过人类反馈进行强化学习（RLHF）是一种机器学习技术，它首先通过人类的直接反馈训练一个“奖励模型”，然后利用强化学习来优化人工智能代理的性能。

Also known as reinforcement learning from human preferences, RLHF is particularly well-suited for tasks with complex, poorly defined, or difficult-to-specify objectives. For example, while it might be impractical or impossible to algorithmically define what constitutes a 'funny' joke in mathematical terms, it is easy for humans to evaluate jokes generated by a large language model (LLM). This human feedback can be refined into a reward function and used to improve the LLM's joke-writing ability.

也被称为基于人类偏好的强化学习，RLHF 特别适用于目标复杂、定义不清或难以明确指定的任务。例如，虽然用算法以数学术语定义什么是“有趣”可能不切实际或不可能，但人类评估大型语言模型（LLM）生成的笑话却很容易。这种人类反馈可以被提炼成奖励函数，用于提升LLM的写笑话能力。

OpenAI's Paul F. Christiano detailed early successful applications of RLHF in a 2017 paper, co-authored with other researchers from OpenAI and DeepMind, where it was used to train AI models for complex tasks like playing Atari games and simulated robot locomotion.¹ Building on these innovations, video games have continued to be a crucial testing ground for RLHF. In 2019, AI systems trained with RLHF, such as OpenAI Five and DeepMind's AlphaStar, defeated top human professional players in the far more complex games of Dota 2² and StarCraft³, respectively.

OpenAI 的 Paul F. Christiano 在 2017 年的一篇论文中详细阐述了 RLHF 早期的成功应用，该论文由 OpenAI 和 DeepMind 的其他研究人员共同撰写，其中 RLHF 被用于训练 AI 模型执行复杂任务，如玩 Atari 游戏和模拟机器人运动。¹ 在这些创新的基础上，电子游戏继续成为 RLHF 的重要试验场。2019 年，使用 RLHF 训练的 AI 系统，如 OpenAI Five 和 DeepMind 的 AlphaStar，分别在更为复杂的 Dota 2² 和 StarCraft³ 游戏中击败了顶尖的人类职业选手。

Perhaps most significantly, the 2017 OpenAI paper demonstrated that their methodology, particularly the introduction of the Proximal Policy Optimization (PPO) algorithm for updating model weights, drastically reduced the cost of collecting and extracting the necessary human feedback. This paved the way for the eventual integration of RLHF with the field of Natural Language Processing (NLP), helping to propel both LLMs and RLHF to the forefront of AI research.

也许最重要的是，OpenAI 2017 年的论文表明，他们的方法，特别是引入了用于更新模型权重的近端策略优化（PPO）算法，极大地降低了收集和提取必要人类反馈的成本。这为 RLHF 与自然语言处理（NLP）领域的最终融合铺平了道路，有助于将 LLM 和 RLHF 共同推向 AI 研究的前沿。

The first release of code detailing how to use RLHF on language models came from OpenAI in 2019⁴, followed by the release of InstructGPT, trained with RLHF, in early 2022.⁵ This was a critical step in bridging the gap between GPT-3 and GPT-3.5 Turbo, which powered the launch of ChatGPT.

2019 年，OpenAI 首次发布了详细说明如何在语言模型上使用 RLHF 的代码⁴，随后在 2022 年初发布了经过 RLHF 训练的 InstructGPT。⁵ 这是在 GPT-3 和后来驱动 ChatGPT 发布的 GPT-3.5 Turbo 之间架起桥梁的关键一步。

Since then, RLHF has been used in the training of state-of-the-art LLMs by OpenAI, DeepMind, Google⁶, and Anthropic.⁷

此后，RLHF 已被 OpenAI、DeepMind、Google⁶ 和 Anthropic⁷ 用于训练最先进的大型语言模型。

Key Concepts in Reinforcement Learning

Conceptually, reinforcement learning (RL) aims to mimic how humans learn: an AI agent learns holistically through trial and error, with strong incentives for success.

从概念上讲，强化学习（RL）旨在模仿人类的学习方式：AI 代理通过试错进行整体学习，并有强烈的成功激励。

To operationalize this strategy, the mathematical framework for reinforcement learning consists of the following components:

为了将这一策略付诸实践，强化学习的数学框架由以下组成部分构成：

State Space

The state space encompasses all available information relevant to the task at hand for which the AI agent can make decisions, including known and unknown variables. The state space typically changes with each decision the agent makes.

状态空间包含了 AI 代理可以做出决策的当前任务相关的所有可用信息，包括已知和未知变量。状态空间通常随着代理做出的每个决策而改变。

Action Space

The action space contains all possible decisions the AI agent can make. In the context of a board game, for example, the action space is discrete and well-defined, consisting of all legal moves available to the AI player at a given moment. In the context of text generation, the action space is vast, encompassing the entire 'vocabulary' of tokens available to the LLM.

动作空间包含 AI 代理可以做出的所有可能决策。例如，在棋盘游戏的背景下，动作空间是离散且定义明确的，由 AI 玩家在给定时刻所有合法的走法组成。在文本生成的背景下，动作空间是巨大的，包含了 LLM 可用的全部标记“词汇表”。

Reward Function

The reward is a metric that measures success or progress, providing incentive to the AI agent. In cases like board games, defining success (winning the game, in this case) is objective and straightforward. However, designing an effective reward function can be a significant challenge when the definition of 'success' is ambiguous. In the mathematical framework, this feedback must be converted into a reward signal—a scalar quantification of positive (or negative) feedback.

奖励是衡量成功或进展的指标，为 AI 代理提供激励。在像棋盘游戏这样的情况下，定义成功（在这种情况下是赢得游戏）是客观且直接的。然而，当“成功”的定义模糊时，设计一个有效的奖励函数可能是一个重大挑战。在数学框架中，这种反馈必须被转换为奖励信号——即正（或负）反馈的标量量化。

Constraints

The reward function can be supplemented with penalties (negative rewards) for actions considered counterproductive to the task at hand. For example, a business might want to prohibit a chatbot from using profanity or other crude language, or an autonomous driving model might receive a penalty for colliding or veering out of its lane.

奖励函数可以通过对被认为不利于当前任务的行为施加惩罚（负奖励）来补充。例如，企业可能希望禁止聊天机器人使用亵渎语言或其他粗俗语言，或者自动驾驶模型可能因碰撞或偏离车道而受到惩罚。

Policy

The policy is essentially the strategy or 'thought process' that drives the AI agent's behavior. Described in general mathematical terms, a policy ('π') is a function that takes a state ('s') as input and returns an action ('a'): π(s)→a.

策略本质上是驱动 AI 代理行为的策略或“思维过程”。用一般的数学术语描述，策略（'π'）是一个以状态（'s'）为输入并返回动作（'a'）的函数：π(s)→a。

The goal of an RL algorithm is to optimize the policy to achieve maximum reward. In deep reinforcement learning, the policy is represented by a neural network that is continuously updated during the learning process according to the reward function. The AI agent learns from experience, much like a human.

RL 算法的目标是优化策略以获得最大奖励。在深度强化学习中，策略由一个神经网络表示，该网络在学习过程中根据奖励函数不断更新。AI 代理像人类一样从经验中学习。

While traditional RL has achieved impressive real-world results in many domains, it can struggle to build effective reward functions for complex tasks where it is difficult to articulate a clear definition of success. A key advantage of RLHF is its ability to capture nuance and subjectivity by using positive human feedback instead of a formally defined objective.

虽然传统 RL 在许多领域取得了令人印象深刻的实际成果，但对于难以清晰定义成功的复杂任务，它可能难以构建有效的奖励函数。RLHF 的一个关键优势在于它能够通过使用积极的人类反馈，而不是正式定义的目标，来捕捉细微差别和主观性。

RLHF for Enhancing Large Language Models

One of the most prominent applications of RLHF has been improving the relevance, accuracy, and ethical alignment of LLMs, especially for use as chatbots.

RLHF 最突出的应用之一是提高 LLM 的相关性、准确性和伦理对齐性，特别是用作聊天机器人时。

Like all generative AI models, LLMs aim to replicate the probability distribution of their training data. While recent advancements have led to LLMs being used as the engine for chatbots or as reasoning engines for general-purpose AI, these language models are used simply to predict the next word in a given sequence initiated by a prompt, using patterns learned from their training data. At a fundamental level, these models are not actually responding to the prompt; they are appending text to it.

与所有生成式 AI 模型一样，LLM 的目标是复制其训练数据的概率分布。尽管最近的进展使得 LLM 被用作聊天机器人的引擎或通用 AI 的推理引擎，但这些语言模型只是简单地用于预测由提示启动的给定序列中的下一个单词，使用的是从其训练数据中学到的模式。从根本上说，这些模型并非真正在回应提示；它们是在向提示追加文本。

Without highly specific instructions, language models have little capacity to understand user intent. Prompt engineering can help provide the necessary context for an LLM to respond in line with a user's needs, but requiring prompt engineering for every conversation with a chatbot is unrealistic.

如果没有高度具体的指令，语言模型理解用户意图的能力很有限。提示工程可以帮助为 LLM 提供必要的上下文，使其能够根据用户需求进行响应，但要求与聊天机器人的每次对话都进行提示工程是不现实的。

Furthermore, while off-the-shelf LLMs have been trained in conventional ways to produce grammatically coherent outputs, training an LLM to produce 'good' outputs is a more elusive problem. Concepts like truthfulness, helpfulness, creativity, or even making a snippet of code executable are far more context-dependent than word meaning and linguistic structure.

此外，虽然现成的 LLM 已经通过传统方式训练以产生语法连贯的输出，但训练 LLM 产生“好”的输出是一个更难以捉摸的问题。像真实性、有用性、创造性，甚至使代码片段可执行这样的概念，远比词义和语言结构更依赖于上下文。

To create better language models for communication with humans, data scientists turned to reinforcement learning from human feedback. The InstructGPT model, enhanced with RLHF, performed significantly better than its predecessor GPT-3, particularly in terms of instruction following, maintaining factual accuracy, and preventing model hallucinations.⁵ Similarly, research published by OpenAI alongside the GPT-4 launch indicated that RLHF doubled accuracy on adversarial prompts.⁸

为了创建更好的人类交流语言模型，数据科学家转向了基于人类反馈的强化学习。经过 RLHF 增强的 InstructGPT 模型在性能上显著优于其前身 GPT-3，特别是在遵循指令、保持事实准确性和防止模型幻觉方面。⁵ 同样，OpenAI 随 GPT-4 发布的研究表明，RLHF 将对抗性提示的准确性提高了一倍。⁸

An advantage of RLHF is that it can substitute for the value of larger training datasets, enabling the development of more data-efficient models. OpenAI noted that its labelers preferred outputs from the 1.3B parameter version of InstructGPT over outputs from the 175B parameter version of GPT-3.⁵

RLHF 的一个优势在于，它可以替代更大训练数据集的价值，从而能够开发出数据效率更高的模型。OpenAI 指出，其标注者更喜欢 13 亿参数的 InstructGPT 版本的输出，而不是 1750 亿参数的 GPT-3 版本的输出。⁵

The RLHF Training Process for LLMs

The process of training an LLM with RLHF typically progresses through four stages:

使用 RLHF 训练 LLM 的过程通常包括四个阶段：

1. Pre-trained Model

RLHF is generally used not as an end-to-end learning method but to fine-tune and optimize a pre-trained model. For example, InstructGPT used RLHF to improve upon an existing GPT, or Generative Pre-trained Transformer, model. In its announcement of InstructGPT, OpenAI stated that the process could be understood as "unlocking capabilities that GPT-3 already had, but which were difficult to elicit through prompt engineering alone."⁵

RLHF 通常不是作为一种端到端的学习方法，而是用于微调和优化一个预训练模型。例如，InstructGPT 使用 RLHF 来改进现有的 GPT（生成式预训练 Transformer）模型。OpenAI 在发布 InstructGPT 时表示，这个过程可以理解为“解锁 GPT-3 已经具备但仅通过提示工程难以激发的能力”。⁵

Pre-training remains the most resource-intensive phase in RLHF. OpenAI noted that the RLHF training process for InstructGPT required less than 2% of the compute and data needed for GPT-3's pre-training.

预训练仍然是 RLHF 中最耗费资源的阶段。OpenAI 指出，InstructGPT 的 RLHF 训练过程所需的计算量和数据量不到 GPT-3 预训练所需资源的 2%。

2. Supervised Fine-Tuning (SFT)

Before beginning explicit reinforcement learning, supervised fine-tuning (SFT) is used to shape the model to generate responses in the format a user expects.

在开始显式的强化学习之前，使用监督微调（SFT）来塑造模型，使其以用户期望的格式生成响应。

As mentioned earlier, the LLM pre-training process optimizes the model to complete a sequence by predicting the next word from the user's prompt, replicating the language patterns learned during model pre-training. Sometimes, an LLM may not complete the sequence in the way the user desires. For example, if a user requests, "Tell me how to write a resume," the LLM might respond with, "Use Microsoft Word." While a valid way to complete the sentence, it does not align with the user's goal.

如前所述，LLM 预训练过程通过预测用户提示中序列的下一个单词来优化模型以完成序列，这复制了模型预训练期间学到的语言模式。有时，LLM 可能无法以用户期望的方式完成序列。例如，如果用户请求“告诉我如何写简历”，LLM 可能会回答“使用 Microsoft Word”。虽然这是完成句子的有效方式，但它与用户的目标不符。

Thus, SFT uses supervised learning to train the model to respond appropriately to various kinds of prompts. Human experts create labeled examples in the format (prompt, response) to demonstrate how to respond to prompts for various use cases, such as question answering, summarization, or translation.

因此，SFT 使用监督学习来训练模型对各种提示做出适当响应。人类专家以（提示，响应）的格式创建标注示例，以演示如何响应各种用例（如问答、摘要或翻译）的提示。

This demonstration data is powerful but time-consuming and costly to produce. DeepMind introduced an approach that applied filtering heuristics based on a 'general written conversational format ('interview transcript' style)' to isolate appropriate prompt/response example pairs within the MassiveWeb dataset, rather than creating new custom examples from scratch.⁹

这种演示数据很强大，但生成起来耗时且成本高。DeepMind 引入了一种方法，基于“通用书面对话格式（‘采访记录’风格）”应用过滤启发式方法，从 MassiveWeb 数据集中分离出合适的提示/响应示例对，而不是从头创建新的定制示例。⁹

3. Reward Model Training

For human feedback to power the reward function in reinforcement learning, a reward model is needed that can translate human preferences into a quantifiable reward signal. Designing an effective reward model is a critical step in RLHF because no simple mathematical or logical formula exists that can capture subjective human values.

为了让人类反馈在强化学习中驱动奖励函数，需要一个奖励模型，能够将人类偏好转化为可量化的奖励信号。设计一个有效的奖励模型是 RLHF 中的关键步骤，因为不存在能够捕捉主观人类价值的简单数学或逻辑公式。

The primary goal of this stage is to sufficiently supply the reward model with training data consisting of direct feedback from human evaluators, so that it learns to mimic how human preferences would assign rewards to different kinds of model responses. This allows training to continue offline without ongoing human involvement.

此阶段的主要目标是为奖励模型提供充足的训练数据，这些数据由人类评估者的直接反馈组成，使其学会模仿人类偏好如何为不同类型的模型响应分配奖励。这使得训练可以在没有持续人类参与的情况下离线继续进行。

The reward model must take a piece of text as input and output a scalar reward value. This value numerically predicts the amount of reward (or penalty) a human user would assign to that text. The output being a scalar is essential for integrating the reward model's output with other components of the RL algorithm.

奖励模型必须以一段文本作为输入，并输出一个标量奖励值。该值以数字形式预测人类用户会为该文本分配的奖励（或惩罚）量。输出为标量对于将奖励模型的输出与 RL 算法的其他组件集成至关重要。

While it might seem most intuitive to simply have human raters express their opinion on each model response in a scalar format—such as rating a response on a scale from 1 (worst) to 10 (best)—it is enormously difficult to get all human raters to agree on the relative worth of a given score, let alone on what constitutes a 'good' or 'bad' response in a vacuum. For this reason, applying scalar ratings directly can be noisy and difficult to calibrate.

虽然最简单直观的方法可能是让人类评分者以标量格式表达他们对每个模型响应的意见——例如，在 1（最差）到 10（最好）的尺度上对响应进行评分——但让所有人类评分者就给定分数的相对价值达成一致是极其困难的，更不用说就真空状态下什么是“好”或“坏”的响应达成一致了。因此，直接应用标量评分可能噪声大且难以校准。

Instead, the evaluation system is typically constructed by comparing human feedback on different model outputs. A common method is to have users compare two similar text sequences (e.g., outputs from two different language models responding to the same prompt) in a head-to-head matchup, then aggregate a relative ranking for each bit of generated text using a system like the Elo rating system. In a simpler system, users might give a 'thumbs up' or 'thumbs down' to each output, and outputs are ranked based on relative preference. In more complex systems, label