GEO

什么是RLHF?基于人类反馈的强化学习技术详解 | Geoz.com.cn

2026/2/8
什么是RLHF?基于人类反馈的强化学习技术详解 | Geoz.com.cn
AI Summary (BLUF)

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that optimizes AI agent performance by training a reward model using direct human feedback. It is particularly effective for tasks with complex, ill-defined, or difficult-to-specify objectives, such as improving the relevance, accuracy, and ethics of large language models (LLMs) in chatbot applications. RLHF typically involves four phases: pre-training model, supervised fine-tuning, reward model training, and policy optimization, with proximal policy optimization (PPO) being a key algorithm. While RLHF has demonstrated remarkable results in training AI agents for complex tasks from robotics to NLP, it faces limitations including the high cost of human preference data, the subjectivity of human opinions, and risks of overfitting and bias. (RLHF(基于人类反馈的强化学习)是一种机器学习技术,通过使用直接的人类反馈训练奖励模型来优化AI代理的性能。它特别适用于具有复杂、定义不明确或难以指定目标的任务,例如提高大型语言模型(LLM)在聊天机器人应用中的相关性、准确性和伦理性。RLHF通常包括四个阶段:预训练模型、监督微调、奖励模型训练和策略优化,其中近端策略优化(PPO)是关键算法。虽然RLHF在从机器人学到自然语言处理的复杂任务AI代理训练中取得了显著成果,但它面临一些限制,包括人类偏好数据的高成本、人类意见的主观性以及过拟合和偏见的风险。)

Introduction

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique that uses direct human feedback to train a "reward model," which is then used to optimize the performance of an AI agent.

基于人类反馈的强化学习(RLHF)是一种机器学习技术,它利用人类的直接反馈来训练一个“奖励模型”,进而用于优化AI智能体的性能。

RLHF, also known as reinforcement learning from human preferences, is particularly well-suited for tasks with complex, ill-defined, or difficult-to-specify goals. For instance, it is unrealistic (if not impossible) to define "funny" in algorithmic or mathematical terms, but it is easy for a human to evaluate a joke generated by a large language model (LLM). This human feedback can be distilled into a reward function and used to improve the LLM's joke-writing ability.

RLHF,也被称为基于人类偏好的强化学习,特别适用于那些目标复杂、定义不明确或难以明确指定的任务。例如,用算法或数学术语来定义“有趣”是不现实的(甚至是不可能的),但人类评估一个由大语言模型(LLM)生成的笑话却很容易。这种人类反馈可以被提炼成一个奖励函数,用于提升LLM编写笑话的能力。

In a seminal 2017 paper, Paul F. Christiano of OpenAI, along with other researchers from OpenAI and DeepMind, discussed the success of RLHF in training AI models to perform complex tasks like playing Atari games and robotic locomotion simulations. Building on this breakthrough, video games have since remained a key testing ground for RLHF. By 2019, AI systems trained with RLHF were defeating top human professional players in far more complex games, such as OpenAI Five in Dota 2 and DeepMind's AlphaStar in StarCraft.

在2017年的一篇开创性论文中,OpenAI的Paul F. Christiano与来自OpenAI和DeepMind的其他研究人员共同探讨了RLHF在训练AI模型执行复杂任务(如玩Atari游戏和机器人运动模拟)方面取得的成功。基于这一突破,电子游戏此后一直是RLHF的关键试验场。到2019年,使用RLHF训练的AI系统已经在更复杂的游戏中击败了顶尖的人类职业玩家,例如OpenAI Five在Dota 2中获胜,以及DeepMind的AlphaStar在《星际争霸》中获胜。

Perhaps most importantly, the 2017 OpenAI paper noted that its methodology, particularly the introduction of the Proximal Policy Optimization (PPO) algorithm for updating model weights, significantly reduced the cost of collecting and distilling the required human feedback. This paved the way for the eventual integration of RLHF with the field of Natural Language Processing (NLP), and the resulting progress helped propel both LLMs and RLHF to the forefront of AI research.

也许最重要的是,2017年OpenAI的论文指出,其方法论,特别是用于更新模型权重的近端策略优化(PPO)算法的引入,显著降低了收集和提炼所需人类反馈的成本。这为RLHF最终与自然语言处理(NLP)领域的融合铺平了道路,由此产生的进展帮助将LLMRLHF都推向了AI研究的前沿。

Code detailing the use of RLHF on language models was first released by OpenAI in 2019. This was followed in early 2022 by the release of InstructGPT, trained with RLHF—a key step in bridging the gap between GPT-3 and GPT-3.5 Turbo, and the model that powered the launch of ChatGPT.

详细说明在语言模型上使用RLHF的代码由OpenAI于2019年首次发布。随后在2022年初,发布了使用RLHF训练的InstructGPT——这是在GPT-3和GPT-3.5 Turbo之间架起桥梁的关键一步,也是推动ChatGPT发布的模型。

Since then, RLHF has been used in training state-of-the-art LLMs from OpenAI, DeepMind, Google, and Anthropic.

自此以后,RLHF已被用于训练来自OpenAI、DeepMind、Google和Anthropic的最先进的大语言模型。

Core Concepts of Reinforcement Learning

Conceptually, reinforcement learning (RL) aims to mimic how humans learn. An AI agent is motivated by a strong incentive for success and learns holistically through trial and error.

从概念上讲,强化学习(RL)旨在模仿人类的学习方式。AI智能体受到成功这一强烈动机的驱动,并通过试错进行整体性学习。

To operationalize this strategy, the mathematical framework of reinforcement learning consists of the following components:

为了将这一策略付诸实践,强化学习的数学框架由以下组成部分构成:

State Space

The state space is all available information relevant to the task at hand, including both known and unknown variables, related to the decisions an AI agent might make. The state space typically changes each time the agent makes a decision.

状态空间是与当前任务相关的所有可用信息,包括已知和未知的变量,涉及AI智能体可能做出的决策。状态空间通常随着智能体每次做出决策而改变。

Action Space

The action space contains all possible decisions an AI agent can make. For example, in a board game, the action space is discrete and well-defined, consisting of all legal moves available to the AI player at a given moment. In text generation, the action space is vast, consisting of the entire "vocabulary" of tokens available to the LLM.

行动空间包含AI智能体可以做出的所有可能决策。例如,在棋盘游戏中,行动空间是离散且明确定义的,由AI玩家在给定时刻可用的所有合法走法组成。在文本生成中,行动空间是巨大的,由LLM可用的整个标记“词汇表”组成。

Reward Function

Reward is the measure of success or progress that motivates the AI agent. In cases like board games, defining success (winning the game, in this case) is objective and straightforward. However, when the definition of "success" is ambiguous, designing an effective reward function can be a significant challenge. In the mathematical framework, this feedback must be converted into a reward signal—a scalar quantification of positive (or negative) feedback.

奖励是激励AI智能体的成功或进展的度量。在像棋盘游戏这样的情况下,定义成功(在这种情况下是赢得游戏)是客观且直接的。然而,当“成功”的定义模糊不清时,设计一个有效的奖励函数可能是一个重大挑战。在数学框架中,这种反馈必须被转换为奖励信号——即正(或负)反馈的标量量化。

Constraints

The reward function can be complemented by negative rewards—penalties for actions considered detrimental to the task at hand. For example, you might want to prohibit your company's chatbot from using profanity or other vulgar language. For an autonomous driving model, you could penalize collisions or lane departures.

奖励函数可以通过负奖励——即对被认为不利于当前任务的行为进行惩罚——来补充。例如,您可能希望禁止公司的聊天机器人使用亵渎性语言或其他粗俗语言。对于自动驾驶模型,您可以惩罚碰撞或车道偏离。

Policy

The policy is essentially the strategy or "thought process" that drives the AI agent's behavior. In plain mathematical terms, the policy (denoted as "π") is a function that takes a state ("s") as input and returns an action ("a"): π(s) → a.

策略本质上是驱动AI智能体行为的策略或“思维过程”。用简单的数学术语来说,策略(表示为“π”)是一个以状态(“s”)为输入并返回行动(“a”)的函数:π(s) → a。

The goal of an RL algorithm is to optimize the policy to produce the maximum reward. In deep reinforcement learning, the policy is represented as a neural network and is continuously updated during the learning process according to the reward function. The AI agent learns from experience, much like a human.

RL算法的目标是优化策略以产生最大的奖励。在深度强化学习中,策略被表示为一个神经网络,并在学习过程中根据奖励函数不断更新。AI智能体像人类一样从经验中学习。

While traditional RL has achieved remarkable results in many real-world domains, it can be difficult to construct an effective reward function for complex tasks where establishing a clear definition of success is challenging. The primary advantage of RLHF is its ability to capture nuance and subjectivity through positive human feedback, rather than a formally defined objective.

虽然传统RL在许多现实世界领域取得了显著成果,但对于那些难以明确定义成功的复杂任务,构建有效的奖励函数可能很困难。RLHF的主要优势在于它能够通过积极的人类反馈来捕捉细微差别和主观性,而不是依赖正式定义的目标。

RLHF for Enhancing Large Language Models

One of the most prominent uses of RLHF is to enhance the relevance, accuracy, and ethics of LLMs, particularly for their use as chatbots.

RLHF最突出的用途之一是增强大语言模型的相关性、准确性和伦理性,特别是当它们被用作聊天机器人时。

Like all generative AI models, LLMs aim to replicate the probability distribution of their training data. Despite recent advances that have led to their use as chatbot engines or general-purpose AI reasoning engines, these language models are merely using patterns learned from their training data to predict the next word within a given sequence initiated by a prompt. At a fundamental level, such models are not actually responding to the prompt; they are appending text to it.

与所有生成式AI模型一样,大语言模型的目标是复制其训练数据的概率分布。尽管最近的进展使其被用作聊天机器人引擎或通用AI推理引擎,但这些语言模型仅仅是使用从训练数据中学到的模式来预测由提示启动的给定序列中的下一个词。从根本上说,这类模型并非真正在回应提示;它们只是在向提示添加文本。

Without very specific instructions, language models have little ability to understand user intent. Prompt engineering helps provide the context needed for an LLM to respond in a way that aligns with user needs, but requiring prompt engineering for every interaction with a chatbot is impractical.

如果没有非常具体的指令,语言模型几乎无法理解用户的意图。提示工程有助于提供LLM所需的情境,使其能够以符合用户需求的方式回应,但要求在与聊天机器人的每次交互中都进行提示工程是不现实的。

Furthermore, off-the-shelf LLMs are trained in conventional ways to produce grammatically coherent output, but training an LLM to produce "good" output is a mysterious field. Concepts like truthfulness, helpfulness, creativity, or even what makes a code snippet executable are far more context-dependent than word meanings or language structure.

此外,现成的大语言模型通过传统方式进行训练以产生语法连贯的输出,但训练LLM产生“良好”输出是一个充满未知的领域。诸如真实性、有用性、创造性,甚至是什么使代码片段可执行等概念,远比词义或语言结构更依赖于上下文。

To make language models suitable for interaction with humans, data scientists turned to reinforcement learning from human feedback. The InstructGPT model, enhanced with RLHF, significantly outperformed pre-GPT-3 models, particularly in following instructions, maintaining factual accuracy, and avoiding model hallucinations. Similarly, research released by OpenAI at the launch of GPT-4 reported that RLHF doubled accuracy on adversarial questions.

为了使语言模型适合与人类互动,数据科学家转向了基于人类反馈的强化学习。通过RLHF增强的InstructGPT模型在遵循指令、保持事实准确性以及避免模型幻觉方面显著优于GPT-3之前的模型。同样,OpenAI在发布GPT-4时发布的研究报告称,RLHF将对抗性问题的准确性提高了一倍。

The benefits of RLHF can outweigh the value of larger training datasets, enabling more data-efficient model development: OpenAI labelers stated they preferred the output of the 1.3B parameter version of InstructGPT over the output of the 1.75B parameter version of GPT-3.5.

RLHF的好处可能超过更大训练数据集的价值,从而实现更高效的数据模型开发:OpenAI的标注者表示,与1.75B参数的GPT-3.5版本输出相比,他们更喜欢1.3B参数的InstructGPT版本的输出。

The RLHF Training Pipeline

Training an LLM using RLHF is typically conducted in four phases:

使用RLHF训练大语言模型通常分为四个阶段:

  1. Pre-trained Model (预训练模型)
  2. Supervised Fine-Tuning (监督式微调)
  3. Reward Model Training (奖励模型训练)
  4. Policy Optimization (策略优化)

Phase 1: Pre-trained Model

RLHF is typically not used as an end-to-end training method but rather to fine-tune and optimize a pre-trained model. For example, InstructGPT used RLHF to enhance an existing GPT (Generative Pre-trained Transformer) model. In its release announcement, OpenAI stated, "One way to think about this process is 'unlocking' capabilities that GPT-3 already had, but that were difficult to elicit through prompt engineering alone." Pre-training remains the most resource-intensive phase within RLHF. OpenAI noted that the RLHF training process for InstructGPT required less than 2% of the compute and data needed for GPT-3's pre-training.

RLHF通常不被用作端到端的训练方法,而是用于微调和优化预训练模型。例如,InstructGPT使用RLHF来增强现有的GPT(生成式预训练变换器)模型。在其发布公告中,OpenAI表示:“理解这个过程的一种方式是‘解锁’GPT-3已经具备但仅通过提示工程难以激发的能力。”预训练仍然是RLHF中最耗费资源的阶段。OpenAI指出,InstructGPTRLHF训练过程所需的计算量和数据不到GPT-3预训练所需的2%。

Phase 2: Supervised Fine-Tuning

Before initiating explicit reinforcement learning, supervised fine-tuning (SFT) is used to prepare the model to generate responses in the format users expect. As mentioned earlier, the pre-learning process of an LLM optimizes the model for completion by predicting the next word in a sequence starting from the user's prompt, replicating the language patterns learned during the model's pre-training. Depending on how the user makes their request, the LLM may not complete the sequence as the user hopes. For example, given the prompt "Teach me how to make a resume," an LLM might respond with "Use Microsoft Word." The sentence is complete, but it does not align with the user's goal. Therefore, in SFT, supervised learning is used to train the model to respond appropriately to various types of prompts. Human experts create labeled examples following the format (prompt, response), demonstrating how to respond to prompts for various use cases like question answering, summarization, and translation. This demonstration data is powerful but time-consuming and expensive to generate. DeepMind introduced an approach that applies a "heuristic instructional method based on filtering for general written conversational formats (e.g., interview transcripts)" to isolate suitable prompt/response example pairs from within the MassiveWeb dataset, rather than creating bespoke new examples.

在启动显式强化学习之前,使用监督式微调(SFT)来准备模型,使其以用户期望的格式生成响应。如前所述,LLM的预学习过程通过预测从用户提示开始的序列中的下一个词来优化模型以完成文本,这复制了模型在预训练期间学到的语言模式。根据用户提出请求的方式,LLM可能不会按照用户的期望完成序列。例如,对于提示“教我如何制作简历”,LLM可能会回答“使用Microsoft Word。”这个句子是完整的,但它不符合用户的目标。因此,在SFT中,使用监督学习来训练模型对各种类型的提示做出适当响应。人类专家按照(提示,响应)的格式创建带标签的示例,展示如何针对问答、摘要、翻译等各种用例的提示进行响应。这些演示数据非常有效,但生成起来耗时且昂贵。DeepMind引入了一种方法,应用“基于通用书面对话格式(例如,采访记录)过滤的启发式教学方法”,从MassiveWeb数据集中分离出合适的提示/响应示例对,而不是创建定制的新示例。

Phase 3: Reward Model Training

For human feedback to enhance the reward function in reinforcement learning, a reward model is needed to translate human preferences into a numerical reward signal. Since there is no mathematical or logical formula to properly define human subjective values, designing an effective reward model is a critical step in RLHF. The main objective of this phase is to provide the reward model with sufficient training data consisting of direct feedback from human evaluators, enabling it to learn to mimic how human preferences assign rewards to different types of model responses. This allows training to continue offline without human intervention. The reward model must take in a piece of text and output a scalar reward value that numerically predicts how much a human user would reward (or penalize) that text. The fact that this output is a scalar value is essential for the reward model's output to be integrated with other components of the RL algorithm. It might seem most intuitive to simply ask human evaluators to express their opinion on each model response in a scalar format, such as rating it on a scale from 1 (worst) to 10 (best). However, it is extremely difficult to calibrate ratings consistently across all humans, and human evaluators cannot individually calibrate what constitutes a "good" or "bad" response for them. This can lead to noisy and difficult-to-calibrate direct scalar ratings. Instead, evaluation systems are typically built on comparisons of human feedback on different model outputs. A common method involves having users directly compare two similar text sequences, such as outputs from two different language models responding to the same prompt, and then using a system like the Elo rating system to generate an aggregate relative ranking for each piece of generated text. In simpler systems, users might "upvote" or "downvote" each output, with outputs ranked by relative preference. More complex systems might ask labelers to provide an overall rating and also answer categorical questions about flaws in each answer, then algorithmically aggregate this feedback to create a weighted quality score. The results of any ranking system are ultimately normalized into a scalar reward signal to inform the training of the reward model.

为了让人类反馈增强强化学习中的奖励函数,需要一个奖励模型来将人类偏好转化为数值化的奖励信号。由于没有数学或逻辑公式来正确定义人类的主观价值,设计一个有效的奖励模型RLHF中的关键一步。此阶段的主要目标是为奖励模型提供足够的训练数据,这些数据由人类评估者的直接反馈组成,使其能够学习模仿人类偏好如何为不同类型的模型响应分配奖励。这使得训练可以在没有人类干预的情况下离线继续进行。奖励模型必须接收一段文本,并输出一个标量奖励值,该值在数值上预测人类用户会给予该文本多少奖励(或惩罚)。输出为标量值这一点对于奖励模型的输出与RL算法的其他组件集成至关重要。让人类评估者以标量形式简单地表达他们对每个模型响应的意见,例如在1(最差)到10(最好)的尺度上进行评分,似乎是最直观的。然而,很难在所有人类评估者之间保持评分的一致性,而且人类评估者无法单独校准对他们而言什么是“好”或“坏”的响应。这可能导致直接标量评分充满噪声且难以校准。相反,评估系统通常建立在人类对不同模型输出的反馈比较之上。一种常见的方法是让用户直接比较两个相似的文本序列,例如两个不同语言模型对同一提示的响应输出,然后使用类似Elo评分系统的方法为每个生成的文本片段生成一个聚合的相对排名。在更简单的系统中,用户可能对每个输出进行

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。