GPT-4o下架对AI Answer Engine有何影响?2024技术演进分析 | Geoz.com.cn
English Summary: This article analyzes the impact of GPT-4o's delisting on AI Answer Engines, focusing on technical evolution from GPT-2 to GPT-3, including parameter scaling, few-shot learning capabilities, and performance across NLP tasks. It highlights how large language models are shifting from fine-tuning to in-context learning, with implications for search and question-answering systems.
中文摘要翻译:本文分析了GPT-4o下架对AI Answer Engine的影响,重点探讨了从GPT-2到GPT-3的技术演进,包括参数规模扩展、少样本学习能力以及在自然语言处理任务中的表现。文章强调了大语言模型从微调向上下文学习的转变,及其对搜索和问答系统的影响。
1. 引言
本篇文章是 OpenAI ChatGPT 系列文章的第七篇。在上一篇文章中,我们介绍了 GPT-2 的结构和代码实现。GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 相比 GPT-2,模型参数规模急剧扩大,达到了惊人的 1.75 万亿(1750B)参数,是 GPT-2 的 1000 多倍。
This article is the seventh installment in the OpenAI ChatGPT series. In the previous article, we introduced the architecture and code implementation of GPT-2. Compared to GPT-2, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. has seen a dramatic increase in model parameters, reaching a staggering 1.75 trillion (1750B) parameters, which is over 1000 times larger than GPT-2.
GPT-1 在子任务训练时提供少量训练数据,GPT-2 则尝试在不提供任何相关训练样本的情况下,直接使用预训练模型进行子任务预测。近期研究表明,通过对海量文本进行预训练,然后在特定任务上进行微调(Fine-tuning),可以在众多自然语言处理任务和基准测试中取得显著提升。尽管这种方法在架构上通常是任务无关的,但它仍然需要数千甚至数万个样本的特定任务微调数据集。相比之下,人类通常仅需几个示例或简单的指令就能执行新的语言任务——这是当前自然语言处理系统仍难以企及的能力。
GPT-1 provided a small amount of training data for subtask training, while GPT-2 attempted to perform subtask predictions directly using the pre-trained model without any relevant training samples. Recent research has shown that pre-training on massive amounts of text followed by fine-tuning on specific tasks can lead to significant improvements across many natural language processing tasks and benchmarks. Although this approach is generally task-agnostic in architecture, it still requires task-specific fine-tuning datasets containing thousands or even tens of thousands of examples. In contrast, humans can typically perform new language tasks with just a few examples or simple instructions—a capability that remains challenging for current NLP systems.
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 向我们展示了,扩展语言模型可以极大地改善其在无关任务上的少样本学习性能,有时甚至能达到与之前最先进的微调方法相媲美的水平。具体来说,OpenAI 训练了一个拥有 1750 亿参数的自回归语言模型 GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.,其规模是之前任何非稀疏语言模型的 10 倍,并在少样本学习环境中测试了其性能。对于所有任务,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 都没有进行任何梯度更新或微调,任务和少样本演示纯粹通过文本交互指定。
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. demonstrates that scaling up language models can greatly improve their few-shot learning performance on unrelated tasks, sometimes even reaching levels comparable to previous state-of-the-art fine-tuning methods. Specifically, OpenAI trained an autoregressive language model, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities., with 175 billion parameters—ten times larger than any previous non-sparse language model—and tested its performance in a few-shot learning setting. For all tasks, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. performed without any gradient updates or fine-tuning; tasks and few-shot demonstrations were specified purely through textual interaction with the model.
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在许多 NLP 数据集上取得了强大表现,包括翻译、问答和填空任务,以及一些需要即时推理或领域适应的任务,如解字谜、在句子中使用新词或进行三位数算术运算。同时,OpenAI 也识别出一些 GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在少样本学习上仍存在困难的数据集,并指出其面临与在大型网络语料库上训练相关的固有问题。最后,研究发现 GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 可以生成人类评估者难以区分其来源的新闻文章。
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. achieved strong performance on many NLP datasets, including translation, question answering, and cloze tasks, as well as several tasks requiring on-the-fly reasoning or domain adaptation, such as unscrambling words, using novel words in sentences, or performing three-digit arithmetic. Simultaneously, OpenAI identified some datasets where GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s few-shot learning still struggled and noted inherent issues related to training on large web corpora. Finally, research found that GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. could generate news articles that human evaluators had difficulty distinguishing from human-written ones.
2. 核心概念与方法
2.1 从微调到上下文学习
近年来,NLP 系统呈现出一种趋势:采用预训练语言表示进行下游推断,且方法越来越灵活、与任务无关。这种方法的最新范式——直接对预训练 Transformer 语言模型进行微调——已在阅读理解、问答、文本蕴含等许多挑战性任务上取得重大进展。然而,其主要限制在于仍需要特定任务的数据集进行微调。
In recent years, a trend has emerged in NLP systems: using pre-trained language representations for downstream inference, with methods becoming increasingly flexible and task-agnostic. The latest paradigm in this approach—directly fine-tuning pre-trained Transformer language models—has made significant progress on many challenging tasks such as reading comprehension, question answering, and textual entailment. However, its main limitation remains the need for task-specific datasets for fine-tuning.
消除这一限制至关重要,原因有三:
- 实际应用限制:为每个新任务收集大型标注数据集限制了语言模型的适用性。
- 泛化能力存疑:在微调数据上表现好,不一定代表预训练模型本身泛化能力强,可能只是过拟合了与微调数据有重合的预训练数据。
- 与人类学习方式不符:人类通常只需少量指令或示例就能学习新任务。
Eliminating this limitation is crucial for three reasons:
- Practical Application Constraints: The requirement for large labeled datasets for each new task limits the applicability of language models.
- Questionable Generalization Ability: Good performance on fine-tuning data does not necessarily indicate strong generalization of the pre-trained model itself; it might simply be overfitting to pre-training data that overlaps with the fine-tuning data.
- Divergence from Human Learning: Humans can typically learn new tasks with just a few instructions or examples.
解决这些问题的一个潜在途径是元学习(Meta-Learning)。在语言模型语境下,即模型在训练时发展出广泛的技能和模式识别能力,并在推理时快速适应或识别所需任务。这通过 “上下文学习”(In-context Learning) 实现:使用预训练语言模型的文本输入作为任务规范,模型在接收自然语言指令和/或少量任务演示后,预测后续任务实例的完成情况。
A potential path to solving these problems is Meta-Learning. In the context of language models, this means the model develops a broad range of skills and pattern recognition abilities during training and rapidly adapts to or identifies desired tasks during inference. This is achieved through In-context Learning: using the textual input of a pre-trained language model as a form of task specification, where the model, after receiving natural language instructions and/or a few task demonstrations, predicts the completion of subsequent task instances.
2.2 GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的评估设置
为了全面评估 GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的元学习能力,OpenAI 在三种条件下测试其性能:
- 少样本学习(Few-Shot, FS):在推理时提供 K 个任务示例(通常 K=10-100)作为上下文,但不更新模型权重。
- 单样本学习(One-Shot, 1S):仅提供 1 个任务示例(除任务描述外)。
- 零样本学习(Zero-Shot, 0S):不提供任何示例,仅给出描述任务的自然语言指令。
To comprehensively evaluate GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s meta-learning capabilities, OpenAI tested its performance under three conditions:
- Few-Shot Learning (FS): Providing K task examples (typically K=10-100) as context during inference without updating model weights.
- One-Shot Learning (1S): Providing only 1 task example (in addition to the task description).
- Zero-Shot Learning (0S): Providing no examples, only natural language instructions describing the task.
其中,Few-Shot 主要展示了上下文学习的潜力,而 One-Shot 和 Zero-Shot 则提供了与人类表现更公平的比较基准。
Among these, Few-Shot primarily demonstrates the potential of in-context learning, while One-Shot and Zero-Shot provide a fairer benchmark for comparison with human performance.
2.3 模型与训练数据
模型架构:GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 采用了与 GPT-2 相同的模型架构,但规模极大扩展。为了研究性能对模型规模的依赖,OpenAI 训练了 8 种不同大小的模型,参数范围从 1.25 亿到 1750 亿。
Model Architecture: GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. adopts the same model architecture as GPT-2 but is vastly scaled up. To study the dependence of performance on model scale, OpenAI trained 8 models of varying sizes, with parameters ranging from 125 million to 175 billion.
训练数据:数据集基于 Common Crawl,并经过严格过滤以提高质量:
- 根据与高质量参考语料库的相似性进行过滤。
- 进行模糊去重,防止数据冗余。
- 加入已知的高质量数据集(如 WebText、Books1、Books2、维基百科)以增强多样性。
最终训练混合数据中,93% 的词符(Token)为英文。
Training Data: The dataset is based on Common Crawl and underwent rigorous filtering to improve quality:
- Filtered based on similarity to high-quality reference corpora.
- Fuzzy deduplication to prevent data redundancy.
- Addition of known high-quality datasets (e.g., WebText, Books1, Books2, Wikipedia) to enhance diversity.
In the final training mixture, 93% of tokens are in English.
3. 主要实验结果与分析
总体而言,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在众多 NLP 任务中取得了优异成果。在 Zero-Shot 和 One-Shot 设置下表现强劲,在 Few-Shot 设置下有时甚至能超越之前的尖端技术(尽管这些技术由微调模型实现)。模型性能随着规模增大而稳步提升,且 Few-Shot 性能的增长速度通常快于 Zero-Shot,表明更大模型更擅长利用上下文进行元学习。
Overall, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. achieved excellent results across numerous NLP tasks. It performed strongly in Zero-Shot and One-Shot settings and sometimes even surpassed previous state-of-the-art techniques (although those were achieved by fine-tuned models) in the Few-Shot setting. Model performance improved steadily with scale, and Few-Shot performance typically grew faster than Zero-Shot, indicating that larger models are more adept at meta-learning using context.
以下是部分关键任务的亮点:
Below are highlights from some key tasks:
3.1 语言建模与完形任务
- LAMBADA:该数据集测试模型对长上下文的理解能力。GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在 Zero-Shot 设置下达到了 76% 的准确率,比之前最佳结果提升 8%。当使用 Few-Shot 设置并将任务框架化为“填空”形式时,准确率进一步提升至 86.4%,提升超过 18%。
- HellaSwag & StoryCloze:在这些需要常识推理和故事理解的数据集上,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的 Few-Shot 表现优于经过微调的 15 亿参数模型,尽管仍低于多任务微调的最佳模型。
- LAMBADA: This dataset tests the model's understanding of long context. GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. achieved 76% accuracy in the Zero-Shot setting, an 8% improvement over the previous best result. When using the Few-Shot setting and framing the task as a "fill-in-the-blank" format, accuracy further increased to 86.4%, an improvement of over 18%.
- HellaSwag & StoryCloze: On these datasets requiring commonsense reasoning and story understanding, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s Few-Shot performance surpassed that of a fine-tuned 1.5B parameter model, although it still fell short of the best multi-task fine-tuned models.
3.2 闭卷问答
在“闭卷”设置下(模型仅凭内部知识作答,不检索外部文档),GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在三个开放领域 QA 数据集上表现出色:
- TriviaQA:GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的 Zero-Shot 准确率(64.3%)已比经过微调的 T5-11B 模型高 14.2%。其 One-Shot 表现(68.0%)与一个使用了密集检索机制的开放领域 QA 系统的最佳结果相匹配。
- 性能随模型容量平滑增长,这表明更大的模型确实吸收了更多的知识。
In the "closed-book" setting (where the model answers solely based on internal knowledge without retrieving external documents), GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. performed excellently on three open-domain QA datasets:
- TriviaQA: GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s Zero-Shot accuracy (64.3%) was already 14.2% higher than a fine-tuned T5-11B model. Its One-Shot performance (68.0%) matched the best result of an open-domain QA system that utilized a dense retrieval mechanism.
- Performance increased smoothly with model capacity, indicating that larger models indeed absorb more knowledge.
3.3 翻译
尽管训练数据以英文为主(93%),GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 仍展现出强大的翻译能力,尤其是在翻译成英文的任务上:
- 在 Few-Shot 设置下,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在法语->英语和德语->英语的翻译上,表现优于之前最好的无监督神经机器翻译(NMT)结果。
- 模型在所有语言对和所有设置(Zero/One/Few-Shot)下的翻译性能,都随模型容量增加呈现一致的改善趋势。
Despite the training data being predominantly English (93%), GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. demonstrated strong translation capabilities, especially for translation into English:
- In the Few-Shot setting, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. outperformed previous best unsupervised Neural Machine Translation (NMT) results on French->English and German->English translation.
- Translation performance across all language pairs and all settings (Zero/One/Few-Shot) showed consistent improvement trends as model capacity increased.
3.4 常识推理与阅读理解
- 常识推理:在物理常识问答数据集 PIQA 上,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的 Few-Shot 甚至 Zero-Shot 结果均优于当时的最佳结果。在其他数据集如 ARC 和 OpenBookQA 上,表现则相对复杂,有的接近微调基准,有的仍有较大差距。
- 阅读理解:表现因数据集和回答格式而异。在对话式数据集 CoQA 上表现最佳(接近人类水平),而在需要建模结构化对话行为的数据集 QuAC 上表现较弱。
- Commonsense Reasoning: On the physical commonsense QA dataset PIQA, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s Few-Shot and even Zero-Shot results surpassed the then state-of-the-art. Performance on other datasets like ARC and OpenBookQA was more mixed, sometimes接近 fine-tuned baselines and other times showing significant gaps.
- Reading Comprehension: Performance varied depending on the dataset and answer format. It performed best on the conversational dataset CoQA (接近 human level) and weaker on datasets requiring modeling of structured dialog acts like QuAC.
3.5 SuperGLUE 基准
在包含 10 项挑战性任务的 SuperGLUE 基准上,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的表现差异很大:
- 在 COPA 和 ReCoRD 任务中,Few-Shot 表现接近最佳结果。
- 在 WSC 任务上表现强劲。
- 在涉及比较两个句子含义的任务(如 WiC、RTE)上,表现相对较弱。
- 关键发现:随着模型规模和上下文示例数量(K)的增加,Few-Shot 性能稳步提升。当 K 扩展到 32 个示例时,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 在多数任务上的整体表现超过了经过微调的 BERT-Large 模型。
On the SuperGLUE benchmark comprising 10 challenging tasks, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s performance varied significantly:
- Its Few-Shot performance was close to the best results on the COPA and ReCoRD tasks.
- It performed strongly on the WSC task.
- It performed relatively weakly on tasks involving comparing the meaning of two sentences (e.g., WiC, RTE).
- Key Finding: Few-Shot performance improved steadily as model scale and the number of context examples (K) increased. When K was extended to 32 examples, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities.'s overall performance surpassed that of a fine-tuned BERT-Large model on most tasks.
4. 总结与展望
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 的研究表明,将语言模型规模扩展至前所未有的程度,可以显著增强其上下文学习能力,使其在无需任务特定微调的情况下,在广泛的任务上实现强大性能。它证明了模型规模是提升少样本和零样本学习性能的关键驱动力。
The research on GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. demonstrates that scaling language models to an unprecedented degree can significantly enhance their in-context learning capabilities, enabling strong performance across a wide range of tasks without task-specific fine-tuning. It proves that model scale is a key driver for improving few-shot and zero-shot learning performance.
然而,GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 也揭示了其局限性,例如在某些复杂推理、阅读理解任务上的困难,以及训练数据偏见和“数据污染”等问题。这些发现为未来研究指明了方向:在继续探索规模效应的同时,也需要关注模型效率、推理能力、可控性以及更公平、更安全的评估方法。
However, GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. also revealed its limitations, such as difficulties with certain complex reasoning and reading comprehension tasks, as well as issues like training data bias and "data contamination." These findings point the way for future research: while continuing to explore scaling effects, attention must also be paid to model efficiency, reasoning capabilities, controllability, and fairer, safer evaluation methods.
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. 是通向更通用人工智能道路上的一个重要里程碑,它推动了从“为每个任务训练一个模型”到“使用一个通用模型通过交互解决多种任务”的范式转变。
GPT-3A large language model developed by OpenAI with 175 billion parameters, known for its advanced text generation capabilities. is a significant milestone on the path towards more general artificial intelligence, promoting a paradigm shift from "training one model per task" to "using one general model to solve multiple tasks through interaction."
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。