大语言模型GPT、LLaMA和PaLM哪个更好用？（附技术架构对比）

Q: 目前主流的大语言模型有哪些？它们各自有什么特点？

主流大语言模型包括GPT、LLaMA和PaLM等家族。这些模型基于Transformer架构，在自然语言处理和多任务求解方面表现出色，但各有不同的参数规模和优化重点。

Abstract

Since the release of ChatGPT in November 2022, Large Language Models (LLMs) have garnered significant attention due to their powerful performance across a wide range of natural language tasks. As predicted by scaling laws, LLMs acquire general language understanding and generation capabilities by training tens to hundreds of billions of model parameters on massive text corpora. Although the research field of LLMs is very new, it is rapidly evolving across many different dimensions. In this article, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), discussing their characteristics, contributions, and limitations. We also outline the techniques developed for building and enhancing LLMs. Subsequently, we survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the article by discussing open challenges and future research directions.

摘要：自2022年11月ChatGPT发布以来，大语言模型（LLMs）因其在广泛的自然语言任务上的强大性能而备受关注。正如缩放定律所预测的那样，大语言模型通过在大量文本数据上训练数十亿个模型参数来获得通用语言理解和生成能力。大语言模型的研究领域虽然非常新，但在许多不同方面都在迅速发展。在本文中，我们回顾了一些最杰出的大语言模型，包括三个流行的大语言模型家族（GPT、LLaMA、PaLM），讨论了它们的特点、贡献和局限性。我们还概述了为构建和增强大语言模型而开发的技术。然后，我们调查了为大语言模型训练、微调和评估准备的流行数据集，回顾了广泛使用的大语言模型评估指标，并在一组代表性的基准测试中比较了几个流行大语言模型的性能。最后，我们通过讨论开放的挑战和未来的研究方向来总结本文。

I. Introduction

Language modeling is a long-standing research topic dating back to the 1950s when Shannon applied information theory to human language, measuring how well simple n-gram language models predict or compress natural language text. Since then, statistical language modeling has served as the foundation for many natural language understanding and generation tasks, from speech recognition and machine translation to information retrieval.

TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.-based LLMs, pre-trained on web-scale text corpora, have dramatically expanded the capabilities of language models. For instance, OpenAI's ChatGPT and GPT-4 are not only used for natural language processing but also serve as general-purpose task solvers powering systems like Microsoft's Copilot. They can follow human instructions to accomplish complex new tasks and perform multi-step reasoning when needed. Consequently, LLMs are becoming fundamental building blocks for developing general-purpose AI agents or Artificial General Intelligence (AGI).

Given the rapid pace of development in the LLM field, with new discoveries, models, and techniques emerging within months or even weeks, AI researchers and practitioners often find it challenging to determine the best approach for building LLM-based AI systems for their tasks. This article provides a timely survey of the latest advancements in LLMs. We hope this review will serve as a valuable reference resource for students, researchers, and developers.

语言建模是一个长期存在的研究课题，可以追溯到20世纪50年代，当时香农将信息理论应用于人类语言，他测量了简单的n元语法语言模型对自然语言文本的预测或压缩效果。从那时起，统计语言建模成为许多自然语言理解和生成任务的基础，从语音识别、机器翻译到信息检索。

基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的大语言模型在网络规模的文本语料上进行预训练，显著扩展了语言模型的能力。例如，OpenAI的ChatGPT和GPT-4不仅可以用于自然语言处理，还可以作为通用任务求解器来为微软的Copilot系统提供支持，它们可以遵循人类的指令完成复杂的新任务，并在需要时进行多步推理。因此，大语言模型正在成为通用人工智能代理或人工智能通用智能（AGI）开发的基本组成部分。

由于大语言模型领域发展迅速，新的发现、模型和技术在数月或数周内不断涌现，人工智能研究人员和从业者常常发现，为他们的任务构建基于大语言模型的人工智能系统的最佳方法难以确定。本文对大语言模型的最新进展进行了及时的调查。我们希望这篇综述能成为学生、研究人员和开发人员有价值的参考资源。

II. Large Language Models

In this section, we first review early pre-trained neural language models as they form the foundation for LLMs, then focus on three prominent LLM families: GPT, LLaMA, and PaLM. The table below provides an overview and characteristics of these models.

在本节中，我们首先回顾早期的预训练神经语言模型，因为它们是大语言模型的基础，然后重点讨论三个大语言模型家族：GPT、LLaMA和PaLM。下表提供了这些模型的概述和特点。

A. Early Pre-trained Neural Language Models

Language modeling using neural networks was pioneered by [38], [39], [40]. Bengio et al. developed one of the early neural language models that was comparable to n-gram models. Subsequently, [14] successfully applied neural language models to machine translation. Mikolov's RNNLM (an open-source neural language modeling toolkit) greatly facilitated the popularization of neural language models. Following this, neural language models based on Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) [19] and Gated Recurrent Units (GRU) [20], were widely used in many natural language applications, including machine translation, text generation, and text classification [43].

The invention of the TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. architecture marked another milestone in the development of neural language models. By applying self-attention to compute "attention scores" for each word in a sentence or document in parallel, modeling the influence of each word on others, the TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. allows for more parallelization than RNNs. This enables efficient pre-training of very large language models on GPUs. These Pre-trained Language Models (PLMs) can be fine-tuned for many downstream tasks.

Based on their neural architecture, we categorize early TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.-based PLMs into three main classes: encoder-only, decoder-only, and encoder-decoder models. Comprehensive surveys of early PLMs are provided in [43], [28].

使用神经网络进行语言建模是由[38]、[39]、[40]开创的。本吉奥等人开发了早期的神经语言模型之一，该模型可与n元语法模型相媲美。随后，[14]成功地将神经语言模型应用于机器翻译。米科洛夫开发的RNNLM（一个开源的神经语言模型工具包）极大地促进了神经语言模型的普及。随后，基于循环神经网络（RNNs）及其变体，如长短期记忆（LSTM）[19]和门控循环单元（GRU）[20]的神经语言模型被广泛应用于许多自然语言应用，包括机器翻译、文本生成和文本分类[43]。

然后，TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.架构的发明标志着神经语言模型发展的另一个里程碑。通过应用自注意力来并行计算句子或文档中每个单词的“注意力得分”，以对每个单词对其他单词的影响进行建模，TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.比RNNs允许更多的并行化，这使得在GPU上对非常大的语言模型进行高效预训练成为可能。这些预训练语言模型（PLMs）可以针对许多下游任务进行微调。

我们根据早期基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的PLMs的神经架构，将它们分为三个主要类别：仅编码器、仅解码器和编码器-解码器模型。早期PLMs的综合调查在[43]、[28]中提供。

1) Encoder-Only PLMs

As the name suggests, encoder-only models consist solely of encoder networks. These models were initially developed for language understanding tasks, such as text classification, where the model needs to predict a category label for input text. Representative encoder-only models include BERT and its variants, such as RoBERTa, ALBERT, DeBERTa, XLM, XLNet, and UNILM, as described below.

顾名思义，仅编码器模型仅由编码器网络组成。这些模型最初是为语言理解任务而开发的，例如文本分类，在这些任务中，模型需要为输入文本预测类别标签。代表性的仅编码器模型包括BERT及其变体，例如RoBERTa、ALBERT、DeBERTa、XLM、XLNet、UNILM，如下所述。

2) Decoder-Only PLMs

Two of the most widely used PLMs are GPT-1 and GPT-2 developed by OpenAI. These models laid the groundwork for the subsequent, more powerful LLMs (i.e., GPT-3 and GPT-4).

两个最广泛使用的PLMs是OpenAI开发的GPT-1和GPT-2。这些模型为随后更强大的大语言模型（即GPT-3和GPT-4）奠定了基础。

3) Encoder-Decoder PLMs

In [52], Raffel et al. showed that almost all NLP tasks can be formulated as sequence-to-sequence generation tasks. Therefore, an encoder-decoder language model is a unified model by design, as it can perform all natural language understanding and generation tasks. Representative encoder-decoder PLMs we will review below are T5, mT5, MASS, and BART.

在[52]中，拉斐尔等人表明，几乎所有的自然语言处理任务都可以被表述为序列到序列生成任务。因此，编码器-解码器语言模型在设计上是一个统一的模型，因为它可以执行所有的自然语言理解和生成任务。我们将在下面回顾的代表性编码器-解码器PLMs是T5、mT5、MASS和BART。

B. Large Language Model Families

Large Language Models (LLMs) primarily refer to TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.-based PLMs containing tens to hundreds of billions of parameters. Compared to the PLMs reviewed above, LLMs are not only much larger in model scale but also exhibit stronger language understanding and generation capabilities, along with emergent abilities not present in smaller-scale models. Next, we review three LLM families: GPT, LLaMA, and PaLM.

大语言模型（LLMs）主要是指基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的PLMs，包含数十亿到数百亿的参数。与上面回顾的PLMs相比，大语言模型不仅模型规模大得多，而且表现出更强的语言理解和生成能力以及小规模模型所没有的涌现能力。接下来，我们回顾三个大语言模型家族：GPT、LLaMA和PaLM。

1) GPT Family

Generative Pre-trained TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. (GPT) is a series of decoder-only TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.-based language models developed by OpenAI, including GPT-1, GPT-2, GPT-3, InstructGPT, ChatGPT, GPT-4, CODEX, and WebGPT. While early GPT models like GPT-1 and GPT-2 were open-source, recent models like GPT-3 and GPT-4 are closed-source and accessible only via API. GPT-1 and GPT-2 models were discussed in the early PLM subsection. We start with GPT-3.

生成式预训练变压器（GPT）是OpenAI开发的一系列仅解码器的基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.的语言模型，包括GPT-1、GPT-2、GPT-3、InstrucGPT、ChatGPT、GPT-4、CODEX和WebGPT。尽管早期的GPT模型，如GPT-1和GPT-2是开源的，但最近的模型，如GPT-3和GPT-4是闭源的，只能通过API访问。GPT-1和GPT-2模型在早期的PLM小节中已经讨论过。我们从GPT-3开始。

Model	Release Date	Developer	Key Parameters	Key Features / Contributions	Open Source
GPT-3	May 2020	OpenAI	175B	Demonstrated emergent in-context learning; first widely recognized LLM.	No
InstructGPT / ChatGPT	Jan 2022 / Nov 2022	OpenAI	~175B (GPT-3.5)	Aligned with human intent via RLHF; popularized conversational AI.	No
GPT-4	Mar 2023	OpenAI	Undisclosed (estimated >1T)	Multimodal (text & image input); state-of-the-art performance on professional/academic benchmarks.	No
CODEX	Mar 2023	OpenAI	Based on GPT-3	Fine-tuned for code generation; powers GitHub Copilot.	No
WebGPT	Dec 2021	OpenAI	Based on GPT-3	Fine-tuned to answer questions using a text-based web browser.	No

模型 发布日期 开发者 关键参数量 关键特性 / 贡献 开源

GPT-3 2020年5月 OpenAI 1750亿展示了涌现的上下文学习能力；首个被广泛认可的大语言模型。否

InstructGPT / ChatGPT 2022年1月 / 2022年11月 OpenAI ~1750亿 (GPT-3.5) 通过RLHF与人类意图对齐；普及了对话式AI。否

GPT-4 2023年3月 OpenAI 未公开 (估计 >1万亿) 多模态（文本 & 图像输入）；在专业/学术基准测试中达到最先进性能。否

CODEX 2023年3月 OpenAI 基于GPT-3 针对代码生成进行微调；为GitHub Copilot提供支持。否

WebGPT 2021年12月 OpenAI 基于GPT-3 微调以使用基于文本的网络浏览器回答问题。否

模型	发布日期	开发者	关键参数量	关键特性 / 贡献	开源
GPT-3	2020年5月	OpenAI	1750亿	展示了涌现的上下文学习能力；首个被广泛认可的大语言模型。	否
InstructGPT / ChatGPT	2022年1月 / 2022年11月	OpenAI	~1750亿 (GPT-3.5)	通过RLHF与人类意图对齐；普及了对话式AI。	否
GPT-4	2023年3月	OpenAI	未公开 (估计 >1万亿)	多模态（文本 & 图像输入）；在专业/学术基准测试中达到最先进性能。	否
CODEX	2023年3月	OpenAI	基于GPT-3	针对代码生成进行微调；为GitHub Copilot提供支持。	否
WebGPT	2021年12月	OpenAI	基于GPT-3	微调以使用基于文本的网络浏览器回答问题。	否

2) LLaMA Family

LLaMA is a collection of foundation language models developed by Meta. Unlike GPT models, LLaMA models are open-source, meaning the model weights are released to the research community under a non-commercial license. Consequently, the LLaMA family has evolved rapidly, with many research teams using these models to develop better open-source LLMs to compete with closed-source ones or to develop task-specific LLMs for mission-critical applications.

The first set of LLaMA models was released in February 2023, with parameters ranging from 7B to 65B. These models were pre-trained on trillions of tokens collected from publicly available datasets. LLaMA uses the GPT-3 TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing. architecture with some minor architectural modifications, including (1) using the SwiGLU activation function instead of ReLU, (2) using rotary positional embeddings instead of absolute positional embeddings, and (3) using root mean square layer normalization instead of standard layer normalization. The open-source LLaMA-13B model outperforms GPT-3 (175B) on most benchmarks.

LLaMA是Meta开发的一组基础语言模型。与GPT模型不同，LLaMA模型是开源的，即模型权重以非商业许可的方式发布给研究社区。因此，LLaMA家族发展迅速，许多研究团队使用这些模型来开发更好的开源大语言模型以与闭源模型竞争，或开发针对关键任务应用的特定任务大语言模型。

LLaMA的第一组模型于2023年2月发布，参数范围从7B到65B。这些模型在数十亿个标记上进行预训练，这些标记是从公开可用的数据集中收集的。LLaMA使用GPT-3的TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.架构，并进行了一些小的架构修改，包括（1）使用SwiGLU激活函数而不是ReLU；（2）使用旋转位置嵌入而不是绝对位置嵌入；（3）使用均方根层归一化而不是标准层归一化。开源的LLaMA-13B模型在大多数基准测试中优于GPT-3（175B）。

Model	Release Date	Key Parameters	Key Features / Contributions	Open Source
LLaMA (v1)	Feb 2023	7B, 13B, 33B, 65B	First major open-source LLM family; strong performance with efficient architecture.	Yes (Non-commercial)
LLaMA 2	Jul 2023	7B, 13B, 70B	Improved pre-training data & safety fine-tuning; released for commercial use.	Yes (Commercial 常见问题（FAQ）大语言模型（LLMs）是如何构建的？主要技术流程包括哪些？构建大语言模型主要包括数据清洗、标记化和训练策略等流程。通过在海量文本数据上训练数十亿参数，模型获得通用语言理解和生成能力。目前主流的大语言模型有哪些？它们各自有什么特点？主流大语言模型包括GPT、LLaMA和PaLM等家族。这些模型基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.架构，在自然语言处理和多任务求解方面表现出色，但各有不同的参数规模和优化重点。如何提升大语言模型的性能？有哪些常用的增强技术？可通过RAG（检索增强生成）和提示工程等技术增强大语言模型性能。这些技术能改善模型的知识准确性和任务适应性，同时需要配合适当的评估基准进行优化。

Model

Release Date

Key Parameters

Key Features / Contributions

Open Source

LLaMA (v1)

Feb 2023

7B, 13B, 33B, 65B

First major open-source LLM family; strong performance with efficient architecture.

Yes (Non-commercial)

LLaMA 2

Jul 2023

7B, 13B, 70B

Improved pre-training data & safety fine-tuning; released for commercial use.

Yes (Commercial

常见问题（FAQ）

大语言模型（LLMs）是如何构建的？主要技术流程包括哪些？

构建大语言模型主要包括数据清洗、标记化和训练策略等流程。通过在海量文本数据上训练数十亿参数，模型获得通用语言理解和生成能力。

目前主流的大语言模型有哪些？它们各自有什么特点？

主流大语言模型包括GPT、LLaMA和PaLM等家族。这些模型基于TransformerA deep learning neural network architecture using self-attention mechanisms for sequence processing.架构，在自然语言处理和多任务求解方面表现出色，但各有不同的参数规模和优化重点。

如何提升大语言模型的性能？有哪些常用的增强技术？

可通过RAG（检索增强生成）和提示工程等技术增强大语言模型性能。这些技术能改善模型的知识准确性和任务适应性，同时需要配合适当的评估基准进行优化。