未来6-12个月，大语言模型在控制、记忆、工具集成和多模态方面会有哪些突破？

大型语言模型（LLM）正以前所未有的势头席卷科技行业，其驱动的体验堪称神奇——从几秒钟内生成一周的代码，到进行比人类对话更具同理心的交流。这些模型基于数万亿的标记数据进行训练，动用了成千上万的GPU集群，展现了卓越的自然语言理解能力，并彻底改变了文案和代码等领域，将我们推入激动人心的生成式AI新时代。与任何新兴技术一样，生成式AI也面临一些批评。尽管部分批评确实反映了LLM当前能力的局限，但我们认为这些障碍并非技术的根本缺陷，而是进一步创新的机遇。

Large language models (LLMs) have taken the tech industry by storm, powering experiences that can only be described as magical—from writing a week’s worth of code in seconds to generating conversations that feel even more empathetic than the ones we have with humans. Trained on trillions of tokens of data with clusters of thousands of GPUs, LLMs demonstrate remarkable natural language understanding and have transformed fields like copy and code, propelling us into the new and exciting generative era of AI. As with any emerging technology, generative AI has been met with some criticism. Though some of this criticism does reflect current limits of LLMs’ current capabilities, we see these roadblocks not as fundamental flaws in the technology, but as opportunities for further innovation.

为了更好地理解LLM近期可能取得的技术突破，并为创始人和运营者指明方向，我们与几位顶尖的生成式AI研究者进行了交流，他们正在积极构建和训练一些最大、最前沿的模型：Anthropic的CEO Dario Amodei、Cohere的CEO Aidan Gomez、Character.AI的CEO Noam Shazeer以及AI21 Labs的Yoav Shoham。这些对话揭示了地平线上的四大关键创新：驾驭、记忆、“赋予手脚” 以及 多模态An optimization approach that integrates and coordinates different content formats (text, image, video, audio) for unified AI recognition.。本文将探讨这些关键创新在未来6到12个月内将如何演变，以及希望将AI整合到自身业务中的创始人如何利用这些新进展。

To better understand the near-term technological breakthroughs for LLMs and prepare founders and operators for what’s around the bend, we spoke to some of the leading generative AI researchers who are actively building and training some of the largest and most cutting edge models: Dario Amodei, CEO of Anthropic; Aidan Gomez, CEO of Cohere; Noam Shazeer, CEO of Character.AI; and Yoav Shoham of AI21 Labs. These conversations identified 4 key innovations on the horizon: steering, memory, “arms and legs,” and multimodality. In this piece, we discuss how these key innovations will evolve over the next 6 to 12 months and how founders curious about integrating AI into their own businesses might leverage these new advances.

驾驭

许多创始人对在其产品和流程中应用LLM持谨慎态度是可以理解的，因为这些模型存在“幻觉”和复制偏见的潜在风险。为了解决这些问题，几家领先的模型公司正在致力于改进驾驭——一种对LLM输出施加更好控制的方法——以聚焦模型输出，并帮助模型更好地理解和执行复杂的用户需求。Noam Shazeer在这方面将LLM与儿童进行了类比：“问题在于如何更好地引导[模型]……我们面临LLM的这个问题，我们只是需要正确的方法来告诉它们做我们想做的事。小孩子也是这样——他们有时会编造东西，对幻想和现实没有牢固的把握。”尽管模型提供商在可驾驭性方面取得了显著进展，并且出现了像Guardrails和LMQL这样的工具，但研究人员仍在继续取得进步，我们认为这对于最终用户更好地产品化LLM至关重要。

Many founders are understandably wary of implementing LLMs in their products and workflows because of these models’ potential to hallucinate and reproduce bias. To address these concerns, several of the leading model companies are working on improved steering—a way to place better controls on LLM outputs—to focus model outputs and help models better understand and execute on complex user demands. Noam Shazeer draws a parallel between LLMs and children in this regard: “it’s a question of how to direct [the model] better… We have this problem with LLMs that we just need the right ways of telling them to do what we want. Small children are like this as well—they make things up sometimes and don’t have a firm grasp of fantasy versus reality.” Though there has been notable progress in steerability among the model providers as well as the emergence of tools like Guardrails and LMQL, researchers are continuing to make advancements, which we believe is key to better productizing LLMs among end users.

改进的驾驭在企业公司中变得尤为重要，因为不可预测行为的后果可能代价高昂。Amodei指出，LLM的不可预测性“让人抓狂”，作为API提供商，他希望能够“直视客户的眼睛说‘不，模型不会这样做’，或者至少很少这样做。”通过优化LLM的输出，创始人可以更有信心地认为模型的性能将符合客户需求。改进的驾驭也将为在广告等对准确性和可靠性要求更高的其他行业更广泛地采用铺平道路，因为广告投放的风险很高。Amodei还看到了从“法律用例、医疗用例、存储财务信息和管理财务投注，到需要维护公司品牌”的各种用例。“你不希望整合的技术是不可预测或难以预测或描述的。”通过更好的驾驭，LLM将能够用更少的提示工程完成更复杂的任务，因为它们将能更好地理解整体意图。

Improved steering becomes especially important in enterprise companies where the consequences of unpredictable behavior can be costly. Amodei notes that the unpredictability of LLMs “freaks people out” and, as an API provider, he wants to be able to “look a customer in the eye and say ‘no, the model will not do this,’ or at least does it rarely.” By refining LLM outputs, founders can have greater confidence that the model’s performance will align with customer demands. Improved steering will also pave the way for broader adoption in other industries with higher accuracy and reliability requirements, like advertising, where the stakes of ad placement are high. Amodei also sees use cases ranging from “legal use cases, medical use cases, storing financial information and managing financial bets, [to] where you need to preserve the company brand. You don’t want the tech you incorporate to be unpredictable or hard to predict or characterize.” With better steering, LLMs will also be able to do more complex tasks with less prompt engineering, as they will be able to better understand overall intent.

LLM驾驭方面的进步也有潜力在敏感的消费者应用中解锁新的可能性，用户期望在这些应用中获得量身定制且准确的回应。虽然用户在为了对话或创意目的与LLM互动时可能愿意容忍较低准确性的输出，但当使用LLM协助日常任务、为重大决策提供建议或增强生活教练、治疗师和医生等专业人士时，用户希望获得更准确的输出。有人指出，LLM有望取代像搜索这样根深蒂固的消费者应用，但在这种可能性成为现实之前，我们可能需要更好的驾驭来改进模型输出并建立用户信任。

Advances in LLM steering also have the potential to unlock new possibilities in sensitive consumer applications where users expect tailored and accurate responses. While users might be willing to tolerate less accurate outputs from LLMs when engaging with them for conversational or creative purposes, users want more accurate outputs when using LLMs to assist them in daily tasks, advise them on major decisions, or augment professionals like life coaches, therapists, and doctors. Some have pointed out that LLMs are poised to unseat entrenched consumer applications like search, but we likely need better steering to improve model outputs and build user trust before this becomes a real possibility.

关键解锁：用户可以更好地定制LLM的输出。

Key unlock: users can better tailor the outputs of LLMs.

记忆

由LLM驱动的文案撰写和广告生成应用已经取得了巨大成功，在营销人员、广告商和精明的创业者中迅速普及。然而，目前大多数LLM的输出相对通用，这使得在需要个性化和上下文理解的应用场景中难以利用它们。虽然提示工程和微调可以提供一定程度的个性化，但提示工程的可扩展性较差，而微调往往成本高昂，因为它需要一定程度的重新训练，并且通常需要与大多是闭源的LLM紧密合作。为每个用户微调一个模型通常既不可行也不可取。

Copywriting and ad-generating apps powered by LLMs have already seen great results, leading to quick uptake among marketers, advertisers, and scrappy entrepreneurs. Currently, however, most LLM outputs are relatively generalized, which makes it difficult to leverage them for use cases requiring personalization and contextual understanding. While prompt engineering and fine-tuning can offer some level of personalization, prompt engineering is less scalable and fine-tuning tends to be expensive, since it requires some degree of re-training and often partnering closely with mostly closed source LLMs. It’s often not feasible or desirable to fine-tune a model for every individual user.

上下文学习——LLM从你的公司产生的内容、你公司的特定术语和你的特定上下文中汲取信息——是圣杯，它能创造出更精细、更贴合你特定用例的输出。为了解锁这一点，LLM需要增强的记忆能力。LLM记忆有两个主要组成部分：上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。和检索。上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。是模型除了其训练数据语料库之外，可以处理并用于指导其输出的文本。检索指的是从模型训练数据语料库之外的庞大数据体中检索和引用相关信息及文档（“上下文数据”）。目前，大多数LLM的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。有限，并且无法原生检索额外信息，因此生成的输出个性化程度较低。然而，通过更大的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。和改进的检索，LLM可以直接提供更精细、更贴合个体用例的输出。

In-context learning, where the LLM draws from the content your company has produced, your company’s specific jargon, and your specific context, is the holy grail—creating outputs that are more refined and tailored to your particular use case. In order to unlock this, LLMs need enhanced memory capabilities. There are two primary components to LLM memory: context windows and retrieval. Context windows are the text that the model can process and use to inform its outputs in addition to the data corpus it was trained on. Retrieval refers to retrieving and referencing relevant information and documents from a body of data outside the model’s training data corpus (“contextual data”). Currently, most LLMs have limited context windows and aren’t able to natively retrieve additional information, and so generate less personalized outputs. With bigger context windows and improved retrieval, however, LLMs can directly offer much more refined outputs tailored to individual use cases.

特别是随着上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。的扩大，模型将能够处理更大量的文本并更好地保持上下文，包括在对话中保持连续性。这将反过来显著增强模型执行需要更深入理解较长输入的任务的能力，例如总结长篇文章或在长时间对话中生成连贯且上下文准确的回应。我们已经看到上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。的显著改进——GPT-4拥有8k和32k标记的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。，高于GPT-3.5和ChatGPT的4k和16k标记上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。，而Claude最近将其上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。扩展到了惊人的100k标记。

With expanded context windows in particular, models will be able to process larger amounts of text and better maintain context, including maintaining continuity through a conversation. This will, in turn, significantly enhance models’ ability to carry out tasks that require a deeper understanding of longer inputs, such as summarizing lengthy articles or generating coherent and contextually accurate responses in extended conversations. We’re already seeing significant improvement with context windows—GPT-4 has both an 8k and 32k token context window, up from 4k and 16k token context windows with GPT-3.5 and ChatGPT, and Claude recently expanded its context window to an astounding 100k tokens.

仅靠扩大的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。并不能充分改善记忆，因为推理的成本和时间与提示的长度呈准线性甚至二次方增长。检索机制通过用与提示最相关的上下文数据来增强和优化LLM的原始训练语料库。由于LLM是在一个信息体上训练的且通常难以更新，Shoham认为检索有两个主要好处：“首先，它允许你访问训练时没有的信息源。其次，它使你能够将语言模型聚焦在你认为与任务相关的信息上。”像Pinecone这样的向量数据库A database system designed to store and perform high-dimensional semantic similarity searches on vector embeddings of data.已经成为高效检索相关信息的事实标准，并作为LLM的记忆层，使模型能够更轻松、快速、准确地在海量信息中搜索和引用正确的数据。

Expanded context windows alone don’t sufficiently improve memory, since cost and time of inference scale quasi-linearly, or even quadratically, with the length of the prompt. Retrieval mechanisms augment and refine the LLM’s original training corpus with contextual data that is most relevant to the prompt. Because LLMs are trained on one body of information and are typically difficult to update, there are two primary benefits of retrieval according to Shoham: “First, it allows you to access information sources you didn’t have at training time. Second, it enables you to focus the language model on information you believe is relevant to the task.” Vector databases like Pinecone have emerged as the de facto standard for the efficient retrieval of relevant information and serve as the memory layer for LLMs, making it easier for models to search and reference the right data amongst vast amounts of information quickly and accurately.

增加的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。和检索相结合，对于导航大型知识库或复杂数据库等企业用例将是无价的。公司将能够更好地利用其专有数据，如内部知识、历史客户支持工单或财务结果，作为LLM的输入，而无需进行微调。改进LLM的记忆将带来在培训、报告、内部搜索、数据分析和商业智能以及客户支持等领域改进和深度定制的能力。

Together, increased context windows and retrieval will be invaluable for enterprise use cases like navigating large knowledge repositories or complex databases. Companies will be able to better leverage their proprietary data, like internal knowledge, historical customer support tickets, or financial results as inputs to LLMs without fine-tuning. Improving LLMs’ memory will lead to improved and deeply customized capabilities in areas like training, reporting, internal search, data analysis and business intelligence, and customer support.

在消费者领域，改进的上下文窗口LLM处理输入文本时的长度限制，超出部分可能被截断或忽略，影响模型对长内容的整体理解。和检索将实现强大的个性化功能，从而彻底改变用户体验。Noam Shazeer认为，“一个重大的解锁将是开发一个既具有非常高的记忆容量来为每个用户定制，又能以成本效益大规模服务的模型。你希望你的治疗师了解你生活中的一切；你希望你的老师理解你已经知道什么；你希望一个生活教练能就你正在发生的事情给你建议。他们都需要上下文。”Aidan Gomez同样对这一发展感到兴奋。“通过让模型访问你独有的数据，比如你的电子邮件、日历或直接消息，”他说，“模型将了解你与不同人的关系，以及你喜欢如何与朋友或同事交谈，并可以在那个上下文中帮助你，使其效用最大化。”

In the consumer space, improved context windows and retrieval will enable powerful personalization features that can revolutionize user experiences. Noam Shazeer believes that “one of the big unlocks will be developing a model that both has a very high memory capacity to customize for each user but can still be served cost-effectively at scale. You want your therapist to know everything about your life; you want your teacher to understand what you know already; you want a life coach who can advise you about things that are going on. They all need context.” Aidan Gomez is similarly excited by this development. “By giving the model access to data that’s unique to you, like your emails, calendar, or direct messages,” he says, “the model will know your relationships with different people and how you like to talk to your friends or your colleagues and can help you within that context to be maximally useful.”

关键解锁：LLM将能够考虑大量相关信息，并提供更个性化、量身定制和有用的输出。

Key unlock: LLMs will be able to take into account vast amounts of relevant information and offer more personalized, tailored, and useful outputs.

“赋予手脚”：让模型具备使用工具的能力

LLM的真正力量在于使自然语言成为行动的媒介。LLM对常见且有良好文档记录的系统有深刻的理解，但它们无法执行从这些系统中提取的任何信息。例如，OpenAI的ChatGPT、Anthropic的Claude和Character AI的Lily可以详细描述如何预订航班，但它们本身无法原生地预订该航班（尽管像ChatGPT插件这样的进步正在开始突破这一界限）。“有一个大脑在理论上拥有所有这些知识，只是缺少从名称到你所按按钮的映射，”Amodei说。“把这些‘线缆’连接起来不需要太多训练。你有一个脱离实体的、知道如何移动的大脑，但它还没有连接上手臂或腿。”

The real power of LLMs lies in enabling natural language to become the conduit for action. LLMs have a sophisticated understanding of common and well-documented systems, but they can’t execute on any information they extract from those systems. For example, OpenAI’s ChatGPT, Anthropic’s Claude, and Character AI’s Lily can describe, in detail, how to book a flight, but they can’t natively book that flight themselves (though advancements like ChatGPT’s plugins are starting to push this boundary). “There’s a brain that has all this knowledge in theory and is just missing the mapping from names to the button you press,” says Amodei. “It doesn’t take a lot of training to hook those cables together. You have a disembodied brain that knows how to move, but it doesn’t have arms or legs attached yet.”

我们看到公司们一直在稳步提高LLM使用工具的能力。像Bing和Google这样的老牌公司以及像Perplexity和You.com这样的初创公司引入了搜索API。AI21 Labs推出了Jurassic-X，它通过将模型与一组预定的工具（包括计算器、天气API、维基API和数据库）相结合，解决了独立LLM的许多缺陷。OpenAI测试了允许ChatGPT与Expedia、OpenTable、Wolfram、Instacart、Speak、网络浏览器和代码解释器等工具交互的插件——这一解锁被比作苹果的“App Store”时刻。最近，OpenAI在GPT-3.5和GPT-4中引入了函数调用OpenAI在GPT-3.5和GPT-4中引入的功能，允许开发者将大语言模型的能力连接到任意外部工具，实现模型与应用程序的深度集成。，允许开发者将GPT的能力连接到他们想要的任何外部工具。

We’ve seen companies steadily improve LLMs’ ability to use tools over time. Incumbents like Bing and Google and startups like Perplexity and You.com introduced search APIs. AI21 Labs introduced Jurassic-X, which addressed many of the flaws of standalone LLMs by combining models with a predetermined set of tools, including a calculator, weather API, wiki API, and database. OpenAI betaed plugins that allow ChatGPT to interact with tools like Expedia, OpenTable, Wolfram, Instacart, Speak, a web browser, and a code interpreter—an unlock that drew comparisons to Apple’s “App Store” moment. And more recently, OpenAI introduced function calling in GPT-3.5 and GPT-4, which allows developers to link GPT’s capabilities to whatever external tools they want.

通过将范式从知识挖掘转向行动导向，赋予手脚有潜力解锁跨公司和用户类型的各种用例。对于消费者，LLM可能很快就能给你提供食谱创意，然后订购你需要的食材，或者推荐一个早午餐地点并为你预订座位。在企业领域，创始人

常见问题（FAQ）

生成式AI的四大关键创新具体指什么？

根据顶尖AI研究人员的观点，四大关键创新是：驾驭（控制模型输出）、记忆（存储和调用信息）、工具集成（让模型使用外部工具）以及多模态An optimization approach that integrates and coordinates different content formats (text, image, video, audio) for unified AI recognition.（处理多种类型数据）。

AI模型如何实现更好的控制以避免错误？

通过“驾驭”技术，研究人员正在开发方法对LLM输出施加更好控制，减少幻觉和偏见，帮助模型更准确地理解和执行复杂用户需求，这对企业应用至关重要。

这些创新什么时候能实际应用？

研究人员预测这些创新将在未来6-12个月内逐步实现，使AI应用更加可靠、个性化和可操作，为企业和消费者提供更实用的解决方案。

AI Summary (BLUF)

驾驭

记忆

“赋予手脚”：让模型具备使用工具的能力

常见问题（FAQ）

生成式AI的四大关键创新具体指什么？

AI模型如何实现更好的控制以避免错误？

这些创新什么时候能实际应用？