什么是LLMs.txt?2024年AI爬虫标准指南 | Geoz.com.cn
LLMs.txt is a proposed web standard designed to help large language models (LLMs) better understand and utilize website content by providing a structured, curated list of important pages in Markdown format. It aims to address challenges AI crawlers face with modern websites, such as JavaScript-loaded content and information overload, potentially improving AI-generated responses and reducing training inefficiencies. (LLMs.txt是一项拟议的网络标准,旨在通过以Markdown格式提供结构化、精选的重要页面列表,帮助大型语言模型(LLMs)更好地理解和利用网站内容。它旨在解决AI爬虫在现代网站中面临的挑战,如JavaScript加载内容和信息过载,可能改善AI生成的响应并减少训练低效。)
Introduction
The llms.txt file is a proposed standard designed to help large language models (LLMs) better understand and utilize content from websites. The concept is straightforward: instead of allowing AI crawlers to index a site indiscriminately, webmasters can provide a curated list of their most important content. This acts as a guide, telling AI systems which parts of the site are authoritative and relevant for training and response generation.
llms.txt文件是一项旨在帮助大型语言模型(LLM)更好地理解和利用网站内容的提议标准。其概念非常直接:网站管理员无需让AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。在网站上漫无目的地抓取,而是可以提供一个经过筛选的重要内容列表。这相当于一份指南,告诉AI系统网站的哪些部分具有权威性且与训练和生成回答相关。
This initiative draws inspiration from existing web standards like robots.txt and XML sitemaps, which help search engine crawlers navigate websites efficiently. The key difference is that llms.txt is specifically tailored for AI models that may use web content to answer user queries or generate text. There is also speculation that implementing llms.txt could enhance a website's visibility in AI-generated responses and potentially increase referral traffic.
这一倡议的灵感来源于现有的网络标准,如
robots.txt和 XML网站地图,它们帮助搜索引擎爬虫高效地浏览网站。关键区别在于,llms.txt是专门为可能使用网络内容来回答用户查询或生成文本的AI模型定制的。也有人推测,实施llms.txt可能会提高网站在AI生成回答中的可见度,并可能带来更多的推荐流量。
Before delving into the details and evaluating its worth, it's crucial to understand the problems this new standard aims to address.
在深入探讨细节并评估其价值之前,理解这项新标准旨在解决的问题至关重要。
What Problem Is LLMs.txt Trying to Solve?
The llms.txt standard is proposed to help AI crawlers browse websites more effectively. Currently, these crawlers face two significant challenges:
llms.txt标准的提出是为了帮助AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。更有效地浏览网站。目前,这些爬虫面临两大挑战:
- Modern Websites Are Hard to Parse: Many contemporary websites rely heavily on JavaScript to load content dynamically. Most AI crawlers can only read basic HTML, meaning they might miss critical information rendered by client-side scripts.
llms.txtprovides a clear, static, and structured format that helps AI crawlers quickly digest key information without parsing complex JavaScript.现代网站难以解析:许多现代网站严重依赖JavaScript来动态加载内容。大多数AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。只能读取基本的HTML,这意味着它们可能会错过由客户端脚本渲染的关键信息。
llms.txt提供了一个清晰、静态且结构化的格式,帮助AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。快速消化关键信息,而无需解析复杂的JavaScript。 - Information Overload and Relevance: Websites often contain vast amounts of information. When AI crawlers visit a site, they lack the context to discern what is most important or authoritative. They might waste resources scraping outdated blog posts or irrelevant pages, leading to responses based on suboptimal information.
llms.txtacts as a curator, guiding crawlers to the most valuable content.信息过载与相关性:网站通常包含大量信息。当AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。访问一个网站时,它们缺乏判断哪些内容最重要或最权威的上下文。它们可能会浪费资源抓取过时的博客文章或不相关的页面,从而导致基于次优信息生成回答。
llms.txt充当了策展人的角色,引导爬虫找到最有价值的内容。
By providing this guidance, llms.txt may also contribute to reducing inefficiencies in large language model training. Training LLMs incurs massive computational costs. Directing models to high-quality, relevant content from the start could minimize resource waste on irrelevant data.
通过提供这种指导,
llms.txt也可能有助于减少大型语言模型训练中的低效问题。训练LLM会产生巨大的计算成本。从一开始就将模型导向高质量、相关的内容,可以最大限度地减少在无关数据上浪费资源。
How Are LLMs.txt Files Structured?
According to the proposed specification, llms.txt files should be structured and formatted using MarkdownA lightweight markup language for creating formatted text using a plain-text editor.. MarkdownA lightweight markup language for creating formatted text using a plain-text editor. is a lightweight markup language that uses plain text formatting syntax to create structured documents. It is widely used by developers (e.g., in GitHub README files) and is easily parseable by both humans and AI systems.
根据提议的规范,
llms.txt文件应使用MarkdownA lightweight markup language for creating formatted text using a plain-text editor.进行结构和格式编排。MarkdownA lightweight markup language for creating formatted text using a plain-text editor.是一种轻量级标记语言,使用纯文本格式语法来创建结构化文档。它被开发者广泛使用(例如在GitHub的README文件中),并且易于被人类和AI系统解析。
Common MarkdownA lightweight markup language for creating formatted text using a plain-text editor. elements used in an llms.txt file include:
llms.txt文件中常用的MarkdownA lightweight markup language for creating formatted text using a plain-text editor.元素包括:
#,##,###for headings (H1, H2, H3, etc.) / 用于标题(H1、H2、H3等)>for blockquotes to highlight important descriptions / 用于块引用以突出重要描述-or*for bullet points in unordered lists / 用于无序列表中的项目符号[text](url)for hyperlinks / 用于超链接:for adding descriptions next to links / 用于在链接旁边添加描述```for code blocks when sharing technical examples / 用于在分享技术示例时的代码块
The official specification provides a basic template. However, for larger or more complex websites, you can add more structure—such as using H3/H4 subsections, incorporating tables, or including code snippets—to provide greater context to AI crawlers, as long as valid MarkdownA lightweight markup language for creating formatted text using a plain-text editor. syntax is used.
官方规范提供了一个基本模板。然而,对于更大或更复杂的网站,您可以添加更多结构——例如使用H3/H4子标题、加入表格或包含代码片段——以便为AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。提供更多上下文,只要使用有效的MarkdownA lightweight markup language for creating formatted text using a plain-text editor.语法即可。
Example llms.txt File:
llms.txt文件示例:
# Company Name
> Brief description of what your company does
## Products
- [Product 1](https://example.com/product-1): Description of this product
- [Product 2](https://example.com/product-2): Description of this product
## Documentation
- [Getting Started](https://example.com/docs/getting-started): Introduction to our platform
- [API Reference](https://example.com/api): Complete API documentation
Current Adoption and Practical Considerations
Are Brands Using the LLMs.txt Standard?
Yes, some companies, particularly in the SaaS and developer tools space, have begun experimenting with llms.txt. However, overall adoption remains very niche. According to data from NerdyData (as of July 2025), only 951 domains had published an llms.txt file—a minuscule fraction of the web.
是的,一些公司,特别是在SaaS和开发者工具领域,已经开始尝试使用
llms.txt。然而,总体采用率仍然非常小众。根据NerdyData的数据(截至2025年7月),只有951个域名发布了llms.txt文件——这只是互联网的极小一部分。
Notable examples show varied approaches:
一些值得注意的示例展示了不同的方法:
| Brand / 品牌 | What the File Focuses On / 文件重点 | Overall Structure / 整体结构 |
|---|---|---|
| Hugging Face | Developer documentation / 开发者文档 | Uses multiple heading levels, code examples, and extensive notes, resembling a comprehensive knowledge base. / 使用多级标题、代码示例和大量注释,类似于一个全面的知识库。 |
| Vercel | Developer documentation / 开发者文档 | Starts with metadata (title, description, tags) and uses clear headers with step-by-step instructions and code examples. / 以元数据(标题、描述、标签)开头,并使用清晰的标题,配以分步说明和代码示例。 |
| Zapier | Developer documentation / 开发者文档 | Employs a simple structure with few headings, primarily consisting of a long list of links with descriptions. / 采用简单的结构,标题很少,主要是一个带有描述的长链接列表。 |
| Cal.com | Developer documentation / 开发者文档 | Uses basic headings followed by a long, ungrouped list of links without subheadings or summaries. / 使用基本标题,后面跟着一个长的、未分组的链接列表,没有子标题或摘要。 |
A key observation is that none of these early adopters uses llms.txt to represent their entire website; they focus primarily on developer documentation sections. This highlights that the file's scope is a strategic choice.
一个关键的观察是,这些早期采用者都没有使用
llms.txt来代表他们的整个网站;他们主要专注于开发者文档部分。这凸显了文件的范围是一个战略选择。
Should You Implement LLMs.txt on Your Site?
Currently, implementing llms.txt is likely not a priority for most website owners, unless driven by curiosity or a desire to experiment.
目前,对于大多数网站所有者来说,实施
llms.txt可能不是优先事项,除非是出于好奇心或实验的愿望。
The primary reason is the lack of official support. llms.txt remains a proposed community standard. Major AI companies like OpenAI, Google, and Anthropic have not officially announced that their web crawlers (e.g., GPTBot, Google-Extended, ClaudeBot) actively consume or prioritize llms.txt files. Google's John Mueller has also confirmed this lack of official usage on Bluesky.
主要原因是缺乏官方支持。
llms.txt仍然是一个提议的社区标准。像OpenAI、Google和Anthropic这样的主要AI公司尚未正式宣布他们的网络爬虫(例如GPTBot、Google-Extended、ClaudeBot)会主动使用或优先处理llms.txt文件。Google的John Mueller也在Bluesky上确认了这种官方使用的缺失。
While there are interesting signals—such as Anthropic publishing an llms.txt file on its own site—this does not confirm operational use. The landscape is currently in an early, speculative phase.
虽然有一些有趣的信号——例如Anthropic在其自己的网站上发布了
llms.txt文件——但这并不能确认其实际使用。目前整个领域还处于早期的推测阶段。
Empirical testing has shown limited impact. For instance, after implementing llms.txt on Search Engine Land in March 2025, no correlation was found between the file and improved visibility in AI search results. Analysis of server logs from mid-August to late October 2025 revealed that the llms.txt page received zero visits from major AI crawlers like Google-Extended, GPTBot, PerplexityBot, or ClaudeBot. While traditional crawlers (Googlebot, Bingbot) accessed the file, they did so infrequently and without special priority.
实证测试显示影响有限。 例如,在2025年3月于Search Engine Land上实施
llms.txt后,未发现该文件与在AI搜索结果中可见度提升之间存在关联。对2025年8月中旬至10月底服务器日志的分析显示,llms.txt页面零次被主要AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。(如Google-Extended、GPTBot、PerplexityBot或ClaudeBot)访问。虽然传统爬虫(Googlebot、Bingbot)访问了该文件,但访问频率很低,且没有特殊优先级。
How to Create an LLMs.txt File (Step by Step)
If you decide to proceed with implementation for experimental purposes, follow these steps. This process is technical, so involving a developer is recommended.
如果您决定出于实验目的继续实施,请按照以下步骤操作。此过程涉及技术细节,因此建议让开发人员参与。
Step 1: Decide What Content to Feature
Determine which pages or sections of your website should be highlighted for AI crawlers. For a site-wide file, consider including:
步骤1:决定要展示的内容
确定您网站的哪些页面或部分应被突出展示给AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。。对于全站范围的文件,请考虑包含:
- Product or service pages / 产品或服务页面
- Key, up-to-date blog posts or articles / 关键的、最新的博客文章或文章
- Pricing page / 定价页面
- About us page / 关于我们页面
- Contact page / 联系页面
- Core documentation / 核心文档
Step 2: Create the File in MarkdownA lightweight markup language for creating formatted text using a plain-text editor.
Open a text editor and create a new file named llms.txt. Structure it using MarkdownA lightweight markup language for creating formatted text using a plain-text editor.. Below is an expanded example structure:
步骤2:用MarkdownA lightweight markup language for creating formatted text using a plain-text editor.创建文件
打开文本编辑器,创建一个名为llms.txt的新文件。使用MarkdownA lightweight markup language for creating formatted text using a plain-text editor.构建其结构。以下是一个扩展的示例结构:
# Website Name
> Brief description of your website's purpose and value.
**Important Notes:**
- Key differentiator or important detail about your business.
- Another critical point about what you do or don't do.
- A third key point that helps define your offering.
## Products
- [Product Name 1](https://example.com/product-1): Short description of main feature and benefit.
- [Product Name 2](https://example.com/product-2): Short description of main feature and benefit.
## Blog Content
- [Blog Post Title 1](https://example.com/blog-post-1): Brief description of topic and utility.
- [Blog Post Title 2](https://example.com/blog-post-2): Brief description of topic and utility.
## Company
- [About Us](https://example.com/about): Company background, mission, and team.
- [Contact](https://example.com/contact): How to reach our team.
- [Pricing](https://example.com/pricing): Overview of plans and costs.
Step 3: Upload the File to Your Website
Place the file in the appropriate location on your server so it is accessible via a direct URL.
步骤3:将文件上传到您的网站
将文件放置在服务器上的适当位置,以便可以通过直接URL访问。
- For a site-wide file: Upload
llms.txtto your website's root directory (e.g.,public_html/). It should be accessible athttps://yourdomain.com/llms.txt.对于全站文件:将
llms.txt上传到您网站的根目录(例如public_html/)。它应该可以通过https://yourdomain.com/llms.txt访问。 - For a section-specific file: Upload it to the corresponding subdirectory (e.g., for docs at
docs.yourdomain.com, place it in the/docs/folder). It would be accessible athttps://docs.yourdomain.com/llms.txt.对于特定部分的文件:将其上传到相应的子目录(例如,对于
docs.yourdomain.com的文档,将其放在/docs/文件夹中)。它将可以通过https://docs.yourdomain.com/llms.txt访问。
Use your web hosting control panel (e.g., cPanel File Manager) or FTP client to upload the file. After uploading, verify it is live by visiting the URL directly in a browser. You can also use tools like Semrush's Site Audit to check if the file is detected.
使用您的网络托管控制面板(例如cPanel文件管理器)或FTP客户端上传文件。上传后,直接在浏览器中访问该URL以验证其是否生效。您也可以使用像Semrush网站审计这样的工具来检查文件是否被检测到。
Finally, remember to maintain the file. Regularly update it to remove broken links and add new, important content to ensure it remains a useful and accurate guide—should AI crawlers begin to use it in the future.
最后,请记住维护该文件。 定期更新它,删除失效链接并添加新的重要内容,以确保它始终是一个有用且准确的指南——以防未来AI爬虫专门用于采集网络内容以训练AI模型的自动化程序,如GPTBot、Google-Extended等。开始使用它。
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。