什么是llms.txt？大语言模型网站入口标准

一、What is llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.?

llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. is an open proposal introduced by Jeremy Howard on September 3, 2024. It aims to provide websites with a standardized, machine-readable entry point specifically designed to help Large Language Models (LLMs) more effectively understand website content during the inference stage.

llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 是由 Jeremy Howard 于 2024 年 9 月 3 日提出的一项开放性提案，旨在为网站提供一个标准的、机器可读的入口，专门用于帮助大语言模型在推理（inference）阶段更有效地理解网站内容。

A brief introduction to Jeremy Howard: He is currently the founding CEO of answer.ai. In his early career, he contributed to the development of the Perl programming language. He later served as President and Chief Scientist at Kaggle and is also a co-founder of fast.ai. Interestingly, he applied spaced repetition learning techniques to acquire practical Chinese language skills in just one year. This profile highlights him as a technical expert who highly values the integration of theory and practice. His GitHub profile is available at: https://github.com/jph00.

简要介绍一下Jeremy Howard，他目前是answer.ai的创始首席执行官，早年间曾参与Perl语言的开发，后来担任Kaggle 的总裁兼首席科学家，同时也是fast.ai的联合创始人，有意思的是，他还运用间隔重复学习法，仅用一年时间就掌握了实用的中文技能。可以看到他是个技术大牛，同是也是一个非常重视理论和实践相结合的人。这是他的Github主页：https://github.com/jph00。

The Core Concept of llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.

Unlike robots.txt: The robots.txt file instructs search engine crawlers on "which pages can be crawled and which cannot." In contrast, the llms.txt file tells LLMs "which information is most valuable for understanding this website."

和robots.txt不同：robots.txt 文件是告诉搜索引擎爬虫“哪些页面可以抓取，哪些页面不能爬取”，llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 文件则是告诉 LLM “哪些信息对理解这个网站最有价值”。

Analogous to sitemap.xml: Similar to sitemap.xml, llms.txt provides an index of website content and links. However, unlike sitemap.xml, it does not attempt to list the entire website. Instead, it offers a curated, structured summary of information and a list of key links to help LLMs quickly build an understanding of the site.

类比sitemap.xml：类似于sitemap.xml，llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.提供的是网站内容和链接的索引，但和sitemap.xml不同的是，它不试图提供整个网站的内容，而是提供一个精心策划的、结构化的信息摘要和关键链接列表，帮助 LLM 快速构建对网站的理解。

Important Clarification: What is the True Purpose of llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.?

If you ask some large models about the purpose of llms.txt, they might incorrectly state that it is used to inform LLMs whether they are permitted to scrape website data for training, similar to how robots.txt controls crawler permissions. Even when searching for related content online, some blog posts propagate this misunderstanding. Initially, I suspected there might be two incompatible versions of llms.txt: one for aiding LLM understanding during inference and another for controlling data scraping permissions for training.

如果你询问一些大模型llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.的作用是什么，它很可能告诉你这是用来告诉大模型是否允许抓取网站数据用于训练的文件，类似于robots.txt控制爬虫权限，甚至在搜索引擎中搜索这方面的内容时，一些博客文章也会这么说。我原本以为存在两个相互不兼容版本的llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.，一个版本是用于帮助大模型在推理阶段理解网站内容，一个版本是用于控制大模型抓取训练数据的权限。

Upon deeper investigation, the first version—designed to assist LLMs in understanding website content during inference—has a very clear origin. It has an official website, a corresponding Python toolkit, and its specification drafter is explicitly Jeremy Howard.

经过我进一步的深入搜索，这个第一个版本，即用于帮助大模型在推理过程中理解网站内容，来源是非常清晰的，有对应的官方网站和对应的Python工具包，规范的起草人也很明确是Jeremy Howard。

However, the second version, concerning data scraping control, lacks any traceable source of definition. Only sporadic personal blog posts mention it. Furthermore, when specifically searching for methods to control AI web scraping permissions, numerous articles suggest simply adding restrictions for specific agents in the robots.txt file. If robots.txt already suffices for this purpose, creating an llms.txt for the same goal would be redundant. Therefore, I reasonably suspect this is content hallucinated by LLMs, which has been mistakenly propagated in technical blogs, potentially even contaminating other models' training data.

但是第二个版本，完全找不到其定义的来源，只有零星的一些个人博客文章在介绍，而且我特意搜索了一下如何控制AI抓取网站内容的权限，大量的文章都说直接在robots.txt中添加限制Agent就行，既然robots.txt就可以限制，那llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.再限制不就是纯属多此一举？因此我合理怀疑这是大模型幻觉生成的内容，并且被一些人以讹传讹写成了技术博客在互联网上传播，甚至污染了更多的大模型。

This incident serves as a stark reminder to exercise extreme caution when evaluating content generated by LLMs, especially technical information. Once erroneous information begins to spread, it can lead to viral proliferation and potentially pollute the broader internet ecosystem.

这件事也让我更加警醒，对于大模型生成的内容，需要非常谨慎地进行鉴别，尤其是技术性的内容，一旦错误的信息开始传播，很可能导致病毒性的蔓延乃至污染了整个互联网。

二、Why is llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. Needed?

1. Limitations of Existing Solutions

Sitemap.xml
Sitemaps are a technical means for Search Engine Optimization (SEO), listing all pages of a website. However, they are not well-suited for LLM consumption because they:

Typically do not contain plain-text versions suitable for LLM reading.
Do not include links to external resources helpful for understanding the site.
Often contain a total volume of content far exceeding an LLM's context window limit.

站点地图（Sitemap.xml）是用于搜索引擎优化（SEO）的技术手段，其列出了网站的所有页面，但它并不适合大模型读取，因为它：

通常不包含适合 LLM 阅读的纯文本版本。

不包含对理解网站有帮助的外部资源链接。

内容总量往往远超 LLM 的上下文窗口限制。

Direct HTML Scraping
Modern web pages have complex HTML structures filled with substantial non-core content (ads, navigation, scripts, etc.), making accurate and efficient extraction of valuable information difficult and unreliable.

直接抓取 HTML：现代网页的 HTML 结构复杂，包含大量非核心内容，提取有效信息困难且不准确。

2. Advantages of llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.

Context Optimization: By providing refined summaries and curated links, it significantly reduces the amount of无效信息 (invalid information) an LLM needs to process.
Structured Navigation: Clear section divisions (e.g., "Docs," "Examples," "Optional") help LLMs quickly locate the information they need.
Support for Dynamic Extension: The llms.txt file itself is just a "table of contents." It can point to other .md files, which are Markdown versions of the original web pages, offering cleaner, more structured content ideal for LLM parsing.

上下文优化：通过提供精炼的摘要和精选链接，显著减少 LLM 需要处理的无效信息。

结构化导航：明确的章节划分（如“文档”、“示例”、“可选”）帮助 LLM 快速定位所需信息。

支持动态扩展：llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 本身只是一个“目录”，它可以指向其他 .md 文件，这些文件是原始网页的 Markdown 版本，内容更纯净、更适合 LLM 解析。

三、Detailed llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. File Format Specification

A compliant llms.txt file is located at the website's root path or another optional subpath, for example, https://vuejs.org/llms.txt or https://fastht.ml/docs/llms.txt (both are real, accessible links). Its structure strictly follows Markdown format.

一个符合规范的 llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 文件位于网站的根路径或其他可选的子路径下，例如 https://vuejs.org/llms.txt，https://fastht.ml/docs/llms.txt（这两个链接都可以真实访问），其结构遵循严格的 Markdown 格式。

1. Required Sections

H1 Title: Starts with #, representing the name of the project or website. This is the only mandatory field.
- Example: # FastHTML
Quote Summary: A blockquote starting with >, providing a brief description of the project/website. It contains key information necessary for understanding the subsequent content.
- Example: > FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's \FT` "FastTags" into a library for creating server-rendered hypermedia applications.`

H1 标题：以 # 开头，表示项目或网站的名称。这是唯一必需的字段。例如：# FastHTML

引用摘要：以 > 开头的引用块，提供对项目/网站的简短描述，包含理解后续内容所需的关键信息。例如：> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's \FT` "FastTags" into a library for creating server-rendered hypermedia applications.`

2. Optional but Recommended Sections

Detail Paragraphs: Between the H1 and the first H2, any number of regular paragraphs, lists, etc., can be included for supplementary explanation.

Example:

Important notes:
- Although parts of its API are inspired by FastAPI, it is *not* compatible with FastAPI syntax...
- FastHTML is compatible with JS-native web components and any vanilla JS library...

H2 Section Lists: Secondary headings starting with ##. Each heading represents an information category (e.g., "Docs," "Examples"). Following each H2 is a Markdown list, where each item contains a link and an optional description.

Example:

## Docs
- [FastHTML quick start](https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md): A brief overview of many FastHTML features
- [HTMX reference](https://github.com/bigskysoftware/htmx/blob/master/www/content/reference.md): Brief description of all HTMX attributes...

详细信息段落：在 H1 和第一个 H2 之间，可以包含任意数量的普通段落、列表等，用于补充说明。
例如：
Important notes:
- Although parts of its API are inspired by FastAPI, it is *not* compatible with FastAPI syntax...
- FastHTML is compatible with JS-native web components and any vanilla JS library...
H2 分节列表：以 ## 开头的二级标题，每个标题代表一个信息类别（如“Docs”、“Examples”）。其后是一个 Markdown 列表，每项包含一个链接和可选的描述。
例如：
## Docs
- [FastHTML quick start](https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md): A brief overview of many FastHTML features
- [HTMX reference](https://github.com/bigskysoftware/htmx/blob/master/www/content/reference.md): Brief description of all HTMX attributes...

3. Special "Optional" Section

If a section named ## Optional exists, the links listed within it are considered non-core information. When an LLM's context window is constrained, it can prioritize skipping this content to ensure the complete loading of core information.

如果存在名为 ## Optional 的章节，其中列出的链接被认为是非核心信息。当 LLM 的上下文窗口紧张时，可以优先跳过这些内容，以保证核心信息的完整加载。

The image below shows the llms.txt file from the fastHTML project, which adheres to the standard format specification:
(Note: Image placeholder from original text. In a real blog, the image would be embedded here.)

下图是fastHTML项目的llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.文件，其遵循了标准的格式规范：
(注：原文中的图片占位符。在实际博客中，此处应嵌入图片。)

四、Technical Implementation and Toolchain

llms.txt is not merely a static file; it is supported by an entire toolchain for practical implementation.

llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 不仅仅是一个静态文件，它背后有一整套工具链来支持其落地。

1. Markdown Versions of Web Pages

The proposal recommends that all pages valuable to LLMs on a website should have a clean Markdown version. The rules are as follows:

If the original URL is https://example.com/docs/tutorial.html, its Markdown version should be https://example.com/docs/tutorial.html.md.
For URLs without a filename (e.g., https://example.com/docs/), the Markdown version is https://example.com/docs/index.html.md.
This ensures LLMs receive the cleanest, most structured textual content.

提案建议，网站上所有对 LLM 有价值的页面，都应提供一个纯净的 Markdown 版本。规则如下：

原始 URL 为 https://example.com/docs/tutorial.html，则其 Markdown 版本应为 https://example.com/docs/tutorial.html.md。

对于没有文件名的 URL（如 https://example.com/docs/），Markdown 版本为 https://example.com/docs/index.html.md。
这确保了 LLM 获取的是最干净、最结构化的文本内容。

2. Python Tool: The `llms-txt` Module

The official llms-txt Python package is provided for parsing llms.txt files and generating LLM-friendly context.

官方提供了 llms-txt Python 包，用于解析 llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 文件并生成适合 LLM 的上下文。

Installation

pip install llms-txt

CLI Usage

# Convert llms.txt to XML-formatted context and output to stdout
llms_txt2ctx llms.txt > llms.md

# Include the "Optional" section
llms_txt2ctx llms.txt --optional True > llms-full.md

Python API Example

from llms_txt import parse_llms_file, create_ctx
from pathlib import Path

# Read the file
samp = Path('llms.txt').read_text()

# Parse the file to get structured data
parsed = parse_llms_file(samp)
print(parsed.title)  # Output: FastHTML
print(list(parsed.sections))  # Output: ['Docs', 'Examples', 'Optional']

# Generate XML context usable by LLMs
ctx = create_ctx(samp)
print(ctx[:300])  # View the first 300 characters of the generated context

五、Practical Adoption in Real Projects

llms.txt has already been adopted by several real-world website projects. Here are some notable examples.

llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.已经在一些实际网站项目中得到了应用，此处列举出一些知名网站中的应用。

FastHTML
FastHTML is a product aimed at developing complete HTML applications using Python. It is a product under answer.ai, founded by Jeremy Howard, the drafter of the llms.txt specification. Consequently, it adopted llms.txt early on. The project's documentation site not only provides an llms.txt file but also generates Markdown versions for each page.

FastHTML是一个旨在用Python语言开发完整的HTML应用的产品，也正是llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.规范的起草人Jeremy Howard创办的answer.ai旗下的产品，因此其很早就采用了 llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.。该项目的文档网站不仅提供了 llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 文件，还为每个页面生成了Markdown版本。

Supabase
Supabase provides an online database service based on PostgreSQL and is also a full-featured backend-as-a-service platform. With Supabase, developers are no longer constrained by the traditional frontend-backend separation model. They can perform database operations directly within frontend code without writing complex backend logic.

Supabase 基于PostgreSQL提供了在线数据库服务，同时也是一个全功能的后端服务平台。通过 Supabase，开发人员不再受限于传统的前后端分离模式，无需编写复杂的后端逻辑，可以直接在前端代码中进行数据库操作。

The Vue.js Ecosystem
In May 2025, Evan You, a leading figure in the frontend framework space, announced that core projects like Vue.js, Vite, and Rolldown had all added llms.txt files. This announcement provided a significant boost to the promotion and adoption of llms.txt.

2025年5月，前端框架领域的领军人物尤雨溪（Evan You）宣布，Vue.js、Vite 和 Rolldown 等核心项目均已添加 llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 文件，这对于llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines.的推广注入了一剂强心针。

(Note: The original content continues with sections on generating llms.txt files and finding websites that use it. Following the instruction to focus on the first ~2000 words, this concludes the main technical analysis. The final paragraph is finished gracefully.)

The adoption of llms.txt represents a thoughtful step towards a more structured and efficient web for AI agents. By providing a curated, machine-readable guide to a website's most valuable content, it addresses key limitations of existing methods like sitemaps and direct HTML scraping. As the toolchain matures and adoption grows among major projects, llms.txt has the potential to become a fundamental standard for human-AI interaction on the web.

llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 的采用代表着朝着为AI智能体构建一个更结构化、更高效的网络环境迈出的深思熟虑的一步。通过提供一个精心策划的、机器可读的网站核心内容指南，它解决了现有方法（如站点地图和直接HTML抓取）的关键局限性。随着工具链的成熟以及在主要项目中的采用率增长，llms.txtA standardized file format that allows website owners to communicate AI training and usage policies to AI crawlers, language models, and AI-driven search engines. 有潜力成为网络上人机交互的一个基础性标准。