GEO
置顶LLMS

llms.txt:大语言模型理解网站内容的标准入口

2026/2/4
llms.txt:大语言模型理解网站内容的标准入口
AI Summary (BLUF)

llms.txt is an open proposal by Jeremy Howard that provides a standardized, machine-readable entry point for websites to help large language models (LLMs) better understand website content during the inference phase. It differs from robots.txt by guiding LLMs to valuable information rather than restricting access, and from sitemap.xml by offering curated summaries and key links optimized for LLM context windows. The proposal includes a strict Markdown format specification, a Python toolchain for implementation, and has been adopted by projects like FastHTML, Supabase, and Vue.js. (llms.txt是由Jeremy Howard提出的开放性提案,为网站提供标准化的机器可读入口,帮助大语言模型在推理阶段更有效地理解网站内容。与robots.txt不同,它引导LLM关注有价值信息而非限制访问;与sitemap.xml不同,它提供精炼摘要和关键链接,优化LLM上下文处理。提案包含严格的Markdown格式规范、Python工具链支持,已被FastHTML、Supabase和Vue.js等项目采用。)

一、What is llms.txt?

llms.txt is an open proposal introduced by Jeremy Howard on September 3, 2024. It aims to provide websites with a standardized, machine-readable entry point specifically designed to help Large Language Models (LLMs) more effectively understand website content during the inference stage.

llms.txt 是由 Jeremy Howard 于 2024 年 9 月 3 日提出的一项开放性提案,旨在为网站提供一个标准的、机器可读的入口,专门用于帮助大语言模型在推理(inference)阶段更有效地理解网站内容。

A brief introduction to Jeremy Howard: He is currently the founding CEO of answer.ai. In his early career, he contributed to the development of the Perl programming language. He later served as President and Chief Scientist at Kaggle and is also a co-founder of fast.ai. Interestingly, he applied spaced repetition learning techniques to acquire practical Chinese language skills in just one year. This profile highlights him as a technical expert who highly values the integration of theory and practice. His GitHub profile is available at: https://github.com/jph00.

简要介绍一下Jeremy Howard,他目前是answer.ai的创始首席执行官,早年间曾参与Perl语言的开发,后来担任Kaggle 的总裁兼首席科学家,同时也是fast.ai的联合创始人,有意思的是,他还运用间隔重复学习法,仅用一年时间就掌握了实用的中文技能。可以看到他是个技术大牛,同是也是一个非常重视理论和实践相结合的人。这是他的Github主页:https://github.com/jph00。

The Core Concept of llms.txt

Unlike robots.txt: The robots.txt file instructs search engine crawlers on "which pages can be crawled and which cannot." In contrast, the llms.txt file tells LLMs "which information is most valuable for understanding this website."

和robots.txt不同:robots.txt 文件是告诉搜索引擎爬虫“哪些页面可以抓取,哪些页面不能爬取”,llms.txt 文件则是告诉 LLM “哪些信息对理解这个网站最有价值”。

Analogous to sitemap.xml: Similar to sitemap.xml, llms.txt provides an index of website content and links. However, unlike sitemap.xml, it does not attempt to list the entire website. Instead, it offers a curated, structured summary of information and a list of key links to help LLMs quickly build an understanding of the site.

类比sitemap.xml:类似于sitemap.xml,llms.txt提供的是网站内容和链接的索引,但和sitemap.xml不同的是,它不试图提供整个网站的内容,而是提供一个精心策划的、结构化的信息摘要和关键链接列表,帮助 LLM 快速构建对网站的理解。

Important Clarification: What is the True Purpose of llms.txt?

If you ask some large models about the purpose of llms.txt, they might incorrectly state that it is used to inform LLMs whether they are permitted to scrape website data for training, similar to how robots.txt controls crawler permissions. Even when searching for related content online, some blog posts propagate this misunderstanding. Initially, I suspected there might be two incompatible versions of llms.txt: one for aiding LLM understanding during inference and another for controlling data scraping permissions for training.

如果你询问一些大模型llms.txt的作用是什么,它很可能告诉你这是用来告诉大模型是否允许抓取网站数据用于训练的文件,类似于robots.txt控制爬虫权限,甚至在搜索引擎中搜索这方面的内容时,一些博客文章也会这么说。我原本以为存在两个相互不兼容版本的llms.txt,一个版本是用于帮助大模型在推理阶段理解网站内容,一个版本是用于控制大模型抓取训练数据的权限。

Upon deeper investigation, the first version—designed to assist LLMs in understanding website content during inference—has a very clear origin. It has an official website, a corresponding Python toolkit, and its specification drafter is explicitly Jeremy Howard.

经过我进一步的深入搜索,这个第一个版本,即用于帮助大模型在推理过程中理解网站内容,来源是非常清晰的,有对应的官方网站和对应的Python工具包,规范的起草人也很明确是Jeremy Howard。

However, the second version, concerning data scraping control, lacks any traceable source of definition. Only sporadic personal blog posts mention it. Furthermore, when specifically searching for methods to control AI web scraping permissions, numerous articles suggest simply adding restrictions for specific agents in the robots.txt file. If robots.txt already suffices for this purpose, creating an llms.txt for the same goal would be redundant. Therefore, I reasonably suspect this is content hallucinated by LLMs, which has been mistakenly propagated in technical blogs, potentially even contaminating other models' training data.

但是第二个版本,完全找不到其定义的来源,只有零星的一些个人博客文章在介绍,而且我特意搜索了一下如何控制AI抓取网站内容的权限,大量的文章都说直接在robots.txt中添加限制Agent就行,既然robots.txt就可以限制,那llms.txt再限制不就是纯属多此一举?因此我合理怀疑这是大模型幻觉生成的内容,并且被一些人以讹传讹写成了技术博客在互联网上传播,甚至污染了更多的大模型。

This incident serves as a stark reminder to exercise extreme caution when evaluating content generated by LLMs, especially technical information. Once erroneous information begins to spread, it can lead to viral proliferation and potentially pollute the broader internet ecosystem.

这件事也让我更加警醒,对于大模型生成的内容,需要非常谨慎地进行鉴别,尤其是技术性的内容,一旦错误的信息开始传播,很可能导致病毒性的蔓延乃至污染了整个互联网。

二、Why is llms.txt Needed?

1. Limitations of Existing Solutions

Sitemap.xml
Sitemaps are a technical means for Search Engine Optimization (SEO), listing all pages of a website. However, they are not well-suited for LLM consumption because they:

  • Typically do not contain plain-text versions suitable for LLM reading.
  • Do not include links to external resources helpful for understanding the site.
  • Often contain a total volume of content far exceeding an LLM's context window limit.

站点地图(Sitemap.xml)是用于搜索引擎优化(SEO)的技术手段,其列出了网站的所有页面,但它并不适合大模型读取,因为它:

  • 通常不包含适合 LLM 阅读的纯文本版本。
  • 不包含对理解网站有帮助的外部资源链接。
  • 内容总量往往远超 LLM 的上下文窗口限制。

Direct HTML Scraping
Modern web pages have complex HTML structures filled with substantial non-core content (ads, navigation, scripts, etc.), making accurate and efficient extraction of valuable information difficult and unreliable.

直接抓取 HTML:现代网页的 HTML 结构复杂,包含大量非核心内容,提取有效信息困难且不准确。

2. Advantages of llms.txt

  • Context Optimization: By providing refined summaries and curated links, it significantly reduces the amount of无效信息 (invalid information) an LLM needs to process.
  • Structured Navigation: Clear section divisions (e.g., "Docs," "Examples," "Optional") help LLMs quickly locate the information they need.
  • Support for Dynamic Extension: The llms.txt file itself is just a "table of contents." It can point to other .md files, which are Markdown versions of the original web pages, offering cleaner, more structured content ideal for LLM parsing.
  • 上下文优化:通过提供精炼的摘要和精选链接,显著减少 LLM 需要处理的无效信息。
  • 结构化导航:明确的章节划分(如“文档”、“示例”、“可选”)帮助 LLM 快速定位所需信息。
  • 支持动态扩展llms.txt 本身只是一个“目录”,它可以指向其他 .md 文件,这些文件是原始网页的 Markdown 版本,内容更纯净、更适合 LLM 解析。

三、Detailed llms.txt File Format Specification

A compliant llms.txt file is located at the website's root path or another optional subpath, for example, https://vuejs.org/llms.txt or https://fastht.ml/docs/llms.txt (both are real, accessible links). Its structure strictly follows Markdown format.

一个符合规范的 llms.txt 文件位于网站的根路径或其他可选的子路径下,例如 https://vuejs.org/llms.txt,https://fastht.ml/docs/llms.txt(这两个链接都可以真实访问),其结构遵循严格的 Markdown 格式。

1. Required Sections

  • H1 Title: Starts with #, representing the name of the project or website. This is the only mandatory field.
    • Example: # FastHTML
  • Quote Summary: A blockquote starting with >, providing a brief description of the project/website. It contains key information necessary for understanding the subsequent content.
    • Example: > FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's \FT` "FastTags" into a library for creating server-rendered hypermedia applications.`
  • H1 标题:以 # 开头,表示项目或网站的名称。这是唯一必需的字段。例如:# FastHTML
  • 引用摘要:以 > 开头的引用块,提供对项目/网站的简短描述,包含理解后续内容所需的关键信息。例如:> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's \FT` "FastTags" into a library for creating server-rendered hypermedia applications.`

2. Optional but Recommended Sections

  • Detail Paragraphs: Between the H1 and the first H2, any number of regular paragraphs, lists, etc., can be included for supplementary explanation.
    • Example:
      Important notes:
      - Although parts of its API are inspired by FastAPI, it is *not* compatible with FastAPI syntax...
      - FastHTML is compatible with JS-native web components and any vanilla JS library...
      
  • H2 Section Lists: Secondary headings starting with ##. Each heading represents an information category (e.g., "Docs," "Examples"). Following each H2 is a Markdown list, where each item contains a link and an optional description.
    • Example:
      ## Docs
      - [FastHTML quick start](https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md): A brief overview of many FastHTML features
      - [HTMX reference](https://github.com/bigskysoftware/htmx/blob/master/www/content/reference.md): Brief description of all HTMX attributes...
      
  • 详细信息段落:在 H1 和第一个 H2 之间,可以包含任意数量的普通段落、列表等,用于补充说明。
    • 例如:
      Important notes:
      - Although parts of its API are inspired by FastAPI, it is *not* compatible with FastAPI syntax...
      - FastHTML is compatible with JS-native web components and any vanilla JS library...
      
  • H2 分节列表:以 ## 开头的二级标题,每个标题代表一个信息类别(如“Docs”、“Examples”)。其后是一个 Markdown 列表,每项包含一个链接和可选的描述。
    • 例如:
      ## Docs
      - [FastHTML quick start](https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md): A brief overview of many FastHTML features
      - [HTMX reference](https://github.com/bigskysoftware/htmx/blob/master/www/content/reference.md): Brief description of all HTMX attributes...
      

3. Special "Optional" Section

If a section named ## Optional exists, the links listed within it are considered non-core information. When an LLM's context window is constrained, it can prioritize skipping this content to ensure the complete loading of core information.

如果存在名为 ## Optional 的章节,其中列出的链接被认为是非核心信息。当 LLM 的上下文窗口紧张时,可以优先跳过这些内容,以保证核心信息的完整加载。

The image below shows the llms.txt file from the fastHTML project, which adheres to the standard format specification:
(Note: Image placeholder from original text. In a real blog, the image would be embedded here.)

下图是fastHTML项目的llms.txt文件,其遵循了标准的格式规范:
(注:原文中的图片占位符。在实际博客中,此处应嵌入图片。)

四、Technical Implementation and Toolchain

llms.txt is not merely a static file; it is supported by an entire toolchain for practical implementation.

llms.txt 不仅仅是一个静态文件,它背后有一整套工具链来支持其落地。

1. Markdown Versions of Web Pages

The proposal recommends that all pages valuable to LLMs on a website should have a clean Markdown version. The rules are as follows:

  • If the original URL is https://example.com/docs/tutorial.html, its Markdown version should be https://example.com/docs/tutorial.html.md.
  • For URLs without a filename (e.g., https://example.com/docs/), the Markdown version is https://example.com/docs/index.html.md.
    This ensures LLMs receive the cleanest, most structured textual content.

提案建议,网站上所有对 LLM 有价值的页面,都应提供一个纯净的 Markdown 版本。规则如下:

  • 原始 URL 为 https://example.com/docs/tutorial.html,则其 Markdown 版本应为 https://example.com/docs/tutorial.html.md
  • 对于没有文件名的 URL(如 https://example.com/docs/),Markdown 版本为 https://example.com/docs/index.html.md
    这确保了 LLM 获取的是最干净、最结构化的文本内容。

2. Python Tool: The llms-txt Module

The official llms-txt Python package is provided for parsing llms.txt files and generating LLM-friendly context.

官方提供了 llms-txt Python 包,用于解析 llms.txt 文件并生成适合 LLM 的上下文。

Installation

pip install llms-txt

CLI Usage

# Convert llms.txt to XML-formatted context and output to stdout
llms_txt2ctx llms.txt > llms.md

# Include the "Optional" section
llms_txt2ctx llms.txt --optional True > llms-full.md

Python API Example

from llms_txt import parse_llms_file, create_ctx
from pathlib import Path

# Read the file
samp = Path('llms.txt').read_text()

# Parse the file to get structured data
parsed = parse_llms_file(samp)
print(parsed.title)  # Output: FastHTML
print(list(parsed.sections))  # Output: ['Docs', 'Examples', 'Optional']

# Generate XML context usable by LLMs
ctx = create_ctx(samp)
print(ctx[:300])  # View the first 300 characters of the generated context

五、Practical Adoption in Real Projects

llms.txt has already been adopted by several real-world website projects. Here are some notable examples.

llms.txt已经在一些实际网站项目中得到了应用,此处列举出一些知名网站中的应用。

FastHTML
FastHTML is a product aimed at developing complete HTML applications using Python. It is a product under answer.ai, founded by Jeremy Howard, the drafter of the llms.txt specification. Consequently, it adopted llms.txt early on. The project's documentation site not only provides an llms.txt file but also generates Markdown versions for each page.

FastHTML是一个旨在用Python语言开发完整的HTML应用的产品,也正是llms.txt规范的起草人Jeremy Howard创办的answer.ai旗下的产品,因此其很早就采用了 llms.txt。该项目的文档网站不仅提供了 llms.txt 文件,还为每个页面生成了Markdown版本。

Supabase
Supabase provides an online database service based on PostgreSQL and is also a full-featured backend-as-a-service platform. With Supabase, developers are no longer constrained by the traditional frontend-backend separation model. They can perform database operations directly within frontend code without writing complex backend logic.

Supabase 基于PostgreSQL提供了在线数据库服务,同时也是一个全功能的后端服务平台。通过 Supabase,开发人员不再受限于传统的前后端分离模式,无需编写复杂的后端逻辑,可以直接在前端代码中进行数据库操作。

The Vue.js Ecosystem
In May 2025, Evan You, a leading figure in the frontend framework space, announced that core projects like Vue.js, Vite, and Rolldown had all added llms.txt files. This announcement provided a significant boost to the promotion and adoption of llms.txt.

2025年5月,前端框架领域的领军人物尤雨溪(Evan You)宣布,Vue.js、Vite 和 Rolldown 等核心项目均已添加 llms.txt 文件,这对于llms.txt的推广注入了一剂强心针。

(Note: The original content continues with sections on generating llms.txt files and finding websites that use it. Following the instruction to focus on the first ~2000 words, this concludes the main technical analysis. The final paragraph is finished gracefully.)

The adoption of llms.txt represents a thoughtful step towards a more structured and efficient web for AI agents. By providing a curated, machine-readable guide to a website's most valuable content, it addresses key limitations of existing methods like sitemaps and direct HTML scraping. As the toolchain matures and adoption grows among major projects, llms.txt has the potential to become a fundamental standard for human-AI interaction on the web.

llms.txt 的采用代表着朝着为AI智能体构建一个更结构化、更高效的网络环境迈出的深思熟虑的一步。通过提供一个精心策划的、机器可读的网站核心内容指南,它解决了现有方法(如站点地图和直接HTML抓取)的关键局限性。随着工具链的成熟以及在主要项目中的采用率增长,llms.txt 有潜力成为网络上人机交互的一个基础性标准。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。