如何构建99%准确率的文档处理器？从工作流到智能体的技术实现

Q: 构建高精度文档处理器的关键经验有哪些？

1) 上下文工程比提示工程更重要 2) 工具设计细节影响系统性能 3) 长时间运行操作需状态化设计。这些经验使分类准确率从85%提升至99%。

Introduction

At Extend, our core mission is to build robust APIs for document processing, encompassing parsing, data extraction, classification, and document splitting. We discovered that building a high-accuracy processor is fundamentally an encoding problem, often disguised as prompt engineering. The challenge lies in translating deep human domain expertise—the nuanced intuition developed from handling thousands of documents—into precise field descriptions and classification rules that a Large Language Model (LLM) can execute reliably.

在 Extend，我们的核心使命是构建用于文档处理的强大 API，涵盖解析、数据提取、分类和文档拆分。我们发现，构建高精度处理器的本质是一个编码问题，通常被伪装成提示词工程。其挑战在于将深厚的人类领域专业知识——从处理数千份文档中发展出的细微直觉——转化为精确的字段描述和分类规则，以便大型语言模型能够可靠地执行。

Consider this all-too-common and frustrating scenario:

考虑这个过于常见且令人沮丧的场景：

User: The classifier gets 85% right, but it keeps confusing a "house bill of lading" with a "master bill of lading". Let me update the prompt to explicitly explain the difference.

用户： 分类器能达到85%的正确率，但它总是混淆"house bill of lading"（分提单）和"master bill of lading"（主提单）。让我更新提示词来明确解释它们的区别。

LLM: Got it! House bills of lading are now classified perfectly.

LLM： 明白了！现在分提单的分类完美了。

User: Great. Wait, why is it suddenly failing to recognize invoices?

用户： 太好了。等等，为什么它突然识别不了发票了？

You fix the invoices, only to find it now misclassifies packing lists. This leads to a cycle of adding a rule, then an exception to that rule, then an exception to the exception. Days can be lost in this iterative loop.

你修复了发票的问题，却发现它现在开始错误分类装箱单。这导致了一个循环：添加规则，然后为该规则添加例外，再为这个例外添加例外。几天时间可能就浪费在这种迭代循环中。

The core issue is that humans are not well-suited for this task. Manually testing prompts, iterating, and updating rules is tedious and error-prone. We cannot easily predict how a change to one rule will ripple through a classifier with 50 categories, nor can we visualize the high-dimensional decision space the model navigates.

核心问题在于，人类并不擅长这类任务。手动测试提示词、迭代和更新规则既繁琐又容易出错。我们无法轻易预测对一个规则的更改将如何在拥有50个类别的分类器中产生连锁反应，也无法可视化模型所 navigating 的高维决策空间。

However, labeling examples is straightforward. Pointing to a document and stating "this is a house bill, that is a master bill" requires only domain expertise, not prompt engineering skills. Our pivotal insight was: If an automated agent could take those human-provided labels and automatically deduce the optimal rules, then humans would never need to manually craft prompts again.

然而，标注示例是简单的。指着一份文档说"这份是分提单，那份是主提单"只需要领域专业知识，而不需要提示词工程技能。我们的关键洞察是：如果一个自动化智能体能够获取这些人工提供的标签并自动推导出最优规则，那么人类就永远不需要再手动编写提示词了。

This is precisely what Composer does. Today, our customers use it to consistently push classification accuracy to ~99% on evaluation sets that previously plateaued at around 85%.

这正是 Composer 的功能。如今，我们的客户使用它，在先前准确率停滞在85%左右的评估集上，持续将分类准确率提升至约99%。

The journey to Composer was not straightforward. It required abandoning our initial architecture, developing strong opinions on problems most agent frameworks overlook, and discovering that its most valuable capability was something we never initially planned for. While our domain is document processing, the lessons learned are generalizable: the trade-offs between workflows and agents, effective context window management, and designing idempotent triggers for long-running operations.

开发 Composer 的旅程并非一帆风顺。它要求我们放弃最初的架构，对大多数智能体框架所忽视的问题形成深刻见解，并发现其最有价值的功能是我们最初从未计划过的。虽然我们的领域是文档处理，但所获得的经验教训具有普适性：工作流与智能体之间的权衡、有效的上下文窗口管理，以及为长时间运行的操作设计幂等触发器。

If you are building intelligent agents, we hope sharing our mistakes and learnings saves you valuable time.

如果你正在构建智能体，我们希望分享我们的错误和经验能为你节省宝贵的时间。

First Attempt: The Workflow Approach

Our initial proof-of-concept was built as a deterministic workflow. It was primarily heuristic-based, with AI involved only at the final step of generating improved field descriptions.

我们最初的概念验证是构建为一个确定性的工作流。它主要基于启发式方法，AI仅在生成改进的字段描述的最后一步参与。

The underlying logic was as follows: an extraction schema consists of individual fields. A naive assumption is that each field is independent and can be optimized separately. If the invoice_total field has 73% accuracy, we would analyze the errors for that specific field, generate a better description, and then proceed to the next field. This approach required one LLM inference call per field—a simple decomposition of the problem.

其基本逻辑如下：一个提取模式由各个字段组成。一个天真的假设是每个字段都是独立的，可以分别进行优化。如果 invoice_total 字段的准确率为73%，我们会分析该特定字段的错误，生成更好的描述，然后继续处理下一个字段。这种方法每个字段需要一次LLM推理调用——这是对问题的一种简单分解。

For simple schemas with a handful of non-interacting, flat fields, this approach worked adequately. Optimizing five truly independent fields separately is a reasonable strategy.

对于只有少数几个互不影响的扁平字段的简单模式，这种方法效果尚可。分别优化五个真正独立的字段是一个合理的策略。

Where the Workflow Approach Broke Down

The assumption of field independence quickly falls apart as schemas become more complex.

随着模式变得更加复杂，字段独立性的假设迅速瓦解。

Array fields create multi-level descriptions. A line_items array has descriptions at both the array level and for each nested property (e.g., description, quantity, unit_price). These levels interact during the extraction process, but our workflow optimized them in isolation. The heuristics needed to order these optimizations and resolve conflicts between levels became increasingly convoluted and brittle.
Related fields interfere with each other. Fields like length, width, and height within an item_dimensions object should have semantically similar descriptions, as they all extract measurements from the same region of a document. However, optimizing them independently could produce conflicting descriptions that work against each other during extraction.
Classification exposed a fundamental flaw. The workflow approach completely collapsed when applied to classification. Classification classes are mutually exclusive. To optimize the description for one class, you must consider all other classes in the schema to avoid creating overlapping or ambiguous definitions. Much of classification logic is about disambiguation: "This is a House Bill of Lading, NOT a Master Bill of Lading, because of feature X."

数组字段创建了多级描述。 一个 line_items 数组在数组层级和每个嵌套属性（例如 description、quantity、unit_price）层级都有描述。这些层级在提取过程中会相互作用，但我们的工作流却孤立地对它们进行优化。用于排序这些优化和解决层级间冲突的启发式方法变得越来越复杂和脆弱。

相关字段相互干扰。 像 item_dimensions 对象中的 length、width 和 height 这样的字段应该具有语义上相似的描述，因为它们都是从文档的同一区域提取测量值。然而，独立优化它们可能会产生相互冲突的描述，在提取过程中彼此对抗。

分类暴露了一个根本性缺陷。 当应用于分类时，工作流方法完全崩溃。分类类别是互斥的。要优化一个类别的描述，必须考虑模式中的所有其他类别，以避免创建重叠或模糊的定义。许多分类逻辑是关于消除歧义的："这是一份分提单，而不是主提单，因为具有特征X。"

When we attempted to build heuristics to handle all possible pairwise relationships between classification classes, we hit a combinatorial explosion. For a classifier with 50 classes, you cannot feasibly enumerate every interaction and write deterministic rules for each. The code paths multiplied faster than we could manage them, rendering the approach unscalable.

当我们试图构建启发式方法来处理分类类别之间所有可能的成对关系时，我们遇到了组合爆炸问题。对于一个有50个类别的分类器，你无法实际枚举每一种相互作用并为每一种编写确定性规则。代码路径的增长速度超过了我们的管理能力，使得该方法无法扩展。

The Pivot to Intelligent Agents

We pivoted after a breakthrough realization: optimizing a complex schema is a high-dimensional search problem, and navigating high-dimensional spaces is precisely what intelligent agents excel at.

在一个突破性的认识之后，我们进行了转向：优化复杂模式是一个高维搜索问题，而探索高维空间正是智能体所擅长的。

Instead of decomposing the problem and manually coding rules for every edge case, we reframed it. We would give the agent the complete picture—the full schema, the error analysis, and the inherent relationships between fields—and equip it with the right tools to autonomously figure out what to optimize and how.

我们没有分解问题并为每个边缘情况手动编写规则，而是重新构建了它。我们将为智能体提供完整的图景——完整的模式、错误分析以及字段之间的内在关系——并为其配备正确的工具，使其能够自主地找出优化什么以及如何优化。

This architectural shift also brought a significant future benefit: as underlying LLMs become more capable, Composer improves automatically. With the old heuristic-based workflow, every model upgrade required manual re-engineering of the rules.

这种架构转变也带来了一个重要的未来好处：随着底层LLM能力的增强，Composer会自动改进。而使用旧的基于启发式的工作流，每次模型升级都需要手动重新设计规则。

Switching to an agent did not magically solve all problems. It transformed the core challenge from "How do I write the perfect heuristics?" to "How do I structure and present information to the agent so it can effectively do its job?" The following sections detail our key learnings in answering this new question.

转向智能体并没有神奇地解决所有问题。它将核心挑战从"我如何编写完美的启发式方法？"转变为"我如何构建和呈现信息给智能体，使其能够有效地工作？"以下部分详细阐述了我们在回答这个新问题过程中的关键经验。

Key Takeaways from Building Composer

1/ Context Engineering Is The Real Work

We built the classification optimizer first because workflows failed most spectacularly there (due to the mutual exclusivity problem), and because classification has simpler context requirements. The output is a single label, and evaluation is straightforward: the predicted class either matches the ground truth or it doesn't.

我们首先构建了分类优化器，因为工作流在那里失败得最为明显（由于互斥性问题），而且分类的上下文要求更简单。输出是单个标签，评估也很直接：预测的类别要么与真实标签匹配，要么不匹配。

We assembled a minimal agent with basic tools and minimal prompt engineering. It worked immediately. On our hardest internal evaluation sets—involving subtle distinctions between visually similar document types—we watched accuracy climb to ~99%. However, the apparent simplicity of classification hid a critical challenge we encountered later: context window limits. We faced this in two primary ways:

我们用一个基本的工具集和最少的提示词工程组装了一个最小的智能体。它立即就工作了。在我们最难的内部评估集上——涉及视觉上相似的文档类型之间的细微区别——我们看到准确率攀升至约99%。然而，分类的表面简单性隐藏了我们后来遇到的一个关键挑战：上下文窗口限制。我们主要在两个方面遇到了这个问题：

Error Analysis Overhead: A classifier's confusion matrix is an N×N structure. For a 200-class classifier, that's 40,000 cells. Even with aggressive compression, presenting the full matrix would consume the entire context window before including any document content or detailed error analysis.
Document Selection: We found superior results by showing the agent only the misclassified examples, explicitly stating "everything else was correct." This focused its computational "attention" on the actual problem areas rather than having it re-evaluate already correct decisions.

错误分析开销： 分类器的混淆矩阵是一个N×N结构。对于一个200个类别的分类器，那就是40,000个单元格。即使进行激进的压缩，呈现完整的矩阵也会在包含任何文档内容或详细错误分析之前就耗尽整个上下文窗口。

文档选择： 我们发现，通过仅向智能体展示错误分类的示例，并明确声明"其他所有都是正确的"，可以获得更好的结果。这将其计算"注意力"集中在实际的问题区域，而不是让它重新评估已经正确的决策。

The Lesson: Context is a precious and finite resource. Every token spent on background information or redundant data is a token unavailable for analyzing the core problem. Aggressive, intelligent filtering is not an optimization; it's a necessity.

经验教训： 上下文是一种宝贵且有限的资源。每一个花费在背景信息或冗余数据上的令牌，都是无法用于分析核心问题的令牌。积极的、智能的过滤不是一种优化，而是一种必需。

2/ Small Tool Design Choices Have Major Impact

The design of the tools you provide to your agent significantly influences its reliability and efficiency.

你提供给智能体的工具设计会显著影响其可靠性和效率。

Design Choice	Initial Approach (Problematic)	Improved Approach (Effective)	Key Benefit
Execution Style	Parallel Tool Calling: Model emits multiple tool calls for concurrent execution.	Batch-style Tools: Model makes a single call (e.g., `update_fields`) with a list of actions.	Eliminates race conditions and ensures atomic, ordered execution of related updates.
Data Retrieval	List-then-Get Pattern: `list_fields()` returns IDs, requiring a subsequent `get_field_details(id)` call.	Collapsed Pagination: `get_fields(page, size)` returns full object details directly.	Reduces round trips, avoids failure mode where agent forgets to fetch details after listing.
Schema Reference	Custom or ad-hoc syntax for pointing to nested fields (e.g., `line_items.0.price`).	JSON Pointer (RFC 6901): Standardized syntax like `/line_items/0/price`.	Leverages model's pre-existing knowledge of standards; ensures reliable parsing with fewer edge cases.

3/ Long-Running, Expensive Operations Require Stateful Design

Most agent frameworks are designed under the assumption that tool calls are fast, cheap, and stateless. Our domain violates all these assumptions.

大多数智能体框架的设计都基于一个假设：工具调用是快速、廉价且无状态的。我们的领域违反了所有这些假设。

When Composer evaluates a processor against a dataset, it may need to process hundreds of documents through multiple extractors and classifiers. This can be a long-running (30+ minutes) and expensive operation in terms of both time and cloud inference costs.

当Composer针对数据集评估一个处理器时，它可能需要通过多个提取器和分类器处理数百份文档。这无论在时间还是云推理成本上，都可能是一个长时间运行（30分钟以上）且昂贵的操作。

We addressed this by building a simple persistence layer that separates triggering an operation from waiting for its result. The key design behaviors are:

我们通过构建一个简单的持久化层来解决这个问题，该层将触发操作与等待其结果分离开来。关键的设计行为包括：

Idempotent Triggers: When the agent requests an evaluation, the system first checks if an identical operation is already in progress. If so, it returns the existing operation ID instead of starting a duplicate.
Persistent State: Operation state (ID, status, configuration) is written to durable storage immediately after triggering. If the agent process crashes and restarts, it can resume polling for the result.
Polling over Blocking: Tools return quickly with an operation ID and a status of "PENDING." The agent then uses a separate poll_operation tool in a loop to check for completion, allowing for progress updates and graceful timeout handling.

幂等触发器： 当智能体请求评估时，系统首先检查是否已有一个相同的操作正在进行中。如果是，则返回现有的操作ID，而不是启动一个重复的操作。

持久化状态： 操作状态（ID、状态、配置）在触发后立即写入持久化存储。如果智能体进程崩溃并重启，它可以恢复轮询结果。

轮询而非阻塞： 工具快速返回一个操作ID和"PENDING"状态。然后，智能体在一个循环中使用单独的 poll_operation 工具来检查是否完成，从而允许进度更新和优雅的超时处理。

This is not an elegant, purely agentic design—it's essentially a lightweight workflow engine bolted onto the agent framework. However, if your domain involves expensive operations, idempotent triggers and persistent state tracking are not optional; they are foundational.

这不是一个优雅的、纯粹的智能体设计——它本质上是一个附加在智能体框架上的轻量级工作流引擎。然而，如果你的领域涉及昂贵的操作，幂等触发器和持久化状态跟踪不是可选的；它们是基础性的。

4/ Your Agent Might Solve a Different (Valuable) Problem

The most unexpected and valuable lesson was that Composer's primary value sometimes lies not in its intended function.

最出乎意料且宝贵的经验是，Composer的主要价值有时并不在于其预期功能。

We built Composer with the singular goal of making "the accuracy number go up

常见问题（FAQ）

AI提示词优化的核心挑战是什么？

核心挑战是将人类领域专业知识转化为LLM可执行的精确规则，而非单纯修改提示词。手动迭代容易陷入'修复一个错误又产生新错误'的循环。

为什么从工作流转向智能体架构？

工作流方法依赖人工规则调整，无法处理复杂决策空间。智能体能自动从标注数据推导最优规则，让人类专注提供领域知识而非工程调整。

构建高精度文档处理器的关键经验有哪些？

上下文工程比提示工程更重要 2) 工具设计细节影响系统性能 3) 长时间运行操作需状态化设计。这些经验使分类准确率从85%提升至99%。