如何优化代码和配置？optimize_anything API 2024最新指南

今天，我们正式推出 optimize_anything，这是一个声明式 API，旨在优化任何可以表示为文本的产物（例如代码、提示词、智能体架构、矢量图形、配置文件）。它将 GEPA（Genetic-Pareto，我们最先进的 LLM 提示词优化器）的能力从提示词领域极大地扩展到了更广泛的范畴。你只需声明要优化什么以及如何衡量它；系统会负责处理搜索过程。

Today, we are introducing optimize_anything, a declarative API that optimizes any artifact representable as text (e.g., code, prompts, agent architectures, vector graphics, configurations). It extends GEPA (Genetic-Pareto, our state-of-the-art LLM prompt optimizer) far beyond prompts. You declare what to optimize and how to measure it; the system handles the search.

我们在多个领域进行了测试，发现 optimize_anything 的表现始终与特定领域的工具持平或更优，其中一些工具甚至是专为特定任务而构建的。通过这一个 API，你可以：

Testing it across several domains, we find optimize_anything consistently matches or outperforms domain-specific tools, including some purpose-built for each task. With one API, you can:

创建智能体技能，使 Claude Code 的任务完成率接近完美，且速度提升 47%。
优化云调度策略，将成本降低 40%，超越专家设计的启发式算法。
寻找详细的系统提示词，以提升 GPT 的数学推理准确性。
发现定制的智能体框架，使 Gemini Flash 在 ARC-AGI 上的准确率几乎翻了三倍。
编写自定义求解器，在数学黑盒优化任务上匹配并超越 Optuna。
以及……建模一个 3D 独角兽。

Create agent skills achieving near-perfect Claude Code task completion 47% faster.

Optimize cloud scheduling policies that cut costs by 40%, beating expert heuristics.

Find detailed system prompts to boost GPT's math reasoning accuracy.

Discover bespoke agent harnesses that nearly triple Gemini Flash's ARC-AGI accuracy.

Write custom solvers to match and exceed Optuna in blackbox mathematical optimization.

And... model a 3D unicorn.

其核心洞见在于，一个令人惊讶的广泛问题都可以被表述为优化一个文本产物：加速 CUDA 内核、调整调度策略、改进提示词模板或重新设计智能体架构。如果它能被序列化为字符串并且其质量可以被衡量，那么 LLM 就可以对其进行推理并提出改进方案。

The key insight is that a surprisingly wide range of problems can be formulated as optimizing a text artifact: speeding up a CUDA kernel, tuning a scheduling policy, refining a prompt template, or redesigning an agent architecture. If it can be serialized to a string and its quality measured, an LLM can reason about it and propose improvements.

之前的 LLM 进化框架（如 AlphaEvolve、OpenEvolve 和 ShinkaEvolve）暴露了诸如岛屿拓扑、提示词采样器和级联评估阶段等概念。相比之下，optimize_anything 将接口精简至其本质，并通过将三种优化模式（单任务搜索、多任务搜索优化模式，批量解决相关任务，通过跨任务知识迁移加速收敛并提高整体性能。和泛化）统一在一个声明式 API 下而走得更远。虽然之前的系统仅支持单任务模式，但 optimize_anything 能够处理它们无法直接表达的优化任务，例如从零开始发现智能体架构、学习能够泛化到未见示例的提示词，以及优化能够跨模型迁移的编码智能体技能。

Where prior LLM-evolution frameworks like AlphaEvolve, OpenEvolve, and ShinkaEvolve expose concepts like island topologies, prompt samplers, and cascade evaluation stages, optimize_anything strips the interface down to its essence — and goes further by unifying three optimization modes (single-task search, multi-task search, and generalization) under one declarative API. While prior systems operate exclusively in single-task mode, optimize_anything enables optimization tasks they cannot directly express like discovering agent architectures from scratch, learning prompts that generalize to unseen examples, and optimizing coding agent skills that transfer across models.

评估一个文本产物，捕获诊断反馈（ASI），然后让 LLM 提出有针对性的改进。代码、提示词、配置、智能体架构——只要你能衡量它，optimize_anything 就能优化它。

Evaluate a text artifact, capture diagnostic feedback (ASI), and let an LLM propose targeted improvements. Code, prompts, configs, agent architectures — if you can measure it, optimize_anything can optimize it.

`optimize_anything` API

最简单的形式

其核心是，API 只需要两样东西：一个产物（或对其需求的描述）和一个评估器。

At its core, the API requires just two things: an artifact (or a description of what you want) and an evaluator.

import gepa.optimize_anything as oa

def evaluate(candidate: str) -> float:
    score, diagnostic = run_my_system(candidate)
    oa.log(f"Error: {diagnostic}")  # captured as ASI
    return score

# 从一个现有的产物开始...
result = oa.optimize_anything(
    seed_candidate="<your initial artifact>",
    evaluator=evaluate,
)

# ... 或者直接描述你的需求。
result = oa.optimize_anything(
    evaluator=evaluate,
    objective="Generate a Python function `reverse()` that reverses a string.",
)

print(result.best_candidate)

就是这样。评估器接收一个候选字符串并返回一个分数（越高越好）。oa.log() 的工作方式类似于 print()，但会将输出路由到 LLM 提议器，作为可操作的辅助信息（ASI）——这是提议器在反思过程中读取的诊断反馈。为了获得更丰富的诊断信息，可以在返回分数的同时返回一个结构化的字典：

That's it. The evaluator takes a candidate string and returns a score (higher is better). oa.log() works just like print(), but routes output to the LLM proposer as Actionable Side Information (ASI) — diagnostic feedback the proposer reads during reflection. For richer diagnostics, return a structured dictionary alongside the score:

def evaluate(candidate: str) -> tuple[float, dict]:
    result = execute_code(candidate)
    return result.score, {
        "Error": result.stderr,
        "Output": result.stdout,
        "Runtime": f"{result.time_ms:.1f}ms",
    }

ASI 可以是开放式文本、结构化数据、多目标（通过分数），甚至是图像（通过 gepa.Image）以供支持视觉的 LLM 使用；任何能帮助专家理解产物并诊断失败的信息都可以。我们将在 SVG 演示中看到 ASI 的实际应用，然后解析其重要性。

ASI can be open-ended text, structured data, multi-objectives (through scores), or even images (via gepa.Image) for vision-capable LLMs; anything that would help an expert understand the artifact and diagnose failures. We'll see ASI in action in the SVG demo, then unpack why it matters.

一个接口，三种优化模式

optimize_anything 将三种不同的优化范式统一在一个 API 下，具体模式取决于你是否提供数据集和验证集：

optimize_anything unifies three distinct optimization paradigms under one API, determined by whether you provide a dataset and valset:

单任务搜索："解决一个难题。" 无需数据集；候选产物本身就是解决方案，评估器直接对其进行评分（无示例参数）。例如，在圆形填充问题中，产物是填充算法的代码，评估器返回分数以及作为 ASI 的几何诊断信息。这是 AlphaEvolve 和 OpenEvolve 等先前 LLM 进化框架所采用的模式。

Single-Task Search: "Solve one hard problem." No dataset needed; the candidate is the solution, and the evaluator scores it directly (no example argument). For example, in circle packing, the artifact is the packing algorithm code and the evaluator returns the score plus geometric diagnostics as ASI. This is the mode that prior LLM-evolution frameworks like AlphaEvolve and OpenEvolve operate in.
```
oa.optimize_anything(seed_candidate=..., evaluator=...)
```
多任务搜索优化模式，批量解决相关任务，通过跨任务知识迁移加速收敛并提高整体性能。："解决一批相关任务，并实现跨任务迁移。" 你提供一组相关任务的数据集；解决一个任务的洞见有助于解决其他任务。例如，在 CUDA 内核生成中，每个任务是在同一硬件上加速的 PyTorch 操作，评估器编译并基准测试内核，返回编译器错误和分析器跟踪作为 ASI。尽管内核执行不同的计算，但由于优化模式的跨任务迁移，多任务模式比专用的单任务优化收敛更快，并且在所有加速阈值上解决了更多问题。没有先前的 LLM 进化框架支持此模式。

Multi-Task Search: "Solve a batch of related problems with cross-transfer." You provide a dataset of related tasks; insights from solving one help solve the others. For example, in CUDA kernel generation, each task is a PyTorch operation to accelerate on the same hardware, and the evaluator compiles and benchmarks the kernel returning compiler errors and profiler traces as ASI. Even though the kernels perform different computations, multi-task mode converges faster and solves more problems across all speedup thresholds than dedicated single-task optimization, thanks to cross-transfer of optimization patterns. No prior LLM-evolution framework supports this mode.
```
oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=tasks)
```
泛化："构建一个能够迁移到未见问题的技能。" 你同时提供训练数据集和保留的验证集；优化后的产物（提示词、智能体、策略）必须能够泛化到未见过的示例。这是 GEPA 提示词优化所采用的模式。optimize_anything 将这种模式泛化到任何文本产物，而不仅仅是提示词，抽象于传统的机器学习和程序合成。例如，在智能体架构发现中，产物是整个智能体，数据集和验证集是 ARC-AGI 谜题，评估器运行智能体并返回其错误作为 ASI。优化后的智能体在测试集上的准确率从 32.5% 提升到 89.5%（+57 个百分点）。同样的模式也支持云调度策略发现，其中产物是一个必须在未见过的基础设施场景中泛化的算法。

Generalization: "Build a skill that transfers to unseen problems." You provide both a training dataset and a held-out valset; the optimized artifact (a prompt, an agent, a policy) must generalize to unseen examples. This is the mode that GEPA's prompt optimization operates in. optimize_anything generalizes the pattern to any text artifact, not just prompts, abstracting over traditional machine learning and program synthesis. For example, in agent architecture discovery, the artifact is the entire agent, the dataset and valset are ARC-AGI puzzles, and the evaluator runs the agent and returns its errors as ASI. The optimized agent improves from 32.5% to 89.5% on the test set (+57 percentage points). The same mode also powers cloud scheduling policy discovery, where the artifact is an algorithm that must generalize across unseen infrastructure scenarios.
```
oa.optimize_anything(seed_candidate=..., evaluator=..., dataset=train, valset=val)
```

完整的 API 签名如下：

The full API signature:

def optimize_anything(
    seed_candidate: str | dict[str, str] | None = None,  # 起始产物（或 None 表示无种子）
    evaluator: Callable,                    # 分数 + ASI
    dataset: list | None = None,            # 训练示例（模式 2 和 3）
    valset: list | None = None,             # 验证集（模式 3）
    objective: str | None = None,           # 优化目标（自然语言）
    background: str | None = None,          # 领域知识和约束
    config: GEPAConfig | None = None,       # 引擎、反思、跟踪设置
) -> GEPAResult:
    """
    调用时需提供 (seed_candidate+evaluator) 或 (evaluator+objective)
    """

请注意缺少了什么：没有变异提示词、没有特定任务的指令模板、没有岛屿配置、没有 EVOLVE-BLOCK 标记（这些在先前的 LLM 进化框架中很常见）。你声明要做什么（你的产物、你的评估器以及任何作为背景的领域知识），而 optimize_anything 处理如何做：提示词构建、反思、候选选择以及搜索策略。这种声明式设计灵感来源于 DSPy 的“编程而非提示”原则，意味着同一个 API 调用无论你是优化 CUDA 内核、云调度策略还是智能体架构都能工作。

Notice what's absent: no mutation prompts, no task-specific instruction templates, no island configurations, no EVOLVE-BLOCK markers (all common in prior LLM-evolution frameworks). You declare the what (your artifact, your evaluator, and any domain knowledge as background) and optimize_anything handles the how: prompt construction, reflection, candidate selection, and search strategy. This declarative design, inspired by DSPy's principle of programming not prompting, means the same API call works whether you're optimizing a CUDA kernel, a cloud scheduling policy, or an agent architecture.

工作原理

经典的优化方法将所有诊断上下文简化为一个标量。它们知道一个候选失败了，但不知道原因。你无法向贝叶斯优化器展示定位错误的堆栈跟踪。最近的 LLM 进化框架通过将执行结果和文本反馈输入到 LLM 提议器中改变了这一点。然而，这些框架所继承的“进化”框架暗示了一个盲目的过程——变异、评估、选择、重复。但是，当 LLM 读取编译器错误、诊断逻辑错误并提出有针对性的修复时，这不是自然选择，而是工程师在迭代原型。optimize_anything 通过两个关键要素深入贯彻了这一理念：作为一等 API 概念的诊断反馈和帕累托高效搜索优化算法，维护帕累托前沿，保留在任何方面表现最佳的候选方案，避免平均分数导致的优化停滞。。

Classical optimization methods reduce all diagnostic context to a single scalar. They know that a candidate failed, but not why. You can't show a Bayesian optimizer the stack trace that pinpoints the bug. Recent LLM-evolution frameworks changed this by feeding execution results and textual feedback into LLM proposers. However, the "evolutionary" framing these frameworks inherit suggests a blind process — mutate, evaluate, select, repeat. But when an LLM reads a compiler error, diagnoses a logic bug, and proposes a targeted fix, that's not natural selection, it's an engineer iterating on a prototype. optimize_anything leans into this with two key ingredients: diagnostic feedback as a first-class API concept and Pareto-efficient search.

可操作的辅助信息

optimize_anything 将诊断反馈作为评估器契约的一等公民。先前的框架通过框架特定的机制暴露反馈；ASI 提供了一个统一的接口，使得暴露评估器能够产生的任何诊断信息变得微不足道，包括没有先前框架支持的模态，例如让 VLM 可视化检查其自身输出的渲染图像。在鹈鹕演示中，评估器将渲染的 SVG 作为图像传回，以便提议器能够直观地看到它正在改进什么。在专门的反思步骤中，提议器基于此信号进行推理，以诊断失败并提出有针对性的修复。

optimize_anything makes diagnostic feedback a first-class part of the evaluator contract. Prior frameworks expose feedback through framework-specific mechanisms; ASI provides a uniform interface that makes it trivial to surface any diagnostic the evaluator can produce, including modalities no prior framework supports, such as rendered images that let a VLM visually inspect its own output. In the pelican demo, the evaluator passed the rendered SVG back as an image so the proposer could literally see what it was improving. During a dedicated reflection step, the proposer reasons over this signal to diagnose failures and propose targeted fixes.

ASI 是文本优化的梯度类比。梯度告诉数值优化器移动的方向，而 ASI 告诉 LLM 提议器候选失败的原因以及如何修复它。

ASI is the text-optimization analogue of the gradient. Where gradients tell a numerical optimizer which direction to move, ASI tells an LLM proposer why a candidate failed and how to fix it.

帕累托高效搜索优化算法，维护帕累托前沿，保留在任何方面表现最佳的候选方案，避免平均分数导致的优化停滞。

即使在优化单一目标时，跨多个方面或示例评估候选也会产生更丰富的信号。朴素的方法将该信号折叠成一个平均分数，并始终改进排名最高的候选。这会很快陷入停滞：平均化隐藏了哪些方面强、哪些方面弱，提议器试图一次性改进所有方面，而不是集中精力。

Even when optimizing a single objective, evaluating candidates across multiple aspects or examples produces richer signal. The naive approach collapses that signal into one average score and always improves the top candidate. This stalls fast: averaging hides which aspects are strong and which are weak, and the proposer tries to improve everything at once instead of focusing.

optimize_anything 做了两件不同的事情。首先，它单独跟踪每个任务（在数据集或验证集中表达）或指标（在评估器返回的分数和 ASI 的 scores 字段中表达）的分数，并维护一个帕累托前沿：任何在某方面表现最好的候选都会被保留，即使其平均值不是最优的。其次，每个反思步骤只向提议器展示 2-3 个示例或指标的小批量，而不是全部。提议器在该子集上进行集中、有针对性的改进，而帕累托前沿确保这些专门的改进在迭代中被保留下来，而不是被平均化掉。经过多次迭代，前沿积累了互补的优势，而最佳候选则结合了这些优势。同样的机制也支持多任务搜索优化模式，批量解决相关任务，通过跨任务知识迁移加速收敛并提高整体性能。：当优化一批相关问题时，前沿保留了在不同任务上表现出色的候选，为一个问题发现的策略会迁移到其他问题——这就是为什么在 CUDA 内核生成中，多任务模式优于专用的单任务优化。

optimize_anything does two things differently. First, it tracks scores per task (expressed in dataset or valset) or metric (expressed in returned score from evaluator and scores field in ASI) individually and maintains a Pareto frontier: any candidate that is the best at something survives, even if its average is suboptimal. Second, each reflection step shows the proposer a minibatch of just 2–3 examples or metrics instead of all of them. The proposer makes focused, targeted improvements on that subset, and the Pareto frontier ensures these specialized gains are preserved across iterations rather than averaged away. Over iterations, the frontier accumulates complementary strengths, and the best candidates combine them. The same mechanism powers multi-task search: when optimizing across a batch of related problems, the frontier preserves candidates that excel on different tasks, and strategies discovered for one problem transfer to others — which is

optimize_anything API：代码与配置优化终极指南2026