DSPy框架是伪科学吗？2025年LLM优化方法深度批判

The Arrival of the Artifact

Imagine humanity discovered an alien artifact - a black box that responds to text input with uncannily intelligent text output. We don't understand its architecture, we can't see its internals, we have no theory of its operation. But it works, sometimes brilliantly.

想象一下，人类发现了一个外星造物——一个黑匣子，它能接收文本输入并产生令人惊异的智能文本输出。我们不了解它的架构，无法窥视其内部，也没有关于其运行机制的理论。但它确实有效，有时甚至表现卓越。

This is essentially what we have with Large Language Models (LLMs). They are complex mathematical objects sculpted by gradient descent, yet their emergent behaviors remain so opaque that they might as well be artifacts from another world. The training process that creates them is akin to an alien manufacturing line—we can replicate it but struggle to comprehend its inner workings.

这本质上就是我们当前面对大语言模型（LLMs）的处境。它们是由梯度下降创建LLM的数学优化过程，遵循精确的数学定律，是这些'外星文物'实际上的'建造者'。法塑造的复杂数学对象，然而其涌现出的行为是如此难以理解，以至于它们与来自另一个世界的造物无异。创造它们的训练过程就像一条外星生产线——我们可以复制它，却难以理解其内部运作。

The Cargo Cult Response

Faced with this enigmatic artifact, two distinct approaches have emerged. One camp, exemplified by frameworks like DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。, has chosen to treat the LLM as a magical black box. The methodology involves probing it with various textual inputs, observing the outputs, retaining what appears to "work," and then cloaking this essentially random experimentation in academic terminology. This is the DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 approach: using one poorly understood artifact (an LLM) to generate prompts for another, labeling this process "optimization."

面对这个神秘的造物，出现了两种截然不同的方法。其中一个阵营，以 DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。等框架为代表，选择将 LLM 视为一个神奇的黑匣子。其方法论包括用各种文本输入试探它，观察输出结果，保留看似“有效”的部分，然后用学术术语包装这种本质上是随机的实验。这就是 DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。的方法：用一个未被充分理解的造物（一个 LLM）来为另一个生成提示词，并将此过程称为“优化”。

The DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 framework can be seen as the pinnacle of cargo cult science in AI. Much like Pacific islanders who built bamboo control towers hoping to summon cargo planes, DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 constructs elaborate architectures of "optimizers" and "teleprompters" in the hope of conjuring better performance from LLMs. It employs terms like "Bayesian optimization" and "Pareto frontiers"—concepts with precise, rigorous meanings in well-understood mathematical domains—and applies them to the semantic noise of prompt engineering, where they often lose substantive meaning.

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。框架可被视为人工智能领域“货物崇拜科学模仿科学形式和术语但缺乏实质理解的伪科学实践，在LLM领域表现为使用复杂框架进行随机试探。”的巅峰。就像太平洋岛民建造竹制控制塔以期召唤来运输机一样，DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。构建了由“优化器”和“提词器”组成的复杂架构，希望以此从 LLM 中召唤出更好的性能。它使用了“贝叶斯优化”和“帕累托前沿”等术语——这些概念在已被充分理解的数学领域具有精确、严谨的含义——却将其应用于提示词工程的语义噪声中，在那里它们常常失去了实质意义。

The Illusion of Academic Rigor

What makes DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 particularly concerning is its veneer of academic legitimacy. Hailing from prestigious institutions and wrapped in the language of peer-reviewed conferences (e.g., ICLR), it carries the imprimatur of scientific authority. However, stripping away the credentials reveals a core process that is remarkably simplistic and ungrounded.

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。尤其令人担忧之处在于其表面的学术合法性。它源自知名机构，包裹在同行评议会议（如 ICLR）的语言中，带有科学权威的印记。然而，剥去这些资历，显露出的核心过程却异常简单且缺乏根基。

At its heart, the purported "optimization" can be crudely summarized by the following pseudo-code logic:

current_prompt = "Solve this"
while hoping_for_improvement:
    new_prompt = llm.suggest_variation(current_prompt)  # One black box queries another
    if accidentally_scores_higher():  # Evaluation via yet another noisy process
        current_prompt = new_prompt
        publish_paper()

They are essentially using one alien artifact (e.g., GPT-4) to generate random variations of text to feed into another artifact (e.g., Gemini), then declaring "optimization" when random noise occasionally produces a marginally higher score on an arbitrary metric. It is analogous to using one Ouija board to calibrate another.

本质上，他们是在用一个外星造物（例如 GPT-4）生成随机文本变体，输入给另一个造物（例如 Gemini），然后当随机噪声偶尔在某个任意指标上产生略高的分数时，便宣称实现了“优化”。这类似于用一个灵应牌来校准另一个。

The state of its code repository often tells the true story: basic functionalities are buggy, token limits are improperly handled, model integrations fail, and the so-called "optimized" prompts frequently underperform carefully handcrafted alternatives. Reports of artificially inflated GitHub star counts—exceeding actual download figures—paint a picture of a Potemkin framework, designed more to impress venture capitalists and conference reviewers than to solve genuine engineering problems.

其代码仓库的状态往往揭示了真相：基本功能存在缺陷、令牌限制处理不当、模型集成失败，而且所谓的“优化后”提示词在性能上常常不及精心手工设计的方案。关于 GitHub star 数被人为虚增（超过实际下载量）的报告，描绘出一个“波将金村庄”式的框架形象，其设计更多是为了打动风险投资家和会议评审，而非解决真正的工程问题。

A Symptom of a Broader Malady

The uncomfortable truth is that DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 represents an acute symptom of a more pervasive disease within the LLM application field. A significant portion of what is currently branded as "LLM engineering" suffers from a similar lack of grounding. It involves practitioners poking at inscrutable models with various proverbial sticks, retaining what seems to work in an isolated test, without developing a fundamental understanding of why it works.

一个令人不安的事实是，DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。只是 LLM 应用领域内一种更普遍弊病的突出症状。当前被标榜为“LLM 工程”的很大一部分工作，都遭受着类似缺乏根基的问题。从业者用各种“棍子”试探难以理解的模型，保留在孤立测试中看似有效的部分，却没有从根本上理解其为何有效。

How many "prompt engineering guides" are merely collections of accumulated superstitions? How many proclaimed "best practices" are simply patterns that worked on a specific dataset on Tuesday but fail unpredictably on Wednesday? How many frameworks are ultimately just elaborate systems for automating and dignifying random prodding?

有多少“提示词工程指南”仅仅是积累的迷信合集？有多少宣称的“最佳实践”只是周二在特定数据集上有效、周三就莫名失效的模式？有多少框架最终只是将随机试探自动化并加以美化的复杂系统？

The entire discipline of prompt engineering often evokes the feeling of medieval alchemy—a compendium of recipes and incantations devoid of underlying theory. The directive to "add 'let's think step by step' to your prompt" is our modern equivalent of "add eye of newt to the cauldron." Sometimes it yields a better result, we don't have a robust explanation for why, but the practice persists.

整个提示词工程学科常常让人联想到中世纪的炼金术——一本缺乏理论基础、满是配方和咒语的汇编。指示“在你的提示词中加入‘让我们一步步思考’”就是我们这个时代“往大锅里加入蝾螈眼”的等价物。有时它会产生更好的结果，我们对其原因没有可靠的解释，但这种做法却持续存在。

The Nature of the Artifacts: They Are Mathematical

The profound tragedy lies in the fact that these LLMs are mathematical objects, not magical ones. The "aliens" who built them are mathematicians and the laws of calculus. The process of gradient descent that forges these models follows precise, deterministic mathematical rules. The artifacts possess internal structure, their behaviors have causes, and they exhibit regularities that can be discovered and exploited.

深刻的悲剧在于，这些 LLM 是数学对象，而非魔法造物。建造它们的“外星人”是数学家和微积分法则。塑造这些模型的梯度下降创建LLM的数学优化过程，遵循精确的数学定律，是这些'外星文物'实际上的'建造者'。过程遵循精确的、确定性的数学规则。这些造物拥有内部结构，其行为有因可循，并且表现出可以被发现和利用的规律性。

A number of research labs recognize this fundamental truth and are pursuing quantitatively grounded approaches:

Anthropic pioneers research into mechanistic interpretability, striving to reverse-engineer the actual computational circuits within these models.
OpenAI (in some of its research) examines log probabilities and confidence distributions to rigorously quantify model uncertainty.
DeepMind studies scaling laws and emergent behaviors with mathematical rigor, seeking predictable principles.

一些研究实验室认识到了这一基本事实，正在追求基于量化的方法：

Anthropic 率先研究机械可解释性，致力于逆向工程这些模型内部的实际计算电路。

OpenAI（在其部分研究中）检查对数概率LLM输出置信度的数学度量，用于科学分析模型不确定性和行为模式，OpenAI等机构采用此方法。和置信度分布，以严格量化模型的不确定性。

DeepMind 以数学的严谨性研究缩放定律和涌现行为，寻找可预测的原理。

These teams treat the artifacts for what they are: immensely complex but ultimately comprehensible mathematical objects that can be understood through careful experimentation, measurement, and theory-building. They analyze attention patterns, trace information flow across layers, and measure uncertainty in terms of log probabilities. They are engaged in science.

这些团队将这些造物视为其本来面目：极其复杂但最终可以理解的数学对象，可以通过仔细的实验、测量和理论构建来理解。他们分析注意力模式，追踪信息在层间的流动，并根据对数概率LLM输出置信度的数学度量，用于科学分析模型不确定性和行为模式，OpenAI等机构采用此方法。来度量不确定性。他们从事的是科学。

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 as Anti-Science

In stark contrast, DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 embodies an anti-scientific approach to LLM development. Rather than attempting to understand the artifact, it constructs Rube Goldberg machines around it. Instead of measuring tangible, informative quantities (e.g., log probabilities, attention weight distributions, gradient flows), it treats noisy, task-specific evaluation scores as primary signals. In place of developing testable theory, it proliferates jargon.

与之形成鲜明对比的是，DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。体现了 LLM 开发中的一种反科学方法。它并非试图理解造物本身，而是在其周围构建复杂的“鲁布·戈德堡机械”。它不去测量有形的、信息丰富的量（例如对数概率LLM输出置信度的数学度量，用于科学分析模型不确定性和行为模式，OpenAI等机构采用此方法。、注意力权重分布、梯度流），而是将嘈杂的、任务特定的评估分数视为主信号。它不发展可检验的理论，而是大量制造术语。

The GEPA (Genetic Prompt Evolution) extension perfectly exemplifies this. It employs "evolutionary algorithms" to evolve code that prods the LLM differently. While reporting a 5.5% improvement on the ARC-AGI benchmark, this "gain" is rendered meaningless when contextualized: models score 3-4% on a benchmark where humans achieve ~60%. This is not meaningful progress; it is optimizing the path from "complete failure" to "complete failure with slightly different noise."

GEPA（遗传提示进化）扩展完美地例证了这一点。它使用“进化算法”来进化出以不同方式刺激 LLM 的代码。虽然报告在 ARC-AGI 基准上取得了 5.5% 的改进，但将这个“增益”置于背景下看就毫无意义：在该基准上，模型得分为 3-4%，而人类得分约为 60%。这不是有意义的进展；这是在优化从“完全失败”到“带有略微不同噪声的完全失败”的路径。

The Core Epistemological Failure

The fundamental epistemological flaw of DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 is its treatment of semantic/prompt variation as a continuous, optimizable space in the classical sense. Genuine optimization requires:

A measurable, stable objective function – not noisy, black-box evaluations from another uncalibrated LLM.
An understood relationship between inputs and outputs – not a heuristic like "maybe this synonym works better sometimes."
A theory of change – not merely "let's try different things and see what happens."

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。根本的认识论缺陷在于，它将语义/提示词变异视为经典意义上的连续、可优化空间。真正的优化需要：

一个可测量、稳定的目标函数——而非来自另一个未校准 LLM 的嘈杂、黑盒评估。

对输入与输出之间关系的理解——而非“也许这个同义词有时效果更好”这样的启发式方法。

一套变化理论——而不仅仅是“让我们试试不同的东西，看看会发生什么”。

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 possesses none of these. Its process is more accurately described as:

new_prompt = random_walk_in_semantic_space(old_prompt)
if coin_flip_says_better():  # Where the coin flip is a noisy LLM judge
    claim_optimization()

Yet, this is packaged and presented as "systematic optimization."

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。不具备其中任何一点。其过程更准确的描述是：
new_prompt = random_walk_in_semantic_space(old_prompt)
if coin_flip_says_better():  # 这里的抛硬币指的是嘈杂的 LLM 评判器
    claim_optimization()
然而，这却被包装并呈现为“系统化优化”。

Conclusion: The Path Forward

The ultimate irony is that these are not truly alien artifacts—they are human-made mathematical creations whose complexity currently outstrips our full understanding. The "aliens" are the mathematical principles of high-dimensional spaces and gradient-based optimization. These artifacts obey mathematical laws, possess mathematical structure, and exhibit mathematical regularities.

最终的讽刺在于，这些并非真正的外星造物——它们是人工制造的数学创造物，其复杂性目前超出了我们的完全理解。所谓的“外星人”是高维空间和基于梯度优化的数学原理。这些造物遵循数学定律，拥有数学结构，并表现出数学规律性。

Frameworks like DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 treat them as magic because doing so is easier than the arduous work of genuine understanding. It is easier to build a system that randomly permutes prompts than to decipher why a specific prompt elicits a desired chain of thought. It is easier to claim "optimization" than to admit one is merely hunting for favorable noise in a vast, unexplored space.

像 DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。这样的框架将其视为魔法，因为这样做比进行真正理解的艰苦工作要容易。构建一个随机排列提示词的系统，比破译为何某个特定提示词能引发所需的思维链要容易。宣称“优化”比承认自己仅仅是在广阔未探索的空间中寻找有利的噪声要容易。

The research teams dedicated to mechanistic interpretability, rigorous uncertainty quantification, and theory-building are illuminating the viable path forward. They respect the artifacts as what they are: complex but ultimately comprehensible mathematical objects amenable to careful scientific investigation.

那些致力于机械可解释性、严格不确定性量化和理论构建的研究团队，正在指明可行的前进道路。他们尊重这些造物的本质：复杂但最终可以理解的、适合进行细致科学研究的数学对象。

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。 represents the outcome of choosing cargo cult over science, theater over theory, and the appearance of sophistication over actual understanding. It is a framework constructed on semantic noise, searching for meaning in randomness, and claiming victory within variance. The alien artifact—our own mathematical creation—deserves far better than random prodding. It deserves actual science.

DSPy一个来自斯坦福和MIT的LLM优化框架，被批评为采用'货物崇拜科学'方法，通过随机提示变异和噪声评估进行所谓的'优化'。代表了选择货物崇拜而非科学、戏剧而非理论、表面复杂而非真正理解的结果。它是一个构建在语义噪声之上的框架，在随机性中寻找意义，并在方差中宣称胜利。这个外星造物——我们自己的数学创造——理应得到比随机试探好得多的对待。它理应得到真正的科学。