大语言模型优化代码时，Python和Rust哪个性能更好？（附实测对比）

Q: 大语言模型优化代码时有哪些局限性？

LLM在代码优化中存在明显局限，如本文所示，它可能错过关键的算法改进机会，例如优化数字求和的计算方式，而仅关注表面层的性能调整。

Beyond Prompts: What to Do When Large Language Models Fail at Code Optimization

引言：从一次代码优化通过改进算法、数据结构或实现方式，提升代码性能（如速度、内存使用）的过程。实验说起

Introduction: Starting from a Code Optimization Experiment

Max Woolf 最近提出了一个有趣的问题：反复提示大语言模型（LLM）“写出更好的代码”，是否真的能让它写出更好的代码？他的答案可以简要概括为：“某种程度上可以，但存在上限”。然而，在他的文章中，一个明显的优化机会被所有提示都忽略了，这引发了我的思考：LLM 在哪些优化场景下会失效？我们又能走多远？

Max Woolf recently posed an interesting question: Does repeatedly prompting a large language model (LLM) to "write better code" actually make it write better code? His answer can be summarized as: "To some extent, yes, but there is a limit." However, in his article, an obvious optimization opportunity was missed by all prompts, which led me to ponder: In which optimization scenarios do LLMs fail? And how far can we go?

由于他的示例是新鲜的，不太可能已存在于 LLM 的训练语料库中，我将借用他的问题，探索在有或没有 LLM 辅助的情况下，我们能在优化之路上走多远。

Since his example is fresh and unlikely to be in the training corpus of LLMs, I will borrow his problem to explore how far we can go on the path of optimization, with or without the assistance of an LLM.

问题定义与初始实现

Problem Definition and Initial Implementation

我们从一个具体的编程问题开始。以下是 Max 使用的原始提示：

Let's start with a specific programming problem. Here is the original prompt used by Max:

Write Python code to solve this problem:

Given a list of 1 million random integers between 1 and 100,000,
find the difference between the smallest and the largest numbers
whose digits sum up to 30.

大语言模型的初始解决方案

Initial Solution by the Large Language Model

使用 GPT-4o 通过 Copilot 生成代码，得到了与 Max 文章中基本一致的 Python 实现：

Using GPT-4o via Copilot to generate code yields a Python implementation largely consistent with the one in Max's article:

import random

# Generate a list of 1 million random integers between 1 and 100,000
random_integers = [random.randint(1, 100000) for _ in range(1000000)]

def digit_sum(n):
    return sum(int(digit) for digit in str(n))

# Filter numbers whose digits sum up to 30
filtered_numbers = [num for num in random_integers if digit_sum(num) == 30]

if filtered_numbers:
    min_number = min(filtered_numbers)
    max_number = max(filtered_numbers)
    difference = max_number - min_number
    print(f"The difference between the largest and smallest numbers whose digits sum up to 30 is: {difference}")
else:
    print("No numbers found with digits summing up to 30.")

这段代码在我的 M1 MacBook Pro 上运行耗时约 520 毫秒。这是一个不错的基准。从 Max 的文章中我们已经知道，通过几种技术可以使其快得多：

This code takes approximately 520 milliseconds to run on my M1 MacBook Pro. This is a good baseline. From Max's article, we already know it can be made much faster using several techniques:

使用 NumPy / 原生代码 (Using NumPy / Native Code)
并行化LLM同时处理任务并通过程序聚合输出的工作流，包括分段（将任务分解为独立并行子任务）和投票（多次运行相同任务以获得多样化输出）两种变体。 / 向量化将文本数据（如文档片段）转换为向量表示的过程，以便存储到向量数据库中进行高效检索。 (Parallelization / Vectorization)
减少冗余计算 (Reducing Redundant Work)

切换到 Rust 以获得更清晰的性能视图

Switching to Rust for a Clearer Performance View

由于 Python 在解决此类问题时引入了一些额外的复杂性（如 JIT 或通过 NumPy 使用原生代码的影响），我们切换到 Rust 语言。在 Rust 中，性能选择的影响可以更清晰地展现。以下是 LLM 生成的相同逻辑的 Rust 初始代码：

Since Python introduces some additional complexity (such as the effects of JIT or using native code via NumPy) when solving this type of problem, we switch to the Rust language. In Rust, the impact of performance choices can be seen more clearly. Here is the initial Rust code generated by the LLM for the same logic:

use rand::Rng;

fn digit_sum(n: i32) -> i32 {
    n.to_string().chars().map(|c| c.to_digit(10).unwrap() as i32).sum()
}

fn main() {
    let mut rng = rand::thread_rng();
    let numbers: Vec<i32> = (0..1_000_000).map(|_| rng.gen_range(1..=100_000)).collect();

    let mut min_number = i32::MAX;
    let mut max_number = i32::MIN;

    for &number in &numbers {
        if digit_sum(number) == 30 {
            if number < min_number {
                min_number = number;
            }
            if number > max_number {
                max_number = number;
            }
        }
    }

    if min_number == i32::MAX || max_number == i32::MIN {
        println!("No numbers with digit sum of 30 found.");
    } else {
        let difference = max_number - min_number;
        println!("Difference between the largest and smallest numbers with digit sum of 30: {}", difference);
    }
}

有趣的是，Rust 代码使用了与 Python 代码相同的“字符串分割”方法来计算数位和。这段代码运行耗时 42 毫秒。从 Python 切换到 Rust 带来的速度提升：约 12 倍。

Interestingly, the Rust code uses the same "string splitting" method as the Python code to calculate the digit sum. This code takes 42 milliseconds to run. Speed improvement from switching Python to Rust: approximately 12x.

第一轮优化：LLM 的贡献与局限

First Round of Optimization: Contributions and Limitations of LLM

优化 `digit_sum` 函数

Optimizing the digit_sum Function

初始实现隐藏了一个问题：digit_sum 函数中“字符串分割”的方法非常慢。我们选中该函数代码并输入提示“make this code faster”，LLM 给出了一个改进版本：

The initial implementation hides an issue: the "string splitting" method in the digit_sum function is very slow. We select the function code and input the prompt "make this code faster". The LLM provides an improved version:

fn digit_sum(n: i32) -> i32 {
    let mut sum = 0;
    let mut num = n;
    while num > 0 {
        sum += num % 10;
        num /= 10;
    }
    sum
}

这个修改是正确的。新版本的运行时间降至 13.2 毫秒。通过优化 digit_sum 函数带来的速度提升：约 3.2 倍。

This modification is correct. The runtime of the new version drops to 13.2 milliseconds. Speed improvement from optimizing the digit_sum function: approximately 3.2x.

值得注意的是，这个优化在 Python 中效果不明显，因为虽然算法更优，但它涉及更多 Python 层面的操作。

It is worth noting that this optimization is less effective in Python because, although the algorithm is better, it involves more Python-level operations.

LLM 在并行化LLM同时处理任务并通过程序聚合输出的工作流，包括分段（将任务分解为独立并行子任务）和投票（多次运行相同任务以获得多样化输出）两种变体。尝试中的失败

LLM's Failure in Parallelization Attempt

接下来，我们选中主函数并再次要求 LLM “make it faster”。Copilot 尝试并行化LLM同时处理任务并通过程序聚合输出的工作流，包括分段（将任务分解为独立并行子任务）和投票（多次运行相同任务以获得多样化输出）两种变体。检查数字向量，但 Rust 不允许多个线程同时修改同一个变量，因此生成的代码无法编译：

Next, we select the main function and again ask the LLM to "make it faster". Copilot attempts to parallelize the checking of the number vector, but Rust does not allow multiple threads to modify the same variable simultaneously, so the generated code fails to compile:

error[E0594]: cannot assign to `min_number`, as it is a captured variable in a `Fn` closure

至此，我们不得不暂时与 LLM 分道扬镳，尽管后续还会少量使用它。

At this point, we have to part ways with our LLM for now, although we will use it a little more later.

关键洞察：LLM 未能发现的人工优化

Key Insight: Human Optimization Missed by LLM

这里出现了我们的第一个人工优化，这是一个简单的优化，但我很惊讶 Max 的尝试或我基于 LLM 的优化尝试都没有发现：在计算相对昂贵的数位和之前，先检查数字本身是否有用。

Here comes our first human optimization, a simple one that I was quite surprised neither Max's attempts nor my LLM-based optimization attempts caught: Check if the number is useful before performing the relatively expensive digit sum test.

所有 LLM 生成的代码都走了相反的路：先按数位和过滤，然后再比较大小。这是一个需要手动完成但很简单的更改：

All LLM-generated code goes the opposite way: filter by digit sum first, then compare sizes. This is a manual change but an easy one:

    for &number in &numbers {
        if number < min_number || number > max_number {
            if digit_sum(number) == 30 {
                if number < min_number {
                    min_number = number;
                }
                if number > max_number {
                    max_number = number;
                }
            }
        }
    }

优化后的运行时间为 10.7 毫秒。速度提升：约 1.2 倍。这个提升比在 Python 代码中做相同改动带来的 5 倍提升小得多，因为 Rust 中优化后的 digit_sum 函数已经非常快了。

The optimized runtime is 10.7 milliseconds. Speed improvement: approximately 1.2x. This improvement is much smaller than the 5x speedup the same change made in the Python code because the optimized digit_sum function in Rust is already very fast.

深入分析：性能剖析与针对性优化

In-Depth Analysis: Profiling and Targeted Optimization

LLM 的一个主要限制是它无法对代码进行性能剖析（Profiling），但我们可以。通过性能剖析，我们发现 Rust 代码的一个主要开销竟然在于生成随机数向量。Rust 的 rand 包默认使用加密安全的随机数生成器（RNG），速度并不快。

A major limitation of LLM is that it cannot profile the code, but we can. Through profiling, we found that a major cost of the Rust code is surprisingly in generating the vector of random numbers. Rust's rand package defaults to a cryptographically secure random number generator (RNG), which is not very fast.

我们通过切换到 fastrand 包来解决这个问题。请注意，我在这里“作弊”了，因为我选择它是为了便于后续引入一些并行化LLM同时处理任务并通过程序聚合输出的工作流，包括分段（将任务分解为独立并行子任务）和投票（多次运行相同任务以获得多样化输出）两种变体。。我们通过 LLM 提示轻松完成了这个更改：

We address this by switching to the fastrand package. Note that I am "cheating" here because I chose it to facilitate some parallelization we will introduce later. We easily made this change via an LLM prompt:

switch to using fastrand instead of rand

主要的代码变更如下：

The main code change is as follows:

    let numbers: Vec<i32> = (0..1_000_000).map(|_| fastrand::i32(1..=100_000)).collect();

新的运行时间：2.8 毫秒。速度提升：约 3.8 倍。

New runtime: 2.8 milliseconds. Speed improvement: approximately 3.8x.

性能优化路径总结

Summary of Performance Optimization Path

下表总结了从初始 Python 实现到当前优化 Rust 实现的关键优化步骤及其性能影响：

The following table summarizes the key optimization steps from the initial Python implementation to the current optimized Rust implementation and their performance impact:


优化阶段	实现语言/关键改动	运行时间 (ms)	累计加速比 (vs Python)	阶段加速比	优化类型
Optimization Stage	Language / Key Change	Runtime (ms)	Cumulative Speedup (vs Python)	Stage Speedup	Optimization Type
基线	Python (LLM 初始代码)	520	1x	-	基线 / Baseline
语言迁移	Rust (直接移植)	42	~12.4x	~12.4x	系统语言 / Systems Language
算法优化	Rust (优化 `digit_sum`)	13.2	~39.4x	~3.2x	算法改进 / Algorithmic
逻辑重排	Rust (先比较后计算)	10.7	~48.6x	~1.2x	分支预测 / Branch Prediction
依赖库优化	Rust (使用 `fastrand`)	2.8	~185.7x	~3.8x	第三方库 / Third-party Library

结论与启示

Conclusion and Implications

本实验揭示了当前大语言模型在代码优化通过改进算法、数据结构或实现方式，提升代码性能（如速度、内存使用）的过程。方面的能力边界：

This experiment reveals the current boundaries of large language models' capabilities in code optimization:

局部微观优化表现出色：LLM 擅长识别和重写低效的代码片段，例如将基于字符串的数位和计算改为基于数学运算的方法。

Excellent at Local Micro-Optimizations: LLMs are good at identifying and rewriting inefficient code snippets, such as changing string-based digit sum calculation to a math-based method.
全局算法与逻辑优化存在盲区：LLM 难以进行跨函数的逻辑流重构，例如未能发现“先进行廉价比较，再进行昂贵计算”这一关键优化模式。

Blind Spots in Global Algorithm and Logic Optimization: LLMs struggle with cross-functional logic flow restructuring, such as failing to discover the key optimization pattern of "perform cheap comparisons first, then expensive calculations."
缺乏系统与环境上下文：LLM 无法感知运行时的性能瓶颈（如随机数生成开销），也无法安全地进行涉及并发数据访问的复杂重构（如并行化LLM同时处理任务并通过程序聚合输出的工作流，包括分段（将任务分解为独立并行子任务）和投票（多次运行相同任务以获得多样化输出）两种变体。）。

Lack of System and Environmental Context: LLMs cannot perceive runtime performance bottlenecks (such as random number generation overhead) and cannot safely perform complex refactoring involving concurrent data access (such as parallelization).

因此，虽然 LLM 是强大的辅助工具，可以快速提供初始实现和局部改进，但深入的性能优化仍然需要开发者的洞察力、系统知识以及性能剖析工具的结合。未来的方向可能是将 LLM 与性能剖析工具链更深度地集成，使其能够基于真实的运行时数据做出更明智的优化建议。

Therefore, although LLMs are powerful auxiliary tools that can quickly provide initial implementations and local improvements, in-depth performance optimization still requires the developer's insight, system knowledge, and the combination of profiling tools. A future direction may be deeper integration of LLMs with profiling toolchains, enabling them to make more informed optimization suggestions based on real runtime data.

常见问题（FAQ）

大语言模型优化代码时有哪些局限性？

LLM在代码优化通过改进算法、数据结构或实现方式，提升代码性能（如速度、内存使用）的过程。中存在明显局限，如本文所示，它可能错过关键的算法改进机会，例如优化数字求和的计算方式，而仅关注表面层的性能调整。

为什么在代码优化通过改进算法、数据结构或实现方式，提升代码性能（如速度、内存使用）的过程。实验中要切换到Rust语言？

切换到Rust可以更清晰地展示性能选择的影响，避免Python中JIT或NumPy等复杂因素的干扰，从而更准确地评估LLM优化效果和人工优化的差异。

当LLM无法优化代码时，开发者应该怎么做？

开发者应结合人工洞察，深入分析问题本质，如本文中通过算法改进优化数字求和计算，超越LLM的表面提示，实现更根本的性能提升。

AI Summary (BLUF)