GEO

如何用Promptfoo提升LLM测试效率?2026年自动化评估全指南

2026/3/14
如何用Promptfoo提升LLM测试效率?2026年自动化评估全指南
AI Summary (BLUF)

Promptfoo is a comprehensive LLM testing tool that automates prompt, model, and RAG system evaluation, significantly improving testing efficiency through features like automated assertions, model comparison, and security testing.

原文翻译: Promptfoo是一款全面的LLM测试工具,通过自动化评估提示词、模型和RAG系统,显著提升测试效率,具备自动断言、模型对比和安全测试等功能。

你是否还在为LLM(大语言模型)应用的测试效率低下而烦恼?手动对比不同提示词、模型输出耗时费力,难以规模化验证应用质量?Promptfoo作为一款专业的LLM测试工具,能帮助你自动化评估提示词、模型和RAG(检索增强生成)系统,显著提升测试效率。本文将带你从安装到高级应用,全面掌握Promptfoo的使用方法。

Are you still struggling with the inefficiency of testing your LLM (Large Language Model) applications? Manually comparing different prompts and model outputs is time-consuming and labor-intensive, making it difficult to validate application quality at scale. Promptfoo, as a professional LLM testing tool, can help you automatically evaluate prompts, models, and RAG (Retrieval-Augmented Generation) systems, significantly improving testing efficiency. This article will guide you from installation to advanced applications, helping you master the use of Promptfoo comprehensively.

Installation

Promptfoo支持多种安装方式,满足不同用户需求。根据你的环境选择以下任意一种方式进行安装:

Promptfoo supports multiple installation methods to meet the needs of different users. Choose any of the following methods based on your environment:

Global Installation via npm

npm install -g promptfoo

Temporary Use via npx

npx promptfoo@latest

Installation via Homebrew (for macOS Users)

brew install promptfoo

安装完成后,通过以下命令验证安装是否成功:

After installation, verify if the installation was successful with the following command:

promptfoo --version

成功安装会显示版本号,如0.114.7。官方安装文档:site/docs/installation.md

A successful installation will display the version number, such as 0.114.7. Official installation documentation: site/docs/installation.md

Getting Started

完成安装后,通过初始化命令快速创建第一个测试项目:

After installation, quickly create your first test project using the initialization command:

Using an Example Project

npx promptfoo@latest init --example getting-started

Interactive Configuration Creation

如果需要自定义配置,可以运行不带示例参数的初始化命令:

If you need a custom configuration, you can run the initialization command without the example parameter:

npx promptfoo@latest init

该命令会引导你完成交互式配置过程,创建适合你的测试环境。

This command will guide you through an interactive configuration process to create a testing environment suitable for you.

初始化完成后,会在当前目录生成promptfooconfig.yaml配置文件和相关测试资源。入门指南:site/docs/getting-started.md

After initialization, a promptfooconfig.yaml configuration file and related test resources will be generated in the current directory. Getting Started Guide: site/docs/getting-started.md

Configuration Deep Dive

Promptfoo的核心配置文件是promptfooconfig.yaml,通过该文件你可以定义测试的各个方面。一个完整的配置包含提示词、模型提供者和测试用例三个主要部分。

The core configuration file for Promptfoo is promptfooconfig.yaml. Through this file, you can define all aspects of your tests. A complete configuration consists of three main parts: prompts, model providers, and test cases.

Defining Prompts

在配置文件中,使用prompts字段定义需要测试的提示词模板。可以直接内联定义或引用外部文件:

In the configuration file, use the prompts field to define the prompt templates to be tested. You can define them inline or reference external files:

prompts:
  - 'Convert this English to {{language}}: {{input}}'
  - 'Translate to {{language}}: {{input}}'
  # 引用外部提示词文件 / Reference external prompt file
  - file://prompts.txt

提示词中使用双花括号{{variable_name}}定义变量,测试时会动态替换为测试用例中的值。提示词配置详情:site/docs/configuration/prompts

Variables are defined in prompts using double curly braces {{variable_name}}, which are dynamically replaced with values from test cases during execution. Prompt configuration details: site/docs/configuration/prompts

Configuring Model Providers

providers字段用于指定需要测试的AI模型。Promptfoo支持50多种模型提供者,包括OpenAI、Anthropic、Google等主流API,以及Ollama等本地模型:

The providers field is used to specify the AI models to be tested. Promptfoo supports over 50 model providers, including mainstream APIs like OpenAI, Anthropic, Google, as well as local models like Ollama:

providers:
  - openai:gpt-5
  - openai:gpt-5-mini
  - anthropic:messages:claude-sonnet-4-20250514
  - vertex:gemini-2.5-pro
  # 自定义本地模型 / Custom local model
  - ollama:llama3.1
  # 引用自定义提供者脚本 / Reference custom provider script
  - file://path/to/custom/provider.py

大多数模型需要设置API密钥等认证信息,通常通过环境变量配置:

Most models require authentication information such as API keys, typically configured via environment variables:

export OPENAI_API_KEY=sk-your-key
export ANTHROPIC_API_KEY=your-key

支持的模型列表:site/docs/providers

List of supported models: site/docs/providers

Creating Test Cases

tests字段定义测试输入变量和预期结果。每个测试用例包含变量值和可选的断言条件:

The tests field defines test input variables and expected results. Each test case contains variable values and optional assertion conditions:

tests:
  - vars:
      language: French
      input: Hello world
    assert:
      - type: contains
        value: bonjour
  - vars:
      language: Spanish
      input: Where is the library?
    assert:
      - type: contains
        value: biblioteca

测试用例配置:site/docs/configuration/guide

Test case configuration: site/docs/configuration/guide

Running Tests

配置完成后,使用eval命令执行测试:

After configuration is complete, use the eval command to execute the tests:

npx promptfoo@latest eval

Command Line Output

测试执行过程中,会在终端显示实时进度和结果摘要:

During test execution, real-time progress and result summaries are displayed in the terminal:

命令行测试结果

Command Line Test Results

Generating HTML Reports

添加-o参数生成详细的HTML报告:

Add the -o parameter to generate a detailed HTML report:

npx promptfoo@latest eval -o output.html

Viewing the Web Interface

测试完成后,通过view命令打开交互式Web界面查看结果:

After testing is complete, open the interactive web interface to view results using the view command:

npx promptfoo@latest view

Web界面提供丰富的结果展示和比较功能,支持按多种维度筛选和排序结果:

The web interface provides rich result display and comparison features, supporting filtering and sorting results by various dimensions:

Web界面测试结果

Web Interface Test Results

Advanced Features

Automatically Evaluating Output Quality

通过断言(assertions)功能,可以自动评估模型输出是否符合预期。Promptfoo支持多种断言类型:

Through the assertions feature, you can automatically evaluate whether model outputs meet expectations. Promptfoo supports multiple assertion types:

Content Checking

assert:
  - type: contains
    value: expected substring
  - type: not-contains
    value: forbidden content
  - type: equals
    value: exact match
  - type: starts-with
    value: beginning of response

LLM Scoring

使用另一个LLM作为裁判,根据自定义评分标准评估输出:

Use another LLM as a judge to evaluate outputs based on custom scoring criteria:

assert:
  - type: llm-rubric
    value: "Scoring Criteria: Is the answer clear, accurate, and concise? 10-point scale."
    threshold: 8 # Minimum score requirement

Custom JavaScript Evaluation

编写JavaScript函数进行复杂的自定义评估:

Write JavaScript functions for complex custom evaluations:

assert:
  - type: javascript
    value: |
      // Calculate output length score, lower score for longer outputs
      Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));

断言配置详情:site/docs/configuration/expected-outputs

Assertion configuration details: site/docs/configuration/expected-outputs

Model Comparison Analysis

Promptfoo可以同时测试多个模型,方便进行横向对比。以下是一个对比GPT-5和Claude的配置示例:

Promptfoo can test multiple models simultaneously, facilitating horizontal comparisons. The following is a configuration example comparing GPT-5 and Claude:

prompts:
  - "Summarize this text in 50 words: {{input}}"
providers:
  - openai:gpt-5
  - anthropic:messages:claude-sonnet-4-20250514
tests:
  - vars:
      input: "人工智能(AI)是计算机科学的一个分支,致力于创建能够模拟人类智能的系统。这些系统能够学习、推理、自适应并执行通常需要人类智能才能完成的任务。"
  - vars:
      input: "气候变化是指地球气候系统的长期变化,包括全球平均温度、降水模式、极端天气事件频率等方面的改变。主要由人类活动导致的温室气体排放是当前气候变化的主要驱动因素。"

运行测试后,通过Web界面可以直观比较不同模型的输出质量:

After running the tests, you can visually compare the output quality of different models through the web interface:

模型对比示例 模型对比教程:examples/openai-model-comparison

Model Comparison Example Model comparison tutorial: examples/openai-model-comparison

Security Testing (Red Teaming)

Promptfoo还提供强大的红队测试功能,帮助你发现LLM应用的安全漏洞:

Promptfoo also provides powerful red teaming functionality to help you discover security vulnerabilities in LLM applications:

npx promptfoo@latest redteam

红队测试会使用各种攻击策略尝试诱导模型生成不当内容,识别潜在的安全风险。测试完成后生成详细的风险报告:

Red team testing uses various attack strategies to attempt to induce the model to generate inappropriate content, identifying potential security risks. A detailed risk report is generated upon completion:

安全风险报告 红队测试指南:site/docs/red-team

Security Risk Report Red team testing guide: site/docs/red-team

Practical Examples

Translation Application Testing

以下是一个完整的翻译应用测试配置示例,比较不同提示词和模型在多语言翻译任务中的表现:

The following is a complete translation application testing configuration example, comparing the performance of different prompts and models in multilingual translation tasks:

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Translation App Test - Comparing Translation Quality Across Prompts and Models

prompts:
  - 'Translate the following to {{language}}: {{input}}'
  - 'In {{language}}, this would be: {{input}}'
  - 'Convert the text to {{language}}, keeping the original meaning: {{input}}'

providers:
  - openai:gpt-5-mini
  - anthropic:messages:

## 常见问题(FAQ)

### Promptfoo支持哪些安装方式?

Promptfoo支持npm全局安装、npx临时使用和Homebrew安装(macOS用户)。安装后可通过`promptfoo --version`验证版本号。

### 如何快速开始使用Promptfoo进行测试?

运行`npx promptfoo@latest init --example getting-started`创建示例项目,或运行`npx promptfoo@latest init`进行交互式配置,生成promptfooconfig.yaml配置文件。

### Promptfoo的核心配置文件如何定义提示词?

在promptfooconfig.yaml的prompts字段中,可直接内联定义模板(如'Convert this English to {{language}}: {{input}}')或引用外部文件,使用{{变量名}}语法。
← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。