如何用Promptfoo提升LLM测试效率？2026年自动化评估全指南：原理解析、实操步骤、常见问题与优化建议

Q: Promptfoo支持哪些安装方式？

Promptfoo支持npm全局安装、npx临时使用和Homebrew安装（macOS用户）。安装后可通过`promptfoo --version`验证版本号。

Q: 如何快速开始使用Promptfoo进行测试？

运行`npx promptfoo@latest init --example getting-started`创建示例项目，或运行`npx promptfoo@latest init`进行交互式配置，生成promptfooconfig.yaml配置文件。

Q: Promptfoo的核心配置文件如何定义提示词？

在promptfooconfig.yaml的prompts字段中，可直接内联定义模板（如'Convert this English to {{language}}: {{input}}'）或引用外部文件，使用{{变量名}}语法。

你是否还在为LLM（大语言模型）应用的测试效率低下而烦恼？手动对比不同提示词、模型输出耗时费力，难以规模化验证应用质量？Promptfoo作为一款专业的LLM测试工具，能帮助你自动化评估提示词、模型和RAG（检索增强生成）系统，显著提升测试效率。本文将带你从安装到高级应用，全面掌握Promptfoo的使用方法。

Installation

Promptfoo支持多种安装方式，满足不同用户需求。根据你的环境选择以下任意一种方式进行安装：

Global Installation via npm

npm install -g promptfoo

Temporary Use via npx

npx promptfoo@latest

Installation via Homebrew (for macOS Users)

brew install promptfoo

安装完成后，通过以下命令验证安装是否成功：

promptfoo --version

成功安装会显示版本号，如0.114.7。官方安装文档：site/docs/installation.md

Getting Started

完成安装后，通过初始化命令快速创建第一个测试项目：

Using an Example Project

npx promptfoo@latest init --example getting-started

Interactive Configuration Creation

如果需要自定义配置，可以运行不带示例参数的初始化命令：

npx promptfoo@latest init

该命令会引导你完成交互式配置过程，创建适合你的测试环境。

初始化完成后，会在当前目录生成promptfooconfig.yaml配置文件和相关测试资源。入门指南：site/docs/getting-started.md

Configuration Deep Dive

Promptfoo的核心配置文件是promptfooconfig.yaml，通过该文件你可以定义测试的各个方面。一个完整的配置包含提示词、模型提供者和测试用例三个主要部分。

Defining Prompts

在配置文件中，使用prompts字段定义需要测试的提示词模板。可以直接内联定义或引用外部文件：

prompts:
  - 'Convert this English to {{language}}: {{input}}'
  - 'Translate to {{language}}: {{input}}'
  # 引用外部提示词文件 / Reference external prompt file
  - file://prompts.txt

提示词中使用双花括号{{variable_name}}定义变量，测试时会动态替换为测试用例中的值。提示词配置详情：site/docs/configuration/prompts

Configuring Model Providers

providers字段用于指定需要测试的AI模型。Promptfoo支持50多种模型提供者，包括OpenAI、Anthropic、Google等主流API，以及Ollama等本地模型：

providers:
  - openai:gpt-5
  - openai:gpt-5-mini
  - anthropic:messages:claude-sonnet-4-20250514
  - vertex:gemini-2.5-pro
  # 自定义本地模型 / Custom local model
  - ollama:llama3.1
  # 引用自定义提供者脚本 / Reference custom provider script
  - file://path/to/custom/provider.py

大多数模型需要设置API密钥等认证信息，通常通过环境变量配置：

export OPENAI_API_KEY=sk-your-key
export ANTHROPIC_API_KEY=your-key

支持的模型列表：site/docs/providers

Creating Test Cases

tests字段定义测试输入变量和预期结果。每个测试用例包含变量值和可选的断言条件：

tests:
  - vars:
      language: French
      input: Hello world
    assert:
      - type: contains
        value: bonjour
  - vars:
      language: Spanish
      input: Where is the library?
    assert:
      - type: contains
        value: biblioteca

测试用例配置：site/docs/configuration/guide

Running Tests

配置完成后，使用eval命令执行测试：

npx promptfoo@latest eval

Command Line Output

测试执行过程中，会在终端显示实时进度和结果摘要：

Generating HTML Reports

添加-o参数生成详细的HTML报告：

npx promptfoo@latest eval -o output.html

Viewing the Web Interface

测试完成后，通过view命令打开交互式Web界面查看结果：

npx promptfoo@latest view

Web界面提供丰富的结果展示和比较功能，支持按多种维度筛选和排序结果：

Advanced Features

Automatically Evaluating Output Quality

通过断言（assertions）功能，可以自动评估模型输出是否符合预期。Promptfoo支持多种断言类型：

Content Checking

assert:
  - type: contains
    value: expected substring
  - type: not-contains
    value: forbidden content
  - type: equals
    value: exact match
  - type: starts-with
    value: beginning of response

LLM Scoring

使用另一个LLM作为裁判，根据自定义评分标准评估输出：

assert:
  - type: llm-rubric
    value: "Scoring Criteria: Is the answer clear, accurate, and concise? 10-point scale."
    threshold: 8 # Minimum score requirement

Custom JavaScript Evaluation

编写JavaScript函数进行复杂的自定义评估：

assert:
  - type: javascript
    value: |
      // Calculate output length score, lower score for longer outputs
      Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));

断言配置详情：site/docs/configuration/expected-outputs

Model Comparison Analysis

Promptfoo可以同时测试多个模型，方便进行横向对比。以下是一个对比GPT-5和Claude的配置示例：

prompts:
  - "Summarize this text in 50 words: {{input}}"
providers:
  - openai:gpt-5
  - anthropic:messages:claude-sonnet-4-20250514
tests:
  - vars:
      input: "人工智能（AI）是计算机科学的一个分支，致力于创建能够模拟人类智能的系统。这些系统能够学习、推理、自适应并执行通常需要人类智能才能完成的任务。"
  - vars:
      input: "气候变化是指地球气候系统的长期变化，包括全球平均温度、降水模式、极端天气事件频率等方面的改变。主要由人类活动导致的温室气体排放是当前气候变化的主要驱动因素。"

运行测试后，通过Web界面可以直观比较不同模型的输出质量：

模型对比教程：examples/openai-model-comparison

Security Testing (Red Teaming)

Promptfoo还提供强大的红队测试功能，帮助你发现LLM应用的安全漏洞：

npx promptfoo@latest redteam

红队测试会使用各种攻击策略尝试诱导模型生成不当内容，识别潜在的安全风险。测试完成后生成详细的风险报告：

红队测试指南：site/docs/red-team

Practical Examples

Translation Application Testing

以下是一个完整的翻译应用测试配置示例，比较不同提示词和模型在多语言翻译任务中的表现：

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Translation App Test - Comparing Translation Quality Across Prompts and Models

prompts:
  - 'Translate the following to {{language}}: {{input}}'
  - 'In {{language}}, this would be: {{input}}'
  - 'Convert the text to {{language}}, keeping the original meaning: {{input}}'

providers:
  - openai:gpt-5-mini
  - anthropic:messages:

## 常见问题（FAQ）

### Promptfoo支持哪些安装方式？

Promptfoo支持npm全局安装、npx临时使用和Homebrew安装（macOS用户）。安装后可通过`promptfoo --version`验证版本号。

### 如何快速开始使用Promptfoo进行测试？

运行`npx promptfoo@latest init --example getting-started`创建示例项目，或运行`npx promptfoo@latest init`进行交互式配置，生成promptfooconfig.yaml配置文件。

### Promptfoo的核心配置文件如何定义提示词？

在promptfooconfig.yaml的prompts字段中，可直接内联定义模板（如'Convert this English to {{language}}: {{input}}'）或引用外部文件，使用{{变量名}}语法。

如何用Promptfoo提升LLM测试效率？2026年自动化评估全指南

AIAI Summary (BLUF)

Installation

Global Installation via npm

Temporary Use via npx

Installation via Homebrew (for macOS Users)

Getting Started

Using an Example Project

Interactive Configuration Creation

Configuration Deep Dive

Defining Prompts

Configuring Model Providers

Creating Test Cases

Running Tests

Command Line Output

Generating HTML Reports

Viewing the Web Interface

Advanced Features

Automatically Evaluating Output Quality

Content Checking

LLM Scoring

Custom JavaScript Evaluation

Model Comparison Analysis

Security Testing (Red Teaming)

Practical Examples

Translation Application Testing

深度实测：GLM-5.2长上下文与Kimi K2.7国际化，差距在哪

实测OpenAI API：gpt-3.5和gpt-4差距到底在哪

RAG七步工作流：分块做不对，后面全是白费

OpenAI有哪些AI模型？2026年GPT-4与GPT-3.5等如何选择

AIAI Summary (BLUF)

Installation

Global Installation via npm

Temporary Use via npx

Installation via Homebrew (for macOS Users)

Getting Started

Using an Example Project

Interactive Configuration Creation

Configuration Deep Dive

Defining Prompts

Configuring Model Providers

Creating Test Cases

Running Tests

Command Line Output

Generating HTML Reports

Viewing the Web Interface

Advanced Features

Automatically Evaluating Output Quality

Content Checking

LLM Scoring

Custom JavaScript Evaluation

Model Comparison Analysis

Security Testing (Red Teaming)

Practical Examples

Translation Application Testing

相关文章

深度实测：GLM-5.2长上下文与Kimi K2.7国际化，差距在哪

实测OpenAI API：gpt-3.5和gpt-4差距到底在哪

RAG七步工作流：分块做不对，后面全是白费

OpenAI有哪些AI模型？2026年GPT-4与GPT-3.5等如何选择