Boswell测试如何通过同行评审对比AI大模型性能？

概述

The Boswell Test is an automated tool for comparing Large Language Models (LLMs) through peer-review, where models grade each other's essays. This implementation is based on the methodology introduced by Peter Luh in his article "Is AI Chatbot My Boswell?" (February 2025).

Boswell Test 是一个通过同行评审来比较大型语言模型的自动化工具，其核心机制是让模型之间相互为对方的文章评分。该实现基于 Peter Luh 在其文章 "Is AI Chatbot My Boswell?"（2025年2月）中提出的方法论。

Boswell Test Domain Comparison
Aggregate Boswell Quotient Rankings

🚀 全新的模块化架构

This repository now features a fully modular architecture for better maintainability and extensibility. Full details are available in the docs/technical/architecture.md documentation.

该代码库现已采用完全模块化的架构，以提升可维护性和可扩展性。完整细节请参阅 docs/technical/architecture.md 文档。

Key Improvements:

主要改进：

Clean package structure with separated concerns (职责分离的清晰包结构)
Response caching system to improve performance and reduce redundant API calls (响应缓存系统，用于提升性能并减少冗余的 API 调用)
Enhanced reporting capabilities (增强的报告生成能力)
Domain creation utilities (领域创建工具)
Expanded free model support with 12 additional LLMs (扩展了对免费模型的支持，新增了 12 个 LLM)
Improved documentation and tooling (改进的文档和工具)

For detailed documentation, see the docs/ directory.

详细文档请参阅 docs/ 目录。

🚦 快速开始

Get up and running with the Boswell Test framework in minutes:

几分钟内即可启动并运行 Boswell Test 框架：

# Clone the repository
git clone https://github.com/alanwilhelm/botwell.git
cd botwell

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package
pip install -e .

# Set your API key
export OPENROUTER_API_KEY="your_api_key_here"

# Run a simple test with free models
botwell --domain pol_sci_1 --free

# Generate a summary report
botwell report --latest

botwell --domain pol_sci_1 --models "o3-mini-high" "Claude-3.7-Sonnet" "GPT-4o" "o1" "grok2-1212" "Qwen-Max" "Perplexity: Llama 3.1 Sonar 70B" "DeepSeek-R1-Full"

See docs/usage/quick_start.md for more details and docs/usage/advanced_usage.md for advanced usage scenarios.

更多详情请参阅 docs/usage/quick_start.md，高级使用场景请参阅 docs/usage/advanced_usage.md。

🌟 工作原理

This tool automates the process of running a Boswell Test across multiple LLMs. Here's how it works:

该工具自动化了在多个 LLM 上运行 Boswell Test 的过程。其工作原理如下：

Essay Generation (文章生成): The system prompts multiple LLMs with the same complex question in a specific domain (like political science or computer science) (系统使用特定领域（如政治学或计算机科学）的同一复杂问题提示多个 LLM)
Peer Evaluation (同行评审): Each LLM grades the essays written by all other models, providing detailed feedback and assigning letter grades (A+, A, A-, etc.) (每个 LLM 为所有其他模型撰写的文章评分，提供详细反馈并分配字母等级（A+、A、A- 等）)
Bias Analysis (偏见分析): The system analyzes grading patterns to identify which models grade more strictly or leniently compared to the median (系统分析评分模式，以识别哪些模型相对于中位数评分更严格或更宽松)
Boswell Quotient (Boswell 商数): A comprehensive score (0-100) is calculated for each model based on equal weighting of performance (grades received), evaluation ability (grading consistency), efficiency (response time), and empathy (emotional intelligence) (根据性能（收到的分数）、评估能力（评分一致性）、效率（响应时间）和同理心（情商）的等权重，为每个模型计算一个综合分数（0-100）)
Visualization (可视化): The framework generates charts and graphs showing performance metrics, grading distributions, timing data, and Boswell Quotient rankings (框架生成图表，显示性能指标、评分分布、时间数据和 Boswell 商数排名)
Comprehensive Reporting (综合报告): Results are organized in timestamped directories with easy-to-read tables in multiple formats (Markdown, ASCII, CSV, JSON) (结果按时间戳组织的目录中，包含多种格式（Markdown、ASCII、CSV、JSON）的易读表格)

The Boswell Test methodology offers several advantages over traditional benchmarks:

与传统基准测试相比，Boswell Test 方法具有以下几个优势：

It captures nuanced evaluation capabilities, not just raw performance (它捕捉了细微的评估能力，而不仅仅是原始性能)
It leverages LLMs' own analytical skills to provide detailed feedback (它利用 LLM 自身的分析技能来提供详细反馈)
It reveals biases in how different models evaluate the same work (它揭示了不同模型评估同一作品时的偏见)
It creates a multidimensional view of model capabilities across different domains (它创建了模型在不同领域能力的多维视图)
It calculates a comprehensive "Boswell Quotient" that measures a model's ability to serve as an indispensable AI companion (它计算了一个综合的“Boswell 商数”，用于衡量模型作为不可或缺的 AI 伴侣的能力)

All of this is automated through a simple command-line interface that handles the entire testing process from essay generation to final report creation.

所有这一切都通过一个简单的命令行界面实现自动化，该界面处理从文章生成到最终报告创建的整个测试过程。

📋 可用测试领域

The framework includes multiple testing domains, each with different difficulty levels:

该框架包含多个测试领域，每个领域具有不同的难度级别：


领域	标识符	描述	难度等级
政治科学	`pol_sci_1`	AI 政策分析	Level 1
政治科学	`pol_sci_2`	AI 治理分析（严格评分）	Level 2
编程	`programming_1`	编程基础	Level 1
编程	`programming_2`	高级算法	Level 2
编程	`programming_3`	竞技编程挑战	Level 3
人文学科	`humanities_1`	社会哲学	Level 1
计算机科学	`comp_sci_1`	算法分析与复杂度	Level 1
计算机科学	`comp_sci_2`	分布式应用系统设计	Level 2

🛠️ 环境配置

先决条件

Python 3.8+ (Python 3.8 或更高版本)
OpenRouter API key (get one at OpenRouter.ai) (OpenRouter API 密钥（在 OpenRouter.ai 获取）)

安装步骤

Clone the repository (克隆代码库):

git clone https://github.com/alanwilhelm/botwell.git
cd botwell

Create a virtual environment (创建虚拟环境):
```
python -m venv venv
```
Activate the environment (激活环境):
- On macOS/Linux (在 macOS/Linux 上):
```
source venv/bin/activate
```
- On Windows (在 Windows 上):
```
venv\Scripts\activate
```

Install dependencies (安装依赖):

# Either install dependencies directly (直接安装依赖)
pip install -r requirements.txt
# Or install as a package (recommended): pip install -e . (或作为包安装（推荐）：pip install -e .)

Set your OpenRouter API key (设置您的 OpenRouter API 密钥):
```
export OPENROUTER_API_KEY="your_api_key_here"
```

🚀 使用指南

基础用法

Run a test with default settings:

使用默认设置运行测试：

botwell

This uses the newly added command-line interface that provides a simpler way to run tests and manage the framework.

这使用了新添加的命令行界面，它提供了运行测试和管理框架的更简单方法。

This will run the basic political science test with all verified models.

这将使用所有已验证的模型运行基础政治科学测试。

高级用法

Select a specific domain (选择特定领域):

botwell --domain pol_sci_2

Run tests on all available domains (在所有可用领域上运行测试):

botwell --all-domains

This will sequentially run tests on all domains with the same set of models, creating separate results directories for each domain. When multiple domains are tested, it will also generate:

这将使用同一组模型在所有领域上顺序运行测试，为每个领域创建独立的结果目录。当测试多个领域时，它还会生成：

An aggregate Boswell Quotient analysis across all domains (跨所有领域的聚合 Boswell 商数分析)
Visualizations comparing model performance across domains (比较模型跨领域性能的可视化图表)
Detailed reports identifying which models are consistent across domains vs. specialized in specific areas (详细报告，识别哪些模型在不同领域表现一致，哪些在特定领域表现专精)

Use specific models (使用特定模型):

botwell --models "GPT-4o" "Claude-3-Opus" "Claude-3.7-Sonnet"

Combine options (组合选项):

botwell --all-domains --models "GPT-4o" "Claude-3.7-Sonnet" --skip-verification

Skip model verification (faster but less reliable) (跳过模型验证（更快但可靠性较低）):

botwell --skip-verification

Configure retry attempts for API calls (配置 API 调用的重试次数):

botwell --max-retries 5

Custom output file (in addition to organized results directory) (自定义输出文件（除了组织好的结果目录外）):

botwell --output custom_results.json

模型管理

Update local models file with available OpenRouter models (使用可用的 OpenRouter 模型更新本地模型文件):

botwell --update-models

This command fetches the current list of available models from OpenRouter's API and saves them to a local JSON file. The output includes model IDs, context lengths, pricing information, and descriptions.

此命令从 OpenRouter 的 API 获取当前可用模型列表，并将其保存到本地 JSON 文件。输出包括模型 ID、上下文长度、定价信息和描述。

Specify custom models file (指定自定义模型文件):

botwell --update-models --models-file my_models.json

缓存管理

The Boswell Test includes cache management utilities to improve performance. The response caching system stores API responses to avoid redundant API calls, especially useful during development and testing.

Boswell Test 包含缓存管理工具以提高性能。响应缓存系统存储 API 响应以避免冗余的 API 调用，在开发和测试期间尤其有用。

# View cache statistics (查看缓存统计信息)
botwell cache stats

# Clear the entire cache (清除整个缓存)
botwell cache clear

# Clear only expired cache entries (仅清除过期的缓存条目)
botwell cache clear --expired-only

领域创建

(注：输入内容在此处中断。根据要求，我们已优雅地完成了最后一个段落。如需继续，请提供剩余内容。)

The modular architecture of the Boswell Test framework is designed to be extensible, allowing users to define and integrate new testing domains tailored to specific research needs or application scenarios.

Boswell Test 框架的模块化架构设计为可扩展的，允许用户定义和集成新的测试领域，以适应特定的研究需求或应用场景。

常见问题（FAQ）

Boswell测试如何评估大语言模型的效率？

Boswell测试通过同行评审机制，让多个LLM在指定领域生成文章并相互评分，分析评分模式偏差，最终计算出一个0-100分的综合Boswell商数来量化模型表现。

如何快速开始使用Boswell测试框架？

克隆代码库后创建虚拟环境，安装包并设置API密钥即可运行测试。框架提供模块化架构和缓存系统，支持免费模型，可通过简单命令启动领域测试并生成报告。

Boswell测试支持哪些评估领域？

框架支持政治学、计算机科学等多个测试领域。用户可使用预置领域或通过工具创建新领域，让模型在特定主题下进行文章生成与互评。

AI Summary (BLUF)

概述