Ragas和LangChain哪个更适合评估LLM应用？（附Python工具包实测）

Q: 如何快速开始使用 Ragas 评估我的 RAG 系统？

最快的方式是使用 `ragas quickstart rag_eval` 命令，它会克隆一个完整的 RAG 评估示例项目。你也可以通过 `pip install ragas` 安装，并参考文档中的快速开始指南。

Ragas: Supercharge Your LLM Application Evaluations 🚀

Ragas 是您评估和优化大型语言模型（LLM）应用程序的终极工具包。它提供客观的指标、智能的测试生成和数据驱动的洞察力。告别耗时且主观的人工评估，迎接高效、数据驱动的评估工作流。即使您还没有现成的测试数据集，Ragas 也能生成符合生产环境标准的测试集。

Ragas is your ultimate toolkit for evaluating and optimizing Large Language Model (LLM) applications. It provides objective metrics, intelligent test generation, and data-driven insights. Say goodbye to time-consuming, subjective assessments and hello to data-driven, efficient evaluation workflows. Don't have a test dataset ready? Ragas also handles production-aligned test set generation.

核心特性

Key Features

🎯 客观指标：使用基于 LLM 的传统指标，精确评估您的 LLM 应用程序。

🎯 Objective Metrics: Evaluate your LLM applications with precision using both LLM-based and traditional metrics.
🧪 测试数据生成：自动创建覆盖广泛场景的综合性测试数据集。

🧪 Test Data Generation: Automatically create comprehensive test datasets covering a wide range of scenarios.
🔗 无缝集成：与 LangChain 等主流 LLM 框架及可观测性工具完美协作。

🔗 Seamless Integrations: Works flawlessly with popular LLM frameworks like LangChain and major observability tools.
📊 构建反馈循环：利用生产数据持续改进您的 LLM 应用程序。

📊 Build feedback loops: Leverage production data to continually improve your LLM applications.

🛡️ 安装

🛡️ Installation

通过 PyPI 安装：

Install via PyPI:

pip install ragas

或者，从源代码安装：

Alternatively, install from source:

pip install git+https://github.com/vibrantlabsai/ragas

🔥 快速开始

🔥 Quickstart

克隆一个完整的示例项目

Clone a Complete Example Project

最快上手的方式是使用 ragas quickstart 命令：

The fastest way to get started is to use the ragas quickstart command:

# 列出可用模板
ragas quickstart

# 创建一个 RAG 评估项目
ragas quickstart rag_eval

# 指定项目创建路径
ragas quickstart rag_eval -o ./my-project

# List available templates
ragas quickstart

# Create a RAG evaluation project
ragas quickstart rag_eval

# Specify where you want to create it.
ragas quickstart rag_eval -o ./my-project

可用及即将推出的项目模板

Available and Upcoming Project Templates


模板名称	状态	核心功能描述
`rag_eval`	已可用	评估检索增强生成（RAG）系统。
`agent_evals`	即将推出	评估 AI 智能体。
`benchmark_llm`	即将推出	对 LLM 进行基准测试和比较。
`prompt_evals`	即将推出	评估不同的提示词变体。
`workflow_eval`	即将推出	评估复杂的工作流。

> > > > > > > > > > > >

Template Name Status Core Function Description

rag_eval Available Evaluate Retrieval-Augmented Generation (RAG) systems.

agent_evals Coming Soon Evaluate AI agents.

benchmark_llm Coming Soon Benchmark and compare LLMs.

prompt_evals Coming Soon Evaluate prompt variations.

workflow_eval Coming Soon Evaluate complex workflows.

评估您的 LLM 应用

Evaluate your LLM App

ragas 为常见的评估任务提供了预构建的指标。例如，方面评判 使用 DiscreteMetric 来评估您输出的任何特定方面：

ragas comes with pre-built metrics for common evaluation tasks. For example, Aspect Critique uses DiscreteMetric to evaluate any specific aspect of your output:
import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

# 设置您的 LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

# 创建一个自定义的方面评估器
metric = DiscreteMetric(
    name="summary_accuracy",
    allowed_values=["accurate", "inaccurate"],
    prompt="""Evaluate if the summary is accurate and captures key information.

Response: {response}

Answer with only 'accurate' or 'inaccurate'."""
)

# 为您的应用输出评分
async def main():
    score = await metric.ascore(
        llm=llm,
        response="The summary of the text is..."
    )
    print(f"Score: {score.value}")  # 'accurate' 或 'inaccurate'
    print(f"Reason: {score.reason}")

if __name__ == "__main__":
    asyncio.run(main())
import asyncio
from openai import AsyncOpenAI
from ragas.metrics import DiscreteMetric
from ragas.llms import llm_factory

# Setup your LLM
client = AsyncOpenAI()
llm = llm_factory("gpt-4o", client=client)

# Create a custom aspect evaluator
metric = DiscreteMetric(
    name="summary_accuracy",
    allowed_values=["accurate", "inaccurate"],
    prompt="""Evaluate if the summary is accurate and captures key information.

Response: {response}

Answer with only 'accurate' or 'inaccurate'."""
)

# Score your application's output
async def main():
    score = await metric.ascore(
        llm=llm,
        response="The summary of the text is..."
    )
    print(f"Score: {score.value}")  # 'accurate' or 'inaccurate'
    print(f"Reason: {score.reason}")

if __name__ == "__main__":
    asyncio.run(main())
注意：请确保已设置 OPENAI_API_KEY 环境变量。

Note: Make sure your OPENAI_API_KEY environment variable is set.

查看完整的快速入门指南。

Find the complete Quickstart Guide.

需要帮助利用评估来改进您的 AI 应用吗？

Want help in improving your AI application using evals?

在过去的两年里，我们见证并帮助许多 AI 应用通过评估实现了改进。如果您希望利用评估来改进和扩展您的 AI 应用。

In the past 2 years, we have seen and helped improve many AI applications using evals. If you want help with improving and scaling up your AI application using evals.

🔗 预约一个时间或给我们发邮件：founders@vibrantlabs.com。

🔗 Book a slot or drop us a line: founders@vibrantlabs.com.

🫂 社区

🫂 Community

如果您想更深入地参与 Ragas 社区，请加入我们的 Discord 服务器。这是一个充满乐趣的社区，我们在这里深入探讨 LLM、检索技术、生产环境问题等话题。

If you want to get more involved with Ragas, check out our Discord server. It's a fun community where we geek out about LLM, Retrieval, Production issues, and more.

贡献者

Contributors

我们欢迎来自社区的贡献！无论是错误修复、功能添加还是文档改进，您的贡献都极具价值。

We welcome contributions from the community! Whether it's bug fixes, feature additions, or documentation improvements, your input is valuable.

Fork 本仓库

Fork the repository

创建您的功能分支 (git checkout -b feature/AmazingFeature)

Create your feature branch (git checkout -b feature/AmazingFeature)

提交您的更改 (git commit -m 'Add some AmazingFeature')

Commit your changes (git commit -m 'Add some AmazingFeature')

推送到分支 (git push origin feature/AmazingFeature)

Push to the branch (git push origin feature/AmazingFeature)

开启一个 Pull Request

Open a Pull Request

🔍 开放分析

🔍 Open Analytics

在 Ragas，我们坚信透明度。我们收集最少量、匿名的使用数据，以改进我们的产品并指导开发工作。

At Ragas, we believe in transparency. We collect minimal, anonymized usage data to improve our product and guide our development efforts.

✅ 不收集个人或公司识别信息

✅ No personal or company-identifying information

✅ 开源的数据收集代码

✅ Open-source data collection code

✅ 公开可用的聚合数据

✅ Publicly available aggregated data

如需选择退出，请将 RAGAS_DO_NOT_TRACK 环境变量设置为 true。

To opt-out, set the RAGAS_DO_NOT_TRACK environment variable to true.

引用我们

Cite Us
@misc{ragas2024,
  author       = {VibrantLabs},
  title        = {Ragas: Supercharge Your LLM Application Evaluations},
  year         = {2024},
  howpublished = {\url{https://github.com/vibrantlabsai/ragas}},
}
常见问题（FAQ）

Ragas 工具包主要能解决哪些 LLM 应用评估的痛点？

Ragas 通过提供客观指标和自动化测试生成，解决了传统人工评估耗时、主观性强的问题，并能自动创建符合生产环境标准的测试数据集，实现高效、数据驱动的评估工作流。

如何快速开始使用 Ragas 评估我的 RAG 系统？

最快的方式是使用 ragas quickstart rag_eval 命令，它会克隆一个完整的 RAG 评估示例项目。你也可以通过 pip install ragas 安装，并参考文档中的快速开始指南。

Ragas 除了评估 RAG 系统，还支持评估其他类型的 LLM 应用吗？

是的，根据路线图，Ragas 未来将推出用于评估 AI 智能体、对 LLM 进行基准测试、评估提示词变体以及复杂工作流的项目模板，但目前 rag_eval 是已可用的核心模板。


Template Name	Status	Core Function Description
`rag_eval`	Available	Evaluate Retrieval-Augmented Generation (RAG) systems.
`agent_evals`	Coming Soon	Evaluate AI agents.
`benchmark_llm`	Coming Soon	Benchmark and compare LLMs.
`prompt_evals`	Coming Soon	Evaluate prompt variations.
`workflow_eval`	Coming Soon	Evaluate complex workflows.

AI Summary (BLUF)

核心特性

🛡️ 安装

🔥 快速开始

克隆一个完整的示例项目

可用及即将推出的项目模板

评估您的 LLM 应用

需要帮助利用评估来改进您的 AI 应用吗？

🫂 社区

贡献者

🔍 开放分析

引用我们

常见问题（FAQ）

Ragas 工具包主要能解决哪些 LLM 应用评估的痛点？

如何快速开始使用 Ragas 评估我的 RAG 系统？

Ragas 除了评估 RAG 系统，还支持评估其他类型的 LLM 应用吗？