Tonic Validate Logging：为您的 RAG 应用提供评估与追踪能力

Tonic Validate 通过提供 RAG 评估指标以及一个用于追踪和评估 RAG 应用实验与变更的平台，来帮助您开发检索增强生成应用。本文介绍的是 Tonic Validate Logging，它是 Tonic Validate 的日志记录组件，负责将您的 RAG 应用输出发送到 Tonic Validate 应用平台。当您使用 Tonic Validate Logging 记录 RAG 应用输出时，这些输出会通过 Tonic Validate Metrics 进行评分，然后输出结果和分数会被发送到 Tonic Validate 应用平台。在该平台上，输出和响应被可视化，使您能够轻松追踪 RAG 应用的性能。

Tonic Validate helps you develop Retrieval-Augmented Generation (RAG) applications by providing RAG evaluation metrics and a platform for tracking and evaluating experiments and changes in RAG applications. This article introduces Tonic Validate Logging, the logging component of Tonic Validate, which is responsible for sending your RAG application outputs to the Tonic Validate application platform. When you log RAG application outputs using Tonic Validate Logging, these outputs are scored by Tonic Validate Metrics, and then the results and scores are sent to the Tonic Validate application platform. On this platform, outputs and responses are visualized, allowing you to easily track the performance of your RAG applications.

Tonic Validate 官方文档 (Tonic Validate Documentation)

快速开始

遵循以下步骤，快速将 Tonic Validate Logging 集成到您的工作流中。

Follow these steps to quickly integrate Tonic Validate Logging into your workflow.

注册账户：注册一个免费的 Tonic Validate 账户。 (Sign up for a free Tonic Validate account.)
安装 SDK：通过 pip 安装 Tonic Validate Logging。 (Install Tonic Validate Logging via pip.)
```
pip install tvallogging
```
获取并设置 API 密钥：
- 获取 Tonic Validate API 密钥，并将其设置为环境变量 TONIC_VALIDATE_API_KEY。
- 由于 Tonic Validate 使用 LLM 辅助评估来为您的 RAG 响应评分，因此还需要设置 OPENAI_API_KEY 环境变量，以便使用 OpenAI 模型。
  (Get a Tonic Validate API key, and set it in your environment as the TONIC_VALIDATE_API_KEY environment variable. Tonic Validate uses LLM assisted evaluation, so you also need to set the OPENAI_API_KEY environment variable.)

设置项目并记录响应：设置一个包含问题和参考答案的基准数据集项目，然后记录您的 RAG 应用对这些问题的响应。以下是一个基础代码示例：

(Set up a project and a benchmark dataset, then log your RAG application responses. Below is a basic code example:)

import os
# 通过 Python 设置环境变量
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["TONIC_VALIDATE_API_KEY"] = "your-tonic-validate-api-key"

from tvallogging.api import TonicValidateApi
from tvallogging.chat_objects import Benchmark

# 定义您的项目、基准名称和数据
project_name: str
benchmark_name: str
# question_with_answer_list 是一个字典列表，格式如下：
# [{"question": "问题文本", "answer": "参考答案"}, ...]
question_with_answer_list: List[Dict[str, str]]

api = TonicValidateApi()

# 创建基准和项目
benchmark = Benchmark.from_json_list(question_with_answer_list)
benchmark_id = api.new_benchmark(benchmark, benchmark_name)
project = api.new_project(benchmark_id, project_name)

# 指定用于评估的 LLM（例如 GPT-4），并创建一个新的运行记录
llm_evaluator = "gpt-4"
run = project.new_run(llm_evaluator)

# 遍历基准中的每个问题，记录 RAG 应用的答案和检索到的上下文
for question_with_answer in run.benchmark.question_with_answer_list:
    question = question_with_answer.question

    llm_answer: str # 从 RAG 应用获得的答案
    retrieved_context_list: List[str] # RAG 应用检索到的上下文列表

    # 将答案和上下文记录到 Tonic Validate，此步骤会在本地计算 RAG 指标
    run.log(question_with_answer, llm_answer, retrieved_context_list)

在 UI 中查看性能：在 Tonic Validate 应用平台的用户界面中，查看和分析您的 RAG 应用性能表现。
(Review how your RAG application is performing in the Tonic Validate UI.)

核心 RAG 评估指标

Tonic Validate 使用一套全面的指标来评估 RAG 系统的表现。这些指标主要分为三类：答案相关性、上下文相关性和真实性。下表概述了关键指标及其评估重点：

Tonic Validate employs a comprehensive set of metrics to evaluate RAG system performance. These metrics primarily fall into three categories: Answer Relevance, Context Relevance, and Groundedness. The table below outlines key metrics and their focus:

类别	指标名称	核心评估目标	说明
答案质量	答案相似度	将 LLM 答案与参考答案进行比较。	衡量生成答案与期望答案的语义接近程度。
检索质量	答案上下文一致性	评估 LLM 答案在多大程度上源自提供的上下文。	检测答案中的“幻觉”，确保答案有据可依。
检索质量	上下文相关性	评估检索到的上下文与问题的相关程度。	衡量检索系统查找有用信息的能力。
综合评估	总体评分	基于上述多个指标计算的加权综合分数。	在 Tonic Validate UI 中提供系统性能的单一视图，便于快速追踪进展。

Category
Metric Name
Core Evaluation Focus
Description
Answer Quality
Answer Similarity
Compares LLM answer to the reference answer.
Measures semantic closeness of generated answer to expected answer.
Retrieval Quality
Answer Context Consistency
Assesses how much the LLM answer is grounded in the provided context.
Detects "hallucinations" in the answer, ensuring it is evidence-based.
Context Relevance
Evaluates how relevant the retrieved context is to the question.
Measures the retrieval system's ability to find useful information.
Overall Assessment
Overall Score
A weighted composite score calculated from multiple metrics above.
Provides a single view of system performance in the Tonic Validate UI for easy progress tracking.
文档与进阶资源
为了充分利用 Tonic Validate，建议您查阅以下资源以获取更深入的信息：
To make the most of Tonic Validate, we recommend consulting the following resources for in-depth information:
Tonic Validate 官方文档：包含关于 Tonic Validate 工作原理及其使用的 RAG 指标的详尽信息。
(Tonic Validate Documentation: Has extensive information on how Tonic Validate works and the RAG metrics used.)
Tonic Validate Metrics 库：如果您想深入了解 Tonic Validate 中使用的 RAG 指标，或者有兴趣在 Tonic Validate 平台之外计算这些指标，请查看此开源库。
(Tonic Validate Metrics Library: Check this out for more information about the RAG metrics and if you're interested in computing them outside of the Tonic Validate platform.)
通过集成 Tonic Validate Logging，您可以为 RAG 应用建立一个持续的评估与优化闭环，从而数据驱动地提升其准确性、相关性和可靠性。
By integrating Tonic Validate Logging, you can establish a continuous evaluation and optimization loop for your RAG application, enabling data-driven improvements to its accuracy, relevance, and reliability.
常见问题（FAQ）
Tonic Validate Logging 如何帮助我追踪 RAG 应用的性能？
该库将您的 RAG 应用输出记录到 Tonic Validate 平台，通过 LLM 辅助指标自动评分，并在平台可视化界面中展示结果，让您能清晰追踪性能变化。
开始使用 Tonic Validate Logging 需要哪些步骤？
首先注册免费账户并获取 API 密钥，然后通过 pip 安装库，设置环境变量，创建包含基准数据集的项目，最后记录 RAG 响应即可在平台查看评估结果。
Tonic Validate 主要评估 RAG 系统的哪些方面？
主要评估三大类指标：答案相关性（答案是否匹配问题）、上下文相关性（检索内容是否相关）和真实性（答案是否基于给定上下文），提供全面的性能评估。

如何用Tonic Validate Logging评估RAG应用性能？（附Python库集成指南）

AI Summary (BLUF)

Tonic Validate Logging：为您的 RAG 应用提供评估与追踪能力

快速开始

核心 RAG 评估指标

文档与进阶资源

常见问题（FAQ）

Tonic Validate Logging 如何帮助我追踪 RAG 应用的性能？

开始使用 Tonic Validate Logging 需要哪些步骤？

Tonic Validate 主要评估 RAG 系统的哪些方面？

Category	Metric Name	Core Evaluation Focus	Description
Answer Quality	Answer Similarity	Compares LLM answer to the reference answer.	Measures semantic closeness of generated answer to expected answer.
Retrieval Quality	Answer Context Consistency	Assesses how much the LLM answer is grounded in the provided context.	Detects "hallucinations" in the answer, ensuring it is evidence-based.
Retrieval Quality	Context Relevance	Evaluates how relevant the retrieved context is to the question.	Measures the retrieval system's ability to find useful information.
Overall Assessment	Overall Score	A weighted composite score calculated from multiple metrics above.	Provides a single view of system performance in the Tonic Validate UI for easy progress tracking.