AI数据转换框架如何选择？2024高性能Rust引擎CocoIndex指南

引言

在构建现代 AI 应用时，数据转换是至关重要的一环。无论是为语义搜索创建向量索引用于高效存储和检索向量嵌入的数据结构，支持语义搜索等AI应用，常与向量数据库（如Qdrant、LanceDB）结合使用。，还是为上下文工程构建知识图谱A structured knowledge base that represents entities and their relationships in a graph format.，亦或是执行任何自定义的数据处理，开发者都需要一个强大、高效且易于使用的工具来管理复杂的数据流水线。传统方法，如编写一次性脚本或依赖复杂的 ETL 工具，往往在开发速度、可维护性和增量处理仅处理数据变更部分的技术，在源数据或逻辑更新时，最小化重复计算，重用缓存以提高效率。方面捉襟见肘。

在构建现代 AI 应用时，数据转换是至关重要的一环。无论是为语义搜索创建向量索引用于高效存储和检索向量嵌入的数据结构，支持语义搜索等AI应用，常与向量数据库（如Qdrant、LanceDB）结合使用。，还是为上下文工程构建知识图谱A structured knowledge base that represents entities and their relationships in a graph format.，亦或是执行任何自定义的数据处理，开发者都需要一个强大、高效且易于使用的工具来管理复杂的数据流水线。传统方法，如编写一次性脚本或依赖复杂的 ETL 工具，往往在开发速度、可维护性和增量处理仅处理数据变更部分的技术，在源数据或逻辑更新时，最小化重复计算，重用缓存以提高效率。方面捉襟见肘。

CocoIndex 应运而生，它是一个专为 AI 场景设计的超高性能数据转换框架。其核心引擎采用 Rust 编写，天生支持增量处理仅处理数据变更部分的技术，在源数据或逻辑更新时，最小化重复计算，重用缓存以提高效率。和开箱即用的数据血缘追踪数据从源到目标的转换路径和依赖关系，提供数据沿袭的可观察性，便于调试和合规。追踪，旨在为开发者提供卓越的开发效率，并实现“第 0 天即可投入生产”的成熟度。

CocoIndex is designed to meet this need. It is an ultra-performant data transformation framework built specifically for AI scenarios. With its core engine written in Rust, it natively supports incremental processing and provides out-of-the-box data lineage tracking. It aims to deliver exceptional developer velocity and achieve a "production-ready on day 0" level of maturity.

核心理念：数据流编程

CocoIndex 的核心遵循数据流编程模型。在这一模型中，每个转换操作都基于输入字段生成全新的字段，整个过程没有隐藏状态，也没有值突变。所有转换前后的数据都是可观测的，并且数据血缘追踪数据从源到目标的转换路径和依赖关系，提供数据沿袭的可观察性，便于调试和合规。关系是天然内置的。

At its core, CocoIndex follows the Dataflow Programming model. In this model, each transformation creates a new field based solely on its input fields. There are no hidden states and no value mutations. All data before and after each transformation is observable, and data lineage is built-in by design.

这意味着开发者无需通过显式的创建、更新和删除操作来改变数据。他们只需要为一组源数据定义转换逻辑或公式。这种声明式的方法极大地简化了复杂数据流水线的构建与推理过程。

This means developers do not explicitly mutate data by creating, updating, and deleting. They only need to define the transformation logic or formulas for a set of source data. This declarative approach greatly simplifies the construction and reasoning process of complex data pipelines.

卓越的开发效率

CocoIndex 致力于将开发者的生产力放在首位。通过约 100 行 Python 代码，您就可以声明一个完整的数据流。

CocoIndex prioritizes developer productivity. With approximately 100 lines of Python code, you can declare a complete dataflow.

# 导入数据源
data['content'] = flow_builder.add_source(...)

# 定义转换链
data['out'] = data['content']
    .transform(...)
    .transform(...)

# 收集处理后的数据
collector.collect(...)

# 导出到目标存储（数据库、向量数据库、图数据库等）
collector.export(...)

# Import data source
data['content'] = flow_builder.add_source(...)

# Define transformation chain
data['out'] = data['content']
    .transform(...)
    .transform(...)

# Collect processed data
collector.collect(...)

# Export to target storage (database, vector database, graph database, etc.)
collector.export(...)

即插即用的构建模块

框架原生内置了针对不同数据源、目标和转换操作的组件。这些组件遵循标准化的接口，使得在不同组件间切换如同搭积木一样简单，通常只需一行代码的修改。

The framework natively provides built-in components for various data sources, targets, and transformations. These components adhere to standardized interfaces, making switching between different components as easy as assembling building blocks, often with just a one-line code change.

数据新鲜度与增量处理仅处理数据变更部分的技术，在源数据或逻辑更新时，最小化重复计算，重用缓存以提高效率。

CocoIndex 能够轻松保持源数据与目标数据的同步。它开箱即用地支持增量索引功能：

CocoIndex effortlessly keeps source data and targets in sync. It provides out-of-the-box support for incremental indexing:

最小化重计算：当源数据或转换逻辑发生变化时，只进行必要的重新计算。
智能缓存复用：尽可能复用已有的缓存结果，仅重新处理受影响的部分。

Minimal Recomputation: Only necessary recalculations are performed when source data or transformation logic changes.

Intelligent Cache Reuse: Reuses existing cached results whenever possible, reprocessing only the affected portions.

快速入门

如果您是 CocoIndex 的新用户，我们建议您查阅以下资源：

If you are new to CocoIndex, we recommend checking out the following resources:

📖 Documentation

⚡ Quick Start Guide

🎬 Quick Start Video Tutorial

环境设置

安装 CocoIndex Python 库
```
pip install -U cocoindex
```
1. Install the CocoIndex Python Library
```
pip install -U cocoindex
```
安装 PostgreSQL（如果尚未安装）。CocoIndex 使用它进行增量处理仅处理数据变更部分的技术，在源数据或逻辑更新时，最小化重复计算，重用缓存以提高效率。。
1. Install PostgreSQL (if you don't have it). CocoIndex uses it for incremental processing.
（可选）安装 Claude Code 技能以获得增强的开发体验。在 Claude Code 中运行以下命令：
```
/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
```
1. (Optional) Install the Claude Code Skill for an enhanced development experience. Run these commands in Claude Code:
```
/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
```

定义您的第一个数据流

遵循快速入门指南定义您的第一个索引流。一个典型的文本嵌入流示例如下：

Follow the Quick Start Guide to define your first indexing flow. A typical text embedding flow example looks like this:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # 添加数据源：从本地目录读取文件
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # 添加收集器，用于导出到向量索引的数据
    doc_embeddings = data_scope.add_collector()

    # 转换每个文档的数据
    with data_scope["documents"].row() as doc:
        # 将文档分割成块，放入 `chunks` 字段
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # 转换每个块的数据
        with doc["chunks"].row() as chunk:
            # 嵌入文本块，放入 `embedding` 字段
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # 将块数据收集到收集器中
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # 将收集的数据导出到向量索引
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a local directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into the `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk text, put into the `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk data into the collector
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

此流程定义了一个从文档读取、分块、嵌入到最终存储到向量数据库的完整索引流水线。

This flow defines a complete indexing pipeline from document reading, chunking, embedding, to final storage in a vector database.

丰富的示例与应用场景

CocoIndex 提供了广泛的示例，展示其在不同场景下的应用能力：

CocoIndex provides a wide range of examples demonstrating its capabilities across various scenarios:

示例	描述
文本嵌入	为语义搜索索引带嵌入的文本文档
代码嵌入	为语义搜索索引代码嵌入
PDF 解析与嵌入	解析 PDF 并为语义搜索索引文本嵌入
多模态 PDF 索引	从 PDF 提取文本和图像；用 SentenceTransformer 嵌入文本，用 CLIP 嵌入图像；存储到 Qdrant 实现多模态搜索
使用 LLM 从手册提取信息	使用 LLM 从手册中提取结构化信息
云存储文档索引	索引来自 Amazon S3、Azure Blob Storage、Google Drive 的文本文档
会议笔记到知识图谱A structured knowledge base that represents entities and their relationships in a graph format.	从 Google Drive 提取结构化会议信息并构建知识图谱A structured knowledge base that represents entities and their relationships in a graph format.
文档到知识图谱A structured knowledge base that represents entities and their relationships in a graph format.	从 Markdown 文档提取关系并构建知识图谱A structured knowledge base that represents entities and their relationships in a graph format.
向量数据库集成	将嵌入索引到 Qdrant、LanceDB 等向量数据库集合中
Docker 化服务	在 Docker 化的 FastAPI 设置中运行语义搜索服务器
实时产品推荐	利用 LLM 和图数据库构建实时产品推荐系统
视觉搜索 API	使用视觉模型为图像生成详细描述，进行嵌入，并通过 FastAPI 和 React 前端实现实时更新的语义搜索
人脸识别	识别图像中的人脸并构建嵌入索引
学术论文元数据索引	索引 PDF 文件中的论文，并为每篇论文构建元数据表
自定义数据源	使用自定义源索引 HackerNews 主题和评论

Example Description

Text Embedding Index text documents with embeddings for semantic search

Code Embedding Index code embeddings for semantic search

PDF Parsing & Embedding Parse PDF and index text embeddings for semantic search

Multimodal PDF Indexing Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search

LLM Extraction from Manuals Extract structured information from a manual using LLM

Cloud Storage Document Indexing Index text documents from Amazon S3, Azure Blob Storage, Google Drive

Meeting Notes to Knowledge Graph Extract structured meeting info from Google Drive and build a knowledge graph

Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph

Vector Database Integration Index embeddings into collections of vector databases like Qdrant, LanceDB

Dockerized Service Run the semantic search server in a Dockerized FastAPI setup

Real-time Product Recommendation Build real-time product recommendations with LLM and graph database

Visual Search API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend

Face Recognition Recognize faces in images and build embedding index

Academic Paper Metadata Indexing Index papers in PDF files, and build metadata tables for each paper

Custom Data Source Index HackerNews threads and comments using Custom Source

Example	Description
Text Embedding	Index text documents with embeddings for semantic search
Code Embedding	Index code embeddings for semantic search
PDF Parsing & Embedding	Parse PDF and index text embeddings for semantic search
Multimodal PDF Indexing	Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search
LLM Extraction from Manuals	Extract structured information from a manual using LLM
Cloud Storage Document Indexing	Index text documents from Amazon S3, Azure Blob Storage, Google Drive
Meeting Notes to Knowledge Graph	Extract structured meeting info from Google Drive and build a knowledge graph
Docs to Knowledge Graph	Extract relationships from Markdown documents and build a knowledge graph
Vector Database Integration	Index embeddings into collections of vector databases like Qdrant, LanceDB
Dockerized Service	Run the semantic search server in a Dockerized FastAPI setup
Real-time Product Recommendation	Build real-time product recommendations with LLM and graph database
Visual Search API	Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend
Face Recognition	Recognize faces in images and build embedding index
Academic Paper Metadata Indexing	Index papers in PDF files, and build metadata tables for each paper
Custom Data Source	Index HackerNews threads and comments using Custom Source

更多示例正在不断添加中！

More examples are coming soon!

加入社区

我们热烈欢迎来自社区的贡献 ❤️。无论是代码改进、文档更新、问题报告、功能请求，还是在我们 Discord 中的讨论，我们都无比期待。

We love contributions from our community ❤️. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions.

🌟 在 GitHub 上为我们点星
👋 加入我们的 Discord 社区
▶️ 订阅我们的 YouTube 频道
📜 阅读我们的博客

🌟 Star us on GitHub

👋 Join our Discord community

▶️ Subscribe to our YouTube channel

📜 Read our Blog

支持我们

我们正在不断改进，更多功能和示例即将推出。如果您喜欢这个项目，请在 GitHub 仓库为我们点一个星标 ⭐，以保持关注并帮助我们成长。

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at the GitHub repository to stay tuned and help us grow.

许可证：CocoIndex 采用 Apache 2.0 许可证。

License: CocoIndex is Apache 2.0 licensed.