Neum AI是什么？2024年RAG数据平台详解

概述

Neum AI 是一个数据平台，旨在帮助开发者利用自身数据，通过检索增强生成技术为大语言模型提供上下文。其功能涵盖从现有数据源（如文档存储和 NoSQL 数据库）提取数据、将内容处理为向量嵌入，以及将向量嵌入摄取到向量数据库中以支持相似性搜索。它为您提供了一个全面的 RAG 解决方案，该方案可随您的应用程序扩展，并减少在集成数据连接器、嵌入模型和向量数据库等服务上所花费的时间。

Neum AI is a data platform designed to help developers leverage their own data to provide context for Large Language Models through Retrieval-Augmented Generation (RAG). Its capabilities span extracting data from existing sources (like document storage and NoSQL databases), processing content into vector embeddings, and ingesting those embeddings into vector databases for similarity search. It offers a comprehensive RAG solution that scales with your application and reduces the time spent integrating services such as data connectors, embedding models, and vector databases.

核心特性

Neum AI 平台设计时优先考虑了性能、灵活性和易用性，其核心特性包括：

🏭 高吞吐分布式架构：可处理数十亿数据点。允许高度并行化以优化嵌入生成和摄取过程。

High-throughput distributed architecture: Capable of handling billions of data points. Enables a high degree of parallelization to optimize embedding generation and ingestion.
🧱 内置数据连接器：提供与常见数据源、嵌入服务和向量存储的开箱即用连接。

Built-in data connectors: Offers out-of-the-box connectivity to common data sources, embedding services, and vector stores.
🔄 数据源实时同步：确保您的检索数据始终是最新的。

Real-time data source synchronization: Ensures your retrieval data is always up-to-date.
♻️ 可定制数据预处理：支持以加载、分块和选择等形式进行数据处理。

Customizable data pre-processing: Supports data manipulation in the form of loading, chunking, and selecting.
🤝 协同数据管理：支持带元数据的混合检索。Neum AI 自动增强和跟踪元数据，以提供丰富的检索体验。

Cohesive data management: Supports hybrid retrieval with metadata. Neum AI automatically augments and tracks metadata to deliver a rich retrieval experience.

快速开始

Neum AI 云服务

您今天就可以在 dashboard.neum.ai 注册。请参阅我们的快速入门指南以开始使用。Neum AI 云服务支持大规模分布式架构，可对数百万文档进行向量嵌入处理。有关完整功能对比，请参阅：云服务 vs 本地部署。

Sign up today at dashboard.neum.ai. See our Quickstart Guide to get started. The Neum AI Cloud supports a large-scale, distributed architecture for vector embedding processing on millions of documents. For a full feature comparison, see: Cloud vs Local.

本地开发

首先，安装 neumai 包：

First, install the neumai package:

pip install neumai

要创建您的第一个数据管道，请访问我们的快速入门指南。简而言之，一个管道由一个或多个用于提取数据的源、一个用于将内容向量化的嵌入连接器以及一个用于存储这些向量的接收器连接器组成。

To create your first data pipeline, visit our Quickstart Guide. At a high level, a pipeline consists of one or multiple sources to pull data from, one embed connector to vectorize the content, and one sink connector to store the resulting vectors.

示例：创建并运行一个管道（网站连接器）

以下代码片段演示了如何组装这些组件并运行一个从网站获取数据的管道：

The following code snippet demonstrates how to assemble these components and run a pipeline that ingests data from a website:

from neumai.DataConnectors.WebsiteConnector import WebsiteConnector
from neumai.Shared.Selector import Selector
from neumai.Loaders.HTMLLoader import HTMLLoader
from neumai.Chunkers.RecursiveChunker import RecursiveChunker
from neumai.Sources.SourceConnector import SourceConnector
from neumai.EmbedConnectors import OpenAIEmbed
from neumai.SinkConnectors import WeaviateSink
from neumai.Pipelines import Pipeline

# 1. 配置数据源 (Website)
website_connector = WebsiteConnector(
    url="https://www.neum.ai/post/retrieval-augmented-generation-at-scale",
    selector=Selector(to_metadata=['url']) # 将URL存入元数据
)
source = SourceConnector(
    data_connector=website_connector,
    loader=HTMLLoader(),
    chunker=RecursiveChunker()
)

# 2. 配置嵌入模型 (OpenAI)
openai_embed = OpenAIEmbed(api_key="<YOUR_OPENAI_API_KEY>")

# 3. 配置向量数据库接收器 (Weaviate)
weaviate_sink = WeaviateSink(
    url="your-weaviate-url",
    api_key="your-weaviate-api-key",
    class_name="your-class-name",
)

# 4. 组装并运行管道
pipeline = Pipeline(sources=[source], embed=openai_embed, sink=weaviate_sink)
pipeline.run()

# 5. 执行检索
results = pipeline.search(
    query="What are the challenges with scaling RAG?",
    number_of_results=3
)
for result in results:
    print(result.metadata)

示例：创建并运行一个管道（PostgreSQL 连接器）

您也可以轻松地从关系型数据库（如 PostgreSQL）中摄取数据：

You can also easily ingest data from relational databases like PostgreSQL:

from neumai.DataConnectors.PostgresConnector import PostgresConnector
from neumai.Shared.Selector import Selector
from neumai.Loaders.JSONLoader import JSONLoader
from neumai.Chunkers.RecursiveChunker import RecursiveChunker
from neumai.Sources.SourceConnector import SourceConnector
from neumai.EmbedConnectors import OpenAIEmbed
from neumai.SinkConnectors import WeaviateSink
from neumai.Pipelines import Pipeline

# 1. 配置数据源 (PostgreSQL)
postgres_connector = PostgresConnector(
    connection_string='your-postgres-connection-string',
    query='SELECT * FROM your_table'
)
source = SourceConnector(
    data_connector=postgres_connector,
    loader=JSONLoader(
        id_key='<your_unique_id_column>',
        selector=Selector(
            to_embed=['column1_to_embed', 'column2_to_embed'], # 指定需要嵌入的字段
            to_metadata=['column3_for_metadata'] # 指定需要存入元数据的字段
        )
    ),
    chunker=RecursiveChunker()
)

# ... (嵌入连接器和接收器连接器配置与上例类似)
# ... (Embed and Sink configuration similar to the previous example)

pipeline = Pipeline(sources=[source], embed=openai_embed, sink=weaviate_sink)
pipeline.run()

将管道发布到 Neum Cloud

本地开发的管道可以轻松发布到 Neum AI 云服务进行托管和扩展：

Pipelines developed locally can be easily published to the Neum AI Cloud for hosting and scaling:

from neumai.Client.NeumClient import NeumClient

client = NeumClient(api_key='<YOUR_NEUM_API_KEY>') # 从 https://dashboard.neum.ai 获取
client.create_pipeline(pipeline=pipeline)

自托管部署

如果您有兴趣将 Neum AI 部署到自己的云环境中，请通过 founders@tryneum.com 联系我们。我们在 GitHub 上发布了一个示例后端架构，您可以用作起点。

If you are interested in deploying Neum AI to your own cloud infrastructure, please contact us at founders@tryneum.com. We have a sample backend architecture published on GitHub which you can use as a starting point.

可用连接器

有关最新列表，请访问我们的官方文档。以下为当前支持的核心连接器类别：

For an up-to-date list, please visit our official documentation. The following are the core connector categories currently supported:

源连接器

从这些数据源提取数据。

Extract data from these sources.

Postgres
Hosted Files
Websites
S3
Azure Blob
SharePoint
SingleStore
Supabase Storage

嵌入连接器

使用这些服务将文本转换为向量。

Convert text to vectors using these services.

OpenAI Embeddings
Azure OpenAI Embeddings

接收器连接器

将向量存储到这些数据库中。

Store vectors into these databases.

Supabase PostgreSQL (with vector extension)
Weaviate
Qdrant
Pinecone
SingleStore

发展路线图

我们的路线图根据社区需求不断演进。如果您发现有任何缺失的功能，欢迎提交 Issue 或发送消息给我们。

Our roadmap evolves based on community asks. If there is anything you find missing, feel free to open an issue or send us a message.

计划中的连接器:

MySQL - 源
GitHub - 源
Google Drive - 源
Hugging Face - 嵌入
LanceDB - 接收器
Marqo - 接收器
Milvus - 接收器
Chroma - 接收器

搜索功能增强:

检索反馈
过滤器支持
统一的 Neum AI 过滤器
智能路由（基于嵌入分类）
智能路由（基于 LLM 分类）
自查询检索（带元数据属性自动生成）

可扩展性:

Langchain / LlamaIndex 文档转换器
自定义分块和加载逻辑

实验性功能:

异步元数据增强
聊天历史连接器
结构化（SQL 和 GraphQL）搜索连接器

Neum 工具集

为配合 Neum AI 使用，我们还提供了额外的工具：

Additional tooling for use with Neum AI can be found here:

neumai-tools: 包含在生成向量嵌入之前，用于加载和分块数据的预处理工具。

Contains pre-processing tools for loading and chunking data before generating vector embeddings.

联系我们

您可以通过以下方式联系我们的团队：

You can reach our team through:

邮箱: founders@tryneum.com
Discord: 加入我们的社区
预约通话: Schedule a call

开始构建您可扩展的、上下文感知的 AI 应用吧。

Start building your scalable, context-aware AI applications today.