GEO
赞助商内容

RAG-Anything 如何实现多模态文档处理?2026年最新功能详解

2026/4/24
RAG-Anything 如何实现多模态文档处理?2026年最新功能详解

AI Summary (BLUF)

RAG-Anything is an all-in-one multimodal RAG system that processes documents containing text, images, tables, and formulas. It features end-to-end processing pipelines, knowledge graph indexing, and c

🚀 RAG-Anything: The All-in-One Multimodal RAG System

🚀 RAG-Anything: 一站式多模态 RAG 系统

This blog post introduces RAG-Anything, a comprehensive, open-source system designed to handle complex, multimodal documents. It goes beyond traditional text-based RAG by seamlessly processing and querying content that includes text, images, tables, and equations.

本博客将介绍 RAG-Anything,一个旨在处理复杂多模态文档的综合性开源系统。它超越了传统的基于文本的 RAG,能够无缝处理和查询包含文本、图像、表格和公式的内容。

🎉 What's New (Latest Updates)

🎉 最新动态

  • [2025.08.12] 🎯 RAGAnything now supports VLM-Enhanced Query mode! When documents contain images, the system can automatically pass the image along with the text context directly to a VLM for comprehensive multimodal analysis.

  • [2025.07.05] 🎯 Added the Context Configuration Module to provide relevant context for multimodal content processing.

  • [2025.07.04] 🎯 Now supports multimodal content queries, enabling enhanced retrieval-augmented generation that integrates text, images, tables, and formulas.

  • [2025.07.03] 🎯 RAGAnything reached 1K stars on GitHub! Thank you for your support.

  • [2025.08.12] 🎯 RAGAnything 现已支持 VLM 增强查询 模式!当文档包含图片时,系统可以自动将图片与文本上下文一起直接传递给 VLM 进行综合多模态分析。

  • [2025.07.05] 🎯 新增上下文配置模块,支持为多模态内容处理添加相关上下文信息。

  • [2025.07.04] 🎯 现在支持多模态内容查询,实现了集成文本、图像、表格和公式处理的增强检索生成功能。

  • [2025.07.03] 🎯 RAGAnything 在 GitHub 上达到了 1K 星标🌟!感谢您的支持和贡献。


🌟 System Overview

🌟 系统概述

RAG-Anything is a comprehensive multimodal document processing RAG system. It provides a complete retrieval-augmented generation (RAG) solution for complex documents containing text, images, tables, formulas, and other multimodal content.

RAG-Anything 是一个综合性多模态文档处理 RAG 系统。该系统能够无缝处理和查询包含文本、图像、表格、公式等多模态内容的复杂文档,提供完整的检索增强生成 (RAG) 解决方案。

🎯 Core Features

🎯 核心特性

  • 🔄 End-to-End Multimodal Pipeline: Provides a complete processing chain from document parsing to multimodal query response, ensuring a unified system operation.

  • 📄 Multi-Format Document Support: Supports unified processing and parsing of mainstream document formats, including PDF, Office documents (DOC/DOCX/PPT/PPTX/XLS/XLSX), and images.

  • 🧠 Multimodal Content Analysis Engine: Deploys specialized processors for images, tables, formulas, and general text to ensure precise parsing of diverse content.

  • 🔗 Knowledge Graph-Based Indexing: Automates entity extraction and relationship construction to build a cross-modal semantic connection network.

  • ⚡ Flexible Processing Architecture: Supports both an intelligent parsing mode based on MinerU and a direct multimodal content insertion mode to meet different application needs.

  • 📋 Direct Content List Insertion: Bypasses document parsing to directly insert pre-parsed content lists from external sources, enabling integration of multiple data sources.

  • 🎯 Cross-Modal Retrieval Mechanism: Enables intelligent retrieval across text and multimodal content, providing precise information location and matching capabilities.

  • 🔄 端到端多模态处理流水线: 提供从文档解析到多模态查询响应的完整处理链路,确保系统的一体化运行。

  • 📄 多格式文档支持: 支持 PDF、Office 文档(DOC/DOCX/PPT/PPTX/XLS/XLSX)、图像等主流文档格式的统一处理和解析。

  • 🧠 多模态内容分析引擎: 针对图像、表格、公式和通用文本内容部署专门的处理器,确保各类内容的精准解析。

  • 🔗 基于知识图谱索引: 实现自动化实体提取和关系构建,建立跨模态的语义连接网络。

  • ⚡ 灵活的处理架构: 支持基于 MinerU 的智能解析模式和直接多模态内容插入模式,满足不同应用场景需求。

  • 📋 直接内容列表插入: 跳过文档解析,直接插入来自外部源的预解析内容列表,支持多种数据来源整合。

  • 🎯 跨模态检索机制: 实现跨文本和多模态内容的智能检索,提供精准的信息定位和匹配能力。


🏗️ Algorithm & Architecture

🏗️ 算法原理与架构

RAG-Anything employs a flexible, layered architecture to implement a multi-stage, multimodal processing pipeline, extending traditional RAG systems into a comprehensive platform for heterogeneous content types.

RAG-Anything 采用灵活的分层架构设计,实现多阶段多模态处理流水线,将传统 RAG 系统扩展为支持异构内容类型的综合处理平台。

The pipeline follows this flow: Document Parsing → Content Analysis → Knowledge Graph → Intelligent Retrieval.

该流水线遵循以下流程:文档解析 → 内容分析 → 知识图谱 → 智能检索

1. Document Parsing Stage

  1. 文档解析阶段

The system builds a high-precision document parsing platform. It uses a structured extraction engine to identify and extract multimodal elements completely. An adaptive content decomposition mechanism intelligently separates heterogeneous content like text, images, tables, and formulas while preserving their semantic relationships.

该系统构建了高精度文档解析平台,通过结构化提取引擎实现多模态元素的完整识别与提取。系统采用自适应内容分解机制,智能分离文档中的文本、图像、表格、公式等异构内容,并保持其语义关联性。

Core Components:

核心组件:

  • ⚙️ Structured Extraction Engine: Integrates the MinerU framework for precise document structure recognition and content extraction.

  • 🧩 Adaptive Content Decomposition: Automatically identifies and extracts heterogeneous elements (text blocks, images, tables, formulas) while maintaining their semantic relationships.

  • 📁 Multi-Format Compatibility: Deploys a matrix of specialized parsers for unified processing of PDF, Office documents, and images.

  • ⚙️ 结构化提取引擎: 集成 MinerU 文档解析框架,实现精确的文档结构识别与内容提取。

  • 🧩 自适应内容分解机制: 自动识别并提取文档中的文本块、图像、表格、公式等异构元素,保持元素间的语义关联关系。

  • 📁 多格式兼容处理: 部署专业化解析器矩阵,支持 PDF、Office 文档系列及图像等主流格式的统一处理与标准化输出。

2. Multimodal Content Understanding & Processing

  1. 多模态内容理解与处理

This system uses an autonomous classification and routing mechanism to intelligently identify and distribute heterogeneous content. A concurrent multi-pipeline architecture ensures efficient parallel processing of text and multimodal content, maximizing throughput while maintaining content integrity and the original document's hierarchical structure.

该多模态内容处理系统通过自主分类路由机制实现异构内容的智能识别与优化分发。系统采用并发多流水线架构,确保文本和多模态内容的高效并行处理,在最大化吞吐量的同时保持内容完整性,并能完整提取和保持原始文档的层次结构与元素关联关系。

Core Components:

核心组件:

  • 🎯 Autonomous Content Routing: Automatically identifies, classifies, and routes different content types to optimized execution channels.

  • ⚡ Concurrent Multi-Pipeline Architecture: Enables parallel execution of text and multimodal content processing for maximum throughput.

  • 🏗️ Document Hierarchy Extraction: Extracts and preserves the original document's hierarchical structure and inter-element relationships during content transformation.

  • 🎯 自主内容分类与路由: 自动识别、分类并将不同内容类型路由至优化的执行通道。

  • ⚡ 并发多流水线架构: 通过专用处理流水线实现文本和多模态内容的并发执行,最大化吞吐效率。

  • 🏗️ 文档层次结构提取: 在内容转换过程中提取并保持原始文档的层次结构和元素间关系。

3. Multimodal Analysis Engine

  1. 多模态分析引擎

The system deploys modality-aware processing units for heterogeneous data types.

系统部署了面向异构数据模态的模态感知处理单元。

Specialized Analyzers:

专用分析器:

Analyzer

Function

Key Capabilities

🔍 Visual Content Analyzer

Image analysis and content recognition

Generates context-aware captions; extracts spatial relationships and hierarchies.

📊 Structured Data Interpreter

Systematic interpretation of tables

Implements statistical pattern recognition for trend analysis; identifies semantic relationships across datasets.

📐 Mathematical Expression Parser

High-precision parsing of complex formulas

Provides native LaTeX support; maps equations to domain-specific knowledge.

🔧 Extensible Modality Processor

Configurable framework for custom content

Supports dynamic integration of new modality processors via a plugin architecture.

分析器

功能

关键能力

🔍 视觉内容分析器

图像分析和内容识别

生成上下文感知的描述性标题;提取视觉元素间的空间关系和层次结构。

📊 结构化数据解释器

对表格进行系统性解释

实现数据趋势分析的统计模式识别算法;识别多个表格数据集间的语义关系。

📐 数学表达式解析器

高精度解析复杂数学表达式

提供原生 LaTeX 格式支持;建立数学方程与领域特定知识库间的概念映射。

🔧 可扩展模态处理器

为自定义内容提供可配置框架

通过插件架构实现新模态处理器的动态集成。

4. Multimodal Knowledge Graph Indexing

  1. 多模态知识图谱索引

This module converts document content into a structured semantic representation. It extracts multimodal entities, establishes cross-modal relationships, and maintains a hierarchical organization, enabling optimized knowledge retrieval through weighted relevance scoring.

多模态知识图谱构建模块将文档内容转换为结构化语义表示。系统提取多模态实体,建立跨模态关系,并保持层次化组织结构。通过加权相关性评分实现优化的知识检索。

Core Functions:

核心功能:

  • 🔍 Multimodal Entity Extraction: Converts important multimodal elements into structured knowledge graph entities, including semantic annotation and metadata preservation.

  • 🔗 Cross-Modal Relationship Mapping: Establishes semantic connections between text entities and multimodal components using automated reasoning.

  • 🏗️ Hierarchical Structure Preservation: Maintains the original document organization through "belongs-to" relationship chains.

  • ⚖️ Weighted Relationship Scoring: Assigns quantitative relevance scores to relationship types based on semantic proximity and contextual importance.

  • 🔍 多模态实体提取: 将重要的多模态元素转换为结构化知识图谱实体,包括语义标注和元数据保存。

  • 🔗 跨模态关系映射: 在文本实体和多模态组件之间建立语义连接和依赖关系,通过自动化关系推理算法实现。

  • 🏗️ 层次结构保持: 通过"归属于"关系链维护原始文档组织结构,保持逻辑内容层次和章节依赖关系。

  • ⚖️ 加权关系评分: 为关系类型分配定量相关性分数,评分基于语义邻近性和文档结构内的上下文重要性。

5. Modality-Aware Retrieval

  1. 模态感知检索

The hybrid retrieval system combines vector similarity search with graph traversal algorithms for comprehensive content retrieval. It implements a modality-aware ranking mechanism and maintains relational consistency among retrieved elements to ensure coherent information delivery.

混合检索系统结合向量相似性搜索与图遍历算法,实现全面的内容检索。系统实现模态感知排序机制,并维护检索元素间的关系一致性,确保上下文集成的信息传递。

Retrieval Mechanisms:

检索机制:

  • 🔀 Vector-Graph Fusion: Integrates vector similarity search with graph traversal, leveraging both semantic embeddings and structural relationships.

  • 📊 Modality-Aware Ranking: An adaptive scoring mechanism adjusts ranking results based on query-specific modality preferences.

  • 🔗 Relational Consistency Maintenance: Preserves semantic and structural relationships among retrieved elements for coherent information transfer.

  • 🔀 向量-图谱融合: 集成向量相似性搜索与图遍历算法,同时利用语义嵌入和结构关系实现全面的内容检索。

  • 📊 模态感知排序: 实现基于内容类型相关性的自适应评分机制,系统根据查询特定的模态偏好调整排序结果。

  • 🔗 关系一致性维护: 维护检索元素间的语义和结构关系,确保信息传递的连贯性和上下文完整性。


🚀 Quick Start

🚀 快速开始

Installation

安装

Option 1: Install from PyPI (Recommended)

选项1:从 PyPI 安装(推荐)

# Basic installation
pip install raganything

# Install with optional dependencies for extended format support:
pip install 'raganything[all]'              # All optional features
pip install 'raganything[image]'            # Image format conversion (BMP, TIFF, GIF, WebP)
pip install 'raganything[text]'             # Text file processing (TXT, MD)
pip install 'raganything[image,text]'       # Combination of features

Option 2: Install from Source

选项2:从源码安装

git clone https://github.com/HKUDS/RAG-Anything.git
cd RAG-Anything
pip install -e .

# Install optional dependencies
pip install -e '.[all]'

Optional Dependencies:

可选依赖:

Dependency

Description

[image]

Enables processing of BMP, TIFF, GIF, WebP formats (requires Pillow).

[text]

Enables processing of TXT and MD files (requires ReportLab).

[all]

Includes all Python optional dependencies.

依赖项

描述

[image]

启用 BMP、TIFF、GIF、WebP 图像格式处理(需要 Pillow)。

[text]

启用 TXT 和 MD 文件处理(需要 ReportLab)。

[all]

包含所有 Python 可选依赖。

⚠️ Office Document Processing Requirements:

  • Office documents (.doc, .docx, .ppt, .pptx, .xls, .xlsx) require LibreOffice to be installed.

  • Download from the LibreOffice website.

  • Windows: Download the installer from the website.

  • macOS: brew install --cask libreoffice

  • Ubuntu/Debian: sudo apt-get install libreoffice

  • CentOS/RHEL: sudo yum install libreoffice

⚠️ Office 文档处理配置要求:

  • Office 文档 (.doc, .docx, .ppt, .pptx, .xls, .xlsx) 需要安装 LibreOffice

  • LibreOffice 官网 下载安装。

  • Windows:从官网下载安装包。

  • macOSbrew install --cask libreoffice

  • Ubuntu/Debiansudo apt-get install libreoffice

  • CentOS/RHELsudo yum install libreoffice

Verify MinerU Installation:

检查 MinerU 安装:

# Verify installation
mineru --version

# Check configuration
python -c "from raganything import RAGAnything; rag = RAGAnything(); print('✅ MinerU installed correctly' if rag.check_parser_installation() else '❌ MinerU installation issue')"

Models are downloaded automatically on first use. For manual download, refer to the MinerU Model Source Configuration.

模型在首次使用时自动下载。手动下载参考 MinerU 模型源配置

Usage Example

使用示例

1. End-to-End Document Processing

  1. 端到端文档处理

The following example demonstrates a complete workflow: configuring the system, processing a PDF document, and performing both text and multimodal queries.

以下示例演示了一个完整的工作流程:配置系统、处理 PDF 文档,以及执行文本和多模态查询。

import asyncio
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc

async def main():
    # Set up API configuration
    api_key = "your-api-key"
    base_url = "your-base-url"  # Optional

    # Create RAGAnything configuration
    config = RAGAnythingConfig(
        working_dir="./rag_storage",
        parser="mineru",  # Choose parser: mineru or docling
        parse_method="auto",  # Parse method: auto, ocr, or txt
        enable_image_processing=True,
        enable_table_processing=True,
        enable_equation_processing=True,
    )

    # Define LLM model function
    def llm_model_func(prompt, system_prompt=None, history_messages=[], **kwargs):
        return openai_complete_if_cache(
            "gpt-4o-mini",
            prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key,
            base_url=base_url,
            **kwargs,
        )

    # Define vision model function for image processing
    def vision_model_func(
        prompt, system_prompt=None, history_messages=[], image_data=None, messages=None, **kwargs
    ):
        # If messages format is provided (for multimodal VLM-enhanced queries), use directly
        if messages:
            return openai_complete_if_cache(
                "gpt-4o",
                "",
                system_prompt=None,
                history_messages=[],
                messages=messages,
                api_key=api_key,
                base_url=base_url,
                **kwargs,
            )
        # Traditional single-image format
        elif image_data:
            return openai_complete_if_cache(
                "gpt-4o",
                "",
                system_prompt=None,
                history_messages=[],
                messages=[
                    {"role": "system", "content": system_prompt}
                    if system_prompt
                    else None,
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": prompt},
                            {
                                "type": "image_url",
                                "image_url": {
                                    "url": f"data:image/jpeg;base64,{image_data}"
                                },
                            },
                        ],
                    }
                    if image_data
                    else {"role": "user", "content": prompt},
                ],
                api_key=api_key,
                base_url=base_url,
                **kwargs,
            )
        # Plain text format
        else:
            return llm_model_func(prompt, system_prompt, history_messages, **kwargs)

    # Define embedding function
    embedding_func = EmbeddingFunc(
        embedding_dim=3072,
        max_token_size=8192,
        func=lambda texts: openai_embed.func(
            texts,
            model="text-embedding-3-large",
            api_key=api_key,
            base_url=base_url,
        ),
    )

    # Initialize RAGAnything
    rag = RAGAnything(
        config=config,
        llm_model_func=llm_model_func,
        vision_model_func=vision_model_func,
        embedding_func=embedding_func,
    )

    # Process a document
    await rag.process_document_complete(
        file_path="path/to/your/document.pdf",
        output_dir="./output",
        parse_method="auto"
    )

    # Query the processed content
    # Text query - basic knowledge base search
    text_result = await rag.aquery(
        "What is the main content of the document?",
        mode="hybrid"
    )
    print("Text Query Result:", text_result)

    # Multimodal query - query involving specific multimodal content
    multimodal_result = await rag.aquery_with_multimodal(
        "Analyze this performance data and explain its relationship to the existing document content",
        multimodal_content=[{
            "type": "table",
            "table_data": """System,Accuracy,F1 Score
                            RAGAnything,95.2%,0.94
                            Baseline Method,87.3%,0.85""",
            "table_caption": "Performance Comparison Results"
        }],
        mode="hybrid"
    )
    print("Multimodal Query Result:", multimodal_result)

if __name__ == "__main__":
    asyncio.run(main())

This concludes the introduction to the core architecture, features, and a practical example of RAG-Anything. The system represents a significant step forward in making complex, multimodal information accessible and queryable through a unified RAG pipeline.

以上是对 RAG-Anything 核心架构、特性及一个实际示例的介绍。该系统代表了通过统一 RAG 流水线使复杂多模态信息变得可访问和可查询的重要一步。

常见问题(FAQ)

RAG-Anything 支持哪些文档格式?

RAG-Anything 支持 PDF、Office 文档(DOC/DOCX/PPT/PPTX/XLS/XLSX)和图像格式。处理 Office 文档需要安装 LibreOffice,解析依赖 MinerU

RAG-Anything 如何处理文档中的图片和表格?

系统部署了专门的处理器分别处理图像、表格、公式和文本。支持 VLM 增强查询模式,可将图片与上下文一起传递给视觉语言模型进行综合分析。

RAG-Anything 的安装要求是什么?

可通过 pip 安装。处理 Office 文档需要 LibreOffice,文档解析需要 MinerU。系统提供基于 MinerU 的智能解析和直接内容列表插入两种模式。

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。

您可能感兴趣