增量更新功能是如何实现的？

GUI通过跟踪文件修改时间和大小检测文档变化，利用Faiss.IndexIDMap管理向量ID，可重用已删除块的ID分配给新内容，实现高效索引更新。

如何在Windows本地搭建DeepSeek知识库？2026年最新RAG教程

Q: 如何在Windows上搭建DeepSeek私人知识库？

本教程提供逐步指南，使用RAG技术结合fastembed生成向量、FAISS存储检索，并包含Tkinter GUI界面，支持PDF、DOCX等多种文档格式的增量更新管理。

Introduction

This article provides a detailed, beginner-friendly guide to building a local, private knowledge base powered by the DeepSeek large language model on a Windows system. The solution leverages modern Retrieval-Augmented Generation (RAG) techniques, combining the fastembed library for generating text embeddings and the FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. library for efficient similarity search and vector storage. Furthermore, we introduce a custom-built TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。-based Graphical User Interface (GUI) that simplifies the process of building, managing, and incrementally updating your knowledge base, supporting various document formats including PDF, DOCX, EPUB, and TXT.

本文提供了一份详细的、适合初学者的指南，用于在 Windows 系统上构建一个由 DeepSeek 大语言模型驱动的本地私有知识库。该方案采用了现代检索增强生成（RAG）技术，结合了用于生成文本嵌入向量的 fastembed 库和用于高效相似性搜索及向量存储的 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 库。此外，我们介绍了一个自定义的基于 TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。的图形用户界面（GUI），它简化了知识库的构建、管理和增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。过程，并支持包括 PDF、DOCX、EPUB 和 TXT 在内的多种文档格式。

Core Technical Concepts

Understanding the RAG Pipeline

The system operates on a standard RAG pipeline. The embedder.embed function is responsible for loading the pre-trained embedding model (e.g., BAAI/bge-small-zh-v1.5) and converting text chunks into high-dimensional vector representations. FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. (Facebook AI Similarity Search) is then used to calculate similarities between vectors and store them in an efficient index for fast retrieval. When a query is made, the system finds the most semantically similar text chunks from the index and provides them as context to the large language model (like DeepSeek) to generate an informed answer.

该系统基于标准的 RAG 流程运行。embedder.embed 函数负责加载预训练的嵌入模型（例如 BAAI/bge-small-zh-v1.5）并将文本块转换为高维向量表示。随后，FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors.（Facebook AI 相似性搜索库）被用来计算向量之间的相似度，并将它们存储在一个高效的索引中以便快速检索。当进行查询时，系统从索引中找出语义最相似的文本块，并将其作为上下文提供给大语言模型（如 DeepSeek），以生成信息充分的答案。

Key Components of the Implementation

Text Embedding with FastEmbedQdrant开发的快速文本嵌入库，支持多种预训练模型，用于将文本转换为向量表示。: Utilizes the BAAI/bge-small-zh-v1.5 model, a compact yet effective model optimized for Chinese text, to generate vector embeddings. A domestic mirror (hf-mirror.com) is configured to accelerate model downloads in China.
Vector Indexing with FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors.: Implements Faiss.IndexFlatL2 for L2 distance-based similarity search. The Faiss.IndexIDMap wrapper is crucially used to support stable, incremental ID management for vectors, enabling update and delete operations.
Document Processing: Supports multiple formats:
- PDF (via pypdf)
- DOCX (via python-docx)
- PPTX (via python-pptx)
- XLSX (via pandas)
- EPUB (via ebooklib and bs4)
- TXT (with multiple encoding fallbacks)
Text Chunking: Implements a sliding window approach (split_text function) to break down documents into overlapping chunks (default size 500 characters, overlap 50 characters) to preserve context.
Incremental Update Logic: The GUI features an incremental update mechanism. It tracks file states (modification time and size) to detect new, modified, or deleted documents. It reuses FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. IDs from deleted chunks for new ones where possible, optimizing index management.

使用 FastEmbedQdrant开发的快速文本嵌入库，支持多种预训练模型，用于将文本转换为向量表示。进行文本嵌入：利用 BAAI/bge-small-zh-v1.5 模型（一个针对中文文本优化、紧凑而有效的模型）来生成向量嵌入。配置了国内镜像（hf-mirror.com）以加速在中国的模型下载。

使用 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 进行向量索引：实现了基于 L2 距离的相似性搜索（Faiss.IndexFlatL2）。关键地使用了 Faiss.IndexIDMap 包装器来支持稳定、增量的向量 ID 管理，从而实现更新和删除操作。

文档处理：支持多种格式：

PDF（通过 pypdf）

DOCX（通过 python-docx）

PPTX（通过 python-pptx）

XLSX（通过 pandas）

EPUB（通过 ebooklib 和 bs4）

TXT（支持多种编码回退）

文本分块将长文档分割成较小的文本块，以便模型更有效地处理和分析。：实现了滑动窗口方法（split_text 函数），将文档分解为重叠的块（默认大小 500 字符，重叠 50 字符）以保留上下文信息。

增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。逻辑：GUI 具备增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。机制。它跟踪文件状态（修改时间和大小）以检测新增、修改或删除的文档。它会尽可能重用已删除块中的 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. ID 来分配给新块，从而优化索引管理。

Building the TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。 GUI Application

Application Structure and Features

The EmbeddingApp class forms the backbone of the GUI application. It is designed for usability, separating the interface into two main panels: one for building/managing the knowledge base and another for performing semantic searches.

EmbeddingApp 类构成了 GUI 应用程序的核心。它注重可用性设计，将界面分为两个主要面板：一个用于构建/管理知识库，另一个用于执行语义搜索。

Main Features of the Build Panel:

Directory Selection: Allows users to choose or drag-and-drop a folder containing source documents.
Format Selection: Checkboxes to select which document formats (PDF, DOCX, etc.) to include.
Build Actions: Buttons for "New Build" (creates index from scratch) and "Incremental Update" (only processes changed files). The incremental button's state (enabled/disabled) automatically updates based on the presence of an existing knowledge base in the selected directory.
Progress Feedback: A progress bar and status label provide real-time feedback during indexing.

构建面板的主要功能：

目录选择：允许用户选择或拖放包含源文档的文件夹。

格式选择：复选框用于选择要包含的文档格式（PDF、DOCX 等）。

构建操作：提供“新建构建”（从头创建索引）和“增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。”（仅处理已更改的文件）按钮。增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。按钮的状态（启用/禁用）会根据所选目录中是否存在现有知识库自动更新。

进度反馈：进度条和状态标签在索引过程中提供实时反馈。

Main Features of the Search Panel:

Dual Search Modes: "Semantic Search" (using the FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. vector index) and "Keyword Search" (traditional text matching).
Interactive Results Display: Presents search results in a scrollable text area, showing the similarity score and source document for each retrieved chunk.
Direct Knowledge Base Query: While the core RAG chat integration (chat_rag.py) is separate, this panel allows for direct interrogation of the indexed content.

搜索面板的主要功能：

双搜索模式：“语义搜索”（使用 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 向量索引）和“关键词搜索”（传统的文本匹配）。

交互式结果展示：在可滚动的文本区域中显示搜索结果，展示每个检索到的文本块的相似度分数和来源文档。

直接知识库查询：虽然核心的 RAG 聊天集成（chat_rag.py）是独立的，但此面板允许直接查询已索引的内容。

Critical Implementation Details

Threading for Responsive UI: All long-running operations (document processing, embedding generation, index building) are executed in separate background threads. A queue-based mechanism (build_queue, search_queue) is used for safe communication between threads and the main TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。 thread, preventing the GUI from freezing.
Incremental Update Algorithm: The update_chunks_incrementally function is the core of this feature.
- It compares the current file system state with a previously saved kb_file_state.json.
- It identifies added, modified, and deleted files.
- For deleted and modified files, it calculates the list of associated FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. IDs (ids_to_remove).
- IDs from deleted chunks are pooled (previous_id_slot) and prioritized for reuse when adding new chunks, helping to manage ID space.
- The FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. index is updated by removing old IDs and adding new vectors with their IDs.
State Persistence: The application maintains several files to preserve state across sessions:
- kb.index: The FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. vector index.
- kb_chunks.npy: The actual text chunks.
- kb_id_map.npy: The mapping of FAIDS IDs.
- kb_file_state.json: Timestamps and sizes of processed files.
- kb_chunk_info.json: Metadata for each chunk (ID, source file, etc.).

用于响应式 UI 的线程处理：所有长时间运行的操作（文档处理、嵌入生成、索引构建）都在单独的后台线程中执行。使用基于队列的机制（build_queue, search_queue）在线程和主 TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。线程之间进行安全通信，防止 GUI 界面冻结。

增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。算法：update_chunks_incrementally 函数是该功能的核心。

它将当前文件系统状态与之前保存的 kb_file_state.json 进行比较。

它识别新增、修改和删除的文件。

对于已删除和已修改的文件，它会计算相关的 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. ID 列表（ids_to_remove）。

来自已删除块的 ID 被集中管理（previous_id_slot），并在添加新块时优先重用，有助于管理 ID 空间。

通过删除旧 ID 并添加带有 ID 的新向量来更新 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 索引。

状态持久化：应用程序维护多个文件以在会话之间保持状态：

kb.index：FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 向量索引。

kb_chunks.npy：实际的文本块。

kb_id_map.npy：FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. ID 的映射。

kb_file_state.json：已处理文件的时间戳和大小。

kb_chunk_info.json：每个块的元数据（ID、源文件等）。

Code Walkthrough: Key Functions and Classes

The `FaissIncrementalIndex` Class

This custom class encapsulates FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. operations with a focus on supporting incremental updates through stable ID management.

这个自定义类封装了 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 操作，重点在于通过稳定的 ID 管理来支持增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。。

class FaissIncrementalIndex:
    def __init__(self, dimension=128, use_l2=True, use_id_map=True):
        # ... Initializes base index and wraps it with IndexIDMap ...
        self.used_ids = set()  # Tracks assigned IDs
        self.next_id = 0

    def add_vectors(self, vectors, ids=None):
        # ... Adds vectors. Generates unique IDs if not provided.
        # Checks for ID collisions if use_id_map is True.
        self.index.add_with_ids(vectors, ids)  # Critical method for ID-based addition

    def update_vectors(self, vectors, ids):
        # ... Replaces existing vectors with new ones for the given IDs.
        self.index.remove_ids(ids)
        self.index.add_with_ids(vectors, ids)

    def remove_vectors(self, ids):
        # ... Removes vectors associated with the given IDs from the index.
        self.index.remove_ids(ids)

这个自定义类封装了 FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors. 操作，重点在于通过稳定的 ID 管理来支持增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。。

class FaissIncrementalIndex:
    def __init__(self, dimension=128, use_l2=True, use_id_map=True):
        # ... 初始化基础索引并用 IndexIDMap 包装它 ...
        self.used_ids = set()  # 跟踪已分配的 ID
        self.next_id = 0

    def add_vectors(self, vectors, ids=None):
        # ... 添加向量。如果未提供 ID 则生成唯一 ID。
        # 如果 use_id_map 为 True，则检查 ID 冲突。
        self.index.add_with_ids(vectors, ids)  # 基于 ID 添加的关键方法

    def update_vectors(self, vectors, ids):
        # ... 用新的向量替换给定 ID 的现有向量。
        self.index.remove_ids(ids)
        self.index.add_with_ids(vectors, ids)

    def remove_vectors(self, ids):
        # ... 从索引中移除与给定 ID 关联的向量。
        self.index.remove_ids(ids)

Why IndexIDMap is Essential: The standard IndexFlatL2 does not natively support updating or deleting specific vectors. IndexIDMap adds a stable identifier to each vector, allowing us to target them for remove_ids and add_with_ids operations, which is the foundation of the incremental update feature.

为什么 IndexIDMap 至关重要：标准的 IndexFlatL2 本身不支持更新或删除特定的向量。IndexIDMap 为每个向量添加了一个稳定的标识符，允许我们针对它们执行 remove_ids 和 add_with_ids 操作，这是增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。功能的基础。

(Due to the length of the provided code, the detailed walkthrough of the GUI setup functions (setup_ui, setup_build_panel, setup_search_panel), the drag-and-drop initialization (setup_drag_drop), and the core document loading functions (load_pdf, load_epub, etc.) will be continued in the next part. The implementation follows the patterns described above, using TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。 widgets and threading to create a functional and user-friendly interface.)

（由于所提供代码的长度，关于 GUI 设置函数（setup_ui、setup_build_panel、setup_search_panel）、拖放初始化（setup_drag_drop）以及核心文档加载函数（load_pdf、load_epub 等）的详细解析将在下一部分继续。其实现遵循上述模式，使用 TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。组件和线程来创建一个功能完善且用户友好的界面。）

常见问题（FAQ）

如何在Windows上搭建DeepSeek私人知识库？

本教程提供逐步指南，使用RAG技术结合fastembedQdrant开发的快速文本嵌入库，支持多种预训练模型，用于将文本转换为向量表示。生成向量、FAISSFacebook's open-source library for efficient similarity search and clustering of dense vectors.存储检索，并包含TkinterPython的标准GUI工具包，用于创建简单的图形用户界面，以文件对话框、按钮、文本框等控件实现用户交互。 GUI界面，支持PDF、DOCX等多种文档格式的增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。管理。

这个知识库系统支持哪些文档格式？

系统支持PDF、DOCX、PPTX、XLSX、EPUB和TXT格式，通过相应库（如pypdf、python-docx）处理，并使用滑动窗口分块方法保留上下文信息。

增量更新无需重新计算整个图谱，仅对新数据片段进行即时整合的数据更新机制，适用于动态变化的数据环境。功能是如何实现的？

GUI通过跟踪文件修改时间和大小检测文档变化，利用FaissFacebook's open-source library for efficient similarity search and clustering of dense vectors..IndexIDMapFAISS中的索引包装器，为向量分配稳定的整数ID，支持向量的更新、删除和增量管理。管理向量ID，可重用已删除块的ID分配给新内容，实现高效索引更新。