如何用大语言模型从PDF和图片中提取结构化JSON数据？

什么是 Unstract？

Unstract 利用大语言模型（LLM）从各类文档（如 PDF、图像、扫描件等）中提取结构化的 JSON 数据。您只需使用自然语言提示定义需要提取的内容，即可将其部署为 API 或 ETL 管道。

Unstract 利用大语言模型（LLM）从各类文档（如 PDF、图像、扫描件等）中提取结构化的 JSON 数据。您只需使用自然语言提示定义需要提取的内容，即可将其部署为 API 或 ETL 管道。

该平台专为金融、保险、医疗保健、KYC/合规等领域的团队构建。

该平台专为金融、保险、医疗保健、KYC/合规等领域的团队构建。

传统方案 vs. Unstract 方案

下表清晰对比了在文档数据提取任务中，使用传统方法与使用 Unstract 平台的核心差异。


任务	不使用 Unstract	使用 Unstract
模式定义	为每个供应商编写正则表达式、构建模板	编写一次提示词，即可处理多种变体
处理新文档类型	数天开发时间	在 Prompt Studio 中仅需数分钟
LLM 集成	自行构建处理管道	即插即用任何提供商（OpenAI, Anthropic, Bedrock, Ollama 等）
部署	需要定制基础设施	运行 `./run-platform.sh` 或使用托管云服务
输出结果	非结构化的文本块	干净的 JSON 数据，可直接存入数据库

The table below clearly contrasts the core differences between using traditional methods and the Unstract platform for document data extraction tasks.

Task Without Unstract With Unstract

Schema Definition Write regex, build templates per vendor Write a prompt once, handles variations

New Document Type Days of development Minutes in Prompt Studio

LLM Integration Build your own pipeline Plug in any provider (OpenAI, Anthropic, Bedrock, Ollama)

Deployment Custom infrastructure Run ./run-platform.sh or managed cloud

Output Unstructured text blobs Clean JSON, ready for your database

⭐ 如果 Unstract 对您有帮助，请在 GitHub 上为我们点亮 Star！

⭐ If Unstract helps you, please give us a star on GitHub!

✨ 核心特性

Prompt Studio — 使用自然语言定义文档提取模式。文档 →

Prompt Studio — Define document extraction schemas with natural language. Docs →

API 部署 — 通过 REST API 发送文档，接收 JSON 返回结果。文档 →

API Deployment — Send a document over REST API, get JSON back. Docs →

ETL 管道 — 从文件夹拉取文档，处理后加载至数据仓库。文档 →

ETL Pipeline — Pull documents from a folder, process them, load to your warehouse. Docs →

MCP 服务器 — 通过模型上下文协议连接 AI 智能体（如 Claude）。文档 →

MCP Server — Connect to AI agents (Claude, etc.) via Model Context Protocol. Docs →

n8n 节点 — 集成到现有的自动化工作流中。文档 →

n8n Node — Drop into existing automation workflows. Docs →

🚀 快速开始 (~5 分钟)

系统要求与前置条件

Linux 或 macOS (Intel 或 M 系列芯片)

Docker & Docker Compose

最低 8 GB 内存

Git

Linux or macOS (Intel or M-series)

Docker & Docker Compose

8 GB RAM minimum

Git

本地运行
# 克隆仓库并启动
git clone https://github.com/Zipstack/unstract.git
cd unstract
./run-platform.sh
# Clone and start
git clone https://github.com/Zipstack/unstract.git
cd unstract
./run-platform.sh
就这么简单！

在浏览器中访问 http://frontend.unstract.localhost

使用用户名：unstract，密码：unstract 登录

开始提取数据！

That's it!

Visit http://frontend.unstract.localhost in your browser

Login with username: unstract password: unstract

Start extracting data!

📦 其他部署选项

Docker Compose

./run-platform.sh 脚本提供了多种运行和配置选项：
# 使用默认环境配置拉取并运行整个 Unstract 平台。
./run-platform.sh

# 拉取并运行指定版本标签的 Docker 容器。
./run-platform.sh -v v0.1.0

# 通过拉取最新可用版本来升级现有的 Unstract 平台设置。
./run-platform.sh -u

# 通过拉取特定版本来升级现有的 Unstract 平台设置。
./run-platform.sh -u -v v0.2.0

# 在本地构建指定版本标签的 Docker 镜像。
./run-platform.sh -b -v v0.1.0

# 从当前工作分支在本地构建标记为 `current` 版本的 Docker 镜像。
./run-platform.sh -b -v current

# 显示帮助信息。
./run-platform.sh -h

# 仅进行环境文件设置。
./run-platform.sh -e

# 仅拉取指定版本标签的 Docker 镜像。
./run-platform.sh -p -v v0.1.0

# 通过在本地构建指定版本标签的镜像来拉取 Docker 镜像。
./run-platform.sh -p -b -v v0.1.0

# 使用从当前工作分支本地构建的、标记为 `current` 版本的 Docker 镜像来升级现有平台。
./run-platform.sh -u -b -v current

# 在分离模式下拉取并运行 Docker 容器。
./run-platform.sh -d -v v0.1.0
The ./run-platform.sh script offers various options for running and configuring the platform:
# Pull and run entire Unstract platform with default env config.
./run-platform.sh

# Pull and run docker containers with a specific version tag.
./run-platform.sh -v v0.1.0

# Upgrade existing Unstract platform setup by pulling the latest available version.
./run-platform.sh -u

# Upgrade existing Unstract platform setup by pulling a specific version.
./run-platform.sh -u -v v0.2.0

# Build docker images locally as a specific version tag.
./run-platform.sh -b -v v0.1.0

# Build docker images locally from working branch as `current` version tag.
./run-platform.sh -b -v current

# Display the help information.
./run-platform.sh -h

# Only do setup of environment files.
./run-platform.sh -e

# Only do docker images pull with a specific version tag.
./run-platform.sh -p -v v0.1.0

# Only do docker images pull by building locally with a specific version tag.
./run-platform.sh -p -b -v v0.1.0

# Upgrade existing Unstract platform setup with docker images built locally from working branch as `current` version tag.
./run-platform.sh -u -b -v current

# Pull and run docker containers in detached mode.
./run-platform.sh -d -v v0.1.0
🔐 备份加密密钥

警告

此密钥用于加密适配器凭证——丢失它将导致现有适配器无法访问！

Warning

This key encrypts adapter credentials — losing it makes existing adapters inaccessible!

请将 backend/.env 或 platform-service/.env 文件中的 ENCRYPTION_KEY 值复制到安全的位置。

Copy the value of ENCRYPTION_KEY from backend/.env or platform-service/.env to a secure location.

🏗️ Unstract 架构
┌────────────────────────────────────────────────────────────┐
│                          Unstract                          │
├─────────────┬─────────────┬─────────────┬──────────────────┤
│  Frontend   │   Backend   │   Worker    │ Platform Service │
│  (React)    │  (Django)   │  (Celery)   │   (FastAPI)      │
├─────────────┴─────────────┴─────────────┴──────────────────┤
│                      Cache (Redis)                         │
├────────────────────────────────────────────────────────────┤
│                  Message Queue (RabbitMQ)                  │
├────────────────────────────────────────────────────────────┤
│                   Database (PostgreSQL)                    │
├────────────────────────────────────────────────────────────┤
│  LLM Adapters    │  Vector DBs    │  Text Extractors       │
│  (OpenAI, etc.)  │ (Qdrant, etc.) │  (LLMWhisperer)        │
└────────────────────────────────────────────────────────────┘
另请参阅架构文档。

Also see architecture documentation.

📄 支持的文档文件格式

Unstract 支持广泛的文件格式，确保您能处理各种来源的文档。

类别格式

文档 PDF, DOCX, DOC, ODT, TXT, CSV, JSON

电子表格 XLSX, XLS, ODS

演示文稿 PPTX, PPT, ODP

图像 PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP

Unstract supports a wide range of file formats, ensuring you can handle documents from various sources.

Category Formats

Documents PDF, DOCX, DOC, ODT, TXT, CSV, JSON

Spreadsheets XLSX, XLS, ODS

Presentations PPTX, PPT, ODP

Images PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP

🔌 连接器与适配器

LLM 提供商

Unstract 设计为模型无关，支持连接主流的大语言模型服务。

提供商状态提供商状态

OpenAI ✅ Azure OpenAI ✅

Anthropic Claude ✅ Google Gemini ✅

AWS Bedrock ✅ Mistral AI ✅

Ollama (本地) ✅ Anyscale ✅

Unstract is designed to be model-agnostic, supporting connections to mainstream LLM services.

Provider Status Provider Status

OpenAI ✅ Azure OpenAI ✅

Anthropic Claude ✅ Google Gemini ✅

AWS Bedrock ✅ Mistral AI ✅

Ollama (local) ✅ Anyscale ✅

向量数据库

支持多种向量数据库，用于需要语义搜索或上下文增强的复杂提取场景。

提供商状态提供商状态

Qdrant ✅ Pinecone ✅

Weaviate ✅ PostgreSQL ✅

Milvus ✅

Supports multiple vector databases for complex extraction scenarios requiring semantic search or context enhancement.

Provider Status Provider Status

Qdrant ✅ Pinecone ✅

Weaviate ✅ PostgreSQL ✅

Milvus ✅

文本提取器

集成先进的文本提取引擎，确保从各种格式的文档中准确获取文本内容。

提供商状态

LLMWhisperer ✅

Unstructured.io ✅

LlamaIndex Parse ✅

Integrates advanced text extraction engines to ensure accurate text retrieval from documents in various formats.

常见问题（FAQ）
Unstract平台如何从PDF文档中提取结构化数据？
Unstract利用大语言模型，通过自然语言提示定义需要提取的内容，直接从PDF、图像、扫描件等文档中输出干净的JSON数据，无需编写复杂的正则表达式或模板。
Unstract与传统文档处理方案相比有哪些优势？
传统方案需为每个供应商定制开发，耗时数天；Unstract只需在Prompt Studio中用自然语言编写一次提示，几分钟即可处理新文档类型，并支持即插即用多种LLM提供商。
Unstract平台支持哪些部署方式？
Unstract可部署为REST API接收文档并返回JSON，也可作为ETL管道从文件夹拉取文档处理后加载至数据仓库，支持本地运行、Docker Compose或托管云服务。
标签
结构化数据 llms.txt DeepSeek AI大模型人工智能
← 返回文章列表
分享到：微博
下一篇
sqlite-memory如何为AI智能体构建持久化、可搜索的记忆系统？（附Markdown优化）
版权与免责声明：本文仅用于信息分享与交流，不构成任何形式的法律、投资、医疗或其他专业建议，也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材，其权利归各自合法权利人所有。本站内容可能基于公开资料整理，亦可能使用 AI 辅助生成或润色；我们尽力确保准确与合规，但不保证完整性、时效性与适用性，请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误，请相关权利人/当事人联系本站，我们将及时核实并采取删除、修正或下架等处理措施。也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。
您可能感兴趣
GEO（Generative Engine Optimization）
专注于GEO（生成式引擎优化）技术的深度探索。分享前沿的AI搜索优化策略、实战案例与技术原理，助您在AI时代抢占流量先机。
探索发现
→ 首页
→ 最新文章
保持联系
📧
Email
hyl162182@hotmail.com
📍
Location
Guangdong, China
© 2026 Geoz.com.cn. All rights reserved.
赣ICP备2026000942号
隐私政策服务条款


Task	Without Unstract	With Unstract
Schema Definition	Write regex, build templates per vendor	Write a prompt once, handles variations
New Document Type	Days of development	Minutes in Prompt Studio
LLM Integration	Build your own pipeline	Plug in any provider (OpenAI, Anthropic, Bedrock, Ollama)
Deployment	Custom infrastructure	Run `./run-platform.sh` or managed cloud
Output	Unstructured text blobs	Clean JSON, ready for your database


类别	格式
文档	PDF, DOCX, DOC, ODT, TXT, CSV, JSON
电子表格	XLSX, XLS, ODS
演示文稿	PPTX, PPT, ODP
图像	PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP


Category	Formats
Documents	PDF, DOCX, DOC, ODT, TXT, CSV, JSON
Spreadsheets	XLSX, XLS, ODS
Presentations	PPTX, PPT, ODP
Images	PNG, JPG, JPEG, TIFF, BMP, GIF, WEBP


提供商	状态	提供商	状态
OpenAI	✅	Azure OpenAI	✅
Anthropic Claude	✅	Google Gemini	✅
AWS Bedrock	✅	Mistral AI	✅
Ollama (本地)	✅	Anyscale	✅


Provider	Status	Provider	Status
OpenAI	✅	Azure OpenAI	✅
Anthropic Claude	✅	Google Gemini	✅
AWS Bedrock	✅	Mistral AI	✅
Ollama (local)	✅	Anyscale	✅


提供商	状态
LLMWhisperer	✅
Unstructured.io	✅
LlamaIndex Parse	✅

AI Summary (BLUF)

什么是 Unstract？

传统方案 vs. Unstract 方案

✨ 核心特性

🚀 快速开始 (~5 分钟)

系统要求与前置条件

本地运行

📦 其他部署选项

Docker Compose

🔐 备份加密密钥

🏗️ Unstract 架构

📄 支持的文档文件格式

🔌 连接器与适配器

LLM 提供商

向量数据库

文本提取器

常见问题（FAQ）

Unstract平台如何从PDF文档中提取结构化数据？

Unstract与传统文档处理方案相比有哪些优势？

Unstract平台支持哪些部署方式？