如何用本地硬件72小时生成1065条高质量LLM微调指令数据集?(附多智能体方案)
AI Summary (BLUF)
This article details a multi-agent autonomous system that generates high-quality instruction datasets for fine-tuning local LLMs, achieving 1,065 professional pairs in 72 hours with zero API costs using a three-agent workflow (Curator, Producer, Critic) and local hardware.
原文翻译: 本文详细介绍了一个多智能体自主系统,用于生成本地大语言模型微调所需的高质量指令数据集。通过三智能体工作流(策划者、生产者、批评者)和本地硬件,在72小时内生成了1,065个专业指令对,且无需API成本。
Building an Autonomous Instruction Dataset Generation System: 1065 High-Quality Data Points in 72 Hours, Zero API Costs
背景与动机
Background and Motivation
我需要高质量的指令数据集来微调本地大语言模型(LLM),但商业选项的价格令人望而却步(一个质量尚可的数据集需要500至2000美元)。
I needed high-quality instruction datasets for fine-tuning local Large Language Models (LLMs), but commercial options were prohibitively expensive, ranging from $500 to $2,000 for a dataset of decent quality.
于是我问自己:如果我能在睡觉时构建一个系统来自主生成这些数据集,会怎样?
So I asked myself: What if I could build a system that generates these datasets autonomously while I sleep?
结果:72小时内生成了1065个专业的指令/响应对,100%本地运行,零API成本。以下是具体的实现方法。
Result: 1,065 professional instruction/response pairs generated in 72 hours, 100% local operation, zero API costs. Here's exactly how I did it.
现有数据集的问题
The Problem with Existing Datasets
当你想要为特定任务微调本地LLM时,通常面临三种选择:
When you want to fine-tune a local LLM for specific tasks, you typically face three options:
- 使用通用数据集 → 与你的领域不匹配 (Use generic datasets → Don't match your domain)
- 手动创建 → 耗时费力,速度慢,无法规模化 (Manual creation → Exhausting, slow, doesn't scale)
- 购买商业数据集 → 昂贵(500-2000美元),定制化有限 (Buy commercial datasets → Expensive ($500-2,000), limited customization)
我需要包含以下特征的代码指令对:
I wanted code instruction pairs with the following characteristics:
- Python/AI/ML/DevOps/数据库主题 (Python/AI/ML/DevOps/Database topics)
- 包含思维链推理一种提示技术,要求模型展示其推理过程,而不仅仅是给出最终答案,有助于提高回答的逻辑性和准确性。(不仅仅是答案) (Chain-of-Thought reasoning (not just answers))
- 具有现实世界的复杂性(非模板化响应) (Real-world complexity (no template responses))
- 能够持续生成(如果可能,24/7不间断) (Continuous generation (24/7 if possible))
现有的选项都无法同时满足这四点要求。
None of the existing options delivered all four requirements.
解决方案:多智能体自主系统
The Solution: Multi-Agent Autonomous System
我构建了一个受学术研究工作流启发的三智能体系统:
I built a three-agent system inspired by academic research workflows:
┌──────────────────┐
│ 策展智能体 │ ← 从知识库中选择主题
│ (Qwen 7B) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 生产智能体 │ ← 生成指令 + 响应 + 思维链
│ (Qwen 7B) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 评审智能体 │ ← 验证逻辑,检查幻觉
│ (DeepSeek 1.5B)│
└────────┬─────────┘
│
▼
[接受/拒绝]
┌──────────────────┐ │ Curator Agent │ ← Selects topics from knowledge base │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Producer Agent │ ← Generates instruction + response + CoT │ (Qwen 7B) │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Critic Agent │ ← Validates logic, checks hallucinations │ (DeepSeek 1.5B)│ └────────┬─────────┘ │ ▼ [Accept/Reject]
智能体角色
Agent Roles
策展智能体:
Curator Agent:
- 读取包含潜在主题的ChromaDBAn open-source vector database designed for storing and querying embeddings.向量存储库 (Reads ChromaDBAn open-source vector database designed for storing and querying embeddings. vector store of potential topics)
- 基于多样性和优先级选择下一个主题 (Selects next topic based on diversity + priority)
- 通过相似性搜索防止重复 (Prevents duplicates via similarity search)
- 可以将其视为研究图书馆员 (Think of it as a research librarian)
生产智能体:
Producer Agent:
- 生成实际的指令/响应对 (Generates the actual instruction/response pair)
- 包含思维链推理一种提示技术,要求模型展示其推理过程,而不仅仅是给出最终答案,有助于提高回答的逻辑性和准确性。 (Includes Chain-of-Thought reasoning)
- 利用领域知识创建真实场景 (Uses domain knowledge to create realistic scenarios)
- 可以将其视为专家作者 (Think of it as the expert writer)
评审智能体:
Critic Agent:
- 审查生产智能体的输出,检查以下方面: (Reviews Producer's output for:)
- 幻觉 (Hallucinations)
- 逻辑错误 (Logical errors)
- 不完整的推理 (Incomplete reasoning)
- 通用模板响应 (Generic template responses)
- 二元决策:接受或拒绝 (Binary decision: Accept or Reject)
- 可以将其视为同行评审员 (Think of it as the peer reviewer)
技术栈
Tech Stack
核心组件
Core Components
| 组件 | 版本/型号 | 主要用途 | 备注 |
|---|---|---|---|
| OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. (本地LLM引擎) | 自定义Modelfile (8k上下文) | 模型推理与生成 | qwen2.5-coder-8k:7b (4.7GB, 主生成器) deepseek-r1-8k:1.5b (1.1GB, 验证器) |
| CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。 | v1.12+ | 智能体编排与工作流管理 | 定义Agent, Task, Crew |
| ChromaDBAn open-source vector database designed for storing and querying embeddings. | - | 去重与记忆存储 | 向量相似性搜索,防止重复 |
| Flask | - | 实时监控仪表板 | 可视化生成进度与指标 |
Component Version/Model Primary Use Notes OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. (Local LLM Engine) Custom Modelfile (8k context) Model inference & generation qwen2.5-coder-8k:7b (4.7GB, main generator)
deepseek-r1-8k:1.5b (1.1GB, validator)CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。 v1.12+ Agent orchestration & workflow management Define Agent, Task, Crew ChromaDBAn open-source vector database designed for storing and querying embeddings. - Deduplication & memory storage Vector similarity search, prevents duplicates Flask - Real-time monitoring dashboard Visualizes generation progress & metrics 硬件配置
Hardware Configuration
组件 型号/规格 备注 处理器 AMD Ryzen AI 9 HX 370 12核心,32GB RAM(共享作为VRAM) 主机 Geekom A9 Max 迷你PC 约899美元 存储 NVMe SSD 用于快速模型加载
Component Model/Specification Notes Processor AMD Ryzen AI 9 HX 370 12 cores, 32GB RAM (shared as VRAM) Host Geekom A9 Max Mini PC ~$899 Storage NVMe SSD For fast model loading 总投资:899美元(硬件)+ 约3.60美元(72小时电费)
Total investment: $899 (hardware) + ~$3.60 (electricity for 72 hours)
实施细节与挑战
Implementation Details and Challenges
挑战一:长时间运行中的内存泄漏
Challenge 1: Memory Leaks in Long Runs
问题:智能体实例会累积状态。在大约100个周期后,性能下降并导致崩溃。
Problem: Agent instances accumulate state. After approximately 100 cycles, performance degrades leading to crashes.
解决方案:每个周期都重新创建所有智能体。
Solution: Recreate all agents from scratch every cycle.
from crewai import Agent, Crew, Task for cycle in range(1000): # 从头重新创建智能体(防止内存泄漏) # Recreate agents from scratch (prevents memory leaks) curator = Agent( role="Topic Curator", goal="Select next topic to generate", llm=ollama_qwen ) producer = Agent( role="Content Producer", goal="Generate high-quality instruction pair", llm=ollama_qwen ) critic = Agent( role="Quality Critic", goal="Validate logic and catch hallucinations", llm=ollama_deepseek ) # 定义顺序工作流 # Define sequential workflow tasks = [ Task(description="Select topic", agent=curator), Task(description="Generate content", agent=producer), Task(description="Validate quality", agent=critic) ] crew = Crew(agents=[curator, producer, critic], tasks=tasks) result = crew.kickoff() # 显式清理 # Explicit cleanup del crew, curator, producer, critic结果:72小时内零崩溃。RAM稳定在24.2 GB。
Result: Zero crashes in 72 hours. Stable RAM usage at 24.2 GB.
挑战二:CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。与OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.的兼容性
Challenge 2: CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。 + OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. Compatibility
问题:CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。 v1.12+ 要求纯字符串响应,但OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.默认返回对象。
Problem: CrewAI一个多智能体协作框架,允许创建和管理多个AI智能体协同完成任务。 v1.12+ requires string-only responses, but OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. returns objects by default.
解决方案:将配置嵌入到Modelfiles中(永久性修复)。
Solution: Embed configuration in Modelfiles (permanent fix).
# qwen2.5-coder-8k:7b 的 Modelfile # Modelfile for qwen2.5-coder-8k:7b FROM qwen2.5-coder:7b PARAMETER num_ctx 8192 PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM """You are an expert programmer..."""运行命令:
Run commands:
ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek这使得
num_ctx: 8192在模型级别成为永久设置,无需运行时配置。This makes
num_ctx: 8192permanent at the model level. No runtime configuration is needed.
挑战三:重复问题
Challenge 3: Duplicate Questions
问题:随机主题生成会产生重复项。
Problem: Random topic generation creates duplicates.
解决方案:ChromaDBAn open-source vector database designed for storing and querying embeddings.相似性搜索 + 拒绝机制。
Solution: ChromaDBAn open-source vector database designed for storing and querying embeddings. similarity search + rejection mechanism.
import chromadb client = chromadb.Client() collection = client.create_collection("generated_questions") def is_duplicate(new_question, threshold=0.85): """通过嵌入相似性检查问题是否已存在""" """Check if question already exists via embedding similarity""" results = collection.query( query_texts=[new_question], n_results=1 ) if not results['ids']: return False similarity = results['distances'][0][0] return similarity > threshold # 保存前检查 # Check before saving if not is_duplicate(instruction): save_to_dataset(entry) collection.add( documents=[instruction], ids=[unique_id] )结果:1065个条目涵盖了452个独特主题(零重复)。
Result: 1,065 entries across 452 unique topics (zero duplicates).
72小时后的结果
Results After 72 Hours
关键指标
Key Metrics
指标 数值 总条目数 1,065 成功率 50.9% (605 接受 / 460 拒绝) 独特主题数 452 生成速度 14.4 条目/小时 内存使用率 24.2 GB (稳定) CPU使用率 31.1% 平均值 系统崩溃 0
Metric Value Total Entries 1,065 Success Rate 50.9% (605 accepted / 460 rejected) Unique Topics 452 Velocity 14.4 entries/hour RAM Usage 24.2 GB (stable) CPU Usage 31.1% average Crashes 0 类别分布
Distribution by Category
类别 条目数 占比 Python 346 33% AI/ML 287 27% 后端/API 126 12% 数据库/SQL 99 9% DevOps/云 99 9% 其他 108 10% 常见问题(FAQ)这个多智能体系统由多个相互协作的AI智能体组成的系统,能够处理复杂任务并通过智能体间的通信和协调实现更高级的自动化功能。是如何在72小时内生成1065条高质量指令数据的?
系统采用三智能体工作流:策展智能体选择主题,生产智能体生成指令/响应对及思维链,评审智能体验证逻辑并检查幻觉。通过本地硬件(如OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.引擎)100%运行,无需API调用,实现高效自主生成。
为什么选择多智能体方案而不是手动或购买数据集?
手动创建耗时且无法规模化;商业数据集昂贵(500-2000美元)且定制有限。多智能体系统由多个相互协作的AI智能体组成的系统,能够处理复杂任务并通过智能体间的通信和协调实现更高级的自动化功能。能自主生成包含Python/AI/ML等专业主题、思维链推理一种提示技术,要求模型展示其推理过程,而不仅仅是给出最终答案,有助于提高回答的逻辑性和准确性。和现实复杂性的高质量数据,解决现有选项的匹配度、成本与可扩展性问题。
系统的三个智能体具体负责什么工作?
策展智能体(如Qwen 7B)从知识库选择多样化主题;生产智能体(Qwen 7B)生成指令、响应和思维链;评审智能体(如DeepSeek 1.5B)验证逻辑、检查幻觉并决定接受/拒绝,确保数据质量。
Category Entries Percentage Python 346 33% AI/ML 287 27% Backend/API 126 12% Database/SQL 99 9%
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。