如何用本地硬件72小时生成1065条高质量LLM微调指令数据集？（附多智能体方案）

Q: 为什么选择多智能体方案而不是手动或购买数据集？

手动创建耗时且无法规模化；商业数据集昂贵（500-2000美元）且定制有限。多智能体系统能自主生成包含Python/AI/ML等专业主题、思维链推理和现实复杂性的高质量数据，解决现有选项的匹配度、成本与可扩展性问题。

Building an Autonomous Instruction Dataset Generation System: 1065 High-Quality Data Points in 72 Hours, Zero API Costs

背景与动机

Background and Motivation

我需要高质量的指令数据集来微调本地大语言模型（LLM），但商业选项的价格令人望而却步（一个质量尚可的数据集需要500至2000美元）。

I needed high-quality instruction datasets for fine-tuning local Large Language Models (LLMs), but commercial options were prohibitively expensive, ranging from $500 to $2,000 for a dataset of decent quality.

于是我问自己：如果我能在睡觉时构建一个系统来自主生成这些数据集，会怎样？

So I asked myself: What if I could build a system that generates these datasets autonomously while I sleep?

结果：72小时内生成了1065个专业的指令/响应对，100%本地运行，零API成本。以下是具体的实现方法。

Result: 1,065 professional instruction/response pairs generated in 72 hours, 100% local operation, zero API costs. Here's exactly how I did it.

现有数据集的问题

The Problem with Existing Datasets

当你想要为特定任务微调本地LLM时，通常面临三种选择：

When you want to fine-tune a local LLM for specific tasks, you typically face three options:

使用通用数据集 → 与你的领域不匹配 (Use generic datasets → Don't match your domain)
手动创建 → 耗时费力，速度慢，无法规模化 (Manual creation → Exhausting, slow, doesn't scale)
购买商业数据集 → 昂贵（500-2000美元），定制化有限 (Buy commercial datasets → Expensive ($500-2,000), limited customization)

我需要包含以下特征的代码指令对：

I wanted code instruction pairs with the following characteristics:

Python/AI/ML/DevOps/数据库主题 (Python/AI/ML/DevOps/Database topics)
包含思维链推理一种提示技术，要求模型展示其推理过程，而不仅仅是给出最终答案，有助于提高回答的逻辑性和准确性。（不仅仅是答案） (Chain-of-Thought reasoning (not just answers))
具有现实世界的复杂性（非模板化响应） (Real-world complexity (no template responses))
能够持续生成（如果可能，24/7不间断） (Continuous generation (24/7 if possible))

现有的选项都无法同时满足这四点要求。

None of the existing options delivered all four requirements.

解决方案：多智能体自主系统

The Solution: Multi-Agent Autonomous System

我构建了一个受学术研究工作流启发的三智能体系统：

I built a three-agent system inspired by academic research workflows:

┌──────────────────┐
│   策展智能体     │ ← 从知识库中选择主题
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   生产智能体     │ ← 生成指令 + 响应 + 思维链
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   评审智能体     │ ← 验证逻辑，检查幻觉
│   (DeepSeek 1.5B)│
└────────┬─────────┘
         │
         ▼
    [接受/拒绝]

┌──────────────────┐
│   Curator Agent  │ ← Selects topics from knowledge base
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Producer Agent │ ← Generates instruction + response + CoT
│   (Qwen 7B)      │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│   Critic Agent   │ ← Validates logic, checks hallucinations
│   (DeepSeek 1.5B)│
└────────┬─────────┘
         │
         ▼
    [Accept/Reject]

智能体角色

Agent Roles

策展智能体：

Curator Agent:

读取包含潜在主题的ChromaDBAn open-source vector database designed for storing and querying embeddings.向量存储库 (Reads ChromaDBAn open-source vector database designed for storing and querying embeddings. vector store of potential topics)
基于多样性和优先级选择下一个主题 (Selects next topic based on diversity + priority)
通过相似性搜索防止重复 (Prevents duplicates via similarity search)
可以将其视为研究图书馆员 (Think of it as a research librarian)

生产智能体：

Producer Agent:

生成实际的指令/响应对 (Generates the actual instruction/response pair)
包含思维链推理一种提示技术，要求模型展示其推理过程，而不仅仅是给出最终答案，有助于提高回答的逻辑性和准确性。 (Includes Chain-of-Thought reasoning)
利用领域知识创建真实场景 (Uses domain knowledge to create realistic scenarios)
可以将其视为专家作者 (Think of it as the expert writer)

评审智能体：

Critic Agent:

审查生产智能体的输出，检查以下方面： (Reviews Producer's output for:)
- 幻觉 (Hallucinations)
- 逻辑错误 (Logical errors)
- 不完整的推理 (Incomplete reasoning)
- 通用模板响应 (Generic template responses)
二元决策：接受或拒绝 (Binary decision: Accept or Reject)
可以将其视为同行评审员 (Think of it as the peer reviewer)

技术栈

Tech Stack

核心组件

Core Components


组件	版本/型号	主要用途	备注
OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. (本地LLM引擎)	自定义Modelfile (8k上下文)	模型推理与生成	qwen2.5-coder-8k:7b (4.7GB, 主生成器) deepseek-r1-8k:1.5b (1.1GB, 验证器)
CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。	v1.12+	智能体编排与工作流管理	定义Agent, Task, Crew
ChromaDBAn open-source vector database designed for storing and querying embeddings.	-	去重与记忆存储	向量相似性搜索，防止重复
Flask	-	实时监控仪表板	可视化生成进度与指标

Component Version/Model Primary Use Notes

OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. (Local LLM Engine) Custom Modelfile (8k context) Model inference & generation qwen2.5-coder-8k:7b (4.7GB, main generator)
deepseek-r1-8k:1.5b (1.1GB, validator)

CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。 v1.12+ Agent orchestration & workflow management Define Agent, Task, Crew

ChromaDBAn open-source vector database designed for storing and querying embeddings. - Deduplication & memory storage Vector similarity search, prevents duplicates

Flask - Real-time monitoring dashboard Visualizes generation progress & metrics

硬件配置

Hardware Configuration

组件型号/规格备注

处理器 AMD Ryzen AI 9 HX 370 12核心，32GB RAM（共享作为VRAM）

主机 Geekom A9 Max 迷你PC 约899美元

存储 NVMe SSD 用于快速模型加载
Component Model/Specification Notes

Processor AMD Ryzen AI 9 HX 370 12 cores, 32GB RAM (shared as VRAM)

Host Geekom A9 Max Mini PC ~$899

Storage NVMe SSD For fast model loading

总投资：899美元（硬件）+ 约3.60美元（72小时电费）

Total investment: $899 (hardware) + ~$3.60 (electricity for 72 hours)

实施细节与挑战

Implementation Details and Challenges

挑战一：长时间运行中的内存泄漏

Challenge 1: Memory Leaks in Long Runs

问题：智能体实例会累积状态。在大约100个周期后，性能下降并导致崩溃。

Problem: Agent instances accumulate state. After approximately 100 cycles, performance degrades leading to crashes.

解决方案：每个周期都重新创建所有智能体。

Solution: Recreate all agents from scratch every cycle.
from crewai import Agent, Crew, Task

for cycle in range(1000):
    # 从头重新创建智能体（防止内存泄漏）
    # Recreate agents from scratch (prevents memory leaks)
    curator = Agent(
        role="Topic Curator",
        goal="Select next topic to generate",
        llm=ollama_qwen
    )

    producer = Agent(
        role="Content Producer",
        goal="Generate high-quality instruction pair",
        llm=ollama_qwen
    )

    critic = Agent(
        role="Quality Critic",
        goal="Validate logic and catch hallucinations",
        llm=ollama_deepseek
    )

    # 定义顺序工作流
    # Define sequential workflow
    tasks = [
        Task(description="Select topic", agent=curator),
        Task(description="Generate content", agent=producer),
        Task(description="Validate quality", agent=critic)
    ]

    crew = Crew(agents=[curator, producer, critic], tasks=tasks)
    result = crew.kickoff()

    # 显式清理
    # Explicit cleanup
    del crew, curator, producer, critic
结果：72小时内零崩溃。RAM稳定在24.2 GB。

Result: Zero crashes in 72 hours. Stable RAM usage at 24.2 GB.

挑战二：CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。与OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.的兼容性

Challenge 2: CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。 + OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. Compatibility

问题：CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。 v1.12+ 要求纯字符串响应，但OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.默认返回对象。

Problem: CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。 v1.12+ requires string-only responses, but OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. returns objects by default.

解决方案：将配置嵌入到Modelfiles中（永久性修复）。

Solution: Embed configuration in Modelfiles (permanent fix).
# qwen2.5-coder-8k:7b 的 Modelfile
# Modelfile for qwen2.5-coder-8k:7b
FROM qwen2.5-coder:7b

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM """You are an expert programmer..."""
运行命令：

Run commands:
ollama create qwen2.5-coder-8k:7b -f Modelfile_qwen
ollama create deepseek-r1-8k:1.5b -f Modelfile_deepseek
这使得 num_ctx: 8192 在模型级别成为永久设置，无需运行时配置。

This makes num_ctx: 8192 permanent at the model level. No runtime configuration is needed.

挑战三：重复问题

Challenge 3: Duplicate Questions

问题：随机主题生成会产生重复项。

Problem: Random topic generation creates duplicates.

解决方案：ChromaDBAn open-source vector database designed for storing and querying embeddings.相似性搜索 + 拒绝机制。

Solution: ChromaDBAn open-source vector database designed for storing and querying embeddings. similarity search + rejection mechanism.
import chromadb

client = chromadb.Client()
collection = client.create_collection("generated_questions")

def is_duplicate(new_question, threshold=0.85):
    """通过嵌入相似性检查问题是否已存在"""
    """Check if question already exists via embedding similarity"""
    results = collection.query(
        query_texts=[new_question],
        n_results=1
    )

    if not results['ids']:
        return False

    similarity = results['distances'][0][0]
    return similarity > threshold

# 保存前检查
# Check before saving
if not is_duplicate(instruction):
    save_to_dataset(entry)
    collection.add(
        documents=[instruction],
        ids=[unique_id]
    )
结果：1065个条目涵盖了452个独特主题（零重复）。

Result: 1,065 entries across 452 unique topics (zero duplicates).

72小时后的结果

Results After 72 Hours

关键指标

Key Metrics

指标数值

总条目数 1,065

成功率 50.9% (605 接受 / 460 拒绝)

独特主题数 452

生成速度 14.4 条目/小时

内存使用率 24.2 GB (稳定)

CPU使用率 31.1% 平均值

系统崩溃 0

Metric Value

Total Entries 1,065

Success Rate 50.9% (605 accepted / 460 rejected)

Unique Topics 452

Velocity 14.4 entries/hour

RAM Usage 24.2 GB (stable)

CPU Usage 31.1% average

Crashes 0

类别分布

Distribution by Category

类别条目数占比

Python 346 33%

AI/ML 287 27%

后端/API 126 12%

数据库/SQL 99 9%

DevOps/云 99 9%

其他 108 10%

常见问题（FAQ）
这个多智能体系统由多个相互协作的AI智能体组成的系统，能够处理复杂任务并通过智能体间的通信和协调实现更高级的自动化功能。是如何在72小时内生成1065条高质量指令数据的？
系统采用三智能体工作流：策展智能体选择主题，生产智能体生成指令/响应对及思维链，评审智能体验证逻辑并检查幻觉。通过本地硬件（如OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.引擎）100%运行，无需API调用，实现高效自主生成。
为什么选择多智能体方案而不是手动或购买数据集？
手动创建耗时且无法规模化；商业数据集昂贵（500-2000美元）且定制有限。多智能体系统由多个相互协作的AI智能体组成的系统，能够处理复杂任务并通过智能体间的通信和协调实现更高级的自动化功能。能自主生成包含Python/AI/ML等专业主题、思维链推理一种提示技术，要求模型展示其推理过程，而不仅仅是给出最终答案，有助于提高回答的逻辑性和准确性。和现实复杂性的高质量数据，解决现有选项的匹配度、成本与可扩展性问题。
系统的三个智能体具体负责什么工作？
策展智能体（如Qwen 7B）从知识库选择多样化主题；生产智能体（Qwen 7B）生成指令、响应和思维链；评审智能体（如DeepSeek 1.5B）验证逻辑、检查幻觉并决定接受/拒绝，确保数据质量。

Category Entries Percentage

Python 346 33%

AI/ML 287 27%

Backend/API 126 12%

Database/SQL 99 9%


Component	Version/Model	Primary Use	Notes
OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models. (Local LLM Engine)	Custom Modelfile (8k context)	Model inference & generation	qwen2.5-coder-8k:7b (4.7GB, main generator) deepseek-r1-8k:1.5b (1.1GB, validator)
CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。	v1.12+	Agent orchestration & workflow management	Define Agent, Task, Crew
ChromaDBAn open-source vector database designed for storing and querying embeddings.	-	Deduplication & memory storage	Vector similarity search, prevents duplicates
Flask	-	Real-time monitoring dashboard	Visualizes generation progress & metrics


组件	型号/规格	备注
处理器	AMD Ryzen AI 9 HX 370	12核心，32GB RAM（共享作为VRAM）
主机	Geekom A9 Max 迷你PC	约899美元
存储	NVMe SSD	用于快速模型加载


Component	Model/Specification	Notes
Processor	AMD Ryzen AI 9 HX 370	12 cores, 32GB RAM (shared as VRAM)
Host	Geekom A9 Max Mini PC	~$899
Storage	NVMe SSD	For fast model loading


指标	数值
总条目数	1,065
成功率	50.9% (605 接受 / 460 拒绝)
独特主题数	452
生成速度	14.4 条目/小时
内存使用率	24.2 GB (稳定)
CPU使用率	31.1% 平均值
系统崩溃	0


Metric	Value
Total Entries	1,065
Success Rate	50.9% (605 accepted / 460 rejected)
Unique Topics	452
Velocity	14.4 entries/hour
RAM Usage	24.2 GB (stable)
CPU Usage	31.1% average
Crashes	0


类别	条目数	占比
Python	346	33%
AI/ML	287	27%
后端/API	126	12%
数据库/SQL	99	9%
DevOps/云	99	9%
其他	108	10%

AI Summary (BLUF)

背景与动机

现有数据集的问题

解决方案：多智能体自主系统

智能体角色

技术栈

核心组件

硬件配置

实施细节与挑战

挑战一：长时间运行中的内存泄漏

挑战二：CrewAI一个多智能体协作框架，允许创建和管理多个AI智能体协同完成任务。与OllamaA tool for running and managing AI models locally, supporting DeepSeek and other models.的兼容性

挑战三：重复问题

72小时后的结果

关键指标

类别分布

这个多智能体系统由多个相互协作的AI智能体组成的系统，能够处理复杂任务并通过智能体间的通信和协调实现更高级的自动化功能。是如何在72小时内生成1065条高质量指令数据的？

为什么选择多智能体方案而不是手动或购买数据集？

系统的三个智能体具体负责什么工作？