Qwen3-2507：开源大语言模型前沿技术解析与部署指南

Introduction

The Qwen team is proud to announce the latest iteration of its flagship model series: Qwen3-2507. Building upon the strong foundation laid by Qwen3 (released in April 2025), this update introduces significant enhancements across both its "Instruct" (non-thinking) and "Thinking" variants. This release represents a continued commitment to pushing the boundaries of open-weight models in reasoning, instruction following, and long-context understanding, making state-of-the-art AI capabilities more accessible to developers and researchers worldwide.

Qwen团队自豪地宣布其旗舰模型系列的最新迭代：Qwen3-2507。该版本在2025年4月发布的Qwen3的坚实基础上，对其“指令”（非思考）和“思考”两种变体均进行了重大增强。此次发布体现了团队持续推动开源模型在推理、指令遵循和长上下文理解等前沿领域的承诺，旨在让全球开发者和研究人员能够更便捷地使用最先进的AI能力。

Key Concepts: Qwen3-2507 Variants

Qwen3-Instruct-2507

Qwen3-Instruct-2507 is the enhanced version of the standard, non-thinking conversational model. It is designed for efficient, high-quality interactions across a broad range of tasks without the explicit, step-by-step reasoning process. Key improvements include:

Enhanced General Capabilities: Significant gains in instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage.
Broader Knowledge Coverage: Substantially improved coverage of long-tail knowledge across multiple languages.
Improved Human Alignment: Markedly better alignment with user preferences in subjective and open-ended tasks, leading to more helpful and higher-quality text generation.
Extended Context: Enhanced 256K-token long-context understanding, extendable up to 1 million tokens.

Qwen3-Instruct-2507 是标准非思考对话模型的增强版本。它专为在各种任务中实现高效、高质量的交互而设计，无需显式的逐步推理过程。主要改进包括：

增强的通用能力：在指令遵循、逻辑推理、文本理解、数学、科学、编码和工具使用方面取得显著进步。

更广泛的知识覆盖：在多语言的长尾知识覆盖上得到实质性改善。

改进的人类对齐：在主观和开放式任务中，与用户偏好的对齐度显著提高，从而生成更有帮助、更高质量的文本。

扩展的上下文：增强的256K令牌长上下文理解能力，可扩展至100万令牌。

Qwen3-Thinking-2507

Qwen3-Thinking-2507 is the successor to the Qwen3 thinking model, designed for complex problems that benefit from explicit, chain-of-thought reasoning. It features:

State-of-the-Art Reasoning: Significantly improved performance on tasks requiring deep reasoning, including mathematics, science, coding, and academic benchmarks, achieving top-tier results among open-weight thinking models.
Enhanced General Capabilities: Marked improvements in instruction following, tool usage, text generation, and alignment with human preferences.
Extended Context with Reasoning: Maintains enhanced 256K long-context capabilities, extendable to 1 million tokens, even within its reasoning framework.

Qwen3-Thinking-2507 是 Qwen3 思考模型的继任者，专为那些受益于显式思维链推理的复杂问题而设计。其特点包括：

最先进的推理能力：在需要深度推理的任务上性能显著提升，包括数学、科学、编码和学术基准测试，在开源思考模型中达到顶级水平。

增强的通用能力：在指令遵循、工具使用、文本生成和人类偏好对齐方面有显著改进。

支持推理的扩展上下文：在其推理框架用于高效运行训练好的模型的软件工具，如vLLM、SGLang和TensorRT-LLM。内，仍保持增强的256K长上下文能力，并可扩展至100万令牌。

Model Availability and Sizes

The Qwen3-2507 series is available in three distinct sizes to cater to different computational needs and application scenarios:

235B-A22B: A massive Mixture-of-Experts (MoE) model for cutting-edge research and applications demanding the highest performance.
30B-A3B: A balanced MoE model offering excellent performance with more manageable resource requirements.
4B: A dense model optimized for efficiency and deployment on more constrained hardware.

All models are available in both Instruct-2507 and Thinking-2507 variants.

Qwen3-2507 系列提供三种不同规模，以满足不同的计算需求和应用场景：

235B-A22B：一个庞大的混合专家模型，适用于需要最高性能的前沿研究和应用。

30B-A3B：一个平衡的混合专家模型，在资源需求更可控的情况下提供卓越性能。

4B：一个密集模型，针对效率和在资源受限的硬件上部署进行了优化。
所有模型均提供 Instruct-2507 和 Thinking-2507 两种变体。

Getting Started with Qwen3

Accessing the Models

You can access all Qwen3 models, including the new Qwen3-2507 series, through the following platforms:

🤗 Hugging Face
🤖 ModelScope

Search for checkpoints with names starting with Qwen3- or visit the Qwen3 collection.

您可以通过以下平台访问所有 Qwen3 模型，包括新的 Qwen3-2507 系列：

🤗 Hugging Face

🤖 ModelScope
搜索以 Qwen3- 开头的检查点或访问 Qwen3 合集。

Comprehensive Documentation

For detailed guidance, please refer to the bilingual Qwen3 Documentation. The documentation covers:

Quickstart: Basic usages and demonstrations.
Inference: Guidance for inference with TransformersA Python library by Hugging Face for using pre-trained transformer models like GPT-2 and BERT., including batch inference and streaming.
Run Locally: Instructions for running LLMs locally on CPU/GPU with llama.cpp, Ollama, and LM Studio.
Deployment: Demonstrations for large-scale inference with SGLang, vLLM, TGI, etc.
Quantization: Practices for quantizing LLMs with GPTQ/AWQ and creating GGUF files.
Training: Instructions for post-training (SFT, RLHF) with frameworks like Axolotl and LLaMA-Factory.
Framework: Usage of Qwen within application frameworks for RAG, Agent, etc.

如需详细指导，请参阅双语的 Qwen3 文档。文档涵盖：

快速开始：基本用法和演示。

推理：使用 TransformersA Python library by Hugging Face for using pre-trained transformer models like GPT-2 and BERT. 进行推理的指南，包括批量推理和流式输出。

本地运行：使用 llama.cpp、Ollama 和 LM Studio 在 CPU/GPU 上本地运行 LLM 的说明。

部署：使用 SGLang、vLLM、TGI 等进行大规模推理的演示。

量化：使用 GPTQ/AWQ 量化 LLM 以及创建 GGUF 文件的实践。

训练：使用 Axolotl、LLaMA-Factory 等框架进行后训练（SFT、RLHF）的说明。

框架：在 RAG、智能体等应用框架中使用 Qwen。

Inference with TransformersA Python library by Hugging Face for using pre-trained transformer models like GPT-2 and BERT.

The transformers library is the primary method for running Qwen3 models. Ensure you have transformers>=4.51.0.

Using Qwen3-Instruct-2507

The following code snippet demonstrates how to use the Qwen3-30B-A3B-Instruct-2507 model. Note that this variant operates exclusively in non-thinking mode.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=16384)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

transformers 库是运行 Qwen3 模型的主要方法。请确保安装 transformers>=4.51.0。

使用 Qwen3-Instruct-2507

以下代码片段演示了如何使用 Qwen3-30B-A3B-Instruct-2507 模型。请注意，此变体仅在非思考模式下运行。

# ... 代码同上 ...

Using Qwen3-Thinking-2507

This snippet shows how to use the Qwen3-30B-A3B-Thinking-2507 model. The default chat template automatically triggers thinking. The model's output contains a </think> tag to denote the end of the reasoning process, which needs to be parsed.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Thinking-2507"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Give me a short introduction to large language model."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=32768)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Parse thinking content (finds the position of the closing </think> tag)
try:
    index = len(output_ids) - output_ids[::-1].index(151668)  # 151668 is the token id for </think>
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)  # Note: No opening <think> tag in output
print("content:", content)

Important Notes:

Qwen3-Thinking-2507 operates exclusively in thinking mode.
The model output contains only the closing </think> tag. The opening tag is implicit in the generation prompt.
A larger max_new_tokens (e.g., 32768) is recommended for complex reasoning tasks to accommodate longer thinking chains.

使用 Qwen3-Thinking-2507

此片段展示了如何使用 Qwen3-30B-A3B-Thinking-2507 模型。默认的聊天模板会自动触发思考。模型的输出包含一个 </think> 标签来表示推理过程的结束，需要对此进行解析。
# ... 代码同上 ...
重要提示：

Qwen3-Thinking-2507 仅在思考模式下运行。

模型输出仅包含结束标签 </think>。开始标签隐含在生成提示中。

对于复杂的推理任务，建议设置更大的 max_new_tokens（例如 32768），以适应更长的思维链。

(Due to the length of the original content, this blog post focuses on rewriting the Introduction, Key Concepts, and the initial "Getting Started" and "Inference" sections. The original content continues with detailed guides for ModelScope, llama.cpp, Ollama, deployment frameworks like SGLang/vLLM, tool use, fine-tuning, and licensing information, which are comprehensively covered in the official Qwen3 documentation.)

（由于原始内容篇幅较长，本博客文章重点重写了引言、核心概念以及初始的“快速开始”和“推理”部分。原始内容后续还包含 ModelScope、llama.cpp、Ollama、SGLang/vLLM 等部署框架、工具使用、微调以及许可信息的详细指南，这些内容在官方的 Qwen3 文档中已有全面涵盖。）