Qwen3.5是什么？2026年原生多模态AI模型深度解析

Qwen3.5 主视觉图

QWEN CHAT | GitHub | Hugging Face | ModelScope | DISCORD

QWEN CHAT | GitHub | Hugging Face | ModelScope | DISCORD

我们很高兴地宣布 Qwen3.5 正式发布，并开源了 Qwen3.5 系列的首个模型权重：Qwen3.5-397B-A17B。作为一个原生的视觉-语言模型，Qwen3.5-397B-A17B 在推理、编程、智能体能力和多模态理解等一系列基准评估中均展现出卓越的性能，旨在赋能开发者和企业实现显著的生产力提升。该模型基于创新的混合架构，融合了线性注意力（通过门控Delta网络）与稀疏专家混合技术，实现了卓越的推理效率：尽管模型总参数量高达 3970 亿，但每次前向传播仅激活 170 亿参数，在保持强大能力的同时，优化了速度与成本。此外，我们还将支持的语言和方言从 119 种扩展至 201 种，为全球用户提供了更广泛的可用性和增强的支持。

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

Qwen3.5-Plus 是通过阿里云Model Studio 提供的托管模型，其特点包括：
- 默认支持 1M 上下文窗口
- 官方内置工具及自适应工具调用能力
  - Qwen3.5-Plus is the hosted model available via Alibaba Cloud Model Studio, featuring:
- a 1M context window by default
- official built-in tools and adaptive tool use

性能表现

Qwen3.5-397B-A17B 基准测试得分

以下我们展示了 Qwen3.5 模型在涵盖多种任务和模态的广泛评估任务中，与前沿模型的全面性能对比。

Below we present the comprehensive evaluation of our models against frontier models in a wide range of evaluation tasks, covering different tasks and modalities.

语言能力

Language

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
知识
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
指令遵循
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
长上下文
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
推理
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
通用智能体
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
搜索智能体
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
多语言能力
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
编程智能体
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

HLE-Verified：这是“人类最终考试”的验证修订版，附带透明、组件化的验证协议和细粒度的错误分类。我们在 https://huggingface.co/datasets/skylenage/HLE-Verified 开源了该数据集。
- HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
TAU2-Bench：我们遵循官方设置，但在航空领域，所有模型都应用了 Claude Opus 4.5 系统卡中提出的修复方案进行评估。
- TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
MCP-Mark：GitHub MCP 服务器使用来自 api.githubcopilot.com 的 v0.30.3；Playwright 工具响应被截断为 32k 个 token。
- MCP-Mark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
搜索智能体：基于我们模型构建的大多数搜索智能体采用简单的上下文折叠策略（256k）：一旦累积的工具响应长度达到预设阈值，较早的工具响应将从历史记录中删除，以将上下文保持在限制范围内。
- Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
BrowseComp：我们测试了两种策略，简单的上下文折叠策略得分为 69.0，而使用与 DeepSeek-V3.2 和 Kimi K2.5 相同的“丢弃所有”策略则达到了 78.6。
- BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
WideSearch：我们使用 256k 上下文窗口，不进行任何上下文管理。
- WideSearch: we use a 256k context window without any context management.
MMLU-ProX：我们报告了 29 种语言的平均准确率。
- MMLU-ProX: we report the averaged accuracy on 29 languages.
WMT24++：这是经过难度标注和重新平衡后的 WMT24 更难的子集；我们使用 XCOMET-XXL 报告了 55 种语言的平均分数。
- WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
MAXIFE：我们报告了在英语+多语言原始提示（共 23 种设置）上的准确率。
- MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
空单元格（--）表示分数尚不可用或不适用。
- Empty cells (--) indicate scores not yet available or not applicable.

视觉语言能力

Vision Language

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-VL-235B-A22B	K2.5-1T-A32B	Qwen3.5-397B-A17B
STEM与谜题
MMMU	86.7	80.7	87.2	80.6	84.3	85.0
MMMU-Pro	79.5	70.6	81.0	69.3	78.5	79.0
MathVision	83.0	74.3	86.6	74.6	84.2	88.6
Mathvista(mini)	83.1	80.0	87.9	85.8	90.1	90.3
We-Math	79.0	70.0	86.9	74.8	84.7	87.9
DynaMath	86.8	79.7	85.1	82.8	84.4	86.3
ZEROBench	9	3	10	4	9	12
ZEROBench_sub	33.2	28.4	39.0	28.4	33.5	41.0
BabyVision	34.4	14.2	49.7	22.2	36.5	52.3/43.3
通用VQA
RealWorldQA	83.3	77.0	83.3	81.3	81.0	83.9
MMStar	77.1	73.2	83.1	78.7	80.5	83.8
HallusionBench	65.2	64.1	68.6	66.7	69.8	71.4
MMBench EN-DEV-v1.1	88.2	89.2	93.7	89.7	94.2	93.7
SimpleVQA	55.8	65.7	73.2	61.3	71.2	67.1
文本识别与文档理解
OmniDocBench1.5	85.7	87.7	88.5	84.5	88.8	90.8
CharXiv(RQ)	82.1	68.5	81.4	66.1	77.5	80.8
MMLongBench-Doc	--	61.9	60.5	56.2	58.5	61.5
CC-OCR	70.3	76.9	79.0	81.5	79.7	82.0
AI2D_TEST	92.2	87.7	94.1	89.2	90.8	93.9
OCRBench	80.7	85.8

常见问题（FAQ）

Qwen3.5的性能表现如何？

Qwen3.5-397B-A17B在推理、编程、智能体能力和多模态理解等基准评估中均展现出卓越性能，其混合架构在保持强大能力的同时优化了推理速度与成本。

Qwen3.5支持多少种语言？

Qwen3.5将支持的语言和方言从119种扩展至201种，为全球用户提供了更广泛的可用性和增强的语言支持能力。

Qwen3.5的视觉语言能力有什么特点？

作为原生视觉-语言模型，Qwen3.5在多模态理解任务上表现卓越，融合了线性注意力与稀疏专家混合技术，实现了高效的多模态处理能力。