Qwen3.5是什么?2026年原生多模态AI模型深度解析
Qwen3.5 is a native multimodal AI model with 397B parameters and 17B activated per inference, featuring hybrid architecture, 201 language support, and superior performance across reasoning, coding, and vision tasks.
原文翻译: Qwen3.5是一款原生多模态AI模型,拥有3970亿参数,每次推理激活170亿参数,采用混合架构,支持201种语言,在推理、编码和视觉任务上表现卓越。

QWEN CHAT | GitHub | Hugging Face | ModelScope | DISCORD
QWEN CHAT | GitHub | Hugging Face | ModelScope | DISCORD
我们很高兴地宣布 Qwen3.5 正式发布,并开源了 Qwen3.5 系列的首个模型权重:Qwen3.5-397B-A17B。作为一个原生的视觉-语言模型,Qwen3.5-397B-A17B 在推理、编程、智能体能力和多模态理解等一系列基准评估中均展现出卓越的性能,旨在赋能开发者和企业实现显著的生产力提升。该模型基于创新的混合架构,融合了线性注意力(通过门控Delta网络)与稀疏专家混合技术,实现了卓越的推理效率:尽管模型总参数量高达 3970 亿,但每次前向传播仅激活 170 亿参数,在保持强大能力的同时,优化了速度与成本。此外,我们还将支持的语言和方言从 119 种扩展至 201 种,为全球用户提供了更广泛的可用性和增强的支持。
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
- Qwen3.5-Plus 是通过 阿里云Model Studio 提供的托管模型,其特点包括:
- 默认支持 1M 上下文窗口
- 官方内置工具及自适应工具调用能力
- Qwen3.5-Plus is the hosted model available via Alibaba Cloud Model Studio, featuring:
- a 1M context window by default
- official built-in tools and adaptive tool use
性能表现

以下我们展示了 Qwen3.5 模型在涵盖多种任务和模态的广泛评估任务中,与前沿模型的全面性能对比。
Below we present the comprehensive evaluation of our models against frontier models in a wide range of evaluation tasks, covering different tasks and modalities.
语言能力
Language
| GPT5.2 | Claude 4.5 Opus | Gemini-3 Pro | Qwen3-Max-Thinking | K2.5-1T-A32B | Qwen3.5-397B-A17B | |
|---|---|---|---|---|---|---|
| 知识 | ||||||
| MMLU-Pro | 87.4 | 89.5 | 89.8 | 85.7 | 87.1 | 87.8 |
| MMLU-Redux | 95.0 | 95.6 | 95.9 | 92.8 | 94.5 | 94.9 |
| SuperGPQA | 67.9 | 70.6 | 74.0 | 67.3 | 69.2 | 70.4 |
| C-Eval | 90.5 | 92.2 | 93.4 | 93.7 | 94.0 | 93.0 |
| 指令遵循 | ||||||
| IFEval | 94.8 | 90.9 | 93.5 | 93.4 | 93.9 | 92.6 |
| IFBench | 75.4 | 58.0 | 70.4 | 70.9 | 70.2 | 76.5 |
| MultiChallenge | 57.9 | 54.2 | 64.2 | 63.3 | 62.7 | 67.6 |
| 长上下文 | ||||||
| AA-LCR | 72.7 | 74.0 | 70.7 | 68.7 | 70.0 | 68.7 |
| LongBench v2 | 54.5 | 64.4 | 68.2 | 60.6 | 61.0 | 63.2 |
| STEM | ||||||
| GPQA | 92.4 | 87.0 | 91.9 | 87.4 | 87.6 | 88.4 |
| HLE | 35.5 | 30.8 | 37.5 | 30.2 | 30.1 | 28.7 |
| HLE-Verified¹ | 43.3 | 38.8 | 48 | 37.6 | -- | 37.6 |
| 推理 | ||||||
| LiveCodeBench v6 | 87.7 | 84.8 | 90.7 | 85.9 | 85.0 | 83.6 |
| HMMT Feb 25 | 99.4 | 92.9 | 97.3 | 98.0 | 95.4 | 94.8 |
| HMMT Nov 25 | 100 | 93.3 | 93.3 | 94.7 | 91.1 | 92.7 |
| IMOAnswerBench | 86.3 | 84.0 | 83.3 | 83.9 | 81.8 | 80.9 |
| AIME26 | 96.7 | 93.3 | 90.6 | 93.3 | 93.3 | 91.3 |
| 通用智能体 | ||||||
| BFCL-V4 | 63.1 | 77.5 | 72.5 | 67.7 | 68.3 | 72.9 |
| TAU2-Bench | 87.1 | 91.6 | 85.4 | 84.6 | 77.0 | 86.7 |
| VITA-Bench | 38.2 | 56.3 | 51.6 | 40.9 | 41.9 | 49.7 |
| DeepPlanning | 44.6 | 33.9 | 23.3 | 28.7 | 14.5 | 34.3 |
| Tool Decathlon | 43.8 | 43.5 | 36.4 | 18.8 | 27.8 | 38.3 |
| MCP-Mark | 57.5 | 42.3 | 53.9 | 33.5 | 29.5 | 46.1 |
| 搜索智能体 | ||||||
| HLE w/ tool | 45.5 | 43.4 | 45.8 | 49.8 | 50.2 | 48.3 |
| BrowseComp | 65.8 | 67.8 | 59.2 | 53.9 | --/74.9 | 69.0/78.6 |
| BrowseComp-zh | 76.1 | 62.4 | 66.8 | 60.9 | -- | 70.3 |
| WideSearch | 76.8 | 76.4 | 68.0 | 57.9 | 72.7 | 74.0 |
| Seal-0 | 45.0 | 47.7 | 45.5 | 46.9 | 57.4 | 46.9 |
| 多语言能力 | ||||||
| MMMLU | 89.5 | 90.1 | 90.6 | 84.4 | 86.0 | 88.5 |
| MMLU-ProX | 83.7 | 85.7 | 87.7 | 78.5 | 82.3 | 84.7 |
| NOVA-63 | 54.6 | 56.7 | 56.7 | 54.2 | 56.0 | 59.1 |
| INCLUDE | 87.5 | 86.2 | 90.5 | 82.3 | 83.3 | 85.6 |
| Global PIQA | 90.9 | 91.6 | 93.2 | 86.0 | 89.3 | 89.8 |
| PolyMATH | 62.5 | 79.0 | 81.6 | 64.7 | 43.1 | 73.3 |
| WMT24++ | 78.8 | 79.7 | 80.7 | 77.6 | 77.6 | 78.9 |
| MAXIFE | 88.4 | 79.2 | 87.5 | 84.0 | 72.8 | 88.2 |
| 编程智能体 | ||||||
| SWE-bench Verified | 80.0 | 80.9 | 76.2 | 75.3 | 76.8 | 76.4 |
| SWE-bench Multilingual | 72.0 | 77.5 | 65.0 | 66.7 | 73.0 | 69.3 |
| SecCodeBench | 68.7 | 68.6 | 62.4 | 57.5 | 61.3 | 68.3 |
| Terminal Bench 2 | 54.0 | 59.3 | 54.2 | 22.5 | 50.8 | 52.5 |
HLE-Verified:这是“人类最终考试”的验证修订版,附带透明、组件化的验证协议和细粒度的错误分类。我们在 https://huggingface.co/datasets/skylenage/HLE-Verified 开源了该数据集。
- HLE-Verified: a verified and revised version of Humanity’s Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
TAU2-Bench:我们遵循官方设置,但在航空领域,所有模型都应用了 Claude Opus 4.5 系统卡中提出的修复方案进行评估。
- TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
MCP-Mark:GitHub MCP 服务器使用来自 api.githubcopilot.com 的 v0.30.3;Playwright 工具响应被截断为 32k 个 token。
- MCP-Mark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
搜索智能体:基于我们模型构建的大多数搜索智能体采用简单的上下文折叠策略(256k):一旦累积的工具响应长度达到预设阈值,较早的工具响应将从历史记录中删除,以将上下文保持在限制范围内。
- Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
BrowseComp:我们测试了两种策略,简单的上下文折叠策略得分为 69.0,而使用与 DeepSeek-V3.2 和 Kimi K2.5 相同的“丢弃所有”策略则达到了 78.6。
- BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
WideSearch:我们使用 256k 上下文窗口,不进行任何上下文管理。
- WideSearch: we use a 256k context window without any context management.
MMLU-ProX:我们报告了 29 种语言的平均准确率。
- MMLU-ProX: we report the averaged accuracy on 29 languages.
WMT24++:这是经过难度标注和重新平衡后的 WMT24 更难的子集;我们使用 XCOMET-XXL 报告了 55 种语言的平均分数。
- WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
MAXIFE:我们报告了在英语+多语言原始提示(共 23 种设置)上的准确率。
- MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
空单元格(--)表示分数尚不可用或不适用。
- Empty cells (--) indicate scores not yet available or not applicable.
视觉语言能力
Vision Language
| GPT5.2 | Claude 4.5 Opus | Gemini-3 Pro | Qwen3-VL-235B-A22B | K2.5-1T-A32B | Qwen3.5-397B-A17B | |
|---|---|---|---|---|---|---|
| STEM与谜题 | ||||||
| MMMU | 86.7 | 80.7 | 87.2 | 80.6 | 84.3 | 85.0 |
| MMMU-Pro | 79.5 | 70.6 | 81.0 | 69.3 | 78.5 | 79.0 |
| MathVision | 83.0 | 74.3 | 86.6 | 74.6 | 84.2 | 88.6 |
| Mathvista(mini) | 83.1 | 80.0 | 87.9 | 85.8 | 90.1 | 90.3 |
| We-Math | 79.0 | 70.0 | 86.9 | 74.8 | 84.7 | 87.9 |
| DynaMath | 86.8 | 79.7 | 85.1 | 82.8 | 84.4 | 86.3 |
| ZEROBench | 9 | 3 | 10 | 4 | 9 | 12 |
| ZEROBench_sub | 33.2 | 28.4 | 39.0 | 28.4 | 33.5 | 41.0 |
| BabyVision | 34.4 | 14.2 | 49.7 | 22.2 | 36.5 | 52.3/43.3 |
| 通用VQA | ||||||
| RealWorldQA | 83.3 | 77.0 | 83.3 | 81.3 | 81.0 | 83.9 |
| MMStar | 77.1 | 73.2 | 83.1 | 78.7 | 80.5 | 83.8 |
| HallusionBench | 65.2 | 64.1 | 68.6 | 66.7 | 69.8 | 71.4 |
| MMBench EN-DEV-v1.1 | 88.2 | 89.2 | 93.7 | 89.7 | 94.2 | 93.7 |
| SimpleVQA | 55.8 | 65.7 | 73.2 | 61.3 | 71.2 | 67.1 |
| 文本识别与文档理解 | ||||||
| OmniDocBench1.5 | 85.7 | 87.7 | 88.5 | 84.5 | 88.8 | 90.8 |
| CharXiv(RQ) | 82.1 | 68.5 | 81.4 | 66.1 | 77.5 | 80.8 |
| MMLongBench-Doc | -- | 61.9 | 60.5 | 56.2 | 58.5 | 61.5 |
| CC-OCR | 70.3 | 76.9 | 79.0 | 81.5 | 79.7 | 82.0 |
| AI2D_TEST | 92.2 | 87.7 | 94.1 | 89.2 | 90.8 | 93.9 |
| OCRBench | 80.7 | 85.8 |
常见问题(FAQ)
Qwen3.5的性能表现如何?
Qwen3.5-397B-A17B在推理、编程、智能体能力和多模态理解等基准评估中均展现出卓越性能,其混合架构在保持强大能力的同时优化了推理速度与成本。
Qwen3.5支持多少种语言?
Qwen3.5将支持的语言和方言从119种扩展至201种,为全球用户提供了更广泛的可用性和增强的语言支持能力。
Qwen3.5的视觉语言能力有什么特点?
作为原生视觉-语言模型,Qwen3.5在多模态理解任务上表现卓越,融合了线性注意力与稀疏专家混合技术,实现了高效的多模态处理能力。
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。