VoxCPM：突破语音合成瓶颈，分层语义-声学建模实现零样本性能飞跃

Introduction (引言)

Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation.

语音合成的生成模型面临一个根本性的权衡：离散标记确保了稳定性但牺牲了表现力，而连续信号保留了声学丰富性但会因任务纠缠而导致误差累积。这一挑战推动该领域转向依赖预训练语音标记器的多阶段流程，但这些方法造成了语义-声学鸿沟，限制了整体性和表现力的语音生成。

Hierarchical Semantic-Acoustic Modeling (分层语义-声学建模将语音生成过程分解为语义层面和声学层面的分层处理框架，通过专业化组件分别处理不同层次的信息。)

We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations and present a novel tokenizer-free TTS model VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.. Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details.

我们通过采用半离散残差表示一种介于离散标记和连续信号之间的表示方法，结合了两者的优势，既保持稳定性又保留声学丰富性。的分层语义-声学建模将语音生成过程分解为语义层面和声学层面的分层处理框架，通过专业化组件分别处理不同层次的信息。解决了这一困境，并提出了一种新型的无标记器TTS模型VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.。我们的框架引入了一个可微分的量化瓶颈，诱导自然专业化：文本-语义语言模型（TSLM）生成语义-韵律计划，而残差声学模型（RALM）恢复细粒度的声学细节。

Technical Architecture (技术架构)

This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external speech tokenizers.

这种分层语义-声学表示引导基于局部扩散的解码器生成高保真语音潜在表示。关键的是，整个架构在简单的扩散目标下进行端到端训练，消除了对外部语音标记器的依赖。

Performance and Capabilities (性能与能力)

Trained on a massive 1.8 million hours of bilingual corpus, our VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.-0.5B model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Besides, VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture. shows the capability to comprehend text to infer and generate appropriate prosody and style, delivering speech with context-aware expressiveness and natural flow.

在180万小时的双语语料库上训练后，我们的VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.-0.5B模型在开源系统中实现了最先进的零样本TTS无需针对特定说话人或风格进行额外训练，即可生成高质量语音的文本到语音合成能力。性能，证明我们的方法能够提供富有表现力且稳定的合成效果。此外，VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.展现出理解文本以推断和生成适当韵律和风格的能力，提供具有上下文感知表现力和自然流畅度的语音。

Open Source Accessibility (开源可访问性)

To facilitate community-driven research and development, VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture. is publicly accessible under Apache 2.0.

为促进社区驱动的研究与开发，VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.在Apache 2.0许可下公开可访问。

Frequently Asked Questions (常见问题)

VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.模型的核心创新是什么？

VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.通过分层语义-声学建模将语音生成过程分解为语义层面和声学层面的分层处理框架，通过专业化组件分别处理不同层次的信息。和半离散残差表示一种介于离散标记和连续信号之间的表示方法，结合了两者的优势，既保持稳定性又保留声学丰富性。，解决了传统语音合成中离散标记与连续信号之间的权衡问题，实现了无需外部标记器的端到端训练。
VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.如何处理语义与声学信息？

模型采用文本-语义语言模型（TSLM）生成语义-韵律计划，残差声学模型（RALM）恢复声学细节，通过可微分量化瓶颈实现自然专业化分工。
VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.的训练数据规模如何？

根据技术报告，VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.-0.5B模型在180万小时的双语语料库上进行训练，这是目前公开报道中规模较大的语音合成训练数据集之一。
VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.在零样本TTS无需针对特定说话人或风格进行额外训练，即可生成高质量语音的文本到语音合成能力。任务中的表现如何？

在开源系统中，VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.实现了最先进的零样本TTS无需针对特定说话人或风格进行额外训练，即可生成高质量语音的文本到语音合成能力。性能，能够生成具有上下文感知表现力和自然流畅度的语音。
VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.是否支持中文语音合成？

是的，VoxCPMA novel tokenizer-free Text-to-Speech system that models speech in continuous space using diffusion autoregressive architecture.在包含中文的双语语料库上训练，具备中文语音合成能力，并展现出对文本韵律和风格的深度理解。