如何将LINE Bot语音合成迁移到Gemini 3.1原生TTS?2026年台湾中文配置与避坑
AIAI Summary (BLUF)
Upgrade your LINE Bot's text-to-speech to Google's Gemini 3.1 Flash Native TTS. This guide covers code evolution, the crucial async await mistake to avoid, and localization adjustments for Taiwanese C
Background (背景)
In the previous practical session, we used Gemini 3.1 Flash Live to achieve speech recognition, and through the "side-attack" method of the Gemini 2.5 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。, we barely achieved the text-to-speech (TTS) function.
在之前的实践环节中,我们使用了 Gemini 3.1 Flash Live 实现了语音识别,并借助 Gemini 2.5 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 的“旁路”方法,勉强实现了文本转语音(TTS)功能。
But in April 2026, Google officially released Gemini 3.1 Flash TTS. This is a native model specifically designed for audio output, no longer requiring a Live WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。, and can directly output high-quality audio through the standard generate_content process.
然而在 2026 年 4 月,Google 正式发布了 Gemini 3.1 Flash TTS。这是一个专为音频输出设计的原生模型,不再需要 Live WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。,可以通过标准的
generate_content流程直接输出高质量音频。
As a developer, of course, you want to follow up immediately with a more elegant and native solution. This article will share how to upgrade the LINE Bot基于LINE聊天平台的机器人,可通过API实现自动回复、语音播报等功能。's text-to-speech summary function to Gemini 3.1 Native TTS, and the "async pitfall" encountered in the process.
作为开发者,您自然希望立即跟进更优雅、更原生的解决方案。本文将分享如何将 LINE Bot基于LINE聊天平台的机器人,可通过API实现自动回复、语音播报等功能。 的语音总结功能升级到 Gemini 3.1 Native TTS,并揭示过程中遇到的“异步之坑”。
Technical Upgrade: From Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 to Native TTS (技术升级:从 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 到原生 TTS)
The previous reading function was simulated using the Gemini 2.5 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。. Although it was usable, it had several shortcomings:
之前的阅读功能是使用 Gemini 2.5 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 模拟的。虽然可用,但存在几个不足:
The emergence of Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。 changed all this:
Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。 的出现改变了这一切:
We can compare the two approaches clearly:
| 维度 (Aspect) | Gemini 2.5 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 | Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。 |
|---|---|---|
| 复杂度 (Complexity) | 需要管理 WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。 连接生命周期 (Requires managing WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。 connection lifecycle) | 使用熟悉的 generate_content_stream 接口 (Uses familiar generate_content_stream interface) |
| 模型限制 (Model limitations) | 必须使用特定 native-audio 模型,主要支持 us-central1 (Must use specific native-audio model, primarily us-central1) |
模型名为 gemini-3.1-flash-tts-preview,全局可用 (Model name gemini-3.1-flash-tts-preview, globally available) |
| 返回格式 (Return format) | 固定 16kHz 采样率 (Fixed 16kHz sample rate) | 动态参数,MIME 类型自动检测,通常提升至 24kHz (Dynamic parameters, auto-detected from MIME type, typically 24kHz) |
The table above is self-contained; no further translation is needed as each cell already includes bilingual content.
Core Code Evolution — tools/tts_tool.py (核心代码演进)
The new implementation has become more concise, with the focus on the response_modalities=["audio"] setting:
新的实现更加简洁,关键在于
response_modalities=["audio"]设置:
async def text_to_speech(text: str) -> tuple[bytes, int]:
client = genai.Client(api_key=GOOGLE_AI_API_KEY, http_options={"api_version": "v1beta"})
contents = [
types.Content(
role="user",
parts=[
# Add localization instructions to make the tone more natural
types.Part.from_text(text=f"Please use Traditional Chinese with Taiwanese vocabulary, and read the following summary in a friendly and natural tone. ## Transcript:\n{text}"),
],
),
]
config = types.GenerateContentConfig(
response_modalities=["audio"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Zephyr")
)
),
)
pcm_chunks = []
sample_rate = 24000 # Default value
try:
# ⚠️ This is the big pit that almost made me stay up all night fixing it
response_stream = await client.aio.models.generate_content_stream(
model="gemini-3.1-flash-tts-preview",
contents=contents,
config=config,
)
async for chunk in response_stream:
if chunk.parts:
for part in chunk.parts:
if part.inline_data:
pcm_chunks.append(part.inline_data.data)
# Get the sampling rate dynamically from the MIME type (e.g. audio/L16;rate=24000)
if part.inline_data.mime_type:
sample_rate = parse_rate(part.inline_data.mime_type)
except Exception as e:
logger.error(f"TTS Error: {e}")
raise
pcm_bytes = b"".join(pcm_chunks)
duration_ms = int(len(pcm_bytes) / (sample_rate * 2) * 1000)
# Subsequently, it is also converted to m4a via ffmpeg and sent to LINE...
上述代码展示了核心逻辑:使用
client.aio.models.generate_content_stream并以"audio"作为响应模态,配合SpeechConfig指定语音参数(如Zephyr音色)。关键改进包括动态采样率提取和区域化提示。
The Pitfall: The Missing await (陷阱:缺失的 await)
This upgrade encountered a very subtle TypeError, which kept popping up after remote deployment:
本次升级遇到了一个非常微妙的
TypeError,在远程部署后持续出现:
TypeError: 'async for' requires an object with __aiter__ method, got coroutine
❌ Incorrect Writing (错误写法)
When I wrote it according to the example, I intuitively thought I could directly async for a method:
当我按照示例编写时,直觉上认为可以直接对方法进行
async for:
# This is wrong!
async for chunk in client.aio.models.generate_content_stream(...):
pass
✅ Correct Solution (正确解法)
In the asynchronous version of the Google GenAI Python SDK, generate_content_stream itself is an async function, and it returns an iterator. So you must await to get that iterator, and then perform async for on it.
在 Google GenAI Python SDK 的异步版本中,
generate_content_stream本身是一个async函数,它会返回一个迭代器。因此你必须先await获取该迭代器,然后再对其执行async for。
# Correct approach: two steps
response_stream = await client.aio.models.generate_content_stream(...)
async for chunk in response_stream:
pass
This detail may not exist in general synchronous code or some older SDKs, but when dealing with the asynchronous stream of 3.1 Flash TTS, this is the key to whether it can run successfully.
这个细节在普通的同步代码或某些旧版 SDK 中可能不存在,但在处理 3.1 Flash TTS 的异步流时,这正是能否成功运行的关键。
Localization Adjustment: Making the Bot Speak "Taiwanese" (区域化调整:让机器人说“台湾腔”)
Although the summary itself is already in Traditional Chinese, the TTS model sometimes has non-native accents or vocabulary when reading. We solved this problem through Prompt Engineering:
虽然摘要本身已是繁体中文,但 TTS 模型在朗读时偶尔会带有非本土的腔调或词汇。我们通过提示工程解决了这个问题:
"Please use Taiwanese vocabulary in Traditional Chinese, and read it in a friendly and natural tone..."
After adding this line of instruction, the audio output by Gemini is closer to the habits of Taiwanese users in terms of intonation and sentence breaks, which greatly enhances the friendliness of the "reading summary".
加入这行指令后,Gemini 输出的音频在语调、断句上更贴近台湾用户的使用习惯,大大提升了“听摘要”的亲和力。
Summary: Changes Brought by Native TTS (总结:原生 TTS 带来的变化)
After migrating from Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 to Native TTS:
从 Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。 迁移到原生 TTS 后:
- 更稳定的连接:不再需要维护长期 WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。。(More stable connection: No longer need to maintain a long-term WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。.)
- 音质提升:原生支持 24kHz 采样率。(Improved sound quality: Native support for 24kHz sampling rate.)
- 易于维护:代码量减少约 30%,逻辑更直接。(Easy to maintain: The amount of code is reduced by about 30%, and the logic is more direct.)
This experience also reminds me that even a seemingly mature SDK should carefully check the return value type when dealing with the async mode.
这次经验也提醒我,即使看似成熟的 SDK,在处理
async模式时也应仔细检查返回值类型。
If you also want your LINE Bot基于LINE聊天平台的机器人,可通过API实现自动回复、语音播报等功能。 to speak, Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。 is definitely the best choice at the moment.
如果你也希望让 LINE Bot基于LINE聊天平台的机器人,可通过API实现自动回复、语音播报等功能。 开口说话,Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。 绝对是目前的最佳选择。
The complete code has been updated to GitHub, see you next time!
完整代码已更新至 GitHub,下次见!
常见问题(FAQ)
为什么需要从Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。升级到Gemini 3.1原生TTS?
Live API谷歌Gemini的实时API,基于WebSocket,支持语音交互但复杂度高。需要管理WebSocket全双工通信协议,允许服务器主动推送数据,常用于实时应用。连接生命周期,模型受限且采样率固定为16kHz。原生TTS使用标准generate_content_streamGoogle GenAI SDK中的异步流式生成方法,用于逐步获取模型输出。接口,支持动态采样率(24kHz),连接更稳定,代码量减少约30%。
Gemini 3.1 Flash TTS谷歌推出的原生文本转语音模型,可通过标准generate_content接口直接输出高质量音频,无需WebSocket。的模型名称和核心配置是什么?
模型名称为gemini-3.1-flash-tts-preview,核心配置包括设置response_modalities=["audio"]和指定speech_config中的语音(如Zephyr)。
使用异步版本SDK时最常见的错误是什么?如何解决?
最常见的错误是直接对client.aio.models.generate_content_streamGoogle GenAI SDK中的异步流式生成方法,用于逐步获取模型输出。进行async for,但该函数本身是异步协程,需要先await获取迭代器,再async for遍历。正确写法:response_stream = await client.aio.models.generate_content_streamGoogle GenAI SDK中的异步流式生成方法,用于逐步获取模型输出。(...); async for chunk in response_stream: ...
版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。
文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。
若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。