GEO

如何将LINE Bot语音合成迁移到Gemini 3.1原生TTS?2026年台湾中文配置与避坑

2026/5/3
如何将LINE Bot语音合成迁移到Gemini 3.1原生TTS?2026年台湾中文配置与避坑

AIAI Summary (BLUF)

Upgrade your LINE Bot's text-to-speech to Google's Gemini 3.1 Flash Native TTS. This guide covers code evolution, the crucial async await mistake to avoid, and localization adjustments for Taiwanese C

Background (背景)

In the previous practical session, we used Gemini 3.1 Flash Live to achieve speech recognition, and through the "side-attack" method of the Gemini 2.5 Live API, we barely achieved the text-to-speech (TTS) function.

在之前的实践环节中,我们使用了 Gemini 3.1 Flash Live 实现了语音识别,并借助 Gemini 2.5 Live API 的“旁路”方法,勉强实现了文本转语音(TTS)功能。

But in April 2026, Google officially released Gemini 3.1 Flash TTS. This is a native model specifically designed for audio output, no longer requiring a Live WebSocket, and can directly output high-quality audio through the standard generate_content process.

然而在 2026 年 4 月,Google 正式发布了 Gemini 3.1 Flash TTS。这是一个专为音频输出设计的原生模型,不再需要 Live WebSocket,可以通过标准的 generate_content 流程直接输出高质量音频。

As a developer, of course, you want to follow up immediately with a more elegant and native solution. This article will share how to upgrade the LINE Bot's text-to-speech summary function to Gemini 3.1 Native TTS, and the "async pitfall" encountered in the process.

作为开发者,您自然希望立即跟进更优雅、更原生的解决方案。本文将分享如何将 LINE Bot 的语音总结功能升级到 Gemini 3.1 Native TTS,并揭示过程中遇到的“异步之坑”。


Technical Upgrade: From Live API to Native TTS (技术升级:从 Live API 到原生 TTS)

The previous reading function was simulated using the Gemini 2.5 Live API. Although it was usable, it had several shortcomings:

之前的阅读功能是使用 Gemini 2.5 Live API 模拟的。虽然可用,但存在几个不足:

The emergence of Gemini 3.1 Flash TTS changed all this:

Gemini 3.1 Flash TTS 的出现改变了这一切:

We can compare the two approaches clearly:

维度 (Aspect) Gemini 2.5 Live API Gemini 3.1 Flash TTS
复杂度 (Complexity) 需要管理 WebSocket 连接生命周期 (Requires managing WebSocket connection lifecycle) 使用熟悉的 generate_content_stream 接口 (Uses familiar generate_content_stream interface)
模型限制 (Model limitations) 必须使用特定 native-audio 模型,主要支持 us-central1 (Must use specific native-audio model, primarily us-central1) 模型名为 gemini-3.1-flash-tts-preview,全局可用 (Model name gemini-3.1-flash-tts-preview, globally available)
返回格式 (Return format) 固定 16kHz 采样率 (Fixed 16kHz sample rate) 动态参数,MIME 类型自动检测,通常提升至 24kHz (Dynamic parameters, auto-detected from MIME type, typically 24kHz)

The table above is self-contained; no further translation is needed as each cell already includes bilingual content.


Core Code Evolution — tools/tts_tool.py (核心代码演进)

The new implementation has become more concise, with the focus on the response_modalities=["audio"] setting:

新的实现更加简洁,关键在于 response_modalities=["audio"] 设置:

async def text_to_speech(text: str) -> tuple[bytes, int]:
    client = genai.Client(api_key=GOOGLE_AI_API_KEY, http_options={"api_version": "v1beta"})

    contents = [
        types.Content(
            role="user",
            parts=[
                # Add localization instructions to make the tone more natural
                types.Part.from_text(text=f"Please use Traditional Chinese with Taiwanese vocabulary, and read the following summary in a friendly and natural tone. ## Transcript:\n{text}"),
            ],
        ),
    ]

    config = types.GenerateContentConfig(
        response_modalities=["audio"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Zephyr")
            )
        ),
    )

    pcm_chunks = []
    sample_rate = 24000  # Default value

    try:
        # ⚠️ This is the big pit that almost made me stay up all night fixing it
        response_stream = await client.aio.models.generate_content_stream(
            model="gemini-3.1-flash-tts-preview",
            contents=contents,
            config=config,
        )
        async for chunk in response_stream:
            if chunk.parts:
                for part in chunk.parts:
                    if part.inline_data:
                        pcm_chunks.append(part.inline_data.data)
                        # Get the sampling rate dynamically from the MIME type (e.g. audio/L16;rate=24000)
                        if part.inline_data.mime_type:
                            sample_rate = parse_rate(part.inline_data.mime_type)
    except Exception as e:
        logger.error(f"TTS Error: {e}")
        raise

    pcm_bytes = b"".join(pcm_chunks)
    duration_ms = int(len(pcm_bytes) / (sample_rate * 2) * 1000)

    # Subsequently, it is also converted to m4a via ffmpeg and sent to LINE...

上述代码展示了核心逻辑:使用 client.aio.models.generate_content_stream 并以 "audio" 作为响应模态,配合 SpeechConfig 指定语音参数(如 Zephyr 音色)。关键改进包括动态采样率提取和区域化提示。


The Pitfall: The Missing await (陷阱:缺失的 await)

This upgrade encountered a very subtle TypeError, which kept popping up after remote deployment:

本次升级遇到了一个非常微妙的 TypeError,在远程部署后持续出现:

TypeError: 'async for' requires an object with __aiter__ method, got coroutine

❌ Incorrect Writing (错误写法)

When I wrote it according to the example, I intuitively thought I could directly async for a method:

当我按照示例编写时,直觉上认为可以直接对方法进行 async for

# This is wrong!
async for chunk in client.aio.models.generate_content_stream(...):
    pass

✅ Correct Solution (正确解法)

In the asynchronous version of the Google GenAI Python SDK, generate_content_stream itself is an async function, and it returns an iterator. So you must await to get that iterator, and then perform async for on it.

在 Google GenAI Python SDK 的异步版本中,generate_content_stream 本身是一个 async 函数,它会返回一个迭代器。因此你必须先 await 获取该迭代器,然后再对其执行 async for

# Correct approach: two steps
response_stream = await client.aio.models.generate_content_stream(...)
async for chunk in response_stream:
    pass

This detail may not exist in general synchronous code or some older SDKs, but when dealing with the asynchronous stream of 3.1 Flash TTS, this is the key to whether it can run successfully.

这个细节在普通的同步代码或某些旧版 SDK 中可能不存在,但在处理 3.1 Flash TTS 的异步流时,这正是能否成功运行的关键。


Localization Adjustment: Making the Bot Speak "Taiwanese" (区域化调整:让机器人说“台湾腔”)

Although the summary itself is already in Traditional Chinese, the TTS model sometimes has non-native accents or vocabulary when reading. We solved this problem through Prompt Engineering:

虽然摘要本身已是繁体中文,但 TTS 模型在朗读时偶尔会带有非本土的腔调或词汇。我们通过提示工程解决了这个问题:

"Please use Taiwanese vocabulary in Traditional Chinese, and read it in a friendly and natural tone..."

After adding this line of instruction, the audio output by Gemini is closer to the habits of Taiwanese users in terms of intonation and sentence breaks, which greatly enhances the friendliness of the "reading summary".

加入这行指令后,Gemini 输出的音频在语调、断句上更贴近台湾用户的使用习惯,大大提升了“听摘要”的亲和力。


Summary: Changes Brought by Native TTS (总结:原生 TTS 带来的变化)

After migrating from Live API to Native TTS:

Live API 迁移到原生 TTS 后:

  • 更稳定的连接:不再需要维护长期 WebSocket。(More stable connection: No longer need to maintain a long-term WebSocket.)
  • 音质提升:原生支持 24kHz 采样率。(Improved sound quality: Native support for 24kHz sampling rate.)
  • 易于维护:代码量减少约 30%,逻辑更直接。(Easy to maintain: The amount of code is reduced by about 30%, and the logic is more direct.)

This experience also reminds me that even a seemingly mature SDK should carefully check the return value type when dealing with the async mode.

这次经验也提醒我,即使看似成熟的 SDK,在处理 async 模式时也应仔细检查返回值类型。

If you also want your LINE Bot to speak, Gemini 3.1 Flash TTS is definitely the best choice at the moment.

如果你也希望让 LINE Bot 开口说话,Gemini 3.1 Flash TTS 绝对是目前的最佳选择。

The complete code has been updated to GitHub, see you next time!

完整代码已更新至 GitHub,下次见!

常见问题(FAQ)

为什么需要从Live API升级到Gemini 3.1原生TTS?

Live API需要管理WebSocket连接生命周期,模型受限且采样率固定为16kHz。原生TTS使用标准generate_content_stream接口,支持动态采样率(24kHz),连接更稳定,代码量减少约30%。

Gemini 3.1 Flash TTS的模型名称和核心配置是什么?

模型名称为gemini-3.1-flash-tts-preview,核心配置包括设置response_modalities=["audio"]和指定speech_config中的语音(如Zephyr)。

使用异步版本SDK时最常见的错误是什么?如何解决?

最常见的错误是直接对client.aio.models.generate_content_stream进行async for,但该函数本身是异步协程,需要先await获取迭代器,再async for遍历。正确写法:response_stream = await client.aio.models.generate_content_stream(...); async for chunk in response_stream: ...

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。