LangChain如何用GPT-4自动生成SQL查询？2026年全流程解析

引言

这算得上是一个非常惊艳的功能。

This can be considered a remarkably impressive feature.

主要优点是，你用自然语言提出问题，LangChain把这个问题丢给LLMs，LLMs返回SQL脚本，然后LangChain会自动执行查询语句得到结果后，再丢给LLMs生成一个自然语言风格的答案。

The primary advantage is that you can pose a question in natural language. LangChain passes this question to an LLM, which returns an SQL script. LangChain then automatically executes the query, obtains the results, and passes them back to the LLM to generate a natural language-style answer.

也就是说，LangChain+ChatGPT，就能快速完成 SQL 生成、执行和答案生成的全过程。

In other words, the combination of LangChain and ChatGPT can rapidly complete the entire process of SQL generation, execution, and answer generation.

主流 SQL 生成大模型概述

SQLCoder, 声称已经超越 GPT 3.5，但是没赶上 GPT-4

SQLCoder, which claims to have surpassed GPT-3.5 but has not yet caught up to GPT-4.
GPT 3.5 和 GPT4，目前SQL的准确率整体上属于顶尖

GPT-3.5 and GPT-4 currently represent the top tier in terms of overall SQL accuracy.
补充：Code LlaMa ，基于LlaMa2，主要以Python等为主，不太擅长写SQL

Note: Code LlaMa, based on LlaMa2, is primarily focused on languages like Python and is not particularly adept at writing SQL.
补充：WizardCoder，基于LLama2，不擅长写SQL

Note: WizardCoder, based on LLama2, is not skilled at writing SQL.
补充：StarCoder，小型的通用代码生成模型，不擅长写SQL

Note: StarCoder, a small general-purpose code generation model, is not proficient at writing SQL.

（其实一般所谓的代码生成，默认指的是 Python -- 毕竟是AI的最主流语言）

(In fact, when people generally refer to code generation, they typically default to Python—after all, it's the most mainstream language in AI.)

主流SQL生成模型对比示意图

LangChain 的核心功能

LangChain 作为一个连接 LLMs 的工具，提供以下功能：

As a tool for connecting LLMs, LangChain provides the following functionalities:

Build SQL Queries, 基于你的自然语言问题，自动生成SQL查询脚本

Build SQL Queries: Automatically generates SQL query scripts based on your natural language questions.
Query a SQL database, 自动连接数据库并运行查询脚本

Query a SQL database: Automatically connects to a database and runs the generated query scripts.
Interact with a SQL database，基于 agent，和数据库进行更灵活的交互

Interact with a SQL database: Enables more flexible interaction with databases using an agent-based approach.

快速完整案例

准备工作

首先，需要安装必要的库并配置环境。

First, it's necessary to install the required libraries and configure the environment.

!pip install -q openai
!pip install -q langchain

# 注意：在实际应用中，应从安全的环境变量中读取API密钥
# Note: In practice, API keys should be read from secure environment variables.
openai_api_key = "YOUR_OPENAI_API_KEY"
import os
import openai

!pip install langchain langchain-experimental openai -q

from langchain import OpenAI, SQLDatabase
from langchain_experimental.sql import SQLDatabaseChain

# 初始化LLM，设置temperature为0以获得更确定性的输出
# Initialize the LLM, setting temperature to 0 for more deterministic output.
llm = OpenAI(temperature=0, openai_api_key=openai_api_key)

# 连接到示例SQLite数据库
# Connect to the example SQLite database.
sqlite_db_path ='./chinook.db'
db = SQLDatabase.from_uri(f"sqlite:///{sqlite_db_path}")

# 创建SQLDatabaseChain，它将串联整个流程
# Create the SQLDatabaseChain, which will orchestrate the entire pipeline.
db_chain = SQLDatabaseChain(llm=llm, database=db, verbose=True)

这里使用的 chinook 数据库是一个经典的SQL学习数据库，其结构如下：

The chinook database used here is a classic SQL learning database with the following structure:

Chinook数据库结构图

尝试提问

现在，我们可以用自然语言向这个链提问。

Now, we can ask questions to this chain in natural language.

英文提问示例：
db_chain.run("How many unique customers are there?")

英文查询执行过程与结果

中文提问同样有效：

Asking questions in Chinese also works:

中文查询执行过程与结果

处理更复杂的查询：

Handling more complex queries:

复杂查询执行过程与结果

链的执行过程会详细显示：

The execution process of the chain is displayed in detail:

> Entering new SQLDatabaseChain chain...
哪个顾客的花费最多?
SQLQuery:SELECT "FirstName", "LastName", MAX("Total") AS "MaxTotal" FROM customers c INNER JOIN invoices i ON c."CustomerId" = i."CustomerId" GROUP BY c."CustomerId" ORDER BY "MaxTotal" DESC LIMIT 5;
SQLResult: [('Helena', 'Holý', 25.86), ('Richard', 'Cunningham', 23.86), ('Ladislav', 'Kovács', 21.86), ('Hugh', "O'Reilly", 21.86), ('Astrid', 'Gruber', 18.86)]
Answer:Helena Holý的花费最多，为25.86美元。
> Finished chain.
Helena Holý的花费最多，为25.86美元。

小结与模型准确性

不过受限于 ChatGPT 的能力，这些代码的生成并不能达到100%正确。

However, limited by the capabilities of ChatGPT, the generation of these codes cannot achieve 100% accuracy.

ChatGPT（GPT-3.5）的正确率大概在 66%。如果想要更好的结果，那上GPT-4就对了。

The accuracy rate of ChatGPT (GPT-3.5) is approximately 66%. If you want better results, then upgrading to GPT-4 is the right choice.

SQL生成模型准确率对比图

图源：https://github.com/defog-ai/sqlcoder

Image Source: https://github.com/defog-ai/sqlcoder

结论

LangChain 通过将大型语言模型与数据库连接，实现了一个从自然语言问题到结构化查询再到自然语言答案的自动化流程。这极大地降低了非技术用户与数据库交互的门槛，并提升了数据分析的效率。尽管当前模型的准确性尚有提升空间，但随着模型能力的不断进化，这项技术在未来数据驱动决策中的应用前景将非常广阔。

LangChain automates the pipeline from natural language questions to structured queries and back to natural language answers by connecting large language models with databases. This significantly lowers the barrier for non-technical users to interact with databases and enhances the efficiency of data analysis. Although the accuracy of current models has room for improvement, the application prospects of this technology in future data-driven decision-making are very broad as model capabilities continue to evolve.

本文主要参考来源：LangChain官方文档，SQL | ️ Langchain

Primary reference for this article: LangChain Official Documentation, SQL | ️ Langchain

常见问题（FAQ）

LangChain如何实现自然语言生成SQL查询？

LangChain将自然语言问题传递给GPT-3.5/4等大语言模型，模型返回SQL脚本后自动执行查询，再将结果返回给LLM生成自然语言答案。

哪些大模型适合用于LangChain的SQL生成？

GPT-3.5和GPT-4在SQL准确率上表现顶尖，SQLCoder声称超越GPT-3.5但未及GPT-4。Code LlaMa、WizardCoder和StarCoder不擅长SQL生成。

使用LangChain进行SQL查询需要哪些准备工作？

需安装openai和langchain库，配置API密钥，初始化LLM并设置temperature为0，连接SQL数据库如SQLite，创建SQLDatabaseChain串联整个流程。