GEO

如何云端部署Qwen-7B-Chat?2026年vLLM+LangChain+FastAPI全流程指南

2026/3/8
如何云端部署Qwen-7B-Chat?2026年vLLM+LangChain+FastAPI全流程指南
AI Summary (BLUF)

This tutorial provides a step-by-step guide to deploying the Qwen-7B-Chat large language model on the cloud using vLLM for inference, LangChain for vector database construction, and FastAPI for web services, with deployment on the cost-effective FunHPC cloud platform.

原文翻译: 本教程提供了一个分步指南,指导如何在云端部署Qwen-7B-Chat大语言模型,使用vLLM进行推理,LangChain构建向量数据库,FastAPI提供Web服务,并在高性价比的FunHPC云平台上实现部署。

引言

自从2022年底ChatGPT横空出世,AI的应用层出不穷。你是否希望可以通过一些流行的框架构建一个自己的LLM(Large Language Model)系统,并将LLM投入生产?那么本文或许将符合你的要求。

Since the advent of ChatGPT at the end of 2022, AI applications have been emerging one after another. Do you wish to build your own LLM (Large Language Model) system using popular frameworks and put it into production? If so, this article might meet your requirements.

本教程将逐步构建出一个简单的Demo,在过程中将使用VLLM进行模型推理,LangChain构建向量数据库,使用FastAPI提供Web服务,并在超具性价比的FunHPC趣算云(原DeepLn算力云)实现模型的云端部署。

This tutorial will step-by-step build a simple demo. During the process, we will use VLLM for model inference, LangChain to build a vector database, FastAPI to provide web services, and deploy the model in the cloud using the highly cost-effective FunHPC (formerly DeepLn Computing Cloud).

核心概念与准备工作

如何选择计算实例

运行深度学习模型特别是LLM需要大量的算力。虽然可以通过一些方法来使用CPU运行LLM(例如llama.cpp),但一般来说需要使用GPU才可以流畅并高效地运行。对于本教程来说,vLLM目前支持Qwen 7B Chat的Int4量化版本(经过测试,截止到教程发布前不支持Int8量化)。该版本最小运行显存为7GB,所以可以在类似RTX 3060这样显存>=8GB的显卡上运行。如果需要使用半精度(FP16/BF16)推理,那么至少需要16.5GB显存,这就需要RTX 3090或A100这样大显存的卡了。

Running deep learning models, especially LLMs, requires significant computational power. Although methods exist to run LLMs on CPUs (e.g., llama.cpp), using a GPU is generally necessary for smooth and efficient operation. For this tutorial, vLLM currently supports the Int4 quantized version of Qwen 7B Chat (tested and confirmed not to support Int8 quantization as of the tutorial's release). This version requires a minimum of 7GB VRAM, allowing it to run on graphics cards like the RTX 3060 with >=8GB VRAM. If half-precision (FP16/BF16) inference is desired, at least 16.5GB VRAM is needed, requiring a high-VRAM card like the RTX 3090 or A100.

由于vLLM并没有对量化模型进行优化,所以在示例中使用模型的未量化版本,以获得更好的准确性和更高的吞吐量。

Since vLLM does not yet have optimizations for quantized models, the unquantized version of the model will be used in this example to achieve better accuracy and higher throughput.

启动实例并配置环境

启动实例

打开FunHPC趣算云官网。如果没有注册账号,可以先注册一下,并领取注册并绑定微信赠送的30算力金。话不多说,点击“算力市场”,就能看到平台上超具性价比的云端显卡了。这里以A100为示例,开始我们的配置。

Open the FunHPC website. If you don't have an account, register first and claim the 30 computing credits offered for registration and binding with WeChat. Without further ado, click on the "Computing Market" to see the highly cost-effective cloud GPUs on the platform. We'll start our configuration using the A100 as an example.

点击“可用”进入选择主机界面,选择可用的主机,点击“立即租用”。此时出现选择GPU数量和框架等的配置界面(性价比极高,A100仅2.58元/卡时?)。参考配置如下:

Click "Available" to enter the host selection interface, choose an available host, and click "Rent Now". A configuration interface will appear for selecting the number of GPUs, frameworks, etc. (The value is incredible—A100 for only 2.58 CNY per card-hour?). A reference configuration is as follows:

点击“立即创建”,即可来到控制台,此时状态为“创建中”。

Click "Create Now" to proceed to the console, where the status will be "Creating".

在创建结束后状态变为“运行中”,此时即可通过code-server或者SSH访问实例。

Once creation is complete, the status changes to "Running". You can now access the instance via code-server or SSH.

配置环境

使用如下命令将pip源更换为国内源,加速包的安装:

Use the following commands to change the pip source to a domestic mirror to accelerate package installation:

cd ~
mkdir .pip
cd .pip
touch pip.conf
echo "[global]\nindex-url=https://pypi.tuna.tsinghua.edu.cn/simple" >> pip.conf

然后通过如下命令安装核心依赖项:

Then, install the core dependencies with the following commands:

pip install langchain vllm gptcache modelscope
pip install transformers==4.32.0 accelerate tiktoken einops scipy transformers_stream_generator==0.0.4 peft deepspeed

如果计划使用Int4量化版本的模型,还需要额外安装如下依赖项:

If you plan to use the Int4 quantized version of the model, install these additional dependencies:

pip install auto-gptq optimum

下载模型并测试离线推理

在本教程中将使用Qwen-7B-Chat。以下为模型的官方介绍:

This tutorial will use Qwen-7B-Chat. The official introduction of the model is as follows:

**通义千问-7B(Qwen-7B)**是阿里云研发的通义千问大模型系列的70亿参数规模的模型。Qwen-7B是基于Transformer的大语言模型,在超大规模的预训练数据上进行训练得到。预训练数据类型多样,覆盖广泛,包括大量网络文本、专业书籍、代码等。同时,在Qwen-7B的基础上,我们使用对齐机制打造了基于大语言模型的AI助手Qwen-7B-Chat。相较于最初开源的Qwen-7B模型,我们现已将预训练模型和Chat模型更新到效果更优的版本。

Qwen-7B is a 7-billion-parameter model from the Qwen large language model series developed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model trained on an ultra-large-scale corpus of diverse pre-training data, including extensive web text, professional books, code, etc. Furthermore, based on Qwen-7B, we have employed alignment techniques to create the AI assistant Qwen-7B-Chat. Compared to the initially open-sourced Qwen-7B model, we have now updated both the pre-trained and chat models to more effective versions.

我们先测试离线LLM推理,然后再部署模型。只要从ModelScope或者Hugging Face上将模型下载到本地,就可以无限进行推理。

We will first test offline LLM inference before deploying the model. Once the model is downloaded locally from ModelScope or Hugging Face, inference can be performed indefinitely.

from vllm import LLM, SamplingParams
import time
import os
# 使用ModelScope。如果不设置该环境变量,将会从Hugging Face下载。
os.environ['VLLM_USE_MODELSCOPE'] = 'True'

上面的代码导入了需要的库,并设置了VLLM_USE_MODELSCOPE这个环境变量为True,这将会从ModelScope而不是Hugging Face下载模型。如果需要从Hugging Face上下载可以将这行代码注释掉。

The code above imports the necessary libraries and sets the environment variable VLLM_USE_MODELSCOPE to True, which will download the model from ModelScope instead of Hugging Face. Comment out this line if you prefer to download from Hugging Face.

然后就可以下载模型了。只需要执行如下代码,就会自动从ModelScope/Hugging Face下载到本地。

Then, you can download the model. Simply execute the following code, and it will automatically download from ModelScope/Hugging Face to your local machine.

# 无量化,最低显存占用约16.5GB
llm = LLM(model="qwen/Qwen-7B-Chat", trust_remote_code=True)
# int4量化,最低显存占用约7GB
# llm = LLM(model="qwen/Qwen-7B-Chat-int4", trust_remote_code=True, gpu_memory_utilization=0.35)

值得注意的是,如果显存不够大,需要自行调整gpu_memory_utilization参数到一个合适的值。这个值会限制模型可以使用的最大显存(当然,给模型使用的显存必须大于最低值,否则无法成功加载)。

It's important to note that if your VRAM is insufficient, you need to adjust the gpu_memory_utilization parameter to an appropriate value. This parameter limits the maximum VRAM the model can use (naturally, the allocated VRAM must be greater than the minimum required, otherwise loading will fail).

当你下载成功后,代码的输出应该类似于这样:

When the download is successful, the output should resemble the following:

Downloading: 100%|██████████| 8.21k/8.21k [00:00<00:00, 12.1MB/s]
...
Downloading: 100%|██████████| 404k/404k [00:00<00:00, 5.16MB/s]

接下来测试模型的推理:

Next, test the model's inference:

prompts = [
'''
Let's think step by step:
将大象塞到冰箱里面有几个步骤?
'''
]

sampling_params = SamplingParams(temperature=0.8, top_k=10, top_p=0.95, max_tokens=256, stop=["<|endoftext|>", "<|im_end|>"])
start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()
latency = end_time - start_time
print(f"Latency: {latency} seconds")
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt} \nGenerated text: \n{generated_text}")

这里有很多参数,比如temperaturetop_k等。如果你想详细了解,可以查看Hugging Face手册。

There are many parameters here, such as temperature, top_k, etc. If you want to understand them in detail, you can refer to the Hugging Face documentation.

执行如上代码,输出如下:

Executing the above code yields the following output:

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  3.27it/s]
Latency: 0.30954647064208984 seconds
Prompt:
Let's think step by step:
将大象塞到冰箱里面有几个步骤?

Generated text:
首先,打开冰箱门。其次,将大象塞入冰箱。最后,关闭冰箱门。

可以看到FunHPC趣算云上的A100只用了约0.3秒就完成了推理,速度非常快,并且正确回答了问题。

As you can see, the A100 on FunHPC completed the inference in about 0.3 seconds, which is very fast, and it answered the question correctly.

使用FastAPI启动Web服务并进行推理

既然我们已经部署了模型,尝试了离线推理,让我们开始使用FastAPI构建一个API服务。它将处理请求并使用已部署的模型生成LLM响应。

Now that we have deployed the model and tested offline inference, let's start building an API service using FastAPI. It will handle requests and generate LLM responses using the deployed model.

以下代码将启动FastAPI Python应用程序,并在/v1/generateText端点托管LLM模型。它会在端口5001上启动API。如果之前的离线推理部分已经完成了LLM模型下载,则此处不会重新下载LLM模型。

The following code will start a FastAPI Python application and host the LLM model at the /v1/generateText endpoint. It will start the API on port 5001. If the LLM model was already downloaded during the previous offline inference step, it will not be re-downloaded here.

from vllm import LLM, SamplingParams
import os
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import uvicorn

# 使用ModelScope。如果不设置该环境变量,将会从Hugging Face下载。
os.environ['VLLM_USE_MODELSCOPE'] = 'True'

app = FastAPI()

llm = LLM(model="qwen/Qwen-7B-Chat", trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.8, top_k=10, top_p=0.95, max_tokens=256, stop=["<|endoftext|>", "<|im_end|>"])

@app.get("/")
def read_root():
    return {"Hello": "World"}

@app.post("/v1/generateText")
async def generateText(request: Request):
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
    prompt = [f'''
    {prompt}
            ''']
    print(prompt)
    output = llm.generate(prompt, sampling_params)
    generated_text = output[0].outputs[0].text
    print("Generated text:", generated_text)
    return JSONResponse({"text": generated_text})

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=5001)

运行上面的Python程序,它的输出应该类似:

Running the above Python program, its output should resemble:

INFO:     Started server process [30399]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5001 (Press CTRL+C to quit)

可以使用以下方法向已启动的服务API发送请求:

You can send requests to the launched service API using the following method:

import requests
import json
import time

# Define the API endpoint
url = "http://0.0.0.0:5001/v1/generateText"

headers = {"Content-Type": "application/json"}

prompt = '''
Let's think step by step:
将大象塞到冰箱里面有几个步骤?
'''
data = {"prompt": prompt}

start_time = time.time()
# Make the POST request
response = requests.post(url, headers=headers, data=json.dumps(data))
end_time = time.time()
latency = end_time - start_time
print(f"Latency: {latency} seconds")
text = json.loads(response.text)
print("LLM response: " + text["text"])

我的返回结果如下:

My returned result is as follows:

Latency: 0.42313146591186523 seconds
LLM response:  1. 打开冰箱门
             2. 将大象塞进去
             3. 关闭冰箱门

网络部分略微增加了延迟(约0.1秒),但总体延迟依然很低。

The network portion slightly increased the latency (about 0.1s), but the overall latency remains very low.

构建向量数据库(知识库)

为了构建一个更智能的、能够利用外部知识的系统,我们将引入LangChain和向量数据库。在这个示例中,我们需要使用FAISS库,可以通过下面的命令安装GPU版本:

To build a more intelligent system capable of utilizing external knowledge, we will introduce LangChain and a vector database. For this example, we need the FAISS library. Install the GPU version with the following command:

pip install faiss-gpu

安装CPU版本的命令如下:

The command to install the CPU version is:

pip install faiss-cpu

同时,为了将知识存入向量数据库,我们还需要一个Embedding模型。使用以下命令安装并加载Embedding模型。就像之前的LLM一样,这里的Embedding模型也会自动下载:

Additionally, to store knowledge in the vector database, we need an Embedding model. Use the following commands to install and load the Embedding model. Just like the LLM earlier, this Embedding model will also be downloaded automatically:

from langchain.vectorstores import FAISS
from langchain.embeddings import ModelScopeEmbeddings
model_id = "damo/nlp_corom_sentence-embedding_english-base"
embeddings = ModelScopeEmbeddings(model_id=model_id)

加载完毕之后我们就可以导入知识了。这里的知识以列表的形式导入:

Once loaded, we can import knowledge. Here, knowledge is imported in the form of a list:

knowledges = ["DeepLN致力于提供高性价比的GPU租赁。"]
vectorstore = FAISS.from_texts(
    knowledges,
    embedding=embeddings
)

(注:由于输入内容在此处截断,后续关于构建检索增强生成(RAG)管道、优化及生产化考虑等内容将基于此框架展开。本教程已涵盖了从云实例选择、环境配置、模型部署到基础API服务和知识库搭建的核心流程,为构建生产级LLM应用奠定了坚实基础。)

(Note: As the input content is truncated here, subsequent topics such as building a Retrieval-Augmented Generation (RAG) pipeline, optimization, and production considerations would be developed based on this framework. This tutorial has covered the core workflow from cloud instance selection, environment configuration, model deployment, to basic API service and knowledge base setup, laying a solid foundation for building production-level LLM applications.)

← 返回文章列表
分享到:微博

版权与免责声明:本文仅用于信息分享与交流,不构成任何形式的法律、投资、医疗或其他专业建议,也不构成对任何结果的承诺或保证。

文中提及的商标、品牌、Logo、产品名称及相关图片/素材,其权利归各自合法权利人所有。本站内容可能基于公开资料整理,亦可能使用 AI 辅助生成或润色;我们尽力确保准确与合规,但不保证完整性、时效性与适用性,请读者自行甄别并以官方信息为准。

若本文内容或素材涉嫌侵权、隐私不当或存在错误,请相关权利人/当事人联系本站,我们将及时核实并采取删除、修正或下架等处理措施。 也请勿在评论或联系信息中提交身份证号、手机号、住址等个人敏感信息。