LlamaEdge 如何实现轻量级本地部署？

LlamaEdge 基于 Rust+Wasm 技术栈，将 LLM 推理应用编译为可移植的 WebAssembly 模块，无需复杂环境依赖，可在多种平台安全运行，实现真正的轻量级部署。

如何快速开始使用 LlamaEdge 运行本地 LLM？

只需三步：1) 安装 WasmEdge 2) 下载模型文件（如 Llama 3.2 1B）3) 下载并运行 llama-chat.wasm 应用。详细步骤见官方文档的快速开始指南。

LlamaEdge 如何实现轻量级本地部署？

LlamaEdge 基于 Rust+Wasm 技术栈，将 LLM 推理应用编译为可移植的 WebAssembly 模块，无需复杂环境依赖，可在多种平台安全运行，实现真正的轻量级部署。

如何快速开始使用 LlamaEdge 运行本地 LLM？

只需三步：1) 安装 WasmEdge 2) 下载模型文件（如 Llama 3.2 1B）3) 下载并运行 llama-chat.wasm 应用。详细步骤见官方文档的快速开始指南。

LlamaEdge和DeepSeek哪个更适合本地部署大语言模型？

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。项目让您能够轻松地在本地运行大语言模型推理应用，并为开源大语言模型创建 OpenAI 兼容的 API 服务。

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。项目让您能够轻松地在本地运行大语言模型推理应用，并为开源大语言模型创建 OpenAI 兼容的 API 服务。

⭐ 喜欢我们的工作吗？请给我们一个 Star！

⭐ 喜欢我们的工作吗？请给我们一个 Star！

📖 查看我们的官方文档以及关于如何定制开源模型的Manning电子书。

📖 查看我们的官方文档以及关于如何定制开源模型的Manning电子书。

🔥 使用 LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。运行多种模型！
LLM（文本生成） | Embedding | 语音转文本 | 文本转语音 | 文生图 | 多模态

快速开始

Quick Start

前提条件：通过以下命令行安装 WasmEdge一个轻量级、高性能且可扩展的 WebAssembly 运行时，适用于云原生、边缘计算和去中心化应用。它是运行 LlamaEdge 应用的前提条件。。

Prerequisite: Install WasmEdge一个轻量级、高性能且可扩展的 WebAssembly 运行时，适用于云原生、边缘计算和去中心化应用。它是运行 LlamaEdge 应用的前提条件。 via the following command line.

curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install_v2.sh | bash

步骤 1：下载一个 LLM 模型文件。这里我们以 Meta Llama 3.2 1B 模型为例。

Step 1: Download an LLM model file. Here we use the Meta Llama 3.2 1B model as an example.

curl -LO https://huggingface.co/second-state/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q5_K_M.gguf

步骤 2：下载 LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 CLI 聊天应用。它也是一个跨平台的可移植 Wasm 应用，可以在许多 CPU 和 GPU 设备上运行。

Step 2: Download the LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 CLI chat app. It is also a cross-platform portable Wasm app that can run on many CPU and GPU devices.

curl -LO https://github.com/second-state/LlamaEdge/releases/latest/download/llama-chat.wasm

步骤 3：运行以下命令与 LLM 进行对话。

Step 3: Run the following command to chat with the LLM.

wasmedge --dir .:. --nn-preload default:GGML:AUTO:Llama-3.2-1B-Instruct-Q5_K_M.gguf llama-chat.wasm -p llama-3-chat

后续步骤：

Next steps:

使用基于 Web 的聊天机器人与您的本地 LLM 交互 (Use a web-based chatbot to interact with your local LLM)
使用您自己的知识库启动 LLM 服务 (Start an LLM service with your own knowledge base)

在 OpenAI 兼容的 Web 服务端点中提供任何 GenAI 模型：

Serve any GenAI model in OpenAI-compatible web service endpoints:

LLM (/v1/chat/completion 端点) -- https://github.com/LlamaEdge/LlamaEdge (此仓库) (LLM (/v1/chat/completion endpoints) -- https://github.com/LlamaEdge/LlamaEdge (this repo))
Embedding (/v1/embeddings 端点) -- https://github.com/LlamaEdge/LlamaEdge (此仓库) (LLM (/v1/embeddings endpoints) -- https://github.com/LlamaEdge/LlamaEdge (this repo))
语音转文本 (/v1/audio/transcriptions 端点) -- https://github.com/LlamaEdge/whisper-api-server (Voice to text (/v1/audio/transcriptions endpoints) -- https://github.com/LlamaEdge/whisper-api-server)
文本转语音 (/v1/audio/speech 端点) -- https://github.com/LlamaEdge/tts-api-server (Text to voice (/v1/audio/speech endpoints) -- https://github.com/LlamaEdge/tts-api-server)
文生图 (/v1/images/generations 端点) -- https://github.com/LlamaEdge/sd-api-server (Text to image (/v1/images/generations endpoints) -- https://github.com/LlamaEdge/sd-api-server)

技术栈

The Tech Stack

Rust + Wasm 技术栈为 AI 推理提供了一个强大的 Python 替代方案。

The Rust+Wasm stack provides a strong alternative to Python in AI inference.


特性	描述	优势
轻量级	运行时总大小约 30MB	资源占用极低，适合边缘设备
高性能	在 GPU 上实现完整的原生速度	推理速度快，延迟低
可移植性	跨 CPU、GPU 和操作系统的单一二进制文件	部署简单，无需为不同平台重新编译
安全性	在不可信设备上进行沙箱化和隔离执行	提供强大的安全边界
容器就绪	支持 Docker, containerd, Podman, Kubernetes	与现代云原生生态无缝集成

Feature Description Advantage

Lightweight Total runtime size is ~30MB Minimal resource footprint, ideal for edge devices

Fast Full native speed on GPUs High inference speed and low latency

Portable Single cross-platform binary on different CPUs, GPUs, and OSes Easy deployment, no recompilation needed for different platforms

Secure Sandboxed and isolated execution on untrusted devices Provides a strong security boundary

Container-ready Supported in Docker, containerd, Podman, and Kubernetes Seamless integration with modern cloud-native ecosystems


Feature	Description	Advantage
Lightweight	Total runtime size is ~30MB	Minimal resource footprint, ideal for edge devices
Fast	Full native speed on GPUs	High inference speed and low latency
Portable	Single cross-platform binary on different CPUs, GPUs, and OSes	Easy deployment, no recompilation needed for different platforms
Secure	Sandboxed and isolated execution on untrusted devices	Provides a strong security boundary
Container-ready	Supported in Docker, containerd, Podman, and Kubernetes	Seamless integration with modern cloud-native ecosystems

欲了解更多信息，请查看《在异构边缘上进行快速且可移植的 Llama2 推理》。

For more information, please check out Fast and Portable Llama2 Inference on the Heterogeneous Edge.

支持的模型与平台

Supported Models and Platforms

模型

Models

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。项目支持所有基于 llama2 框架的大语言模型。模型文件必须为 GGUF 格式。我们致力于持续测试和验证每天涌现的新开源模型。

The LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 project supports all Large Language Models (LLMs) based on the llama2 framework. The model files must be in the GGUF format. We are committed to continuously testing and validating new open-source models that emerge every day.

点击此处查看支持的模型列表，其中包含每个模型的下载链接和启动命令。如果您成功运行了其他 LLM，请随时通过创建 Pull Request 来帮助扩展此列表。

Click here to see the supported model list with a download link and startup commands for each model. If you have success with other LLMs, don't hesitate to contribute by creating a Pull Request (PR) to help extend this list.

平台

Platforms

编译后的 Wasm 文件是跨平台的。您可以使用同一个 Wasm 文件在不同的操作系统、CPU 和 GPU 上运行 LLM。

The compiled Wasm file is cross platfrom. You can use the same Wasm file to run the LLM across OSes, CPUs, and GPUs.


类别	具体支持
操作系统	macOS, Linux, Windows (WSL)
CPU 架构	x86, ARM, Apple Silicon, RISC-V
GPU 厂商	NVIDIA, Apple

Category Specific Support

Operating Systems macOS, Linux, Windows (WSL)

CPU Architectures x86, ARM, Apple Silicon, RISC-V

GPU Vendors NVIDIA, Apple

从 WasmEdge一个轻量级、高性能且可扩展的 WebAssembly 运行时，适用于云原生、边缘计算和去中心化应用。它是运行 LlamaEdge 应用的前提条件。 0.13.5 开始，安装程序会自动检测 NVIDIA CUDA 驱动程序。如果检测到 CUDA，安装程序将始终尝试安装支持 CUDA 的插件版本。我们的自动化 CI 已在以下平台上测试了 CUDA 支持：

The installer from WasmEdge一个轻量级、高性能且可扩展的 WebAssembly 运行时，适用于云原生、边缘计算和去中心化应用。它是运行 LlamaEdge 应用的前提条件。 0.13.5 will detect NVIDIA CUDA drivers automatically. If CUDA is detected, the installer will always attempt to install a CUDA-enabled version of the plugin. The CUDA support is tested on the following platforms in our automated CI.

Nvidia Jetson AGX Orin 64GB 开发套件 (Nvidia Jetson AGX Orin 64GB developer kit)

Intel i7-10700 + Nvidia GTX 1080 8G GPU (Intel i7-10700 + Nvidia GTX 1080 8G GPU)

AWS EC2 g5.xlarge + Nvidia A10G 24G GPU + Amazon 深度学习基础 Ubuntu 20.04 (AWS EC2 g5.xlarge + Nvidia A10G 24G GPU + Amazon deep learning base Ubuntu 20.04)

如果您使用的是纯 CPU 机器，安装程序将安装插件的 OpenBLAS 版本。您可能需要通过 apt update && apt install -y libopenblas-dev 安装 libopenblas-dev。

If you're using CPU only machine, the installer will install the OpenBLAS version of the plugin instead. You may need to install libopenblas-dev by apt update && apt install -y libopenblas-dev.

故障排除

Troubleshooting

Q: 为什么启动 API 服务器后出现以下错误？

Q: Why I got the following errors after starting the API server?
[2024-03-05 16:09:05.800] [error] instantiation failed: module name conflict, Code: 0x60
[2024-03-05 16:09:05.801] [error]     At AST node: module
A: 模块冲突错误是一个已知问题，这些是误报错误。它们不会影响您程序的功能。

A: The module conflict error is a known issue, and these are false-positive errors. They do not impact your program's functionality.

Q: 即使我的机器内存很大，在问了几个问题后，还是收到了错误信息 'Error: Backend Error: WASI-NNWebAssembly 系统接口的神经网络扩展，允许 WebAssembly 程序访问主机提供的神经网络推理功能。在 LlamaEdge 中，它用于连接 Wasm 应用和底层的 LLM 推理后端（如 ggml）。'。我该怎么办？

Q: Even though my machine has a large RAM, after asking several questions, I received an error message returns 'Error: Backend Error: WASI-NNWebAssembly 系统接口的神经网络扩展，允许 WebAssembly 程序访问主机提供的神经网络推理功能。在 LlamaEdge 中，它用于连接 Wasm 应用和底层的 LLM 推理后端（如 ggml）。'. What should I do?

A: 为了让内存较小的机器（如 8 GB）也能运行 7b 模型，我们默认将上下文大小限制设置为 512。如果您的机器有更多资源，可以使用此处提供的 CLI 选项将上下文大小和批处理大小增加到最多 4096。使用以下命令调整设置：

A: To enable machines with smaller RAM, like 8 GB, to run a 7b model, we've set the context size limit to 512. If your machine has more capacity, you can increase both the context size and batch size up to 4096 using the CLI options available here. Use these commands to adjust the settings:
-c, --ctx-size <CTX_SIZE>
-b, --batch-size <BATCH_SIZE>
Q: 运行 apt update && apt install -y libopenblas-dev 后，可能会遇到以下错误：

Q: After running apt update && apt install -y libopenblas-dev, you may encounter the following error:
...
E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
A: 这表明您没有以 root 身份登录。请尝试使用 sudo 命令再次安装：

A: This indicates that you are not logged in as root. Please try installing again using the sudo command:
sudo apt update && sudo apt install -y libopenblas-dev
Q: 运行 wasmedge 命令后，可能会收到以下错误：

Q: After running the wasmedge command, you may receive the following error:
[2023-10-02 14:30:31.227] [error] loading failed: invalid path, Code: 0x20
[2023-10-02 14:30:31.227] [error]     load library failed:libblas.so.3: cannot open shared object file: No such file or directory
[2023-10-02 14:30:31.227] [error] loading failed: invalid path, Code: 0x20
[2023-10-02 14:30:31.227] [error]     load library failed:libblas.so.3: cannot open shared object file: No such file or directory
unknown option: nn-preload
A: 这表明您的插件安装不成功。要解决此问题，请尝试重新安装所需的插件。

A: This suggests that your plugin installation was not successful. To resolve this issue, please attempt to install your desired plugin again.

Q: 执行 wasmedge 命令后，可能会遇到错误信息：[WASI-NN] GGML backend: Error: unable to init model.

Q: After executing the wasmedge command, you might encounter the error message: [WASI-NN] GGML backend: Error: unable to init model.

A: 此错误表示模型设置不成功。要解决此问题，请验证以下事项：

A:

常见问题（FAQ）

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。如何实现轻量级本地部署？

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。基于 Rust+Wasm 技术栈指使用 Rust 编程语言编写应用，并将其编译为 WebAssembly（Wasm）字节码的技术组合。该组合以高性能、内存安全和可移植性著称，是 LlamaEdge 项目的核心。，将 LLM 推理应用编译为可移植的 WebAssembly 模块，无需复杂环境依赖，可在多种平台安全运行，实现真正的轻量级部署。

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。支持哪些类型的 AI 模型？

支持文本生成（LLM）、Embedding、语音转文本、文本转语音、文生图及多模态模型，并提供对应的 OpenAI 兼容 API 端点，方便集成到现有应用中。

如何快速开始使用 LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。运行本地 LLM？

只需三步：1) 安装 WasmEdge一个轻量级、高性能且可扩展的 WebAssembly 运行时，适用于云原生、边缘计算和去中心化应用。它是运行 LlamaEdge 应用的前提条件。 2) 下载模型文件（如 Llama 3.2 1B）3) 下载并运行 llama-chat.wasm 应用。详细步骤见官方文档的快速开始指南。


Category	Specific Support
Operating Systems	macOS, Linux, Windows (WSL)
CPU Architectures	x86, ARM, Apple Silicon, RISC-V
GPU Vendors	NVIDIA, Apple

AI Summary (BLUF)

快速开始

Quick Start

技术栈

The Tech Stack

支持的模型与平台

Supported Models and Platforms

模型

Models

平台

Platforms

故障排除

Troubleshooting

常见问题（FAQ）

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 如何实现轻量级本地部署？

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 支持哪些类型的 AI 模型？

如何快速开始使用 LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。 运行本地 LLM？

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。如何实现轻量级本地部署？

LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。支持哪些类型的 AI 模型？

如何快速开始使用 LlamaEdge一个开源项目，用于在本地环境中便捷地运行大型语言模型（LLM）推理应用，并创建与 OpenAI API 兼容的服务端点。运行本地 LLM？