SGLangDesk - SGLang 本地服务实现指南¶

概述¶

SGLangDesk 是基于 SGLang 本地部署的 LLM 的 DeskLLM 实现。SGLang 是一个高性能的大语言模型推理框架，支持多种开源模型，提供快速的推理速度和灵活的配置选项。

SGLang 简介¶

SGLang 是由 UC Berkeley 研发的 LLM 推理框架，具有以下特点：

高性能：优化的推理引擎，支持大吞吐量
灵活配置：支持多种采样参数和停止词
易于部署：本地运行，无需联网
多模型支持：支持 Llama、Qwen、Mistral 等多种模型

安装 SGLang¶

使用 pip 安装¶

pip install "sglang[all]"

启动 SGLang 服务器¶

# 启动服务器（默认端口 30000）
python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30000

# 或使用 sglang 命令
sglang-launch-server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30000

拉取模型¶

# 使用 HuggingFace 拉取模型
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct

# 或使用 ModelScope
modelscope download --model meta-llama/Llama-3.2-3B-Instruct

快速开始¶

基本使用¶

from tfrobot.brain.chain.llms.generation_llms.desk_llm.sglang_desk import SGLangDesk
from tfrobot.brain.chain.prompt.memo_prompt import MemoPrompt
from tfrobot.schema.message.conversation.message_dto import TextMessage

# 创建 SGLangDesk 实例
desk_llm = SGLangDesk(
    name="llama-3.2-3b",  # 模型名称
    host="http://localhost",
    port=30000
)

# 配置系统提示
desk_llm.system_prompt = [
    MemoPrompt(template="你是一个专业的代码助手，擅长 Python 开发。")
]

# 调用
result = desk_llm.complete(
    current_input=TextMessage(content="帮我写一个快速排序函数")
)
print(result.generations[0].text)

核心参数¶

基础配置¶

参数	类型	默认值	说明
`host`	`str`	`http://localhost`	SGLang 服务地址
`port`	`int`	`30000`	SGLang 服务端口
`timeout`	`int`	`180`	请求超时时间（秒）

模型参数¶

参数	类型	默认值	说明
`name`	`str`	-	模型名称（仅用于记录）
`max_tokens`	`int`	-	最大生成长度（max_new_tokens）
`temperature`	`float`	`0.8`	采样温度（0-1）
`top_p`	`float`	`0.9`	核采样参数
`stop`	`str\\|list[str]`	-	停止词列表

高级参数¶

参数	类型	默认值	说明
`append`	`str`	-	追加词（停止后需要追加的内容）

使用场景¶

代码生成¶

desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000
)

desk_llm.system_prompt = [
    MemoPrompt(template="你是一个 Python 专家，编写高质量、有文档的代码。")
]

desk_llm.purpose_prompt = [
    MemoPrompt(template="创建一个 Person 数据类，包含 name 和 age 属性。")
]

desk_llm.prefix_prompt = [
    MemoPrompt(template="```python\nfrom dataclasses import dataclass\n\n")
]

result = desk_llm.complete(current_input=TextMessage(content="开始生成"))
print(result.generations[0].text)

中文内容生成¶

# 使用 Qwen 获得更好的中文效果
desk_llm = SGLangDesk(
    name="qwen2.5-7b",
    host="http://localhost",
    port=30000,
    temperature=0.8,
    top_p=0.9
)

desk_llm.system_prompt = [
    MemoPrompt(template="你是一个专业的文案撰写助手。")
]

desk_llm.purpose_prompt = [
    MemoPrompt(template="为一款智能家居产品撰写一段宣传文案，突出便捷性和智能化。")
]

result = desk_llm.complete(current_input=TextMessage(content="开始撰写"))
print(result.generations[0].text)

使用追加词¶

desk_llm = SGLangDesk(
    name="llama-3.2-3b",
    host="http://localhost",
    port=30000,
    stop=["```"],  # 在遇到 ``` 时停止
    append="```"   # 追加 ``` 以闭合代码块
)

desk_llm.purpose_prompt = [
    MemoPrompt(template="生成一个 Python 快速排序函数。")
]

result = desk_llm.complete(current_input=TextMessage(content="开始"))
# 输出会包含 ``` 闭合标记

配置优化¶

温度和采样参数¶

# 确定性输出（代码生成）
desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000,
    temperature=0.0,
    top_p=0.9
)

# 创造性输出（文案创作）
desk_llm = SGLangDesk(
    name="qwen2.5-7b",
    host="http://localhost",
    port=30000,
    temperature=0.8,
    top_p=0.95
)

使用停止词¶

# 减少不必要的生成
desk_llm = SGLangDesk(
    name="llama-3.2-3b",
    host="http://localhost",
    port=30000,
    stop=["```", "END", "<|end_of_text|>"]
)

错误处理¶

连接失败¶

from requests.exceptions import ConnectionError

try:
    result = desk_llm.complete(current_input=user_input)
except ConnectionError:
    print("无法连接到 SGLang 服务，请确保 SGLang 正在运行")
    print("启动命令: sglang-launch-server --model-path <model> --port 30000")

超时错误¶

# 增加超时时间
desk_llm = SGLangDesk(
    name="qwen2.5-32b",  # 大模型需要更长时间
    host="http://localhost",
    port=30000,
    timeout=300  # 5 分钟
)

JSON 解析错误¶

try:
    result = desk_llm.complete(current_input=user_input)
except json.JSONDecodeError as e:
    print(f"SGLang 响应解析失败: {str(e)}")
    print("请检查 SGLang 服务是否正常运行")

高级用法¶

多轮编辑¶

desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000
)

# 初始代码
original_code = """
def add(a, b):
    return a + b
"""

desk_llm.original_desk_screenshot_prompt = [
    MemoPrompt(template=f"原始代码：\n```python\n{original_code}\n```")
]

# 第一轮：添加类型注解
desk_llm.purpose_prompt = [
    MemoPrompt(template="添加类型注解。")
]

result1 = desk_llm.complete(current_input=TextMessage(content="开始编辑"))
current_code = result1.generations[0].text

# 第二轮：添加文档字符串
desk_llm.current_desk_screenshot_prompt = [
    MemoPrompt(template=f"当前代码：\n```python\n{current_code}\n```")
]

desk_llm.intermediate_prompt = [
    MemoPrompt(template="已添加类型注解。")
]

desk_llm.purpose_prompt = [
    MemoPrompt(template="添加文档字符串。")
]

result2 = desk_llm.complete(current_input=TextMessage(content="继续编辑"))
print(result2.generations[0].text)

异步调用¶

import asyncio

async def generate():
    desk_llm = SGLangDesk(
        name="qwen2.5-7b",
        host="http://localhost",
        port=30000
    )
    result = await desk_llm.async_complete(
        current_input=TextMessage(content="生成代码")
    )
    return result.generations[0].text

result = asyncio.run(generate())

SGLang 服务器配置¶

基本启动参数¶

# 启动服务器
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --port 30000 \
    --host 0.0.0.0 \
    --tp 1 \  # Tensor parallelism
    --mem-frac 0.9  # 使用 90% 的内存

推理优化¶

# 启用 KV 缓存和量化
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --port 30000 \
    --kv-cache torch \
    --quantization fp8

多 GPU 部署¶

# 使用多 GPU
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --port 30000 \
    --tp 4 \  # 使用 4 个 GPU
    --mem-frac 0.9

性能优化¶

1. 选择合适的模型大小¶

# 小模型：快速，适合简单任务
desk_llm = SGLangDesk(
    name="llama-3.2-3b",
    host="http://localhost",
    port=30000
)

# 大模型：更好的质量，但更慢
desk_llm = SGLangDesk(
    name="llama-3.2-70b",
    host="http://localhost",
    port=30000
)

2. 调整采样参数¶

# 对于代码生成，使用低温度
desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000,
    temperature=0.0,
    top_p=0.9
)

3. 使用停止词¶

# 减少不必要的生成
desk_llm = SGLangDesk(
    name="llama-3.2-3b",
    host="http://localhost",
    port=30000,
    stop=["```", "END"]
)

4. 服务器端优化¶

# 增加 KV 缓存大小
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --port 30000 \
    --kv-cache-size 4096

最佳实践¶

1. 选择合适的模型¶

# 中文场景：Qwen2.5
desk_llm = SGLangDesk(
    name="qwen2.5-7b",
    host="http://localhost",
    port=30000
)

# 代码场景：CodeLlama
desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000
)

# 英文场景：Llama3.2
desk_llm = SGLangDesk(
    name="llama-3.2-3b",
    host="http://localhost",
    port=30000
)

2. 合理设置温度¶

# 代码生成：低温度
desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000,
    temperature=0.0
)

# 文案创作：高温度
desk_llm = SGLangDesk(
    name="qwen2.5-7b",
    host="http://localhost",
    port=30000,
    temperature=0.8
)

3. 使用追加词¶

# 确保代码块正确闭合
desk_llm = SGLangDesk(
    name="codellama-7b",
    host="http://localhost",
    port=30000,
    stop=["```"],
    append="```"
)

与 Ollama 的区别¶

特性	SGLangDesk	OllamaDesk
框架	SGLang	Ollama
性能	更高	中等
配置灵活性	高	中等
多 GPU 支持	✅ 原生支持	⚠️ 有限
部署复杂度	中等	简单
文档完整性	中等	完善

与远程模型的区别¶

特性	SGLangDesk	ClaudeDesk/GPTDesk
网络依赖	无	需要
数据隐私	完全本地	上传到云端
成本	免费	按使用付费
性能	取决于硬件	稳定
可定制性	高	低
维护成本	需要维护	无需维护

SGLangDesk - SGLang 本地服务实现指南¶

概述¶

SGLang 简介¶

安装 SGLang¶

使用 pip 安装¶

启动 SGLang 服务器¶

拉取模型¶

快速开始¶

基本使用¶

核心参数¶

基础配置¶

模型参数¶

高级参数¶

使用场景¶

代码生成¶

中文内容生成¶

使用追加词¶

配置优化¶

温度和采样参数¶

使用停止词¶

错误处理¶

连接失败¶

超时错误¶

JSON 解析错误¶

高级用法¶

多轮编辑¶

异步调用¶

SGLang 服务器配置¶

基本启动参数¶

推理优化¶

多 GPU 部署¶

性能优化¶

1. 选择合适的模型大小¶

2. 调整采样参数¶

3. 使用停止词¶

4. 服务器端优化¶

最佳实践¶

1. 选择合适的模型¶

2. 合理设置温度¶

3. 使用追加词¶

与 Ollama 的区别¶

与远程模型的区别¶

相关文档¶