Qwen2.5-0.5B Instruct在Ubuntu20.04上的快速部署教程-洪萨配资

Qwen2.5-0.5B Instruct在Ubuntu20.04上的快速部署教程

想在自己的Ubuntu服务器上跑一个轻量级的大语言模型试试水？Qwen2.5-0.5B Instruct是个不错的选择。它虽然只有5亿参数，但指令跟随能力不错，对硬件要求也低，很适合用来学习和搭建简单的AI对话服务。

今天我就带你走一遍在Ubuntu 20.04上部署这个模型的完整流程。整个过程不算复杂，只要你跟着步骤来，半小时内应该就能看到模型跑起来的效果。

1. 部署前的准备工作

在开始安装模型之前，我们需要先把运行环境搭建好。这就像盖房子前要打好地基一样，环境没问题，后面的事情就顺了。

1.1 检查你的系统环境

首先，打开终端，确认一下你的Ubuntu版本。虽然标题说的是20.04，但其他版本也基本通用。

lsb_release -a

你会看到类似这样的输出：

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.6 LTS Release: 20.04 Codename: focal

只要系统是Ubuntu，版本不是太老（比如18.04以下），一般都没问题。接下来，更新一下系统包，确保所有软件都是最新状态：

sudo apt update sudo apt upgrade -y

1.2 安装Python和必要的工具

Qwen2.5模型基于Python的transformers库，所以Python环境是必须的。Ubuntu 20.04默认可能已经装了Python 3.8，但我们最好用更新一点的版本。我建议用Python 3.10，兼容性好，性能也不错。

sudo apt install python3.10 python3.10-venv python3.10-dev python3-pip -y

安装完成后，确认一下Python版本：

python3 --version

如果显示的是Python 3.10.x，那就没问题。接下来安装一些编译可能需要的依赖：

sudo apt install build-essential cmake git -y

1.3 处理GPU环境（可选但推荐）

如果你的服务器有NVIDIA显卡，那肯定要用上GPU来加速推理。没有的话也没关系，CPU也能跑，只是速度会慢一些。

检查一下有没有NVIDIA驱动和CUDA：

nvidia-smi

如果这个命令能正常执行，并且显示了你的显卡信息，比如RTX 4090之类的，那说明驱动已经装好了。你还会看到CUDA版本，比如12.4。如果没显示，你可能需要先安装NVIDIA驱动和CUDA Toolkit。

对于Ubuntu 20.04，可以通过官方仓库安装：

sudo apt install nvidia-driver-550 -y # 驱动版本根据你的显卡调整 sudo apt install nvidia-cuda-toolkit -y

安装后重启一下系统，再运行nvidia-smi确认。

2. 创建独立的Python环境

我强烈建议为这个项目创建一个独立的虚拟环境。这样不会影响系统里其他的Python项目，以后想清理也方便。

python3.10 -m venv qwen_env

创建完成后，激活这个环境：

source qwen_env/bin/activate

你会看到终端提示符前面多了(qwen_env)，这说明你已经在这个虚拟环境里了。接下来所有pip安装的包，都会装在这个环境里，不会污染全局。

3. 安装核心的Python依赖

现在可以安装运行模型需要的Python包了。最核心的就是PyTorch和transformers。

首先安装PyTorch。去PyTorch官网看看最新的安装命令。根据你的CUDA版本选择，比如CUDA 12.1：

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

如果你没有GPU，或者想先用CPU试试，就安装CPU版本：

pip3 install torch torchvision torchaudio

然后安装transformers库，这是Hugging Face提供的模型加载和推理框架：

pip install transformers

由于Qwen2.5是比较新的模型，确保transformers版本足够新（建议4.37.0以上）：

pip install --upgrade transformers

还需要安装accelerate，这个库能帮我们更好地管理模型在不同设备上的加载：

pip install accelerate

最后，安装tiktoken，这是Qwen系列模型分词器需要的依赖：

pip install tiktoken

4. 下载并运行Qwen2.5-0.5B Instruct模型

环境都准备好了，现在可以开始玩模型了。我们有两种方式获取模型：让代码自动下载，或者手动下载后使用。

4.1 方式一：让代码自动下载（最简单）

创建一个Python脚本，比如叫run_qwen.py：

from transformers import AutoModelForCausalLM, AutoTokenizer # 指定模型名称，会自动从Hugging Face下载 model_name = "Qwen/Qwen2.5-0.5B-Instruct" print("正在加载模型和分词器，首次运行会下载模型，请耐心等待...") model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", # 自动选择精度 device_map="auto" # 自动选择设备（GPU/CPU） ) tokenizer = AutoTokenizer.from_pretrained(model_name) print("模型加载完成！") # 准备对话 prompt = "用简单的语言解释一下什么是人工智能" messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] # 将对话格式化为模型能理解的文本 text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # 分词并移动到模型所在的设备 model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # 生成回复 generated_ids = model.generate( **model_inputs, max_new_tokens=256 # 生成的最大token数 ) # 提取生成的回复部分 generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print("\n=== 模型回复 ===") print(response)

运行这个脚本：

python run_qwen.py

第一次运行时会下载模型文件，大概有1GB左右，下载速度取决于你的网络。下载完成后，模型就会开始推理并输出结果。

4.2 方式二：手动下载模型（推荐网络慢的用户）

如果自动下载太慢或者经常中断，可以先用modelscope（阿里云的开源模型平台）下载，速度通常快一些。

首先安装modelscope：

pip install modelscope

然后在终端里直接下载模型：

python -c "from modelscope import snapshot_download; snapshot_download('Qwen/Qwen2.5-0.5B-Instruct', cache_dir='./qwen_model')"

或者用命令行工具：

modelscope download --model Qwen/Qwen2.5-0.5B-Instruct --cache_dir ./qwen_model

下载完成后，修改Python脚本，使用本地路径：

local_model_path = "./qwen_model/Qwen/Qwen2.5-0.5B-Instruct" # 根据实际路径调整 model = AutoModelForCausalLM.from_pretrained( local_model_path, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(local_model_path)

5. 试试不同的对话场景

模型跑起来后，你可以试试各种问题。我修改了一下脚本，让它能连续对话：

from transformers import AutoModelForCausalLM, AutoTokenizer import torch class QwenChatbot: def __init__(self, model_path="Qwen/Qwen2.5-0.5B-Instruct"): print("初始化模型中...") self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype="auto", device_map="auto", trust_remote_code=True ) self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.history = [] print("模型就绪！") def chat(self, user_input, system_prompt=None): # 添加系统提示（如果有的话） if system_prompt and len(self.history) == 0: self.history.append({"role": "system", "content": system_prompt}) # 添加用户输入 self.history.append({"role": "user", "content": user_input}) # 准备模型输入 text = self.tokenizer.apply_chat_template( self.history, tokenize=False, add_generation_prompt=True ) model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device) # 生成回复 with torch.no_grad(): generated_ids = self.model.generate( **model_inputs, max_new_tokens=512, do_sample=True, # 启用采样，让回复更多样 temperature=0.7, # 控制随机性 top_p=0.9 # 核采样参数 ) # 提取回复 generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] # 将助手回复加入历史 self.history.append({"role": "assistant", "content": response}) # 保持历史记录不会太长（可选） if len(self.history) > 10: # 只保留最近10轮对话 self.history = [self.history[0]] + self.history[-9:] if self.history[0]["role"] == "system" else self.history[-10:] return response def clear_history(self): self.history = [] # 使用示例 if __name__ == "__main__": bot = QwenChatbot() # 测试不同的问题 test_questions = [ "你好，请介绍一下你自己", "Python里怎么读取文件？", "写一个简单的快速排序算法", "今天的天气真好，你觉得呢？" ] for question in test_questions: print(f"\n你：{question}") response = bot.chat(question) print(f"Qwen：{response}") print("-" * 50)

运行这个脚本，你会看到模型对各种问题的回复。0.5B的模型虽然小，但处理一些常见问题还是可以的。

6. 搭建一个简单的API服务

如果想让其他程序也能调用这个模型，可以搭建一个简单的Web API。用FastAPI很容易实现：

from fastapi import FastAPI, HTTPException from pydantic import BaseModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch from typing import List, Optional import uvicorn app = FastAPI(title="Qwen2.5-0.5B API") # 请求和响应模型 class Message(BaseModel): role: str content: str class ChatRequest(BaseModel): model: str = "qwen2.5-0.5b" messages: List[Message] max_tokens: Optional[int] = 512 temperature: Optional[float] = 0.7 class ChatResponse(BaseModel): id: str object: str = "chat.completion" created: int model: str choices: List[dict] usage: dict # 全局模型和分词器 model = None tokenizer = None @app.on_event("startup") async def startup_event(): """启动时加载模型""" global model, tokenizer print("正在加载Qwen2.5-0.5B模型...") model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-0.5B-Instruct", torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") print("模型加载完成！") @app.post("/v1/chat/completions", response_model=ChatResponse) async def chat_completion(request: ChatRequest): """处理聊天请求""" try: # 转换消息格式 messages = [{"role": msg.role, "content": msg.content} for msg in request.messages] # 应用聊天模板 text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # 分词 inputs = tokenizer(text, return_tensors="pt").to(model.device) # 生成回复 with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=request.max_tokens, temperature=request.temperature, do_sample=True ) # 解码回复 response_text = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) # 构建响应 import time return ChatResponse( id=f"chatcmpl-{int(time.time())}", created=int(time.time()), model=request.model, choices=[{ "index": 0, "message": { "role": "assistant", "content": response_text }, "finish_reason": "stop" }], usage={ "prompt_tokens": inputs.input_ids.shape[1], "completion_tokens": outputs.shape[1] - inputs.input_ids.shape[1], "total_tokens": outputs.shape[1] } ) except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/health") async def health_check(): """健康检查端点""" return {"status": "healthy", "model_loaded": model is not None} if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)

把上面的代码保存为api_server.py，然后运行：

python api_server.py

服务启动后，你可以用curl测试：

curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "你好，请用中文回答"} ], "max_tokens": 100 }'

或者用Python客户端：

import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "messages": [{"role": "user", "content": "写一首关于春天的诗"}], "max_tokens": 200 } ) print(response.json()["choices"][0]["message"]["content"])

7. 可能遇到的问题和解决方法

部署过程中可能会遇到一些小问题，这里整理了几个常见的：

问题1：提示KeyError: 'qwen2'这是因为transformers版本太旧，不识别Qwen2.5的模型配置。解决方法：

pip install --upgrade transformers

问题2：GPU内存不足0.5B模型其实很小，但如果你同时跑很多任务，或者用很大的上下文长度，也可能遇到内存问题。可以尝试：

减少max_new_tokens参数
使用torch.float16而不是默认精度
如果有多张GPU，确保device_map="auto"能正确分配

问题3：下载模型太慢除了用modelscope，还可以：

使用镜像源，比如清华源
先在其他地方下载好，然后scp到服务器
使用huggingface-cli工具，支持断点续传

问题4：生成速度慢如果用的是CPU，生成速度确实会比较慢。可以：

确保安装了正确版本的PyTorch（有MKL优化）
减少生成长度
考虑使用量化版本（不过0.5B本身已经很小了）

8. 总结

整体走下来，在Ubuntu 20.04上部署Qwen2.5-0.5B Instruct的过程还算顺利。这个模型虽然参数不多，但对于学习大模型部署、搭建简单的对话服务来说，完全够用了。最大的优点就是对硬件要求低，普通带GPU的云服务器甚至好一点的CPU都能跑起来。

实际用的时候，你会发现它在处理一些常见问答、简单代码生成上表现不错，但复杂的逻辑推理或长文本生成可能就力不从心了。不过这正好，你可以用它作为起点，了解大模型的工作原理，然后再去尝试更大的模型。

部署过程中最花时间的可能是环境配置和模型下载，一旦这些搞定，后面的使用就很直接了。如果你打算长期使用，可以考虑把API服务做成systemd服务，或者用Docker容器化，这样管理起来更方便。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

Qwen2.5-0.5B Instruct在Ubuntu20.04上的快速部署教程