Cosmos-Reason1-7B代码实例：REST API封装支持Postman调试调用-洪萨配资

Cosmos-Reason1-7B代码实例：REST API封装支持Postman调试调用

1. 项目概述

Cosmos-Reason1-7B是基于NVIDIA官方模型开发的本地大语言模型推理工具，专门针对逻辑推理、数学计算和编程解答等场景优化。本文将详细介绍如何为这个强大的本地推理工具添加REST API支持，使其能够通过Postman等工具进行远程调用和调试。

传统的本地工具虽然保证了数据隐私和安全，但缺乏远程访问能力。通过添加REST API封装，我们可以在保持本地运行优势的同时，获得远程调用的便利性，特别适合团队协作、系统集成和自动化测试场景。

2. 环境准备与依赖安装

在开始API封装之前，确保你已经具备以下环境：

Python 3.8或更高版本
已安装PyTorch和Transformers库
已成功运行Cosmos-Reason1-7B基础工具

首先安装必要的Web框架依赖：

pip install fastapi uvicorn pydantic

这些库将帮助我们快速构建高性能的REST API服务：

FastAPI: 现代高效的Web框架，支持自动API文档生成
Uvicorn: 轻量级的ASGI服务器，用于运行FastAPI应用
Pydantic: 数据验证库，确保API请求和响应的格式正确

3. API服务核心代码实现

3.1 基础框架搭建

首先创建主要的API服务文件api_server.py：

from fastapi import FastAPI, HTTPException from pydantic import BaseModel from typing import List, Optional import torch from transformers import AutoModelForCausalLM, AutoTokenizer import logging # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = FastAPI( title="Cosmos-Reason1-7B API", description="REST API for Cosmos-Reason1-7B Reasoning Model", version="1.0.0" ) # 全局变量存储模型和tokenizer model = None tokenizer = None device = "cuda" if torch.cuda.is_available() else "cpu" class ChatRequest(BaseModel): message: str conversation_history: Optional[List[dict]] = None max_length: Optional[int] = 1024 temperature: Optional[float] = 0.7 class ChatResponse(BaseModel): response: str reasoning_process: str full_output: str status: str @app.on_event("startup") async def load_model(): """启动时加载模型""" global model, tokenizer try: logger.info("正在加载Cosmos-Reason1-7B模型...") # 模型加载代码（与原始工具保持一致） model_name = "nvidia/Cosmos-Reason1-7B" tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True ) logger.info("模型加载完成，准备接收请求") except Exception as e: logger.error(f"模型加载失败: {str(e)}") raise e

3.2 核心推理API实现

接下来实现主要的聊天推理端点：

@app.post("/chat", response_model=ChatResponse) async def chat_completion(request: ChatRequest): """ 处理聊天推理请求 """ try: if model is None or tokenizer is None: raise HTTPException(status_code=503, detail="模型未就绪") # 构建对话历史 messages = request.conversation_history or [] messages.append({"role": "user", "content": request.message}) # 应用聊天模板 text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # 编码输入 model_inputs = tokenizer( [text], return_tensors="pt" ).to(device) # 模型推理 with torch.no_grad(): generated_ids = model.generate( **model_inputs, max_length=request.max_length, temperature=request.temperature, do_sample=True, pad_token_id=tokenizer.eos_token_id ) # 解码输出 generated_text = tokenizer.batch_decode( generated_ids, skip_special_tokens=True )[0] # 提取推理过程和最终答案 reasoning, answer = extract_reasoning_and_answer(generated_text) return ChatResponse( response=answer, reasoning_process=reasoning, full_output=generated_text, status="success" ) except Exception as e: logger.error(f"推理错误: {str(e)}") raise HTTPException(status_code=500, detail=f"推理过程出错: {str(e)}") def extract_reasoning_and_answer(text: str): """ 从模型输出中提取推理过程和最终答案 """ # 查找推理过程（位于<think>标签中） think_start = text.find("<think>") think_end = text.find("</think>") reasoning = "" answer = text if think_start != -1 and think_end != -1: reasoning = text[think_start + len("<think>"):think_end].strip() answer = text[think_end + len("</think>"):].strip() return reasoning, answer

3.3 辅助功能API

添加一些实用的辅助端点：

@app.get("/health") async def health_check(): """健康检查端点""" return { "status": "healthy", "model_loaded": model is not None, "device": device, "gpu_available": torch.cuda.is_available() } @app.post("/clear_memory") async def clear_memory(): """清理GPU显存""" try: if torch.cuda.is_available(): torch.cuda.empty_cache() return {"status": "success", "message": "显存已清理"} return {"status": "success", "message": "无GPU显存可清理"} except Exception as e: raise HTTPException(status_code=500, detail=f"显存清理失败: {str(e)}") @app.get("/model_info") async def get_model_info(): """获取模型信息""" if model is None: raise HTTPException(status_code=503, detail="模型未加载") return { "model_name": "Cosmos-Reason1-7B", "framework": "Transformers", "precision": "FP16", "device": device, "parameters": "7B" }

4. 启动与测试API服务

4.1 启动API服务器

创建启动脚本start_api.py：

import uvicorn if __name__ == "__main__": uvicorn.run( "api_server:app", host="0.0.0.0", # 允许远程访问 port=8000, reload=True, # 开发时自动重载 workers=1 # 单工作进程 )

运行API服务：

python start_api.py

服务启动后，你将看到类似下面的输出：

INFO: Started server process [12345] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

4.2 使用Postman测试API

现在可以使用Postman测试我们的API服务：

1. 健康检查测试

方法: GET
URL:http://localhost:8000/health
预期响应: 返回服务状态和模型加载信息

2. 聊天推理测试

方法: POST
URL:http://localhost:8000/chat
Body(JSON格式):

{ "message": "如果一个圆的半径是5厘米，它的面积是多少？", "max_length": 512, "temperature": 0.7 }

3. 带对话历史的复杂推理

方法: POST
URL:http://localhost:8000/chat
Body:

{ "message": "那么周长是多少呢？", "conversation_history": [ { "role": "user", "content": "如果一个圆的半径是5厘米，它的面积是多少？" }, { "role": "assistant", "content": "<think>圆的面积公式是πr²，半径r=5厘米，所以面积=3.14159×5²=3.14159×25≈78.54平方厘米</think>面积大约是78.54平方厘米。" } ] }

4.3 高级Postman使用技巧

环境变量配置：在Postman中设置环境变量，方便在不同环境间切换：

base_url:http://localhost:8000
api_key: (如果需要认证)

集合测试脚本：在Postman Tests标签页中添加验证脚本：

// 检查响应状态 pm.test("Status code is 200", function () { pm.response.to.have.status(200); }); // 检查响应结构 pm.test("Response has correct structure", function () { var jsonData = pm.response.json(); pm.expect(jsonData).to.have.property('response'); pm.expect(jsonData).to.have.property('reasoning_process'); pm.expect(jsonData).to.have.property('status', 'success'); });

批量测试：创建测试集合，一次性运行多个测试用例：

健康检查 → 模型信息 → 简单推理 → 复杂推理
错误处理测试（如发送无效JSON）
压力测试（连续多个请求）

5. 实际应用场景示例

5.1 数学问题求解

通过API求解复杂数学问题：

import requests import json def solve_math_problem(problem): url = "http://localhost:8000/chat" payload = { "message": problem, "max_length": 1024 } response = requests.post(url, json=payload) result = response.json() print(f"问题: {problem}") print(f"推理过程: {result['reasoning_process']}") print(f"答案: {result['response']}") print("---") # 测试多个数学问题 problems = [ "解方程: 2x + 5 = 15", "计算: (15 × 4) ÷ (3 + 2)", "一个等差数列的首项是3，公差是4，求第10项的值" ] for problem in problems: solve_math_problem(problem)

5.2 编程问题解答

利用API获取编程指导：

def get_programming_help(question): url = "http://localhost:8000/chat" payload = { "message": question, "max_length": 2048 # 编程问题可能需要更长响应 } response = requests.post(url, json=payload) return response.json() # 获取Python编程帮助 help_request = """ 如何用Python实现快速排序算法？请给出代码示例和简要解释。 """ result = get_programming_help(help_request) print("编程解答:") print(result['response'])

5.3 逻辑推理任务

处理复杂的逻辑推理问题：

def logical_reasoning(problem): url = "http://localhost:8000/chat" payload = { "message": problem, "temperature": 0.3 # 降低温度以获得更确定的推理 } response = requests.post(url, json=payload) result = response.json() print("逻辑问题:", problem) print("推理过程:", result['reasoning_process']) print("结论:", result['response']) print("="*50) # 逻辑推理示例 logic_problem = """ 三个人A、B、C中，一个人总是说真话，一个人总是说谎，一个人有时说真话有时说谎。 A说：B总是说真话。 B说：C总是说谎。 C说：A有时说真话有时说谎。 请问谁总是说真话？谁总是说谎？谁有时说真话有时说谎？ """ logical_reasoning(logic_problem)