Baichuan-M2-32B-GPTQ-Int4模型API开发教程：基于Flask的RESTful接口实现-洪萨配资

Baichuan-M2-32B-GPTQ-Int4模型API开发教程：基于Flask的RESTful接口实现

1. 引言

在医疗AI领域，Baichuan-M2-32B-GPTQ-Int4作为一款强大的医疗增强推理模型，能够为各类医疗应用提供智能支持。本教程将带你从零开始，使用Flask框架为这个模型构建一个RESTful API接口，让开发者能够轻松调用模型能力。

通过本教程，你将学会：

如何搭建基础的Flask应用
如何集成Baichuan-M2-32B-GPTQ-Int4模型
如何设计合理的API路由和请求处理
如何进行简单的性能优化

2. 环境准备与快速部署

2.1 系统要求

Python 3.8或更高版本
NVIDIA GPU（推荐RTX 4090或更高）
至少16GB显存（4bit量化版本）

2.2 安装依赖

首先创建并激活Python虚拟环境：

python -m venv baichuan-api source baichuan-api/bin/activate # Linux/macOS # 或 baichuan-api\Scripts\activate # Windows

然后安装必要的Python包：

pip install flask flask-cors transformers torch vllm

2.3 下载模型

可以使用Hugging Face的transformers库直接下载模型：

from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "baichuan-inc/Baichuan-M2-32B-GPTQ-Int4" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

3. 基础Flask应用搭建

3.1 创建基础应用结构

创建一个简单的Flask应用结构：

baichuan_api/ ├── app.py # 主应用文件 ├── config.py # 配置文件 ├── requirements.txt # 依赖文件 └── utils/ # 工具函数 └── model_utils.py

3.2 编写基础Flask应用

在app.py中添加基础代码：

from flask import Flask, jsonify, request from flask_cors import CORS app = Flask(__name__) CORS(app) # 允许跨域请求 @app.route('/') def home(): return jsonify({"status": "Baichuan-M2-32B API is running"}) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000, debug=True)

4. 模型集成与API设计

4.1 模型加载封装

在utils/model_utils.py中封装模型加载和推理逻辑：

from transformers import AutoModelForCausalLM, AutoTokenizer import torch class BaichuanModel: def __init__(self, model_name="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4"): self.device = "cuda" if torch.cuda.is_available() else "cpu" self.tokenizer = AutoTokenizer.from_pretrained( model_name, trust_remote_code=True ) self.model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, torch_dtype=torch.float16, device_map="auto" ) def generate(self, prompt, max_length=512): inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) outputs = self.model.generate( **inputs, max_new_tokens=max_length, temperature=0.7, top_p=0.9 ) return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

4.2 设计API路由

在app.py中添加模型推理路由：

from utils.model_utils import BaichuanModel model = BaichuanModel() @app.route('/api/generate', methods=['POST']) def generate_text(): data = request.get_json() prompt = data.get('prompt', '') max_length = data.get('max_length', 512) if not prompt: return jsonify({"error": "Prompt is required"}), 400 try: result = model.generate(prompt, max_length) return jsonify({"result": result}) except Exception as e: return jsonify({"error": str(e)}), 500

5. 完整API实现与测试

5.1 完整应用代码

整合后的app.py完整代码：

from flask import Flask, jsonify, request from flask_cors import CORS from utils.model_utils import BaichuanModel import logging # 配置日志 logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) app = Flask(__name__) CORS(app) # 初始化模型 try: model = BaichuanModel() logger.info("Model loaded successfully") except Exception as e: logger.error(f"Failed to load model: {str(e)}") model = None @app.route('/') def home(): return jsonify({"status": "Baichuan-M2-32B API is running"}) @app.route('/api/generate', methods=['POST']) def generate_text(): if model is None: return jsonify({"error": "Model not loaded"}), 503 data = request.get_json() prompt = data.get('prompt', '') max_length = data.get('max_length', 512) if not prompt: return jsonify({"error": "Prompt is required"}), 400 try: result = model.generate(prompt, max_length) return jsonify({"result": result}) except Exception as e: logger.error(f"Generation error: {str(e)}") return jsonify({"error": str(e)}), 500 if __name__ == '__main__': app.run(host='0.0.0.0', port=5000, debug=True)

5.2 测试API

可以使用curl或Postman测试API：

curl -X POST http://localhost:5000/api/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "患者主诉头痛三天，伴随轻度发热，可能的诊断是什么？", "max_length": 200}'

预期响应格式：

{ "result": "根据患者描述的症状（头痛三天伴随轻度发热），可能的诊断包括：1. 上呼吸道感染；2. 流感；3. 鼻窦炎；4. 偏头痛。建议患者测量体温，如持续高热或出现其他症状应及时就医。" }

6. 性能优化与进阶功能

6.1 使用vLLM加速推理

修改model_utils.py使用vLLM加速：

from vllm import LLM, SamplingParams class BaichuanModel: def __init__(self): self.llm = LLM( model="baichuan-inc/Baichuan-M2-32B-GPTQ-Int4", trust_remote_code=True, quantization="gptq" ) self.sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512 ) def generate(self, prompt, max_length=512): self.sampling_params.max_tokens = max_length outputs = self.llm.generate([prompt], self.sampling_params) return outputs[0].outputs[0].text

6.2 添加批处理支持

修改API路由支持批处理：

@app.route('/api/batch_generate', methods=['POST']) def batch_generate(): if model is None: return jsonify({"error": "Model not loaded"}), 503 data = request.get_json() prompts = data.get('prompts', []) max_length = data.get('max_length', 512) if not prompts or not isinstance(prompts, list): return jsonify({"error": "Prompts must be a non-empty list"}), 400 try: results = [model.generate(p, max_length) for p in prompts] return jsonify({"results": results}) except Exception as e: logger.error(f"Batch generation error: {str(e)}") return jsonify({"error": str(e)}), 500

6.3 添加健康检查端点

添加模型健康检查：

@app.route('/api/health') def health_check(): if model is None: return jsonify({"status": "unhealthy", "message": "Model not loaded"}), 503 try: # 简单测试生成 test_output = model.generate("Test", max_length=5) return jsonify({ "status": "healthy", "model": "Baichuan-M2-32B-GPTQ-Int4" }) except Exception as e: return jsonify({ "status": "unhealthy", "message": str(e) }), 503