vLLM部署Qwen3.6-27B教程-洪萨配资

一、模型介绍

阿里巴巴在2026年4月22日晚，发布并开源了 Qwen3.6-27B，Qwen3.6 基于社区的直接反馈打造，注重稳定性与实际应用价值，为开发者提供更直观、响应更快且真正高效的编码体验。模型新增思维保留（Thinking Preservation）功能，可保留历史消息中的推理上下文内容。

本文硬件：Nvidia A30 * 2；显存24G * 2 = 48G
效果：73 tokens/s(CherryStudio 测试)
显存消耗：模型占用30.52G；启动后共占用19144MiB * 2 (gpu-memory-utilization 0.8 - 最大tokens 70,778)

官方模型对比及评分：

二、模型部署

模型可使用SGLang、KTransformers、vLLM部署引擎，社区第三方量化版可使用Ollama、llama.cpp、LM Studio等引擎，本文介绍vLLM的详细部署方式。

1. 下载模型

以官方Qwen/Qwen3.6-27B-FP8模型为例，显存够用，也可直接使用官方Qwen/Qwen3.6-27B。

pip install modelscope modelscope download--model Qwen/Qwen3.6-27B-FP8--local_dir./Qwen3.6-27B-FP8

2. Docker部署

docker pull vllm/vllm-openai:v0.19.1-cu130

选择vllm/vllm-openai:v0.19.1-cu130镜像还是vllm/vllm-openai:v0.19.1镜像，根据自己的Cuda环境而定，自己A30显卡，Cuda13.0版本环境，使用的vllm/vllm-openai:v0.19.1-cu130。

docker-compose.yml

services: Qwen3.6-27B-FP8: container_name: Qwen36_27B_FP8 image: vllm/vllm-openai:v0.19.1-cu130 restart: always ports: - "8000:8000" command: [ "/data/MODELS/Qwen3.6-27B-FP8", "--max-model-len", "65536", "--max-num-seqs", "30", "--gpu-memory-utilization", "0.8", "--quantization", "fp8", "--enable-prefix-caching", "--enable-chunked-prefill", "--reasoning-parser", "qwen3", "--served-model-name", "Qwen3.6", "--enable-auto-tool-choice", "--tool-call-parser", "qwen3_coder", "--host", "0.0.0.0", "--port", "8000", "--tensor-parallel-size", "2", "--kv-cache-dtype", "fp8_e4m3", "--mm-encoder-tp-mode", "data", "--mm-processor-cache-type", "shm", "--limit-mm-per-prompt.video", "0", "--swap-space", "8", "--async-scheduling", "--no-enforce-eager", "--compilation-config.mode", "3", "--compilation-config.cudagraph_mode", "PIECEWISE", "--speculative-config", "{\"method\":\"mtp\",\"num_speculative_tokens\":2}", ] volumes: - ./models:/data/MODELS/Qwen3.6-27B-FP8 environment: - TZ=Asia/Shanghai - NCCL_DEBUG=WARN - NCCL_IB_DISABLE=1 - NCCL_SOCKET_IFNAME=eth0 - HOSTNAME=localhost - MASTER_ADDR=localhost - MASTER_PORT=12355 - VLLM_CONFIGURE_LOGGING=1 extra_hosts: - "localhost:127.0.0.1" - "host.docker.internal:host-gateway" shm_size: 32g ulimits: memlock: -1 stack: 67108864 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "1"] capabilities: [gpu] runtime: nvidia

docker compose up -d

注意：

--tool-call-parser不要使用qwen3_coder，链接Agent后，容易被中断。
tclf90/Qwen3.6-27B-AWQ模型，在两块A30显卡24G*2=48G上，速度太慢11 tokens /s。

三、模型使用

纯文本输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ {"role": "user", "content": "Type \"I love Qwen3.6\" backwards"}, ] chat_response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

图像输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" } }, { "type": "text", "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$" } ] } ] response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

视频输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ { "role": "user", "content": [ { "type": "video_url", "video_url": { "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4" } }, { "type": "text", "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?" } ] } ] # When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`, # video frame sampling can be configured via `extra_body` (e.g., by setting `fps`). # This feature is currently supported only in vLLM. # # By default, `fps=2` and `do_sample_frames=True`. # With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate. response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True}, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

四、使用建议：

通用任务的思考模式：temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
精确编码任务（如 WebDev）的思考模式：temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
通用任务的指令（或非思考）模式：temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
推理任务的指令（或非思考）模式：temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

使用 RoPE 缩放技术（例如 YaRN），可增加上下文长度至1M

修改模型配置文件config.json，将text_config中的rope_parameters字段修改为：

{ "mrope_interleaved": true, "mrope_section": [ 11, 11, 10 ], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144, }

vLLM传入参数：

VLLM_USE_MODELSCOPE=true VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000

修改模型配置文件video_preprocessor_config.json中的size参数，可调节视频理解的长度，longest_edge参数设为 469,762,048（对应 224k 视频 token），以支持小时级视频的更高帧率采样，从而获得更优性能
```
{"longest_edge": 469762048, "shortest_edge": 4096}
```