news 2026/5/15 7:10:42

vLLM部署Qwen3.6-27B教程

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
vLLM部署Qwen3.6-27B教程

一、模型介绍

阿里巴巴在2026年4月22日晚,发布并开源了 Qwen3.6-27B,Qwen3.6 基于社区的直接反馈打造,注重稳定性与实际应用价值,为开发者提供更直观、响应更快且真正高效的编码体验。模型新增思维保留(Thinking Preservation)功能,可保留历史消息中的推理上下文内容。

  • 本文硬件:Nvidia A30 * 2;显存24G * 2 = 48G

  • 效果:73 tokens/s(CherryStudio 测试)

  • 显存消耗:模型占用30.52G;启动后共占用19144MiB * 2 (gpu-memory-utilization 0.8 - 最大tokens 70,778)

官方模型对比及评分:

二、模型部署

模型可使用SGLangKTransformersvLLM部署引擎,社区第三方量化版可使用Ollamallama.cppLM Studio等引擎,本文介绍vLLM的详细部署方式。

1. 下载模型

以官方Qwen/Qwen3.6-27B-FP8模型为例,显存够用,也可直接使用官方Qwen/Qwen3.6-27B

pip install modelscope modelscope download--model Qwen/Qwen3.6-27B-FP8--local_dir./Qwen3.6-27B-FP8

2. Docker部署

docker pull vllm/vllm-openai:v0.19.1-cu130

选择vllm/vllm-openai:v0.19.1-cu130镜像还是vllm/vllm-openai:v0.19.1镜像,根据自己的Cuda环境而定,自己A30显卡,Cuda13.0版本环境,使用的vllm/vllm-openai:v0.19.1-cu130

docker-compose.yml

services: Qwen3.6-27B-FP8: container_name: Qwen36_27B_FP8 image: vllm/vllm-openai:v0.19.1-cu130 restart: always ports: - "8000:8000" command: [ "/data/MODELS/Qwen3.6-27B-FP8", "--max-model-len", "65536", "--max-num-seqs", "30", "--gpu-memory-utilization", "0.8", "--quantization", "fp8", "--enable-prefix-caching", "--enable-chunked-prefill", "--reasoning-parser", "qwen3", "--served-model-name", "Qwen3.6", "--enable-auto-tool-choice", "--tool-call-parser", "qwen3_coder", "--host", "0.0.0.0", "--port", "8000", "--tensor-parallel-size", "2", "--kv-cache-dtype", "fp8_e4m3", "--mm-encoder-tp-mode", "data", "--mm-processor-cache-type", "shm", "--limit-mm-per-prompt.video", "0", "--swap-space", "8", "--async-scheduling", "--no-enforce-eager", "--compilation-config.mode", "3", "--compilation-config.cudagraph_mode", "PIECEWISE", "--speculative-config", "{\"method\":\"mtp\",\"num_speculative_tokens\":2}", ] volumes: - ./models:/data/MODELS/Qwen3.6-27B-FP8 environment: - TZ=Asia/Shanghai - NCCL_DEBUG=WARN - NCCL_IB_DISABLE=1 - NCCL_SOCKET_IFNAME=eth0 - HOSTNAME=localhost - MASTER_ADDR=localhost - MASTER_PORT=12355 - VLLM_CONFIGURE_LOGGING=1 extra_hosts: - "localhost:127.0.0.1" - "host.docker.internal:host-gateway" shm_size: 32g ulimits: memlock: -1 stack: 67108864 deploy: resources: reservations: devices: - driver: nvidia device_ids: ["0", "1"] capabilities: [gpu] runtime: nvidia
docker compose up -d

注意:

  1. --tool-call-parser不要使用qwen3_coder,链接Agent后,容易被中断。
  2. tclf90/Qwen3.6-27B-AWQ模型,在两块A30显卡24G*2=48G上,速度太慢11 tokens /s。

三、模型使用

纯文本输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ {"role": "user", "content": "Type \"I love Qwen3.6\" backwards"}, ] chat_response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

图像输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/CI_Demo/mathv-1327.jpg" } }, { "type": "text", "text": "The centres of the four illustrated circles are in the corners of the square. The two big circles touch each other and also the two little circles. With which factor do you have to multiply the radii of the little circles to obtain the radius of the big circles?\nChoices:\n(A) $\\frac{2}{9}$\n(B) $\\sqrt{5}$\n(C) $0.8 \\cdot \\pi$\n(D) 2.5\n(E) $1+\\sqrt{2}$" } ] } ] response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

视频输入

from openai import OpenAI # Configured by environment variables client = OpenAI() messages = [ { "role": "user", "content": [ { "type": "video_url", "video_url": { "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4" } }, { "type": "text", "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?" } ] } ] # When vLLM is launched with `--media-io-kwargs '{"video": {"num_frames": -1}}'`, # video frame sampling can be configured via `extra_body` (e.g., by setting `fps`). # This feature is currently supported only in vLLM. # # By default, `fps=2` and `do_sample_frames=True`. # With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate. response = client.chat.completions.create( model="Qwen/Qwen3.6-27B", messages=messages, max_tokens=81920, temperature=1.0, top_p=0.95, presence_penalty=0.0, extra_body={ "top_k": 20, "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True}, "chat_template_kwargs": {"enable_thinking": False}, }, ) print("Chat response:", chat_response)

四、使用建议:

  • 通用任务的思考模式:temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

  • 精确编码任务(如 WebDev)的思考模式:temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0

  • 通用任务的指令(或非思考)模式:temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

  • 推理任务的指令(或非思考)模式:temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

  • 使用 RoPE 缩放技术(例如 YaRN),可增加上下文长度至1M

    修改模型配置文件config.json,将text_config中的rope_parameters字段修改为:

    { "mrope_interleaved": true, "mrope_section": [ 11, 11, 10 ], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144, }

    vLLM传入参数:

    VLLM_USE_MODELSCOPE=true VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
  • 修改模型配置文件video_preprocessor_config.json中的size参数,可调节视频理解的长度,longest_edge参数设为 469,762,048(对应 224k 视频 token),以支持小时级视频的更高帧率采样,从而获得更优性能

    {"longest_edge": 469762048, "shortest_edge": 4096}
版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/15 6:58:06

开源协作平台LetsFG:重塑分布式开发工作流的架构与实践

1. 项目概述:一个面向未来的开源协作平台最近在开源社区里,一个名为“LetsFG/LetsFG”的项目引起了我的注意。乍一看这个标题,可能会觉得有些抽象,甚至有点摸不着头脑。但作为一名在开源领域摸爬滚打了十多年的老手,我…

作者头像 李华
网站建设 2026/5/15 6:52:05

Cursor智能体工具包:从AI编程助手到自主规划开发伙伴

1. 项目概述:一个为AI编程助手赋能的智能工具包如果你和我一样,日常重度依赖Cursor这类AI编程助手,那你肯定也经历过这样的时刻:面对一个复杂的重构任务,你不得不把需求拆成十几条指令,一条条喂给AI&#x…

作者头像 李华
网站建设 2026/5/15 6:48:06

基于Whisper与本地化部署的视频智能转录翻译工具vidscribe实战指南

1. 项目概述:一个视频智能转录与翻译的本地化利器最近在折腾一个挺有意思的开源项目,叫vidscribe。简单来说,这是一个能帮你把视频里的语音,自动转成文字,还能翻译成其他语言的工具。听起来是不是有点像某些在线服务&a…

作者头像 李华
网站建设 2026/5/15 6:43:28

VME-MB-Z004伺服控制板

VME-MB-Z004 伺服控制板产品特点VME-MB-Z004 是一款基于VME总线架构的高性能伺服控制板,适用于多轴运动控制与精密伺服驱动系统。VME-MB-Z004 采用标准VME总线接口,支持与VME机箱内其他板卡高速数据交换。支持多轴伺服控制,可同时驱动多个伺服…

作者头像 李华