LLM Xinference 安装使用（支持CPU、Metal、CUDA推理和分布式部署）-洪萨配资

1. 详细步骤

1.1 安装

# CUDA/CPU pip install "xinference[transformers]" pip install "xinference[vllm]" pip install "xinference[sglang]" # Metal(MPS) pip install "xinference[mlx]" CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python 注：可能是 nvcc 版本等个人环境配置原因，llama-cpp-python 在 CUDA 上无法使用（C/C++ 环境上是正常的），Metal 的 llama-cpp-python 正常。如需安装 flashinfer 等依赖见官方安装文档：https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

1.2 启动

1.2.1 直接启动

简洁命令

xinference-local --host 0.0.0.0 --port 9997

多参数命令

设置模型缓存路径和模型来源（Hugging Face/Modelscope）

# CUDA/CPU XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997 # Metal(MPS) XINFERENCE_HOME=/path/.xinference XINFERENCE_MODEL_SRC=modelscope PYTORCH_ENABLE_MPS_FALLBACK=1 xinference-local --host 0.0.0.0 --port 9997

1.2.2 集群部署

通过ifconfig查看当前服务器IP

1.2.2.1 主服务器启动 Supervisor

# 格式 xinference-supervisor -H 当前服务器IP(主服务器IP) --port 9997 # 示例 xinference-supervisor -H 192.168.31.100 --port 9997

1.2.2.2 其他服务器启动 Worker

# 格式 xinference-worker -e "http://${主服务器IP}:9997" -H 当前服务器IP(子服务器IP) # 示例 xinference-worker -e "http://192.168.31.100:9997" -H 192.168.31.101

注：按需添加XINFERENCE_HOME、XINFERENCE_MODEL_SRC、PYTORCH_ENABLE_MPS_FALLBACK等环境变量（启动时参数）

1.3 使用

访问http://主服务器IP:9997/docs查看接口文档，访问http://主服务器IP:9997正常使用

2. 参考资料

2.1 Xinference

2.1.1 部署文档

本地运行 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#run-xinference-locally

集群中部署 Xinference

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#deploy-xinference-in-a-cluster

2.1.2 安装文档

官方页面

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html

Transformers 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#transformers-backend

vLLM 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#vllm-backend

Llama.cpp 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#llama-cpp-backend

MLX 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#mlx-backend

3. 资源

3.1 Xinference

3.1.1 GitHub

官方页面

https://github.com/xorbitsai/inference

https://github.com/xorbitsai/inference/blob/main/README_zh_CN.md

3.1.2 安装文档

SGLang 引擎

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#sglang-backend

其他平台（在昇腾 NPU 上安装）

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation.html#other-platforms

https://inference.readthedocs.io/zh-cn/latest/getting_started/installation_npu.html#installation-npu

智能体推理范式: Plan-and-Execute（规划与执行）

什么是 Plan-and-Execute Plan-and-Execute Planning（规划） Execution（执行） 这是一种将复杂任务先进行全局规划，然后按计划逐步执行的架构模式，强调先思后行、有序推进。人类类比想象你在规划一次旅行。…

李华

MySQL —— 配置文件

前一篇文章：MySQL —— MySQL 程序-CSDN博客目录前言一、使用方法二、配置文件位置及加载顺序 1.在 Windows 系统中读取配置文件 2.在 Linux 系统中读取配置文件三、配置文件语法四、案例：设置客户端全局编码格式总结前言本篇文章要介…

李华

AirCloud平台与excloud扩展库协同实战：核心功能落地案例!

在边缘智能与云边协同日益融合的今天，AirCloud平台以其出色的设备管理与资源调度能力脱颖而出，而excloud扩展库则为平台注入了灵活的功能扩展机制。二者的协同应用，为复杂业务场景提供了强有力的支撑。但如何通过合理配置实现功能最大化&…

李华

Linux下MySQL的简单使用

Linux下MySQL的简单使用导语MySQL安装与配置 MySQL安装密码设置 MySQL管理命令 myisamchkmysql其他常见操作 C语言访问MYSQL 连接例程错误处理使用SQL 总结参考文献导语这一章是MySQL的使用，一些常用的MySQL语句属于本科阶段内容，然后是C语言和M…

李华

LLM Xinference 安装使用（支持CPU、Metal、CUDA推理和分布式部署）

1. 详细步骤

1.1 安装

1.2 启动

1.2.1 直接启动

简洁命令

多参数命令

1.2.2 集群部署

1.2.2.1 主服务器启动 Supervisor

1.2.2.2 其他服务器启动 Worker

1.3 使用

2. 参考资料

2.1 Xinference

2.1.1 部署文档

本地运行 Xinference

集群中部署 Xinference

2.1.2 安装文档

官方页面

Transformers 引擎

vLLM 引擎

Llama.cpp 引擎

MLX 引擎

3. 资源

3.1 Xinference

3.1.1 GitHub

官方页面

3.1.2 安装文档

SGLang 引擎

其他平台（在昇腾 NPU 上安装）

智能体推理范式: Plan-and-Execute（规划与执行）

MySQL —— 配置文件

AirCloud平台与excloud扩展库协同实战：核心功能落地案例!

UE5 C++（12-2）：

Linux下MySQL的简单使用

maven导入spring框架