最近正在做一个本地RAG项目,即数据需要留在本地,模型也需要本地搭建,特此记录。本系列总体以PIKE-RAG开源知识库为基础,包含本地化改造、FastAPI封装接口,页面搭建等内容。本篇只包含PIKE-RAG开源知识库部署与如何利用本地部署大模型作为对话模型对内容进行分块。
PIKE-RAG知识库介绍
PIKE-RAG知识库是微软开源的一个模块化的知识库系统,包括文档解析、知识抽取、知识存储、知识检索、知识组织、以知识为中心的推理以及任务分解与调用等功能。除了没有界面,我们可以使用PIKE-RAG完成知识库中的所有流程。
它相比于现有知识库主要做了两个创新点。1.知识原子化:把一段资料拆成 “最小有用知识单元”,还会给每个单元配个 “问题标签”(比如一段讲 “某药 2020 年获批” 的文字,标签是 “某药的获批年份是啥?”)。这样搜的时候,不管是直接搜资料,还是搜 “问题标签”,都能快速找到关键信息。2.知识感知的任务分解:拆复杂问题时,会先看知识库有啥信息,再决定怎么拆。比如问 “有多少款可替换生物类似药”,如果知识库有现成的 “可替换清单”,就直接统计;如果只有 “所有生物类似药清单”,就拆成 “找清单→判断是否可替换→统计”,避免瞎拆导致走弯路。
github仓库:https://github.com/microsoft/PIKE-RAG
gitee镜像:https://gitee.com/mirrors_microsoft/PIKE-RAG
PIKE-RAG知识库搭建
代码结构
核心代码:
- 核心代码:
pikerag/目录,包含文档加载器、转换器等核心组件。- document_loaders/:文档加载与读取工具;
- document_transformers/:文档切分与过滤,包括基于 LLM 的 tagger/splitter;
- knowledge_retrievers/:多种检索器实现,如 BM25、Chroma、ChunkAtom 检索器;
- llm_client/:语言模型客户端接口,支持 OpenAI API、Azure、HuggingFace 等;
- prompts/:各种 prompt 模板定义,涵盖 chunking、QA、生成功能等;
- utils/:通用工具类,如日志、配置解析、路径管理等;
- workflows/:核心工作流封装,包括 QA、评估、标注等流程控制模块。
- 数据处理:
data_process/目录,含句子拆分、基准测试数据处理等脚本(如chunk_by_sentence.py、retrieval_contexts_as_chunks.py)。 - 示例脚本:
examples/目录,提供生物学、HotpotQA、MuSiQue 等场景的示例(如问答、评估、标记等脚本)。 - 文档:
docs/目录,包含环境配置、示例运行等指南。 - 辅助脚本:
scripts/目录,含 Azure 相关安装和登录脚本。 - 配置文件:各示例场景下的
configs/目录,包含 YAML 配置文件(如标记、问答流程配置)。
本地模型部署
我使用了Xinference部署了DeepSeekR1-32B的4bit量化版模型作为对话模型,部署了beg-m3作为嵌入模型。如果想学习Xinference如何部署的请查看:https://mp.weixin.qq.com/s/glAeQDgdXIHvIgwUmtnVzA。
也可自己使用熟悉的方式部署大模型与嵌入模型。
环境搭建
# 安装uvcurl-LsSf https://astral.sh/uv/install.sh|sh# 初始化文件目录uv init PithyRAGcdPithyRAG# 修改python版本为3.12uv run main.py# 克隆仓库gitclone https://gitee.com/mirrors_microsoft/PIKE-RAG.git# 复制pikerag至PithyRAG目录下cp-r PIKE-RAG/pikerag ./删除uv.lock文件,并修改pyproject.toml文件,将以下内容覆盖原文件。
[project] name = "pithyrag" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.12" dependencies = [ "bs4>=0.0.2", "chromadb>=1.1.1", "dacite>=1.9.2", "datasets>=4.2.0", "fastapi[standard]>=0.120.0", "jsonlines>=4.0.0", "langchain>=0.3.27", "langchain-chroma>=0.2.6", "langchain-community>=0.3.31", "langchain-huggingface>=0.3.1", "locust>=2.41.6", "markdown>=3.9", "openai>=2.3.0", "openpyxl>=3.1.5", "pandas>=2.3.3", "pickledb>=1.3.2", "pydantic-settings>=2.11.0", "python-docx>=1.2.0", "rank-bm25>=0.2.2", "rouge>=1.0.1", "sentence-transformers>=5.1.1", "spacy>=3.8.7", "tabulate>=0.9.0", "torch>=2.8.0", "tqdm>=4.67.1", "transformers>=4.57.0", "unstructured>=0.18.15", "word2number>=1.1", "xinference-client>=1.10.1", ] [[tool.uv.index]] url = "https://pypi.tuna.tsinghua.edu.cn/simple" default = true使用uv sync命令下载依赖。
编写本地大模型接口
首先在pikerag/llm_client目录下添加xinference_client.py文件,并将以下代码复制进去。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time : 2025/11/26 19:05:27# @Author : Jsm# @Version : 1.0# @Desc : DescribeimportjsonimportreimporttimefromtypingimportList,Literal,Optional,Unionimportosimportopenaifromlangchain_core.embeddingsimportEmbeddingsfromopenaiimportOpenAIfromopenai.typesimportCreateEmbeddingResponsefromopenai.types.chat.chat_completionimportChatCompletionfrompickledbimportPickleDBfrompikerag.llm_client.baseimportBaseLLMClientfrompikerag.utils.loggerimportLogger# 测试时需要加# from config.config import load_config# model_config = load_config().model_config# def parse_wait_time_from_error(error: openai.RateLimitError) -> Optional[int]:# """Parse wait time from OpenAI RateLimitError.# Args:# error (openai.RateLimitError): The rate limit error from OpenAI API.# Returns:# Optional[int]: The suggested wait time in seconds, None if parsing failed.# """# try:# info_str: str = error.args[0]# info_dict_str: str = info_str[info_str.find("{"):]# error_info: dict = json.loads(re.compile(r"(?<!\\)'").sub('"', info_dict_str))# error_message = error_info["error"]["message"]# matches = re.search(r"Try again in (\d+) seconds", error_message)# wait_time = int(matches.group(1)) + 3 # Add 3 seconds buffer# return wait_time# except Exception:# return NoneclassXinferenceClient(BaseLLMClient):"""Xinference client implementation for DeepSeek models."""NAME="XinferenceClient"def__init__(self,location:str=None,auto_dump:bool=True,logger:Logger=None,max_attempt:int=5,exponential_backoff_factor:int=None,unit_wait_time:int=60,**kwargs,)->None:"""LLM Communication Client for Xinference endpoints with models. Args: location (str): The file location of the LLM client communication cache. No cache would be created if set to None. Defaults to None. auto_dump (bool): Automatically save the Client's communication cache or not. Defaults to True. logger (Logger): Client logger. Defaults to None. max_attempt (int): Maximum attempt time for LLM requesting. Request would be skipped if max_attempt reached. Defaults to 5. exponential_backoff_factor (int): Set to enable exponential backoff retry manner. Every time the wait time would be `exponential_backoff_factor ^ num_attempt`. Set to None to disable and use the `unit_wait_time` manner. Defaults to None. unit_wait_time (int): `unit_wait_time` would be used only if the exponential backoff mode is disabled. Every time the wait time would be `unit_wait_time * num_attempt`, with seconds (s) as the time unit. Defaults to 60. **kwargs: Additional arguments for Xinference client initialization. yml config example: ... llm_client: module_path: pikerag.llm_client class_name: XinferenceClient args:{ base_url: http://localhost:9997/v1 # Xinference server URL api_key: xinference # Default API key for Xinference } ... """super().__init__(location,auto_dump,logger,max_attempt,exponential_backoff_factor,unit_wait_time,**kwargs)print(f"kwargs:{kwargs}"