Python段落分割并保留句子标点的示例-洪萨配资

在Python中，将段落分割成句子并保留结尾标点符号有多种方法。

这里尝试示例以下是几种常用的方法，所用例子收集和修改自网络资料。

1 正则方案

纯中文文本可以使用正则表达式，以下是两个正则表达式分割示例。

1.1 基础版分割

正则表达式是最常用的句子分割手段，示例如下。

import re def split_paragraph_to_sentences(paragraph): """ 将段落分割成句子，保留结尾标点符号 支持中文和英文 """ # 正则表达式匹配句子结束符：。！？；.!?;（以及可能的后引号） pattern = r'(?<=[。！？；.!?;])\s*' sentences = re.split(pattern, paragraph.strip()) # 过滤空字符串 sentences = [s.strip() for s in sentences if s.strip()] return sentences # 示例 paragraph = "这是一个测试段落。这是第二句话！这是第三句话？让我们继续。结尾标点.最后一句。" sentences = split_paragraph_to_sentences(paragraph) for i, sentence in enumerate(sentences, 1): print(f"句子{i}: {sentence}")

输出如下:

句子1: 这是一个测试段落。
句子2: 这是第二句话！
句子3: 这是第三句话？
句子4: 让我们继续。
句子5: 结尾标点.
句子6: 最后一句。

1.2 更精细的正则分割

这里尝试更精细的正则分割，示例代码如下。

import re def split_sentences_advanced(text): """ 更精细的句子分割，处理特殊情况 """ # 处理缩写、小数点等特殊情况 pattern = r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!|。|！|？|;|；)\s' sentences = re.split(pattern, text) # 处理可能的分割后空白 sentences = [s.strip() for s in sentences if s.strip()] return sentences # 示例 paragraph = "Dr. Smith went to the store. He bought apples, oranges, etc. The total was $12.50. Was that expensive?" sentences = split_sentences_advanced(paragraph) for i, sentence in enumerate(sentences, 1): print(f"句子{i}: {sentence}")

输出如下

句子1: Dr. Smith went to the store.
句子2: He bought apples, oranges, etc.
句子3: The total was $12.50.
句子4: Was that expensive?

2 NLTK方案

NLTK库适合对英文文档进行分割，需要提前安装punkt资源，示例代码如下。

import nltk # 第一次使用时需要下载punkt资源 # nltk.download('punkt') def split_sentences_nltk(text): """使用NLTK进行句子分割（主要针对英文）""" from nltk.tokenize import sent_tokenize return sent_tokenize(text) # 示例 english_paragraph = "Hello world! This is a test. How are you? I'm fine, thank you." sentences = split_sentences_nltk(english_paragraph) for i, sentence in enumerate(sentences, 1): print(f"句子{i}: {sentence}")

输出示例如下

1. Dr. Smith met Mr. Jones at 5 p.m.
2. They discussed the project.
3. It was great!

3 综合方案

以下是多种综合方案，兼容中英文处理等多种特殊情况。

3.1 多级分割处理

如果段落混杂中文和英文，可以采用多级分割方式，示例如下。

import re def split_mixed_language_paragraph(paragraph): """ 处理混合中英文的段落分割 """ # 结合中文和英文的句子结束符 pattern = r'(?<=[。！？；.!?;])\s*(?![a-zA-Z0-9])' sentences = re.split(pattern, paragraph) # 二次处理：对于英文句子，使用更精确的模式 refined_sentences = [] for sentence in sentences: if sentence.strip(): # 如果句子中包含英文标点，进一步分割 if re.search(r'[.!?]', sentence) and len(sentence) > 50: sub_sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', sentence) refined_sentences.extend([s.strip() for s in sub_sentences if s.strip()]) else: refined_sentences.append(sentence.strip()) return refined_sentences # 示例 mixed_paragraph = "这是一个测试。Hello world! 这是中文句子。How are you? 我很好！" sentences = split_mixed_language_paragraph(mixed_paragraph) for i, sentence in enumerate(sentences, 1): print(f"句子{i}: {sentence}")

输出示例如下

句子1: 这是一个测试。Hello world!
句子2: 这是中文句子。How are you?
句子3: 我很好！

3.2 特殊标记添加

进一步支持添加特殊标记，示例代码如下所示。

import re class SentenceSplitter: def __init__(self): # 常见缩写列表，防止错误分割 self.abbreviations = { 'mr.', 'mrs.', 'ms.', 'dr.', 'prof.', 'rev.', 'hon.', 'st.', 'ave.', 'blvd.', 'rd.', 'ln.', 'etc.', 'e.g.', 'i.e.', 'vs.', 'jan.', 'feb.', 'mar.', 'apr.', 'jun.', 'jul.', 'aug.', 'sep.', 'oct.', 'nov.', 'dec.' } def split(self, text): """主分割方法""" if not text.strip(): return [] # 预处理：在可能被错误分割的缩写后添加特殊标记 text = self._protect_abbreviations(text) # 分割句子 pattern = r'(?<=[。！？.!?])\s+' sentences = re.split(pattern, text) # 恢复被保护的缩写 sentences = [self._restore_abbreviations(s.strip()) for s in sentences if s.strip()] return sentences def _protect_abbreviations(self, text): """保护缩写不被错误分割""" import re def replace_abbr(match): abbr = match.group(0).lower() if abbr in self.abbreviations: return match.group(0).replace('.', '[DOT]') return match.group(0) # 匹配可能的小写缩写 pattern = r'\b[a-z]+\.' text = re.sub(pattern, replace_abbr, text, flags=re.IGNORECASE) return text def _restore_abbreviations(self, text): """恢复被保护的缩写""" return text.replace('[DOT]', '.') # 使用示例 splitter = SentenceSplitter() paragraph = "Dr. Smith met Mr. Jones at 5 p.m. They discussed the project. It was great!" sentences = splitter.split(paragraph) for i, sentence in enumerate(sentences, 1): print(f"{i}. {sentence}")

输出如下所示

1. Dr. Smith met Mr. Jones at 5 p.m.
2. They discussed the project.
3. It was great!

3.3 spaCy示例

另外，可以使用spacy进行句子分割，适合对纯英文文本进行分割。

# 需要先安装：pip install spacy # 下载模型：python -m spacy download en_core_web_sm import spacy def split_sentences_spacy(text, language='en'): """使用spaCy进行句子分割""" if language == 'en': nlp = spacy.load('en_core_web_sm') else: # 对于中文，需要中文模型 # pip install spacy zh_core_web_sm # python -m spacy download zh_core_web_sm nlp = spacy.load('zh_core_web_sm') doc = nlp(text) return [sent.text.strip() for sent in doc.sents] # 示例 text = "This is the first sentence. This is the second one! And here's the third?" sentences = split_sentences_spacy(text) for i, sent in enumerate(sentences, 1): print(f"句子{i}: {sent}")

3.4 综合示例

如果混合中英文，也可以采用如下的综合分割方法。

这是一个综合分割示例，可以选择分割方法，所支持的语言等。

def split_paragraph(paragraph, method='auto', language='mixed'): """ 综合句子分割函数 参数: paragraph: 输入的段落文本 method: 分割方法，可选 'auto', 'regex', 'nltk', 'spacy' language: 语言，可选 'zh', 'en', 'mixed' 返回: 句子列表 """ if not paragraph or not paragraph.strip(): return [] if method == 'auto': # 根据语言自动选择方法 if language == 'en': try: from nltk.tokenize import sent_tokenize return sent_tokenize(paragraph) except: method = 'regex' else: method = 'regex' if method == 'regex': if language == 'zh': pattern = r'(?<=[。！？；])\s*' elif language == 'en': pattern = r'(?<=[.!?])\s+(?=[A-Z])' else: # mixed pattern = r'(?<=[。！？.!?;])\s*' sentences = re.split(pattern, paragraph.strip()) return [s.strip() for s in sentences if s.strip()] elif method == 'nltk': from nltk.tokenize import sent_tokenize return sent_tokenize(paragraph) elif method == 'spacy': import spacy nlp = spacy.load('en_core_web_sm' if language == 'en' else 'zh_core_web_sm') doc = nlp(paragraph) return [sent.text.strip() for sent in doc.sents] return [] # 使用示例 paragraph = "这是一个测试。Hello world! 第二句话？结束。" sentences = split_paragraph(paragraph, method='regex', language='mixed') print("分割结果:", sentences)

reference

---