Blackstone：专为法律文本设计的NLP工具-洪萨配资

Blackstone

Blackstone是一个spaCy模型和库，用于处理长篇、非结构化的法律文本。Blackstone是英格兰和威尔士法律报告委员会研究实验室ICLR&D的一个实验性研究项目。Blackstone由Daniel Hoadley编写。

内容

为什么我们要构建Blackstone？
Blackstone有什么特别之处？
观察和其他值得注意的事项
安装
- 安装库
- 安装Blackstone模型
关于模型
- 管道
- 命名实体识别器
- 文本分类器
使用
- 应用NER模型
  - 可视化实体
- 应用文本分类器模型
自定义管道扩展
- 缩写和完整形式定义解析
- 复合案例引用检测
- 法律条文链接器
- 句子分割器

为什么我们要构建Blackstone？

过去几年，法律与技术交叉领域的活动激增。然而，在英国，绝大多数此类活动发生在律师事务所和其他商业环境中。其结果是，尽管法律信息学领域的发展层出不穷，但几乎没有研究是开放源码的。

此外，英国法律信息学领域的大多数研究（无论是开放还是封闭的）都集中在开发用于自动化合同和其他具有交易性质的法律文件的自然语言处理应用程序上。这是可以理解的，因为英国法律自然语言处理研究的主要受益者是律师事务所，而律师事务所通常不难获得可以作为训练数据的交易文件。

问题在于，我们认为英国的法律自然语言处理研究过度集中在商业应用上，值得投资开发针对其他法律文本的自然语言处理研究，例如判决书、学术文章、案情摘要和诉状。

Blackstone有什么特别之处？

据我们所知，Blackstone是第一个专门针对包含普通法实体和概念的长篇文本训练的开源模型。
Blackstone构建在spaCy之上，这使得它易于掌握并应用于自己的数据。
Blackstone的训练数据跨越了相当长的时间段（最早可追溯到1860年代起草的文本）。这很有用，因为普通法的一个有趣特点是，较旧的著作（特别是判决书）在多年后仍然具有相关性。
它是免费和开源的。
它并不完美，并且毫不掩饰地向您展示这一事实。

观察和其他值得注意的事项：

完美是优秀的敌人。这是一个高度实验性项目的原型发布。因此，Blackstone模型的准确性还有待提高（NER的F1约为70%）。这些模型的准确性将随着时间的推移而提高。
这些模型是在英国判例法上训练的，并且该库是考虑到英格兰和威尔士法律体系的特殊性而构建的。也就是说，该模型具有良好的泛化能力，应该也能在澳大利亚、加拿大和美国的内容上表现得相当不错。
用于训练Blackstone模型的数据来源于英格兰和威尔士法律报告委员会的案件报告和未报告判决的档案。该档案是专有的，这使我们无法发布任何用于训练Blackstone的数据。
Blackstone不是法官或诉讼分析工具。

安装

注意！强烈建议您将Blackstone安装到虚拟环境中！有关虚拟环境的更多信息，请参见此处。Blackstone应与Python 3.6及更高版本兼容。

安装Blackstone请按照以下步骤操作：

1. 安装库

第一步是安装该库，该库目前包含一些自定义的spaCy组件。按如下方式安装库：

pipinstallblackstone

2. 安装Blackstone模型

第二步是安装spaCy模型。按如下方式安装模型：

pipinstallhttps://blackstone-model.s3-eu-west-1.amazonaws.com/en_blackstone_proto-0.0.1.tar.gz

从源码安装

如果您正在开发Blackstone，可以按以下方式从源码安装：

pipinstall--editable.pipinstall-r dev-requirements.txt

关于模型

这是Blackstone的第一次发布，该模型最好被视为原型；它尚不完善，代表了ICLR&D正在进行的针对法律文本的自然语言处理开源研究计划的第一步。

言归正传，以下是原型模型中包含的内容的简要介绍。

管道

此版本中包含的原型模型在其管道中具有以下元素：

由于针对法律文本的标记词性标注和依存关系训练数据的稀缺，分词器、词性标注器和解析器管道组件取自spaCy的en_core_web_sm模型。总的来说，这些组件表现得不错，但未来某个时候用自定义训练数据重新审视这些组件会很好。
ner和textcat组件是为Blackstone特别训练的自定义组件。

命名实体识别器

Blackstone模型的NER组件已训练用于检测以下实体类型：

实体类型	名称	示例
CASENAME	案例名称	例如 Smith v Jones, In re Jones, In Jones’ case
CITATION	引用（已报告和未报告案例的唯一标识符）	例如 (2002) 2 Cr App R 123
INSTRUMENT	成文法律文件	例如 Theft Act 1968, European Convention on Human Rights, CPR
PROVISION	成文法律文件中的单位	例如 section 1, art 2(3)
COURT	法院或法庭	例如 Court of Appeal, Upper Tribunal
JUDGE	法官的引用	例如 Eady J, Lord Bingham of Cornhill

文本分类器

此版本的Blackstone还附带一个文本分类器。与NER组件（已训练用于识别感兴趣的标记和标记序列）相比，文本分类器对更长的文本范围（例如句子）进行分类。

文本分类器已训练用于将文本分类到五个互斥的类别之一，如下所示：

类别	描述
AXIOM	文本似乎假设了一个既定的原则
CONCLUSION	文本似乎做出了裁决、决定或结论

使用

应用NER模型

以下是一个将模型应用于文本的示例，该文本取自女王诉某机构案[2017] UKSC 5；[2018] AC 61中合议庭判决的第31段：

importspacy# 加载模型nlp=spacy.load("en_blackstone_proto")text=""" 31 As we shall explain in more detail in examining the submission of the Secretary of State (see paras 77 and following), it is the Secretary of State’s case that nothing has been done by Parliament in the European Communities Act 1972 or any other statute to remove the prerogative power of the Crown, in the conduct of the international relations of the UK, to take steps to remove the UK from the EU by giving notice under article 50EU for the UK to withdraw from the EU Treaty and other relevant EU Treaties. The Secretary of State relies in particular on Attorney General v De Keyser’s Royal Hotel Ltd [1920] AC 508 and R v Secretary of State for Foreign and Commonwealth Affairs, Ex p Rees-Mogg [1994] QB 552; he contends that the Crown’s prerogative power to cause the UK to withdraw from the EU by giving notice under article 50EU could only have been removed by primary legislation using express words to that effect, alternatively by legislation which has that effect by necessary implication. The Secretary of State contends that neither the ECA 1972 nor any of the other Acts of Parliament referred to have abrogated this aspect of the Crown’s prerogative, either by express words or by necessary implication. """# 将模型应用于文本doc=nlp(text)# 遍历模型识别的实体forentindoc.ents:print(ent.text,ent.label_)>>>European Communities Act1972INSTRUMENT>>>article 50EU PROVISION>>>EU Treaty INSTRUMENT>>>Attorney General v De Keyser’s Royal Hotel Ltd CASENAME>>>[1920]AC508CITATION>>>R v Secretary of StateforForeignandCommonwealth Affairs,Ex p Rees-Mogg CASENAME>>>[1994]QB552CITATION>>>article 50EU PROVISION

可视化实体

spaCy附带了一组优秀的可视化工具，包括用于NER预测的可视化工具。Blackstone附带了一个自定义调色板，使用displacy时可以更容易地区分源文本上的实体。

""" 使用spaCy的displacy可视化工具可视化实体。 Blackstone有一个自定义调色板：`from blackstone.displacy_palette import ner_displacy_options` """importspacyfromspacyimportdisplacyfromblackstone.displacy_paletteimportner_displacy_options nlp=spacy.load("en_blackstone_proto")text=""" The applicant must satisfy a high standard. This is a case where the action is to be tried by a judge with a jury. The standard is set out in Jameel v Wall Street Journal Europe Sprl [2004] EMLR 89, para 14: “But every time a meaning is shut out (including any holding that the words complained of either are, or are not, capable of bearing a defamatory meaning) it must be remembered that the judge is taking it upon himself to rule in effect that any jury would be perverse to take a different view on the question. It is a high threshold of exclusion. Ever since Fox’s Act 1792 (32 Geo 3, c 60) the meaning of words in civil as well as criminal libel proceedings has been constitutionally a matter for the jury. The judge’s function is no more and no less than to pre-empt perversity. That being clearly the position with regard to whether or not words are capable of being understood as defamatory or, as the case may be, non-defamatory, I see no basis on which it could sensibly be otherwise with regard to differing levels of defamatory meaning. Often the question whether words are defamatory at all and, if so, what level of defamatory meaning they bear will overlap.” 18 In Berezovsky v Forbes Inc [2001] EMLR 1030, para 16 Sedley LJ had stated the test this way: “The real question in the present case is how the courts ought to go about ascertaining the range of legitimate meanings. Eady J regarded it as a matter of impression. That is all right, it seems to us, provided that the impression is not of what the words mean but of what a jury could sensibly think they meant. Such an exercise is an exercise in generosity, not in parsimony.” """doc=nlp(text)# 调用displacy并将`ner_displacy_options`传递到选项参数中`displacy.serve(doc,style="ent",options=ner_displacy_options)

它会产生类似这样的效果：

应用文本分类器模型

Blackstone的文本分类器为文档生成预测分类。textcat管道组件设计用于应用于单个句子，而不是由多个句子组成的单个文档。

importspacy# 加载模型nlp=spacy.load("en_blackstone_proto")defget_top_cat(doc):""" 用于识别文本分类器生成的最高分 类别预测的函数。 """cats=doc.cats max_score=max(cats.values())max_cats=[kfork,vincats.items()ifv==max_score]max_cat=max_cats[0]return(max_cat,max_score)text=""" It is a well-established principle of law that the transactions of independent states between each other are governed by other laws than those which municipal courts administer. \ It is, however, in my judgment, insufficient to react to the danger of over-formalisation and “judicialisation” simply by emphasising flexibility and context-sensitivity. \ The question is whether on the facts found by the judge, the (or a) proximate cause of the loss of the rig was “inherent vice or nature of the subject matter insured” within the meaning of clause 4.4 of the Institute Cargo Clauses (A). """# 将模型应用于文本doc=nlp(text)# 获取文本段落中的句子sentences=[sent.textforsentindoc.sents]# 打印句子和相应的预测类别。forsentenceinsentences:doc=nlp(sentence)top_category=get_top_cat(doc)print(f"\"{sentence}\"{top_category}\n")>>>"In my judgment, it is patently obvious that cats are a type of dog."('CONCLUSION',0.9990500807762146)>>>"It is a well settled principle that theft is wrong."('AXIOM',0.556410014629364)

自定义管道扩展

除了核心模型之外，Blackstone的这个原型版本还附带三个自定义组件：

缩写检测 - 这主要基于 [scispacy] 中的AbbreviationDetector()组件，并将缩写形式解析为其完整形式定义，例如 ECtHR -> European Court of Human Rights。
复合案例引用检测 - 这同样是一个 alpha 组件，尝试识别 CASENAME 和 CITATION 对，从而将 CITATION 与其父 CASENAME 合并。

缩写检测和完整形式定义解析

法律文件的作者缩写冗长的术语并在文档的其余部分使用缩写形式，这并不少见。例如：

The European Court of Human Rights (“ECtHR”) is the court ultimately responsible for applying the European Convention on Human Rights (“ECHR”).

Blackstone中的缩写检测组件旨在通过实现scispaCy的AbbreviationDetector()的略微修改版本来解决这个问题（该组件本身是对本文所述方法的实现：https://psb.stanford.edu/psb-online/proceedings/psb03/schwartz.pdf）。我们的实现仍然存在一些问题，但使用示例如下：

importspacyfromblackstone.pipeline.abbreviationsimportAbbreviationDetector nlp=spacy.load("en_blackstone_proto")# 将缩写管道添加到spacy管道中。abbreviation_pipe=AbbreviationDetector(nlp)nlp.add_pipe(abbreviation_pipe)doc=nlp('The European Court of Human Rights ("ECtHR") is the court ultimately responsible for applying the European Convention on Human Rights ("ECHR").')print("Abbreviation","\t","Definition")forabrvindoc._.abbreviations:print(f"{abrv}\t ({abrv.start},{abrv.end}){abrv._.long_form}")>>>"ECtHR"(7,10)European Court of Human Rights>>>"ECHR"(25,28)European Convention on Human Rights

复合案例引用检测

Blackstone中的复合案例引用检测组件旨在将CITATION实体与其父CASENAME实体配对。

普通法司法管辖区通常通过名称（通常源自案件当事人的姓名）和某种唯一的引用来引用案例，如下所示：

Regina v Horncastle [2010] 2 AC 373

Blackstone的NER模型分别尝试识别CASENAME和CITATION实体。然而，在信息提取的背景下，将这些实体作为配对提取出来可能是有用的。

CompoundCases()在NER之后应用了一个自定义管道，并在两种场景下识别CASENAME/CITATION对：

标准场景：Gelmini v Moriggia [1913] 2 KB 549
所有格场景（有点过时）：Jone’s case [1915] 1 KB 45

importspacyfromblackstone.pipeline.compound_casesimportCompoundCases nlp=spacy.load("en_blackstone_proto")compound_pipe=CompoundCases(nlp)nlp.add_pipe(compound_pipe)doc=nlp(text)forcompound_refindoc._.compound_cases:print(compound_ref)>>>Gelmini v Moriggia[1913]2KB549>>>Jones'case[1915]1KB45

法律条文链接器

Blackstone的法律条文链接器尝试通过使用NER模型识别INSTRUMENT的存在，然后遍历依存关系树以识别子条文，从而将PROVISION引用与其父INSTRUMENT配对。

一旦Blackstone识别出一个PROVISION:INSTRUMENT对，它将尝试为条文和母法在legislation.gov.uk上生成目标URL。

importspacyfromblackstone.utils.legislation_linkerimportextract_legislation_relations nlp=spacy.load("en_blackstone_proto")text="The Secretary of State was at pains to emphasise that, if a withdrawal agreement is made, it is very likely to be a treaty requiring ratification and as such would have to be submitted for review by Parliament, acting separately, under the negative resolution procedure set out in section 20 of the Constitutional Reform and Governance Act 2010. Theft is defined in section 1 of the Theft Act 1968"doc=nlp(text)relations=extract_legislation_relations(doc)forprovision,provision_url,instrument,instrument_urlinrelations:print(f"\n{provision}\t{provision_url}\t{instrument}\t{instrument_url}")>>>section20http://www.legislation.gov.uk/ukpga/2010/25/section/20Constitutional ReformandGovernance Act2010http://www.legislation.gov.uk/ukpga/2010/25/contents>>>section1http://www.legislation.gov.uk/ukpga/1968/60/section/1Theft Act1968http://www.legislation.gov.uk/ukpga/1968/60/contents

句子分割器

Blackstone附带了一个基于规则的自定义句子分割器，该分割器解决了法律文本中一系列倾向于使开箱即用的句子分割规则困惑的特征。

可以通过可选地传递一列spaCy风格的Matcher模式来扩展此行为，这些模式将明确阻止在匹配项内进行句子边界检测。

importspacyfromblackstone.pipeline.sentence_segmenterimportSentenceSegmenterfromblackstone.rulesimportCITATION_PATTERNS nlp=spacy.load("en_blackstone_proto")# 在解析器之前将Blackstone句子分割器添加到管道中sentence_segmenter=SentenceSegmenter(nlp.vocab,CITATION_PATTERNS)nlp.add_pipe(sentence_segmenter,before="parser")doc=nlp(""" The courts in this jurisdiction will enforce those commitments when it is legally possible and necessary to do so (see, most recently, R. (on the application of ClientEarth) v Secretary of State for the Environment, Food and Rural Affairs (No.2) [2017] P.T.S.R. 203 and R. (on the application of ClientEarth) v Secretary of State for Environment, Food and Rural Affairs (No.3) [2018] Env. L.R. 21). The central question in this case arises against that background. """)forsentindoc.sents:print(sent.text)

致谢

我们要感谢以下人员/组织（直接或间接地）帮助我们构建了这个原型。

Mark Neumann of AI2 and scispaCy
Explosion AI for building spaCy and Prodigy
Kristin Hodgins of the Office of the Attorney General of British Columbia
更多精彩内容请关注我的个人公众号公众号（办公AI智能小助手）或者我的个人博客 https://blog.qife122.com/
对网络安全、黑客技术感兴趣的朋友可以关注我的安全公众号（网络安全技术点滴分享）