five

A Dataset Showing a Century of Evolution in the Complexity of the United States Legal Code

收藏
DataCite Commons2026-01-06 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/A_Century_of_Evolution_in_the_Complexity_of_the_United_States_Legal_Code/29540039/5
下载链接
链接失效反馈
官方服务:
资源简介:
We leverage <b>OCR</b> and <b>Generative AI</b> techniques to recover and clean printed historical editions of the Code. This enables computational analysis of federal law even in periods before web-based digital access. The processing pipeline includes:📄 <b>Contents of U.S. Code</b>: Word counts, unique word counts, entropy, scaling exponents, etc.🌲 <b>Hierarchical Structure</b>: Subtitle → Part → Chapter → Section → Subsection...🔗 <b>Cross-Reference Relationships</b>: Title-to-title citation relationshipsFor the small sample of our data, please check out our github repository https://github.com/Dawoon-Jeong0523/uscode-complexity🔍 A sample OCR text page (<code>ocr_processing_gemini</code>) for demonstration🌐 Web-based U.S. Code text from 1994 for structural parsing (<code>Data Set 2</code>)<br>

我们借助**光学字符识别(Optical Character Recognition, OCR)**与**生成式AI(Generative AI)**技术,对该法典的印刷版历史版本进行复原与清理。这使得即便在基于网络的数字公开渠道问世之前的历史时期,也可对联邦法律开展计算分析。 处理流程涵盖: 📄 **《美国法典(U.S. Code)》文本统计特征**:词频数、唯一词频数、熵值、标度指数等。 🌲 **层级结构**:副标题→分部→章→节→小节…… 🔗 **交叉引用关系**:法典标题间的引用关联。 如需查看我们的小型数据样本,请访问我们的GitHub仓库:https://github.com/Dawoon-Jeong0523/uscode-complexity 🔍 用于演示的OCR文本页面示例(<code>ocr_processing_gemini</code>) 🌐 用于结构解析的1994年版网络版《美国法典》文本(<code>数据集2</code>)<br>
提供机构:
figshare
创建时间:
2025-10-22
二维码
社区交流群
二维码
科研交流群
商业服务