five

american_municipal_law

收藏
魔搭社区2025-11-27 更新2025-10-11 收录
下载链接:
https://modelscope.cn/datasets/laion/american_municipal_law
下载链接
链接失效反馈
官方服务:
资源简介:
# American Law ## Gathered by Kyle Rose @the-ride-never-ends https://huggingface.co/datasets/the-ride-never-ends/american_municipal_law ## Description Municipal and County Laws from across the United States, in parquet format. Files are named after a location's [GNIS id](https://en.wikipedia.org/wiki/Geographic_Names_Information_System) and what they contain. Unless specified otherwise, each row is one sub-section of a law. It is the one of the smallest units of law that can be cited under the [Bluebook Citation method](https://owl.purdue.edu/owl/research_and_citation/chicago_manual_17th_edition/cmos_formatting_and_style_guide/bluebook_citation_for_legal_materials.html). For convenience's sake, these will be referred to as "law" for the rest of this document. All types are from Python. Unless specified otherwise, each embedding was created using OpenAI's "text-embedding-3-small" model. ## Parquet Contents a series of municipal laws scraped from the web with sanitation and embeddings included. ### HTML Files that end in '_html' consist of rows with the following attributes: - cid (str): A unique CID for the law. It is created from the string "{gnis}_{doc_title}.json". Ex: "bafkreifzhmfladwvggsjdrc32npt6exucvrl3xe2omkgyfkqpo35ufaope" - doc_id (str): A unique ID based on the law's plaintext title. Ex: "CO_CH27HOSPVI" - doc_order (int): The relative location of a law in a corpus, in ascending order. - html_title (str): The raw HTML of the law's title. Ex: ""<div class=\"chunk-title\">Chapter 46 - MASTER ROAD PLAN AND SPECIFICATIONS<a href=..." - html (str): The raw HTML of the law itself. This can include footnotes, citations, tables, etc. - __index_level_0__ (int): Artifact of the parquet conversion process. Will likely be removed in future updates. ### Citation Files that end in '_citation' consist of rows with the following attributes: - bluebook_cid (str): A unique CID for the given citation. It is created from the string f"{place_name}{bluebook_state_code}{title}{history_note}". A law may have multiple citations linked to it, as parts of a law can be modified without changing other parts. - cid (str): The CID for the citation's associated law. This functions as a foreign key. - title (str): A plaintext version of the law's title. Ex: "Sec. 44. - Civic center and municipal building" - title_num (str): The number in the law's title. As title numbers can also use letters (e.g. A,B,C,etc.), specially-formatted numbers (e.g. 17-35), or none at all, title_num should be treated as a string for search and processing purposes. - date (str): The date when an ordinance was passed/changed. Ex: "10-22-85". - public_law_num (str): An "NA" place holder for Public Law number, A Public Law number, like "P.L. 107-101," identifies a law passed by the US Congress, indicating the US Congress number (107) and the law's sequential number within that Congress (101). As municipal and county citations are not federal laws, they do not have such a number. However, this is left in as this dataset might include federal law in the future. - chapter (str): The plaintext title of a chapter in the law corpus that contains the law. Chapters are broadly defined to be any macro-grouping in a corpus, and include appendices, tables, and other errata. Ex: "Chapter 16 - PARKS AND OTHER RECREATIONAL AND PUBLIC FACILITIES". - chapter_num (str): The number of the chapter. As chapter numbers can also use letters (e.g. A,B,C,etc.), specially-formatted numbers (e.g. 17-35), or none at all, chapter_num should be treated as a string for search and processing purposes. - history_note (str): The plaintext version of a single footnote for a law. They document the law's history. Ex: "Ord. No. 1985-19, § 1, 10-22-85" - ordinance (str): Which ordinance the law was passed/changed under. Ex: "Ord. No. 1985-19". Will be updated when history note parsing is more robust. - section (str): Which section in the law was passed/changed under. Ex: "§ 1". Will be updated when history note parsing is more robust. - enacted (str). The date when a specific law came into effect. Ex: "1985". Will be updated when history note parsing is more robust. - year (str): The year when an ordinance was passed/changed. Ex: "1985". Will be updated when history note parsing is more robust. - place_name (str): The place where the law is in effect. Currently only has municipalities and counties Ex: "City of Baker". - state_name (str): The state where a given place is located. Ex: "Louisiana" - state_code (str): The two letter abbreviation of a state's name. Ex: "LA" - bluebook_sate_code (str): An abbreviation of state_name used specifically for Bluebook citation's. Ex: "Mass." - bluebook_citation (str): A bluebook citation for a given law. Ex: "City of Baker, La., Municipal Code, §17-102 (1972)". - __index_level_0__ (int): Artifact of the parquet conversion process. Will likely be removed in future updates. ### Embeddings Files that end in '_embeddings' consist of rows with the following attributes: - embedding_cid (str): A unique CID for the embedding. It is created from turning each float into a string, concatentating them, then making a CID out of it. - gnis (str): A place's GNIS id. This functions as a foreign key and allows for cosine similarity searches to be narrowed to a specific location. - cid (str): The CID for the embeddings's associated law. This functions as a foreign key. - text_chunk_order (int): The relative location of an embedding for a law. As chunks may sometimes have token counts greater than the embedding model's input limits, laws are split into separate parts if they go over this limit. - embedding (list(float)): An embedding of the plaintext version of the law, with newlines removed and spaces normalized. - __index_level_0__ (int): Artifact of the parquet conversion process. Will likely be removed in future updates.

# 美国法律 ## 整理者:Kyle Rose @the-ride-never-ends 数据集链接:https://huggingface.co/datasets/the-ride-never-ends/american_municipal_law ## 数据集描述 本数据集收录全美范围内的市政与县级法律,存储格式为Parquet。 文件命名依据对应地点的[地理名称信息系统(Geographic Names Information System, GNIS)]编号以及文件内容。若无特别说明,每一行对应一部法律的一个子条款,这是可依照[蓝皮书引证规则(Bluebook Citation method)]进行引证的最小法律单元之一。为便于行文,本文档后续将统一以“法律”指代此类子条款。所有数据类型均源自Python。若无特别说明,所有嵌入向量均通过OpenAI的“text-embedding-3-small”模型生成。 ## Parquet 文件内容 本数据集包含从网络爬取的一系列市政法律,已完成数据清洗并附带嵌入向量。 ### HTML 格式文件 后缀为`_html`的文件包含以下字段的行数据: - `cid`(字符串类型):法律的唯一CID,由字符串`{gnis}_{doc_title}.json`生成。示例:"bafkreifzhmfladwvggsjdrc32npt6exucvrl3xe2omkgyfkqpo35ufaope" - `doc_id`(字符串类型):基于法律纯文本标题生成的唯一标识符。示例:"CO_CH27HOSPVI" - `doc_order`(整数类型):法律在语料库中的相对位置,按升序排列 - `html_title`(字符串类型):法律标题的原始HTML内容。示例:"<div class="chunk-title">Chapter 46 - MASTER ROAD PLAN AND SPECIFICATIONS<a href=..." - `html`(字符串类型):法律条文的原始HTML内容,可包含脚注、引证、表格等元素 - `__index_level_0__`(整数类型):Parquet转换过程产生的冗余字段,未来版本大概率会移除 ### 引证格式文件 后缀为`_citation`的文件包含以下字段的行数据: - `bluebook_cid`(字符串类型):对应引证的唯一CID,由字符串`f"{place_name}{bluebook_state_code}{title}{history_note}"`生成。由于法律的部分条款可在不修改其他条款的情况下进行修订,一部法律可能对应多个引证 - `cid`(字符串类型):该引证关联法律的CID,作为外键使用 - `title`(字符串类型):法律标题的纯文本版本。示例:"Sec. 44. - Civic center and municipal building" - `title_num`(字符串类型):法律标题中的编号。由于标题编号可包含字母(如A、B、C等)、特殊格式编号(如17-35)或无编号,检索与处理时请将其视为字符串 - `date`(字符串类型):条例通过或修订的日期。示例:"10-22-85" - `public_law_num`(字符串类型):公法编号的占位符,默认值为"NA"。公法编号(如"P.L. 107-101")用于标识美国国会通过的法律,包含国会届次(107)与该届国会内的法律序号(101)。由于市政与县级引证并非联邦法律,此类条目暂无对应编号,但保留该字段以应对未来可能加入联邦法律的需求 - `chapter`(字符串类型):法律所在语料库中章节的纯文本标题。章节的定义较为宽泛,涵盖所有宏观分组,包括附录、表格及其他勘误内容。示例:"Chapter 16 - PARKS AND OTHER RECREATIONAL AND PUBLIC FACILITIES" - `chapter_num`(字符串类型):章节的编号。由于章节编号可包含字母(如A、B、C等)、特殊格式编号(如17-35)或无编号,检索与处理时请将其视为字符串 - `history_note`(字符串类型):法律单条脚注的纯文本版本,用于记录法律的修订历史。示例:"Ord. No. 1985-19, § 1, 10-22-85" - `ordinance`(字符串类型):该法律通过或修订所依据的条例编号。示例:"Ord. No. 1985-19"。待脚注解析功能优化后,该字段将进一步完善 - `section`(字符串类型):该法律通过或修订所对应的条款编号。示例:"§ 1"。待脚注解析功能优化后,该字段将进一步完善 - `enacted`(字符串类型):特定法律生效的日期。示例:"1985"。待脚注解析功能优化后,该字段将进一步完善 - `year`(字符串类型):条例通过或修订的年份。示例:"1985"。待脚注解析功能优化后,该字段将进一步完善 - `place_name`(字符串类型):法律生效的地域范围,当前仅包含直辖市与县级行政区。示例:"City of Baker" - `state_name`(字符串类型):该地域所属的州。示例:"Louisiana" - `state_code`(字符串类型):州名称的两字母缩写。示例:"LA" - `bluebook_sate_code`(字符串类型):专门用于蓝皮书引证的州名称缩写。示例:"Mass." - `bluebook_citation`(字符串类型):对应法律的蓝皮书引证格式文本。示例:"City of Baker, La., Municipal Code, §17-102 (1972)" - `__index_level_0__`(整数类型):Parquet转换过程产生的冗余字段,未来版本大概率会移除 ### 嵌入向量文件 后缀为`_embeddings`的文件包含以下字段的行数据: - `embedding_cid`(字符串类型):嵌入向量的唯一CID,通过将每个浮点数转换为字符串并拼接后生成CID - `gnis`(字符串类型):地域的GNIS编号,作为外键使用,可用于将余弦相似度检索范围限定至特定地域 - `cid`(字符串类型):该嵌入向量关联法律的CID,作为外键使用 - `text_chunk_order`(整数类型):法律嵌入向量的相对位置。当文本块的Token数量超过嵌入模型的输入限制时,法律将被拆分为多个子部分,该字段用于标识拆分后的顺序 - `embedding`(浮点数列表):法律纯文本版本的嵌入向量,已移除换行符并标准化空格 - `__index_level_0__`(整数类型):Parquet转换过程产生的冗余字段,未来版本大概率会移除
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作