pile-of-law/pile-of-law

Name: pile-of-law/pile-of-law
Creator: pile-of-law
Published: 2023-01-08 03:10:35
License: 暂无描述

Hugging Face2023-01-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pile-of-law/pile-of-law

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - en license: - cc-by-nc-sa-4.0 multilinguality: - monolingual pretty_name: pile-of-law size_categories: - 10M<n<100M source_datasets: [] task_categories: - fill-mask task_ids: - masked-language-modeling viewer: false --- # Dataset Card for Pile of Law ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/pile-of-law/pile-of-law - **Repository:** https://huggingface.co/datasets/pile-of-law/pile-of-law - **Paper:** https://arxiv.org/abs/2207.00220 ### Dataset Summary We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives. ### Supported Tasks and Leaderboards See paper for details. ### Languages Mainly English, but some other languages may appear in some portions of the data. ## Dataset Structure ### Data Instances **courtListener_docket_entry_documents** : Docket entries in U.S. federal courts, including filed briefs from CourtListener RECAP archive. **courtListener_opinions** : U.S. court opinions from CourtListener (synchronized as of 12/31/2022). **atticus_contracts**: Unannotated contracts from the Atticus Project. **federal_register**: The U.S. federal register where agencies file draft rulemaking. **bva_opinions**: Bureau of Veterans Appeals opinions. **us_bills**: Draft Bills from the United States Congress. **cc_casebooks**: Educational Casebooks released under open CC licenses. **tos**: Unannotated Terms of Service contracts. **euro_parl**: European parliamentary debates. **nlrb_decisions**: Decisions from the U.S. National Labor Review Board. **scotus_oral_arguments**: U.S. Supreme Court Oral Arguments **cfr**: U.S. Code of Federal Regulations **state_codes**: U.S. State Codes **scotus_filings**: Briefs and filings with the U.S. Supreme Court. **exam_outlines**: Exam outlines available openly on the web. **edgar**: Contracts filed with the SEC and made available on the SEC's Edgar tool. **cfpb_creditcard_contracts**: Credit Card Contracts compiled by the U.S. Consumer Finance Protection Bureau. **constitutions** : The World's constitutions. **congressional_hearings** : U.S. Congressional hearing transcripts and statements. **oig**: U.S. Office of Inspector general reports. **olc_memos**: U.S. Office of Legal Counsel memos. **uscode**: The United States Code (laws). **founding_docs**: Letters from U.S. founders. **ftc_advisory_opinions**: Advisory opinions by the Federal Trade Commission. **echr** : European Court of Human Rights opinions. **eurlex**: European Laws. **tax_rulings**: Rulings from U.S. Tax court. **un_debates**: U.N. General Debates **fre**: U.S. Federal Rules of Evidence **frcp** : U.S. Federal Rules of Civil Procedure **canadian_decisions**: Canadian Court Opinions from ON and BC. **eoir**: U.S. Executive Office for Immigration Review Immigration and Nationality Precedential Decisions **dol_ecab**: Department of Labor Employees' Compensation Appeals Board decisions after 2006 **r_legaladvice** : Filtered data from the r/legaladvice and r/legaladviceofftopic subreddits in the format. Title: [Post Title] Question: [Post Content] Topic: [Post Flair] Answer \#[N]: [Top Answers]... **acus_reports** : Reports from the Administrative Conference of the United States from 2010-2022. **ed_policy_guidance** : Policy guidance documents from the U.S. Department of Education (2001-2022). **uspto_office_actions** : Office Actions from the U.S. Patent and Trademark Office from 2019-2022. **icj-pcij** : International Court of Justice and Permanent Court of International Justice opinions. **hhs_alj_opinions** : Opinions from the U.S. Department of Health and Human Services Administrative Law Judges from 1985-2019. **sec_administrative_proceedings**: Significant pleadings, orders and decisions for administrative proceedings from the U.S. Securities and Exchange Commission from 2005-2022. **fmshrc_bluebooks**: Bluebooks from the U.S. Federal Mine Safety and Health Review Commission from 1979 (March) - 2022 (August). **resource_contracts**: Resource Contracts collected by ResourceContracts.org **medicaid_policy_guidance**: Policy guidance documents from the U.S. Department of Health and Human Services (1994-2022). **irs_legal_advice_memos**: Legal Advice Memos and Chief Counsel Notices from the U.S. Internal Revenue Service. **doj_guidance**: Guidance documents from the U.S. Department of Justice (2020-2022). **1/23 update**: Data updated in 2023 included: syncing courtListener opinions, adding ACUS reports, USPTO office actions, Ed Policy Guidance, HHS ALJ opinions, SEC administrative proceedings, FMSHRC Bluebooks, Resource Contracts, and ICJ/PCIJ legal opinions. We also fixed OLC opinions which had some formatting inconsistencies and merged exam outlines into one file, adding some additional exam outlines. On-disk sizes might vary due to caching and compression, but should be approximately as follows as of 1/7/2023. ```bash % xz --list data/*.xz Strms Blocks Compressed Uncompressed Ratio Check Filename 183 181 9,631.2 KiB 35.0 MiB 0.268 CRC64 data/train.acus_reports.jsonl.xz 1 1 1,024.1 MiB 6,804.7 MiB 0.150 CRC64 data/train.atticus_contracts.0.jsonl.xz 1 1 1,024.1 MiB 6,781.1 MiB 0.151 CRC64 data/train.atticus_contracts.1.jsonl.xz 1 1 1,024.1 MiB 6,790.1 MiB 0.151 CRC64 data/train.atticus_contracts.2.jsonl.xz 1 1 1,024.1 MiB 6,759.2 MiB 0.152 CRC64 data/train.atticus_contracts.3.jsonl.xz 1 1 139.9 MiB 925.0 MiB 0.151 CRC64 data/train.atticus_contracts.4.jsonl.xz 1 1 1,564.6 MiB 12.5 GiB 0.123 CRC64 data/train.bva.jsonl.xz 1 1 29.8 MiB 154.3 MiB 0.193 CRC64 data/train.canadian_decisions.jsonl.xz 1 1 18.5 MiB 82.6 MiB 0.224 CRC64 data/train.cc_casebooks.jsonl.xz 1 1 3,427.3 KiB 67.2 MiB 0.050 CRC64 data/train.cfpb_cc.jsonl.xz 1 1 72.7 MiB 582.6 MiB 0.125 CRC64 data/train.cfr.jsonl.xz 1 1 1,056.1 MiB 4,941.9 MiB 0.214 CRC64 data/train.congressional_hearings.jsonl.xz 1 1 3,272.4 KiB 21.3 MiB 0.150 CRC64 data/train.constitutions.jsonl.xz 1 1 1,024.1 MiB 13.0 GiB 0.077 CRC64 data/train.courtlistenerdocketentries.0.jsonl.xz 1 1 1,024.3 MiB 13.3 GiB 0.075 CRC64 data/train.courtlistenerdocketentries.1.jsonl.xz 1 1 1,024.1 MiB 12.4 GiB 0.080 CRC64 data/train.courtlistenerdocketentries.2.jsonl.xz 1 1 635.2 MiB 8,671.6 MiB 0.073 CRC64 data/train.courtlistenerdocketentries.3.jsonl.xz 1 1 953.7 MiB 4,575.7 MiB 0.208 CRC64 data/train.courtlisteneropinions.0.jsonl.xz 1 1 953.7 MiB 4,356.2 MiB 0.219 CRC64 data/train.courtlisteneropinions.1.jsonl.xz 1 1 953.7 MiB 4,315.6 MiB 0.221 CRC64 data/train.courtlisteneropinions.10.jsonl.xz 1 1 953.7 MiB 4,650.3 MiB 0.205 CRC64 data/train.courtlisteneropinions.11.jsonl.xz 1 1 953.7 MiB 4,836.3 MiB 0.197 CRC64 data/train.courtlisteneropinions.12.jsonl.xz 1 1 953.7 MiB 4,644.9 MiB 0.205 CRC64 data/train.courtlisteneropinions.13.jsonl.xz 1 1 953.7 MiB 4,657.5 MiB 0.205 CRC64 data/train.courtlisteneropinions.14.jsonl.xz 1 1 539.2 MiB 2,621.8 MiB 0.206 CRC64 data/train.courtlisteneropinions.15.jsonl.xz 1 1 953.7 MiB 4,335.3 MiB 0.220 CRC64 data/train.courtlisteneropinions.2.jsonl.xz 1 1 953.7 MiB 4,352.0 MiB 0.219 CRC64 data/train.courtlisteneropinions.3.jsonl.xz 1 1 953.7 MiB 4,575.9 MiB 0.208 CRC64 data/train.courtlisteneropinions.4.jsonl.xz 1 1 953.7 MiB 4,382.6 MiB 0.218 CRC64 data/train.courtlisteneropinions.5.jsonl.xz 1 1 953.7 MiB 4,352.3 MiB 0.219 CRC64 data/train.courtlisteneropinions.6.jsonl.xz 1 1 953.7 MiB 4,462.4 MiB 0.214 CRC64 data/train.courtlisteneropinions.7.jsonl.xz 1 1 953.7 MiB 4,604.0 MiB 0.207 CRC64 data/train.courtlisteneropinions.8.jsonl.xz 1 1 953.7 MiB 4,612.0 MiB 0.207 CRC64 data/train.courtlisteneropinions.9.jsonl.xz 335 335 6,047.4 KiB 24.1 MiB 0.245 CRC64 data/train.doj_guidance.jsonl.xz 1 1 41.1 MiB 305.6 MiB 0.135 CRC64 data/train.dol_ecab.jsonl.xz 1 1 19.1 MiB 100.5 MiB 0.190 CRC64 data/train.echr.jsonl.xz 508 507 1,502.0 KiB 4,716.7 KiB 0.318 CRC64 data/train.ed_policy_guidance.jsonl.xz 1 1 1,372.0 MiB 9,032.6 MiB 0.152 CRC64 data/train.edgar.jsonl.xz 1 1 3,896.6 KiB 18.6 MiB 0.205 CRC64 data/train.eoir.jsonl.xz 1 1 140.3 MiB 1,154.7 MiB 0.121 CRC64 data/train.eurlex.jsonl.xz 1 1 51.4 MiB 239.4 MiB 0.215 CRC64 data/train.euro_parl.jsonl.xz 1 1 355.3 KiB 1,512.5 KiB 0.235 CRC64 data/train.examoutlines.jsonl.xz 1 1 20.7 MiB 131.7 MiB 0.157 CRC64 data/train.federal_register.jsonl.xz 396 396 43.9 MiB 175.7 MiB 0.250 CRC64 data/train.fmshrc.jsonl.xz 1 1 73.4 MiB 341.7 MiB 0.215 CRC64 data/train.founding_docs.jsonl.xz 1 1 324.2 KiB 1,459.4 KiB 0.222 CRC64 data/train.frcp.jsonl.xz 1 1 116.1 KiB 484.9 KiB 0.239 CRC64 data/train.fre.jsonl.xz 1 1 297.3 KiB 1,245.0 KiB 0.239 CRC64 data/train.ftc_advisory_opinions.jsonl.xz 2,084 2,083 13.4 MiB 42.2 MiB 0.318 CRC64 data/train.hhs_alj.jsonl.xz 1 1 29.5 MiB 157.4 MiB 0.188 CRC64 data/train.ijc.jsonl.xz 442 442 7,904.4 KiB 35.8 MiB 0.216 CRC64 data/train.irs_legal_advice_memos.jsonl.xz 658 658 3,403.1 KiB 10.6 MiB 0.314 CRC64 data/train.medicaid_policy_guidance.jsonl.xz 1 1 170.7 MiB 788.9 MiB 0.216 CRC64 data/train.nlrb_decisions.jsonl.xz 1 1 218.4 MiB 1,580.3 MiB 0.138 CRC64 data/train.oig.jsonl.xz 1 1 5,857.4 KiB 31.5 MiB 0.182 CRC64 data/train.olc_memos.jsonl.xz 1 1 58.6 MiB 234.5 MiB 0.250 CRC64 data/train.r_legaldvice.jsonl.xz 1,639 1,639 43.7 MiB 188.1 MiB 0.232 CRC64 data/train.resource_contracts.jsonl.xz 1 1 242.6 MiB 1,241.6 MiB 0.195 CRC64 data/train.scotus_docket_entries.jsonl.xz 1 1 68.5 MiB 323.2 MiB 0.212 CRC64 data/train.scotus_oral.jsonl.xz 10,805 10,805 40.7 MiB 118.4 MiB 0.344 CRC64 data/train.sec.jsonl.xz 1 1 705.0 MiB 5,019.9 MiB 0.140 CRC64 data/train.state_code.jsonl.xz 1 1 75.2 MiB 540.8 MiB 0.139 CRC64 data/train.taxrulings.jsonl.xz 1 1 273.6 KiB 1,318.5 KiB 0.207 CRC64 data/train.tos.jsonl.xz 1 1 22.6 MiB 108.1 MiB 0.209 CRC64 data/train.undebates.jsonl.xz 1 1 167.6 MiB 1,119.6 MiB 0.150 CRC64 data/train.us_bills.jsonl.xz 1 1 25.3 MiB 196.1 MiB 0.129 CRC64 data/train.uscode.jsonl.xz 1 1 1,713.2 MiB 33.7 GiB 0.050 CRC64 data/train.uspto_oab.jsonl.xz 54 54 2,960.9 KiB 11.0 MiB 0.264 CRC64 data/validation.acus_reports.jsonl.xz 1 1 1,024.1 MiB 6,797.1 MiB 0.151 CRC64 data/validation.atticus_contracts.0.jsonl.xz 1 1 374.6 MiB 2,471.7 MiB 0.152 CRC64 data/validation.atticus_contracts.1.jsonl.xz 1 1 523.0 MiB 4,258.9 MiB 0.123 CRC64 data/validation.bva.jsonl.xz 1 1 9.8 MiB 50.5 MiB 0.195 CRC64 data/validation.canadian_decisions.jsonl.xz 1 1 4,281.5 KiB 19.1 MiB 0.219 CRC64 data/validation.cc_casebooks.jsonl.xz 1 1 1,532.6 KiB 19.6 MiB 0.077 CRC64 data/validation.cfpb_cc.jsonl.xz 1 1 23.3 MiB 190.4 MiB 0.122 CRC64 data/validation.cfr.jsonl.xz 1 1 347.4 MiB 1,620.7 MiB 0.214 CRC64 data/validation.congressional_hearings.jsonl.xz 1 1 1,102.4 KiB 6,733.0 KiB 0.164 CRC64 data/validation.constitutions.jsonl.xz 1 1 1,024.1 MiB 10.7 GiB 0.094 CRC64 data/validation.courtlistenerdocketentries.0.jsonl.xz 1 1 473.7 MiB 5,225.2 MiB 0.091 CRC64 data/validation.courtlistenerdocketentries.1.jsonl.xz 1 1 953.7 MiB 4,391.3 MiB 0.217 CRC64 data/validation.courtlisteneropinions.0.jsonl.xz 1 1 953.7 MiB 4,406.9 MiB 0.216 CRC64 data/validation.courtlisteneropinions.1.jsonl.xz 1 1 953.8 MiB 4,436.7 MiB 0.215 CRC64 data/validation.courtlisteneropinions.2.jsonl.xz 1 1 953.7 MiB 4,476.9 MiB 0.213 CRC64 data/validation.courtlisteneropinions.3.jsonl.xz 1 1 953.7 MiB 4,618.0 MiB 0.207 CRC64 data/validation.courtlisteneropinions.4.jsonl.xz 1 1 238.5 MiB 1,147.4 MiB 0.208 CRC64 data/validation.courtlisteneropinions.5.jsonl.xz 100 100 1,778.7 KiB 7,371.5 KiB 0.241 CRC64 data/validation.doj_guidance.jsonl.xz 1 1 13.8 MiB 101.5 MiB 0.136 CRC64 data/validation.dol_ecab.jsonl.xz 1 1 4,132.1 KiB 20.8 MiB 0.194 CRC64 data/validation.echr.jsonl.xz 174 173 490.5 KiB 1,564.9 KiB 0.313 CRC64 data/validation.ed_policy_guidance.jsonl.xz 1 1 453.6 MiB 2,978.9 MiB 0.152 CRC64 data/validation.edgar.jsonl.xz 1 1 1,340.0 KiB 6,294.8 KiB 0.213 CRC64 data/validation.eoir.jsonl.xz 1 1 49.1 MiB 393.7 MiB 0.125 CRC64 data/validation.eurlex.jsonl.xz 1 1 17.0 MiB 79.0 MiB 0.215 CRC64 data/validation.euro_parl.jsonl.xz 1 1 103.7 KiB 547.9 KiB 0.189 CRC64 data/validation.examoutlines.jsonl.xz 1 1 7,419.0 KiB 45.7 MiB 0.158 CRC64 data/validation.federal_register.jsonl.xz 120 120 13.5 MiB 53.9 MiB 0.250 CRC64 data/validation.fmshrc.jsonl.xz 1 1 25.3 MiB 113.2 MiB 0.224 CRC64 data/validation.founding_docs.jsonl.xz 1 1 63.5 KiB 248.8 KiB 0.255 CRC64 data/validation.frcp.jsonl.xz 1 1 58.4 KiB 226.7 KiB 0.257 CRC64 data/validation.fre.jsonl.xz 1 1 117.4 KiB 419.1 KiB 0.280 CRC64 data/validation.ftc_advisory_opinions.jsonl.xz 722 721 4,900.2 KiB 15.1 MiB 0.318 CRC64 data/validation.hhs_alj.jsonl.xz 1 1 10.0 MiB 52.3 MiB 0.191 CRC64 data/validation.ijc.jsonl.xz 161 161 3,791.0 KiB 17.7 MiB 0.209 CRC64 data/validation.irs_legal_advice_memos.jsonl.xz 214 214 1,101.1 KiB 3,411.1 KiB 0.323 CRC64 data/validation.medicaid_policy_guidance.jsonl.xz 1 1 55.8 MiB 257.8 MiB 0.217 CRC64 data/validation.nlrb_decisions.jsonl.xz 1 1 80.0 MiB 603.7 MiB 0.132 CRC64 data/validation.oig.jsonl.xz 1 1 1,826.2 KiB 9,874.6 KiB 0.185 CRC64 data/validation.olc_memos.jsonl.xz 1 1 19.7 MiB 78.7 MiB 0.251 CRC64 data/validation.r_legaldvice.jsonl.xz 584 584 15.3 MiB 63.5 MiB 0.241 CRC64 data/validation.resource_contracts.jsonl.xz 1 1 86.4 MiB 422.5 MiB 0.204 CRC64 data/validation.scotus_docket_entries.jsonl.xz 1 1 23.1 MiB 109.0 MiB 0.212 CRC64 data/validation.scotus_oral.jsonl.xz 3,559 3,559 13.0 MiB 37.7 MiB 0.344 CRC64 data/validation.sec.jsonl.xz 1 1 371.8 MiB 2,678.4 MiB 0.139 CRC64 data/validation.state_code.jsonl.xz 1 1 24.8 MiB 177.4 MiB 0.140 CRC64 data/validation.taxrulings.jsonl.xz 1 1 92.7 KiB 381.6 KiB 0.243 CRC64 data/validation.tos.jsonl.xz 1 1 7,705.6 KiB 35.5 MiB 0.212 CRC64 data/validation.undebates.jsonl.xz 1 1 53.8 MiB 356.3 MiB 0.151 CRC64 data/validation.us_bills.jsonl.xz 1 1 15.2 MiB 117.5 MiB 0.129 CRC64 data/validation.uscode.jsonl.xz 1 1 885.5 MiB 11.2 GiB 0.077 CRC64 data/validation.uspto_oab.jsonl.xz ------------------------------------------------------------------------------- 22,839 22,833 41.0 GiB 291.5 GiB 0.141 CRC64 119 files ``` ### Data Fields - text: the document text - created_timestamp: If the original source provided a timestamp when the document was created we provide this as well. Note, these may be inaccurate. For example CourtListener case opinions provide the timestamp of when it was uploaded to CourtListener not when the opinion was published. We welcome pull requests to correct this field if such inaccuracies are discovered. - downloaded_timestamp: When the document was scraped. - url: the source url ### Data Splits There is a train/validation split for each subset of the data. 75%/25%. Note, we do not use the validation set for any downstream tasks nor do we filter out any data from downstream tasks. Please filter as needed before training models or feel free to use a different dataset split. ## Dataset Creation ### Curation Rationale We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives. As such, data sources are curated to inform: (1) legal analysis, knowledge, or understanding; (2) argument formation; (3) privacy filtering standards. Sources like codes and laws tend to inform (1). Transcripts and court filings tend to inform (2). Opinions tend to inform (1) and (3). ### Source Data #### Initial Data Collection and Normalization We do not normalize the data, but we provide dataset creation code and relevant urls in https://github.com/Breakend/PileOfLaw #### Who are the source language producers? Varied (see sources above). ### Personal and Sensitive Information This dataset may contain personal and sensitive information. However, this has been previously filtered by the relevant government and federal agencies that weigh the harms of revealing this information against the benefits of transparency. If you encounter something particularly harmful, please file a takedown request with the upstream source and notify us in the communities tab. We will then remove the content. We cannot enable more restrictive licensing because upstream sources may restrict using a more restrictive license. However, we ask that all users of this data respect the upstream licenses and restrictions. Per the standards of CourtListener, we do not allow indexing of this data by search engines and we ask that others do not also. Please do not turn on anything that allows the data to be easily indexed. ## Considerations for Using the Data ### Social Impact of Dataset We hope that this dataset will provide more mechanisms for doing data work. As we describe in the paper, the internal variation allows contextual privacy rules to be learned. If robust mechanisms for this are developed they can applied more broadly. This dataset can also potentially be used for legal language model pretraining. As discussed in ``On the Opportunities and Risks of Foundation Models'', legal language models can help improve access to justice in various ways. But they can also be used in potentially harmful ways. While such models are not ready for most production environments and are the subject of significant research, we ask that model creators using this data, particularly when creating generative models, consider the impacts of their model and make a good faith effort to weigh the benefits against the harms of their method. Our license and many of the sub-licenses also restrict commercial usage. ### Discussion of Biases The data reflects the biases of governments and courts. As we discuss in our work, these can be significant, though more recent text will likely be less overtly toxic. Please see the above statement and embark on any model uses responsibly. ### Other Known Limitations We mainly focus on U.S. and English-speaking legal sources, though we include some European and Canadian resources. ## Additional Information ### Licensing Information CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International. But individual sources may have other licenses. See paper for details. Some upstream data sources request that indexing be disabled. As such please **do not re-host any data in a way that can be indexed by search engines.** ### No Representations We do not make any representation that the legal information provided here is accurate. It is meant for research purposes only. For the authoritative and updated source of information please refer directly to the governing body which provides the latest laws, rules, and regulations relevant to you. ### DMCA Takedown Requests Pile of Law follows the notice and takedown procedures in the Digital Millennium Copyright Act (DMCA), 17 U.S.C. Section 512. If you believe content on Pile of Law violates your copyright, please immediately notify its operators by sending a message with the information described below. Please use the subject "Copyright" in your message. If Pile of Law's operators act in response to an infringement notice, they will make a good-faith attempt to contact the person who contributed the content using the most recent email address that person provided to Pile of Law. Under the DMCA, you may be held liable for damages based on material misrepresentations in your infringement notice. You must also make a good-faith evaluation of whether the use of your content is a fair use, because fair uses are not infringing. See 17 U.S.C. Section 107 and Lenz v. Universal Music Corp., No. 13-16106 (9th Cir. Sep. 14, 2015). If you are not sure if the content you want to report infringes your copyright, you should first contact a lawyer. The DMCA requires that all infringement notices must include all of the following: + A signature of the copyright owner or a person authorized to act on the copyright owner's behalf + An identification of the copyright claimed to have been infringed + A description of the nature and location of the material that you claim to infringe your copyright, in sufficient detail to allow Pile of Law to find and positively identify that material + Your name, address, telephone number, and email address + A statement that you believe in good faith that the use of the material that you claim to infringe your copyright is not authorized by law, or by the copyright owner or such owner's agent + A statement, under penalty of perjury, that all of the information contained in your infringement notice is accurate + A statement, under penalty of perjury, that you are either the copyright owner or a person authorized to act on their behalf. Pile of Law will respond to all DMCA-compliant infringement notices, including, as required or appropriate, by removing the offending material or disabling all links to it. All received infringement notices may be posted in full to the Lumen database (previously known as the Chilling Effects Clearinghouse). All takedown requests with the above information should be posted to the Communities tab. This removal notice has been modified from the (CourtListener DMCA takedown notice)[https://www.courtlistener.com/terms/]. ### Citation Information For a citation to this work: ``` @misc{hendersonkrass2022pileoflaw, url = {https://arxiv.org/abs/2207.00220}, author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.}, title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset}, publisher = {arXiv}, year = {2022} } ``` Since this dataset also includes several other data sources with citations, please refer to our paper and cite the additional relevant work in addition to our own work.

提供机构：

pile-of-law

原始信息汇总

数据集概述

数据集基本信息

名称: Pile of Law
语言: 主要为英语，部分数据包含其他语言
许可证: CC-BY-NC-SA-4.0
多语言性: 单语
大小: 10M<n<100M
任务类别: fill-mask
任务ID: masked-language-modeling

数据集内容

数据实例

数据集包含多个子集，每个子集对应不同的法律和行政文档类型，包括但不限于：

courtListener_docket_entry_documents: 美国联邦法院的案件记录文档
courtListener_opinions: 美国法院意见
atticus_contracts: Atticus项目中的未注释合同
federal_register: 美国联邦注册，机构提交的草案规则制定
bva_opinions: 退伍军人上诉局意见
us_bills: 美国国会草案法案
cc_casebooks: 开放CC许可证下的教育案例书
tos: 未注释的服务条款合同
euro_parl: 欧洲议会辩论
nlrb_decisions: 美国国家劳工关系委员会决定
scotus_oral_arguments: 美国最高法院口头辩论
cfr: 美国联邦法规法典
state_codes: 美国州法典
scotus_filings: 美国最高法院的简报和文件
exam_outlines: 公开的考试大纲
edgar: SEC的Edgar工具上提供的合同
cfpb_creditcard_contracts: 美国消费者金融保护局编制的信用卡合同
constitutions: 世界宪法
congressional_hearings: 美国国会听证会记录和声明
oig: 美国监察长办公室报告
olc_memos: 美国法律顾问办公室备忘录
uscode: 美国法典（法律）
founding_docs: 美国创始人的信件
ftc_advisory_opinions: 联邦贸易委员会的咨询意见
echr: 欧洲人权法院意见
eurlex: 欧洲法律
tax_rulings: 美国税务法院裁决
un_debates: 联合国大会辩论
fre: 美国联邦证据规则
frcp: 美国联邦民事诉讼规则
canadian_decisions: 加拿大法院意见
eoir: 美国移民和国家性先例决定
dol_ecab: 美国劳工部员工补偿上诉委员会决定
r_legaladvice: 来自r/legaladvice和r/legaladviceofftopic子版块的数据
acus_reports: 美国行政会议报告
ed_policy_guidance: 美国教育部政策指导文件
uspto_office_actions: 美国专利商标局办公室行动
icj-pcij: 国际法院和常设国际法院意见
hhs_alj_opinions: 美国卫生与公共服务部行政法法官意见
sec_administrative_proceedings: 美国证券交易委员会行政程序的重要申诉、命令和决定
fmshrc_bluebooks: 美国联邦矿山安全与健康审查委员会蓝皮书
resource_contracts: ResourceContracts.org收集的资源合同
medicaid_policy_guidance: 美国卫生与公共服务部政策指导文件
irs_legal_advice_memos: 美国国税局法律咨询备忘录和首席顾问通知
doj_guidance: 美国司法部指导文件

数据字段

text: 文档文本
created_timestamp: 文档创建时间戳（可能不准确）
downloaded_timestamp: 文档抓取时间
url: 源URL

数据分割

数据集包含训练集和验证集，分割比例为75%/25%。

搜集汇总

数据集介绍

构建方式

Pile of Law 数据集的构建基于对大量法律和行政数据的精心筛选与整合。数据来源包括美国联邦法院的诉讼记录、法院意见书、合同文件、联邦公报、退伍军人事务局意见、国会法案草案等。这些数据通过自动化工具从公开的在线资源中抓取，并经过初步的格式化和时间戳标记，以确保数据的原始性和可追溯性。数据集的分割采用75%训练集和25%验证集的比例，便于后续的模型训练与验证。

特点

Pile of Law 数据集的特点在于其广泛的法律领域覆盖和多样化的数据来源。数据集不仅包含美国国内的法律文件，还涵盖了欧洲议会辩论、联合国辩论、国际法院意见等国际法律资源。此外，数据集还特别关注了隐私过滤标准，确保敏感信息在公开透明的前提下得到适当保护。数据的多样性和深度使其成为法律领域语言模型预训练的理想选择。

使用方法

Pile of Law 数据集主要用于法律领域的自然语言处理任务，特别是语言模型的预训练。用户可以通过 Hugging Face 平台直接访问数据集，并根据需求选择特定的子集进行下载和使用。在使用过程中，建议用户根据任务需求对数据进行进一步过滤和分割，以确保模型训练的有效性。此外，数据集的时间戳和来源链接为数据的追溯和验证提供了便利。

背景与挑战

背景概述

Pile of Law数据集是一个专注于法律和行政领域的大规模语料库，由多个研究机构于2022年创建，旨在为法律领域的语言模型预训练提供支持。该数据集汇集了来自美国联邦法院、欧洲议会、联合国辩论等多种来源的法律文件、合同、法规和法院意见等数据。其核心研究问题在于如何通过大规模的法律文本数据，推动法律领域的自然语言处理技术发展，特别是在法律知识理解、法律论证形成以及隐私过滤标准等方面。该数据集的创建不仅为法律领域的AI研究提供了丰富的资源，还为司法公正倡议提供了技术基础。

当前挑战

Pile of Law数据集在构建和应用过程中面临多重挑战。首先，法律文本的复杂性和多样性使得数据预处理和标准化变得极为困难，尤其是不同司法管辖区的法律术语和表达方式差异显著。其次，数据集中可能包含敏感信息，尽管已通过相关政府机构的过滤，但仍需谨慎处理，以避免隐私泄露。此外，数据的时间戳准确性也是一个问题，部分数据的时间标记可能不准确，影响模型的训练效果。最后，数据集的规模庞大，如何在保证数据质量的同时进行高效存储和处理，也是技术上的一个重要挑战。

常用场景

经典使用场景

在法律领域，Pile of Law数据集被广泛用于训练和评估法律领域的语言模型。该数据集包含了大量的法律文书、法院意见、合同、法规等文本，能够为法律文本的理解、生成和分类任务提供丰富的语料支持。通过使用该数据集，研究人员可以构建更加精准的法律文本分析工具，帮助律师、法官和法律学者更高效地处理法律文档。

实际应用

在实际应用中，Pile of Law数据集被用于开发法律智能助手、合同分析工具和法规检索系统。例如，律师事务所可以利用该数据集训练的法律模型，自动化处理大量的合同审查工作，提高工作效率。政府部门也可以利用该数据集构建法规检索系统，帮助公众快速查找和理解相关法律条文。此外，该数据集还为法律教育和研究提供了丰富的资源，支持法律学者进行深入的文本分析。

衍生相关工作

Pile of Law数据集衍生了许多经典的研究工作。例如，基于该数据集，研究人员开发了多个法律领域的预训练语言模型，如Legal-BERT和CaseLaw-BERT。这些模型在法律文本分类、法律问答和法律文本生成等任务中表现出色。此外，该数据集还被用于构建法律知识图谱，支持法律推理和案例检索。这些研究工作不仅推动了法律智能化的发展，也为法律领域的自然语言处理研究提供了新的方向。

以上内容由遇见数据集搜集并总结生成