TeraflopAI/Caselaw_Access_Project_FAISS_index

Name: TeraflopAI/Caselaw_Access_Project_FAISS_index
Creator: TeraflopAI
Published: 2024-03-16 20:04:55
License: 暂无描述

Hugging Face2024-03-16 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/TeraflopAI/Caselaw_Access_Project_FAISS_index

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-generation language: - en tags: - legal - law - caselaw pretty_name: Caselaw Access Project size_categories: - 1M<n<10M --- <img src="https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_project/resolve/main/cap.png" width="800"> # The Caselaw Access Project In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here: https://case.law/docs/ Learn more about the Caselaw Access Project and all of the phenomenal work done by Jack Cushman, Greg Leppert, and Matteo Cargnelutti here: https://case.law/about/ Watch a live stream of the data release here: https://lil.law.harvard.edu/about/cap-celebration/stream # Post-processing Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data. During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting. Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form. Our one-click deployment allowed us to easily split the computation between 1000s of nodes on our managed infrastructure. # FAISS Index We built a FAISS index over all of the post-processed legal texts. The index consists of ~6.6 million dense vectors and the average search speed of a query over the entire index is 12.46 milliseconds. The FAISS library by @Meta allows you to perform k-nearest neighbor search efficiently and in a scalable way over millions of dense vectors. Find the FAISS library here: https://github.com/facebookresearch/faiss The combination of an Inverted File Index (IVF), Product quantization (PQ), and Hierarchical Navigable Small World (HNSW) allows us to run these queries across all of the dense vectors in milliseconds. Find more information about everything here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes # Licensing Information The Caselaw Access Project dataset is licensed under the [CC0 License](https://creativecommons.org/public-domain/cc0/). # Citation Information ``` The President and Fellows of Harvard University. "Caselaw Access Project." 2024, https://case.law/ ``` ``` @misc{ccap, title={Cleaned Caselaw Access Project}, author={Enrico Shippole, Aran Komatsuzaki}, howpublished{\url{https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project}}, year={2024} } ```

提供机构：

TeraflopAI

原始信息汇总

数据集概述

合作机构：哈佛大学法学院图书馆与Ravel Law合作
数据集内容：包含超过4000万份美国法院判决，涵盖670万案件
时间范围：近360年
访问方式：通过Caselaw Access Project API (CAPAPI)进行批量下载，详情请访问：Caselaw Access Project API

5,000+

优质数据集

54 个

任务类型

进入经典数据集