free-law/Caselaw_Access_Project_FAISS_index

Name: free-law/Caselaw_Access_Project_FAISS_index
Creator: free-law
Published: 2024-03-16 20:04:55
License: 暂无描述

Hugging Face2024-03-16 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/free-law/Caselaw_Access_Project_FAISS_index

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-generation language: - en tags: - legal - law - caselaw pretty_name: Caselaw Access Project size_categories: - 1M<n<10M --- <img src="https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_project/resolve/main/cap.png" width="800"> # The Caselaw Access Project In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here: https://case.law/docs/ Learn more about the Caselaw Access Project and all of the phenomenal work done by Jack Cushman, Greg Leppert, and Matteo Cargnelutti here: https://case.law/about/ Watch a live stream of the data release here: https://lil.law.harvard.edu/about/cap-celebration/stream # Post-processing Teraflop AI is excited to help support the Caselaw Access Project and Harvard Library Innovation Lab, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. It is important to democratize fair access to data to the public, legal community, and researchers. This is a processed and cleaned version of the original CAP data. During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting. Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form. Our one-click deployment allowed us to easily split the computation between 1000s of nodes on our managed infrastructure. # FAISS Index We built a FAISS index over all of the post-processed legal texts. The index consists of ~6.6 million dense vectors and the average search speed of a query over the entire index is 12.46 milliseconds. The FAISS library by @Meta allows you to perform k-nearest neighbor search efficiently and in a scalable way over millions of dense vectors. Find the FAISS library here: https://github.com/facebookresearch/faiss The combination of an Inverted File Index (IVF), Product quantization (PQ), and Hierarchical Navigable Small World (HNSW) allows us to run these queries across all of the dense vectors in milliseconds. Find more information about everything here: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes # Licensing Information The Caselaw Access Project dataset is licensed under the [CC0 License](https://creativecommons.org/public-domain/cc0/). # Citation Information ``` The President and Fellows of Harvard University. "Caselaw Access Project." 2024, https://case.law/ ``` ``` @misc{ccap, title={Cleaned Caselaw Access Project}, author={Enrico Shippole, Aran Komatsuzaki}, howpublished{\url{https://huggingface.co/datasets/TeraflopAI/Caselaw_Access_Project}}, year={2024} } ```

提供机构：

free-law

社区讨论

#经验分享

可访问arxiv网站地址https://arxiv.org/abs/2204.10149v1，然后点击右上角view pdf看到论文内容，在论文内容里面查找其引用的数据集的源地址；如果找不到可以到网站的左下角点击view email查看作者邮箱，可以作者发邮件获取数据集内容或者下载地址信息。

5,000+

优质数据集

54 个

任务类型

进入经典数据集