aurora2424/epstein-files-20k

Name: aurora2424/epstein-files-20k
Creator: aurora2424
Published: 2026-03-13 08:22:52
License: 暂无描述

Hugging Face2026-03-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/aurora2424/epstein-files-20k

下载链接

链接失效反馈

官方服务：

资源简介：

--- tags: - epstein - public-records - government-documents - legal-documents - research - information-retrieval --- # Disclaimer This dataset is a reupload of a previously circulating public dataset. The contents may include unverified, incomplete, disputed, or inaccurate information and should not be interpreted as factual, authoritative, or as proof of guilt for any individual. This dataset is provided strictly for research, archival, and educational purposes, such as analysis of information propagation, data preservation, or media studies. No claims are made regarding the accuracy, authenticity, or legitimacy of the materials contained herein. If the dataset is demonstrated to be falsified, inaccurate, or if the original creator or relevant rights holders request removal, it will be taken down promptly. This dataset should not be relied upon as a source of truth. The maintainer of this repository does not endorse, affirm, or validate any claims contained within the dataset. This reupload is intended solely to preserve access to publicly released materials in their extracted text form. The rest of this readme is a copy of the original repository. # U.S. House Oversight Epstein Estate Documents The motivation for curating this dataset is to enable transparent exploration of the Epstein estate documents. The goal is to empower AI practitioners, researchers, and enthusiasts to build RAG based systems that can identify patterns, connections, and insights that are difficult to obtain through manual inspection. ## Usage and Responsibilities (Required Reading) This dataset is provided for **research and exploratory analysis** with a focus on: - Evaluating information retrieval and retrieval augmented generation (RAG) systems. - Developing and testing search, clustering, and summarization methods on a real world corpus. **Users are responsible for:** - Using the dataset only for lawful purposes and in accordance with institutional and ethical review requirements. - Treating individuals mentioned in the documents with respect, and avoiding sensationalism or misuse of sensitive material. - Clearly distinguishing model generated content and exploratory findings from verified facts, and citing primary sources where appropriate. It is **not** intended for: - Fine-tuning language models. - Harassment, doxing, or targeted attacks on any individual or group. - Attempts to deanonymize redacted information or circumvent existing redactions. - Making or amplifying unverified allegations as factual claims. All use must comply with applicable law, institutional policies, and the terms of the original House releases. See the “Legal and copyright status” and “Ethical and content warning” sections below before working with this corpus. ## Source (All data derived from publicly released materials) All documents originate from the public release **“Oversight Committee Releases Additional Epstein Estate Documents”** on the official House Oversight Committee website (press release dated **November 12, 2025**): https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/ The underlying materials are distributed via a Google Drive structure maintained by the Committee. This dataset is an independent derivative collection built from that release and is **not** an official product of the U.S. House of Representatives or the Committee on Oversight and Government Reform. ## Dataset contents - **Documents**: Over 25,000 plain text files derived from the committee’s public releases organized in a single csv file - **Source Folders**: - `TEXT/` – Files that were originally text-based (e.g., PDFs, emails) and converted to plain text. - `IMAGES/` – Image files (primarily JPG) converted to text via OCR. **Filenames preserve the relative path and naming conventions from the original Google Drive release, to facilitate cross-referencing back to the official source files.** ## Processing - All image files under the `IMAGES/` directory (approximately 20,000 JPGs) were converted to machine-readable text using the open-source **Tesseract** OCR engine. - Native text-based files under `TEXT/` were converted to plain text using standard tools (e.g., PDF/text extraction) without manual editing. - No manual content editing, summarization, or redaction has been performed beyond: - basic file organization, - text extraction / OCR, - and any redactions already present in the official House releases. As a result, the corpus may contain: - OCR noise and misrecognized characters - Broken formatting - Redaction blocks, stamps, or markers inherited from the original scans ## Legal and copyright status (non-authoritative) - The original underlying documents were created by various private individuals and entities, not by the dataset maintainer. - The documents are sourced from releases published by the U.S. House Committee on Oversight and Government Reform. The release webpages themselves carry standard copyright notices (© 2025 Committee on Oversight and Government Reform), and many individual documents are likely protected by copyright held by their original authors or rights holders. - This dataset: - **Does not** assert any ownership over the underlying documents. - **Does not** grant any license to reproduce, distribute, or create derivative works from the underlying texts beyond what may already be permitted by law (e.g., fair use or similar doctrines in your jurisdiction). - Users are solely responsible for ensuring that their use of this corpus complies with applicable copyright law, privacy law, institutional policies, and the terms of the original House releases. Nothing in this dataset card constitutes legal advice. If you plan to use this corpus in a public-facing product, for model training, or at scale, you should seek independent legal counsel. ## Ethical and content warning The documents contain material related to: - Sexual abuse and exploitation - Trafficking - Violence and other highly sensitive topics - Unverified allegations, opinions, or speculation ## Intended use and limitations Recommended / common use cases include: - Text mining and exploratory analysis of the *public record* surrounding the Epstein estate documents. - Search / retrieval experiments (e.g., indexing, ranking, IR/RAG prototypes) conducted in controlled or research settings. - Qualitative review by journalists, historians, or legal scholars.

# 标签 - 爱泼斯坦（Epstein） - 公共记录（public-records） - 政府文件（government-documents） - 法律文件（legal-documents） - 研究（research） - 信息检索（information-retrieval） # 免责声明本数据集为此前流传的公开数据集的重新上传。其内容可能包含未经核实、不完整、存在争议或不准确的信息，不得被解读为事实性、权威性内容，亦不得作为任何个人有罪的证据。本数据集仅用于研究、档案保存与教育目的，例如信息传播分析、数据留存或媒体研究。本数据集未对其中包含的材料的准确性、真实性或合法性作出任何声明。若本数据集被证实存在伪造、不准确情况，或原创作者及相关权利持有人要求移除，本数据集将立即下架。不得将本数据集作为事实来源依赖。本仓库的维护者不认可、确认或验证数据集中包含的任何主张。本次重新上传仅旨在以提取文本的形式留存已公开材料的访问途径。本自述文件其余内容为原仓库的副本。 # 美国众议院监督委员会爱泼斯坦遗产文件整理本数据集的初衷是实现对爱泼斯坦遗产文件的透明化探索。其目标是赋能AI从业者、研究人员与爱好者构建基于检索增强生成（Retrieval Augmented Generation，简称RAG）的系统，以识别难以通过人工检视获取的模式、关联与洞见。 ## 使用与责任（必读）本数据集仅供**研究与探索性分析**使用，聚焦于： - 评估信息检索与检索增强生成（RAG）系统 - 在真实世界语料库上开发、测试搜索、聚类与摘要方法 **用户需承担以下责任**： - 仅以合法用途使用本数据集，并遵守机构与伦理审查要求 - 尊重文件中提及的个人，避免对敏感材料进行耸动性处理或滥用 - 明确区分模型生成内容、探索性发现与已核实事实，并在适当时引用原始来源本数据集**不适合**用于： - 微调大语言模型 - 骚扰、人肉搜索或针对任何个人或群体的定向攻击 - 试图去匿名化已打码的信息或规避现有打码处理 - 将未经证实的指控作为事实主张并传播所有使用行为必须遵守适用法律、机构政策以及原众议院发布的条款。在使用该语料库前，请参阅下文的“法律与版权状态”及“伦理与内容警告”部分。 ## 来源（所有数据均源自公开发布材料）所有文件均源自美国众议院监督委员会官网于**2025年11月12日**发布的题为《监督委员会发布额外爱泼斯坦遗产文件》的新闻稿： https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/ 原始材料由该委员会维护的Google Drive结构分发。本数据集是基于该发布内容构建的独立衍生合集，**并非美国众议院或监督与政府改革委员会的官方产品**。 ## 数据集内容 - **文档**：超过25000个纯文本文件，源自该委员会的公开发布内容，整理为单个CSV文件 - **源文件夹**： - `TEXT/`：原本为基于文本的文件（如PDF、电子邮件），已转换为纯文本格式 - `IMAGES/`：图像文件（主要为JPG格式），已通过光学字符识别（Optical Character Recognition，简称OCR）转换为文本 **文件名保留了原Google Drive发布内容的相对路径与命名规范，以便与官方源文件进行交叉引用**。 ## 处理流程 - `IMAGES/`目录下的所有图像文件（约20000张JPG图片）已通过开源**Tesseract** OCR引擎转换为机器可读文本 - `TEXT/`目录下的原生文本文件已通过标准工具（如PDF/文本提取工具）转换为纯文本，未经过人工编辑 - 除以下操作外，未进行任何人工内容编辑、摘要或打码处理： - 基础文件组织 - 文本提取/OCR处理 - 原众议院发布内容中已存在的打码处理因此，该语料库可能包含： - OCR噪声与识别错误字符 - 格式损坏问题 - 源自原始扫描件的打码块、印章或标记 ## 法律与版权状态（非权威性） - 原始底层文档由各类私人个体与实体创建，而非本数据集维护者 - 这些文档源自美国众议院监督与政府改革委员会发布的内容。发布网页本身带有标准版权声明（© 2025 监督与政府改革委员会），且多数单个文档可能受其原作者或权利持有人的版权保护 - 本数据集： - **未**主张对底层文档的任何所有权 - **未**授予任何复制、分发或基于底层文本创建衍生作品的许可，除非法律已允许（例如您所在司法管辖区的合理使用或类似原则） - 用户需独自负责确保其对本语料库的使用符合适用的版权法、隐私法、机构政策以及原众议院发布的条款本数据集卡片中的任何内容均不构成法律建议。若您计划将本语料库用于面向公众的产品、模型训练或大规模使用，请寻求独立的法律咨询。 ## 伦理与内容警告本文件包含与以下内容相关的材料： - 性虐待与剥削 - 人口贩运 - 暴力及其他高度敏感话题 - 未经证实的指控、观点或推测 ## 预期用途与局限性推荐/常见使用场景包括： - 对爱泼斯坦遗产文件相关的**公共记录**进行文本挖掘与探索性分析 - 在受控或研究环境中开展搜索/检索实验（例如索引、排序、信息检索/RAG原型开发） - 供记者、历史学家或法律学者进行定性审查

提供机构：

aurora2424

5,000+

优质数据集

54 个

任务类型

进入经典数据集