five

EPSTEIN_FILES_20K

收藏
魔搭社区2025-11-25 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/tensonaut/EPSTEIN_FILES_20K
下载链接
链接失效反馈
官方服务:
资源简介:
# U.S. House Oversight Epstein Estate Documents ## Overview On November 12, 2025, the U.S. House Oversight Committee released over 20,000 pages of documents from the Epstein estate. While intended to serve the public interest, these records remain largely inaccessible as they are scattered across nested folders in mixed file formats. This dataset aims to democratize access to these public government documents by organizing and converting them into a clean, standardized format suitable for open source investigation. It enables AI researchers and investigative journalists to perform exploratory analysis and build RAG systems capable of surfacing insights that would be impractical to uncover through manual review. *Dataset was originally shared on r/LocalLLaMA on November 16, 2025; updated and published on Hugging Face based on community feedback.* --- ## Usage Guidelines This dataset is intended for research and exploratory analysis in support of investigative journalism, with focus on: - Evaluating information retrieval and retrieval augmented generation (RAG) systems - Developing and testing of search, clustering, knowledge graph, and summarization tools - Enabling transparent, reproducible research aligned with open science principles ### User Responsibilities - Treat individuals mentioned in documents with respect; avoid sensationalism or misuse of sensitive material - Clearly distinguish model generated content and exploratory findings from verified facts. Cite primary sources where appropriate - Respect all existing redactions. Do not attempt to identify protected information - Adhere to journalistic and academic ethics standards ### Prohibited Uses - Finetuning language models - Harassment, doxing, or targeted attacks on any individual or group - Attempts to deanonymize redacted information or circumvent existing redactions - Presenting unverified allegations as factual claims - Sensationalizing findings All use must comply with applicable law, institutional policies, and the terms of the original House release. See the Legal and Ethical sections below before working with this corpus. --- ## Source All documents originate from the public release **"Oversight Committee Releases Additional Epstein Estate Documents"** published by the House Oversight Committee on **November 12, 2025**: 🔗 [Official Release](https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/) This dataset is an independent derivative collection and is **not** an official product of the U.S. House of Representatives or the Committee on Oversight and Government Reform. --- ## Preprocessing - **25,000+ plain text files** derived from the committee's public releases, organized into a single CSV - Image files (~20,000 JPGs under `IMAGES/`) converted to text using the open source **Tesseract** OCR engine - Native text files (under `TEXT/`) preserved as is - Filenames retain original relative paths and naming conventions for cross referencing ### Known Limitations The corpus may contain: - OCR noise and misrecognized characters - Broken formatting - Redaction blocks, stamps, or markers inherited from the original scans --- ## How to Contribute The collaborative nature of this project supports natural checks and balances. Contributions are welcome in three areas: | Area | Description | Link | |------|-------------|------| | **Safety & Accuracy** | Report concerns, inaccuracies, or potential misuse | [EF20K/Safety](https://github.com/EF20K/Safety) | | **Project Registry** | Register tools, models, or IR systems built on this dataset | [EF20K/Projects](https://github.com/EF20K/Projects) | | **Dataset Cleaning** | Improve OCR output using vision models | [EF20K/Datasets](https://github.com/EF20K/Datasets) | --- ## Legal and Copyright Status > ⚠️ **Disclaimer:** Nothing in this section constitutes legal advice. - Original documents were created by various private individuals and entities, not by the dataset maintainer - Documents are sourced from releases by the U.S. House Committee on Oversight and Government Reform. Release pages carry standard copyright notices (© 2025), and individual documents may be protected by copyright held by original authors or rights holders - This dataset: - Does **not** assert ownership over underlying documents - Does **not** grant any license to reproduce, distribute, or create derivative works beyond what is permitted by law (e.g., fair use) - Users are solely responsible for ensuring compliance with applicable copyright law, privacy law, institutional policies, and the terms of the original release If you plan to use this corpus in a public facing product, for model training, or at scale, seek independent legal counsel. --- ## Content Warning These documents contain material related to: - Sexual abuse, exploitation and trafficking - Violence and other highly sensitive topics - Unverified allegations, opinions, and speculation --- ## Acknowledgments Thank you to Hugging Face for hosting this dataset despite its sensitive subject matter, supporting open access to public records and open source tool development for investigative journalism. --- ## Resources - [Dataset on Hugging Face](https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K) - [Safety & Reporting](https://github.com/EF20K/Safety) - [Project Registry](https://github.com/EF20K/Projects) - [Dataset Improvements](https://github.com/EF20K/Datasets) - [Official Government Release](https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/) - [Hugging Face Ethics Guidelines](https://huggingface.co/ethics) --- *This project processes public government documents to enable open access and open source tool development for investigative journalism. It operates under strict ethical guidelines with community oversight.*

# 美国众议院监督委员会 爱泼斯坦庄园文件数据集 ## 概述 2025年11月12日,美国众议院监督委员会发布了逾2万页爱泼斯坦庄园相关文件。尽管这些文件旨在服务公共利益,但由于分散在嵌套文件夹中且格式混杂,目前仍难以被广泛获取。 本数据集旨在通过整理并转换为清晰、标准化的格式,使这些公共政府文件更易于获取,为开源调查提供支持。它可帮助AI研究人员与调查记者开展探索性分析,并构建检索增强生成(Retrieval Augmented Generation, RAG)系统,挖掘出人工审查难以发现的关键见解。 本数据集最初于2025年11月16日发布于r/LocalLLaMA,后根据社区反馈进行更新并发布至Hugging Face。 --- ## 使用指南 本数据集旨在支持调查性新闻的研究与探索性分析,重点应用场景包括: - 评估信息检索与检索增强生成(RAG)系统 - 开发、测试搜索、聚类、知识图谱与摘要工具 - 开展符合开放科学原则的透明、可复现研究 ### 用户责任 - 尊重文档中提及的个人,避免煽情炒作或不当使用敏感内容 - 清晰区分模型生成内容、探索性发现与已验证事实,适当情况下需标注原始来源 - 尊重所有已打码的内容,不得尝试识别受保护的信息 - 遵守新闻与学术伦理标准 ### 禁止使用场景 - 对语言模型进行微调 - 对任何个人或群体进行骚扰、人肉搜索或针对性攻击 - 尝试对打码信息去匿名化或绕过现有打码处理 - 将未经证实的指控作为事实主张 - 对研究结果进行煽情炒作 所有使用必须符合适用法律、机构政策以及众议院原始发布的条款。在使用该数据集前,请参阅下文的法律与伦理部分。 --- ## 来源 所有文件均源自美国众议院监督委员会2025年11月12日发布的官方公告**「监督委员会发布额外爱泼斯坦庄园相关文件」**: 🔗 [官方公告](https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/) 本数据集为独立衍生合集,**并非**美国众议院或监督与政府改革委员会的官方产品。 --- ## 预处理流程 - **25000+ 纯文本文件**:源自委员会的公开发布内容,整合为单个CSV文件 - 图像文件(约20000张JPG图片,存放于`IMAGES/`目录):通过开源**Tesseract**光学字符识别(Optical Character Recognition, OCR)引擎转换为文本 - 原生文本文件(存放于`TEXT/`目录):保留原始格式 - 文件名保留原始相对路径与命名规范,以便交叉引用 ### 已知局限性 该数据集可能存在以下问题: - OCR识别噪声与字符误识别 - 格式损坏 - 继承自原始扫描件的打码块、印章或标记 --- ## 贡献方式 本项目的协作属性支持自然的制衡机制,欢迎从以下三个领域参与贡献: | 领域 | 描述 | 链接 | |------|-------------|------| | **安全与准确性** | 报告问题、不准确内容或潜在滥用风险 | [EF20K/Safety](https://github.com/EF20K/Safety) | | **项目注册表** | 登记基于本数据集开发的工具、模型或信息检索系统 | [EF20K/Projects](https://github.com/EF20K/Projects) | | **数据集清理** | 利用视觉模型优化OCR输出结果 | [EF20K/Datasets](https://github.com/EF20K/Datasets) | --- ## 法律与版权声明 > ⚠️ **免责声明**:本部分内容不构成法律建议。 - 原始文件由各类私人个体与实体创作,而非本数据集维护者 - 文件源自美国众议院监督与政府改革委员会的公开发布,发布页面带有标准版权声明(© 2025),单个文件可能受原作者或权利持有人的版权保护 - 本数据集: - 不主张对原始文件拥有任何所有权 - 未授予任何超出法律允许范围(如合理使用)的复制、分发或创作衍生作品的许可 - 用户需自行确保遵守适用的版权法、隐私法、机构政策以及原始发布的条款。 若计划将本数据集用于公开产品、模型训练或大规模使用,请寻求独立法律咨询。 --- ## 内容警告 本数据集包含以下相关内容: - 性虐待、剥削与人口贩运 - 暴力及其他高度敏感话题 - 未经证实的指控、观点与推测 --- ## 致谢 感谢Hugging Face在本数据集涉及敏感主题的情况下仍提供托管服务,支持公共记录的开放获取与面向调查性新闻的开源工具开发。 --- ## 相关资源 - [Hugging Face 数据集页面](https://huggingface.co/datasets/tensonaut/EPSTEIN_FILES_20K) - [安全与举报渠道](https://github.com/EF20K/Safety) - [项目注册表](https://github.com/EF20K/Projects) - [数据集优化](https://github.com/EF20K/Datasets) - [官方政府公告](https://oversight.house.gov/release/oversight-committee-releases-additional-epstein-estate-documents/) - [Hugging Face 伦理指南](https://huggingface.co/ethics) --- *本项目旨在处理公共政府文件,为调查性新闻的开放获取与开源工具开发提供支持。本项目遵循严格的伦理准则,并接受社区监督。*
提供机构:
maas
创建时间:
2025-11-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作