ishumilin/epstein-files-ocr-complete
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ishumilin/epstein-files-ocr-complete
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- question-answering
- text-classification
- text-retrieval
language:
- en
tags:
- epstein
- jeffrey-epstein
- epstein-files
- epstein-case
- court-documents
- depositions
- unsealed-documents
- fbi-files
- legal
- flight-logs
- private-jet
- passenger-list
- island-visits
- us-law
- news
- politics
- corruption
- elite-networks
- power-networks
- social-graph
- network-analysis
- named-entities
- entity-linking
- relationship-extraction
- relation-extraction
- summarization
- investigative-journalism
- open-source-intelligence
- osint
- ocr
size_categories:
- 1M<n<10M
---
# Epstein Files — Complete OCR Dataset
>
> This is a comprehensive, structured publication of the Epstein Files OCR dataset, significantly expanding upon the earlier [Datasets 1-8 release](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-datasets-1-8-early-release).
>
## Dataset Summary
This dataset contains **page-level OCR output** compiled from an extensive release of documents related to **Jeffrey Epstein / the Epstein case**.
Each row in this dataset represents **one scanned PDF document** from the original release using a proprietary automated OCR pipeline provided by [Wild Ma-Gässli](https://wildma.ch).
The dataset is designed for:
* Question answering
* Information retrieval
* Downstream NLP tasks such as named entity recognition (NER), entity linking, and relationship extraction.
### Enhancements from Previous Versions
- **Scale:** This structured release covers **1,380,935 PDF documents**, comprising over **2,700,000 total pages**.
- **Format:** Restructured from individual `.md` files into a more efficient **Parquet** format.
- **Document Linking:** Each page retains its original `document_id` (e.g., `EFTA00000001`), resolving the limitation from earlier releases where pages could not be easily traced back to their source PDFs.
## Supported Tasks
* Text retrieval / search (BM25, hybrid, dense retrieval)
* Question answering over retrieved context (RAG)
* Entity extraction (names, places, phone numbers, dates) from noisy OCR
* Social graph and network analysis
## Languages
Primarily English (`en`).
## Related Tools
This dataset is designed to be used with the **Epstein Chat** analysis tool, which provides a RAG (Retrieval-Augmented Generation) interface for querying these documents.
* **GitHub Repository**: [ishumilin/epstein-chat](https://github.com/ishumilin/epstein-chat)
## Dataset Structure
The dataset is provided as a Parquet file, which works natively with Hugging Face's `datasets` library.
### Data Fields
The schema contains the following fields:
- `document_id` (`string`): The identifier of the original document/page (e.g., `EFTA00146767`).
- `content` (`string`): The full OCR-extracted content for that specific document.
**Example Row:**
```json
{
"document_id": "EFTA00146767",
"content": "Hey beautiful. Tried to call you back..."
}
```
### Splits
No predefined train/validation/test splits.
## Dataset Creation
### Source Data
* **Primary source**: The upstream Epstein Files release hosted at:
* Torrent: https://github.com/yung-megafone/Epstein-Files/blob/main/Torrent%20Files/epstein-files-structured-full-20250204.tar.zst.torrent
**Coverage in this dataset:** All PDF files from the upstream release.
### OCR / Preprocessing
OCR was performed on this dataset using a **proprietary model** provided by [Wild Ma-Gässli](https://wildma.ch).
## Considerations for Using the Data
### Personal / Sensitive Information
These documents contain **personal data** (names, phone numbers, addresses, emails) and/or information about alleged criminal activity.
**Redaction Policy:**
* This dataset is published as **verbatim OCR output** derived from the public source files.
* **No additional redaction** (masking/removal) has been applied beyond what was already redacted by the DOJ or the original releasing entity.
**Use Responsibly:**
* Comply with applicable laws and platform policies.
* Avoid doxxing or harassment.
* Do not treat OCR text as ground truth; always verify against the original page images/PDFs for high-stakes use.
### Known Limitations
* **OCR noise**: While improved, automated extraction can produce recognition errors, incorrect formatting artifacts, or miss obscure characters (especially on poor-quality scans or handwriting). Some pages contain explicit placeholders such as `[hidden text]` reflecting original redactions made by DOJ.
* **Content variance**: Documents range from dense narrative text to unformatted tables and metadata tags.
* **Corrupted Source Files**: Three files from the original release were severely corrupted and their contents remain unknown and unextracted:
* `EFTA00645624.pdf`
* `EFTA01175426.pdf`
* `EFTA01220934.pdf`
### Biases
This dataset reflects:
* The selection, redaction, and presentation choices of the original releasing institution.
* OCR model performance characteristics (better on clean text, worse on handwriting / low-quality scans).
## Licensing
See [`LICENSE`](./LICENSE) for the full CC0 1.0 legal text.
## Citation
If you use this dataset, please cite:
1. The original [public release](https://www.justice.gov/epstein/doj-disclosures).
2. This [dataset](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-complete).
---
许可证:CC0 1.0
任务类别:
- 问答(Question Answering)
- 文本分类(Text Classification)
- 文本检索(Text Retrieval)
语言:
- 英语(en)
标签:
- 爱泼斯坦(Epstein)
- 杰弗里·爱泼斯坦(Jeffrey Epstein)
- 爱泼斯坦文件(Epstein Files)
- 爱泼斯坦案件(Epstein Case)
- 法庭文件(Court Documents)
- 证词笔录(Depositions)
- 公开解密文件(Unsealed Documents)
- FBI文件(FBI Files)
- 法律(Legal)
- 飞行日志(Flight Logs)
- 私人飞机(Private Jet)
- 乘客名单(Passenger List)
- 岛屿访问记录(Island Visits)
- 美国法律(U.S. Law)
- 新闻(News)
- 政治(Politics)
- 腐败(Corruption)
- 精英网络(Elite Networks)
- 权力网络(Power Networks)
- 社会图谱(Social Graph)
- 网络分析(Network Analysis)
- 命名实体(Named Entities)
- 实体链接(Entity Linking)
- 关系抽取(Relationship Extraction)
- 关系抽取(Relationship Extraction)
- 文本摘要(Summarization)
- 调查新闻(Investigative Journalism)
- 开源情报(Open-Source Intelligence)
- 开源情报(Open-Source Intelligence, OSINT)
- 光学字符识别(Optical Character Recognition, OCR)
数据规模类别:
- 100万 < 数据量 < 1000万
---
# 爱泼斯坦文件(Epstein Files)完整光学字符识别(Optical Character Recognition, OCR)数据集
> 本数据集为爱泼斯坦文件(Epstein Files)光学字符识别(OCR)数据集的结构化全量发布,较此前发布的【数据集1-8】(https://huggingface.co/datasets/ishumilin/epstein-files-ocr-datasets-1-8-early-release)有显著扩充。
## 数据集概览
本数据集包含从与杰弗里·爱泼斯坦(Jeffrey Epstein)/爱泼斯坦案件相关的大规模公开文件中提取的**页面级OCR输出结果**。
数据集中的每一行对应原始公开文件中的一份扫描PDF文档,由[Wild Ma-Gässli](https://wildma.ch)提供的专有自动化OCR处理流水线生成。
本数据集适用于以下场景:
- 问答(Question Answering)
- 信息检索(Information Retrieval)
- 下游自然语言处理(Natural Language Processing, NLP)任务,例如命名实体识别(Named Entity Recognition, NER)、实体链接(Entity Linking)与关系抽取(Relationship Extraction)。
### 相较于旧版的改进
- **规模扩充**:本次结构化发布涵盖**1,380,935份PDF文档**,总页数超过**270万页**。
- **格式优化**:从单独的Markdown(.md)文件重构为更高效的Parquet格式。
- **文档关联优化**:每一页均保留原始`document_id`(例如`EFTA00000001`),解决了旧版发布中无法将页面快速溯源至原始PDF的问题。
## 支持的任务
- 文本检索/搜索(支持BM25、混合检索、稠密检索)
- 基于检索上下文的问答(检索增强生成(Retrieval-Augmented Generation, RAG))
- 从噪声OCR结果中提取实体(包括姓名、地点、电话号码、日期等)
- 社会图谱与网络分析
## 语言
以英语(en)为主。
## 相关工具
本数据集可配合**爱泼斯坦聊天(Epstein Chat)**分析工具使用,该工具提供了检索增强生成(Retrieval-Augmented Generation, RAG)界面以支持对本数据集文档的查询。
* **GitHub仓库**:[ishumilin/epstein-chat](https://github.com/ishumilin/epstein-chat)
## 数据集结构
本数据集以Parquet格式提供,可直接兼容Hugging Face的`datasets`库。
### 数据字段
数据集架构包含以下字段:
- `document_id`(字符串类型):原始文档/页面的标识符(例如`EFTA00146767`)。
- `content`(字符串类型):对应文档的完整OCR提取内容。
**示例行:**
json
{
"document_id": "EFTA00146767",
"content": "Hey beautiful. Tried to call you back..."
}
### 数据集划分
未预设训练集、验证集与测试集划分。
## 数据集构建
### 源数据
* **主要数据源**:上游公开的爱泼斯坦文件(Epstein Files)发布资源,地址为:
* 种子文件:https://github.com/yung-megafone/Epstein-Files/blob/main/Torrent%20Files/epstein-files-structured-full-20250204.tar.zst.torrent
**本数据集覆盖范围**:上游发布资源中的所有PDF文件。
### OCR与预处理
本数据集的OCR处理由[Wild Ma-Gässli](https://wildma.ch)提供的**专有模型**完成。
## 数据使用注意事项
### 个人与敏感信息
本数据集包含的文档中存在**个人数据**(姓名、电话号码、地址、电子邮箱)以及/或涉嫌犯罪活动的相关信息。
**编辑修订政策:**
- 本数据集以从公开源文件直接提取的**逐字OCR输出**形式发布。
- 除美国司法部(Department of Justice, DOJ)或原始发布方已完成的修订外,未进行额外的编辑修订(遮盖/移除)操作。
**负责任使用建议:**
- 遵守适用法律法规与平台政策。
- 避免人肉搜索或骚扰行为。
- 请勿将OCR文本视为绝对准确的原始文本;若用于高风险场景,请务必与原始页面图像/PDF文件进行核对验证。
### 已知局限性
- **OCR噪声问题**:尽管已有优化,但自动化提取仍可能产生识别错误、格式异常或遗漏生僻字符(尤其针对低质量扫描件或手写文本)。部分页面包含类似`[hidden text]`的显式占位符,对应美国司法部(DOJ)原有的修订内容。
- **内容多样性问题**:文档内容涵盖从密集叙事文本到无格式表格与元数据标签的多种形式。
- **源文件损坏问题**:原始发布资源中有3份文件严重损坏,其内容尚未被提取且无法获取:
* `EFTA00645624.pdf`
* `EFTA01175426.pdf`
* `EFTA01220934.pdf`
### 数据集偏差
本数据集的偏差体现为以下方面:
- 原始发布机构对文档的筛选、修订与呈现方式带来的偏差。
- OCR模型本身的性能局限(对清晰文本识别效果更佳,对手写文本/低质量扫描件识别效果较差)。
## 许可证
完整的CC0 1.0法律文本请参阅[`LICENSE`](./LICENSE)文件。
## 引用说明
若您使用本数据集,请引用以下内容:
1. 原始[公开发布资源](https://www.justice.gov/epstein/doj-disclosures)。
2. 本[数据集](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-complete)。
提供机构:
ishumilin



