genevera/epstein-files-ocr-complete
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/genevera/epstein-files-ocr-complete
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- question-answering
- text-classification
- text-retrieval
language:
- en
tags:
- epstein
- jeffrey-epstein
- epstein-files
- epstein-case
- court-documents
- depositions
- unsealed-documents
- fbi-files
- legal
- flight-logs
- private-jet
- passenger-list
- island-visits
- us-law
- news
- politics
- corruption
- elite-networks
- power-networks
- social-graph
- network-analysis
- named-entities
- entity-linking
- relationship-extraction
- relation-extraction
- summarization
- investigative-journalism
- open-source-intelligence
- osint
- ocr
size_categories:
- 1M<n<10M
---
# Epstein Files — Complete OCR Dataset
>
> This is a comprehensive, structured publication of the Epstein Files OCR dataset, significantly expanding upon the earlier [Datasets 1-8 release](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-datasets-1-8-early-release).
>
## Dataset Summary
This dataset contains **page-level OCR output** compiled from an extensive release of documents related to **Jeffrey Epstein / the Epstein case**.
Each row in this dataset represents **one scanned PDF document** from the original release using a proprietary automated OCR pipeline provided by [Wild Ma-Gässli](https://wildma.ch).
The dataset is designed for:
* Question answering
* Information retrieval
* Downstream NLP tasks such as named entity recognition (NER), entity linking, and relationship extraction.
### Enhancements from Previous Versions
- **Scale:** This structured release covers **1,380,935 PDF documents**, comprising over **2,700,000 total pages**.
- **Format:** Restructured from individual `.md` files into a more efficient **Parquet** format.
- **Document Linking:** Each page retains its original `document_id` (e.g., `EFTA00000001`), resolving the limitation from earlier releases where pages could not be easily traced back to their source PDFs.
## Supported Tasks
* Text retrieval / search (BM25, hybrid, dense retrieval)
* Question answering over retrieved context (RAG)
* Entity extraction (names, places, phone numbers, dates) from noisy OCR
* Social graph and network analysis
## Languages
Primarily English (`en`).
## Related Tools
This dataset is designed to be used with the **Epstein Chat** analysis tool, which provides a RAG (Retrieval-Augmented Generation) interface for querying these documents.
* **GitHub Repository**: [ishumilin/epstein-chat](https://github.com/ishumilin/epstein-chat)
## Dataset Structure
The dataset is provided as a Parquet file, which works natively with Hugging Face's `datasets` library.
### Data Fields
The schema contains the following fields:
- `document_id` (`string`): The identifier of the original document/page (e.g., `EFTA00146767`).
- `content` (`string`): The full OCR-extracted content for that specific document.
**Example Row:**
```json
{
"document_id": "EFTA00146767",
"content": "Hey beautiful. Tried to call you back..."
}
```
### Splits
No predefined train/validation/test splits.
## Dataset Creation
### Source Data
* **Primary source**: The upstream Epstein Files release hosted at:
* Torrent: https://github.com/yung-megafone/Epstein-Files/blob/main/Torrent%20Files/epstein-files-structured-full-20250204.tar.zst.torrent
**Coverage in this dataset:** All PDF files from the upstream release.
### OCR / Preprocessing
OCR was performed on this dataset using a **proprietary model** provided by [Wild Ma-Gässli](https://wildma.ch).
## Considerations for Using the Data
### Personal / Sensitive Information
These documents contain **personal data** (names, phone numbers, addresses, emails) and/or information about alleged criminal activity.
**Redaction Policy:**
* This dataset is published as **verbatim OCR output** derived from the public source files.
* **No additional redaction** (masking/removal) has been applied beyond what was already redacted by the DOJ or the original releasing entity.
**Use Responsibly:**
* Comply with applicable laws and platform policies.
* Avoid doxxing or harassment.
* Do not treat OCR text as ground truth; always verify against the original page images/PDFs for high-stakes use.
### Known Limitations
* **OCR noise**: While improved, automated extraction can produce recognition errors, incorrect formatting artifacts, or miss obscure characters (especially on poor-quality scans or handwriting). Some pages contain explicit placeholders such as `[hidden text]` reflecting original redactions made by DOJ.
* **Content variance**: Documents range from dense narrative text to unformatted tables and metadata tags.
* **Corrupted Source Files**: Three files from the original release were severely corrupted and their contents remain unknown and unextracted:
* `EFTA00645624.pdf`
* `EFTA01175426.pdf`
* `EFTA01220934.pdf`
### Biases
This dataset reflects:
* The selection, redaction, and presentation choices of the original releasing institution.
* OCR model performance characteristics (better on clean text, worse on handwriting / low-quality scans).
## Licensing
See [`LICENSE`](./LICENSE) for the full CC0 1.0 legal text.
## Citation
If you use this dataset, please cite:
1. The original [public release](https://www.justice.gov/epstein/doj-disclosures).
2. This [dataset](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-complete).
提供机构:
genevera



