genevera/epstein-files-ocr-complete

Name: genevera/epstein-files-ocr-complete
Creator: genevera
Published: 2026-04-18 20:24:30
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/genevera/epstein-files-ocr-complete

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - question-answering - text-classification - text-retrieval language: - en tags: - epstein - jeffrey-epstein - epstein-files - epstein-case - court-documents - depositions - unsealed-documents - fbi-files - legal - flight-logs - private-jet - passenger-list - island-visits - us-law - news - politics - corruption - elite-networks - power-networks - social-graph - network-analysis - named-entities - entity-linking - relationship-extraction - relation-extraction - summarization - investigative-journalism - open-source-intelligence - osint - ocr size_categories: - 1M<n<10M --- # Epstein Files — Complete OCR Dataset > > This is a comprehensive, structured publication of the Epstein Files OCR dataset, significantly expanding upon the earlier [Datasets 1-8 release](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-datasets-1-8-early-release). > ## Dataset Summary This dataset contains **page-level OCR output** compiled from an extensive release of documents related to **Jeffrey Epstein / the Epstein case**. Each row in this dataset represents **one scanned PDF document** from the original release using a proprietary automated OCR pipeline provided by [Wild Ma-Gässli](https://wildma.ch). The dataset is designed for: * Question answering * Information retrieval * Downstream NLP tasks such as named entity recognition (NER), entity linking, and relationship extraction. ### Enhancements from Previous Versions - **Scale:** This structured release covers **1,380,935 PDF documents**, comprising over **2,700,000 total pages**. - **Format:** Restructured from individual `.md` files into a more efficient **Parquet** format. - **Document Linking:** Each page retains its original `document_id` (e.g., `EFTA00000001`), resolving the limitation from earlier releases where pages could not be easily traced back to their source PDFs. ## Supported Tasks * Text retrieval / search (BM25, hybrid, dense retrieval) * Question answering over retrieved context (RAG) * Entity extraction (names, places, phone numbers, dates) from noisy OCR * Social graph and network analysis ## Languages Primarily English (`en`). ## Related Tools This dataset is designed to be used with the **Epstein Chat** analysis tool, which provides a RAG (Retrieval-Augmented Generation) interface for querying these documents. * **GitHub Repository**: [ishumilin/epstein-chat](https://github.com/ishumilin/epstein-chat) ## Dataset Structure The dataset is provided as a Parquet file, which works natively with Hugging Face's `datasets` library. ### Data Fields The schema contains the following fields: - `document_id` (`string`): The identifier of the original document/page (e.g., `EFTA00146767`). - `content` (`string`): The full OCR-extracted content for that specific document. **Example Row:** ```json { "document_id": "EFTA00146767", "content": "Hey beautiful. Tried to call you back..." } ``` ### Splits No predefined train/validation/test splits. ## Dataset Creation ### Source Data * **Primary source**: The upstream Epstein Files release hosted at: * Torrent: https://github.com/yung-megafone/Epstein-Files/blob/main/Torrent%20Files/epstein-files-structured-full-20250204.tar.zst.torrent **Coverage in this dataset:** All PDF files from the upstream release. ### OCR / Preprocessing OCR was performed on this dataset using a **proprietary model** provided by [Wild Ma-Gässli](https://wildma.ch). ## Considerations for Using the Data ### Personal / Sensitive Information These documents contain **personal data** (names, phone numbers, addresses, emails) and/or information about alleged criminal activity. **Redaction Policy:** * This dataset is published as **verbatim OCR output** derived from the public source files. * **No additional redaction** (masking/removal) has been applied beyond what was already redacted by the DOJ or the original releasing entity. **Use Responsibly:** * Comply with applicable laws and platform policies. * Avoid doxxing or harassment. * Do not treat OCR text as ground truth; always verify against the original page images/PDFs for high-stakes use. ### Known Limitations * **OCR noise**: While improved, automated extraction can produce recognition errors, incorrect formatting artifacts, or miss obscure characters (especially on poor-quality scans or handwriting). Some pages contain explicit placeholders such as `[hidden text]` reflecting original redactions made by DOJ. * **Content variance**: Documents range from dense narrative text to unformatted tables and metadata tags. * **Corrupted Source Files**: Three files from the original release were severely corrupted and their contents remain unknown and unextracted: * `EFTA00645624.pdf` * `EFTA01175426.pdf` * `EFTA01220934.pdf` ### Biases This dataset reflects: * The selection, redaction, and presentation choices of the original releasing institution. * OCR model performance characteristics (better on clean text, worse on handwriting / low-quality scans). ## Licensing See [`LICENSE`](./LICENSE) for the full CC0 1.0 legal text. ## Citation If you use this dataset, please cite: 1. The original [public release](https://www.justice.gov/epstein/doj-disclosures). 2. This [dataset](https://huggingface.co/datasets/ishumilin/epstein-files-ocr-complete).

提供机构：

genevera

5,000+

优质数据集

54 个

任务类型

进入经典数据集