datamatters24/research-document-archive
收藏Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/datamatters24/research-document-archive
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc0-1.0
task_categories:
- text-classification
- token-classification
- feature-extraction
size_categories:
- 100K<n<1M
tags:
- declassified
- government
- historical
- ocr
- entities
- embeddings
---
# Research Document Archive
Computational analysis of **234,630 declassified U.S. government documents** across 7 archival collections. Output of a 13-step ML pipeline extracting OCR text, entities, topics, keywords, redactions, and semantic embeddings from 3.1 million pages.
## Files
| File | Rows | Description |
|------|------|-------------|
| `documents.parquet` | 234,630 | Document metadata: id, source_section, file_path, file_hash, total_pages |
| `pages/<section>.parquet` | 3.1M | Per-page OCR text + 1536-dim sentence-transformer embeddings. Sharded by collection. |
| `entities/<section>.parquet` | 31M | Named entities (spaCy `en_core_web_lg`): PERSON, ORG, DATE, GPE, FAC, LOC, NORP, EVENT. Sharded by collection. |
| `document_topics.parquet` | 234,629 | BART-large-MNLI zero-shot topic assignments (per doc, top-1 topic + probability) |
| `document_features.parquet` | 1.17M | EAV feature table: `redaction_summary`, `forensic_metadata`, `bertopic`, `sentiment`, `topic_distribution`, `exact_duplicate` |
| `document_keywords.parquet` | 3.5M | TF-IDF keywords (top 15 per document, unigrams + bigrams) |
| `document_dates.parquet` | 234,630 | Inferred document dates (regex + header parsing) |
| `document_events.parquet` | 296K | Document-to-historical-event correlations (20 crisis events) |
| `historical_events.parquet` | 20 | Crisis event dictionary (event_name, date range, category, keywords) |
| `entity_relationships.parquet` | 2.88M | Entity co-occurrence pairs with counts, distances, and sample documents |
## Collections
| source_section | Documents | Source |
|---|---|---|
| `cia_declassified` | 1,605 | CIA Reading Room |
| `cia_mkultra` | 1,936 | MKULTRA release |
| `cia_stargate` | 13,937 | Stargate remote viewing program |
| `doj_disclosures` | — | DOJ public disclosures |
| `house_resolutions` | 181,092 | House.gov bill text (GovInfo API) |
| `jfk_assassination` | 35,979 | National Archives JFK release |
| `lincoln_archives` | 21 | Library of Congress |
## Loading
```python
from datasets import load_dataset
docs = load_dataset("datamatters24/research-document-archive", data_files="documents.parquet")
# Or load a specific sharded table:
import pyarrow.parquet as pq
pages = pq.read_table("hf://datasets/datamatters24/research-document-archive/pages/cia_mkultra.parquet")
```
## Methodology
- **OCR**: Tesseract + PyMuPDF
- **NER**: spaCy `en_core_web_lg`
- **Topics**: BART-large-MNLI (zero-shot) + BERTopic (unsupervised)
- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384-dim) and OpenAI text-embedding-3-small (1536-dim) per page
- **Redaction detection**: OpenCV contour analysis on PDF-rendered pages
- **Entity relationships**: page-window co-occurrence with distance weighting
Code: https://github.com/tedrubin80/Massivedata-Pull
提供机构:
datamatters24



