five

datamatters24/research-document-archive

收藏
Hugging Face2026-04-15 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/datamatters24/research-document-archive
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc0-1.0 task_categories: - text-classification - token-classification - feature-extraction size_categories: - 100K<n<1M tags: - declassified - government - historical - ocr - entities - embeddings --- # Research Document Archive Computational analysis of **234,630 declassified U.S. government documents** across 7 archival collections. Output of a 13-step ML pipeline extracting OCR text, entities, topics, keywords, redactions, and semantic embeddings from 3.1 million pages. ## Files | File | Rows | Description | |------|------|-------------| | `documents.parquet` | 234,630 | Document metadata: id, source_section, file_path, file_hash, total_pages | | `pages/<section>.parquet` | 3.1M | Per-page OCR text + 1536-dim sentence-transformer embeddings. Sharded by collection. | | `entities/<section>.parquet` | 31M | Named entities (spaCy `en_core_web_lg`): PERSON, ORG, DATE, GPE, FAC, LOC, NORP, EVENT. Sharded by collection. | | `document_topics.parquet` | 234,629 | BART-large-MNLI zero-shot topic assignments (per doc, top-1 topic + probability) | | `document_features.parquet` | 1.17M | EAV feature table: `redaction_summary`, `forensic_metadata`, `bertopic`, `sentiment`, `topic_distribution`, `exact_duplicate` | | `document_keywords.parquet` | 3.5M | TF-IDF keywords (top 15 per document, unigrams + bigrams) | | `document_dates.parquet` | 234,630 | Inferred document dates (regex + header parsing) | | `document_events.parquet` | 296K | Document-to-historical-event correlations (20 crisis events) | | `historical_events.parquet` | 20 | Crisis event dictionary (event_name, date range, category, keywords) | | `entity_relationships.parquet` | 2.88M | Entity co-occurrence pairs with counts, distances, and sample documents | ## Collections | source_section | Documents | Source | |---|---|---| | `cia_declassified` | 1,605 | CIA Reading Room | | `cia_mkultra` | 1,936 | MKULTRA release | | `cia_stargate` | 13,937 | Stargate remote viewing program | | `doj_disclosures` | — | DOJ public disclosures | | `house_resolutions` | 181,092 | House.gov bill text (GovInfo API) | | `jfk_assassination` | 35,979 | National Archives JFK release | | `lincoln_archives` | 21 | Library of Congress | ## Loading ```python from datasets import load_dataset docs = load_dataset("datamatters24/research-document-archive", data_files="documents.parquet") # Or load a specific sharded table: import pyarrow.parquet as pq pages = pq.read_table("hf://datasets/datamatters24/research-document-archive/pages/cia_mkultra.parquet") ``` ## Methodology - **OCR**: Tesseract + PyMuPDF - **NER**: spaCy `en_core_web_lg` - **Topics**: BART-large-MNLI (zero-shot) + BERTopic (unsupervised) - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (384-dim) and OpenAI text-embedding-3-small (1536-dim) per page - **Redaction detection**: OpenCV contour analysis on PDF-rendered pages - **Entity relationships**: page-window co-occurrence with distance weighting Code: https://github.com/tedrubin80/Massivedata-Pull
提供机构:
datamatters24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作