RevolutionCrossroads/nara_revolutionary_war_pension_files_PDFs
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files_PDFs
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: NAID
dtype: string
- name: naraURL
dtype: string
- name: title
dtype: string
- name: logicalDate
dtype: string
- name: pdfObjectID
dtype: string
- name: pdfURL
dtype: string
- name: pageObjectId
sequence: string
- name: pageURL
sequence: string
- name: numberOfPages
dtype: int64
- name: pageImageType
dtype: string
- name: extractedTextID
sequence: string
- name: extractedText
sequence: string
- name: extractedTextDate
sequence: string
- name: extractedTextContributor
dtype: string
- name: transcriptionID
sequence: string
- name: transcriptionText
sequence: string
- name: transcriptionDate
sequence: string
splits:
- name: train
num_bytes: 2697565677
num_examples: 78926
download_size: 1267069275
dataset_size: 2697565677
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc0-1.0
task_categories:
- image-to-text
- text-retrieval
- text-classification
- feature-extraction
language:
- en
tags:
- american-history
- american-revolution
- archives
- national-archives
- pension-files
- genealogy
- america-250
- glam
- lam
pretty_name: Revolutionary War Pension Files - File-Level
size_categories:
- 10K<n<100K
---
# Dataset Card for American Revolutionary War Pension Files - File-Level
## Table of Contents
- [Dataset Summary](#dataset-summary)
- [Dataset Description](#dataset-description)
- [Dataset Details](#dataset-details)
- [Relationship to Source Dataset](#relationship-to-source-dataset)
- [Curation Rationale](#curation-rationale)
- [Dataset Creation](#dataset-creation)
- [Data Collection and Processing](#data-collection-and-processing)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Source Data](#source-data)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Additional Information](#additional-information)
---
## Dataset Summary
A dataset derived from the National Archives and Records Administration (NARA) series *Case Files of Pension and Bounty-Land Warrant Applications Based on American Revolutionary War Service* ([NARA Catalog Series, NAID 300022](https://catalog.archives.gov/id/300022)). This dataset provides a **file-level representation** of Revolutionary War pension records, aggregating individual page records into complete pension files with associated metadata, extracted text, and transcriptions.
It is derived from the page-level dataset:
[https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files](https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files)
This dataset was prepared as part of the Revolution Crossroads project:
[https://www.si.edu/revolution-crossroads](https://www.si.edu/revolution-crossroads)
---
## Dataset Description
For the initiative *Revolution Crossroads*, the Smithsonian Institution prepared this dataset using data, metadata, and digital objects publicly available from the National Archives Catalog.
### Dataset Details
- **Prepared by**: Smithsonian Institution, Office of Digital & Innovation staff
- **Shared by**: Revolution Crossroads
- **Language(s)**: English, some French
- **License**: Public domain
---
### Relationship to Source Dataset
This dataset is a transformation of the original [Revolution Crossroads page-level dataset](https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files). For detailed information about:
- source materials
- digitization processes
- metadata definitions
- rights and provenance
refer to the page-level dataset card:
[https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files/blob/main/README.md](https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files/blob/main/README.md)
---
### Curation Rationale
The original dataset is structured at the page level, where each record represents a single page from a pension file. However, pension files are multi-page archival documents that often contain narratives, affidavits, correspondence, and supporting materials that span multiple pages.
This file-level dataset was created to:
- Preserve document-level context across entire pension files
- Support OCR and text extraction workflows at the file level
- Enable analysis of relationships between individuals, places, and events within a single case file
- Support entity extraction and historical interpretation across full narrative records
---
## Dataset Creation
This dataset was derived from the page-level pension dataset by restructuring records from page-level to file-level.
- Page-level records were grouped by NAID (file-level identifier)
- File-level records retain key metadata from the source dataset
- Page counts were calculated for each file (`numberOfPages`)
- Page-level data (images, extracted text, and transcriptions) were retained as lists within each file-level record
- Some page-level metadata fields not relevant to file-level analysis were removed
A note on archival levels:
- In most cases, records correspond to file-level groupings. However, a small number of records represent an intermediate “item” level between file and page in the NARA Catalog hierarchy
- These item-level records were treated equivalently to file-level records in this dataset
- Any record containing one or more pages was used as the top-level grouping unit (NAID), regardless of whether it was formally classified as a file or item
Each record in this dataset corresponds to a single pension file or equivalent document-level unit.
---
## Data Collection and Processing
### Supporting Files Available
File-level PDF files are provided in the Files tab for this dataset, within the *media* directory.
- **pdfs**
Contains compiled PDF files representing complete pension files. Each PDF corresponds to a single record in the dataset and is constructed from the source page images.
*Note:*
`pdfObjectID` and `pdfURL` fields refer to PDFs provided by the National Archives Catalog where available. These may be incomplete or inconsistently available. The supplementary PDFs included with this dataset were generated to provide consistent, complete file-level representations.
#### PDF Construction
- Source page images were obtained as JPG files from the NARA Catalog
- Images were grouped by top-level NAID and compiled into multi-page PDFs
- PDFs were constructed to represent complete pension files
To ensure compatibility with downstream processing systems:
- Some PDFs were resized when they exceeded size limits during OCR extraction workflows
- Resizing was applied selectively to affected files
- A future update will provide consistently resized PDFs for broader external use
---
## Dataset Structure
Each record in the dataset corresponds to a single pension file. File-level metadata is represented once per record, while page-level data (images, text, transcriptions) are stored as lists.
---
## Data Fields
- **NAID** *(string)*
National Archives Identifier for the pension file.
*Example:*
`111403815`
- **naraURL** *(string)*
Link to the pension file in the National Archives Catalog.
*Example:*
`https://catalog.archives.gov/id/111403815`
- **title** *(string)*
Title of the pension file.
*Example:*
`Revolutionary War Pension and Bounty Land Warrant Application File B. L. Wt. 2,153-400, Joseph Torrey`
- **logicalDate** *(string)*
Normalized or machine-readable date, if available.
*Example:*
`1836-06-02`
- **pdfObjectID** *(string)*
Identifier for a file-level PDF provided by the NARA Catalog, when available.
*Example:*
`57140166`
- **pdfURL** *(string)*
URL to a file-level PDF provided by the NARA Catalog, when available.
*Example:*
`https://s3.amazonaws.com/NARAprodstorage/.../28482-2016.pdf`
- **pageObjectId** *(list)*
List of identifiers for individual pages within the file.
*Example:*
`["111403816", "111403817"]`
- **pageURL** *(list)*
List of URLs to page images.
*Example:*
`["https://s3.amazonaws.com/NARAprodstorage/.../image1.jpg"]`
- **numberOfPages** *(int64)*
Number of pages in the pension file.
*Example:*
`5`
- **pageImageType** *(string)*
File format of the page images.
*Example:*
`JPG`
- **extractedTextID** *(list)*
Identifiers for extracted text records.
*Example:*
`["19537435", "19537436"]`
- **extractedText** *(list)*
Extracted text for each page, provided via automated OCR/AI processes.
*Example:*
`["State of New York...", "City and County of New York..."]`
- **extractedTextDate** *(list)*
Dates when extracted text was created.
*Example:*
`["2024-11-20T13:30:07.000Z"]`
- **extractedTextContributor** *(string)*
Source of the extracted text.
*Example:*
`FamilySearch`
- **transcriptionID** *(list)*
Identifiers for transcription records, if available.
*Example:*
`["60f6e82a-f53d-4de3-a09b-f4f707046c0f"]`
- **transcriptionText** *(list)*
Human-created transcription text for pages, where available.
*Example:*
`["Gorham Me March 22 1868..."]`
- **transcriptionDate** *(list)*
Dates when transcriptions were created or updated.
*Example:*
`["2025-04-30 17:15:44"]`
---
## Source Data
This dataset is derived from the National Archives and Records Administration series:
[https://catalog.archives.gov/id/300022](https://catalog.archives.gov/id/300022)
For full details on the source collection, refer to the page-level dataset card:
[https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files/blob/main/README.md](https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files/blob/main/README.md)
---
## Personal and Sensitive Information
No known personal or sensitive information is included beyond what appears in historical archival records. These materials may contain outdated or offensive terminology reflective of the period in which they were created.
---
## Considerations for Using the Data
### Risks and Limitations
- Pension files vary widely in length, structure, and completeness
- Extracted text accuracy varies, especially for handwritten documents
- Transcriptions are available for only a portion of records
- Metadata may be incomplete or inconsistent
### Recommendations
Researchers should validate extracted information against source images when accuracy is critical and consult the National Archives Catalog for the most up-to-date records.
---
## Additional Information
### Citation Information
**BibTeX**
```
@misc{revolution_crossroads_2026,
author = { Revolution Crossroads },
title = { nara_revolutionary_war_pension_files_PDFs (Revision 8453ff3) },
year = 2026,
url = { https://huggingface.co/datasets/RevolutionCrossroads/nara_revolutionary_war_pension_files_PDFs },
doi = { 10.57967/hf/8349 },
publisher = { Hugging Face }
}
```
**APA**
Revolution Crossroads Project Team. (2025). *American Revolutionary War Pension Files - File-Level* [Data set]. Hugging Face.
---
### Glossary
- **NAID**: National Archives Identifier
- **Extracted text**: Machine-generated text created from images
- **Transcription**: Human-created text
- **Pension file**: A complete archival case file documenting a pension application
---
### Dataset Card Contact
revolutioncrossroads@si.edu
---
提供机构:
RevolutionCrossroads
搜集汇总
数据集介绍

构建方式
该数据集源自美国国家档案馆(NARA)所藏的《美国独立战争服役养老金及 bounty-land 申请案卷》系列,由史密森尼学会“革命十字路口”项目团队构建。构建过程中,团队将原本以单页为单位的页面级数据集,依据国家档案标识符(NAID)进行分组聚合,重组为以完整案卷为单位的文件级数据集。每个文件级记录保留了源数据的关键元数据,同时计算了每份案卷的页数(numberOfPages),并将页面级图像、机器提取文本与人工转录文本以列表形式嵌套于记录中。为满足下游处理需求,团队还将案卷内的页面图像合并生成了完整的PDF文件,部分超大PDF在OCR流程中进行了尺寸调整。此举旨在保留案卷文档级别的上下文信息,支持对整份档案中人物、地点与事件间关联的深度分析。
特点
该数据集的核心特点在于其文件级结构,每个记录对应一份完整的养老金申请案卷,涵盖了叙事陈述、证词、信函及支撑材料等多页文档的完整语境。数据字段丰富多元,既包含案卷级别的NAID、标题、日期与NARA链接等元数据,又通过列表形式保留了每页的图像标识符、URL、机器提取文本与人工转录文本,形成了多模态信息的高效整合。值得注意的是,数据集中还附带了由系统自动生成的完整PDF文件,确保了案卷内容呈现的一致性与完整性。数据集规模适中,共包含78,926个训练样本,总大小约2.7 GB,采用CC0公共领域许可协议,便于学术研究与文化遗产领域的广泛复用。
使用方法
该数据集支持多种下游任务,包括图像到文本生成、文本检索、文本分类与特征提取等。研究者可通过NAID字段唯一标识每份案卷,利用naraURL直接访问NARA目录中的原始记录。对于需要整份案卷内容的任务,可直接使用pdfURL字段下载对应的PDF文件;对于需要逐页分析的任务,则可遍历pageURL、extractedText与transcriptionText列表,获取每页的图像与文本数据。由于提取文本的准确性因手写文档而异,建议在关键分析中参照原始图像进行验证。数据集以Hugging Face Datasets格式组织,可通过标准API加载与处理,并提供了BibTeX引用格式,便于学术论文中规范引用。
背景与挑战
背景概述
该数据集由史密森尼学会(Smithsonian Institution)于2025年依托“革命十字路口”(Revolution Crossroads)项目创建,旨在系统化呈现美国国家档案馆(NARA)所藏独立战争退伍军人养老金与 bounty-land 申请档案(系列号NAID 300022)。核心研究问题聚焦于如何将分散的页面级档案重构为完整的文件级数字对象,以支持历史叙事分析、实体抽取及跨案例关系挖掘。该数据集涵盖约7.8万份文件,通过整合元数据、OCR提取文本与人工转录,为数字人文与档案学领域提供了重要的结构化资源,对理解18世纪末至19世纪初美国社会史与军事史具有深远影响。
当前挑战
该数据集面临的挑战首要在于解决档案领域长期存在的多页文档上下文断裂问题:养老金档案包含叙述书、证词、信函等跨页材料,原始页面级表示难以保留完整语义。构建过程中,数据整理需应对档案等级不一(部分记录为中间“项目”层级而非文件层级)、PDF生成时因OCR工作流尺寸限制导致的图像缩放不一致,以及自动提取文本在手写文档上准确率参差等难题。此外,人工转录覆盖率有限,元数据存在不完整或不一致现象,要求研究者对关键信息进行源头影像验证,以确保历史解析的可靠性。
常用场景
经典使用场景
在历史文献数字化与自然语言处理的交叉领域中,nara_revolutionary_war_pension_files_PDFs数据集以其独特的档案层级结构,成为研究18至19世纪美国社会史的珍贵语料库。该数据集将美国国家档案馆藏的革命战争养老金申请案卷,从分散的页面级记录重组为完整的文件级单元,保留了多页叙事、证词、信函等连贯的文档语境。研究者常将其用于历史文档的OCR后处理与文本校正,通过分析跨页面的语义连贯性来提升机器转录的准确性。同时,该数据集为训练针对手写历史文献的命名实体识别模型提供了理想素材,支持从养老金申请书中自动抽取人名、地名、军衔与服役日期等关键信息,从而构建结构化的历史人物关系网络。
实际应用
在实际应用层面,该数据集为家谱学与公共历史领域提供了强有力的数字工具。家谱研究者可利用文件级PDF快速定位祖先的完整养老金申请材料,无需逐页浏览海量档案,大幅降低了家族史追踪的时间成本。博物馆与档案馆则借助该数据集开发互动式展览,通过可视化呈现退伍军人的迁徙地图与家庭网络,使公众直观感受独立战争后美国社会的流动图景。此外,该数据还服务于美国250周年国家庆典(America 250)的文化遗产数字化项目,其结构化文本与图像资源为开发教育性AI应用奠定了坚实基础,例如构建面向学生的历史角色对话系统,或生成自动化的档案摘要与时间线。
衍生相关工作
该数据集催生了一系列开创性的衍生工作,在数字人文与机器学习领域形成了活跃的研究生态。基于文件级结构,研究者开发了跨页面文档分类模型,能够自动识别养老金案卷中的叙事片段类型(如申请函、证人陈述、财产证明),从而实现对历史文书的智能归档。同时,针对该数据集的手写文本识别挑战,社区贡献了多个微调后的OCR模型与数据集扩增策略,显著提升了19世纪手写英文的转录准确率。在知识图谱构建方面,衍生工作通过实体链接与关系抽取技术,将养老金文件中的孤立信息整合为可查询的历史人物知识库,支持复杂的跨文件推理,例如揭示同一家族多代成员的服役记录,或关联不同案卷中的共同证人与地点,为美国革命史的系统性研究开辟了新路径。
以上内容由遇见数据集搜集并总结生成



