svattikuti/rev_warwithembeddings
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/svattikuti/rev_warwithembeddings
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Revolutionary War With Embeddings
tags:
- history
- revolutionary-war
- archives
- ocr
- embeddings
- multimodal
task_categories:
- text-retrieval
- text-classification
- visual-document-retrieval
size_categories:
- 1M<n<10M
---
# Revolutionary War With Embeddings
This dataset repo aggregates several Revolutionary War era corpora and derived embedding tables in Parquet form. It is intended for retrieval, search, clustering, OCR exploration, and multimodal linking across archival text and images.
The repository currently contains 21 files, primarily Parquet datasets, plus a few small CSV/JSON artifacts used during processing.
## What is included
The dataset combines material from three main source families:
- Library of Congress Chronicling America newspaper data and OCR.
- U.S. National Archives Revolutionary War pension and related records.
- Smithsonian Institution Revolutionary Era collection and media data.
It also includes derived embedding tables:
- `*_e5_embeddings.parquet` files for text embeddings.
- `si_us_revolutionary_era_media_clip_embeddings.parquet` and `si_us_revolutionary_era_media_clip_embeddings_with_context.parquet` for image or image-context CLIP embeddings.
## Key files
Some of the larger and more central files are:
- `loc_chronicling_america_1770_1810.parquet`: 58,116 rows of newspaper/OCR-oriented records.
- `nara_revolutionary_war_pension_files.parquet`: 2,244,629 rows of NARA pension-file related records.
- `si_us_revolutionary_era_collections.parquet`: 12,667 Smithsonian collection records.
- `si_us_revolutionary_era_media_blobs.parquet`: 5,205 media records with binary image payloads.
## Example schemas
`loc_chronicling_america_1770_1810.parquet` includes fields such as:
- `Web_URL`
- `newspaper_title`
- `place_of_publication`
- `issue_date`
- `Page`
- `thumbnail_url`
- `jpeg2000_url`
- `pdf_url`
- `ocr_url`
- `ocr_text`
`nara_revolutionary_war_pension_files.parquet` includes fields such as:
- `NAID`
- `naraURL`
- `title`
- `logicalDate`
- `pdfURL`
- `pageURL`
- `extractedText`
- `transcriptionText`
- `transcriptionUserNames`
`si_us_revolutionary_era_media_blobs.parquet` includes:
- `media_key`
- `media_url`
- `image_bytes`
- `size_bytes`
## Intended use
Possible use cases:
- Semantic search over Revolutionary War newspapers and archival text.
- Linking text records to embeddings for retrieval-augmented workflows.
- Matching Smithsonian media assets to text context using CLIP embeddings.
- OCR quality inspection and downstream historical NLP experiments.
- Building exploratory knowledge bases around Revolutionary War people, places, and events.
## Notes
- Files are stored as flat artifacts rather than as named Hugging Face splits.
- Several files are derivative products generated from source records, including embeddings and image blobs.
- The `.parquet` files are the primary dataset payloads; the `.csv` and `.json` files are smaller supporting outputs.
## Loading examples
### Python with pandas
```python
import pandas as pd
df = pd.read_parquet("hf://datasets/svattikuti/rev_warwithembeddings/loc_chronicling_america_1770_1810.parquet")
print(df.head())
```
### DuckDB
```sql
SELECT *
FROM read_parquet('hf://datasets/svattikuti/rev_warwithembeddings/nara_revolutionary_war_pension_files.parquet')
LIMIT 10;
```
## Limitations
- OCR text may contain recognition errors.
- Coverage and row granularity differ across source families.
- Embedding files depend on the model and preprocessing choices used during dataset construction.
- Some tables contain binary blobs and may be large to download or query locally.
## Provenance
Source provenance is inferred from file names and record schemas in this repository:
- Library of Congress / Chronicling America
- U.S. National Archives and Records Administration
- Smithsonian Institution collections and media
If stricter citation, licensing, or collection-level provenance is needed, those details should be added in a follow-up update.
提供机构:
svattikuti



