five

svattikuti/rev_warwithembeddings

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/svattikuti/rev_warwithembeddings
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Revolutionary War With Embeddings tags: - history - revolutionary-war - archives - ocr - embeddings - multimodal task_categories: - text-retrieval - text-classification - visual-document-retrieval size_categories: - 1M<n<10M --- # Revolutionary War With Embeddings This dataset repo aggregates several Revolutionary War era corpora and derived embedding tables in Parquet form. It is intended for retrieval, search, clustering, OCR exploration, and multimodal linking across archival text and images. The repository currently contains 21 files, primarily Parquet datasets, plus a few small CSV/JSON artifacts used during processing. ## What is included The dataset combines material from three main source families: - Library of Congress Chronicling America newspaper data and OCR. - U.S. National Archives Revolutionary War pension and related records. - Smithsonian Institution Revolutionary Era collection and media data. It also includes derived embedding tables: - `*_e5_embeddings.parquet` files for text embeddings. - `si_us_revolutionary_era_media_clip_embeddings.parquet` and `si_us_revolutionary_era_media_clip_embeddings_with_context.parquet` for image or image-context CLIP embeddings. ## Key files Some of the larger and more central files are: - `loc_chronicling_america_1770_1810.parquet`: 58,116 rows of newspaper/OCR-oriented records. - `nara_revolutionary_war_pension_files.parquet`: 2,244,629 rows of NARA pension-file related records. - `si_us_revolutionary_era_collections.parquet`: 12,667 Smithsonian collection records. - `si_us_revolutionary_era_media_blobs.parquet`: 5,205 media records with binary image payloads. ## Example schemas `loc_chronicling_america_1770_1810.parquet` includes fields such as: - `Web_URL` - `newspaper_title` - `place_of_publication` - `issue_date` - `Page` - `thumbnail_url` - `jpeg2000_url` - `pdf_url` - `ocr_url` - `ocr_text` `nara_revolutionary_war_pension_files.parquet` includes fields such as: - `NAID` - `naraURL` - `title` - `logicalDate` - `pdfURL` - `pageURL` - `extractedText` - `transcriptionText` - `transcriptionUserNames` `si_us_revolutionary_era_media_blobs.parquet` includes: - `media_key` - `media_url` - `image_bytes` - `size_bytes` ## Intended use Possible use cases: - Semantic search over Revolutionary War newspapers and archival text. - Linking text records to embeddings for retrieval-augmented workflows. - Matching Smithsonian media assets to text context using CLIP embeddings. - OCR quality inspection and downstream historical NLP experiments. - Building exploratory knowledge bases around Revolutionary War people, places, and events. ## Notes - Files are stored as flat artifacts rather than as named Hugging Face splits. - Several files are derivative products generated from source records, including embeddings and image blobs. - The `.parquet` files are the primary dataset payloads; the `.csv` and `.json` files are smaller supporting outputs. ## Loading examples ### Python with pandas ```python import pandas as pd df = pd.read_parquet("hf://datasets/svattikuti/rev_warwithembeddings/loc_chronicling_america_1770_1810.parquet") print(df.head()) ``` ### DuckDB ```sql SELECT * FROM read_parquet('hf://datasets/svattikuti/rev_warwithembeddings/nara_revolutionary_war_pension_files.parquet') LIMIT 10; ``` ## Limitations - OCR text may contain recognition errors. - Coverage and row granularity differ across source families. - Embedding files depend on the model and preprocessing choices used during dataset construction. - Some tables contain binary blobs and may be large to download or query locally. ## Provenance Source provenance is inferred from file names and record schemas in this repository: - Library of Congress / Chronicling America - U.S. National Archives and Records Administration - Smithsonian Institution collections and media If stricter citation, licensing, or collection-level provenance is needed, those details should be added in a follow-up update.
提供机构:
svattikuti
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作