esa-sceva/satcom-corpus
收藏Hugging Face2025-11-13 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/esa-sceva/satcom-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
pretty_name: SatCom Corpus
tags:
- satellite
- communication
- OCR
- scientific-articles
- NLP
- technical-texts
- PII-removal
---
# satcom-corpus
The **SatCom Corpus** is a large-scale collection of scientific and technical text data focused on **satellite communications (SatCom)** and related technologies.
The corpus contains text extracted from online sources selected by experts (mainly scientific publishers and open-access archives) and is designed for **language modeling, domain adaptation, and question answering** in the SatCom domain.
---
## Dataset structure
Each entry in the dataset includes the following columns:
* `id`: a unique identifier (UUID) derived from the original document filename
* `text`: OCR-extracted and cleaned text content of the article
* `source`: original URL of the document
* `publisher`: publisher of the article
* `doi`: DOI of the article
* `title`: title of the article
* `journal`: journal name
* `year`: year of publication
* `authors`: list of authors
---
## Data pipeline
The dataset was built through a fully automated **data curation pipeline**:
1. **Scraping** — Articles were retrieved from open-access online sources such as *Nature*, *IEEE*, and *ESA* technical archives using specialized scrapers.
2. **Text extraction** — PDF documents were processed with the **[Nougat OCR model](https://github.com/facebookresearch/nougat)** to obtain clean, structured text from scientific PDFs.
3. **Cleaning and normalization** — Post-processing removed non-textual artifacts, repeated headers/footers, and irrelevant formatting.
4. **PII removal** — Automatic filtering and normalization routines were applied to eliminate **personally identifiable information (PII)**, such as author email addresses, ORCID identifiers, and affiliations, ensuring that the dataset is compliant for open release.
5. **Integration** — Metadata such as URLs, SHA256 hashes, and timestamps were merged to ensure full traceability.
GitHub repository of the processing pipeline:
[https://github.com/esa-sceva/satcom-data-pipeline](https://github.com/esa-sceva/satcom-data-pipeline)
---
## Data splits
The dataset is split into **21 Parquet shards** for easier processing and efficient loading:
| Split | Description | Size | Files |
|-------|-------------|------|-------|
| train | Full corpus for text modeling or pretraining | ~4 GB | train-00-of-20.parquet … train-20-of-20.parquet |
---
## Use cases
This dataset can be used for:
- Domain-specific **language model pretraining** (e.g. fine-tuning LLMs on technical texts)
- **Question answering** and **information retrieval** in SatCom and aerospace domains
- **Text summarization** or **topic modeling** on engineering publications
- **Knowledge graph construction** from scientific literature
---
## Limitations and notes
- OCR extraction may introduce transcription errors.
- Not all scraped content is guaranteed to be peer-reviewed.
- Despite PII filtering, users are encouraged to handle the dataset responsibly in compliance with data privacy standards.
- The dataset is intended for **research and model training purposes only**.
---
## Citation
If you use this dataset, please cite:
提供机构:
esa-sceva



