esa-sceva/satcom-corpus

Name: esa-sceva/satcom-corpus
Creator: esa-sceva
Published: 2025-11-13 12:11:16
License: 暂无描述

Hugging Face2025-11-13 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/esa-sceva/satcom-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en pretty_name: SatCom Corpus tags: - satellite - communication - OCR - scientific-articles - NLP - technical-texts - PII-removal --- # satcom-corpus The **SatCom Corpus** is a large-scale collection of scientific and technical text data focused on **satellite communications (SatCom)** and related technologies. The corpus contains text extracted from online sources selected by experts (mainly scientific publishers and open-access archives) and is designed for **language modeling, domain adaptation, and question answering** in the SatCom domain. --- ## Dataset structure Each entry in the dataset includes the following columns: * `id`: a unique identifier (UUID) derived from the original document filename * `text`: OCR-extracted and cleaned text content of the article * `source`: original URL of the document * `publisher`: publisher of the article * `doi`: DOI of the article * `title`: title of the article * `journal`: journal name * `year`: year of publication * `authors`: list of authors --- ## Data pipeline The dataset was built through a fully automated **data curation pipeline**: 1. **Scraping** — Articles were retrieved from open-access online sources such as *Nature*, *IEEE*, and *ESA* technical archives using specialized scrapers. 2. **Text extraction** — PDF documents were processed with the **[Nougat OCR model](https://github.com/facebookresearch/nougat)** to obtain clean, structured text from scientific PDFs. 3. **Cleaning and normalization** — Post-processing removed non-textual artifacts, repeated headers/footers, and irrelevant formatting. 4. **PII removal** — Automatic filtering and normalization routines were applied to eliminate **personally identifiable information (PII)**, such as author email addresses, ORCID identifiers, and affiliations, ensuring that the dataset is compliant for open release. 5. **Integration** — Metadata such as URLs, SHA256 hashes, and timestamps were merged to ensure full traceability. GitHub repository of the processing pipeline: [https://github.com/esa-sceva/satcom-data-pipeline](https://github.com/esa-sceva/satcom-data-pipeline) --- ## Data splits The dataset is split into **21 Parquet shards** for easier processing and efficient loading: | Split | Description | Size | Files | |-------|-------------|------|-------| | train | Full corpus for text modeling or pretraining | ~4 GB | train-00-of-20.parquet … train-20-of-20.parquet | --- ## Use cases This dataset can be used for: - Domain-specific **language model pretraining** (e.g. fine-tuning LLMs on technical texts) - **Question answering** and **information retrieval** in SatCom and aerospace domains - **Text summarization** or **topic modeling** on engineering publications - **Knowledge graph construction** from scientific literature --- ## Limitations and notes - OCR extraction may introduce transcription errors. - Not all scraped content is guaranteed to be peer-reviewed. - Despite PII filtering, users are encouraged to handle the dataset responsibly in compliance with data privacy standards. - The dataset is intended for **research and model training purposes only**. --- ## Citation If you use this dataset, please cite:

提供机构：

esa-sceva

5,000+

优质数据集

54 个

任务类型

进入经典数据集