five

esa-sceva/satcom-corpus

收藏
Hugging Face2025-11-13 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/esa-sceva/satcom-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en pretty_name: SatCom Corpus tags: - satellite - communication - OCR - scientific-articles - NLP - technical-texts - PII-removal --- # satcom-corpus The **SatCom Corpus** is a large-scale collection of scientific and technical text data focused on **satellite communications (SatCom)** and related technologies. The corpus contains text extracted from online sources selected by experts (mainly scientific publishers and open-access archives) and is designed for **language modeling, domain adaptation, and question answering** in the SatCom domain. --- ## Dataset structure Each entry in the dataset includes the following columns: * `id`: a unique identifier (UUID) derived from the original document filename * `text`: OCR-extracted and cleaned text content of the article * `source`: original URL of the document * `publisher`: publisher of the article * `doi`: DOI of the article * `title`: title of the article * `journal`: journal name * `year`: year of publication * `authors`: list of authors --- ## Data pipeline The dataset was built through a fully automated **data curation pipeline**: 1. **Scraping** — Articles were retrieved from open-access online sources such as *Nature*, *IEEE*, and *ESA* technical archives using specialized scrapers. 2. **Text extraction** — PDF documents were processed with the **[Nougat OCR model](https://github.com/facebookresearch/nougat)** to obtain clean, structured text from scientific PDFs. 3. **Cleaning and normalization** — Post-processing removed non-textual artifacts, repeated headers/footers, and irrelevant formatting. 4. **PII removal** — Automatic filtering and normalization routines were applied to eliminate **personally identifiable information (PII)**, such as author email addresses, ORCID identifiers, and affiliations, ensuring that the dataset is compliant for open release. 5. **Integration** — Metadata such as URLs, SHA256 hashes, and timestamps were merged to ensure full traceability. GitHub repository of the processing pipeline: [https://github.com/esa-sceva/satcom-data-pipeline](https://github.com/esa-sceva/satcom-data-pipeline) --- ## Data splits The dataset is split into **21 Parquet shards** for easier processing and efficient loading: | Split | Description | Size | Files | |-------|-------------|------|-------| | train | Full corpus for text modeling or pretraining | ~4 GB | train-00-of-20.parquet … train-20-of-20.parquet | --- ## Use cases This dataset can be used for: - Domain-specific **language model pretraining** (e.g. fine-tuning LLMs on technical texts) - **Question answering** and **information retrieval** in SatCom and aerospace domains - **Text summarization** or **topic modeling** on engineering publications - **Knowledge graph construction** from scientific literature --- ## Limitations and notes - OCR extraction may introduce transcription errors. - Not all scraped content is guaranteed to be peer-reviewed. - Despite PII filtering, users are encouraged to handle the dataset responsibly in compliance with data privacy standards. - The dataset is intended for **research and model training purposes only**. --- ## Citation If you use this dataset, please cite:
提供机构:
esa-sceva
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作