five

esa-sceva/satcom-chunk-collection

收藏
Hugging Face2026-04-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/esa-sceva/satcom-chunk-collection
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: chunks data_files: - split: train path: chunks/*.parquet default: true - config_name: embeddings data_files: - split: train path: embeddings/*.parquet license: cc-by-sa-4.0 --- # SatCom Chunk Collection A large-scale dataset of **1,900,085** text chunks was constructed from satellite communication (SatCom) research papers. These chunks are extracted from the [SatCom corpus](https://huggingface.co/datasets/esa-sceva/satcom-corpus). Each chunk is enriched with structured metadata (e.g., publication details and authorship information), domain relevance scores, and precomputed vector embeddings using [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) to enable efficient semantic retrieval and downstream RAG-based generation. ## Dataset Structure The dataset is organized into two subsets: ### `chunks` (default) Text content and metadata. Lightweight and previewable in the HuggingFace dataset viewer. | Column | Type | Description | |--------|------|-------------| | `id` | `int64` | Unique identifier for each chunk | | `content` | `string` | The text content of the chunk | | `title` | `string` | Title of the source paper | | `authors` | `string` | Authors of the source paper | | `doi` | `string` | Digital Object Identifier of the source paper | | `url` | `string` | URL to the source paper | | `journal` | `string` | Journal or venue where the paper was published | | `publisher` | `string` | Publisher of the source paper | | `year` | `float64` | Publication year (ranges from 1929 to 2026) | | `score` | `float64` | Domain relevance score produced by [UltraRM](https://huggingface.co/openbmb/UltraRM-13b), a reward model. Represents how closely the chunk's content relates to the satellite communication domain. Higher values (closer to 0) indicate stronger relevance | | `file_id` | `string` | Internal file identifier | | `original_file_name` | `string` | Original filename of the source document | | `chunk_name` | `string` | Name/identifier of the chunk within its source document | ### `embeddings` Precomputed embedding vectors, joinable with `chunks` via the `id` column. | Column | Type | Description | |--------|------|-------------| | `id` | `int64` | Unique identifier (matches `chunks.id`) | | `vector` | `fixed_size_list<float32>[2560]` | 2560-dimensional embedding vector | ## Usage ```python from datasets import load_dataset # Load text and metadata (default) chunks = load_dataset("esa-sceva/satcom-chunk-collection") # Load embeddings embeddings = load_dataset("esa-sceva/satcom-chunk-collection", "embeddings") # Merge on id when you need both text and vectors import pandas as pd chunks_df = chunks["train"].to_pandas() embeddings_df = embeddings["train"].to_pandas() merged = chunks_df.merge(embeddings_df, on="id") ``` ## Chunking Strategy Documents were split using **hierarchical chunking**, which preserves the logical structure of research papers (sections, subsections, paragraphs) rather than splitting at arbitrary token boundaries. This ensures that each chunk captures a coherent unit of information. The maximum token length per chunk is **1,048 tokens**. ## Dataset Statistics - **Total rows**: 1,900,085 - **Max chunk length**: 1,048 tokens - **Chunking method**: Hierarchical - **Embedding dimensions**: 2,560 (float32) - **Publication years**: 1929–2026 - **Source**: Satellite communication research literature ## Score Details The `score` column is a domain relevance score computed using [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b), a reward model developed by OpenBMB. Each chunk was scored based on how closely its content aligns with the satellite communication domain. Scores are negative floats where values closer to zero indicate higher relevance to SatCom.
提供机构:
esa-sceva
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作