esa-sceva/satcom-chunk-collection

Name: esa-sceva/satcom-chunk-collection
Creator: esa-sceva
Published: 2026-04-01 08:09:50
License: 暂无描述

Hugging Face2026-04-01 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/esa-sceva/satcom-chunk-collection

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: chunks data_files: - split: train path: chunks/*.parquet default: true - config_name: embeddings data_files: - split: train path: embeddings/*.parquet license: cc-by-sa-4.0 --- # SatCom Chunk Collection A large-scale dataset of **1,900,085** text chunks was constructed from satellite communication (SatCom) research papers. These chunks are extracted from the [SatCom corpus](https://huggingface.co/datasets/esa-sceva/satcom-corpus). Each chunk is enriched with structured metadata (e.g., publication details and authorship information), domain relevance scores, and precomputed vector embeddings using [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) to enable efficient semantic retrieval and downstream RAG-based generation. ## Dataset Structure The dataset is organized into two subsets: ### `chunks` (default) Text content and metadata. Lightweight and previewable in the HuggingFace dataset viewer. | Column | Type | Description | |--------|------|-------------| | `id` | `int64` | Unique identifier for each chunk | | `content` | `string` | The text content of the chunk | | `title` | `string` | Title of the source paper | | `authors` | `string` | Authors of the source paper | | `doi` | `string` | Digital Object Identifier of the source paper | | `url` | `string` | URL to the source paper | | `journal` | `string` | Journal or venue where the paper was published | | `publisher` | `string` | Publisher of the source paper | | `year` | `float64` | Publication year (ranges from 1929 to 2026) | | `score` | `float64` | Domain relevance score produced by [UltraRM](https://huggingface.co/openbmb/UltraRM-13b), a reward model. Represents how closely the chunk's content relates to the satellite communication domain. Higher values (closer to 0) indicate stronger relevance | | `file_id` | `string` | Internal file identifier | | `original_file_name` | `string` | Original filename of the source document | | `chunk_name` | `string` | Name/identifier of the chunk within its source document | ### `embeddings` Precomputed embedding vectors, joinable with `chunks` via the `id` column. | Column | Type | Description | |--------|------|-------------| | `id` | `int64` | Unique identifier (matches `chunks.id`) | | `vector` | `fixed_size_list<float32>[2560]` | 2560-dimensional embedding vector | ## Usage ```python from datasets import load_dataset # Load text and metadata (default) chunks = load_dataset("esa-sceva/satcom-chunk-collection") # Load embeddings embeddings = load_dataset("esa-sceva/satcom-chunk-collection", "embeddings") # Merge on id when you need both text and vectors import pandas as pd chunks_df = chunks["train"].to_pandas() embeddings_df = embeddings["train"].to_pandas() merged = chunks_df.merge(embeddings_df, on="id") ``` ## Chunking Strategy Documents were split using **hierarchical chunking**, which preserves the logical structure of research papers (sections, subsections, paragraphs) rather than splitting at arbitrary token boundaries. This ensures that each chunk captures a coherent unit of information. The maximum token length per chunk is **1,048 tokens**. ## Dataset Statistics - **Total rows**: 1,900,085 - **Max chunk length**: 1,048 tokens - **Chunking method**: Hierarchical - **Embedding dimensions**: 2,560 (float32) - **Publication years**: 1929–2026 - **Source**: Satellite communication research literature ## Score Details The `score` column is a domain relevance score computed using [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b), a reward model developed by OpenBMB. Each chunk was scored based on how closely its content aligns with the satellite communication domain. Scores are negative floats where values closer to zero indicate higher relevance to SatCom.

提供机构：

esa-sceva

5,000+

优质数据集

54 个

任务类型

进入经典数据集