esa-sceva/satcom-chunk-collection
收藏Hugging Face2026-04-01 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/esa-sceva/satcom-chunk-collection
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: chunks
data_files:
- split: train
path: chunks/*.parquet
default: true
- config_name: embeddings
data_files:
- split: train
path: embeddings/*.parquet
license: cc-by-sa-4.0
---
# SatCom Chunk Collection
A large-scale dataset of **1,900,085** text chunks was constructed from satellite communication (SatCom) research papers. These chunks are extracted from the [SatCom corpus](https://huggingface.co/datasets/esa-sceva/satcom-corpus).
Each chunk is enriched with structured metadata (e.g., publication details and authorship information), domain relevance scores, and precomputed vector embeddings using [Qwen3-Embedding-4B](https://huggingface.co/Qwen/Qwen3-Embedding-4B) to enable efficient semantic retrieval and downstream RAG-based generation.
## Dataset Structure
The dataset is organized into two subsets:
### `chunks` (default)
Text content and metadata. Lightweight and previewable in the HuggingFace dataset viewer.
| Column | Type | Description |
|--------|------|-------------|
| `id` | `int64` | Unique identifier for each chunk |
| `content` | `string` | The text content of the chunk |
| `title` | `string` | Title of the source paper |
| `authors` | `string` | Authors of the source paper |
| `doi` | `string` | Digital Object Identifier of the source paper |
| `url` | `string` | URL to the source paper |
| `journal` | `string` | Journal or venue where the paper was published |
| `publisher` | `string` | Publisher of the source paper |
| `year` | `float64` | Publication year (ranges from 1929 to 2026) |
| `score` | `float64` | Domain relevance score produced by [UltraRM](https://huggingface.co/openbmb/UltraRM-13b), a reward model. Represents how closely the chunk's content relates to the satellite communication domain. Higher values (closer to 0) indicate stronger relevance |
| `file_id` | `string` | Internal file identifier |
| `original_file_name` | `string` | Original filename of the source document |
| `chunk_name` | `string` | Name/identifier of the chunk within its source document |
### `embeddings`
Precomputed embedding vectors, joinable with `chunks` via the `id` column.
| Column | Type | Description |
|--------|------|-------------|
| `id` | `int64` | Unique identifier (matches `chunks.id`) |
| `vector` | `fixed_size_list<float32>[2560]` | 2560-dimensional embedding vector |
## Usage
```python
from datasets import load_dataset
# Load text and metadata (default)
chunks = load_dataset("esa-sceva/satcom-chunk-collection")
# Load embeddings
embeddings = load_dataset("esa-sceva/satcom-chunk-collection", "embeddings")
# Merge on id when you need both text and vectors
import pandas as pd
chunks_df = chunks["train"].to_pandas()
embeddings_df = embeddings["train"].to_pandas()
merged = chunks_df.merge(embeddings_df, on="id")
```
## Chunking Strategy
Documents were split using **hierarchical chunking**, which preserves the logical structure of research papers (sections, subsections, paragraphs) rather than splitting at arbitrary token boundaries. This ensures that each chunk captures a coherent unit of information. The maximum token length per chunk is **1,048 tokens**.
## Dataset Statistics
- **Total rows**: 1,900,085
- **Max chunk length**: 1,048 tokens
- **Chunking method**: Hierarchical
- **Embedding dimensions**: 2,560 (float32)
- **Publication years**: 1929–2026
- **Source**: Satellite communication research literature
## Score Details
The `score` column is a domain relevance score computed using [UltraRM-13b](https://huggingface.co/openbmb/UltraRM-13b), a reward model developed by OpenBMB. Each chunk was scored based on how closely its content aligns with the satellite communication domain. Scores are negative floats where values closer to zero indicate higher relevance to SatCom.
提供机构:
esa-sceva



