yuyijiong/context_qa_sum_qwen3_synthetic
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/yuyijiong/context_qa_sum_qwen3_synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- summarization
- question-answering
language:
- en
- zh
size_categories:
- 10M<n<100M
---
# Context-based QA and Summarization Synthetic Dataset
## Overview
This dataset contains **synthetic** context-based question-answering (QA) and summarization data. The data was synthesized using:
- **Source context**: [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
- **Synthesis model**: [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
Each context is obtained by taking the **initial segment** of raw pretraining text from Ultra-FineWeb, truncated to at most the corresponding number of tokens, while ensuring the truncation does not occur in the middle of a sentence.
The dataset covers **context-based QA** and **summarization** tasks in both **English** and **Chinese**. It is organized into **single-document** and **multi-document** splits.
---
## Dataset Structure
### Single-Document Tasks
Single-document splits provide one source context per sample, with context lengths **close to (and not exceeding)** **128**, **256**, **512**, or **1024** tokens. Each sample includes a question, an answer, and summaries derived from that context.
| Folder | Language | Context length (tokens) | # Samples |
|--------|----------|-------------------------|-----------|
| `doc_sum_qa_shortsum_synthetic_text_128_en` | English | ~128 | 1,879,005 |
| `doc_sum_qa_shortsum_synthetic_text_128_zh` | Chinese | ~128 | 3,070,655 |
| `doc_sum_qa_shortsum_synthetic_text_256_en` | English | ~256 | 1,613,809 |
| `doc_sum_qa_shortsum_synthetic_text_256_zh` | Chinese | ~256 | 2,992,339 |
| `doc_sum_qa_shortsum_synthetic_text_512_en` | English | ~512 | 1,538,654 |
| `doc_sum_qa_shortsum_synthetic_text_512_zh` | Chinese | ~512 | 3,446,382 |
| `doc_complex_qa_shortsum_synthetic_text_1024_en` | English | ~1024 | 1,597,716 |
| `doc_complex_qa_shortsum_synthetic_text_1024_zh` | Chinese | ~1024 | 2,533,299 |
**Column definitions (single-document):**
| Column | Description |
|--------|-------------|
| `text_128` / `text_256` / `text_512` / `text_1024` | Source context (column name matches the target context length for that split). |
| `question` | A regular question grounded in this context. |
| `answer` | A brief answer to the question. |
| `summary` | A summary of the context. |
| `short_summary` | A highly condensed summary of the context. |
**Note on `doc_complex_qa_shortsum_synthetic_text_1024_*` splits:** These 1024-token splits do **not** include a `summary` column. The QA in these splits is **complex QA**—i.e., questions that require multi-step reasoning and are more challenging than the regular QA in the 128/256/512 splits.
---
### Multi-Document Tasks
Multi-document splits provide a context formed by **multiple documents**, each of length approximately **128** tokens. Each sample includes multiple questions over the combined context, with answers and a short summary.
| Folder | Language | # Samples |
|--------|----------|-----------|
| `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_en` | English | 1,861,577 |
| `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_zh` | Chinese | 2,034,612 |
**Column definitions (multi-document):**
| Column | Description |
|--------|-------------|
| `context` | The full context, composed of multiple documents of ~128 tokens each. |
| `questions` | Multiple regular questions about this context. |
| `answers` | The answer corresponding to each question. |
| `short_summary` | A highly condensed summary of the full context. |
| `docs` | The original documents used to form the context. |
| `target_docs` | For each question, the document(s) that the question is based on. |
## Intended Use
- This is the training data for [Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio ](https://arxiv.org/abs/2603.25926). The trained models are [here](https://huggingface.co/yuyijiong/qwen3-semi-dynamic-soft-context-compress).
- **Training / fine-tuning** LLMs for (short-)context-based QA and summarization, enhancing its overall ability of context utilization.
- **Evaluating** LLMs' ability to answer questions or summarize about contextual information.
提供机构:
yuyijiong



