yuyijiong/context_qa_sum_qwen3_synthetic

Name: yuyijiong/context_qa_sum_qwen3_synthetic
Creator: yuyijiong
Published: 2026-03-31 05:52:04
License: 暂无描述

Hugging Face2026-03-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/yuyijiong/context_qa_sum_qwen3_synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - summarization - question-answering language: - en - zh size_categories: - 10M<n<100M --- # Context-based QA and Summarization Synthetic Dataset ## Overview This dataset contains **synthetic** context-based question-answering (QA) and summarization data. The data was synthesized using: - **Source context**: [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) - **Synthesis model**: [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) Each context is obtained by taking the **initial segment** of raw pretraining text from Ultra-FineWeb, truncated to at most the corresponding number of tokens, while ensuring the truncation does not occur in the middle of a sentence. The dataset covers **context-based QA** and **summarization** tasks in both **English** and **Chinese**. It is organized into **single-document** and **multi-document** splits. --- ## Dataset Structure ### Single-Document Tasks Single-document splits provide one source context per sample, with context lengths **close to (and not exceeding)** **128**, **256**, **512**, or **1024** tokens. Each sample includes a question, an answer, and summaries derived from that context. | Folder | Language | Context length (tokens) | # Samples | |--------|----------|-------------------------|-----------| | `doc_sum_qa_shortsum_synthetic_text_128_en` | English | ~128 | 1,879,005 | | `doc_sum_qa_shortsum_synthetic_text_128_zh` | Chinese | ~128 | 3,070,655 | | `doc_sum_qa_shortsum_synthetic_text_256_en` | English | ~256 | 1,613,809 | | `doc_sum_qa_shortsum_synthetic_text_256_zh` | Chinese | ~256 | 2,992,339 | | `doc_sum_qa_shortsum_synthetic_text_512_en` | English | ~512 | 1,538,654 | | `doc_sum_qa_shortsum_synthetic_text_512_zh` | Chinese | ~512 | 3,446,382 | | `doc_complex_qa_shortsum_synthetic_text_1024_en` | English | ~1024 | 1,597,716 | | `doc_complex_qa_shortsum_synthetic_text_1024_zh` | Chinese | ~1024 | 2,533,299 | **Column definitions (single-document):** | Column | Description | |--------|-------------| | `text_128` / `text_256` / `text_512` / `text_1024` | Source context (column name matches the target context length for that split). | | `question` | A regular question grounded in this context. | | `answer` | A brief answer to the question. | | `summary` | A summary of the context. | | `short_summary` | A highly condensed summary of the context. | **Note on `doc_complex_qa_shortsum_synthetic_text_1024_*` splits:** These 1024-token splits do **not** include a `summary` column. The QA in these splits is **complex QA**—i.e., questions that require multi-step reasoning and are more challenging than the regular QA in the 128/256/512 splits. --- ### Multi-Document Tasks Multi-document splits provide a context formed by **multiple documents**, each of length approximately **128** tokens. Each sample includes multiple questions over the combined context, with answers and a short summary. | Folder | Language | # Samples | |--------|----------|-----------| | `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_en` | English | 1,861,577 | | `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_zh` | Chinese | 2,034,612 | **Column definitions (multi-document):** | Column | Description | |--------|-------------| | `context` | The full context, composed of multiple documents of ~128 tokens each. | | `questions` | Multiple regular questions about this context. | | `answers` | The answer corresponding to each question. | | `short_summary` | A highly condensed summary of the full context. | | `docs` | The original documents used to form the context. | | `target_docs` | For each question, the document(s) that the question is based on. | ## Intended Use - This is the training data for [Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio ](https://arxiv.org/abs/2603.25926). The trained models are [here](https://huggingface.co/yuyijiong/qwen3-semi-dynamic-soft-context-compress). - **Training / fine-tuning** LLMs for (short-)context-based QA and summarization, enhancing its overall ability of context utilization. - **Evaluating** LLMs' ability to answer questions or summarize about contextual information.

提供机构：

yuyijiong

5,000+

优质数据集

54 个

任务类型

进入经典数据集