five

yuyijiong/context_qa_sum_qwen3_synthetic

收藏
Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/yuyijiong/context_qa_sum_qwen3_synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - summarization - question-answering language: - en - zh size_categories: - 10M<n<100M --- # Context-based QA and Summarization Synthetic Dataset ## Overview This dataset contains **synthetic** context-based question-answering (QA) and summarization data. The data was synthesized using: - **Source context**: [openbmb/Ultra-FineWeb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb) - **Synthesis model**: [Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) Each context is obtained by taking the **initial segment** of raw pretraining text from Ultra-FineWeb, truncated to at most the corresponding number of tokens, while ensuring the truncation does not occur in the middle of a sentence. The dataset covers **context-based QA** and **summarization** tasks in both **English** and **Chinese**. It is organized into **single-document** and **multi-document** splits. --- ## Dataset Structure ### Single-Document Tasks Single-document splits provide one source context per sample, with context lengths **close to (and not exceeding)** **128**, **256**, **512**, or **1024** tokens. Each sample includes a question, an answer, and summaries derived from that context. | Folder | Language | Context length (tokens) | # Samples | |--------|----------|-------------------------|-----------| | `doc_sum_qa_shortsum_synthetic_text_128_en` | English | ~128 | 1,879,005 | | `doc_sum_qa_shortsum_synthetic_text_128_zh` | Chinese | ~128 | 3,070,655 | | `doc_sum_qa_shortsum_synthetic_text_256_en` | English | ~256 | 1,613,809 | | `doc_sum_qa_shortsum_synthetic_text_256_zh` | Chinese | ~256 | 2,992,339 | | `doc_sum_qa_shortsum_synthetic_text_512_en` | English | ~512 | 1,538,654 | | `doc_sum_qa_shortsum_synthetic_text_512_zh` | Chinese | ~512 | 3,446,382 | | `doc_complex_qa_shortsum_synthetic_text_1024_en` | English | ~1024 | 1,597,716 | | `doc_complex_qa_shortsum_synthetic_text_1024_zh` | Chinese | ~1024 | 2,533,299 | **Column definitions (single-document):** | Column | Description | |--------|-------------| | `text_128` / `text_256` / `text_512` / `text_1024` | Source context (column name matches the target context length for that split). | | `question` | A regular question grounded in this context. | | `answer` | A brief answer to the question. | | `summary` | A summary of the context. | | `short_summary` | A highly condensed summary of the context. | **Note on `doc_complex_qa_shortsum_synthetic_text_1024_*` splits:** These 1024-token splits do **not** include a `summary` column. The QA in these splits is **complex QA**—i.e., questions that require multi-step reasoning and are more challenging than the regular QA in the 128/256/512 splits. --- ### Multi-Document Tasks Multi-document splits provide a context formed by **multiple documents**, each of length approximately **128** tokens. Each sample includes multiple questions over the combined context, with answers and a short summary. | Folder | Language | # Samples | |--------|----------|-----------| | `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_en` | English | 1,861,577 | | `doc_multi_doc_multi_qa_short_sum_synthetic_text_128_zh` | Chinese | 2,034,612 | **Column definitions (multi-document):** | Column | Description | |--------|-------------| | `context` | The full context, composed of multiple documents of ~128 tokens each. | | `questions` | Multiple regular questions about this context. | | `answers` | The answer corresponding to each question. | | `short_summary` | A highly condensed summary of the full context. | | `docs` | The original documents used to form the context. | | `target_docs` | For each question, the document(s) that the question is based on. | ## Intended Use - This is the training data for [Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio ](https://arxiv.org/abs/2603.25926). The trained models are [here](https://huggingface.co/yuyijiong/qwen3-semi-dynamic-soft-context-compress). - **Training / fine-tuning** LLMs for (short-)context-based QA and summarization, enhancing its overall ability of context utilization. - **Evaluating** LLMs' ability to answer questions or summarize about contextual information.
提供机构:
yuyijiong
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作