WikiRAG-TR
收藏魔搭社区2026-01-09 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/WikiRAG-TR
下载链接
链接失效反馈官方服务:
资源简介:

# Dataset Summary
WikiRAG-TR is a dataset of 6K (5999) question and answer pairs which synthetically created from introduction part of Turkish Wikipedia Articles. The dataset is created to be used for Turkish Retrieval-Augmented Generation (RAG) tasks.
## Dataset Information
- **Number of Instances**: 5999 (5725 synthetically generated question-answer pairs, 274 augmented negative samples)
- **Dataset Size**: 20.5 MB
- **Language**: Turkish
- **Dataset License**: apache-2.0
- **Dataset Category**: Text2Text Generation
- **Dataset Domain**: STEM and Social Sciences
## WikiRAG-TR Pipeline
The creation of the dataset was accomplished in two main phases, each represented by a separate diagram.
### Phase 1: Subcategory Collection

In this initial phase:
1. A curated list of seed categories was decided, including science, technology, engineering, mathematics, physics, chemistry, biology, geology, meteorology, history, social sciences, and more.
2. Using these seed categories, subcategories were recursively gathered from Wikipedia.
- **Recursion depth** was set to 3 and the **number of subcategories** to collect was limited to 100 for each depth layer.
3. For each step, following subcategory types were filtered out:
- Subcategories containing **NSFW words**.
- Subcategories that only contain **lists of items**
- Subcategories used as **templates**
4. Articles from the resulting subcategory list were acquired.
### Phase 2: Dataset Generation

The second phase involved the following steps:
1. Introduction sections were extracted from the articles gathered in Phase 1.
- If the introduction was **too short** or **too long** (less than 50 or more than 2500 characters), the article was discarded.
- If the introduction contained **NSFW words**, the article was discarded.
- If the introduction contained **equations**, the article was discarded.
- If the introduction section was **empty**, the article was discarded.
2. The filtered introductions were fed into a large language model `(Gemma-2-27B-it)` to generate synthetic question and answer pairs.
3. For each resulting row in the dataset (containing an introduction, question, and answer), the following operations were performed:
- Unrelated contexts (introductions) were gathered from other rows to add false positive retrievals to the context.
- These unrelated contexts were appended to a list.
- The related context was added to this list. (In some cases, the relevant context was omitted to create **negative samples** where the answer indicates the model can't answer the question due to insufficient information. These negative samples were created separately, ensuring all original questions have corresponding answers.)
- The list was shuffled to **randomize the position** of the relevant context.
- The list elements were joined using the '\n' character.
## Considerations for Using the Data
The generated answers are usually short and concise. This may lead to models trained on this dataset to generate short answers.
Since Wikipedia articles were used to create this dataset, any biases and inaccuracies present in them may also exist in this dataset.
## Dataset Columns
- `id`: Unique identifier for each row.
- `question`: The question generated by the model.
- `answer`: The answer generated by the model.
- `context`: The augmented context containing both relevant and irrelevant information.
- `is_negative_response`: Indicates whether the answer is a negative response (0: No, 1: Yes).
- `number_of_articles`: The number of article introductions used to create the context.
- `ctx_split_points`: The ending character indices of each introduction in the context. These can be used to split the `context` column into its individual article introductions.
- `correct_intro_idx`: Index of the related introduction in the context. Can be used together with `ctx_split_points` to find the related introduction. This can also be useful for post-training analysis.
# Attributions
<a href="https://www.flaticon.com/free-icons/globe" title="globe icons">Globe icons created by Freepik - Flaticon</a>
<a href="https://www.flaticon.com/free-icons/search" title="search icons">Search icons created by Freepik - Flaticon</a>

# 数据集概览
WikiRAG-TR 是一个包含5999条(即6000条量级)问答对的数据集,其数据源自土耳其语维基百科文章的引言部分,专为土耳其语**检索增强生成(Retrieval-Augmented Generation,RAG)**任务构建。
## 数据集详情
- **实例数量**:5999条,其中包含5725条合成生成的问答对与274条增强负样本
- **数据集大小**:20.5 MB
- **语言**:土耳其语
- **数据集许可协议**:apache-2.0
- **数据集类别**:文本到文本生成(Text2Text Generation)
- **数据集领域**:理工科(Science, Technology, Engineering, Mathematics,STEM)与社会科学
## WikiRAG-TR 构建流程
本数据集的构建分为两个主要阶段,每个阶段均配有独立示意图。
### 第一阶段:子类别收集

本初始阶段流程如下:
1. 确定经过筛选的种子类别列表,涵盖科学、技术、工程、数学、物理、化学、生物、地质、气象、历史、社会科学等多个领域。
2. 基于上述种子类别,从维基百科递归爬取子类别:**递归深度**设为3,且每层爬取的子类别数量上限为100。
3. 每轮爬取后,需过滤以下类型的子类别:
- 包含**不适宜公开内容(Not Safe For Work,NSFW)词汇**的子类别
- 仅包含**条目列表**的子类别
- 用作**模板**的子类别
4. 获取筛选后子类别对应的维基百科文章。
### 第二阶段:数据集生成

本阶段流程如下:
1. 从第一阶段获取的文章中提取引言部分:
- 若引言**过短**(少于50字符)或**过长**(超过2500字符),则剔除该文章
- 若引言包含**不适宜公开内容(Not Safe For Work,NSFW)词汇**,则剔除该文章
- 若引言包含**公式**,则剔除该文章
- 若引言部分**为空**,则剔除该文章
2. 将筛选后的引言输入至大语言模型`Gemma-2-27B-it`,生成合成问答对。
3. 针对数据集中每一条包含引言、问题与答案的记录,执行以下操作:
- 从其他记录中采集无关上下文(即其他文章的引言),为当前上下文添加误检检索样本
- 将这些无关上下文添加至上下文列表
- 将相关上下文添加至该列表。部分场景下会省略相关上下文,以构建**负样本**:此时模型需基于不足的信息判定无法回答该问题。此类负样本单独生成,确保所有原始问题均存在对应答案。
- 对上下文列表进行洗牌,以**随机化相关上下文的位置**
- 使用换行符`
`拼接列表中的所有元素。
## 数据使用注意事项
生成的答案通常简短凝练,因此基于本数据集训练的模型可能倾向于生成短答案。
由于本数据集源自维基百科文章,原文中存在的偏见与不准确信息同样会出现在本数据集中。
## 数据集字段说明
- `id`:每条记录的唯一标识符
- `question`:模型生成的问题
- `answer`:模型生成的答案
- `context`:增强后的上下文,同时包含相关与无关信息
- `is_negative_response`:标记该答案是否为负样本响应(0:否,1:是)
- `number_of_articles`:用于构建上下文的文章引言数量
- `ctx_split_points`:上下文中每条引言的结束字符索引,可用于将`context`字段拆分为独立的文章引言
- `correct_intro_idx`:上下文中相关引言的索引,可与`ctx_split_points`配合定位相关引言,也可用于训练后分析。
## 致谢
<a href="https://www.flaticon.com/free-icons/globe" title="globe icons">地球图标由 Freepik 从 Flaticon 平台制作上传</a>
<a href="https://www.flaticon.com/free-icons/search" title="search icons">搜索图标由 Freepik 从 Flaticon 平台制作上传</a>
提供机构:
maas
创建时间:
2024-08-06



