Turku-WebQA
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/TurkuNLP/Turku-WebQA
下载链接
链接失效反馈官方服务:
资源简介:
### Dataset Summary
The Turku WebQA dataset is a Finnish Question-Answer dataset that has been extracted from different CommonCrawl sources (Parsebank, mC4-Fi, CC-Fi).
The dataset has 237,000 question-answer pairs (altogether 290,000 questions, but not all have an answer). The questions with no answers can be discarded by taking out the rows with None (null).
The codebase as well as the raw data can be found on [GitHub](https://github.com/TurkuNLP/register-qa).
The extracted question-answer pairs include various topics from the source corpora, some of which are explored in the paper for which the citing information can be found below.
### Data Fields
- `source`: a `string` feature. Tells whether the question-answer pair is extracted from Parsebank, mC4-Fi or CC-Fi.
- `id`: a `string` feature. Id of the original text from which the question-answer pair is extracted.
- `question`: a `string` feature.
- `answer`: a `string` feature. Can also be None (null).
### Manual Evalution of the Pairs
To get an idea on how good the extracted pairs were, a sample was annotated for noisy artefacts, insufficient answers and missing context.
The evaluation showed that there is variation between the different source corpora.
| Source | Noisy artefacts | Insufficient Answer | Missing context |
| -------- | -------- | -------- | -------- |
| Total (N=73) | 0,29 | 0,22 | 0,08 |
| CC-Fi (N=25) | 0,36 | 0,22 |0,03 |
| mC4-Fi (N=25) | 0,28 | 0,28 | 0,14 |
| Parsebank (N=22) | 0,23 | 0,14 | 0,07 |
### Citing
To cite this dataset use the following bibtex.
```
@inproceedings{eskelinen-etal-2024-building-question,
title = "Building Question-Answer Data Using Web Register Identification",
author = "Eskelinen, Anni and
Myntti, Amanda and
Henriksson, Erik and
Pyysalo, Sampo and
Laippala, Veronika",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.234",
pages = "2595--2611",
abstract = "This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish {--} Turku WebQA {--} comprising over 200,000 QA pairs.",
}
```
### 数据集概览
Turku WebQA数据集是一款芬兰语问答数据集,其数据源自多个CommonCrawl数据源(Parsebank、mC4-Fi、CC-Fi)。
该数据集包含23.7万个问答对(总计生成29万个问题,但并非所有问题均配有对应答案)。可通过移除值为`None`(空值)的行,过滤掉无配套答案的问题。
该数据集的代码库与原始数据可在[GitHub](https://github.com/TurkuNLP/register-qa)获取。
提取得到的问答对涵盖源语料库中的各类主题,其中部分主题已在后续引用信息对应的论文中展开探讨。
### 数据字段
- `source`:字符串类型特征,用于标识该问答对源自Parsebank、mC4-Fi还是CC-Fi。
- `id`:字符串类型特征,即提取该问答对的原始文本的唯一标识符。
- `question`:字符串类型特征,代表问题内容。
- `answer`:字符串类型特征,代表答案内容,也可为`None`(空值)。
### 问答对人工评估
为评估提取的问答对质量,研究人员针对采样样本的噪声异常项、答案不充分及上下文缺失问题开展了人工标注。评估结果显示,不同源语料库的表现存在差异。
| 数据源 | 噪声异常占比 | 答案不充分占比 | 上下文缺失占比 |
| -------- | -------- | -------- | -------- |
| 总计(N=73) | 0.29 | 0.22 | 0.08 |
| CC-Fi(N=25) | 0.36 | 0.22 | 0.03 |
| mC4-Fi(N=25) | 0.28 | 0.28 | 0.14 |
| Parsebank(N=22) | 0.23 | 0.14 | 0.07 |
### 引用信息
引用该数据集请使用以下BibTeX格式条目:
@inproceedings{eskelinen-etal-2024-building-question,
title = "基于网页语域识别构建问答数据",
author = "Eskelinen, Anni and
Myntti, Amanda and
Henriksson, Erik and
Pyysalo, Sampo and
Laippala, Veronika",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "2024年计算语言学与语言资源评估联合国际会议(LREC-COLING 2024)论文集",
month = may,
year = "2024",
address = "意大利都灵",
publisher = "ELRA与ICCL",
url = "https://aclanthology.org/2024.lrec-main.234",
pages = "2595--2611",
abstract = "本文提出了一种资源高效的方法,用于通过机器学习(ML)从网页规模数据中提取问答对,以此构建问答(QA)数据集。该方法依托近期网页语域(体裁)识别领域的进展,包含两步机器学习流程与一个额外的后处理步骤。首先,我们利用XLM-R与涵盖问答论坛等类别的多语言CORE网页语料库系列,训练一个多语言分类器,从网页规模数据中检索大概率包含问答对的文档。其次,我们开发了一款命名实体识别(Named Entity Recognition,NER)风格的Token分类器,用于在这些文档中识别问答文本跨度。为此,我们基于英文LFQA数据集、少量人工清洗的英芬双语网页问答对,以及使用ChatGPT清洗的芬兰语网页问答数据集构建半合成数据集,并开展训练。对该流程框架的评估证实,其可高效获取大量问答对。尽管该方法可依托语言模型与大规模网页数据适配任意语言,但我们以英语与芬兰语展示了其有效性,开发了首个面向芬兰语的开源、非合成且非机器翻译的问答数据集——Turku WebQA,该数据集包含超过20万个问答对。",
}
提供机构:
maas
创建时间:
2025-08-08



