CLEAN
收藏arXiv2024-02-15 更新2024-07-29 收录
下载链接:
https://zhiyiluo.site/misc/clean_v1.0_sample.json
下载链接
链接失效反馈官方服务:
资源简介:
CLEAN数据集由浙江理工大学创建,是一个全面的中文多跨度问答数据集,涵盖广泛的开放领域主题,包含9063个样本,其中约76%需要描述性答案。数据集内容丰富,来源于大规模中文在线知识问答分享平台,如百度知道,支持将阅读理解作为答案提取任务。创建过程中,通过随机爬取一百万个问题并筛选高质量答案,确保问题意图得到适当处理。CLEAN数据集主要应用于解决开放领域中复杂问题的多信息提取,旨在克服现有数据集在问题选择上的局限性。
The CLEAN dataset, developed by Zhejiang Sci-Tech University, is a comprehensive Chinese multi-span question answering dataset covering a wide range of open-domain topics, with a total of 9063 samples, of which approximately 76% require descriptive answers. The dataset has rich content sourced from large-scale Chinese online knowledge Q&A sharing platforms such as Baidu Zhidao, and supports treating reading comprehension as an answer extraction task. During its construction, one million questions were randomly crawled and high-quality answers were screened to ensure proper handling of question intentions. The CLEAN dataset is primarily used for multi-information extraction of complex questions in open-domain scenarios, aiming to address the limitations of existing datasets in question selection.
提供机构:
浙江理工大学
创建时间:
2024-02-15
搜集汇总
数据集介绍

背景与挑战
背景概述
CLEAN数据集是一个结构化的问答或命名实体识别数据集,包含多个数据项,每个数据项由问题、上下文文本、序列标注标签和答案片段组成,覆盖地理、历史、文化等多个主题领域,旨在支持自然语言处理任务如信息提取和文本理解。
以上内容由遇见数据集搜集并总结生成



