wiki-727K + wiki-50 (Text Segmentation as a Supervised Learning Task)
收藏Mendeley Data2024-03-27 更新2024-06-29 收录
下载链接:
https://zenodo.org/record/4737322
下载链接
链接失效反馈官方服务:
资源简介:
Dataset accompanying Text Segmentation as a Supervised Learning Task (Koshorek et al. 2018) For this work we have created a new dataset, which we name wiki-727K. It is a collection of 727,746 English Wikipedia documents, and their hierarchical segmentation, as it appears in their table of contents. We randomly partitioned the documents into a train (80%), development (10%), and test (10%) set. Different text segmentation use-cases require different levels of granularity. For example, for segmenting text by overarching topic it makes sense to train a model that predicts only top-level segments, which are typically vary in topic – for example, “History”, “Geography”, and “Demographics”. For segmenting a radio broadcast into separate news stories, which requires finer granularity, it makes sense to train a model to predict sub-segments. Our dataset provides the entire segmentation information, and an application may choose the appropriate level of granularity. To generate the data, we performed the following preprocessing steps for each Wikipedia document: Removed all photos, tables, Wikipedia template elements, and other non-text elements. Removed single-sentence segments, documents with less than three segments, and documents where most segments were filtered. Divided each segment into sentences using the PUNKT tokenizer of the NLTK library. This is necessary for the use of our dataset as a benchmark, as with-out a well-defined sentence segmentation, it is impossible to evaluate different models. Because our test set is large, it is difficult to evaluate some of the existing methods, which are computationally demanding. Thus, we introduce wiki-50, a set of 50 randomly sampled test documents from wiki-727K. We use wiki-50 to evaluate systems that are too slow to evaluate on the entire test set. We also provide human segmentation performance results on wiki-50 (Pk = 14.97, Beeferman et al. 1999).
本数据集配套于《文本分割作为监督学习任务》(Text Segmentation as a Supervised Learning Task,Koshorek等人,2018)的研究工作。本次研究中我们构建了全新的数据集,命名为wiki-727K。该数据集包含727,746份英文维基百科文档及其层级化文本分割结果,分割结构与文档目录中的层级完全一致。我们将所有文档随机划分为训练集(80%)、开发集(10%)与测试集(10%)。
不同的文本分割应用场景需要不同的粒度层级。例如,当需基于宏观主题开展文本分割时,训练仅预测顶层分割单元的模型更为合理——这类顶层分割通常对应不同的主题范畴,如“历史”“地理”与“人口统计学”。而若需将广播音频分割为独立新闻片段,则需要更精细的粒度,此时训练模型预测子分割单元更为适配。本数据集提供完整的分割信息,应用方可根据需求选择合适的粒度层级。
为生成该数据集,我们针对每份维基百科文档执行了以下预处理流程:移除所有图片、表格、维基百科模板元素及其他非文本内容;过滤掉仅包含单句的分割单元、分割数少于3的文档,以及多数分割单元被过滤的文档;使用自然语言工具包(Natural Language Toolkit,NLTK)的PUNKT分词器将每个分割单元拆分为句子。这一操作是将本数据集作为基准测试集的必要前提:若无明确的句子分割结果,则无法对不同模型开展有效评估。
由于本次测试集规模较大,部分计算成本高昂的现有方法难以在全量测试集上完成评估。为此我们额外构建了wiki-50子集,该子集从wiki-727K中随机抽取50份测试文档。我们使用wiki-50来评估那些无法在全量测试集上运行的慢速系统。此外我们还提供了wiki-50上的人工分割性能结果(Pk = 14.97,Beeferman等人,1999)。
创建时间:
2023-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含wiki-727K和wiki-50两部分,用于文本分割的监督学习任务。wiki-727K由727,746个英文维基百科文档及其层次化分割组成,划分为训练集(80%)、开发集(10%)和测试集(10%),支持不同粒度级别的分割应用。wiki-50是从wiki-727K中随机采样的50个测试文档,用于评估计算密集型模型,数据集经过预处理以移除非文本元素并确保句子分割一致性。
以上内容由遇见数据集搜集并总结生成



