wiki-727K + wiki-50 (Text Segmentation as a Supervised Learning Task)
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4737321
下载链接
链接失效反馈官方服务:
资源简介:
Dataset accompanying Text Segmentation as a Supervised Learning Task (Koshorek et al. 2018)
For this work we have created a new dataset, which we name wiki-727K. It is a collection of 727,746 English Wikipedia documents, and their hierarchical segmentation, as it appears in their table of contents. We randomly partitioned the documents into a train (80%), development (10%), and test (10%) set. Different text segmentation use-cases require different levels of granularity. For example, for segmenting text by overarching topic it makes sense to train a model that predicts only top-level segments, which are typically vary in topic – for example, “History”, “Geography”, and “Demographics”. For segmenting a radio broadcast into separate news stories, which requires finer granularity, it makes sense to train a model to predict sub-segments. Our dataset provides the entire segmentation information, and an application may choose the appropriate level of granularity. To generate the data, we performed the following preprocessing steps for each Wikipedia document:
Removed all photos, tables, Wikipedia template elements, and other non-text elements.
Removed single-sentence segments, documents with less than three segments, and documents where most segments were filtered.
Divided each segment into sentences using the PUNKT tokenizer of the NLTK library. This is necessary for the use of our dataset as a benchmark, as with-out a well-defined sentence segmentation, it is impossible to evaluate different models.
Because our test set is large, it is difficult to evaluate some of the existing methods, which are computationally demanding. Thus, we introduce wiki-50, a set of 50 randomly sampled test documents from wiki-727K. We use wiki-50 to evaluate systems that are too slow to evaluate on the entire test set. We also provide human segmentation performance results on wiki-50 (Pk = 14.97, Beeferman et al. 1999).
创建时间:
2021-05-05



