Wikipedia人工数据集
收藏arXiv2021-12-11 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2112.11478v1
下载链接
链接失效反馈官方服务:
资源简介:
Wikipedia人工数据集是由应用科学大学创建的,用于评估数据去重方法的文本数据集。该数据集包含超过30,000篇不同长度的英文维基百科文章,通过随机选择和添加句子,创建了近似重复的样本。数据集的创建过程涉及文本的句子标记化和随机句子选择,以及从300篇文章中收集的2,000个不同句子的添加。该数据集主要用于解决大规模数据集中重复数据的问题,确保模型训练的有效性和数据集内部的无数据重叠,适用于自然语言处理和自动语音识别等领域。
The Wikipedia artificial dataset was developed by the University of Applied Sciences as a text dataset for evaluating data deduplication methods. It contains over 30,000 English Wikipedia articles of varying lengths, with near-duplicate samples created through random sentence selection and insertion. The dataset creation process involves sentence tokenization of the text, random sentence selection, and the addition of 2,000 distinct sentences collected from 300 articles. This dataset is primarily used to address the issue of duplicate data in large-scale datasets, ensuring the effectiveness of model training and the absence of data overlap within the dataset, and is applicable to fields such as natural language processing and automatic speech recognition.
提供机构:
应用科学大学
创建时间:
2021-12-11



