BREAD

Name: BREAD
Creator: 谷歌研究院
Published: 2023-11-11 08:11:50
License: 暂无描述

arXiv2023-11-11 更新2024-06-21 收录

下载链接：

https://github.com/toizzy/bread

下载链接

链接失效反馈

官方服务：

资源简介：

BREAD数据集是由谷歌研究院创建，旨在评估文本中的冗余和样板内容，涵盖360种语言。该数据集包含从MADLAD-400数据集中随机选择的文档，并由NLP专家进行标注。数据集分为四个类别：REP（重复样板）、OK（自然文本）、BOIL（非语言样板或噪声）和UNK（标注者不确定）。BREAD数据集用于开发和测试检测文本冗余的方法，特别关注低资源语言，以推动语言建模数据集的清洁化。

The BREAD dataset was developed by Google Research to evaluate redundancy and boilerplate content in text across 360 languages. It contains documents randomly selected from the MADLAD-400 dataset, which were annotated by NLP experts. The dataset is categorized into four classes: REP (repetitive boilerplate), OK (natural text), BOIL (non-linguistic boilerplate or noise), and UNK (annotator uncertainty). The BREAD dataset is used for developing and testing methods to detect text redundancy, with a particular focus on low-resource languages to advance the cleaning of language modeling datasets.

提供机构：

谷歌研究院

创建时间：

2023-11-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集