five

Saibo-creator/bookcorpus_deduplicated

收藏
Hugging Face2022-12-29 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Saibo-creator/bookcorpus_deduplicated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 2867856394 num_examples: 38832894 download_size: 1794567875 dataset_size: 2867856394 --- # Dataset Card for "bookcorpus_deduplicated" ## Dataset Summary This is a deduplicated version of the original [Book Corpus dataset](https://huggingface.co/datasets/bookcorpus). The Book Corpus (Zhu et al., 2015), which was used to train popular models such as BERT, has a substantial amount of exact-duplicate documents according to [Bandy and Vincent (2021)](https://arxiv.org/abs/2105.05241) [Bandy and Vincent (2021)](https://arxiv.org/abs/2105.05241) find that thousands of books in BookCorpus are duplicated, with only 7,185 unique books out of 11,038 total. Effect of deduplication - Num of lines: 38832894 VS 74004228 - Dataset size: 2.91GB VS 4.63GB The duplicate text has been droped and only the first appearance is kept. The order of text appearance is kept. ## Why deduplicate? Deduplication of training data has showed various advantages, including: - require fewer training steps to achieve the same or better accuracy - train models that emit memorized text ten times less frequently - reduce carbon emission and energy consumption cf [Deduplicating Training Data Makes Language Models Better](https://arxiv.org/abs/2107.06499) ## Deduplication script ```python import pandas as pd from datasets import load_dataset dataset = load_dataset("bookcorpus")["train"]["text"] df = pd.Dataframe({"text":dataset}) # drop duplicates(exact match) df_filtered = df["text"].drop_duplicates() df_filtered.to_csv("bookcorpus_filtered.csv","index"=False,"header"=False) new_dataset = load_dataset("text",data_files={"train":"bookcorpus_filtered.csv"}) ``` The running time is short, less than several minutes. More sophicated deduplication algorithms can be applied to improve the performance, such as https://github.com/google-research/deduplicate-text-datasets ## Reference ```bib @misc{https://doi.org/10.48550/arxiv.2105.05241, doi = {10.48550/ARXIV.2105.05241}, url = {https://arxiv.org/abs/2105.05241}, author = {Bandy, Jack and Vincent, Nicholas}, keywords = {Computation and Language (cs.CL), Computers and Society (cs.CY), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} } ``` ```bib @misc{https://doi.org/10.48550/arxiv.2107.06499, doi = {10.48550/ARXIV.2107.06499}, url = {https://arxiv.org/abs/2107.06499}, author = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas}, keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Deduplicating Training Data Makes Language Models Better}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} } ``` ```bib @misc{https://doi.org/10.48550/arxiv.2209.00099, doi = {10.48550/ARXIV.2209.00099}, url = {https://arxiv.org/abs/2209.00099}, author = {Treviso, Marcos and Ji, Tianchu and Lee, Ji-Ung and van Aken, Betty and Cao, Qingqing and Ciosici, Manuel R. and Hassid, Michael and Heafield, Kenneth and Hooker, Sara and Martins, Pedro H. and Martins, André F. T. and Milder, Peter and Raffel, Colin and Simpson, Edwin and Slonim, Noam and Balasubramanian, Niranjan and Derczynski, Leon and Schwartz, Roy}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Methods for Natural Language Processing: A Survey}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} } ``` [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
Saibo-creator
原始信息汇总

数据集卡片 "bookcorpus_deduplicated"

数据集概述

这是一个原始Book Corpus数据集的去重版本。根据Bandy和Vincent (2021)的研究,Book Corpus数据集中有大量完全重复的文档。去重后,数据集仅保留了7,185本独特的书籍,而原始数据集共有11,038本书籍。

去重效果:

  • 行数:38832894 VS 74004228
  • 数据集大小:2.91GB VS 4.63GB

去重过程中,重复的文本被删除,仅保留首次出现的文本,并保持文本出现的顺序。

为什么去重?

训练数据的去重显示出多种优势,包括:

  • 需要更少的训练步骤来达到相同或更好的准确性
  • 训练出的模型产生记忆文本的频率降低十倍
  • 减少碳排放和能源消耗

参考文献:Deduplicating Training Data Makes Language Models Better

去重脚本

python import pandas as pd from datasets import load_dataset

dataset = load_dataset("bookcorpus")["train"]["text"] df = pd.Dataframe({"text":dataset})

删除重复项(完全匹配)

df_filtered = df["text"].drop_duplicates()

df_filtered.to_csv("bookcorpus_filtered.csv","index"=False,"header"=False) new_dataset = load_dataset("text",data_files={"train":"bookcorpus_filtered.csv"})

运行时间短,不到几分钟。更复杂的去重算法可以应用于提高性能,例如google-research/deduplicate-text-datasets

参考文献

bib @misc{https://doi.org/10.48550/arxiv.2105.05241, doi = {10.48550/ARXIV.2105.05241}, url = {https://arxiv.org/abs/2105.05241}, author = {Bandy, Jack and Vincent, Nicholas}, keywords = {Computation and Language (cs.CL), Computers and Society (cs.CY), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} }

bib @misc{https://doi.org/10.48550/arxiv.2107.06499, doi = {10.48550/ARXIV.2107.06499}, url = {https://arxiv.org/abs/2107.06499}, author = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas}, keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Deduplicating Training Data Makes Language Models Better}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} }

bib @misc{https://doi.org/10.48550/arxiv.2209.00099, doi = {10.48550/ARXIV.2209.00099}, url = {https://arxiv.org/abs/2209.00099}, author = {Treviso, Marcos and Ji, Tianchu and Lee, Ji-Ung and van Aken, Betty and Cao, Qingqing and Ciosici, Manuel R. and Hassid, Michael and Heafield, Kenneth and Hooker, Sara and Martins, Pedro H. and Martins, André F. T. and Milder, Peter and Raffel, Colin and Simpson, Edwin and Slonim, Noam and Balasubramanian, Niranjan and Derczynski, Leon and Schwartz, Roy}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Methods for Natural Language Processing: A Survey}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作