Saibo-creator/bookcorpus_deduplicated
收藏数据集卡片 "bookcorpus_deduplicated"
数据集概述
这是一个原始Book Corpus数据集的去重版本。根据Bandy和Vincent (2021)的研究,Book Corpus数据集中有大量完全重复的文档。去重后,数据集仅保留了7,185本独特的书籍,而原始数据集共有11,038本书籍。
去重效果:
- 行数:38832894 VS 74004228
- 数据集大小:2.91GB VS 4.63GB
去重过程中,重复的文本被删除,仅保留首次出现的文本,并保持文本出现的顺序。
为什么去重?
训练数据的去重显示出多种优势,包括:
- 需要更少的训练步骤来达到相同或更好的准确性
- 训练出的模型产生记忆文本的频率降低十倍
- 减少碳排放和能源消耗
参考文献:Deduplicating Training Data Makes Language Models Better
去重脚本
python import pandas as pd from datasets import load_dataset
dataset = load_dataset("bookcorpus")["train"]["text"] df = pd.Dataframe({"text":dataset})
删除重复项(完全匹配)
df_filtered = df["text"].drop_duplicates()
df_filtered.to_csv("bookcorpus_filtered.csv","index"=False,"header"=False) new_dataset = load_dataset("text",data_files={"train":"bookcorpus_filtered.csv"})
运行时间短,不到几分钟。更复杂的去重算法可以应用于提高性能,例如google-research/deduplicate-text-datasets。
参考文献
bib @misc{https://doi.org/10.48550/arxiv.2105.05241, doi = {10.48550/ARXIV.2105.05241}, url = {https://arxiv.org/abs/2105.05241}, author = {Bandy, Jack and Vincent, Nicholas}, keywords = {Computation and Language (cs.CL), Computers and Society (cs.CY), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} }
bib @misc{https://doi.org/10.48550/arxiv.2107.06499, doi = {10.48550/ARXIV.2107.06499}, url = {https://arxiv.org/abs/2107.06499}, author = {Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas}, keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Deduplicating Training Data Makes Language Models Better}, publisher = {arXiv}, year = {2021}, copyright = {arXiv.org perpetual, non-exclusive license} }
bib @misc{https://doi.org/10.48550/arxiv.2209.00099, doi = {10.48550/ARXIV.2209.00099}, url = {https://arxiv.org/abs/2209.00099}, author = {Treviso, Marcos and Ji, Tianchu and Lee, Ji-Ung and van Aken, Betty and Cao, Qingqing and Ciosici, Manuel R. and Hassid, Michael and Heafield, Kenneth and Hooker, Sara and Martins, Pedro H. and Martins, André F. T. and Milder, Peter and Raffel, Colin and Simpson, Edwin and Slonim, Noam and Balasubramanian, Niranjan and Derczynski, Leon and Schwartz, Roy}, keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Efficient Methods for Natural Language Processing: A Survey}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }



