fineweb_deduplicated
收藏魔搭社区2025-12-04 更新2024-10-12 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/fineweb_deduplicated
下载链接
链接失效反馈官方服务:
资源简介:
# TL;DR
[Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) is a popular and high quality open dataset. This dataset is a deduplicated version of Fineweb - removing rows with duplicate text, collecting counts.
## Motivation
Fineweb is an open text dataset intended for training language models. It's one of the highest quality and most popular open datasets available. It has been produced by a reputable AI lab - HuggingFace and has been downloaded tens of thousands of times.
Fineweb dataset is 93.4 TB and has 15T tokens. This makes it one of the 10 biggest open text datasets available, which presents the challenge when working with this dataset. It's hard and expensive to download and process this dataset given the volume.
70% of fineweb is duplicated. Running exact deduplication across all CC crawl reduces the size of dataset from 15T to 5T. The dataset of such reduced size is much cheaper and easier to work with.
This dataset provides an opportunity for research on effects of deduplication on massive datasets.
## Existing deduplication
Fineweb was deduplicated within CC dumps, but not across dumps.
HuggingFace reasoning for publishing dataset without exact deduplication across the whole dataset is to provide potentially valuable upsampling of high quality rows. The hypothesis is that if text persists across multiple CC dumps, then it's longer lived on the web and more valuable. This is a very reasonable hypothesis, however this upsampling increases the size of the dataset 3 times.
## Deduplication mechanism
Text columns was tokenized with GPT4-o tokenizer and the tokenized version was used as a column for exact deduplication. There is no deeper meaning behind this approach, we use GPT4-o tokenized version, it make sense to do dedup on tokenized version and there is no reason why dedup on tokenized version should be drastically different from deduplication on plain text.
[Here is](https://huggingface.co/datasets/Salesforce/fineweb_deduplicated/blob/main/top_100_documents_by_accurances.csv) the csv with 100 most common documents in Fineweb and their row counts.
Here is the example of most repeated document in Fineweb (17049 occurrences):
> Skip to main content Genealogy and Family History Records for Newspaper Archives (1690 – 2016) Newspaper Articles: Includes additional obituaries, births, marriages, and more > Historical Obituaries > Birth Records > Marriage Records > Passenger Lists > More Results – Other Newspaper Archives Records > Recent Newspaper Obituaries (1977 – Today) Government Publications (1789 – 1994) Find military records, widow's claims, orphan petitions, land grants and much more! Historical Books (1749 – 1900) Printed items including: family genealogies, local histories, funeral sermons, biographies, and much more. Social Security Death Index (1937 – 2014) GET UNLIMITED ACCESS: Sign up for a 30-day trial to get unlimited access to our archives. Start a 30-Day Trial As seen on: The Wall Street Journal The Huffington Post Terms of Service Share this page:
## Ethical Considerations
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
## 要点概览
[Fineweb 数据集](https://huggingface.co/datasets/HuggingFaceFW/fineweb) 是一款广受欢迎且品质上乘的开源数据集。本数据集为 Fineweb 的去重版本——移除了文本重复的行,并统计了重复出现的次数。
## 研究动机
Fineweb 是一款专为训练大语言模型(Large Language Model,LLM)打造的开源文本数据集,是当前已发布的高品质开源数据集之一,由知名人工智能实验室 HuggingFace 制作,累计下载量已达数万次。
Fineweb 数据集总规模达 93.4 TB,包含 15T 个 Token,是全球十大规模最大的开源文本数据集之一,但也正因体量庞大,处理该数据集时面临诸多挑战:下载与预处理的成本高昂且难度较大。
该数据集存在 70% 的重复内容。通过对全量 Common Crawl(CC)爬取数据执行全局精确去重后,数据集规模可从 15T Token 缩减至 5T Token,大幅降低了后续处理的成本与难度。
本去重数据集为研究大规模数据集的去重效果提供了优质的实验载体。
## 现有去重现状
此前的 Fineweb 仅在单个 CC 数据集分片内部完成去重,并未实现跨分片的全局精确去重。
HuggingFace 选择未做全局去重就发布该数据集,初衷是保留高质量文本的潜在重复采样机会。其核心假设为:若某段文本在多个 CC 数据分片中均存在,则说明该文本在网络上留存时间更长、价值更高。这一假设具备合理性,但也使得数据集规模膨胀至原本的三倍。
## 去重实现机制
本次去重采用 GPT4-o 分词器对文本列进行分词,以分词后的结果作为精确去重的依据。选择该方案并无特殊深意:使用 GPT4-o 分词结果作为去重依据符合常规操作逻辑,且基于分词结果的精确去重与基于原始文本的去重结果不会存在显著差异。
可通过[此链接](https://huggingface.co/datasets/Salesforce/fineweb_deduplicated/blob/main/top_100_documents_by_accurances.csv) 下载包含 Fineweb 中出现频次最高的 100 份文档及其出现次数的 CSV 文件。
以下为 Fineweb 中重复次数最多的文档(共出现 17049 次):
> 跳转到主要内容 家谱与家族史档案库(1690 – 2016) 报纸文章:包含额外的讣告、出生、婚姻等记录 > 历史讣告 > 出生记录 > 婚姻记录 > 乘客名单 > 更多结果——其他报纸档案库记录 > 近期报纸讣告(1977 – 至今) 政府出版物(1789 – 1994) 可查询军事记录、遗孀补助申请、孤儿请愿书、土地授予等大量内容! 历史书籍(1749 – 1900) 印刷品包括:家族家谱、地方史、葬礼布道词、传记等大量内容。 社会安全局死亡索引(1937 – 2014) 获取无限访问权限:注册 30 天试用即可无限访问我们的档案库。 开始 30 天试用 登载于:《华尔街日报》《赫芬顿邮报》 服务条款 分享本页面:
## 伦理考量
本数据集仅用于支持学术论文的研究用途。我们的模型、数据集与代码并未针对所有下游场景进行专门设计与性能评估。我们强烈建议用户在部署该模型前,针对准确性、安全性与公平性等潜在问题开展评估与优化。我们鼓励用户充分考虑人工智能技术的普遍局限性,遵守适用法律法规,并在选择应用场景时遵循最佳实践,尤其在错误或滥用可能对民众生活、权利与安全造成重大影响的高风险场景中。如需进一步了解应用场景相关指南,请参考我们的可接受使用政策(AUP)与人工智能可接受使用政策(AI AUP)。
提供机构:
maas
创建时间:
2024-09-27



