five

argilla-warehouse/personahub-fineweb-edu-4-dedup

收藏
Hugging Face2024-09-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/argilla-warehouse/personahub-fineweb-edu-4-dedup
下载链接
链接失效反馈
官方服务:
资源简介:
PersonaHub FineWeb-Edu 4 dedup数据集是一个去重后的数据集,使用了MinHashDedup步骤去除了约170万条persona。该数据集包含id、persona、model_name和keep_row_after_minhash_filtering四个特征,主要用于存储与墨西哥裔美国人历史和文化相关的角色描述。数据集的大小在10M到100M之间,包含22,532,926个训练样本,总大小为5,815,089,318字节。数据集的语言为英语,许可证为llama3,标签包括synthetic和distilabel。

This is a deduplicated dataset based on the [argilla-warehouse/personahub-fineweb-edu-4-raw](https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-raw) version, using a newly added `MinHashDedup` step that removed approximately 1.7 million personas. The dataset contains four features: id, persona, model_name, and keep_row_after_minhash_filtering. The dataset is divided into a training set with 22,532,926 samples. The dataset size is 5,815,089,318 bytes, with a download size of 2,752,424,820 bytes. The dataset is in English and is licensed under llama3.
提供机构:
argilla-warehouse
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作