argilla-warehouse/personahub-fineweb-edu-4-dedup
收藏Hugging Face2024-09-10 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/argilla-warehouse/personahub-fineweb-edu-4-dedup
下载链接
链接失效反馈官方服务:
资源简介:
PersonaHub FineWeb-Edu 4 dedup数据集是一个去重后的数据集,使用了MinHashDedup步骤去除了约170万条persona。该数据集包含id、persona、model_name和keep_row_after_minhash_filtering四个特征,主要用于存储与墨西哥裔美国人历史和文化相关的角色描述。数据集的大小在10M到100M之间,包含22,532,926个训练样本,总大小为5,815,089,318字节。数据集的语言为英语,许可证为llama3,标签包括synthetic和distilabel。
This is a deduplicated dataset based on the [argilla-warehouse/personahub-fineweb-edu-4-raw](https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-raw) version, using a newly added `MinHashDedup` step that removed approximately 1.7 million personas. The dataset contains four features: id, persona, model_name, and keep_row_after_minhash_filtering. The dataset is divided into a training set with 22,532,926 samples. The dataset size is 5,815,089,318 bytes, with a download size of 2,752,424,820 bytes. The dataset is in English and is licensed under llama3.
提供机构:
argilla-warehouse



