skymizer/fineweb-edu-dedup-45B-4-of-4
收藏Hugging Face2025-01-11 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/skymizer/fineweb-edu-dedup-45B-4-of-4
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本内容、唯一标识符和元数据等信息。文本内容是数据集的主要部分,每个文本都有一个唯一标识符。元数据提供了关于文本的额外信息,如来源URL、创建日期、文件路径、文本语言及其评分、token数量、分数和整型分数。数据集分为训练集,其大小为55.93GB,共有约1084.21万个示例。数据集的下载大小为32.43GB。
The dataset includes text content, unique identifiers, and metadata. The text content is the main component of the dataset, with each text entry having a unique identifier. Metadata provides additional information about the text, such as source URL, creation date, file path, text language and its score, token count, score, and integer score. The dataset is split into a training set, which is 55.93GB in size and contains approximately 10.84 million examples. The download size of the dataset is 32.43GB.
提供机构:
skymizer



