jack-stanley/fineweb-edu-dedup-10b-2gram-shuffled
收藏Hugging Face2025-04-08 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/jack-stanley/fineweb-edu-dedup-10b-2gram-shuffled
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本数据,每个数据点都有一个文本字段(text),一个唯一标识符(id)以及一些元数据(metadata)。元数据中包括了数据来源的dump和url,数据的日期(date),文件路径(file_path),语言(language)和语言置信度(language_score),以及token的数量(token_count)。此外,还包括了两个分数字段:score和int_score。数据集仅包含一个训练集(train),共有约950万条数据,整个数据集的大小为约48.1GB。
The dataset consists of text data, with each data point having a text field (text), a unique identifier (id), and some metadata (metadata). The metadata includes the source dump and url of the data, the date (date), the file path (file_path), the language (language) and the language confidence (language_score), as well as the number of tokens (token_count). In addition, there are two score fields: score and int_score. The dataset contains only one training set (train) with a total of about 9.5 million pieces of data, and the entire dataset is about 48.1GB in size.
提供机构:
jack-stanley



