ruliad/fineweb_350BT_chunk_3
收藏Hugging Face2024-06-02 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/ruliad/fineweb_350BT_chunk_3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: dump
dtype: string
- name: url
dtype: string
- name: date
dtype: string
- name: file_path
dtype: string
- name: language
dtype: string
- name: language_score
dtype: float64
- name: token_count
dtype: int64
splits:
- name: train
num_bytes: 17041257984
num_examples: 5000000
download_size: 10198207509
dataset_size: 17041257984
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
The dataset includes multiple features such as text content, unique identifier, source URL, collection date, file path, language type, language detection score, and token count of the text. It is primarily divided into a training set with 5,000,000 examples. The total size of the dataset is 17041257984 bytes, with a download size of 10198207509 bytes.
提供机构:
ruliad
原始信息汇总
数据集概述
数据集特征
- text:字符串类型
- id:字符串类型
- dump:字符串类型
- url:字符串类型
- date:字符串类型
- file_path:字符串类型
- language:字符串类型
- language_score:浮点数类型
- token_count:整数类型
数据集划分
- train:
- 数据量:5000000条
- 存储大小:17041257984字节
数据集大小
- 下载大小:10198207509字节
- 数据集总大小:17041257984字节
配置信息
- config_name: default
- data_files:
- split: train
- path: data/train-*
- data_files:



