sumuks/essential-web-v1.0-sample-10B
收藏Hugging Face2025-07-03 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/sumuks/essential-web-v1.0-sample-10B
下载链接
链接失效反馈官方服务:
资源简介:
这是一个从Essential Web v1.0数据集中采样的子集,包含了大约101亿个token。这个数据集保留了原数据集中的所有列,包括文档ID、文本内容、URL和来源信息、RedPajama质量指标、Essential AI分类标签等。数据集以Parquet文件格式存储在data目录下,并通过HuggingFace datasets库自动加载为单个数据集。采样方法是随机跨快照采样,并保留了所有原始列和元数据。
This is a sampled subset from the Essential Web v1.0 dataset, containing approximately 10.1 billion tokens. The dataset preserves all columns from the original dataset, including document ID, text content, URL and source information, RedPajama quality metrics, Essential AI taxonomy labels, etc. The dataset is stored in the data directory in Parquet file format and is automatically loaded as a single dataset through the HuggingFace datasets library. The sampling method is random sampling across snapshots, preserving all original columns and metadata.
提供机构:
sumuks



