five

sumukshashidhar-archive/essential-web-v1.0-sample-10B

收藏
Hugging Face2025-07-03 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/sumukshashidhar-archive/essential-web-v1.0-sample-10B
下载链接
链接失效反馈
官方服务:
资源简介:
# Essential Web v1.0 - 10B Token Sample Approximately 10,000,000,000 tokens sampled from Essential Web v1.0. ## Dataset Info - **Target**: 10,000,000,000 tokens - **Actual**: ~10,123,307,400 tokens (estimated) - **Source**: [EssentialAI/essential-web-v1.0](https://huggingface.co/datasets/EssentialAI/essential-web-v1.0) ## Schema This sample preserves ALL columns from the original dataset, including: - `id`: Document ID - `text`: Text content - `metadata`: URL and source information - `quality_signals`: RedPajama quality metrics - `eai_taxonomy`: Essential AI taxonomy labels - `pid`: Partition ID - And all other original columns ## Usage ```python from datasets import load_dataset dataset = load_dataset("sumuks/essential-web-v1.0-sample-10B") # Access the data with all columns example = dataset['train'][0] print(example['text'][:200] + "...") # Access quality signals print(example['quality_signals']) # Access taxonomy print(example['eai_taxonomy']) ``` ## File Structure The dataset is split across multiple parquet files in the `data/` directory: - `data/part-00000.parquet` - `data/part-00001.parquet` - etc. HuggingFace datasets automatically loads all parts as a single dataset. ## Sampling Method - Random sampling across snapshots - Preserves all original columns and metadata - Token estimation: ~600 tokens per row
提供机构:
sumukshashidhar-archive
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作