orionweller/tokenized-datasets-c4-msmarco-jsonl
收藏Hugging Face2024-10-30 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/tokenized-datasets-c4-msmarco-jsonl
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: c4
features:
- name: text
dtype: string
- name: id
dtype: string
splits:
- name: train
num_examples: 8596372
- config_name: msmarco
features:
- name: text
dtype: string
- name: id
dtype: string
splits:
- name: train
num_examples: 8596372
configs:
- config_name: c4
data_files:
- split: train
path: c4/data-*-of-*.arrow
- config_name: msmarco
data_files:
- split: train
path: msmarco/data-*-of-*.arrow
---
This dataset contains two configurations: c4 and msmarco, both stored in Arrow format with multiple sharded files.
提供机构:
orionweller



