orionweller/dolma_20_percent_sample
收藏Hugging Face2024-05-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/dolma_20_percent_sample
下载链接
链接失效反馈官方服务:
资源简介:
# Example Usage
```
from datasets import load_dataset
import huggingface_hub
for folder_name in huggingface_hub.list_repo_tree("orionweller/dolma_20_percent_sample", repo_type="dataset"):
if folder_name in ["README.md", ".gitattributes"]:
continue
# otherwise is a url from a particular part of Dolma, e.g. `algebraic_stack_train_0000`, total is 2419
# You can load only one part like this
dataset = load_dataset("orionweller/dolma_20_percent_sample", data_files={"data": f"{folder_name.path}/*"})["data"]
# dataset will have these keys: ["id", "text", "added", "created", "source"]
```
This dataset is a sample named dolma_20_percent_sample, consisting of multiple parts, each corresponding to a specific URL. The keys in the dataset include [id, text, added, created, source].
提供机构:
orionweller
原始信息汇总
数据集概述
数据集名称
- 名称: dolma_20_percent_sample
- 所有者: orionweller
数据集结构
- 数据文件: 数据集由多个部分组成,每个部分对应一个文件夹,例如
algebraic_stack_train_0000。 - 数据键: 加载的数据集包含以下键:
["id", "text", "added", "created", "source"]
数据集加载
-
加载方法: 使用
load_dataset函数从Hugging Face Hub加载数据集的特定部分。 -
示例代码: python from datasets import load_dataset import huggingface_hub
dataset = load_dataset("orionweller/dolma_20_percent_sample", data_files={"data": f"{folder_name.path}/*"})["data"]



