nilq/babylm-10M
收藏Hugging Face2024-01-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nilq/babylm-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 57802971
num_examples: 1058740
- name: validation
num_bytes: 55093483
num_examples: 1026747
- name: test
num_bytes: 60175255
num_examples: 1054646
download_size: 108417116
dataset_size: 173071709
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
language:
- en
pretty_name: BabyLM 10M
---
# BabyLM 10M
This curated dataset is originally from the [BabyLM Challenge](https://babylm.github.io/guidelines.html).
It consists of ~10M words of mixed domain, consisting of the following sources:
- CHILDES (child-directed speech)
- Subtitles (speech)
- BNC (speech)
- TED talks (speech)
- children's books (simple written language)
提供机构:
nilq
原始信息汇总
BabyLM 10M 数据集概述
数据集信息
-
特征:
text: 数据类型为字符串(string)
-
数据分割:
train: 包含 1058740 个样本,大小为 57802971 字节validation: 包含 1026747 个样本,大小为 55093483 字节test: 包含 1054646 个样本,大小为 60175255 字节
-
数据大小:
- 下载大小: 108417116 字节
- 数据集总大小: 173071709 字节
配置信息
- 配置名称:
default- 数据文件路径:
train:data/train-*validation:data/validation-*test:data/test-*
- 数据文件路径:
语言
- 数据集语言: 英语(
en)
数据集名称
pretty_name: BabyLM 10M



