text-machine-lab/constrained_language
收藏Hugging Face2023-06-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/text-machine-lab/constrained_language
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: TEXT
dtype: string
splits:
- name: train
num_bytes: 4537675604
num_examples: 9081490
- name: validation
num_bytes: 50107745
num_examples: 100000
- name: test
num_bytes: 50134861
num_examples: 100000
download_size: 3052451421
dataset_size: 4637918210
---
# Dataset Card for constrained_language (pre-training data for simplified English)
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Citation Information](#additional-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Paper: https://arxiv.org/abs/2305.17266**
- **Point of Contact: vijeta_deshpande@student.uml.edu**
### Dataset Summary
This dataset is one of the two datasets published by "Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale" (https://arxiv.org/abs/2305.17266).
The dataset available at this link is the pre-training data constrained by vocabulary. The other published data i.e. the pre-training data that is not constrained by vocabulary is available at https://huggingface.co/datasets/text-machine-lab/unconstrained_language.
The vocabulary used for curating the data is constructed from the AOChildes corpus (https://www.sciencedirect.com/science/article/abs/pii/S0079742121000256). The AOChildes corpus consists of transcripts of child-directed speech. Hence, the vocabulary constructed from AOChildes corpus consists of words spoken or heard by children of age six years or younger.
The vocabulary is then used to filter the widely used text corpora,
- C4: https://arxiv.org/abs/1910.10683,
- BookCorpus: https://ieeexplore.ieee.org/document/7410368,
- Wikipedia: https://huggingface.co/datasets/wikipedia,
- Simplified-Wikipedia: https://simple.wikipedia.org/wiki/Main_Page,
- Children's Book Test Corpus: https://arxiv.org/abs/1511.02301
From the above corpora, only those spans are included that contain words only from the predefined vocabulary. The dataset includes 44 million sentences (~6 million sequences, each with ~128 tokens) and 3 million contiguous spans (each with ~128 tokens). Refer to Table 1 of the paper for data distribution over different corpora.
### Languages
The dataset contains the English language only.
## Dataset Structure
The dataset is available in the arrow dataset format with three splits namely, train, validation, and test. Every data instance has only one key "Text" that included a text span of approximately 128 tokens.
### Citation Information
If this dataset is useful to you please cite our work.
```sh
@article{deshpande2023honey,
title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale},
author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna},
journal={arXiv preprint arXiv:2305.17266},
year={2023}
}
```
提供机构:
text-machine-lab
原始信息汇总
数据集概述
- 名称: constrained_language
- 用途: 简化英语的预训练数据
- 语言: 英语
数据集结构
- 特征:
- TEXT: 字符串类型
- 分割:
- 训练集: 9081490个样本,占用4537675604字节
- 验证集: 100000个样本,占用50107745字节
- 测试集: 100000个样本,占用50134861字节
- 下载大小: 3052451421字节
- 数据集大小: 4637918210字节
数据来源
- 词汇来源: AOChildes corpus,包含6岁以下儿童的言语转录
- 文本来源:
- C4
- BookCorpus
- Wikipedia
- Simplified-Wikipedia
- Childrens Book Test Corpus
数据集详情
- 数据组成: 包含4400万句子(约600万序列,每个序列约128个令牌)和300万连续跨度(每个跨度约128个令牌)
引用信息
sh @article{deshpande2023honey, title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale}, author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna}, journal={arXiv preprint arXiv:2305.17266}, year={2023} }



