text-machine-lab/constrained_language

Name: text-machine-lab/constrained_language
Creator: text-machine-lab
Published: 2023-06-13 05:32:11
License: 暂无描述

Hugging Face2023-06-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/text-machine-lab/constrained_language

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: TEXT dtype: string splits: - name: train num_bytes: 4537675604 num_examples: 9081490 - name: validation num_bytes: 50107745 num_examples: 100000 - name: test num_bytes: 50134861 num_examples: 100000 download_size: 3052451421 dataset_size: 4637918210 --- # Dataset Card for constrained_language (pre-training data for simplified English) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Citation Information](#additional-information) - [Citation Information](#citation-information) ## Dataset Description - **Paper: https://arxiv.org/abs/2305.17266** - **Point of Contact: vijeta_deshpande@student.uml.edu** ### Dataset Summary This dataset is one of the two datasets published by "Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale" (https://arxiv.org/abs/2305.17266). The dataset available at this link is the pre-training data constrained by vocabulary. The other published data i.e. the pre-training data that is not constrained by vocabulary is available at https://huggingface.co/datasets/text-machine-lab/unconstrained_language. The vocabulary used for curating the data is constructed from the AOChildes corpus (https://www.sciencedirect.com/science/article/abs/pii/S0079742121000256). The AOChildes corpus consists of transcripts of child-directed speech. Hence, the vocabulary constructed from AOChildes corpus consists of words spoken or heard by children of age six years or younger. The vocabulary is then used to filter the widely used text corpora, - C4: https://arxiv.org/abs/1910.10683, - BookCorpus: https://ieeexplore.ieee.org/document/7410368, - Wikipedia: https://huggingface.co/datasets/wikipedia, - Simplified-Wikipedia: https://simple.wikipedia.org/wiki/Main_Page, - Children's Book Test Corpus: https://arxiv.org/abs/1511.02301 From the above corpora, only those spans are included that contain words only from the predefined vocabulary. The dataset includes 44 million sentences (~6 million sequences, each with ~128 tokens) and 3 million contiguous spans (each with ~128 tokens). Refer to Table 1 of the paper for data distribution over different corpora. ### Languages The dataset contains the English language only. ## Dataset Structure The dataset is available in the arrow dataset format with three splits namely, train, validation, and test. Every data instance has only one key "Text" that included a text span of approximately 128 tokens. ### Citation Information If this dataset is useful to you please cite our work. ```sh @article{deshpande2023honey, title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale}, author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna}, journal={arXiv preprint arXiv:2305.17266}, year={2023} } ```

提供机构：

text-machine-lab

原始信息汇总

数据集概述

名称: constrained_language
用途: 简化英语的预训练数据
语言: 英语

数据集结构

特征:
- TEXT: 字符串类型
分割:
- 训练集: 9081490个样本，占用4537675604字节
- 验证集: 100000个样本，占用50107745字节
- 测试集: 100000个样本，占用50134861字节
下载大小: 3052451421字节
数据集大小: 4637918210字节

数据来源

词汇来源: AOChildes corpus，包含6岁以下儿童的言语转录
文本来源:
- C4
- BookCorpus
- Wikipedia
- Simplified-Wikipedia
- Childrens Book Test Corpus

数据集详情

数据组成: 包含4400万句子（约600万序列，每个序列约128个令牌）和300万连续跨度（每个跨度约128个令牌）

引用信息

sh @article{deshpande2023honey, title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale}, author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna}, journal={arXiv preprint arXiv:2305.17266}, year={2023} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集