five

text-machine-lab/constrained_language

收藏
Hugging Face2023-06-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/text-machine-lab/constrained_language
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: TEXT dtype: string splits: - name: train num_bytes: 4537675604 num_examples: 9081490 - name: validation num_bytes: 50107745 num_examples: 100000 - name: test num_bytes: 50134861 num_examples: 100000 download_size: 3052451421 dataset_size: 4637918210 --- # Dataset Card for constrained_language (pre-training data for simplified English) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Citation Information](#additional-information) - [Citation Information](#citation-information) ## Dataset Description - **Paper: https://arxiv.org/abs/2305.17266** - **Point of Contact: vijeta_deshpande@student.uml.edu** ### Dataset Summary This dataset is one of the two datasets published by "Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale" (https://arxiv.org/abs/2305.17266). The dataset available at this link is the pre-training data constrained by vocabulary. The other published data i.e. the pre-training data that is not constrained by vocabulary is available at https://huggingface.co/datasets/text-machine-lab/unconstrained_language. The vocabulary used for curating the data is constructed from the AOChildes corpus (https://www.sciencedirect.com/science/article/abs/pii/S0079742121000256). The AOChildes corpus consists of transcripts of child-directed speech. Hence, the vocabulary constructed from AOChildes corpus consists of words spoken or heard by children of age six years or younger. The vocabulary is then used to filter the widely used text corpora, - C4: https://arxiv.org/abs/1910.10683, - BookCorpus: https://ieeexplore.ieee.org/document/7410368, - Wikipedia: https://huggingface.co/datasets/wikipedia, - Simplified-Wikipedia: https://simple.wikipedia.org/wiki/Main_Page, - Children's Book Test Corpus: https://arxiv.org/abs/1511.02301 From the above corpora, only those spans are included that contain words only from the predefined vocabulary. The dataset includes 44 million sentences (~6 million sequences, each with ~128 tokens) and 3 million contiguous spans (each with ~128 tokens). Refer to Table 1 of the paper for data distribution over different corpora. ### Languages The dataset contains the English language only. ## Dataset Structure The dataset is available in the arrow dataset format with three splits namely, train, validation, and test. Every data instance has only one key "Text" that included a text span of approximately 128 tokens. ### Citation Information If this dataset is useful to you please cite our work. ```sh @article{deshpande2023honey, title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale}, author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna}, journal={arXiv preprint arXiv:2305.17266}, year={2023} } ```
提供机构:
text-machine-lab
原始信息汇总

数据集概述

  • 名称: constrained_language
  • 用途: 简化英语的预训练数据
  • 语言: 英语

数据集结构

  • 特征:
    • TEXT: 字符串类型
  • 分割:
    • 训练集: 9081490个样本,占用4537675604字节
    • 验证集: 100000个样本,占用50107745字节
    • 测试集: 100000个样本,占用50134861字节
  • 下载大小: 3052451421字节
  • 数据集大小: 4637918210字节

数据来源

  • 词汇来源: AOChildes corpus,包含6岁以下儿童的言语转录
  • 文本来源:
    • C4
    • BookCorpus
    • Wikipedia
    • Simplified-Wikipedia
    • Childrens Book Test Corpus

数据集详情

  • 数据组成: 包含4400万句子(约600万序列,每个序列约128个令牌)和300万连续跨度(每个跨度约128个令牌)

引用信息

sh @article{deshpande2023honey, title={Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale}, author={Deshpande, Vijeta and Pechi, Dan and Thatte, Shree and Lialin, Vladislav and Rumshisky, Anna}, journal={arXiv preprint arXiv:2305.17266}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作