cam-cst/cbt

Name: cam-cst/cbt
Creator: cam-cst
Published: 2024-01-16 16:01:16
License: 暂无描述

Hugging Face2024-01-16 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/cam-cst/cbt

下载链接

链接失效反馈

官方服务：

资源简介：

儿童书籍测试（CBT）旨在直接衡量语言模型如何利用更广泛的上下文信息。该数据集基于免费提供的儿童书籍构建，包含四种不同的配置：`V`（动词）、`P`（代词）、`NE`（命名实体）和`CN`（普通名词）。每个配置的数据字段包括句子、问题、答案和选项。数据集的语言为英语，来源于儿童故事书作者如Lucy Maud Montgomery、Charles Dickens等。

Children's Book Test (CBT) is designed to directly measure how language models leverage broader contextual information. This dataset is constructed based on freely available children's books, and includes four distinct configurations: `V` (Verb), `P` (Pronoun), `NE` (Named Entity), and `CN` (Common Noun). The data fields for each configuration include sentences, questions, answers, and options. The dataset is in English, with source materials originating from children's storybooks written by authors such as Lucy Maud Montgomery, Charles Dickens, and other writers.

提供机构：

cam-cst

原始信息汇总

数据集概述

数据集基本信息

数据集名称: Children’s Book Test (CBT)
语言: 英语
许可证: GFDL
多语言性: 单语种
数据集大小: 100K<n<1M 和 n<1K
源数据: 原始数据
任务类别: 其他、问答
任务ID: 多项选择问答
PapersWithCode ID: cbt
配置名称: CN, NE, P, V, raw

数据集配置详情

CN 配置

特征:
- sentences: 字符串序列
- question: 字符串
- answer: 字符串
- options: 字符串序列
分割:
- train: 301730151 字节, 120769 样本
- test: 6138376 字节, 2500 样本
- validation: 4737257 字节, 2000 样本
下载大小: 31615166 字节
数据集大小: 312605784 字节

NE 配置

特征:
- sentences: 字符串序列
- question: 字符串
- answer: 字符串
- options: 字符串序列
分割:
- train: 253551931 字节, 108719 样本
- test: 5707734 字节, 2500 样本
- validation: 4424316 字节, 2000 样本
下载大小: 29693075 字节
数据集大小: 263683981 字节

P 配置

特征:
- sentences: 字符串序列
- question: 字符串
- answer: 字符串
- options: 字符串序列
分割:
- train: 852852601 字节, 334030 样本
- test: 6078048 字节, 2500 样本
- validation: 4776981 字节, 2000 样本
下载大小: 43825356 字节
数据集大小: 863707630 字节

V 配置

特征:
- sentences: 字符串序列
- question: 字符串
- answer: 字符串
- options: 字符串序列
分割:
- train: 252177649 字节, 105825 样本
- test: 5806625 字节, 2500 样本
- validation: 4556425 字节, 2000 样本
下载大小: 29992082 字节
数据集大小: 262540699 字节

raw 配置

特征:
- title: 字符串
- content: 字符串
分割:
- train: 25741580 字节, 98 样本
- test: 1528704 字节, 5 样本
- validation: 1182657 字节, 5 样本
下载大小: 16350790 字节
数据集大小: 28452941 字节

数据集结构

数据实例

V 配置实例: json { "answer": "said", "options": ["christening", "existed", "hear", "knows", "read", "remarked", "said", "sitting", "talking", "wearing"], "question": "They are very kind old ladies in their way , XXXXX the king ; and were nice to me when I was a boy . ", "sentences": [ "This vexed the king even more than the queen , who was very clever and learned , and who had hated dolls when she was a child .", "However , she , too in spite of all the books she read and all the pictures she painted , would have been glad enough to be the mother of a little prince .", "The king was anxious to consult the fairies , but the queen would not hear of such a thing .", "She did not believe in fairies : she said that they had never existed ; and that she maintained , though The History of the Royal Family was full of chapters about nothing else .", "Well , at long and at last they had a little boy , who was generally regarded as the finest baby that had ever been seen .", "Even her majesty herself remarked that , though she could never believe all the courtiers told her , yet he certainly was a fine child -- a very fine child .", "Now , the time drew near for the christening party , and the king and queen were sitting at breakfast in their summer parlour talking over it .", "It was a splendid room , hung with portraits of the royal ancestors .", "There was Cinderella , the grandmother of the reigning monarch , with her little foot in her glass slipper thrust out before her .", "There was the Marquis de Carabas , who , as everyone knows , was raised to the throne as prince consort after his marriage with the daughter of the king of the period .", "On the arm of the throne was seated his celebrated cat , wearing boots .", "There , too , was a portrait of a beautiful lady , sound asleep : this was Madame La Belle au Bois-dormant , also an ancestress of the royal family .", "Many other pictures of celebrated persons were hanging on the walls .", "`` You have asked all the right people , my dear ? ", "said the king .", "`` Everyone who should be asked , answered the queen .", "`` People are so touchy on these occasions , said his majesty .", "`` You have not forgotten any of our aunts ? ", "`` No ; the old cats ! ", "replied the queen ; for the king s aunts were old-fashioned , and did not approve of her , and she knew it ." ] }

数据字段

raw 配置:
- title: 包含数据集中书籍标题的字符串特征。
- content: 包含数据集中书籍内容的字符串特征。
其他配置:
- sentences: 包含20个句子（来自一本书）的字符串序列特征。
- question: 包含一个带有空白标记（XXXX）的问题的字符串特征。
- answer: 包含答案的字符串特征。
- options: 包含问题选项的字符串序列特征。

数据分割

分割和对应大小:

train test validation

raw 98 5 5

V 105825 2500 2000

P 334030 2500 2000

CN 120769 2500 2000

NE 108719 2500 2000

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，评估模型对广泛上下文的理解能力至关重要。儿童图书测试数据集（CBT）的构建过程体现了这一理念，其源数据来自多部经典儿童文学作品，如露西·莫德·蒙哥马利和查尔斯·狄更斯的著作。数据集的构建方法具有系统性：从每本书中抽取连续的21个句子，前20句构成上下文，第21句则被移除一个关键词作为查询问题。答案选项由上下文及查询句中出现的10个候选词组成，整个过程通过自动化流程生成，确保了数据的一致性和可重复性。

特点

该数据集在语言模型评估领域展现出独特的设计特点。它提供了四种不同的配置，分别针对动词、代词、命名实体和普通名词等词类进行专门测试，这种细粒度划分允许研究者深入探究模型对不同语言成分的理解能力。每个数据实例包含20个句子的丰富上下文，以及一个带有空白标记的问题和十个候选答案，这种结构模拟了真实阅读中的推理过程。数据规模适中，训练集从数万到数十万不等，测试集和验证集均保持稳定，为模型性能提供了可靠的基准。

使用方法

使用该数据集时，研究者可灵活选择适合的配置以针对特定语言现象进行评估。数据集已预先划分为训练集、验证集和测试集，支持标准的机器学习工作流程。对于模型训练，可利用提供的上下文句子和问题-答案对来优化参数；在评估阶段，模型需要从候选答案中选出正确的词汇以填充空白。原始配置还提供了完整的书籍文本，便于进行更广泛的文本分析。数据加载可通过HuggingFace库实现，并遵循GNU自由文档许可证的规定。

背景与挑战

背景概述

儿童图书测试数据集（Children’s Book Test, CBT）由Facebook人工智能研究院与剑桥大学的研究团队于2016年共同构建，其核心研究问题在于评估语言模型如何有效利用广泛的上下文信息进行词汇预测。该数据集源自经典儿童文学作品，通过系统化抽取连续句子并构造填空式问题，旨在衡量模型在理解叙事连贯性与语义关联方面的能力。CBT的推出为自然语言处理领域提供了重要的评估基准，尤其在词汇消歧与长距离依赖建模方面产生了深远影响，推动了记忆增强型神经网络等模型的发展。

当前挑战

该数据集致力于解决语言模型在词汇预测任务中的核心挑战，即如何准确捕捉长距离上下文依赖以完成特定词类（如命名实体、普通名词、动词等）的填空。在构建过程中，研究团队面临从原始文本中自动化生成高质量问题的难题，需确保候选答案均出自上下文，同时维持问题的多样性与语言复杂性。此外，数据集的划分需严格避免训练集与测试集之间的内容重叠，以保证评估的公正性，这对书籍章节的分配与句子序列的采样提出了精细化的要求。

常用场景

经典使用场景

在自然语言处理领域，儿童图书测试数据集常被用于评估语言模型对长距离上下文依赖关系的理解能力。该数据集通过从儿童故事书中提取连续句子构建上下文，并设计填空题形式的问题，要求模型从候选词中选出正确答案。这种设置模拟了人类阅读时基于前文信息推断缺失词汇的认知过程，为衡量模型的语言理解深度提供了标准化测试平台。

衍生相关工作

该数据集催生了多项经典研究工作，其中最具代表性的是Facebook AI团队提出的记忆网络架构。相关研究通过引入显式记忆模块处理长文本序列，显著提升了模型在儿童图书测试上的表现。后续工作如动态记忆网络和键值记忆网络进一步拓展了该方向，这些架构创新已成为处理长文档理解任务的重要范式，持续推动着阅读理解技术的发展。

数据集最近研究

	train	test	validation
raw	98	5	5
V	105825	2500	2000
P	334030	2500	2000
CN	120769	2500	2000
NE	108719	2500	2000