Children Book Test
收藏帕依提提2024-03-04 收录
下载链接:
https://www.payititi.com/opendatasets/show-271.html
下载链接
链接失效反馈官方服务:
资源简介:
Children¡¯s Book Test (CBT), designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg. Details and baseline results on this dataset can be found in the paper: Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children¡¯s Books with Explicit Memory Representations, arXiv:1511.02301. After allocating books to either training, validation or test sets, we formed example ¡®questions¡¯ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions Here is an example of question (context + query) from Alice in Wonderland by Lewis Carroll:
儿童书籍测试集(Children's Book Test, CBT)旨在直接衡量语言模型对更广泛语言上下文的利用能力。该数据集依托古腾堡计划(Project Gutenberg)免费公开的书籍构建而成。本数据集的详细信息与基准实验结果可参阅以下论文:Felix Hill、Antoine Bordes、Sumit Chopra与Jason Weston所著《金发姑娘原则:利用显式记忆表示阅读儿童书籍》(arXiv:1511.02301)。在将书籍划分为训练集、验证集与测试集后,我们通过选取书籍章节中连续的21个句子来构造示例“问题”。每个问题中,前20个句子作为上下文,从第21个句子中移除一个单词作为查询项。模型需从上下文句子与查询句中出现的10个候选答案里识别出正确的目标单词。为进行更细粒度的分析,我们按移除单词的类型将问题划分为四类:命名实体(Named Entities)、(普通)名词、动词与介词。以下摘自刘易斯·卡罗尔所著《爱丽丝梦游仙境》的示例,展示了一个问题(上下文+查询句):
提供机构:
帕依提提
搜集汇总
数据集介绍

背景与挑战
背景概述
Children Book Test是一个自然语言处理数据集,用于评估语言模型利用上下文的能力。它基于Project Gutenberg的书籍,通过构建包含上下文和查询的问题来测试模型对缺失词的预测能力,问题类型包括命名实体、名词、动词和介词。
以上内容由遇见数据集搜集并总结生成



