New Brown Corpus
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/New_Brown_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
我们引入了一个新的数据集,用于训练和评估扎根的语言模型。我们的数据是在虚拟现实环境中收集的,旨在模拟语言前儿童可能可以访问的语言数据的质量: 即自然主义的自发语音与丰富的视觉空间上下文相结合。我们使用收集到的数据来比较动词学习的几种分布语义模型。我们评估基于2D (像素) 特征的神经模型以及基于3D (符号,空间) 特征的特征工程模型,并表明两种建模方法都无法达到令人满意的性能。我们的结果与儿童语言习得的证据一致,该证据强调了从幼稚的分布数据学习动词的难度。我们讨论了未来基于认知的基础语言学习工作的途径,并发布了语料库,以促进对该主题的研究。
We introduce a novel dataset for training and evaluating grounded language models. Our data was collected in a virtual reality environment, designed to mimic the quality of linguistic input accessible to pre-linguistic children: namely, naturalistic spontaneous speech paired with rich visuospatial contexts. We utilize the collected dataset to compare several distributional semantic models for verb learning. We evaluate neural models based on 2D (pixel-based) features as well as feature-engineered models built on 3D (symbolic, spatial) features, and demonstrate that neither modeling approach achieves satisfactory performance. Our findings align with evidence from child language acquisition, which highlights the difficulty of learning verbs from naive distributional data. We discuss potential avenues for future cognitive-grounded language learning research, and release the corpus to facilitate further studies on this topic.
提供机构:
OpenDataLab
创建时间:
2022-05-25
搜集汇总
数据集介绍

背景与挑战
背景概述
New Brown Corpus是由布朗大学在2020年发布的数据集,它通过在虚拟现实环境中收集自然语音和视觉空间上下文数据,模拟儿童语言习得条件,用于训练和评估语言模型。该数据集还用于比较动词学习的分布语义模型,并探讨了从分布数据中学习动词的挑战,以促进基础语言学习研究。
以上内容由遇见数据集搜集并总结生成



