New Brown Corpus

Name: New Brown Corpus
Creator: OpenDataLab
Published: 2026-05-24 10:30:11
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/New_Brown_Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

我们引入了一个新的数据集，用于训练和评估扎根的语言模型。我们的数据是在虚拟现实环境中收集的，旨在模拟语言前儿童可能可以访问的语言数据的质量: 即自然主义的自发语音与丰富的视觉空间上下文相结合。我们使用收集到的数据来比较动词学习的几种分布语义模型。我们评估基于2D (像素) 特征的神经模型以及基于3D (符号，空间) 特征的特征工程模型，并表明两种建模方法都无法达到令人满意的性能。我们的结果与儿童语言习得的证据一致，该证据强调了从幼稚的分布数据学习动词的难度。我们讨论了未来基于认知的基础语言学习工作的途径，并发布了语料库，以促进对该主题的研究。

We introduce a novel dataset for training and evaluating grounded language models. Our data was collected in a virtual reality environment, designed to mimic the quality of linguistic input accessible to pre-linguistic children: namely, naturalistic spontaneous speech paired with rich visuospatial contexts. We utilize the collected dataset to compare several distributional semantic models for verb learning. We evaluate neural models based on 2D (pixel-based) features as well as feature-engineered models built on 3D (symbolic, spatial) features, and demonstrate that neither modeling approach achieves satisfactory performance. Our findings align with evidence from child language acquisition, which highlights the difficulty of learning verbs from naive distributional data. We discuss potential avenues for future cognitive-grounded language learning research, and release the corpus to facilitate further studies on this topic.

提供机构：

OpenDataLab

创建时间：

2022-05-25

搜集汇总

数据集介绍