BIG-C
收藏arXiv2023-05-27 更新2024-06-21 收录
下载链接:
https://github.com/csikasote/bigc
下载链接
链接失效反馈官方服务:
资源简介:
BIG-C数据集是由赞比亚大学计算机科学系创建的一个大型多模态数据集,专门用于Bemba语言的研究。该数据集包含超过92,000条基于图像的多轮对话,总计超过180小时的音频数据,并附有相应的转录和英语翻译。BIG-C旨在解决Bemba语言资源匮乏的问题,支持自然语言处理(NLP)实验和语言技术的发展。数据集内容丰富,包括多轮对话、图像描述和翻译,适用于多种语言技术工具的开发,如语音识别、机器翻译和语音翻译系统。此外,BIG-C还可用作学术和工业研究的基准,以及支持语言接地和多模态模型开发的研究。
The BIG-C dataset is a large-scale multimodal dataset created by the Department of Computer Science, University of Zambia, specifically for research on the Bemba language. It contains over 92,000 image-based multi-turn dialogues, with a total of over 180 hours of audio data, accompanied by corresponding transcriptions and English translations. BIG-C aims to address the scarcity of language resources for the Bemba language, supporting natural language processing (NLP) experiments and the development of language technologies. The dataset features rich content including multi-turn dialogues, image descriptions and translations, which is applicable for the development of various language technology tools such as speech recognition, machine translation and speech translation systems. Furthermore, BIG-C can serve as a benchmark for both academic and industrial research, and support research on language grounding and multimodal model development.
提供机构:
赞比亚大学计算机科学系
创建时间:
2023-05-27



