Code-mixed Natural Language Inference (NLI) Dataset

Name: Code-mixed Natural Language Inference (NLI) Dataset
Creator: 微软研究印度
Published: 2020-04-13 12:34:52
License: 暂无描述

arXiv2020-04-13 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2004.05051v2

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集由微软研究印度创建，专注于代码混合的自然语言推理任务，特别是混合了印地语和英语的对话。数据集包含400个基于宝莱坞电影对话的代码混合前提和2240个由印地语-英语双语者提供的代码混合假设。数据集的创建过程涉及初步注释研究和最终注释协议的制定。该数据集旨在帮助机器理解代码混合语言，特别是在对话代理和聊天机器人中的应用，以解决多语言社区中的沟通问题。

This dataset was developed by Microsoft Research India, targeting code-mixed natural language inference tasks, particularly Hindi-English mixed dialogues. It includes 400 code-mixed premises sourced from dialogues in Bollywood films, and 2240 code-mixed hypotheses provided by Hindi-English bilingual speakers. The dataset construction process involved preliminary annotation research and the establishment of a finalized annotation protocol. This dataset is intended to assist machines in comprehending code-mixed languages, especially for applications in conversational agents and chatbots, to address communication issues in multilingual communities.

提供机构：

微软研究印度

创建时间：

2020-04-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集