NirantK/hda_nli_hindi
收藏数据集概述
基本信息
- 数据集名称: Hindi Discourse Analysis Dataset
- 语言: 印地语 (hi)
- 许可证: MIT
- 数据集大小: 10K<n<100K
- 多语言性: 单语种
- 源数据集: 扩展自 hindi_discourse
- 任务类别: 文本分类
- 任务ID: 自然语言推理
数据集配置
- 配置名称: HDA hindi nli 和 hda nli hindi
- 特征:
- premise: 字符串类型
- hypothesis: 字符串类型
- label: 类别标签,值为 "not-entailment" (0) 或 "entailment" (1)
- topic: 类别标签,值为 "Argumentative" (0), "Descriptive" (1), "Dialogic" (2), "Informative" (3), "Narrative" (4)
数据分割
- 训练集: 31892 个样本,8721972 字节
- 验证集: 9460 个样本,2556118 字节
- 测试集: 9970 个样本,2646453 字节
- 下载大小: 13519261 字节
- 数据集大小: 13924543 字节
数据集创建
- 创建方法: 采用重构技术,将公开的印地语话语分析分类数据集转换为文本蕴含问题。
- 源数据: BBC 印地语头条数据集
数据字段
- premise: 前提,字符串类型
- hypothesis: 假设,字符串类型
- label: 标签,类别标签,值为 "not-entailment" (0) 或 "entailment" (1)
- topic: 主题,类别标签,值为 "Argumentative" (0), "Descriptive" (1), "Dialogic" (2), "Informative" (3), "Narrative" (4)
数据实例
json { "hypothesis": "यह एक वर्णनात्मक कथन है।", "label": 1, "premise": "जैसे उस का सारा चेहरा अपना हो और आँखें किसी दूसरे की जो चेहरे पर पपोटों के पीछे महसूर कर दी गईं।", "topic": 1 }
数据集用途
- 用于训练印地语自然语言推理任务的模型。
许可证信息
- 许可证: MIT
- 版权声明: 由 Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi (MIDAS, IIIT-Delhi) 持有。
引用信息
bibtex @inproceedings{uppal-etal-2020-two, title = "Two-Step Classification using Recasted Data for Low Resource Settings", author = "Uppal, Shagun and Gupta, Vivek and Swaminathan, Avinash and Zhang, Haimin and Mahata, Debanjan and Gosangi, Rakesh and Shah, Rajiv Ratn and Stent, Amanda", booktitle = "Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing", month = dec, year = "2020", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.aacl-main.71", pages = "706--719", abstract = "An NLP model{}s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.", }




