Divyanshu/indicxnli

Name: Divyanshu/indicxnli
Creator: Divyanshu
Published: 2022-10-06 15:26:00
License: 暂无描述

Hugging Face2022-10-06 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Divyanshu/indicxnli

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated language_creators: - machine-generated language: - as - bn - gu - hi - kn - ml - mr - or - pa - ta - te license: - cc0-1.0 multilinguality: - multilingual pretty_name: IndicXNLI size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-classification task_ids: - natural-language-inference --- # Dataset Card for "IndicXNLI" ## Table of Contents - [Dataset Card for "IndicXNLI"](#dataset-card-for-indicxnli) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) ## Dataset Description - **Homepage:** <https://github.com/divyanshuaggarwal/IndicXNLI> - **Paper:** [IndicXNLI: Evaluating Multilingual Inference for Indian Languages](https://arxiv.org/abs/2204.08776) - **Point of Contact:** [Divyanshu Aggarwal](mailto:divyanshuggrwl@gmail.com) ### Dataset Summary INDICXNLI is similar to existing XNLI dataset in shape/form, but focusses on Indic language family. INDICXNLI include NLI data for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’). ### Supported Tasks and Leaderboards **Tasks:** Natural Language Inference **Leaderboards:** Currently there is no Leaderboard for this dataset. ### Languages - `Assamese (as)` - `Bengali (bn)` - `Gujarati (gu)` - `Kannada (kn)` - `Hindi (hi)` - `Malayalam (ml)` - `Marathi (mr)` - `Oriya (or)` - `Punjabi (pa)` - `Tamil (ta)` - `Telugu (te)` ## Dataset Structure ### Data Instances One example from the `hi` dataset is given below in JSON format. ```python {'premise': 'अवधारणात्मक रूप से क्रीम स्किमिंग के दो बुनियादी आयाम हैं-उत्पाद और भूगोल।', 'hypothesis': 'उत्पाद और भूगोल क्रीम स्किमिंग का काम करते हैं।', 'label': 1 (neutral) } ``` ### Data Fields - `premise (string)`: Premise Sentence - `hypothesis (string)`: Hypothesis Sentence - `label (integer)`: Integer label `0` if hypothesis `entails` the premise, `2` if hypothesis `negates` the premise and `1` otherwise. ### Data Splits  Language | ISO 639-1 Code |Train | Test | Dev | --------------|----------------|-------|-----|------| Assamese | as | 392,702 | 5,010 | 2,490 | Bengali | bn | 392,702 | 5,010 | 2,490 | Gujarati | gu | 392,702 | 5,010 | 2,490 | Hindi | hi | 392,702 | 5,010 | 2,490 | Kannada | kn | 392,702 | 5,010 | 2,490 | Malayalam | ml |392,702 | 5,010 | 2,490 | Marathi | mr |392,702 | 5,010 | 2,490 | Oriya | or | 392,702 | 5,010 | 2,490 | Punjabi | pa | 392,702 | 5,010 | 2,490 | Tamil | ta | 392,702 | 5,010 | 2,490 | Telugu | te | 392,702 | 5,010 | 2,490 |  ## Dataset usage Code snippet for using the dataset using datasets library. ```python from datasets import load_dataset dataset = load_dataset("Divyanshu/indicxnli") ``` ## Dataset Creation Machine translation of XNLI english dataset to 11 listed Indic Languages. ### Curation Rationale [More information needed] ### Source Data [XNLI dataset](https://cims.nyu.edu/~sbowman/xnli/) #### Initial Data Collection and Normalization [Detailed in the paper](https://arxiv.org/abs/2204.08776) #### Who are the source language producers? [Detailed in the paper](https://arxiv.org/abs/2204.08776) #### Human Verification Process [Detailed in the paper](https://arxiv.org/abs/2204.08776) ## Considerations for Using the Data ### Social Impact of Dataset [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Discussion of Biases [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Other Known Limitations [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Dataset Curators Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use any of the datasets, models or code modules, please cite the following paper: ``` @misc{https://doi.org/10.48550/arxiv.2204.08776, doi = {10.48550/ARXIV.2204.08776}, url = {https://arxiv.org/abs/2204.08776}, author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ```

提供机构：

Divyanshu

原始信息汇总

数据集概述

数据集名称

IndicXNLI

数据集摘要

IndicXNLI 是一个专注于印度语言家族的自然语言推理数据集，包含11种主要印度语言的数据。

支持的任务和排行榜

任务: 自然语言推理
排行榜: 目前没有

语言支持

Assamese (as)
Bengali (bn)
Gujarati (gu)
Kannada (kn)
Hindi (hi)
Malayalam (ml)
Marathi (mr)
Oriya (or)
Punjabi (pa)
Tamil (ta)
Telugu (te)

数据集结构

数据实例: 包含前提（premise）、假设（hypothesis）和标签（label）。
数据字段:
- premise (string): 前提句子
- hypothesis (string): 假设句子
- label (integer): 标签，0表示假设蕴含前提，2表示假设否定前提，1表示其他情况。
数据分割: 每个语言的数据集都分为训练集、测试集和验证集，每部分的大小分别为392,702、5,010和2,490条记录。

许可证

cc0-1.0

数据集大小

1M<n<10M

数据来源

原始数据

任务类别

文本分类

任务ID

自然语言推理

5,000+

优质数据集

54 个

任务类型

进入经典数据集