five

Divyanshu/indicxnli

收藏
Hugging Face2022-10-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Divyanshu/indicxnli
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated language_creators: - machine-generated language: - as - bn - gu - hi - kn - ml - mr - or - pa - ta - te license: - cc0-1.0 multilinguality: - multilingual pretty_name: IndicXNLI size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-classification task_ids: - natural-language-inference --- # Dataset Card for "IndicXNLI" ## Table of Contents - [Dataset Card for "IndicXNLI"](#dataset-card-for-indicxnli) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) ## Dataset Description - **Homepage:** <https://github.com/divyanshuaggarwal/IndicXNLI> - **Paper:** [IndicXNLI: Evaluating Multilingual Inference for Indian Languages](https://arxiv.org/abs/2204.08776) - **Point of Contact:** [Divyanshu Aggarwal](mailto:divyanshuggrwl@gmail.com) ### Dataset Summary INDICXNLI is similar to existing XNLI dataset in shape/form, but focusses on Indic language family. INDICXNLI include NLI data for eleven major Indic languages that includes Assamese (‘as’), Gujarat (‘gu’), Kannada (‘kn’), Malayalam (‘ml’), Marathi (‘mr’), Odia (‘or’), Punjabi (‘pa’), Tamil (‘ta’), Telugu (‘te’), Hindi (‘hi’), and Bengali (‘bn’). ### Supported Tasks and Leaderboards **Tasks:** Natural Language Inference **Leaderboards:** Currently there is no Leaderboard for this dataset. ### Languages - `Assamese (as)` - `Bengali (bn)` - `Gujarati (gu)` - `Kannada (kn)` - `Hindi (hi)` - `Malayalam (ml)` - `Marathi (mr)` - `Oriya (or)` - `Punjabi (pa)` - `Tamil (ta)` - `Telugu (te)` ## Dataset Structure ### Data Instances One example from the `hi` dataset is given below in JSON format. ```python {'premise': 'अवधारणात्मक रूप से क्रीम स्किमिंग के दो बुनियादी आयाम हैं-उत्पाद और भूगोल।', 'hypothesis': 'उत्पाद और भूगोल क्रीम स्किमिंग का काम करते हैं।', 'label': 1 (neutral) } ``` ### Data Fields - `premise (string)`: Premise Sentence - `hypothesis (string)`: Hypothesis Sentence - `label (integer)`: Integer label `0` if hypothesis `entails` the premise, `2` if hypothesis `negates` the premise and `1` otherwise. ### Data Splits <!-- Below is the dataset split given for `hi` dataset. ```python DatasetDict({ train: Dataset({ features: ['premise', 'hypothesis', 'label'], num_rows: 392702 }) test: Dataset({ features: ['premise', 'hypothesis', 'label'], num_rows: 5010 }) validation: Dataset({ features: ['premise', 'hypothesis', 'label'], num_rows: 2490 }) }) ``` --> Language | ISO 639-1 Code |Train | Test | Dev | --------------|----------------|-------|-----|------| Assamese | as | 392,702 | 5,010 | 2,490 | Bengali | bn | 392,702 | 5,010 | 2,490 | Gujarati | gu | 392,702 | 5,010 | 2,490 | Hindi | hi | 392,702 | 5,010 | 2,490 | Kannada | kn | 392,702 | 5,010 | 2,490 | Malayalam | ml |392,702 | 5,010 | 2,490 | Marathi | mr |392,702 | 5,010 | 2,490 | Oriya | or | 392,702 | 5,010 | 2,490 | Punjabi | pa | 392,702 | 5,010 | 2,490 | Tamil | ta | 392,702 | 5,010 | 2,490 | Telugu | te | 392,702 | 5,010 | 2,490 | <!-- The dataset split remains same across all languages. --> ## Dataset usage Code snippet for using the dataset using datasets library. ```python from datasets import load_dataset dataset = load_dataset("Divyanshu/indicxnli") ``` ## Dataset Creation Machine translation of XNLI english dataset to 11 listed Indic Languages. ### Curation Rationale [More information needed] ### Source Data [XNLI dataset](https://cims.nyu.edu/~sbowman/xnli/) #### Initial Data Collection and Normalization [Detailed in the paper](https://arxiv.org/abs/2204.08776) #### Who are the source language producers? [Detailed in the paper](https://arxiv.org/abs/2204.08776) #### Human Verification Process [Detailed in the paper](https://arxiv.org/abs/2204.08776) ## Considerations for Using the Data ### Social Impact of Dataset [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Discussion of Biases [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Other Known Limitations [Detailed in the paper](https://arxiv.org/abs/2204.08776) ### Dataset Curators Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use any of the datasets, models or code modules, please cite the following paper: ``` @misc{https://doi.org/10.48550/arxiv.2204.08776, doi = {10.48550/ARXIV.2204.08776}, url = {https://arxiv.org/abs/2204.08776}, author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, publisher = {arXiv}, year = {2022}, copyright = {Creative Commons Attribution 4.0 International} } ``` <!-- ### Contributions -->
提供机构:
Divyanshu
原始信息汇总

数据集概述

数据集名称

  • IndicXNLI

数据集摘要

  • IndicXNLI 是一个专注于印度语言家族的自然语言推理数据集,包含11种主要印度语言的数据。

支持的任务和排行榜

  • 任务: 自然语言推理
  • 排行榜: 目前没有

语言支持

  • Assamese (as)
  • Bengali (bn)
  • Gujarati (gu)
  • Kannada (kn)
  • Hindi (hi)
  • Malayalam (ml)
  • Marathi (mr)
  • Oriya (or)
  • Punjabi (pa)
  • Tamil (ta)
  • Telugu (te)

数据集结构

  • 数据实例: 包含前提(premise)、假设(hypothesis)和标签(label)。
  • 数据字段:
    • premise (string): 前提句子
    • hypothesis (string): 假设句子
    • label (integer): 标签,0表示假设蕴含前提,2表示假设否定前提,1表示其他情况。
  • 数据分割: 每个语言的数据集都分为训练集、测试集和验证集,每部分的大小分别为392,702、5,010和2,490条记录。

许可证

  • cc0-1.0

数据集大小

  • 1M<n<10M

数据来源

  • 原始数据

任务类别

  • 文本分类

任务ID

  • 自然语言推理
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作