muhammadravi251001/idkmrc-nli

Name: muhammadravi251001/idkmrc-nli
Creator: muhammadravi251001
Published: 2024-05-16 08:14:03
License: 暂无描述

Hugging Face2024-05-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/muhammadravi251001/idkmrc-nli

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated - manual-partial-validation language_creators: - expert-generated language: - id license: unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - IDK-MRC task_categories: - text-classification task_ids: - natural-language-inference pretty_name: IDK-MRC-NLI dataset_info: features: - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: class_label: names: '0': entailment '1': neutral '2': contradiction config_name: idkmrc-nli splits: - name: train num_bytes: 5916125 num_examples: 18664 - name: validation num_bytes: 473125 num_examples: 1528 - name: test num_bytes: 521375 num_examples: 1688 download_size: 6910625 dataset_size: 21880 --- # Dataset Card for IDK-MRC-NLI ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Hugging Face](https://huggingface.co/datasets/muhammadravi251001/idkmrc-nli) - **Point of Contact:** [Hugging Face](https://huggingface.co/datasets/muhammadravi251001/idkmrc-nli) - **Experiment:** [Github](https://github.com/muhammadravi251001/multilingual-qas-with-nli) ### Dataset Summary The IDKMRC-NLI dataset is derived from the IDK-MRC question answering dataset, utilizing named entity recognition (NER), chunking tags, Regex, and embedding similarity techniques to determine its contradiction sets. Collected through this process, the dataset comprises various columns beyond premise, hypothesis, and label, including properties aligned with NER and chunking tags. This dataset is designed to facilitate Natural Language Inference (NLI) tasks and contains information extracted from diverse sources to provide comprehensive coverage. Each data instance encapsulates premise, hypothesis, label, and additional properties pertinent to NLI evaluation. ### Supported Tasks and Leaderboards - Natural Language Inference for Indonesian ### Languages Indonesian ## Dataset Structure ### Data Instances An example of `test` looks as follows. ``` { "premise": "Karangkancana adalah sebuah kecamatan di Kabupaten Kuningan, Provinsi Jawa Barat, Indonesia.", "hypothesis": "Dimanakah letak Desa Karang kancana? Kabupaten Kuningan, Provinsi Jawa Barat, Indonesia.", "label": 0 } ``` ### Data Fields The data fields are: - `premise`: a `string` feature - `hypothesis`: a `string` feature - `label`: a classification label, with possible values including `entailment` (0), `neutral` (1), `contradiction` (2). ### Data Splits #TODO The data is split across `train`, `valid`, and `test`. | split | # examples | |----------|-------:| |train| 18664| |valid| 1528| |test| 1688| ## Dataset Creation ### Curation Rationale Indonesian NLP is considered under-resourced. We need NLI dataset to fine-tuning the NLI model to utilizing them for QA models in order to improving the performance of the QA's. ### Source Data #### Initial Data Collection and Normalization We collect the data from the prominent QA dataset in Indonesian. The annotation fully by the original dataset's researcher. #### Who are the source language producers? This synthetic data was produced by machine, but the original data was produced by human. ### Personal and Sensitive Information There might be some personal information coming from Wikipedia and news, especially the information of famous/important people. ## Considerations for Using the Data ### Discussion of Biases The QA dataset (so the NLI-derived from them) is created using premise sentences taken from Wikipedia and news. These data sources may contain some bias. ### Other Known Limitations No other known limitations ## Additional Information ### Dataset Curators This dataset is the result of the collaborative work of Indonesian researchers from the University of Indonesia, Mohamed bin Zayed University of Artificial Intelligence, and the Korea Advanced Institute of Science & Technology. ### Licensing Information The license is Unknown. Please contact authors for any information on the dataset.

提供机构：

muhammadravi251001

原始信息汇总

数据集卡片 IDK-MRC-NLI

数据集描述

数据集概要

IDK-MRC-NLI 数据集源自 IDK-MRC 问答数据集，利用命名实体识别（NER）、分块标签、正则表达式和嵌入相似性技术来确定其矛盾集合。该数据集包含前提、假设和标签等列，以及与 NER 和分块标签相关的属性。该数据集旨在促进自然语言推理（NLI）任务，并包含从不同来源提取的信息以提供全面覆盖。每个数据实例封装了前提、假设、标签和与 NLI 评估相关的其他属性。

支持的任务和排行榜

印度尼西亚语的自然语言推理

语言

印度尼西亚语

数据集结构

数据实例

test 数据集的一个示例如下：

json { "premise": "Karangkancana adalah sebuah kecamatan di Kabupaten Kuningan, Provinsi Jawa Barat, Indonesia.", "hypothesis": "Dimanakah letak Desa Karang kancana? Kabupaten Kuningan, Provinsi Jawa Barat, Indonesia.", "label": 0 }

数据字段

数据字段包括：

premise：一个 string 特征
hypothesis：一个 string 特征
label：一个分类标签，可能的值包括 entailment（0）、neutral（1）、contradiction（2）。

数据分割

数据分为 train、valid 和 test：

分割	样本数量
train	18664
valid	1528
test	1688

数据集创建

策划理由

印度尼西亚语 NLP 被认为是资源不足的。我们需要 NLI 数据集来微调 NLI 模型，以利用它们来提高 QA 模型的性能。

源数据

初始数据收集和规范化

我们从印度尼西亚主要的 QA 数据集中收集数据。注释完全由原始数据集的研究人员完成。

源语言生产者是谁？

这些合成数据由机器生成，但原始数据由人类生成。

个人和敏感信息

可能会有一些来自维基百科和新闻的个人信息，特别是关于著名/重要人物的信息。

使用数据的注意事项

偏见的讨论

QA 数据集（以及从中衍生的 NLI 数据集）使用从维基百科和新闻中提取的前提句子创建。这些数据源可能包含一些偏见。

其他已知限制

没有其他已知限制

附加信息

数据集策展人

该数据集是印度尼西亚大学、穆罕默德·本·扎耶德人工智能大学和韩国高级科学技术研究院的研究人员合作工作的结果。

许可信息

许可证未知。请联系作者获取有关数据集的任何信息。

5,000+

优质数据集

54 个

任务类型

进入经典数据集