muhammadravi251001/tydiqaid-nli

Name: muhammadravi251001/tydiqaid-nli
Creator: muhammadravi251001
Published: 2024-05-16 08:13:42
License: 暂无描述

Hugging Face2024-05-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/muhammadravi251001/tydiqaid-nli

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated - manual-partial-validation language_creators: - expert-generated language: - id license: unknown multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - TyDI-QA-ID task_categories: - text-classification task_ids: - natural-language-inference pretty_name: TyDI-QA-ID-NLI dataset_info: features: - name: premise dtype: string - name: hypothesis dtype: string - name: label dtype: class_label: names: '0': entailment '1': neutral '2': contradiction config_name: tydiqaid-nli splits: - name: train num_bytes: 3207000 num_examples: 9694 - name: validation num_bytes: 373750 num_examples: 1130 - name: test num_bytes: 565625 num_examples: 1170 download_size: 4146375 dataset_size: 11994 --- # Dataset Card for TyDI-QA-ID-NLI ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Hugging Face](https://huggingface.co/datasets/muhammadravi251001/tydiqaid-nli) - **Point of Contact:** [Hugging Face](https://huggingface.co/datasets/muhammadravi251001/tydiqaid-nli) - **Experiment:** [Github](https://github.com/muhammadravi251001/multilingual-qas-with-nli) ### Dataset Summary The TyDI-QA-ID-NLI dataset is derived from the TyDI-QA-ID question answering dataset, utilizing named entity recognition (NER), chunking tags, Regex, and embedding similarity techniques to determine its contradiction sets. Collected through this process, the dataset comprises various columns beyond premise, hypothesis, and label, including properties aligned with NER and chunking tags. This dataset is designed to facilitate Natural Language Inference (NLI) tasks and contains information extracted from diverse sources to provide comprehensive coverage. Each data instance encapsulates premise, hypothesis, label, and additional properties pertinent to NLI evaluation. ### Supported Tasks and Leaderboards - Natural Language Inference for Indonesian ### Languages Indonesian ## Dataset Structure ### Data Instances An example of `test` looks as follows. ``` { "premise": "Manuls sering kali terlihat di padang rumput stepa Asia Tengah wilayah Mongolia, Cina dan Dataran Tinggi Tibet, di mana rekor elevasi 5.050 m (16.570 kaki) dilaporkan.[5] Mereka secara luas tersebar di daerah dataran tinggi dan lekukan Intermountain serta padang rumput pegunungan di Kyrgyzstan dan Kazakhstan.[6] Di Rusia, mereka muncul sesekali di Transkaukasus dan daerah Transbaikal, di sepanjang perbatasan dengan utara-timur Kazakhstan, dan di sepanjang perbatasan dengan Mongolia dan Cina di Altai, Tyva Buryatia, dan Chita republik. Pada musim semi 1997, trek yang ditemukan di Timur Sayan pada ketinggian 2.470 m (8.100 kaki) dalam 4,5cm (1,8 in) lapisan salju yang tebal. Trek ini dianggap fakta pertama yang dapat dibuktikan mendiami daerah manuls. Analisis DNA dari kotoran individu ini menegaskan kehadiran spesies.[7] Populasi di barat daya, yaitu wilayah Laut Kaspia, Afghanistan dan Pakistan, berkurang, terisolasi dan jarang [8][9]. Pada tahun 2008, seekor individu terekam kamera di Iran Khojir National Park untuk pertama kalinya [10].,Dimanakah Kucing Pallas pertama kali ditemukan ?", "hypothesis": ",Dimanakah Kucing Pallas pertama kali ditemukan ? 2008", "label": 0 } ``` ### Data Fields The data fields are: - `premise`: a `string` feature - `hypothesis`: a `string` feature - `label`: a classification label, with possible values including `entailment` (0), `neutral` (1), `contradiction` (2). ### Data Splits #TODO The data is split across `train`, `valid`, and `test`. | split | # examples | |----------|-------:| |train| 9694| |valid| 1130| |test| 1170| ## Dataset Creation ### Curation Rationale Indonesian NLP is considered under-resourced. We need NLI dataset to fine-tuning the NLI model to utilizing them for QA models in order to improving the performance of the QA's. ### Source Data #### Initial Data Collection and Normalization We collect the data from the prominent QA dataset in Indonesian. The annotation fully by the original dataset's researcher. #### Who are the source language producers? This synthetic data was produced by machine, but the original data was produced by human. ### Personal and Sensitive Information There might be some personal information coming from Wikipedia and news, especially the information of famous/important people. ## Considerations for Using the Data ### Discussion of Biases The QA dataset (so the NLI-derived from them) is created using premise sentences taken from Wikipedia and news. These data sources may contain some bias. ### Other Known Limitations No other known limitations ## Additional Information ### Dataset Curators This dataset is the result of the collaborative work of Indonesian researchers from the University of Indonesia, Mohamed bin Zayed University of Artificial Intelligence, and the Korea Advanced Institute of Science & Technology. ### Licensing Information The license is Unknown. Please contact authors for any information on the dataset.

提供机构：

muhammadravi251001

原始信息汇总

数据集概述

名称: TyDI-QA-ID-NLI

语言: 印度尼西亚语

许可证: 未知

多语言性: 单语

大小: 1K<n<10K

任务类别: 文本分类

任务ID: 自然语言推理

数据集信息:

特征:
- premise: 字符串类型
- hypothesis: 字符串类型
- label: 分类标签，值包括 entailment (0), neutral (1), contradiction (2)
配置名称: tydiqaid-nli
数据分割:
- train: 9694个样本，3207000字节
- validation: 1130个样本，373750字节
- test: 1170个样本，565625字节
下载大小: 4146375字节
数据集大小: 11994个样本

数据集创建:

来源数据: 来自TyDI-QA-ID问答数据集
注释: 机器生成，部分手动验证
语言创建者: 专家生成

数据集结构:

数据实例: 包含前提、假设、标签及其他与NLI评估相关的属性
数据字段: 详见上述特征描述
数据分割: 详见上述数据分割描述

使用数据考虑:

偏见讨论: 数据源可能包含偏见，如来自维基百科和新闻的句子

附加信息:

数据集创建者: 印度尼西亚大学、穆罕默德·本·扎耶德人工智能大学和韩国科学技术高级研究院的研究人员合作成果

搜集汇总

数据集介绍

构建方式

在自然语言处理领域，针对印度尼西亚语资源相对匮乏的现状，TyDI-QA-ID-NLI数据集应运而生。该数据集以TyDI-QA-ID问答数据集为源，通过命名实体识别、分块标记、正则表达式及嵌入相似度等计算语言学技术，系统性地构建了蕴含、中立与矛盾三类逻辑关系。其构建过程融合了机器自动生成与人工部分验证，确保了标注的准确性与可靠性，最终形成了包含近一万两千条样本的专用自然语言推理语料。

特点

作为专注于印度尼西亚语的自然语言推理数据集，其核心特征在于文本对蕴含关系的精细标注。每个数据实例均包含前提、假设及对应的逻辑标签，并额外保留了与命名实体识别和分块标记相关的属性信息，为模型提供了丰富的语言学特征。数据集规模适中，划分为训练集、验证集与测试集，结构清晰，便于进行模型训练与评估，有效支撑了印度尼西亚语理解任务的深入研究。

使用方法

该数据集主要应用于自然语言推理任务的模型训练与评估。研究者可通过加载标准数据分割，直接将其用于监督学习，以微调预训练语言模型，提升其在蕴含关系判别上的性能。鉴于数据源可能包含来自维基百科或新闻的潜在偏见，使用时应进行必要的偏差分析与数据清洗。此外，其衍生的语言学属性也可作为辅助特征，进一步探索多任务学习或可解释性研究。

背景与挑战

背景概述

在自然语言处理领域，针对资源相对匮乏的语言构建高质量数据集，是推动该语言技术发展的关键。TyDI-QA-ID-NLI数据集应运而生，其创建源于对印尼语自然语言推理任务的迫切需求。该数据集由印度尼西亚大学、穆罕默德·本·扎耶德人工智能大学及韩国科学技术院的研究人员协作构建，其核心研究问题在于为印尼语提供可靠的NLI基准，以支持问答系统等下游模型的性能优化。通过对知名问答数据集TyDI-QA-ID进行转化，并综合运用命名实体识别、分块标记及嵌入相似性等技术，该数据集为印尼语的语义理解研究提供了重要资源，对提升该语言在预训练模型中的表征能力具有显著影响力。

当前挑战

该数据集旨在解决的领域挑战，是为资源稀缺的印尼语构建自然语言推理基准，以应对该语言在语义关系建模与逻辑推断方面的研究空白。在构建过程中，挑战主要集中于数据源的转化与标注。原始问答数据虽由人工生成，但转化为NLI格式时，需依赖命名实体识别、正则表达式及嵌入相似性等自动化技术来生成矛盾假设，这可能导致语义关系的精确性不足。同时，数据源自维基百科和新闻，可能隐含社会文化偏见，且自动化处理流程在捕捉语言细微差别和复杂推理模式时存在局限，对数据集的泛化能力构成潜在影响。

常用场景

经典使用场景

在自然语言处理领域，TyDI-QA-ID-NLI数据集为印度尼西亚语的自然语言推理任务提供了关键资源。该数据集通过从TyDI-QA-ID问答数据集中提取前提和假设对，并利用命名实体识别、分块标记等技术构建矛盾集，从而支持模型进行蕴含、中立和矛盾关系的分类。经典使用场景包括训练和评估跨语言或单语言NLI模型，特别是在资源相对匮乏的印尼语环境中，为研究者提供了标准化的基准测试平台，推动了语言理解技术的深入探索。

衍生相关工作

基于该数据集，已衍生出多项经典研究工作，包括针对印尼语的预训练模型微调实验、跨语言迁移学习框架的优化，以及低资源NLI方法的创新。例如，研究者利用该数据集评估多语言BERT等模型在印尼语上的性能，并开发了结合实体信息的增强推理技术。这些工作不仅丰富了印尼语NLI的学术成果，也为其他低资源语言的研究提供了可借鉴的方法论，推动了全球自然语言处理技术的多样化进展。

数据集最近研究