KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives
收藏Hugging Face2026-01-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: anchor
dtype: string
- name: positives
dtype: string
- name: negatives
dtype: string
splits:
- name: train
num_bytes: 749109360
num_examples: 1308990
- name: test
num_bytes: 9045912
num_examples: 16700
- name: dev
num_bytes: 4501075
num_examples: 8300
download_size: 113501692
dataset_size: 762656347
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: dev
path: data/dev-*
language:
- as
license: cc0-1.0
task_categories:
- sentence-similarity
- text-classification
pretty_name: Assamese IndicXNLI Triplet Dataset (Random Negatives)
---
# Assamese IndicXNLI Triplet Dataset (Random Negatives = 10)
## Overview
This dataset is derived from the **Assamese portion** of the **IndicXNLI dataset**
([Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli)), a
multilingual natural language inference corpus covering 11 Indic languages.
It is specifically constructed for **metric learning and contrastive learning**
settings such as **triplet-loss training**.
Each instance contains:
- an **anchor sentence**
- a **positive sentence** (entailment)
- **10 randomly sampled negative sentences** (non-entailment)
---
## Source Dataset: IndicXNLI
**IndicXNLI** is a multilingual natural language inference (NLI) dataset created by
machine-translating the English XNLI corpus into 11 Indic languages, including
Assamese.
Each example consists of a *(premise, hypothesis)* pair labeled as:
- entailment
- contradiction
- neutral
**Source dataset:**
[Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli)
---
## Construction Details
For the Assamese split of IndicXNLI:
- **Entailment pairs** are treated as *(anchor, positive)* pairs.
- For each such pair, **10 negative sentences** are sampled at random from
examples labeled as *neutral* or *contradiction*.
- Negative sampling is **uniform and random**, without semantic filtering or
hard-negative mining.
This results in multiple triplets per anchor–positive pair, providing a strong
baseline for representation learning.
---
## Intended Use
This dataset is suitable for:
- Sentence embedding learning
- Triplet-loss and contrastive-loss training
- Siamese / bi-encoder models
- Low-resource Indic language representation learning
---
## Limitations
- Negatives are **random**, not hard negatives
- Some negatives may be semantically distant
- Not intended for direct NLI classification
---
## Attribution
This dataset is a **derived work** based on:
[Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli) — IndicXNLI:
Evaluating Multilingual Inference for Indian Languages
---
## Citation
If you use this dataset, please cite the original IndicXNLI paper:
```bibtex
@inproceedings{aggarwal-etal-2022-indicxnli,
title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages},
author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop},
booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
year = {2022},
address = {Abu Dhabi, United Arab Emirates},
publisher = {Association for Computational Linguistics},
pages = {10994--11006},
doi = {10.18653/v1/2022.emnlp-main.755}
}
```
---
数据集信息:
特征:
- 名称:锚定句(anchor),数据类型:字符串
- 名称:正样本句(positives),数据类型:字符串
- 名称:负样本句(negatives),数据类型:字符串
数据划分:
- 名称:训练集(train),字节占用:749109360,样本数量:1308990
- 名称:测试集(test),字节占用:9045912,样本数量:16700
- 名称:验证集(dev),字节占用:4501075,样本数量:8300
下载总大小:113501692
数据集总大小:762656347
配置项:
- 配置名称:默认配置(default),数据文件路径:
- 训练集:data/train-*
- 测试集:data/test-*
- 验证集:data/dev-*
语言:阿萨姆语(as)
许可协议:CC0 1.0(cc0-1.0)
任务类别:句子相似度(sentence-similarity)、文本分类(text-classification)
展示名称:阿萨姆语IndicXNLI三元组数据集(随机负样本)
---
# 阿萨姆语IndicXNLI三元组数据集(随机负样本数=10)
## 概述
本数据集源自覆盖11种印度语言的多语言自然语言推理(Natural Language Inference, NLI)语料库**IndicXNLI数据集(Divyanshu/indicxnli)**,其源地址为:https://huggingface.co/datasets/Divyanshu/indicxnli。
本数据集专为度量学习与对比学习(Contrastive Learning)场景构建,例如三元组损失(Triplet Loss)训练。
每个样本包含以下内容:
- 锚定句(anchor)
- 正样本句(蕴含关系句子,entailment)
- 10个随机采样的负样本句(非蕴含关系句子)
## 源数据集:IndicXNLI
**IndicXNLI**是将英文XNLI语料库机器翻译为包括阿萨姆语在内的11种印度语言后得到的多语言自然语言推理数据集。
每个样本由(前提,假设)句对组成,标注标签分为三类:蕴含(entailment)、矛盾(contradiction)、中立(neutral)。
源数据集链接:https://huggingface.co/datasets/Divyanshu/indicxnli
## 构造细节
针对IndicXNLI的阿萨姆语子集:
- 所有蕴含样本对被用作(锚定句,正样本句)对。
- 针对每一组此类样本对,从标注为中立或矛盾的样本中随机采样10个句子作为负样本。
- 负采样采用均匀随机策略,未进行语义筛选或难负样本挖掘(Hard Negative Mining)。
该构造方式使得每组锚定-正样本对可生成多个三元组样本,为表征学习(Representation Learning)提供了强有力的基准基线。
## 适用场景
本数据集适用于以下任务:
- 句子嵌入(Sentence Embedding)学习
- 三元组损失与对比损失训练
- 孪生(Siamese)/双编码器(Bi-Encoder)模型
- 低资源印度语言表征学习
## 局限性
- 负样本为随机采样所得,而非难负样本
- 部分负样本可能语义距离较远
- 不适合直接用于自然语言推理分类任务
## 归属声明
本数据集为衍生作品,基于以下作品构建:
https://huggingface.co/datasets/Divyanshu/indicxnli — IndicXNLI:面向印度语言的多语言推理评估
## 引用信息
若使用本数据集,请引用原始IndicXNLI论文:
bibtex
@inproceedings{aggarwal-etal-2022-indicxnli,
title = "IndicXNLI: Evaluating Multilingual Inference for Indian Languages",
author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop},
booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
year = {2022},
address = {Abu Dhabi, United Arab Emirates},
publisher = {Association for Computational Linguistics},
pages = {10994--11006},
doi = {10.18653/v1/2022.emnlp-main.755}
}
提供机构:
KhyontekAI



