KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives

Name: KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives
Creator: KhyontekAI
Published: 2026-01-16 07:33:03
License: 暂无描述

Hugging Face2026-01-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: anchor dtype: string - name: positives dtype: string - name: negatives dtype: string splits: - name: train num_bytes: 749109360 num_examples: 1308990 - name: test num_bytes: 9045912 num_examples: 16700 - name: dev num_bytes: 4501075 num_examples: 8300 download_size: 113501692 dataset_size: 762656347 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: dev path: data/dev-* language: - as license: cc0-1.0 task_categories: - sentence-similarity - text-classification pretty_name: Assamese IndicXNLI Triplet Dataset (Random Negatives) --- # Assamese IndicXNLI Triplet Dataset (Random Negatives = 10) ## Overview This dataset is derived from the **Assamese portion** of the **IndicXNLI dataset** ([Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli)), a multilingual natural language inference corpus covering 11 Indic languages. It is specifically constructed for **metric learning and contrastive learning** settings such as **triplet-loss training**. Each instance contains: - an **anchor sentence** - a **positive sentence** (entailment) - **10 randomly sampled negative sentences** (non-entailment) --- ## Source Dataset: IndicXNLI **IndicXNLI** is a multilingual natural language inference (NLI) dataset created by machine-translating the English XNLI corpus into 11 Indic languages, including Assamese. Each example consists of a *(premise, hypothesis)* pair labeled as: - entailment - contradiction - neutral **Source dataset:** [Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli) --- ## Construction Details For the Assamese split of IndicXNLI: - **Entailment pairs** are treated as *(anchor, positive)* pairs. - For each such pair, **10 negative sentences** are sampled at random from examples labeled as *neutral* or *contradiction*. - Negative sampling is **uniform and random**, without semantic filtering or hard-negative mining. This results in multiple triplets per anchor–positive pair, providing a strong baseline for representation learning. --- ## Intended Use This dataset is suitable for: - Sentence embedding learning - Triplet-loss and contrastive-loss training - Siamese / bi-encoder models - Low-resource Indic language representation learning --- ## Limitations - Negatives are **random**, not hard negatives - Some negatives may be semantically distant - Not intended for direct NLI classification --- ## Attribution This dataset is a **derived work** based on: [Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli) — IndicXNLI: Evaluating Multilingual Inference for Indian Languages --- ## Citation If you use this dataset, please cite the original IndicXNLI paper: ```bibtex @inproceedings{aggarwal-etal-2022-indicxnli, title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, pages = {10994--11006}, doi = {10.18653/v1/2022.emnlp-main.755} } ```

--- 数据集信息：特征： - 名称：锚定句（anchor），数据类型：字符串 - 名称：正样本句（positives），数据类型：字符串 - 名称：负样本句（negatives），数据类型：字符串数据划分： - 名称：训练集（train），字节占用：749109360，样本数量：1308990 - 名称：测试集（test），字节占用：9045912，样本数量：16700 - 名称：验证集（dev），字节占用：4501075，样本数量：8300 下载总大小：113501692 数据集总大小：762656347 配置项： - 配置名称：默认配置（default），数据文件路径： - 训练集：data/train-* - 测试集：data/test-* - 验证集：data/dev-* 语言：阿萨姆语（as）许可协议：CC0 1.0（cc0-1.0）任务类别：句子相似度（sentence-similarity）、文本分类（text-classification）展示名称：阿萨姆语IndicXNLI三元组数据集（随机负样本） --- # 阿萨姆语IndicXNLI三元组数据集（随机负样本数=10） ## 概述本数据集源自覆盖11种印度语言的多语言自然语言推理（Natural Language Inference, NLI）语料库**IndicXNLI数据集（Divyanshu/indicxnli）**，其源地址为：https://huggingface.co/datasets/Divyanshu/indicxnli。本数据集专为度量学习与对比学习（Contrastive Learning）场景构建，例如三元组损失（Triplet Loss）训练。每个样本包含以下内容： - 锚定句（anchor） - 正样本句（蕴含关系句子，entailment） - 10个随机采样的负样本句（非蕴含关系句子） ## 源数据集：IndicXNLI **IndicXNLI**是将英文XNLI语料库机器翻译为包括阿萨姆语在内的11种印度语言后得到的多语言自然语言推理数据集。每个样本由（前提，假设）句对组成，标注标签分为三类：蕴含（entailment）、矛盾（contradiction）、中立（neutral）。源数据集链接：https://huggingface.co/datasets/Divyanshu/indicxnli ## 构造细节针对IndicXNLI的阿萨姆语子集： - 所有蕴含样本对被用作（锚定句，正样本句）对。 - 针对每一组此类样本对，从标注为中立或矛盾的样本中随机采样10个句子作为负样本。 - 负采样采用均匀随机策略，未进行语义筛选或难负样本挖掘（Hard Negative Mining）。该构造方式使得每组锚定-正样本对可生成多个三元组样本，为表征学习（Representation Learning）提供了强有力的基准基线。 ## 适用场景本数据集适用于以下任务： - 句子嵌入（Sentence Embedding）学习 - 三元组损失与对比损失训练 - 孪生（Siamese）/双编码器（Bi-Encoder）模型 - 低资源印度语言表征学习 ## 局限性 - 负样本为随机采样所得，而非难负样本 - 部分负样本可能语义距离较远 - 不适合直接用于自然语言推理分类任务 ## 归属声明本数据集为衍生作品，基于以下作品构建： https://huggingface.co/datasets/Divyanshu/indicxnli — IndicXNLI：面向印度语言的多语言推理评估 ## 引用信息若使用本数据集，请引用原始IndicXNLI论文： bibtex @inproceedings{aggarwal-etal-2022-indicxnli, title = "IndicXNLI: Evaluating Multilingual Inference for Indian Languages", author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, pages = {10994--11006}, doi = {10.18653/v1/2022.emnlp-main.755} }

提供机构：

KhyontekAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集