five

KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives

收藏
Hugging Face2026-01-16 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/KhyontekAI/Assamese-IndicXNLI-Triplet-Random-Negatives
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: anchor dtype: string - name: positives dtype: string - name: negatives dtype: string splits: - name: train num_bytes: 749109360 num_examples: 1308990 - name: test num_bytes: 9045912 num_examples: 16700 - name: dev num_bytes: 4501075 num_examples: 8300 download_size: 113501692 dataset_size: 762656347 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: dev path: data/dev-* language: - as license: cc0-1.0 task_categories: - sentence-similarity - text-classification pretty_name: Assamese IndicXNLI Triplet Dataset (Random Negatives) --- # Assamese IndicXNLI Triplet Dataset (Random Negatives = 10) ## Overview This dataset is derived from the **Assamese portion** of the **IndicXNLI dataset** ([Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli)), a multilingual natural language inference corpus covering 11 Indic languages. It is specifically constructed for **metric learning and contrastive learning** settings such as **triplet-loss training**. Each instance contains: - an **anchor sentence** - a **positive sentence** (entailment) - **10 randomly sampled negative sentences** (non-entailment) --- ## Source Dataset: IndicXNLI **IndicXNLI** is a multilingual natural language inference (NLI) dataset created by machine-translating the English XNLI corpus into 11 Indic languages, including Assamese. Each example consists of a *(premise, hypothesis)* pair labeled as: - entailment - contradiction - neutral **Source dataset:** [Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli) --- ## Construction Details For the Assamese split of IndicXNLI: - **Entailment pairs** are treated as *(anchor, positive)* pairs. - For each such pair, **10 negative sentences** are sampled at random from examples labeled as *neutral* or *contradiction*. - Negative sampling is **uniform and random**, without semantic filtering or hard-negative mining. This results in multiple triplets per anchor–positive pair, providing a strong baseline for representation learning. --- ## Intended Use This dataset is suitable for: - Sentence embedding learning - Triplet-loss and contrastive-loss training - Siamese / bi-encoder models - Low-resource Indic language representation learning --- ## Limitations - Negatives are **random**, not hard negatives - Some negatives may be semantically distant - Not intended for direct NLI classification --- ## Attribution This dataset is a **derived work** based on: [Divyanshu/indicxnli](https://huggingface.co/datasets/Divyanshu/indicxnli) — IndicXNLI: Evaluating Multilingual Inference for Indian Languages --- ## Citation If you use this dataset, please cite the original IndicXNLI paper: ```bibtex @inproceedings{aggarwal-etal-2022-indicxnli, title = {IndicXNLI: Evaluating Multilingual Inference for Indian Languages}, author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, pages = {10994--11006}, doi = {10.18653/v1/2022.emnlp-main.755} } ```

--- 数据集信息: 特征: - 名称:锚定句(anchor),数据类型:字符串 - 名称:正样本句(positives),数据类型:字符串 - 名称:负样本句(negatives),数据类型:字符串 数据划分: - 名称:训练集(train),字节占用:749109360,样本数量:1308990 - 名称:测试集(test),字节占用:9045912,样本数量:16700 - 名称:验证集(dev),字节占用:4501075,样本数量:8300 下载总大小:113501692 数据集总大小:762656347 配置项: - 配置名称:默认配置(default),数据文件路径: - 训练集:data/train-* - 测试集:data/test-* - 验证集:data/dev-* 语言:阿萨姆语(as) 许可协议:CC0 1.0(cc0-1.0) 任务类别:句子相似度(sentence-similarity)、文本分类(text-classification) 展示名称:阿萨姆语IndicXNLI三元组数据集(随机负样本) --- # 阿萨姆语IndicXNLI三元组数据集(随机负样本数=10) ## 概述 本数据集源自覆盖11种印度语言的多语言自然语言推理(Natural Language Inference, NLI)语料库**IndicXNLI数据集(Divyanshu/indicxnli)**,其源地址为:https://huggingface.co/datasets/Divyanshu/indicxnli。 本数据集专为度量学习与对比学习(Contrastive Learning)场景构建,例如三元组损失(Triplet Loss)训练。 每个样本包含以下内容: - 锚定句(anchor) - 正样本句(蕴含关系句子,entailment) - 10个随机采样的负样本句(非蕴含关系句子) ## 源数据集:IndicXNLI **IndicXNLI**是将英文XNLI语料库机器翻译为包括阿萨姆语在内的11种印度语言后得到的多语言自然语言推理数据集。 每个样本由(前提,假设)句对组成,标注标签分为三类:蕴含(entailment)、矛盾(contradiction)、中立(neutral)。 源数据集链接:https://huggingface.co/datasets/Divyanshu/indicxnli ## 构造细节 针对IndicXNLI的阿萨姆语子集: - 所有蕴含样本对被用作(锚定句,正样本句)对。 - 针对每一组此类样本对,从标注为中立或矛盾的样本中随机采样10个句子作为负样本。 - 负采样采用均匀随机策略,未进行语义筛选或难负样本挖掘(Hard Negative Mining)。 该构造方式使得每组锚定-正样本对可生成多个三元组样本,为表征学习(Representation Learning)提供了强有力的基准基线。 ## 适用场景 本数据集适用于以下任务: - 句子嵌入(Sentence Embedding)学习 - 三元组损失与对比损失训练 - 孪生(Siamese)/双编码器(Bi-Encoder)模型 - 低资源印度语言表征学习 ## 局限性 - 负样本为随机采样所得,而非难负样本 - 部分负样本可能语义距离较远 - 不适合直接用于自然语言推理分类任务 ## 归属声明 本数据集为衍生作品,基于以下作品构建: https://huggingface.co/datasets/Divyanshu/indicxnli — IndicXNLI:面向印度语言的多语言推理评估 ## 引用信息 若使用本数据集,请引用原始IndicXNLI论文: bibtex @inproceedings{aggarwal-etal-2022-indicxnli, title = "IndicXNLI: Evaluating Multilingual Inference for Indian Languages", author = {Aggarwal, Divyanshu and Gupta, Vivek and Kunchukuttan, Anoop}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing}, year = {2022}, address = {Abu Dhabi, United Arab Emirates}, publisher = {Association for Computational Linguistics}, pages = {10994--11006}, doi = {10.18653/v1/2022.emnlp-main.755} }
提供机构:
KhyontekAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作