jaygala24/indic_sts
收藏Hugging Face2024-04-23 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jaygala24/indic_sts
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
- found
- machine-generated
language:
- as
- bn
- en
- gu
- hi
- kn
- ml
- mr
- or
- pa
- ta
- te
- ur
license:
- cc0-1.0
multilinguality:
- multilingual
size_categories:
- 1K<n<10K
task_categories:
- text-classification
task_ids:
- text-scoring
- semantic-similarity-scoring
pretty_name: Indic STS
dataset_info:
- config_name: en-as
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 167751
num_examples: 656
download_size: 83024
dataset_size: 167751
- config_name: en-bn
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 280422
num_examples: 957
download_size: 133859
dataset_size: 280422
- config_name: en-gu
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 217276
num_examples: 780
download_size: 105617
dataset_size: 217276
- config_name: en-hi
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 458019
num_examples: 1268
download_size: 224825
dataset_size: 458019
- config_name: en-kn
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 291586
num_examples: 953
download_size: 140273
dataset_size: 291586
- config_name: en-ml
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 302856
num_examples: 947
download_size: 139777
dataset_size: 302856
- config_name: en-mr
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 253059
num_examples: 779
download_size: 124599
dataset_size: 253059
- config_name: en-or
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 121019
num_examples: 500
download_size: 57149
dataset_size: 121019
- config_name: en-pa
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 204730
num_examples: 688
download_size: 102251
dataset_size: 204730
- config_name: en-ta
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 368438
num_examples: 1044
download_size: 162846
dataset_size: 368438
- config_name: en-te
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 291129
num_examples: 948
download_size: 138528
dataset_size: 291129
- config_name: en-ur
features:
- name: lang_code
dtype: string
- name: source
dtype: string
- name: english_sentence
dtype: string
- name: indic_sentence
dtype: string
- name: score
dtype: float64
splits:
- name: test
num_bytes: 150870
num_examples: 500
download_size: 85405
dataset_size: 150870
configs:
- config_name: en-as
data_files:
- split: test
path: en-as/test-*
- config_name: en-bn
data_files:
- split: test
path: en-bn/test-*
- config_name: en-gu
data_files:
- split: test
path: en-gu/test-*
- config_name: en-hi
data_files:
- split: test
path: en-hi/test-*
- config_name: en-kn
data_files:
- split: test
path: en-kn/test-*
- config_name: en-ml
data_files:
- split: test
path: en-ml/test-*
- config_name: en-mr
data_files:
- split: test
path: en-mr/test-*
- config_name: en-or
data_files:
- split: test
path: en-or/test-*
- config_name: en-pa
data_files:
- split: test
path: en-pa/test-*
- config_name: en-ta
data_files:
- split: test
path: en-ta/test-*
- config_name: en-te
data_files:
- split: test
path: en-te/test-*
- config_name: en-ur
data_files:
- split: test
path: en-ur/test-*
tags:
- multilingual
- semantic-textual-similarity
---
# Dataset Card for Indic STS
This dataset is STS benchmark between English and 12 high-resource Indic languages. This was released as a part of [Samanantar](https://arxiv.org/abs/2104.05596) paper. Please refer to the paper for more details.
### Languages
Available languages are: en-as, en-bn, en-gu, en-hi, en-kn, en-ml, en-mr, en-or, en-pa, en-ta, en-te, en-ur
### Dataset Structure
#### Dataset Fields
- lang_code: 2-digit ISO language code
- source: The source from which the candidate sentence is considered.
- english_sentence: The full sentence in the English language.
- indic_sentence: The full sentence in the corresponding Indic language.
- score: The similarity score as a float which is <= 5.0 and >= 0.0.
#### Data Instances
```json
{
"lang_code":"hi",
"source":"CatchNews",
"english_sentence":"\"...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\\'s time then I think things will go back to as normal as it can be,\" Kumble said\n",
"indic_sentence":"उन्होंने कहा,\"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी\n",
"score":4.0
}
```
#### Splits
| | en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur |
|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 |
### Examples of Use
```python3
from datasets import load_dataset
dataset = load_dataset("jaygala24/indic_sts", name="en-hi", split="test")
```
### Citation
```bibtex
@article{DBLP:journals/tacl/RameshDBJASSDJK22,
author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Deepak Kumar and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
journal = {Trans. Assoc. Comput. Linguistics},
volume = {10},
pages = {145-162},
year = {2022},
url = {https://doi.org/10.1162/tacl\_a\_00452},
doi = {10.1162/TACL\_A\_00452},
timestamp = {Wed, 29 Jun 2022 16:03:22 +0200},
biburl = {https://dblp.org/rec/journals/tacl/RameshDBJASSDJK22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
提供机构:
jaygala24
原始信息汇总
Indic STS 数据集概述
数据集基本信息
- 名称: Indic STS
- 语言: 多语言,包括 as, bn, en, gu, hi, kn, ml, mr, or, pa, ta, te, ur
- 许可证: cc0-1.0
- 大小: 每个配置数据集大小在 1K<n<10K 范围内
- 任务类别: 文本分类
- 具体任务: 文本评分, 语义相似度评分
数据集结构
数据集字段
- lang_code: 2-digit ISO 语言代码
- source: 候选句子的来源
- english_sentence: 英语完整句子
- indic_sentence: 对应印度语言的完整句子
- score: 相似度分数,范围为 0.0 至 5.0
数据集分割
| en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 |
数据集示例
json { "lang_code":"hi", "source":"CatchNews", "english_sentence":""...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\s time then I think things will go back to as normal as it can be," Kumble said ", "indic_sentence":"उन्होंने कहा,"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी ", "score":4.0 }
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,跨语言语义相似性评估是衡量模型理解多语言文本的关键任务。Indic STS数据集的构建采用了众包、现有资源与机器生成相结合的方式,涵盖了英语与12种高资源印度语言之间的句子对。该数据集源自Samanantar项目,通过精心设计的流程收集并标注了每对句子的语义相似度分数,确保了数据的多样性与代表性,为跨语言语义理解研究提供了坚实的基础。
使用方法
该数据集主要用于跨语言语义相似性任务的模型训练与评估。研究人员可通过Hugging Face的datasets库直接加载,例如使用load_dataset函数指定语言配置如'en-hi'来获取印地语与英语的测试集。每个数据实例包含语言代码、来源、英语句子、印度语言句子及相似度分数,便于进行模型性能分析或作为下游任务的评估指标,推动多语言自然语言处理技术的发展。
背景与挑战
背景概述
在自然语言处理领域,跨语言语义相似性评估是衡量不同语言间文本语义对等性的核心任务,对于机器翻译、信息检索及多语言模型训练具有关键意义。Indic STS数据集作为Samanantar项目的重要组成部分,由Gowtham Ramesh等研究人员于2022年发布,旨在构建英语与12种高资源印度语言之间的语义文本相似性基准。该数据集通过众包、现有资源及机器生成相结合的方式,涵盖了阿萨姆语、孟加拉语、古吉拉特语、印地语等多种语言,为南亚多语言自然语言处理研究提供了珍贵的评估资源,显著推动了该区域语言技术生态的发展。
当前挑战
Indic STS数据集致力于解决跨语言语义相似性计算中的核心挑战,即在不同语系和语法结构的语言间准确量化文本对等的程度。由于印度语言在形态、句法及文化表达上存在显著差异,构建统一的相似性评分标准面临语义对齐的复杂性。在数据集构建过程中,挑战主要体现在多语言数据的高质量收集与标注上:众包标注需要克服语言专家稀缺和评分主观性问题;同时,整合机器生成与现有资源时,需确保数据的一致性与代表性,避免因翻译偏差或文化特定表达影响评估的可靠性。
常用场景
经典使用场景
在自然语言处理领域,跨语言语义相似性评估是衡量模型理解多语言文本核心含义的关键任务。Indic STS数据集作为英语与十二种高资源印度语言之间的语义文本相似性基准,其经典使用场景在于为跨语言语义相似度计算模型提供标准化测试平台。研究者利用该数据集评估模型在捕捉英语与印度语言句子对之间语义关联性的能力,从而推动多语言表示学习的发展。
解决学术问题
该数据集有效解决了多语言自然语言处理中语义对齐评估资源匮乏的学术难题。通过提供涵盖阿萨姆语、孟加拉语、印地语等多种印度语言与英语的句子对及其人工标注的相似度分数,它为跨语言语义相似性计算、低资源语言模型性能评测以及多语言表示学习的公平比较建立了可靠基准。其意义在于填补了印度语言语义评估数据的空白,促进了语言技术研究的包容性与多样性。
实际应用
在实际应用层面,Indic STS数据集支撑着面向南亚地区的多语言智能系统开发。基于该数据集训练的模型可应用于机器翻译质量评估、跨语言信息检索系统优化以及多语言聊天机器人的语义理解模块。例如,在新闻聚合或社交媒体内容分析中,系统能够准确判断不同语言表述的同一事件报道之间的语义一致性,从而提升信息服务的准确性与覆盖范围。
数据集最近研究
最新研究方向
在自然语言处理领域,多语言语义相似性评估是推动跨语言理解技术发展的关键环节。Indic STS数据集作为涵盖英语与12种高资源印度语言对的语义文本相似性基准,为研究多语言模型在复杂语言环境下的语义对齐能力提供了重要资源。当前前沿研究聚焦于利用该数据集训练和评估跨语言表示模型,特别是在低资源语言场景下的迁移学习与零样本性能优化。随着全球数字包容性倡议的推进,此类数据集促进了印度语言在机器翻译、信息检索及多语言对话系统中的应用,对打破语言壁垒、推动区域语言技术生态发展具有深远意义。
以上内容由遇见数据集搜集并总结生成



