mteb/indic_sts
收藏Hugging Face2024-05-07 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mteb/indic_sts
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
language_creators:
- crowdsourced
- found
- machine-generated
language:
- as
- bn
- en
- gu
- hi
- kn
- ml
- mr
- or
- pa
- ta
- te
- ur
license:
- cc0-1.0
multilinguality:
- multilingual
size_categories:
- 1K<n<10K
task_categories:
- text-classification
task_ids:
- text-scoring
- semantic-similarity-scoring
pretty_name: Indic STS
configs:
- config_name: default
data_files:
- path: test/*.parquet
split: test
- config_name: en-bn
data_files:
- path: test/en-bn.parquet
split: test
- config_name: en-hi
data_files:
- path: test/en-hi.parquet
split: test
- config_name: en-or
data_files:
- path: test/en-or.parquet
split: test
- config_name: en-ml
data_files:
- path: test/en-ml.parquet
split: test
- config_name: en-as
data_files:
- path: test/en-as.parquet
split: test
- config_name: en-pa
data_files:
- path: test/en-pa.parquet
split: test
- config_name: en-ta
data_files:
- path: test/en-ta.parquet
split: test
- config_name: en-gu
data_files:
- path: test/en-gu.parquet
split: test
- config_name: en-kn
data_files:
- path: test/en-kn.parquet
split: test
- config_name: en-te
data_files:
- path: test/en-te.parquet
split: test
- config_name: en-ur
data_files:
- path: test/en-ur.parquet
split: test
- config_name: en-mr
data_files:
- path: test/en-mr.parquet
split: test
tags:
- multilingual
- semantic-textual-similarity
---
# Dataset Card for Indic STS
This dataset is STS benchmark between English and 12 high-resource Indic languages. This was released as a part of [Samanantar](https://arxiv.org/abs/2104.05596) paper. Please refer to the paper for more details.
### Languages
Available languages are: en-as, en-bn, en-gu, en-hi, en-kn, en-ml, en-mr, en-or, en-pa, en-ta, en-te, en-ur
### Dataset Structure
#### Dataset Fields
- lang_code: 2-digit ISO language code
- source: The source from which the candidate sentence is considered.
- english_sentence: The full sentence in the English language.
- indic_sentence: The full sentence in the corresponding Indic language.
- score: The similarity score as a float which is <= 5.0 and >= 0.0.
#### Data Instances
```json
{
"lang_code":"hi",
"source":"CatchNews",
"english_sentence":"\"...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\\'s time then I think things will go back to as normal as it can be,\" Kumble said\n",
"indic_sentence":"उन्होंने कहा,\"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी\n",
"score":4.0
}
```
#### Splits
| | en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur |
|------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 |
### Examples of Use
```python3
from datasets import load_dataset
dataset = load_dataset("jaygala24/indic_sts", name="en-hi", split="test")
```
### Citation
```bibtex
@article{DBLP:journals/tacl/RameshDBJASSDJK22,
author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Deepak Kumar and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
journal = {Trans. Assoc. Comput. Linguistics},
volume = {10},
pages = {145-162},
year = {2022},
url = {https://doi.org/10.1162/tacl\_a\_00452},
doi = {10.1162/TACL\_A\_00452},
timestamp = {Wed, 29 Jun 2022 16:03:22 +0200},
biburl = {https://dblp.org/rec/journals/tacl/RameshDBJASSDJK22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
提供机构:
mteb
原始信息汇总
数据集概述
基本信息
- 名称: Indic STS
- 语言: 多语言,包括 as, bn, en, gu, hi, kn, ml, mr, or, pa, ta, te, ur
- 许可证: cc0-1.0
- 大小: 1K<n<10K
- 任务类别: 文本分类
- 任务ID: 文本评分, 语义相似性评分
- 标签: 多语言, 语义文本相似性
数据集结构
- 字段:
- lang_code: 2-digit ISO 语言代码
- source: 候选句子的来源
- english_sentence: 英语完整句子
- indic_sentence: 对应Indic语言的完整句子
- score: 相似度分数,范围为0.0至5.0
数据实例
json { "lang_code":"hi", "source":"CatchNews", "english_sentence":""...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\s time then I think things will go back to as normal as it can be," Kumble said ", "indic_sentence":"उन्होंने कहा,"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी ", "score":4.0 }
数据集分割
| en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 |
配置
- 默认配置: 测试数据路径为
test/*.parquet - 特定语言配置: 包括 en-bn, en-hi, en-or, en-ml, en-as, en-pa, en-ta, en-gu, en-kn, en-te, en-ur, en-mr,测试数据路径分别为对应的
.parquet文件。



