indic_sts

Name: indic_sts
Creator: maas
Published: 2025-12-04 16:17:02
License: 暂无描述

魔搭社区2025-12-04 更新2024-09-07 收录

下载链接：

https://modelscope.cn/datasets/MTEB/indic_sts

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Indic STS This dataset is STS benchmark between English and 12 high-resource Indic languages. This was released as a part of [Samanantar](https://arxiv.org/abs/2104.05596) paper. Please refer to the paper for more details. ### Languages Available languages are: en-as, en-bn, en-gu, en-hi, en-kn, en-ml, en-mr, en-or, en-pa, en-ta, en-te, en-ur ### Dataset Structure #### Dataset Fields - lang_code: 2-digit ISO language code - source: The source from which the candidate sentence is considered. - english_sentence: The full sentence in the English language. - indic_sentence: The full sentence in the corresponding Indic language. - score: The similarity score as a float which is <= 5.0 and >= 0.0. #### Data Instances ```json { "lang_code":"hi", "source":"CatchNews", "english_sentence":"\"...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\\'s time then I think things will go back to as normal as it can be,\" Kumble said\n", "indic_sentence":"उन्होंने कहा,\"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी\n", "score":4.0 } ``` #### Splits | | en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur | |------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 | ### Examples of Use ```python3 from datasets import load_dataset dataset = load_dataset("jaygala24/indic_sts", name="en-hi", split="test") ``` ### Citation ```bibtex @article{DBLP:journals/tacl/RameshDBJASSDJK22, author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Deepak Kumar and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, journal = {Trans. Assoc. Comput. Linguistics}, volume = {10}, pages = {145-162}, year = {2022}, url = {https://doi.org/10.1162/tacl\_a\_00452}, doi = {10.1162/TACL\_A\_00452}, timestamp = {Wed, 29 Jun 2022 16:03:22 +0200}, biburl = {https://dblp.org/rec/journals/tacl/RameshDBJASSDJK22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```

# Indic STS 数据集卡片本数据集为英语与12种高资源印度语言之间的语义文本相似度（Semantic Textual Similarity，简称STS）基准数据集，作为[Samanantar](https://arxiv.org/abs/2104.05596)论文的一部分发布，详细信息请参阅该论文。 ### 语言组合可用语言组合如下：en-as、en-bn、en-gu、en-hi、en-kn、en-ml、en-mr、en-or、en-pa、en-ta、en-te、en-ur ### 数据集结构 #### 数据集字段 - lang_code：两位ISO语言代码 - source：候选语句的来源 - english_sentence：完整英语语句 - indic_sentence：对应印度语言的完整语句 - score：相似度分数，为取值范围0.0至5.0的浮点数 #### 数据实例 json { "lang_code":"hi", "source":"CatchNews", "english_sentence":""...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\'s time then I think things will go back to as normal as it can be," Kumble said ", "indic_sentence":"उन्होंने कहा,"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी ", "score":4.0 } #### 数据集划分 | 划分集 | en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur | |--------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | 测试集 | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 | ### 使用示例 python3 from datasets import load_dataset dataset = load_dataset("jaygala24/indic_sts", name="en-hi", split="test") ### 引用格式 bibtex @article{DBLP:journals/tacl/RameshDBJASSDJK22, author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Deepak Kumar and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, journal = {Trans. Assoc. Comput. Linguistics}, volume = {10}, pages = {145-162}, year = {2022}, url = {https://doi.org/10.1162/tacl\_a\_00452}, doi = {10.1162/TACL\_A\_00452}, timestamp = {Wed, 29 Jun 2022 16:03:22 +0200}, biburl = {https://dblp.org/rec/journals/tacl/RameshDBJASSDJK22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

提供机构：

maas

创建时间：

2024-09-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集