five

mteb/indic_sts

收藏
Hugging Face2024-05-07 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mteb/indic_sts
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language_creators: - crowdsourced - found - machine-generated language: - as - bn - en - gu - hi - kn - ml - mr - or - pa - ta - te - ur license: - cc0-1.0 multilinguality: - multilingual size_categories: - 1K<n<10K task_categories: - text-classification task_ids: - text-scoring - semantic-similarity-scoring pretty_name: Indic STS configs: - config_name: default data_files: - path: test/*.parquet split: test - config_name: en-bn data_files: - path: test/en-bn.parquet split: test - config_name: en-hi data_files: - path: test/en-hi.parquet split: test - config_name: en-or data_files: - path: test/en-or.parquet split: test - config_name: en-ml data_files: - path: test/en-ml.parquet split: test - config_name: en-as data_files: - path: test/en-as.parquet split: test - config_name: en-pa data_files: - path: test/en-pa.parquet split: test - config_name: en-ta data_files: - path: test/en-ta.parquet split: test - config_name: en-gu data_files: - path: test/en-gu.parquet split: test - config_name: en-kn data_files: - path: test/en-kn.parquet split: test - config_name: en-te data_files: - path: test/en-te.parquet split: test - config_name: en-ur data_files: - path: test/en-ur.parquet split: test - config_name: en-mr data_files: - path: test/en-mr.parquet split: test tags: - multilingual - semantic-textual-similarity --- # Dataset Card for Indic STS This dataset is STS benchmark between English and 12 high-resource Indic languages. This was released as a part of [Samanantar](https://arxiv.org/abs/2104.05596) paper. Please refer to the paper for more details. ### Languages Available languages are: en-as, en-bn, en-gu, en-hi, en-kn, en-ml, en-mr, en-or, en-pa, en-ta, en-te, en-ur ### Dataset Structure #### Dataset Fields - lang_code: 2-digit ISO language code - source: The source from which the candidate sentence is considered. - english_sentence: The full sentence in the English language. - indic_sentence: The full sentence in the corresponding Indic language. - score: The similarity score as a float which is <= 5.0 and >= 0.0. #### Data Instances ```json { "lang_code":"hi", "source":"CatchNews", "english_sentence":"\"...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\\'s time then I think things will go back to as normal as it can be,\" Kumble said\n", "indic_sentence":"उन्होंने कहा,\"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी\n", "score":4.0 } ``` #### Splits | | en-as | en-bn | en-gu | en-hi | en-kn | en-ml | en-mr | en-or | en-pa | en-ta | en-te | en-ur | |------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | test | 656 | 957 | 780 | 1268 | 953 | 947 | 779 | 500 | 688 | 1044 | 948 | 500 | ### Examples of Use ```python3 from datasets import load_dataset dataset = load_dataset("jaygala24/indic_sts", name="en-hi", split="test") ``` ### Citation ```bibtex @article{DBLP:journals/tacl/RameshDBJASSDJK22, author = {Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Srihari Nagaraj and Deepak Kumar and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra}, title = {Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages}, journal = {Trans. Assoc. Comput. Linguistics}, volume = {10}, pages = {145-162}, year = {2022}, url = {https://doi.org/10.1162/tacl\_a\_00452}, doi = {10.1162/TACL\_A\_00452}, timestamp = {Wed, 29 Jun 2022 16:03:22 +0200}, biburl = {https://dblp.org/rec/journals/tacl/RameshDBJASSDJK22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```
提供机构:
mteb
原始信息汇总

数据集概述

基本信息

  • 名称: Indic STS
  • 语言: 多语言,包括 as, bn, en, gu, hi, kn, ml, mr, or, pa, ta, te, ur
  • 许可证: cc0-1.0
  • 大小: 1K<n<10K
  • 任务类别: 文本分类
  • 任务ID: 文本评分, 语义相似性评分
  • 标签: 多语言, 语义文本相似性

数据集结构

  • 字段:
    • lang_code: 2-digit ISO 语言代码
    • source: 候选句子的来源
    • english_sentence: 英语完整句子
    • indic_sentence: 对应Indic语言的完整句子
    • score: 相似度分数,范围为0.0至5.0

数据实例

json { "lang_code":"hi", "source":"CatchNews", "english_sentence":""...this is only an interim measure and as long as we have hopefully control over COVID in a few months or a year\s time then I think things will go back to as normal as it can be," Kumble said ", "indic_sentence":"उन्होंने कहा,"यह केवल एक अंतरिम उपाय है और जब तक हम कुछ महीनों या एक साल के समय में COVID-19 पर नियंत्रण करते हैं, तब तक मुझे लगता है कि चीजें फिर से सामान्य हो जाएंगी ", "score":4.0 }

数据集分割

en-as en-bn en-gu en-hi en-kn en-ml en-mr en-or en-pa en-ta en-te en-ur
test 656 957 780 1268 953 947 779 500 688 1044 948 500

配置

  • 默认配置: 测试数据路径为 test/*.parquet
  • 特定语言配置: 包括 en-bn, en-hi, en-or, en-ml, en-as, en-pa, en-ta, en-gu, en-kn, en-te, en-ur, en-mr,测试数据路径分别为对应的 .parquet 文件。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作