ai4bharat/indic_glue

Name: ai4bharat/indic_glue
Creator: ai4bharat
Published: 2024-01-04 12:36:30
License: 暂无描述

Hugging Face2024-01-04 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/ai4bharat/indic_glue

下载链接

链接失效反馈

官方服务：

资源简介：

IndicGLUE是一个多语言数据集，涵盖了多种印度语言（如阿萨姆语、孟加拉语、英语、古吉拉特语、印地语等）。该数据集支持多种自然语言处理任务，包括文本分类、标记分类、多项选择等，具体任务包括主题分类、自然语言推理、情感分析、语义相似度评分、命名实体识别等。数据集的大小在100K到1M之间，包含多个配置，每个配置对应不同的语言和任务，且每个配置都有详细的特征描述和数据集大小信息。

IndicGLUE is a multilingual dataset covering a variety of Indian languages including Assamese, Bengali, English, Gujarati, Hindi, and others. It supports a wide range of natural language processing (NLP) tasks, such as text classification, token classification, multiple-choice tasks, etc. Specific tasks include topic classification, natural language inference, sentiment analysis, semantic similarity scoring, named entity recognition (NER), and more. The dataset size ranges from 100K to 1M, and it contains multiple configurations, each corresponding to different languages and tasks. Each configuration is equipped with detailed feature descriptions and dataset size information.

提供机构：

ai4bharat

原始信息汇总

数据集概述

基本信息

数据集名称: IndicGLUE
语言: 多种印度语言，包括阿萨姆语（as）、孟加拉语（bn）、英语（en）、古吉拉特语（gu）、印地语（hi）、卡纳达语（kn）、马拉雅拉姆语（ml）、马拉地语（mr）、奥里亚语（or）、旁遮普语（pa）、泰米尔语（ta）、泰卢固语（te）
许可证: 其他
多语言性: 多语言
数据集大小: 100K<n<1M
源数据集: 扩展自其他数据集
任务类别: 文本分类、标记分类、多项选择
任务ID: 主题分类、自然语言推理、情感分析、语义相似度评分、命名实体识别、多项选择问答
标签: 话语模式分类、复述识别、跨语言相似度、标题分类

数据集配置

actsa-sc.te

特征:
- text: 字符串
- label: 类别标签，包括 positive 和 negative
分割:
- train: 4328个样本，1370907字节
- validation: 541个样本，166089字节
- test: 541个样本，168291字节
下载大小: 727630字节
数据集大小: 1705287字节

bbca.hi

特征:
- label: 字符串
- text: 字符串
分割:
- train: 3467个样本，22126205字节
- test: 866个样本，5501148字节
下载大小: 10349015字节
数据集大小: 27627353字节

copa.en

特征:
- premise: 字符串
- choice1: 字符串
- choice2: 字符串
- question: 字符串
- label: 整数
分割:
- train: 400个样本，46033字节
- validation: 100个样本，11679字节
- test: 500个样本，55846字节
下载大小: 79431字节
数据集大小: 113558字节

copa.gu

特征:
- premise: 字符串
- choice1: 字符串
- choice2: 字符串
- question: 字符串
- label: 整数
分割:
- train: 362个样本，92097字节
- validation: 88个样本，23450字节
- test: 448个样本，109997字节
下载大小: 107668字节
数据集大小: 225544字节

copa.hi

特征:
- premise: 字符串
- choice1: 字符串
- choice2: 字符串
- question: 字符串
- label: 整数
分割:
- train: 362个样本，93376字节
- validation: 88个样本，23559字节
- test: 449个样本，112830字节
下载大小: 104233字节
数据集大小: 229765字节

copa.mr

特征:
- premise: 字符串
- choice1: 字符串
- choice2: 字符串
- question: 字符串
- label: 整数
分割:
- train: 362个样本，93441字节
- validation: 88个样本，23874字节
- test: 449个样本，112055字节
下载大小: 105962字节
数据集大小: 229370字节

csqa.as

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 2942个样本，3800523字节
下载大小: 1390423字节
数据集大小: 3800523字节

csqa.bn

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 38845个样本，54671018字节
下载大小: 19648180字节
数据集大小: 54671018字节

csqa.gu

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 22861个样本，29131607字节
下载大小: 6027825字节
数据集大小: 29131607字节

csqa.hi

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 35140个样本，40409347字节
下载大小: 14711258字节
数据集大小: 40409347字节

csqa.kn

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 13666个样本，21199816字节
下载大小: 7669655字节
数据集大小: 21199816字节

csqa.ml

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 26537个样本，47220836字节
下载大小: 17382215字节
数据集大小: 47220836字节

csqa.mr

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 11370个样本，13667174字节
下载大小: 5072738字节
数据集大小: 13667174字节

csqa.or

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 1975个样本，2562365字节
下载大小: 948046字节
数据集大小: 2562365字节

csqa.pa

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 5667个样本，5806097字节
下载大小: 2194109字节
数据集大小: 5806097字节

csqa.ta

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 38590个样本，61868481字节
下载大小: 20789467字节
数据集大小: 61868481字节

csqa.te

特征:
- question: 字符串
- answer: 字符串
- category: 字符串
- title: 字符串
- options: 字符串序列
- out_of_context_options: 字符串序列
分割:
- test: 41338个样本，58784997字节
下载大小: 17447618字节
数据集大小: 58784997字节

cvit-mkb-clsr.en-bn

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 5522个样本，1990957字节
下载大小: 945551字节
数据集大小: 1990957字节

cvit-mkb-clsr.en-gu

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 6463个样本，2303377字节
下载大小: 1093313字节
数据集大小: 2303377字节

cvit-mkb-clsr.en-hi

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 5169个样本，1855989字节
下载大小: 890609字节
数据集大小: 1855989字节

cvit-mkb-clsr.en-ml

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 4886个样本，1990089字节
下载大小: 868956字节
数据集大小: 1990089字节

cvit-mkb-clsr.en-mr

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 5760个样本，2130601字节
下载大小: 993961字节
数据集大小: 2130601字节

cvit-mkb-clsr.en-or

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 752个样本，274873字节
下载大小: 134334字节
数据集大小: 274873字节

cvit-mkb-clsr.en-ta

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 5637个样本，2565178字节
下载大小: 1091653字节
数据集大小: 2565178字节

cvit-mkb-clsr.en-te

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 5049个样本，1771129字节
下载大小: 840410字节
数据集大小: 1771129字节

cvit-mkb-clsr.en-ur

特征:
- sentence1: 字符串
- sentence2: 字符串
分割:
- test: 1006个样本，288430字节
下载大小: 166129字节
数据集大小: 288430字节

iitp-mr.hi

特征:
- text: 字符串
- label: 类别标签，包括 negative, neutral, positive
分割:
- train: 2480个样本，6704905字节
- validation: 310个样本，822218字节
- test: 310个样本，702373字节
下载大小: 3151762字节
数据集大小: 8229496字节

iitp-pr.hi

特征:
- text: 字符串
- label: 类别标签，包括 negative, neutral, positive
分割:
- train: 4182个样本，945589字节
- validation: 523个样本，120100字节
- test: 523个样本，121910字节
下载大小: 509822字节
数据集大小: 1187599字节

inltkh.gu

特征:
- text: 字符串
- label: 类别标签，包括 entertainment, business, tech, sports, state, spirituality, tamil-cinema, positive, negative, neutral
分割:
- train: 5269个样本，883063字节
- validation: 659个样本，111201字节
- test: 659个样本，110757字节
下载大小: 5150

5,000+

优质数据集

54 个

任务类型

进入经典数据集