five

ai4bharat/IndicSentenceSummarization

收藏
Hugging Face2022-10-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ai4bharat/IndicSentenceSummarization
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - as - bn - gu - hi - kn - ml - mr - or - pa - ta - te license: - cc-by-nc-4.0 multilinguality: - multilingual pretty_name: IndicSentenceSummarization size_categories: - 5K<n<112K source_datasets: - original for Hindi, and modified [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) for other languages. task_categories: - conditional-text-generation task_ids: - conditional-text-generation-other-sentence-summarization --- # Dataset Card for "IndicSentenceSummarization" ## Table of Contents - [Dataset Card Creation Guide](#dataset-card-creation-guide) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://indicnlp.ai4bharat.org/indicnlg-suite - **Paper:** [IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages](https://arxiv.org/abs/2203.05437) - **Point of Contact:** ### Dataset Summary IndicSentenceSummarization is the sentence summarization dataset released as part of IndicNLG Suite. Each input sentence is paired with an output as summary. We create this dataset in eleven languages including as, bn, gu, hi, kn, ml, mr, or, pa, ta, te. The total size of the dataset is 431K. ### Supported Tasks and Leaderboards **Tasks:** Sentence Summarization **Leaderboards:** Currently there is no Leaderboard for this dataset. ### Languages - `Assamese (as)` - `Bengali (bn)` - `Gujarati (gu)` - `Kannada (kn)` - `Hindi (hi)` - `Malayalam (ml)` - `Marathi (mr)` - `Oriya (or)` - `Punjabi (pa)` - `Tamil (ta)` - `Telugu (te)` ## Dataset Structure ### Data Instances One random example from the `hi` dataset is given below in JSON format. ``` {'id': '5', 'input': 'जम्मू एवं कश्मीर के अनंतनाग जिले में शनिवार को सुरक्षाबलों के साथ मुठभेड़ में दो आतंकवादियों को मार गिराया गया।', 'target': 'जम्मू-कश्मीर : सुरक्षाबलों के साथ मुठभेड़ में 2 आतंकवादी ढेर', 'url': 'https://www.indiatv.in/india/national-jammu-kashmir-two-millitant-killed-in-encounter-with-security-forces-574529' } ``` ### Data Fields - `id (string)`: Unique identifier. - `input (string)`: Input sentence. - `target (strings)`: Output summary. - `url (string)`: Source web link of the sentence. ### Data Splits Here is the number of samples in each split for all the languages. Language | ISO 639-1 Code | Train | Dev | Test | ---------- | ---------- | ---------- | ---------- | ---------- | Assamese | as | 10,812 | 5,232 | 5,452 | Bengali | bn | 17,035 | 2,355 | 2,384 | Gujarati | gu | 54,788 | 8,720 | 8,460 | Hindi | hi | 78,876 | 16,935 | 16,835 | Kannada | kn | 61,220 | 9,024 | 1,485 | Malayalam | ml | 2,855 | 1,520 | 1,580 | Marathi | mr | 27,066 | 3,249 | 3,309 | Oriya | or | 12,065 | 1,539 | 1,440 | Punjabi | pa | 31,630 | 4,004 | 3,967 | Tamil | ta | 23,098 | 2,874 | 2,948 | Telugu | te | 7,119 | 878 | 862 | ## Dataset Creation ### Curation Rationale [Detailed in the paper](https://arxiv.org/abs/2203.05437) ### Source Data It is a modified subset of [IndicHeadlineGeneration](https://huggingface.co/datasets/ai4bharat/IndicHeadlineGeneration) dataset. #### Initial Data Collection and Normalization [Detailed in the paper](https://arxiv.org/abs/2203.05437) #### Who are the source language producers? [Detailed in the paper](https://arxiv.org/abs/2203.05437) ### Annotations [More information needed] #### Annotation process [More information needed] #### Who are the annotators? [More information needed] ### Personal and Sensitive Information [More information needed] ## Considerations for Using the Data ### Social Impact of Dataset [More information needed] ### Discussion of Biases [More information needed] ### Other Known Limitations [More information needed] ## Additional Information ### Dataset Curators [More information needed] ### Licensing Information Contents of this repository are restricted to only non-commercial research purposes under the [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/). Copyright of the dataset contents belongs to the original copyright holders. ### Citation Information If you use any of the datasets, models or code modules, please cite the following paper: ``` @inproceedings{Kumar2022IndicNLGSM, title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages}, author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar}, year={2022}, url = "https://arxiv.org/abs/2203.05437", ``` ### Contributions [Detailed in the paper](https://arxiv.org/abs/2203.05437)
提供机构:
ai4bharat
原始信息汇总

数据集概述

数据集名称

  • 名称: IndicSentenceSummarization
  • 描述: 这是一个多语言句子摘要数据集,包含11种语言的句子及其对应的摘要。

语言支持

  • 支持的语言:
    • Assamese (as)
    • Bengali (bn)
    • Gujarati (gu)
    • Kannada (kn)
    • Hindi (hi)
    • Malayalam (ml)
    • Marathi (mr)
    • Oriya (or)
    • Punjabi (pa)
    • Tamil (ta)
    • Telugu (te)

数据集大小

  • 总大小: 431K

数据集结构

  • 数据实例: 每个实例包含一个唯一的ID、输入句子、输出摘要和源网页链接。
  • 数据字段:
    • id (string): 唯一标识符。
    • input (string): 输入句子。
    • target (string): 输出摘要。
    • url (string): 源网页链接。

数据分割

  • 分割详情:
    • 每个语言的数据集被分为训练集、验证集和测试集。
    • 例如,Hindi (hi) 语言的数据集包含78,876个训练样本,16,935个验证样本和16,835个测试样本。

许可信息

  • 许可: Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)

引用信息

  • 引用文献:

    @inproceedings{Kumar2022IndicNLGSM, title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages}, author={Aman Kumar and others}, year={2022}, url = "https://arxiv.org/abs/2203.05437" }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作