1-800-SHARED-TASKS/telugu-summarization-generation
收藏Hugging Face2024-09-26 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/1-800-SHARED-TASKS/telugu-summarization-generation
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: "news_articles_dataset.csv"
annotations_creators:
- expert-generated
language:
- te
language_creators:
- expert-generated
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: Telugu News Articles
size_categories:
- 100K<n<1M
source_datasets:
- original
tags:
- newspaper
- 2018-2023
task_categories:
- text-generation
task_ids:
- language-modeling
---
# Summary
`aya-telugu-news-articles` is an open source dataset of instruct-style records generated by webscraping a Telugu news articles website. This was created as part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License.
Supported Tasks:
- Training LLMs
- Synthetic Data Generation
- Data Augmentation
Languages: Telugu Version: 1.0
# Dataset Overview
`aya-telugu-news-articles` is a corpus of more than 467k records generated by webscraping of the Telugu News articles website. This Dataset can be used for the following two tasks:
- Given Title/Headline of the article, generate the article with that Title/Headline.
- Given the article, generate the Title/Headline for the article.
# Intended Uses
While immediately valuable for instruction fine tuning large language models, as a corpus of instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods. For example, prompt-completions could be submitted as few-shot examples to a large open language model to generate additional articles and their respective titles.
# Dataset
## Load with Datasets
To load this dataset with Datasets, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
ds = load_dataset('SuryaKrishna02/aya-telugu-news-articles')
```
## Purpose of Collection
Telugu is a low-resource language where there no title and article generation instruct-style dataset to the best of my knowledge. This was created as a part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI to make sure Telugu is well represented in the space of AI/ML. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
## Sources
- **Suryaa Newsarticles Website**: Performed webscraping from [Suryaa Website](https://telugu.suryaa.com/) which is a famous news articles website in Telugu States. Next, performed some pre-processing of the data like removing unwanted characters, removing too lengthy or too short articles from the scraped data. Finally, converted the scraped data into Instruct-style prompts and completions.
## Data Fields
- `inputs` : Prompt or input to the language model.
- `targets` : Completion or output of the language model.
- `template_id` : Id of the template used in `inputs` and `targets`.
- `template_lang`: ISO code of the language used in the `inputs` and `targets` where *tel* refers to Telugu.
## Templates
For the creation of instruct-style prompts and completions from the scraped data, the following two templates categories with two templates were used:
1. Given Title/Headline of the article, generate the article with that Title/Headline.
| template_id | inputs | targets |
|-------------|--------|---------|
| 1 | ```[క్రింది \| కింది \| ఇవ్వబడిన \| ఇచ్చిన] [శీర్షికతో \| టైటిల్ తో \| హెడ్లైన్ తో] [వార్తా కథనాన్ని \| న్యూస్ ఆర్టికల్ ని \| న్యూస్ కథనాన్ని] [వ్రాయండి \| రాయండి]:\n{{Title}}``` | ```{{Article}}```
2. Given the article, generate the Title/Headline for the article.
| template_id | inputs | targets |
|-------------|--------|---------|
| 2 | ```[క్రింది \| కింది \| ఇవ్వబడిన \| ఇచ్చిన] [వార్తా కథనానికి \| న్యూస్ ఆర్టికల్ కి \| న్యూస్ కథనానికి] [శీర్షికను \| టైటిల్ ను \| హెడ్లైన్ ను] [వ్రాయండి \| ఇవ్వండి \| రాయండి]:\n{{Article}}``` | ```[ఇచ్చిన \| ఇవ్వబడిన] [వార్తా కథనానికి \| న్యూస్ ఆర్టికల్ కి \| న్యూస్ కథనానికి] [సరిపోయే \| తగిన \| అనువైన] [శీర్షిక \| టైటిల్ \| హెడ్లైన్] '{{Title}}'.``` |
## Personal or Sensitive Data
This dataset contains public information. To our knowledge, there are no private person’s personal identifiers or sensitive information.
## Language
Telugu
# Known Limitations
- The Dataset is scraped from the News Website and the contents of this dataset may reflect the bias, factual errors, politicial affiliations and sensitive matters.
- Although there is utmost care taken to keep the dataset as monolingual, there might be some records that may contain English Language along with Telugu.
# Contributors
[SuryaKrishna02](https://github.com/SuryaKrishna02) and [Desik98](https://github.com/desik1998)
提供机构:
1-800-SHARED-TASKS



