five

1-800-SHARED-TASKS/telugu-summarization-generation

收藏
Hugging Face2024-09-26 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/1-800-SHARED-TASKS/telugu-summarization-generation
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: "news_articles_dataset.csv" annotations_creators: - expert-generated language: - te language_creators: - expert-generated license: - apache-2.0 multilinguality: - monolingual pretty_name: Telugu News Articles size_categories: - 100K<n<1M source_datasets: - original tags: - newspaper - 2018-2023 task_categories: - text-generation task_ids: - language-modeling --- # Summary `aya-telugu-news-articles` is an open source dataset of instruct-style records generated by webscraping a Telugu news articles website. This was created as part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License. Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: Telugu Version: 1.0 # Dataset Overview `aya-telugu-news-articles` is a corpus of more than 467k records generated by webscraping of the Telugu News articles website. This Dataset can be used for the following two tasks: - Given Title/Headline of the article, generate the article with that Title/Headline. - Given the article, generate the Title/Headline for the article. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods. For example, prompt-completions could be submitted as few-shot examples to a large open language model to generate additional articles and their respective titles. # Dataset ## Load with Datasets To load this dataset with Datasets, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset('SuryaKrishna02/aya-telugu-news-articles') ``` ## Purpose of Collection Telugu is a low-resource language where there no title and article generation instruct-style dataset to the best of my knowledge. This was created as a part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI to make sure Telugu is well represented in the space of AI/ML. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **Suryaa Newsarticles Website**: Performed webscraping from [Suryaa Website](https://telugu.suryaa.com/) which is a famous news articles website in Telugu States. Next, performed some pre-processing of the data like removing unwanted characters, removing too lengthy or too short articles from the scraped data. Finally, converted the scraped data into Instruct-style prompts and completions. ## Data Fields - `inputs` : Prompt or input to the language model. - `targets` : Completion or output of the language model. - `template_id` : Id of the template used in `inputs` and `targets`. - `template_lang`: ISO code of the language used in the `inputs` and `targets` where *tel* refers to Telugu. ## Templates For the creation of instruct-style prompts and completions from the scraped data, the following two templates categories with two templates were used: 1. Given Title/Headline of the article, generate the article with that Title/Headline. | template_id | inputs | targets | |-------------|--------|---------| | 1 | ```[క్రింది \| కింది \| ఇవ్వబడిన \| ఇచ్చిన] [శీర్షికతో \| టైటిల్ తో \| హెడ్లైన్ తో] [వార్తా కథనాన్ని \| న్యూస్ ఆర్టికల్ ని \| న్యూస్ కథనాన్ని] [వ్రాయండి \| రాయండి]:\n{{Title}}``` | ```{{Article}}``` 2. Given the article, generate the Title/Headline for the article. | template_id | inputs | targets | |-------------|--------|---------| | 2 | ```[క్రింది \| కింది \| ఇవ్వబడిన \| ఇచ్చిన] [వార్తా కథనానికి \| న్యూస్ ఆర్టికల్ కి \| న్యూస్ కథనానికి] [శీర్షికను \| టైటిల్ ను \| హెడ్లైన్ ను] [వ్రాయండి \| ఇవ్వండి \| రాయండి]:\n{{Article}}``` | ```[ఇచ్చిన \| ఇవ్వబడిన] [వార్తా కథనానికి \| న్యూస్ ఆర్టికల్ కి \| న్యూస్ కథనానికి] [సరిపోయే \| తగిన \| అనువైన] [శీర్షిక \| టైటిల్ \| హెడ్లైన్] '{{Title}}'.``` | ## Personal or Sensitive Data This dataset contains public information. To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language Telugu # Known Limitations - The Dataset is scraped from the News Website and the contents of this dataset may reflect the bias, factual errors, politicial affiliations and sensitive matters. - Although there is utmost care taken to keep the dataset as monolingual, there might be some records that may contain English Language along with Telugu. # Contributors [SuryaKrishna02](https://github.com/SuryaKrishna02) and [Desik98](https://github.com/desik1998)
提供机构:
1-800-SHARED-TASKS
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作