SuryaKrishna02/aya-telugu-paraphrase
收藏Hugging Face2024-01-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SuryaKrishna02/aya-telugu-paraphrase
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language:
- te
language_creators:
- expert-generated
license:
- apache-2.0
multilinguality:
- monolingual
pretty_name: Telugu Paraphrase
size_categories:
- n<1K
source_datasets:
- extended|ai4bharat/IndicXParaphrase
tags:
- paraphrase
task_categories:
- text-generation
task_ids:
- language-modeling
---
# Summary
`aya-telugu-paraphrase` is an open source dataset of instruct-style records generated from the Telugu split of [ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test) dataset. This was created as part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License.
Supported Tasks:
- Training LLMs
- Synthetic Data Generation
- Data Augmentation
Languages: Telugu Version: 1.0
# Dataset Overview
`aya-telugu-paraphrase` is a corpus of more than 1.5k records generated by conversion of Telugu split of [ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test) dataset into Instruct-Style format. This Dataset can be used for the following task:
- Given a sentence, generate a sentence with similar meaning.
# Intended Uses
While immediately valuable for instruction fine tuning large language models, as a corpus of instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods. For example, prompt-completions could be submitted as few-shot examples to a large open language model to generate sentence and corresponding paraphrased sentence.
# Dataset
## Load with Datasets
To load this dataset with Datasets, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code:
```python
from datasets import load_dataset
ds = load_dataset('SuryaKrishna02/aya-telugu-paraphrase')
```
## Purpose of Collection
Telugu is a low-resource language where there no paraphase generation instruct-style dataset to the best of my knowledge. This was created as a part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI to make sure Telugu is well represented in the space of AI/ML. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.
## Sources
- **[ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test)**: Converted this dataset into Instruct-style prompts and completions.
## Data Fields
- `inputs` : Prompt or input to the language model.
- `targets` : Completion or output of the language model.
- `template_id` : Id of the template used in `inputs` and `targets`.
- `template_lang`: ISO code of the language used in the `inputs` and `targets` where *tel* refers to Telugu.
## Templates
For the creation of instruct-style prompts and completions from the original dataset, the following one template category with 6 different variations were used:
1. Given a sentence, generate a sentence with similar meaning.
| template_id | inputs | targets |
|-------------|--------|---------|
| 1 | ```ఈ క్రింది వాక్యం మరోరీతిలో రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
| 2 | ```ఈ వాక్యం మరోరీతిలో రాయి: {Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
| 3 | ```ఈ క్రింది వాక్యం ఇంకొలాగా రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
| 4 | ```ఈ వాక్యం ఇంకొలాగా రాయి: {{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
| 5 | ```ఈ క్రింది వాక్యం మరోరకంగా రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
| 6 | ```ఈ వాక్యం మరోరకంగా రాయి: {{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` |
## Personal or Sensitive Data
This dataset contains public information. To our knowledge, there are no private person’s personal identifiers or sensitive information.
## Language
Telugu
# Known Limitations
- The Dataset is converted from the existing dataset and the contents of this dataset may reflect the bias, factual errors and sensitive matters.
- Although there is utmost care taken to keep the dataset as monolingual, there might be some records that may contain English Language along with Telugu.
# Contributors
[SuryaKrishna02](https://github.com/SuryaKrishna02) and [Desik98](https://github.com/desik1998)
提供机构:
SuryaKrishna02
原始信息汇总
数据集概述
基本信息
- 数据集名称:
aya-telugu-paraphrase - 语言: 泰卢固语 (Telugu)
- 数据集大小: 少于1千条记录
- 许可证: Apache 2.0
- 多语言性: 单语种
- 标签: 释义
- 任务类别: 文本生成
- 任务ID: 语言建模
数据集来源
- 源数据集:
ai4bharat/IndicXParaphrase - 创建者: 专家生成
数据集用途
- 支持任务:
- 训练大型语言模型 (LLMs)
- 合成数据生成
- 数据增强
- 具体任务: 给定一个句子,生成一个意义相似的句子。
数据集结构
- 数据字段:
inputs: 语言模型的提示或输入targets: 语言模型的完成或输出template_id: 用于inputs和targets的模板IDtemplate_lang:inputs和targets中使用的语言的ISO代码,其中tel指泰卢固语
模板
- 模板类别: 给定一个句子,生成一个意义相似的句子
- 模板ID: 1-6
- 示例:
inputs: ఈ క్రింది వాక్యం మరోరీతిలో రాయి: {{Original Sentence}}targets: {{Paraphrased Sentence}}
数据集收集目的
- 目的: 泰卢固语是一种低资源语言,目前没有释义生成的指令风格数据集。该数据集作为Cohere For AI的Aya开放科学计划的一部分创建,以确保泰卢固语在AI/ML领域得到充分代表。
已知限制
- 限制:
- 数据集可能反映源数据集的偏见、事实错误和敏感问题。
- 尽管尽最大努力保持数据集为单语种,但可能存在一些记录包含泰卢固语和英语混合的情况。
贡献者
- 贡献者: SuryaKrishna02 和 Desik98



