five

SuryaKrishna02/aya-telugu-paraphrase

收藏
Hugging Face2024-01-23 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/SuryaKrishna02/aya-telugu-paraphrase
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - te language_creators: - expert-generated license: - apache-2.0 multilinguality: - monolingual pretty_name: Telugu Paraphrase size_categories: - n<1K source_datasets: - extended|ai4bharat/IndicXParaphrase tags: - paraphrase task_categories: - text-generation task_ids: - language-modeling --- # Summary `aya-telugu-paraphrase` is an open source dataset of instruct-style records generated from the Telugu split of [ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test) dataset. This was created as part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI. This dataset can be used for any purpose, whether academic or commercial, under the terms of the [Apache 2.0](https://opensource.org/license/apache-2-0) License. Supported Tasks: - Training LLMs - Synthetic Data Generation - Data Augmentation Languages: Telugu Version: 1.0 # Dataset Overview `aya-telugu-paraphrase` is a corpus of more than 1.5k records generated by conversion of Telugu split of [ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test) dataset into Instruct-Style format. This Dataset can be used for the following task: - Given a sentence, generate a sentence with similar meaning. # Intended Uses While immediately valuable for instruction fine tuning large language models, as a corpus of instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods. For example, prompt-completions could be submitted as few-shot examples to a large open language model to generate sentence and corresponding paraphrased sentence. # Dataset ## Load with Datasets To load this dataset with Datasets, you'll just need to install Datasets as `pip install datasets --upgrade` and then use the following code: ```python from datasets import load_dataset ds = load_dataset('SuryaKrishna02/aya-telugu-paraphrase') ``` ## Purpose of Collection Telugu is a low-resource language where there no paraphase generation instruct-style dataset to the best of my knowledge. This was created as a part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI to make sure Telugu is well represented in the space of AI/ML. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications. ## Sources - **[ai4bharat/IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase/viewer/te/test)**: Converted this dataset into Instruct-style prompts and completions. ## Data Fields - `inputs` : Prompt or input to the language model. - `targets` : Completion or output of the language model. - `template_id` : Id of the template used in `inputs` and `targets`. - `template_lang`: ISO code of the language used in the `inputs` and `targets` where *tel* refers to Telugu. ## Templates For the creation of instruct-style prompts and completions from the original dataset, the following one template category with 6 different variations were used: 1. Given a sentence, generate a sentence with similar meaning. | template_id | inputs | targets | |-------------|--------|---------| | 1 | ```ఈ క్రింది వాక్యం మరోరీతిలో రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | | 2 | ```ఈ వాక్యం మరోరీతిలో రాయి: {Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | | 3 | ```ఈ క్రింది వాక్యం ఇంకొలాగా రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | | 4 | ```ఈ వాక్యం ఇంకొలాగా రాయి: {{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | | 5 | ```ఈ క్రింది వాక్యం మరోరకంగా రాయి:\n{{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | | 6 | ```ఈ వాక్యం మరోరకంగా రాయి: {{Original Sentence}}``` | ```{{Paraphrased Sentence}}``` | ## Personal or Sensitive Data This dataset contains public information. To our knowledge, there are no private person’s personal identifiers or sensitive information. ## Language Telugu # Known Limitations - The Dataset is converted from the existing dataset and the contents of this dataset may reflect the bias, factual errors and sensitive matters. - Although there is utmost care taken to keep the dataset as monolingual, there might be some records that may contain English Language along with Telugu. # Contributors [SuryaKrishna02](https://github.com/SuryaKrishna02) and [Desik98](https://github.com/desik1998)
提供机构:
SuryaKrishna02
原始信息汇总

数据集概述

基本信息

  • 数据集名称: aya-telugu-paraphrase
  • 语言: 泰卢固语 (Telugu)
  • 数据集大小: 少于1千条记录
  • 许可证: Apache 2.0
  • 多语言性: 单语种
  • 标签: 释义
  • 任务类别: 文本生成
  • 任务ID: 语言建模

数据集来源

  • 源数据集: ai4bharat/IndicXParaphrase
  • 创建者: 专家生成

数据集用途

  • 支持任务:
    • 训练大型语言模型 (LLMs)
    • 合成数据生成
    • 数据增强
  • 具体任务: 给定一个句子,生成一个意义相似的句子。

数据集结构

  • 数据字段:
    • inputs: 语言模型的提示或输入
    • targets: 语言模型的完成或输出
    • template_id: 用于inputstargets的模板ID
    • template_lang: inputstargets中使用的语言的ISO代码,其中tel指泰卢固语

模板

  • 模板类别: 给定一个句子,生成一个意义相似的句子
    • 模板ID: 1-6
    • 示例:
      • inputs: ఈ క్రింది వాక్యం మరోరీతిలో రాయి: {{Original Sentence}}
      • targets: {{Paraphrased Sentence}}

数据集收集目的

  • 目的: 泰卢固语是一种低资源语言,目前没有释义生成的指令风格数据集。该数据集作为Cohere For AI的Aya开放科学计划的一部分创建,以确保泰卢固语在AI/ML领域得到充分代表。

已知限制

  • 限制:
    • 数据集可能反映源数据集的偏见、事实错误和敏感问题。
    • 尽管尽最大努力保持数据集为单语种,但可能存在一些记录包含泰卢固语和英语混合的情况。

贡献者

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作