five

el2e10/aya-paraphrase

收藏
Hugging Face2024-02-04 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/el2e10/aya-paraphrase
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc task_categories: - text-generation language: - ml - gu - mr - hi - pa - bn pretty_name: Aya Paraphrase size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: mal path: data/mal.parquet - split: ben path: data/ben.parquet - split: guj path: data/guj.parquet - split: hin path: data/hin.parquet - split: mar path: data/mar.parquet - split: pan path: data/pan.parquet --- ### Description This dataset is derived from the already existing dataset made by AI4Bharat. We have used the [IndicXParaphrase](https://huggingface.co/datasets/ai4bharat/IndicXParaphrase) dataset of AI4Bharat to create this instruction style dataset. This was created as part of [Aya Open Science Initiative](https://sites.google.com/cohere.com/aya-en/home) from Cohere For AI. IndicXParaphrase is multilingual, and n-way parallel dataset for paraphrase detection in 10 Indic languages. The original dataset(IndicXParaphrase) was made available under the cc-0 license. ### Template The following templates where used for converting the original dataset: ``` #Template 1 prompt: Write the following sentence using different words: "{original_sentence}" completion: {paraphrased_sentence} ``` ``` #Template 2 prompt: Rewrite the following sentence in different way: "{original_sentence}" completion: {paraphrased_sentence} ``` ``` #Template 3 prompt: Paraphrase the following sentence:: "{original_sentence}" completion: {paraphrased_sentence} ``` ### Acknowledgement Thank you, Jay Patel for helping by providing the Gujarati translations, Amarjit for helping by providing the Punjabi translations, Yogesh Haribhau Kulkarni for helping by providing the Marathi translations, Ganesh Jagadeesan for helping by providing the Hindi translations and Tahmid Hossain for helping by providing the Bengali translations of the above mentioned English prompts.
提供机构:
el2e10
原始信息汇总

数据集概述

基本信息

  • 许可证: cc
  • 任务类别: 文本生成
  • 语言: 马拉雅拉姆语 (ml), 古吉拉特语 (gu), 马拉地语 (mr), 印地语 (hi), 旁遮普语 (pa), 孟加拉语 (bn)
  • 数据集名称: Aya Paraphrase
  • 数据集大小: 1K<n<10K

配置信息

  • 配置名称: default
  • 数据文件:
    • split: mal, 路径: data/mal.parquet
    • split: ben, 路径: data/ben.parquet
    • split: guj, 路径: data/guj.parquet
    • split: hin, 路径: data/hin.parquet
    • split: mar, 路径: data/mar.parquet
    • split: pan, 路径: data/pan.parquet

描述

该数据集源自AI4Bharat已有的IndicXParaphrase数据集,用于创建指令式数据集。IndicXParaphrase是一个多语言、n-way并行数据集,用于10种印度语言的释义检测。

模板

以下模板用于转换原始数据集: #Template 1 prompt: Write the following sentence using different words: "{original_sentence}"

completion: {paraphrased_sentence}

#Template 2 prompt: Rewrite the following sentence in different way: "{original_sentence}"

completion: {paraphrased_sentence}

#Template 3 prompt: Paraphrase the following sentence:: "{original_sentence}"

completion: {paraphrased_sentence}

致谢

感谢Jay Patel提供古吉拉特语翻译,Amarjit提供旁遮普语翻译,Yogesh Haribhau Kulkarni提供马拉地语翻译,Ganesh Jagadeesan提供印地语翻译,Tahmid Hossain提供孟加拉语翻译。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作