five

Felladrin/ChatML-aya_dataset

收藏
Hugging Face2024-02-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Felladrin/ChatML-aya_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation annotations_creators: - crowdsourced - expert-generated language: - amh - arb - ary - ars - acq - arz - apc - ben - ceb - dan - deu - ell - eng - eus - fil - fin - fra - gle - guj - hat - hau - hin - hun - ibo - ind - ita - jav - jpn - kan - kir - kor - kur - lit - mal - mar - mlg - msa - mya - nep - nld - nso - nya - pan - pes - pol - por - pus - rus - sin - sna - snd - som - spa - sqi - srp - sun - swa - swe - tam - tel - tha - tur - ukr - urd - vie - wol - xho - yor - zho - zul language_creators: - crowdsourced - expert-generated multilinguality: - multilingual size_categories: - 100K<n<1M --- [CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) in ChatML format, ready to use in [HuggingFace TRL's SFT Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer). Python code used for conversion: ```python from datasets import load_dataset from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1") dataset = load_dataset("CohereForAI/aya_dataset", split="train") def format(columns): messages = [ { "role": "user", "content": columns["inputs"].strip(), }, { "role": "assistant", "content": columns["targets"].strip(), }, ] return { "text": tokenizer.apply_chat_template(messages, tokenize=False) } dataset.map(format).select_columns(['text', 'language', 'language_code', 'annotation_type', 'user_id']).to_parquet("train.parquet") ```
提供机构:
Felladrin
原始信息汇总

数据集概述

许可证

  • Apache 2.0

任务类别

  • 问答
  • 文本生成

标注创建者

  • 众包
  • 专家生成

语言

  • amh, arb, ary, ars, acq, arz, apc, ben, ceb, dan, deu, ell, eng, eus, fil, fin, fra, gle, guj, hat, hau, hin, hun, ibo, ind, ita, jav, jpn, kan, kir, kor, kur, lit, mal, mar, mlg, msa, mya, nep, nld, nso, nya, pan, pes, pol, por, pus, rus, sin, sna, snd, som, spa, sqi, srp, sun, swa, swe, tam, tel, tha, tur, ukr, urd, vie, wol, xho, yor, zho, zul

语言创建者

  • 众包
  • 专家生成

多语言性

  • 多语言

数据集大小

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作