Felladrin/ChatML-aya_dataset
收藏Hugging Face2024-02-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Felladrin/ChatML-aya_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-generation
annotations_creators:
- crowdsourced
- expert-generated
language:
- amh
- arb
- ary
- ars
- acq
- arz
- apc
- ben
- ceb
- dan
- deu
- ell
- eng
- eus
- fil
- fin
- fra
- gle
- guj
- hat
- hau
- hin
- hun
- ibo
- ind
- ita
- jav
- jpn
- kan
- kir
- kor
- kur
- lit
- mal
- mar
- mlg
- msa
- mya
- nep
- nld
- nso
- nya
- pan
- pes
- pol
- por
- pus
- rus
- sin
- sna
- snd
- som
- spa
- sqi
- srp
- sun
- swa
- swe
- tam
- tel
- tha
- tur
- ukr
- urd
- vie
- wol
- xho
- yor
- zho
- zul
language_creators:
- crowdsourced
- expert-generated
multilinguality:
- multilingual
size_categories:
- 100K<n<1M
---
[CohereForAI/aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) in ChatML format, ready to use in [HuggingFace TRL's SFT Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer).
Python code used for conversion:
```python
from datasets import load_dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Felladrin/Llama-160M-Chat-v1")
dataset = load_dataset("CohereForAI/aya_dataset", split="train")
def format(columns):
messages = [
{
"role": "user",
"content": columns["inputs"].strip(),
},
{
"role": "assistant",
"content": columns["targets"].strip(),
},
]
return { "text": tokenizer.apply_chat_template(messages, tokenize=False) }
dataset.map(format).select_columns(['text', 'language', 'language_code', 'annotation_type', 'user_id']).to_parquet("train.parquet")
```
提供机构:
Felladrin
原始信息汇总
数据集概述
许可证
- Apache 2.0
任务类别
- 问答
- 文本生成
标注创建者
- 众包
- 专家生成
语言
- amh, arb, ary, ars, acq, arz, apc, ben, ceb, dan, deu, ell, eng, eus, fil, fin, fra, gle, guj, hat, hau, hin, hun, ibo, ind, ita, jav, jpn, kan, kir, kor, kur, lit, mal, mar, mlg, msa, mya, nep, nld, nso, nya, pan, pes, pol, por, pus, rus, sin, sna, snd, som, spa, sqi, srp, sun, swa, swe, tam, tel, tha, tur, ukr, urd, vie, wol, xho, yor, zho, zul
语言创建者
- 众包
- 专家生成
多语言性
- 多语言
数据集大小
- 100K<n<1M



