Felladrin/ChatML-databricks-dolly-15k
收藏Hugging Face2024-02-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Felladrin/ChatML-databricks-dolly-15k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-3.0
task_categories:
- question-answering
- text-generation
language:
- en
size_categories:
- 10K<n<100K
---
[databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) in ChatML format.
Python code used for conversion:
```python
from datasets import load_dataset
import pandas
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1"
)
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def format(columns):
instruction = columns["instruction"].strip()
context = columns["context"].strip()
response = columns["response"].strip()
if context:
user_message = f"{instruction}\n\nContext:\n{context}"
else:
user_message = instruction
messages = [
{
"role": "user",
"content": user_message,
},
{
"role": "assistant",
"content": response,
},
]
return tokenizer.apply_chat_template(messages, tokenize=False)
pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_parquet("train.parquet", index=False)
```
提供机构:
Felladrin
原始信息汇总
数据集概述
基本信息
- 许可证: cc-by-sa-3.0
- 任务类别:
- 问答
- 文本生成
- 语言:
- 英语
- 大小类别:
- 10K<n<100K
数据集名称
- 名称: databricks/databricks-dolly-15k
数据转换
-
转换格式: ChatML
-
转换代码: python from datasets import load_dataset import pandas from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="Felladrin/Llama-160M-Chat-v1" )
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
def format(columns): instruction = columns["instruction"].strip() context = columns["context"].strip() response = columns["response"].strip()
if context: user_message = f"{instruction}
Context: {context}" else: user_message = instruction
messages = [
{
"role": "user",
"content": user_message,
},
{
"role": "assistant",
"content": response,
},
]
return tokenizer.apply_chat_template(messages, tokenize=False)
pandas.DataFrame({"text": [format(columns) for columns in dataset]}).to_parquet("train.parquet", index=False)



