sumandas/openhermes-2.5-llama3
收藏Hugging Face2024-04-20 更新2024-04-21 收录
下载链接:
https://hf-mirror.com/datasets/sumandas/openhermes-2.5-llama3
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 1192763111.1828105
num_examples: 701085
- name: test
num_bytes: 511185891.8171895
num_examples: 300466
download_size: 860367422
dataset_size: 1703949003
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
Its a fork of `teknium/OpenHermes-2.5`, scripts used for generating this
```python
from datasets import load_dataset
dataset = load_dataset("teknium/OpenHermes-2.5")
def _return_header(message)-> str:
role = message["from"]
header = ""
if role == "system":
header = "system"
elif role == "gpt":
header = "assistant"
elif role == "human":
header = "user"
return header
def encode_header(message):
text = ''
text = text + "<|start_header_id|>"
header = _return_header(message)
text = text + header
text = text + "<|end_header_id|>"
text = text + "\n\n"
return text
def encode_message(message)->str:
text = encode_header(message)
text = text + message["value"].strip()
text = text + "<|eot_id|>"
return text
def encode_dialog_prompt(dialog):
text = ''
text = text + "<|begin_of_text|>"
for message in dialog:
text = text + encode_message(message)
# text = text + "<|end_of_text|>"
return text
ds = dataset.map(lambda x: {"content":encode_dialog_prompt(x['conversations'])}, num_proc=10)
ds = ds.remove_columns(['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source'])
train_test_split = ds["train"].train_test_split(test_size=0.3)
train_test_split.push_to_hub("sumandas/openhermes-2.5-llama3")
```
**Note** Need experiment to validate this is correctly formatted
### Citation
```
@misc{OpenHermes 2.5,
title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants},
author = {Teknium},
year = {2023},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5}
}
```
提供机构:
sumandas
原始信息汇总
数据集信息
特征
- 名称: content
- 数据类型: string
数据分割
- 训练集
- 字节数: 1192763111.1828105
- 样本数: 701085
- 测试集
- 字节数: 511185891.8171895
- 样本数: 300466
下载和数据集大小
- 下载大小: 860367422
- 数据集大小: 1703949003
配置
- 配置名称: default
- 数据文件
- 训练集路径: data/train-*
- 测试集路径: data/test-*
- 数据文件
任务类别
- 文本生成
大小类别
- 100K<n<1M



