sumandas/openhermes-2.5-llama3

Name: sumandas/openhermes-2.5-llama3
Creator: sumandas
Published: 2024-04-20 07:33:19
License: 暂无描述

Hugging Face2024-04-20 更新2024-04-21 收录

下载链接：

https://hf-mirror.com/datasets/sumandas/openhermes-2.5-llama3

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: content dtype: string splits: - name: train num_bytes: 1192763111.1828105 num_examples: 701085 - name: test num_bytes: 511185891.8171895 num_examples: 300466 download_size: 860367422 dataset_size: 1703949003 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - text-generation size_categories: - 100K<n<1M --- Its a fork of `teknium/OpenHermes-2.5`, scripts used for generating this ```python from datasets import load_dataset dataset = load_dataset("teknium/OpenHermes-2.5") def _return_header(message)-> str: role = message["from"] header = "" if role == "system": header = "system" elif role == "gpt": header = "assistant" elif role == "human": header = "user" return header def encode_header(message): text = '' text = text + "<|start_header_id|>" header = _return_header(message) text = text + header text = text + "<|end_header_id|>" text = text + "\n\n" return text def encode_message(message)->str: text = encode_header(message) text = text + message["value"].strip() text = text + "<|eot_id|>" return text def encode_dialog_prompt(dialog): text = '' text = text + "<|begin_of_text|>" for message in dialog: text = text + encode_message(message) # text = text + "<|end_of_text|>" return text ds = dataset.map(lambda x: {"content":encode_dialog_prompt(x['conversations'])}, num_proc=10) ds = ds.remove_columns(['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source']) train_test_split = ds["train"].train_test_split(test_size=0.3) train_test_split.push_to_hub("sumandas/openhermes-2.5-llama3") ``` **Note** Need experiment to validate this is correctly formatted ### Citation ``` @misc{OpenHermes 2.5, title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants}, author = {Teknium}, year = {2023}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5} } ```

提供机构：

sumandas

原始信息汇总

数据集信息

特征

名称: content
数据类型: string

数据分割

训练集
- 字节数: 1192763111.1828105
- 样本数: 701085
测试集
- 字节数: 511185891.8171895
- 样本数: 300466

下载和数据集大小

下载大小: 860367422
数据集大小: 1703949003

配置

配置名称: default
- 数据文件
  - 训练集路径: data/train-*
  - 测试集路径: data/test-*

任务类别

文本生成

大小类别

100K<n<1M

5,000+

优质数据集

54 个

任务类型

进入经典数据集