five

sumandas/openhermes-2.5-llama3

收藏
Hugging Face2024-04-20 更新2024-04-21 收录
下载链接:
https://hf-mirror.com/datasets/sumandas/openhermes-2.5-llama3
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: content dtype: string splits: - name: train num_bytes: 1192763111.1828105 num_examples: 701085 - name: test num_bytes: 511185891.8171895 num_examples: 300466 download_size: 860367422 dataset_size: 1703949003 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - text-generation size_categories: - 100K<n<1M --- Its a fork of `teknium/OpenHermes-2.5`, scripts used for generating this ```python from datasets import load_dataset dataset = load_dataset("teknium/OpenHermes-2.5") def _return_header(message)-> str: role = message["from"] header = "" if role == "system": header = "system" elif role == "gpt": header = "assistant" elif role == "human": header = "user" return header def encode_header(message): text = '' text = text + "<|start_header_id|>" header = _return_header(message) text = text + header text = text + "<|end_header_id|>" text = text + "\n\n" return text def encode_message(message)->str: text = encode_header(message) text = text + message["value"].strip() text = text + "<|eot_id|>" return text def encode_dialog_prompt(dialog): text = '' text = text + "<|begin_of_text|>" for message in dialog: text = text + encode_message(message) # text = text + "<|end_of_text|>" return text ds = dataset.map(lambda x: {"content":encode_dialog_prompt(x['conversations'])}, num_proc=10) ds = ds.remove_columns(['custom_instruction', 'topic', 'model_name', 'model', 'skip_prompt_formatting', 'category', 'conversations', 'views', 'language', 'id', 'title', 'idx', 'hash', 'avatarUrl', 'system_prompt', 'source']) train_test_split = ds["train"].train_test_split(test_size=0.3) train_test_split.push_to_hub("sumandas/openhermes-2.5-llama3") ``` **Note** Need experiment to validate this is correctly formatted ### Citation ``` @misc{OpenHermes 2.5, title = {OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants}, author = {Teknium}, year = {2023}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/teknium/OpenHermes-2.5} } ```
提供机构:
sumandas
原始信息汇总

数据集信息

特征

  • 名称: content
  • 数据类型: string

数据分割

  • 训练集
    • 字节数: 1192763111.1828105
    • 样本数: 701085
  • 测试集
    • 字节数: 511185891.8171895
    • 样本数: 300466

下载和数据集大小

  • 下载大小: 860367422
  • 数据集大小: 1703949003

配置

  • 配置名称: default
    • 数据文件
      • 训练集路径: data/train-*
      • 测试集路径: data/test-*

任务类别

  • 文本生成

大小类别

  • 100K<n<1M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作