five

Mihaiii/OpenHermes-2.5-1k-longest-curated

收藏
Hugging Face2024-02-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: instruction dtype: string - name: output dtype: string splits: - name: train num_bytes: 4176433 num_examples: 519 download_size: 1835764 dataset_size: 4176433 configs: - config_name: default data_files: - split: train path: data/train-* --- This is a dataset that was created from [HuggingFaceH4/OpenHermes-2.5-1k-longest](https://huggingface.co/datasets/HuggingFaceH4/OpenHermes-2.5-1k-longest). The purpose is to be able to use in [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) config by adding: ```yaml datasets: - path: Mihaiii/OpenHermes-2.5-1k-longest-curated type: alpaca ``` I elimininated rows that: 1) Had sys prompt (only 3 rows eliminated) 2) Contained on output a character that is repeated 10 times in a row (478 rows eliminated) So from a 1000 rows dataset, I ended up with a 519 rows dataset. See the [OpenHermes-2.5-1k-longest-curated.ipynb](https://huggingface.co/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated/blob/main/OpenHermes-2.5-1k-longest-curated.ipynb) notebook for details on how the dataset was constructed. **Later edit**: after a more in depth analysis on the dataset, I noticed that: 1) The imported subset is `test_sft`, but this is the 2nd chunk of top 1k records. The first one is in `train_sft` subset. 2) Valid code records that contained 10 repeated spaces for indentation were also eliminated.

This is a dataset created from HuggingFaceH4/OpenHermes-2.5-1k-longest, containing 519 samples, each with two features: instruction and output, both of which are string types. The dataset was filtered to remove rows with system prompts and rows where the output contained a character repeated 10 times. It is used for axolotl configuration, with detailed construction information available in the provided Jupyter notebook.
提供机构:
Mihaiii
原始信息汇总

数据集概述

数据集信息

  • 特征:
    • instruction: 类型为字符串
    • output: 类型为字符串
  • 分割:
    • train: 包含519个样本,总字节数为4176433
  • 大小:
    • 下载大小: 1835764字节
    • 数据集大小: 4176433字节

配置

  • 默认配置:
    • 数据文件路径: data/train-*

数据处理

  • 原始数据集包含1000行,经过以下处理后剩余519行:
    1. 删除了包含系统提示的行(3行)
    2. 删除了输出中包含连续重复字符10次的行(478行)
  • 进一步分析发现:
    1. 导入的子集为test_sft,但这是前1000条记录的第二个块,第一个块在train_sft子集中
    2. 删除了包含10个连续空格进行缩进的有效代码记录
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作