Mihaiii/OpenHermes-2.5-1k-longest-curated

Name: Mihaiii/OpenHermes-2.5-1k-longest-curated
Creator: Mihaiii
Published: 2024-02-17 12:36:56
License: 暂无描述

Hugging Face2024-02-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: instruction dtype: string - name: output dtype: string splits: - name: train num_bytes: 4176433 num_examples: 519 download_size: 1835764 dataset_size: 4176433 configs: - config_name: default data_files: - split: train path: data/train-* --- This is a dataset that was created from [HuggingFaceH4/OpenHermes-2.5-1k-longest](https://huggingface.co/datasets/HuggingFaceH4/OpenHermes-2.5-1k-longest). The purpose is to be able to use in [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) config by adding: ```yaml datasets: - path: Mihaiii/OpenHermes-2.5-1k-longest-curated type: alpaca ``` I elimininated rows that: 1) Had sys prompt (only 3 rows eliminated) 2) Contained on output a character that is repeated 10 times in a row (478 rows eliminated) So from a 1000 rows dataset, I ended up with a 519 rows dataset. See the [OpenHermes-2.5-1k-longest-curated.ipynb](https://huggingface.co/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated/blob/main/OpenHermes-2.5-1k-longest-curated.ipynb) notebook for details on how the dataset was constructed. **Later edit**: after a more in depth analysis on the dataset, I noticed that: 1) The imported subset is `test_sft`, but this is the 2nd chunk of top 1k records. The first one is in `train_sft` subset. 2) Valid code records that contained 10 repeated spaces for indentation were also eliminated.

This is a dataset created from HuggingFaceH4/OpenHermes-2.5-1k-longest, containing 519 samples, each with two features: instruction and output, both of which are string types. The dataset was filtered to remove rows with system prompts and rows where the output contained a character repeated 10 times. It is used for axolotl configuration, with detailed construction information available in the provided Jupyter notebook.

提供机构：

Mihaiii

原始信息汇总

数据集概述

数据集信息

特征:
- instruction: 类型为字符串
- output: 类型为字符串
分割:
- train: 包含519个样本，总字节数为4176433
大小:
- 下载大小: 1835764字节
- 数据集大小: 4176433字节

配置

默认配置:
- 数据文件路径: data/train-*

数据处理

原始数据集包含1000行，经过以下处理后剩余519行:
1. 删除了包含系统提示的行（3行）
2. 删除了输出中包含连续重复字符10次的行（478行）
进一步分析发现:
1. 导入的子集为test_sft，但这是前1000条记录的第二个块，第一个块在train_sft子集中
2. 删除了包含10个连续空格进行缩进的有效代码记录

5,000+

优质数据集

54 个

任务类型

进入经典数据集