Mihaiii/OpenHermes-2.5-1k-longest-curated
收藏Hugging Face2024-02-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 4176433
num_examples: 519
download_size: 1835764
dataset_size: 4176433
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
This is a dataset that was created from [HuggingFaceH4/OpenHermes-2.5-1k-longest](https://huggingface.co/datasets/HuggingFaceH4/OpenHermes-2.5-1k-longest).
The purpose is to be able to use in [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) config by adding:
```yaml
datasets:
- path: Mihaiii/OpenHermes-2.5-1k-longest-curated
type: alpaca
```
I elimininated rows that:
1) Had sys prompt (only 3 rows eliminated)
2) Contained on output a character that is repeated 10 times in a row (478 rows eliminated)
So from a 1000 rows dataset, I ended up with a 519 rows dataset.
See the [OpenHermes-2.5-1k-longest-curated.ipynb](https://huggingface.co/datasets/Mihaiii/OpenHermes-2.5-1k-longest-curated/blob/main/OpenHermes-2.5-1k-longest-curated.ipynb) notebook for details on how the dataset was constructed.
**Later edit**: after a more in depth analysis on the dataset, I noticed that:
1) The imported subset is `test_sft`, but this is the 2nd chunk of top 1k records. The first one is in `train_sft` subset.
2) Valid code records that contained 10 repeated spaces for indentation were also eliminated.
This is a dataset created from HuggingFaceH4/OpenHermes-2.5-1k-longest, containing 519 samples, each with two features: instruction and output, both of which are string types. The dataset was filtered to remove rows with system prompts and rows where the output contained a character repeated 10 times. It is used for axolotl configuration, with detailed construction information available in the provided Jupyter notebook.
提供机构:
Mihaiii
原始信息汇总
数据集概述
数据集信息
- 特征:
instruction: 类型为字符串output: 类型为字符串
- 分割:
train: 包含519个样本,总字节数为4176433
- 大小:
- 下载大小: 1835764字节
- 数据集大小: 4176433字节
配置
- 默认配置:
- 数据文件路径:
data/train-*
- 数据文件路径:
数据处理
- 原始数据集包含1000行,经过以下处理后剩余519行:
- 删除了包含系统提示的行(3行)
- 删除了输出中包含连续重复字符10次的行(478行)
- 进一步分析发现:
- 导入的子集为
test_sft,但这是前1000条记录的第二个块,第一个块在train_sft子集中 - 删除了包含10个连续空格进行缩进的有效代码记录
- 导入的子集为



