garg-aayush/mini-platypus-1K
收藏Hugging Face2024-04-18 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/garg-aayush/mini-platypus-1K
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 4185802
num_examples: 1000
download_size: 2263884
dataset_size: 4185802
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
task_categories:
- text-generation
- text2text-generation
language:
- en
size_categories:
- n<1K
---
# Dataset Card for Dataset Name
<!-- Provide a quick summary of the dataset. -->
This is a filtered [Open-Platypus dataset](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) dataset containing 1000 examples for SFT training.
## Dataset Details
### Dataset Description
This is a filtered [Open-Platypus dataset](https://huggingface.co/datasets/garage-bAInd/Open-Platypus) dataset containing 1000 examples for SFT training.
The filtering steps are as follows:
1. Remove all examples/rows with `combined tokens > 2048` ( instruction and output). Note, `NousResearch/Llama-2-7b-hf` model tokenizer is used to generate to tokens.
2. Use similarity to remove duplicate examples. Note, used `mixedbread-ai/mxbai-embed-large-v1` model to generate embeddings, `faiss` for vector database and cosine similarity as similarity function. The similarity threshold is `0.877`.
3. Do top-k sampling with most combined tokens for k = 1000.
The filtering steps are modified from `Maxime Labonne's` [Dataset creation for fine-tuning LLM.ipynb](https://colab.research.google.com/drive/1GH8PW9-zAe4cXEZyOIE-T9uHXblIldAg?usp=sharing) notebook.
The dataset has been formatted using the following chat template function:
```pyython
def chat_template(example):
example["instruction"] = f"### Instruction:\n{example['instruction']}\n\n### Response:\n"
return example
```
## Uses
For SFT LLM training.
提供机构:
garg-aayush
原始信息汇总
数据集概述
数据集信息
-
特征:
instruction: 数据类型为字符串。output: 数据类型为字符串。
-
分割:
train: 包含1000个示例,数据大小为4185802字节。
-
下载大小: 2263884字节。
-
数据集大小: 4185802字节。
-
配置:
default: 训练数据文件路径为data/train-*。
-
许可证: Apache-2.0。
-
任务类别:
- 文本生成
- 文本到文本生成
-
语言: 英语。
-
大小类别: 小于1K。
数据集描述
这是一个经过筛选的Open-Platypus数据集,包含1000个示例,用于SFT训练。筛选步骤包括:
- 移除所有组合令牌超过2048的示例。
- 使用相似性移除重复示例,相似性阈值为0.877。
- 进行top-k采样,选择最多组合令牌的1000个示例。
筛选步骤基于Maxime Labonne的Dataset creation for fine-tuning LLM.ipynb笔记本。
数据集已使用以下聊天模板函数格式化: python def chat_template(example): example["instruction"] = f"### Instruction: {example[instruction]}
Response:
" return example
用途
用于SFT LLM训练。



