collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format-under4096
收藏Hugging Face2025-07-04 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format-under4096
下载链接
链接失效反馈官方服务:
资源简介:
这是一个经过处理的Token长度过滤数据集,只包含少于4096个Token的样本。数据集基于collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format数据集,并使用collabo-research/Moxin-7B-Instruct-hf分词器进行过滤。原始样本有196807个,过滤后的样本有66753个,移除了130054个样本。数据集的特征包括prompt(格式化的提示信息)、token_length(提示信息的Token数量)以及源数据集中的所有原始列。数据集分为训练集和验证集。
This is a processed Token Length Filtered Dataset containing samples with less than 4096 tokens. It is based on the collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format dataset and filtered using the collabo-research/Moxin-7B-Instruct-hf tokenizer. The original dataset contained 196807 samples, with 66753 samples after filtering, and 130054 samples removed. The dataset features include prompt (formatted prompt information), token_length (number of tokens in the prompt), and all original columns from the source dataset. The dataset is split into training and validation sets.
提供机构:
collabo-research



