collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format-under4096

Name: collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format-under4096
Creator: collabo-research
Published: 2025-07-04 08:22:52
License: 暂无描述

Hugging Face2025-07-04 更新2025-07-05 收录

下载链接：

https://hf-mirror.com/datasets/collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format-under4096

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个经过处理的Token长度过滤数据集，只包含少于4096个Token的样本。数据集基于collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format数据集，并使用collabo-research/Moxin-7B-Instruct-hf分词器进行过滤。原始样本有196807个，过滤后的样本有66753个，移除了130054个样本。数据集的特征包括prompt（格式化的提示信息）、token_length（提示信息的Token数量）以及源数据集中的所有原始列。数据集分为训练集和验证集。

This is a processed Token Length Filtered Dataset containing samples with less than 4096 tokens. It is based on the collabo-research/Moxin-sft-reasoning-dataset-en-32kfiltered-chat-format dataset and filtered using the collabo-research/Moxin-7B-Instruct-hf tokenizer. The original dataset contained 196807 samples, with 66753 samples after filtering, and 130054 samples removed. The dataset features include prompt (formatted prompt information), token_length (number of tokens in the prompt), and all original columns from the source dataset. The dataset is split into training and validation sets.

提供机构：

collabo-research

5,000+

优质数据集

54 个

任务类型

进入经典数据集