imranarshad01/ethizo-nb-sft-gemma4-8k-thinking-20260423
收藏Hugging Face2026-04-23 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/imranarshad01/ethizo-nb-sft-gemma4-8k-thinking-20260423
下载链接
链接失效反馈官方服务:
资源简介:
Gemma 4 8k Thinking Dataset是一个用于文本生成的英语数据集,主要应用于医疗、临床、工具调用、推理和思考等领域。数据集包含15,000行训练数据和2,101行验证数据,每行数据严格限制在8,192个令牌内。数据集经过严格转换和过滤,确保数据质量和分布合理性。转换规则包括保留包含真实思考块的助理消息、删除训练集中出现在验证集中的synthetic_pid、仅保留清晰的消息形状、将助理的思考块转换为Gemma风格的推理和可见内容、为每行添加chat_template_kwargs以及按比例抽样以保持源分布。数据集还提供了详细的统计摘要,包括训练和验证数据的可用行数、最终行数以及处理过程中的重叠和替换情况。
The Gemma 4 8k Thinking Dataset is an English dataset for text generation, primarily used in fields such as medical, clinical, tool-calling, reasoning, and thinking. The dataset includes 15,000 rows of training data and 2,101 rows of validation data, with each row strictly capped at 8,192 tokens. The dataset undergoes rigorous conversion and filtering to ensure data quality and distribution合理性. Conversion rules include retaining assistant messages with real thinking blocks, dropping any train row whose synthetic_pid appears in validation, keeping only clean message shapes, converting assistant thinking blocks into Gemma-style reasoning plus visible content, adding chat_template_kwargs to every row, and sampling train data proportionally to preserve the source distribution. The dataset also provides detailed statistical summaries, including the number of usable rows before sampling, final rows, and handling of overlaps and replacements during processing.
提供机构:
imranarshad01



