ontocord/Dolci-Instruct-SFT-decontaminated
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ontocord/Dolci-Instruct-SFT-decontaminated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: default
splits:
- name: train
num_examples: 2144770
license: apache-2.0
tags:
- decontaminated
---
# Dolci-Instruct-SFT-decontaminated
Decontaminated version of [allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT).
## Decontamination Details
- **Method**: 13-gram overlap detection
- **Original samples**: 2,152,112
- **Cleaned samples**: 2,144,770
- **Removed samples**: 7,342 (0.34%)
### Benchmarks Checked
MMLU, Ifeval, ARC, COPA, LAMBADA, OpenBookQA, Winogrande, BoolQ, HellaSwag, PIQA, Gsm8k, ALERT, GPQA, MATH, MBPP, HumanEval, SimpleQA, CommonsenseQA, DoNotAnswer, AIME24, LiveCodeBench, MATH500
数据集信息:
配置名称:default
数据划分:
- 名称:训练集(train),样本量:2144770
许可证:Apache-2.0
标签:去污染(decontaminated)
# 去污染版Dolci-Instruct-SFT(Dolci-Instruct-SFT-decontaminated)
本数据集为[allenai/Dolci-Instruct-SFT](https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT)的去污染(decontaminated)版本。
## 去污染细节
- **去污染方法**:13-gram重叠检测
- **原始样本量**:2152112
- **清洗后样本量**:2144770
- **移除样本量**:7342,占比0.34%
### 已测试基准数据集
MMLU、Ifeval、ARC、COPA、LAMBADA、OpenBookQA、Winogrande、BoolQ、HellaSwag、PIQA、Gsm8k、ALERT、GPQA、MATH、MBPP、HumanEval、SimpleQA、CommonsenseQA、DoNotAnswer、AIME24、LiveCodeBench、MATH500
提供机构:
ontocord



