openeurollm/contaminated-documents
收藏Hugging Face2025-11-25 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/contaminated-documents
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: nemotron_sample
features:
- name: warc_record_id
dtype: string
- name: file_part
dtype: string
- name: benchmark
dtype: string
- name: matched_ngram
dtype: string
- name: benchmark_text
dtype: string
- name: train
dtype: string
splits:
- name: single_collision
num_bytes: 189307298
num_examples: 7428
- name: all_collisions
num_bytes: 1044885141
num_examples: 32841
download_size: 212808419
dataset_size: 1234192439
configs:
- config_name: nemotron_sample
data_files:
- split: single_collision
path: nemotron_sample/single_collision-*
- split: all_collisions
path: nemotron_sample/all_collisions-*
---
This repository will include the contaminated documents from Nemotron and HPLT, extracted using nemo-curator.
The benchmarks are obtained from [here](https://docs.google.com/spreadsheets/d/1uBji9fJFLdaaOnzYI71RGW2IiuHJ2vrmuHsXPIsx_rk/edit?gid=1345345034#gid=1345345034), and use the split defined for benchmarking by lm-evaluation-harness
提供机构:
openeurollm



