five

openeurollm/contaminated-documents

收藏
Hugging Face2025-11-25 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/openeurollm/contaminated-documents
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: config_name: nemotron_sample features: - name: warc_record_id dtype: string - name: file_part dtype: string - name: benchmark dtype: string - name: matched_ngram dtype: string - name: benchmark_text dtype: string - name: train dtype: string splits: - name: single_collision num_bytes: 189307298 num_examples: 7428 - name: all_collisions num_bytes: 1044885141 num_examples: 32841 download_size: 212808419 dataset_size: 1234192439 configs: - config_name: nemotron_sample data_files: - split: single_collision path: nemotron_sample/single_collision-* - split: all_collisions path: nemotron_sample/all_collisions-* --- This repository will include the contaminated documents from Nemotron and HPLT, extracted using nemo-curator. The benchmarks are obtained from [here](https://docs.google.com/spreadsheets/d/1uBji9fJFLdaaOnzYI71RGW2IiuHJ2vrmuHsXPIsx_rk/edit?gid=1345345034#gid=1345345034), and use the split defined for benchmarking by lm-evaluation-harness
提供机构:
openeurollm
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作