five

EER6/nvidia-OpenCodeInstruct-broad

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/EER6/nvidia-OpenCodeInstruct-broad
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 source_datasets: - nvidia/OpenCodeInstruct task_categories: - text-generation language: - en tags: - code - sft - instruction-tuning - filtered size_categories: - 1M<n<10M --- # nvidia-OpenCodeInstruct-broad A quality-filtered subset of [nvidia/OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) (5M examples). ## Filtering criteria Both conditions must be satisfied: | Criterion | Threshold | |-----------|-----------| | **LLM judge min score** | >= 4 (out of 5) | | **Unit test pass rate** (`average_test_score`) | >= 0.8 | **LLM judge min score** is the minimum across all three dimensions in the `llm_judgement` field: - `requirement_conformance` — does the code do what the instruction asked? - `logical_correctness` — is the algorithm/logic correct? - `edge_case_consideration` — does it handle edge cases? A min score >= 4 means *every* dimension scores at least 4/5. **Unit test pass rate** (`average_test_score`) is the fraction of 10 LLM-generated unit tests that the solution passes. >= 0.8 means at least 8 out of 10 tests pass. ## Result - **Source size:** 5,000,000 - **Filtered size:** 1,698,239 - **Retention rate:** 34.0% [EER6/nvidia-OpenCodeInstruct-refined](https://huggingface.co/datasets/EER6/nvidia-OpenCodeInstruct-refined) is a strict subset of this dataset with tighter thresholds (llm_min = 5, test = 1.0, 444K examples). ## Schema All original columns from `nvidia/OpenCodeInstruct` are preserved as-is — no transforms or column additions. See the [original dataset card](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) for column descriptions and citation.
提供机构:
EER6
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作