five

fineinstructions/fineinstructions_nemotron

收藏
Hugging Face2026-01-30 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/fineinstructions/fineinstructions_nemotron
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en --- [![FineInstructionsCoverImage](https://cdn-uploads.huggingface.co/production/uploads/61c40eeb727d1257bf3cf5ba/jSiXJ8FaogflCSRt_YirX.png)](https://huggingface.co/fineinstructions) **✨ Note:** For all FineInstructions resources please visit: https://huggingface.co/fineinstructions ---- This dataset is ~1B+ synthetic instruction-answer pairs or ~300B tokens created using the [FineInstructions pipeline](https://huggingface.co/fineinstructions). The FineInstructions pipeline was run over the raw pre-training documents in the Nemotron-CC pre-training corpus (a subset of high-quality documents from CommonCrawl). See our paper for more details. Each `.parquet` file in the [`data` folder](https://huggingface.co/datasets/fineinstructions/fineinstructions_nemotron/tree/main/data) has a corresponding `judge-*.json` file that contains an automatic judgement score of the quality of the synthetic instruction-answer pair on a Likert score (1-5) where 5 is the highest-quality. <!-- Autocitation --> -------------------- If you use this project in your research please cite: ``` @article{patel2026fineinstructions, title={FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale}, author={Patel, Ajay and Raffel, Colin and Callison-Burch, Chris}, journal={arXiv preprint arXiv:2601.22146}, year={2026}, archivePrefix={arXiv}, primaryClass={cs.CL}, doi={10.48550/arXiv.2601.22146} } ```
提供机构:
fineinstructions
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作