five

nvidia/Nemotron-VLM-Dataset-v2

收藏
Hugging Face2025-12-18 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/nvidia/Nemotron-VLM-Dataset-v2
下载链接
链接失效反馈
官方服务:
资源简介:
Nemotron-VLM-Dataset v2是一个包含近三倍高质量样本的数据集,相较于Llama Nemotron VLM Dataset V1的300万个样本。这次我们的重点放在了三个方面:添加了新的数据模态,如视频;扩展了我们的思维链推理数据;为社区提供了一个生成OCR训练数据的工具链。我们发现,为了进一步提高性能,我们的模型不仅需要学习正确的答案,还需要学习背后的推理过程。添加更多有针对性的思维链数据被证明是打破各种基准平台的关键。随着这个版本的发布,我们正在扩大数据集的范围,以允许训练更强大的模型。我们添加了:新的模态和领域:我们添加了大量新的数据,涵盖了UI理解、复杂的图表、图表。首次,我们还包括了视频理解任务。关注推理:通过添加更多思维链数据,我们能够打破基准平台。其中一些数据是通过为现有样本自动标记思维痕迹生成的。我们发现,提供这些痕迹特别有助于那些以前模型难以处理的样本。改进OCR:我们通过添加更多样化的训练样本,包括六种语言的多种语言数据,进一步提高了我们第一个VL模型的高竞争性OCR能力。遗憾的是,我们无法重新分发这些样本的大部分,但我们发布了我们使用的用于生成所有这些OCR数据(包括真实标签)的数据生成流程!请查看[这里](https://github.com/NVIDIA-NeMo/Curator/tree/experimental/experimental/nvpdftex)。在下面的表格中,您可以查看我们发布的所有子数据集,包括它们的大小、属性和链接到具有更多详细信息的子数据集卡。

Following up on Llama Nemotron VLM Dataset V1 with 3 million samples, we are releasing the Nemotron VLM Dataset V2 with almost three times as many high-quality samples. This time, our focus was on three main areas: Adding new data modalities like video, expanding our chain-of-thought reasoning data, and providing the community with a toolchain to generate OCR training data. We discovered that to enhance performance further, our models needed to learn not only the correct answer but also the reasoning process behind it. Adding more targeted chain-of-thought datasets proved to be the key to breaking the plateau for various benchmarks. With this release, we are broadening the dataset scope to allow for training more capable models. We added New Modalities and Domains: We have added a substantial amount of new data covering UI understanding, complex charts, diagrams. For the first time, we are also including video understanding tasks. Focus on Reasoning: We have been able to break benchmark plateaus by adding more chain-of-thought data, some of which we generated by auto labeling thinking traces for existing samples. We found that providing those traces helped especially for samples that the previous model struggled with. Improved OCR: We further improved on the highly-competitive OCR capabilities of our first VL model by adding an even larger variety of training samples including multilingual data for six languages. Unfortunately, we cannot redistribute a large part of those samples, but we are releasing the data generation pipeline that we used, so you can generate all that OCR data with ground truth yourself! Check it out [here](https://github.com/NVIDIA-NeMo/Curator/tree/experimental/experimental/nvpdftex). In the table below, you can see all the subdatasets that we are publishing with their sizes, properties and link to a subdataset card with more details. For each subdataset we are publishing the annotations/labels which we generated by using various strategies, see Source & Processing column. The actual media data (images and videos) can only be redistributed for some of the datasets according to their licenses. For the remaining ones, we provide instructions on how to obtain the data in each of the subdataset cards. All of the data is prepared to be used with our multi-modal data loader Megatron Energon. For more details, see [this section](#loading-the-data-with-megatron-energon) below. This dataset is ready for commercial use.
提供机构:
nvidia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作