nvidia/Nemotron-SpecializedDomains-Finance-v1

Name: nvidia/Nemotron-SpecializedDomains-Finance-v1
Creator: nvidia
Published: 2026-03-11 01:14:55
License: 暂无描述

Hugging Face2026-03-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nvidia/Nemotron-SpecializedDomains-Finance-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: - cc-by-4.0 task_categories: - text-generation - question-answering tags: - finance - financial-reasoning - sec-filings - synthetic-data - specialized-domains configs: - config_name: default data_files: - split: train path: data/train.jsonl --- ## Dataset Description Nemotron-SpecializedDomains-Finance is a large-scale synthetic financial question-answering dataset designed to improve LLM performance on specialized financial reasoning and document comprehension tasks. The dataset comprises 326K+ high-quality Q&A pairs generated from SEC filings of S&P 500 companies spanning 2019-2024. This dataset is ready for commercial use. ### Overview The dataset leverages **template-based Synthetic Data Generation (SDG)** to create contextualized financial questions grounded in real regulatory documents. It processes SEC filings (10-K annual reports and 10-Q quarterly reports) from S&P 500 companies filed between 2019 and 2024, providing broad coverage across diverse industry sectors. ### Key Features - **Document-Grounded**: All questions and answers are anchored to specific sections of SEC filings, ensuring factual accuracy - **Domain Coverage**: Spans corporate finance, risk factors, financial performance, governance, regulatory compliance, and business operations - **High Quality**: Filtered using GenSelect methodology to ensure coherent, accurate, and contextually appropriate responses - **SFT-Ready**: Pre-formatted in conversational structure (system/user/assistant messages) for supervised fine-tuning ### Generation Pipeline The dataset was created through a 6-stage template-based SDG process using seed questions from the **[SecQue benchmark](https://huggingface.co/datasets/nogabenyoash/SecQue)** (565 validated financial questions): 1. **Seed Data Creation**: Download SecQue benchmark questions and S&P 500 company metadata with industry classifications 2. **Question Generation**: Adapt seed questions to different companies and fiscal periods, maintaining question structure while updating company-specific details 3. **Context Mapping**: Map questions to relevant SEC filing sections based on original SecQue mappings (e.g., if a SecQue question mapped to Item 1A of a 10-K, the adapted question maps to Item 1A of the target company's 10-K) 4. **Answer Generation**: Generate multiple candidate answers (5 variations with different random seeds) using GPT-OSS-120B, grounded in the mapped filing sections 5. **Answer Selection (GenSelect)**: Use a larger evaluation model to select the best answer from the 5 candidates based on accuracy, coherence, and context alignment 6. **Quality Filtering**: Remove unanswerable questions and low-quality responses through automated verification This approach combines the validated question patterns from SecQue with broad coverage across S&P 500 companies spanning diverse industry sectors, ensuring both question quality and dataset diversity. Each sample includes reasoning traces, teaching models to analyze financial documents systematically before providing answers. ## How to Use It ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("nvidia/Nemotron-SpecializedDomains-Finance") # Access splits train_data = dataset["train"] val_data = dataset["validation"] # Example: Inspect a sample sample = train_data[0] print("Messages:", sample["messages"]) print("Metadata:", sample["metadata"]) # Example: Extract user query and assistant response for message in sample["messages"]: if message["role"] == "user": print("Question:", message["content"][:200]) # First 200 chars elif message["role"] == "assistant": print("Answer:", message["content"][:200]) ``` ## Dataset Owner(s) NVIDIA Corporation ## Dataset Creation Date Created on: 12/01/2025 Last Modified on: 02/03/2026 ## License / Terms of Use This dataset is governed by the Creative Commons Attribution 4.0 International License (CC BY 4.0). ## Intended Usage This dataset is intended for LLM engineers, research teams, and financial AI developers working on: - **Supervised fine-tuning (SFT)** of foundation models for financial domain expertise - **Domain-specific reasoning**: Training models to understand financial terminology, regulatory language, and corporate disclosure patterns - **Document comprehension**: Improving model capabilities for analyzing long-form financial documents (10-K and 10-Q reports) - **Financial QA systems**: Building AI assistants for investment research, compliance analysis, and financial advisory - **Evaluation**: Benchmarking model performance on specialized financial reasoning tasks - **Research**: Studying domain adaptation techniques for highly specialized domains The dataset is particularly suitable for teams developing financial copilots, research assistants, compliance tools, and automated financial analysis systems. ## Dataset Characterization **Data Collection Method** Hybrid: Automated, Synthetic Source documents (SEC filings from S&P 500 companies) were collected through automated download from EDGAR (SEC's Electronic Data Gathering, Analysis, and Retrieval system). Seed questions were sourced from the SecQue benchmark, a curated set of 565 validated financial questions. The question-answer pairs were synthetically generated using template-based SDG methodology powered by large language models (GPT-OSS-120B for answer generation). **Labeling Method** Synthetic All question-answer pairs were generated using template-based Synthetic Data Generation (SDG). The methodology involves: 1. Downloading seed questions from SecQue benchmark and company metadata 2. Adapting seed questions to different companies and fiscal years 3. Mapping questions to relevant context sections from SEC filings based on SecQue section mappings 4. Generating multiple candidate answers (5 variations) using large language models (GPT-OSS-120B) 5. Selecting best answers using GenSelect methodology with a larger evaluation model 6. Filtering low-quality responses through automated quality checks and verification against source documents ## Dataset Format Modality: Text Format: JSONL Structure: Each sample contains: - `messages`: Array with role-based conversation structure (system, user, assistant) - System message (empty in this dataset) - User message with financial document context and question - Assistant message with reasoning and answer - `metadata`: Contains UUID and SDG model identifier (e.g., "openai/gpt-oss-120b") ## Dataset Quantification | Subset | Samples | |--------|---------| | train | 326,698 | | Total | 326,698 | Total Data Storage: ~ 20GB ## Reference(s) - **Source Documents**: SEC EDGAR filings (10-K and 10-Q) from S&P 500 companies (2019-2024) - **Seed Questions**: [SecQue benchmark](https://huggingface.co/datasets/nogabenyoash/SecQue) - 565 validated financial questions from real SEC filings - **GenSelect Methodology**: [arXiv:2507.17797](https://arxiv.org/abs/2507.17797) - Answer selection approach using LLM-as-a-judge ## Ethical Considerations NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)

提供机构：

nvidia

5,000+

优质数据集

54 个

任务类型

进入经典数据集