nri-ai/nri-fin-reasoning

Name: nri-ai/nri-fin-reasoning
Creator: nri-ai
Published: 2026-03-09 02:28:48
License: 暂无描述

Hugging Face2026-03-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nri-ai/nri-fin-reasoning

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - ja - en task_categories: - text-generation tags: - reasoning - finance - japanese - sft - thinking size_categories: - 100K<n<1M dataset_info: features: - name: id dtype: string - name: model dtype: class_label: names: '0': openai/gpt-oss-120b - name: language dtype: class_label: names: '0': en '1': ja - name: task dtype: class_label: names: '0': math '1': mcqa '2': openqa '3': writing - name: messages list: - name: role dtype: string - name: content dtype: string - name: thinking dtype: string splits: - name: train num_bytes: 22178381121 num_examples: 632636 download_size: 11919658034 dataset_size: 22178381121 configs: - config_name: default data_files: - split: train path: data/train-* --- # nri-fin-reasoning <div align="center"> <img src="assets/method_pipeline_en.png" alt="Method Pipeline" width="100%"/> </div> <div align="center" style="line-height: 1;"> <a href="https://huggingface.co/nri-ai" target="_blank" style="margin: 2px;"> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-NRI--AI-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/README.ja.md" style="margin: 2px;"> <img alt="Japanese" src="https://img.shields.io/badge/%F0%9F%87%AF%F0%9F%87%B5%20%E6%97%A5%E6%9C%AC%E8%AA%9E-README-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C7-2.pdf" target="_blank" style="margin: 2px;"> <img alt="NLP2026" src="https://img.shields.io/badge/%F0%9F%93%9D%20NLP2026-Paper-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> <a href="https://arxiv.org/abs/2603.01353" target="_blank" style="margin: 2px;"> <img alt="arXiv" src="https://img.shields.io/badge/%F0%9F%93%9D%20arXiv-Paper-b31b1b?color=b31b1b&logoColor=white" style="display: inline-block; vertical-align: middle;"/> </a> </div> <div align="center" style="line-height: 1;"> <a href="https://creativecommons.org/licenses/by/4.0/" style="margin: 2px;"> <img alt="License" src="https://img.shields.io/badge/License-CC_BY_4.0-f5de53?color=f5de53" style="display: inline-block; vertical-align: middle;"/> </a> </div> A Japanese instruction dataset with reasoning traces from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), specialized for the financial domain. ## Overview A large-scale dataset of 632,636 samples (~6.35 billion tokens), featuring multi-turn conversations (up to 3 turns) with explicit reasoning traces. Designed for supervised fine-tuning to improve LLM reasoning in the financial domain. ## Dataset Structure ### Fields Each example contains: - **`id`** (str): Unique identifier for the sample - **`model`** (ClassLabel): Model used for response generation (`openai/gpt-oss-120b`) - **`language`** (ClassLabel): Language of the sample (`en` or `ja`) - **`task`** (ClassLabel): Task type (`math`, `mcqa`, `openqa`, or `writing`) - **`messages`** (list): Conversation in standard chat format - `role`: Either "user" or "assistant" - `content`: The message content - `thinking` (str, optional): The reasoning trace, present only in assistant messages ### Task Types | Task | Description | |------|-------------| | `openqa` | Open-ended questions | | `math` | Mathematical problems | | `writing` | Writing requests | | `mcqa` | Multiple-choice questions | ## Usage ### Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("nri-ai/nri-fin-reasoning", split="train") # Example: Access a sample sample = ds[0] print(f"Language: {sample['language']}") print(f"Task: {sample['task']}") print(f"User: {sample['messages'][0]['content']}") print(f"Thinking: {sample['messages'][1]['thinking']}") print(f"Assistant: {sample['messages'][1]['content']}") ``` ## Dataset Creation ### Topic Word Generation To ensure comprehensive coverage of the financial domain, we curated topic words from multiple categories: - **Financial industries**: Banking, insurance, securities, etc. - **Financial instruments**: Bonds, ETFs, derivatives, etc. - **Key technologies**: Blockchain, smart contracts, etc. - **Professional certifications**: Financial Planner (FP), CPA, etc. A total of 135 financial topic words were selected, with an additional 20 general-domain topics to maintain model versatility. For each topic, 10 related sub-topics were generated. ### Question Generation For each sub-topic, user questions were generated across four types: - **Open-ended questions** (10 per sub-topic) - **Math problems** (10 per sub-topic) - **Writing requests** (10 per sub-topic) - **Multiple-choice questions** (8 per sub-topic) ### Question Expansion To increase diversity, questions were expanded through: - Context addition - Style transformation - Specialization to specific scenarios - Modification to related topics ### Lexical Filtering - N-gram filtering for repetition detection - Word count filtering (minimum 10 words via MeCab tokenization) - Fuzzy deduplication using MinHash and LSH ### Multi-turn Dialogue Generation Responses and subsequent user turns are generated by the LLM, up to a maximum of three turns. ### LLM-as-a-Judge Filtering Final quality filtering using gpt-oss-120b as judge, evaluating on 5 dimensions: | Dimension | Evaluation Criteria | |-----------|---------------------| | Accuracy | Factual correctness, absence of misinformation | | Relevance | Appropriate response to prompt, instruction following | | Usefulness | Helpfulness, comprehensiveness of information | | Reasoning | Quality of reasoning, logical consistency | | Safety | Safety, ethics, appropriate expression | Each dimension is scored 1-5 and only samples achieving 5/5 on all dimensions were retained. ## Models Trained on This Dataset - [gpt-oss-20b-Ja-Fin-Thinking](https://huggingface.co/nri-ai/gpt-oss-20b-Ja-Fin-Thinking) - [Qwen3-14B-Ja-Fin-Thinking](https://huggingface.co/nri-ai/Qwen3-14B-Ja-Fin-Thinking) ## Intended Use This dataset can be used to post-train LLMs via supervised fine-tuning (SFT). ## Limitations - **Domain specificity**: This dataset was generated with the financial domain in mind, and thus lacks coverage of other areas such as coding, STEM, and creative writing. - **Synthetic data**: All instructions and responses were synthesized using LLMs and as a result may contain hallucinations despite filtering. - **Multilingual coverage**: Primarily Japanese and English. ## Ethical Considerations - Financial information in this dataset should not be used as professional financial advice without review by qualified experts. - Users should evaluate model outputs for their specific use cases and seek professional guidance where appropriate. ## License This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. ## Privacy Notice For details on how personal information is handled, please see the [Privacy Notice](https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/PRIVACY_NOTICE.md) ([日本語](https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/PRIVACY_NOTICE.ja.md)). ## Citation ```bibtex @inproceedings{okochiDomainSpecificLLM2026, author = {大河内悠磨 and Sim, Fabio Milentiansen and 岡田智靖}, title = {ドメイン特化LLMの推論能力向上を目的とした合成指示データセットの構築と金融ドメインにおける評価}, booktitle = {言語処理学会第32回年次大会 (NLP2026) }, year = {2026}, month = mar, address = {Utsunomiya, Tochigi, Japan}, publisher = {言語処理学会}, note = {Paper ID: C7-2}, url = {https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C7-2.pdf} } ``` ```bibtex @misc{okochi2026constructingsyntheticinstructiondatasets, title = {Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain}, author = {Yuma Okochi and Fabio Milentiansen Sim and Tomoyasu Okada}, year = {2026}, eprint = {2603.01353}, archivePrefix = {arXiv}, primaryClass = {cs.LG}, url = {https://arxiv.org/abs/2603.01353} } ``` ## Acknowledgments This dataset was developed with the support of the "GENIAC (Generative AI Accelerator Challenge)" project, implemented by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO), with the aim of strengthening Japan's development capabilities in generative AI.

提供机构：

nri-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集