nri-ai/nri-fin-reasoning
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nri-ai/nri-fin-reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- ja
- en
task_categories:
- text-generation
tags:
- reasoning
- finance
- japanese
- sft
- thinking
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: id
dtype: string
- name: model
dtype:
class_label:
names:
'0': openai/gpt-oss-120b
- name: language
dtype:
class_label:
names:
'0': en
'1': ja
- name: task
dtype:
class_label:
names:
'0': math
'1': mcqa
'2': openqa
'3': writing
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: thinking
dtype: string
splits:
- name: train
num_bytes: 22178381121
num_examples: 632636
download_size: 11919658034
dataset_size: 22178381121
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# nri-fin-reasoning
<div align="center">
<img src="assets/method_pipeline_en.png" alt="Method Pipeline" width="100%"/>
</div>
<div align="center" style="line-height: 1;">
<a href="https://huggingface.co/nri-ai" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-NRI--AI-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/README.ja.md" style="margin: 2px;">
<img alt="Japanese" src="https://img.shields.io/badge/%F0%9F%87%AF%F0%9F%87%B5%20%E6%97%A5%E6%9C%AC%E8%AA%9E-README-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C7-2.pdf" target="_blank" style="margin: 2px;">
<img alt="NLP2026" src="https://img.shields.io/badge/%F0%9F%93%9D%20NLP2026-Paper-005bac?color=005bac&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://arxiv.org/abs/2603.01353" target="_blank" style="margin: 2px;">
<img alt="arXiv" src="https://img.shields.io/badge/%F0%9F%93%9D%20arXiv-Paper-b31b1b?color=b31b1b&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
<div align="center" style="line-height: 1;">
<a href="https://creativecommons.org/licenses/by/4.0/" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-CC_BY_4.0-f5de53?color=f5de53" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
A Japanese instruction dataset with reasoning traces from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b), specialized for the financial domain.
## Overview
A large-scale dataset of 632,636 samples (~6.35 billion tokens), featuring multi-turn conversations (up to 3 turns) with explicit reasoning traces. Designed for supervised fine-tuning to improve LLM reasoning in the financial domain.
## Dataset Structure
### Fields
Each example contains:
- **`id`** (str): Unique identifier for the sample
- **`model`** (ClassLabel): Model used for response generation (`openai/gpt-oss-120b`)
- **`language`** (ClassLabel): Language of the sample (`en` or `ja`)
- **`task`** (ClassLabel): Task type (`math`, `mcqa`, `openqa`, or `writing`)
- **`messages`** (list): Conversation in standard chat format
- `role`: Either "user" or "assistant"
- `content`: The message content
- `thinking` (str, optional): The reasoning trace, present only in assistant messages
### Task Types
| Task | Description |
|------|-------------|
| `openqa` | Open-ended questions |
| `math` | Mathematical problems |
| `writing` | Writing requests |
| `mcqa` | Multiple-choice questions |
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
ds = load_dataset("nri-ai/nri-fin-reasoning", split="train")
# Example: Access a sample
sample = ds[0]
print(f"Language: {sample['language']}")
print(f"Task: {sample['task']}")
print(f"User: {sample['messages'][0]['content']}")
print(f"Thinking: {sample['messages'][1]['thinking']}")
print(f"Assistant: {sample['messages'][1]['content']}")
```
## Dataset Creation
### Topic Word Generation
To ensure comprehensive coverage of the financial domain, we curated topic words from multiple categories:
- **Financial industries**: Banking, insurance, securities, etc.
- **Financial instruments**: Bonds, ETFs, derivatives, etc.
- **Key technologies**: Blockchain, smart contracts, etc.
- **Professional certifications**: Financial Planner (FP), CPA, etc.
A total of 135 financial topic words were selected, with an additional 20 general-domain topics to maintain model versatility. For each topic, 10 related sub-topics were generated.
### Question Generation
For each sub-topic, user questions were generated across four types:
- **Open-ended questions** (10 per sub-topic)
- **Math problems** (10 per sub-topic)
- **Writing requests** (10 per sub-topic)
- **Multiple-choice questions** (8 per sub-topic)
### Question Expansion
To increase diversity, questions were expanded through:
- Context addition
- Style transformation
- Specialization to specific scenarios
- Modification to related topics
### Lexical Filtering
- N-gram filtering for repetition detection
- Word count filtering (minimum 10 words via MeCab tokenization)
- Fuzzy deduplication using MinHash and LSH
### Multi-turn Dialogue Generation
Responses and subsequent user turns are generated by the LLM, up to a maximum of three turns.
### LLM-as-a-Judge Filtering
Final quality filtering using gpt-oss-120b as judge, evaluating on 5 dimensions:
| Dimension | Evaluation Criteria |
|-----------|---------------------|
| Accuracy | Factual correctness, absence of misinformation |
| Relevance | Appropriate response to prompt, instruction following |
| Usefulness | Helpfulness, comprehensiveness of information |
| Reasoning | Quality of reasoning, logical consistency |
| Safety | Safety, ethics, appropriate expression |
Each dimension is scored 1-5 and only samples achieving 5/5 on all dimensions were retained.
## Models Trained on This Dataset
- [gpt-oss-20b-Ja-Fin-Thinking](https://huggingface.co/nri-ai/gpt-oss-20b-Ja-Fin-Thinking)
- [Qwen3-14B-Ja-Fin-Thinking](https://huggingface.co/nri-ai/Qwen3-14B-Ja-Fin-Thinking)
## Intended Use
This dataset can be used to post-train LLMs via supervised fine-tuning (SFT).
## Limitations
- **Domain specificity**: This dataset was generated with the financial domain in mind, and thus lacks coverage of other areas such as coding, STEM, and creative writing.
- **Synthetic data**: All instructions and responses were synthesized using LLMs and as a result may contain hallucinations despite filtering.
- **Multilingual coverage**: Primarily Japanese and English.
## Ethical Considerations
- Financial information in this dataset should not be used as professional financial advice without review by qualified experts.
- Users should evaluate model outputs for their specific use cases and seek professional guidance where appropriate.
## License
This dataset is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
## Privacy Notice
For details on how personal information is handled, please see the [Privacy Notice](https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/PRIVACY_NOTICE.md) ([日本語](https://huggingface.co/datasets/nri-ai/nri-fin-reasoning/blob/main/docs/PRIVACY_NOTICE.ja.md)).
## Citation
```bibtex
@inproceedings{okochiDomainSpecificLLM2026,
author = {大河内 悠磨 and Sim, Fabio Milentiansen and 岡田 智靖},
title = {ドメイン特化LLMの推論能力向上を目的とした合成指示データセットの構築と金融ドメインにおける評価},
booktitle = {言語処理学会第32回年次大会 (NLP2026) },
year = {2026},
month = mar,
address = {Utsunomiya, Tochigi, Japan},
publisher = {言語処理学会},
note = {Paper ID: C7-2},
url = {https://www.anlp.jp/proceedings/annual_meeting/2026/pdf_dir/C7-2.pdf}
}
```
```bibtex
@misc{okochi2026constructingsyntheticinstructiondatasets,
title = {Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain},
author = {Yuma Okochi and Fabio Milentiansen Sim and Tomoyasu Okada},
year = {2026},
eprint = {2603.01353},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2603.01353}
}
```
## Acknowledgments
This dataset was developed with the support of the "GENIAC (Generative AI Accelerator Challenge)" project, implemented by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO), with the aim of strengthening Japan's development capabilities in generative AI.
提供机构:
nri-ai



