five

EDINET-Bench

收藏
魔搭社区2026-01-09 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/SakanaAI/EDINET-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
# EDINET-Bench 📚 [Paper](https://arxiv.org/abs/2506.08762) | 📝 [Blog](https://sakana.ai/edinet-bench/) | 🧑‍💻 [Code](https://github.com/SakanaAI/EDINET-Bench) EDINET-Bench is a Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. This dataset is built leveraging [EDINET](https://disclosure2.edinet-fsa.go.jp), a platform managed by the Financial Services Agency (FSA) of Japan that provides access to disclosure documents such as securities reports. ## Notice - **June 9, 2025**: This dataset was originally released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Although section 1.7.3 of the Public Domain License (PDL) 1.0 states that it is compatible with CC BY 4.0, we have relicensed the dataset under PDL 1.0 to ensure strict consistency with the original licensing terms of the source data. ## Resources - 📃**Paper**: Read our paper for detailed dataset construction pipeline and evaluation results at https://arxiv.org/abs/2506.08762 - 🏗️**Counstruction Code**: Create a new benchmark dataset at https://github.com/SakanaAI/edinet2dataset - 📊**Evaluation Code**: Evaluate the performance of models on EDINET-Bench at https://github.com/SakanaAI/EDINET-Bench ## Dataset Construction Pipeline <img src="EDINET-Bench.png" alt="Overview of EDINET-Bench" width="50%"/> EDINET-Bench is built by downloading the past 10 years of annual reports of Japanese listed companies via EDINET-API and automatically annotating labels for each task. For detailed information, please read our paper and code. ## How to Use **Acounting fraud detection** This task is a binary classification problem aimed at predicting whether a given annual report is fraudulent. The label is either fraud (1) or non-fraud (0). The explanation includes the reasons why the LLM determined that the contents of the amended report are related to accounting fraud. Sample Explanation: ``` この訂正有価証券報告書は明らかに会計不正に関連しています。提出理由の部分に「当社の元従業員が、複数年度に亘って、商品の不正持ち出し転売するなどの行為を行っていた事実が判明」と記載されており、「第三者委員会」を設置して調査を行ったことが明記されています。さらに「不適切な会計処理を訂正」という表現も使用されています。この不正行為により、連結財務諸表および財務諸表の数値に変更が生じており、訂正箇所として貸借対照表、損益計算書、連結キャッシュ・フロー計算書など財務諸表の主要部分が挙げられています。これは単なる記載ミスではなく、元従業員による不正行為に起因する重大な会計上の問題であることが明確です。 (The amended securities report is clearly related to accounting fraud. In the section stating the reason for the amendment, it is noted that "it was discovered that a former employee of the company had, over multiple fiscal years, engaged in misconduct such as unlawfully removing and reselling products." It is also clearly stated that a "third-party committee" was established to investigate the matter. Furthermore, the report uses expressions such as "correction of inappropriate accounting treatment." As a result of this misconduct, changes have been made to figures in both the consolidated financial statements and the individual financial statements. The corrected sections include major parts of the financial statements, such as the balance sheet, income statement, and consolidated cash flow statement. This is not merely a clerical error, but rather a serious accounting issue stemming from fraudulent actions by a former employee.) ``` ```python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "fraud_detection") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'explanation', 'edinet_code', 'ammended_doc_id', 'doc_id', 'file_path'], num_rows: 865 }) test: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'explanation', 'edinet_code', 'ammended_doc_id', 'doc_id', 'file_path'], num_rows: 224 }) }) ``` **Earnings forecast** This task is a binary classification problem that predicts whether a company's earnings will increase or decrease in the next fiscal year based on its current annual report. The label is either increase (1) or not (0). ```python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "earnings_forecast") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'naive_prediction', 'edinet_code', 'doc_id', 'previous_year_file_path', 'current_year_file_path'], num_rows: 549 }) test: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'naive_prediction', 'edinet_code', 'doc_id', 'previous_year_file_path', 'current_year_file_path'], num_rows: 451 }) }) ``` **Industry prediction** This task is a multi-class classification problem that predicts a company's industry type (e.g., Banking) based on its current annual report. Each label (in this case, the industry column) represents one of 16 possible industry types. ```python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "industry_prediction") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'industry', 'edinet_code', 'doc_id', 'file_path'], num_rows: 496 }) }) ``` ## Limitation - **Mislabeling**: When constructing the benchmark dataset for the accounting fraud detection task, we assume that only cases explicitly reported as fraudulent are labeled as such, while all others are considered non-fraudulent. However, there may be undiscovered fraud cases that remain unreported, introducing potential label noise into the dataset. Additionally, our fraud examples are constructed by having the LLM read the contents of the amended reports and determine whether they are related to fraudulent activities. Due to the hallucination problem inherent in LLMs and lack of instruction following abilities, there is a risk that some cases may be incorrectly identified as fraudulent. - **Intrinsic difficulty**: Among the tasks in our benchmark, the fraud detection and earnings forecasting tasks may be intrinsically challenging with a performance upper bound, as the LLM relies solely on information from a single annual report for its predictions. Future research directions could explore the development of benchmark task designs that enable the model to utilize information beyond the annual report with novel agentic pipelines. ## LICENSE EDINET-Bench is licensed under the [PDL 1.0](https://www.digital.go.jp/resources/open_data/public_data_license_v1.0) in accordance with [EDINET's Terms of Use](https://disclosure2dl.edinet-fsa.go.jp/guide/static/disclosure/WZEK0030.html). ## ⚠️ Warnings EDINET-Bench is intended solely for advancing LLM applications in finance and must not be used to target or harm any real companies included in the dataset. ## Citation ``` @misc{sugiura2025edinet, author={Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha}, title={{EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements}}, year={2025}, eprint={2506.08762}, archivePrefix={arXiv}, primaryClass={q-fin.ST}, url={https://arxiv.org/abs/2506.08762}, } ```

# EDINET-Bench 📚 [论文](https://arxiv.org/abs/2506.08762) | 📝 [博客](https://sakana.ai/edinet-bench/) | 🧑‍💻 [代码](https://github.com/SakanaAI/EDINET-Bench) EDINET-Bench是一款面向日语金融领域的基准测试集,旨在评估大语言模型(Large Language Model, LLM)在各类高难度金融任务上的性能,涵盖会计舞弊检测、盈利预测与行业分类三大任务。 本数据集依托日本金融厅(Financial Services Agency, FSA)运营的[EDINET](https://disclosure2.edinet-fsa.go.jp)平台构建,该平台可获取有价证券报告书等各类披露文件。 ## 声明 - **2025年6月9日**:本数据集最初以知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International, CC BY 4.0)发布。尽管公共领域许可协议1.0版(Public Domain License 1.0, PDL 1.0)的1.7.3条款表明其与CC BY 4.0兼容,但为确保与源数据的原始许可条款严格一致,我们已将本数据集重新授权至PDL 1.0协议下。 ## 资源 - 📃**论文**:如需了解数据集构建流程与评估结果的详细内容,请参阅我们的论文:https://arxiv.org/abs/2506.08762 - 🏗️**构建代码**:可通过以下链接构建全新的基准测试集:https://github.com/SakanaAI/edinet2dataset - 📊**评估代码**:可通过以下代码在EDINET-Bench上评估模型性能:https://github.com/SakanaAI/EDINET-Bench ## 数据集构建流程 ![EDINET-Bench概览](EDINET-Bench.png) EDINET-Bench通过EDINET应用程序编程接口(Application Programming Interface, API)下载日本过去10年的上市企业年度报告,并自动为各任务标注标签。如需了解详细信息,请参阅我们的论文与代码。 ## 使用方法 ### 会计舞弊检测任务 该任务为二分类任务,目标是预测给定的年度报告是否涉及舞弊。标签取值为舞弊(1)或非舞弊(0)。解释字段用于说明大语言模型判定修正后报告内容与会计舞弊相关的依据。 示例解释: この訂正有価証券報告書は明らかに会計不正に関連しています。提出理由の部分に「当社の元従業員が、複数年度に亘って、商品の不正持ち出し転売するなどの行為を行っていた事実が判明」と記載されており、「第三者委員会」を設置して調査を行ったことが明記されています。さらに「不適切な会計処理を訂正」という表現も使用されています。この不正行為により、連結財務諸表および財務諸表の数値に変更が生じており、訂正箇所として貸借対照表、損益計算書、連結キャッシュ・フロー計算書など財務諸表の主要部分が挙げられています。これは単なる記載ミスではなく、元従業員による不正行為に起因する重大な会計上の問題であることが明確です。 (The amended securities report is clearly related to accounting fraud. In the section stating the reason for the amendment, it is noted that "it was discovered that a former employee of the company had, over multiple fiscal years, engaged in misconduct such as unlawfully removing and reselling products." It is also clearly stated that a "third-party committee" was established to investigate the matter. Furthermore, the report uses expressions such as "correction of inappropriate accounting treatment." As a result of this misconduct, changes have been made to figures in both the consolidated financial statements and the individual financial statements. The corrected sections include major parts of the financial statements, such as the balance sheet, income statement, and consolidated cash flow statement. This is not merely a clerical error, but rather a serious accounting issue stemming from fraudulent actions by a former employee.) python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "fraud_detection") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'explanation', 'edinet_code', 'ammended_doc_id', 'doc_id', 'file_path'], num_rows: 865 }) test: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'explanation', 'edinet_code', 'ammended_doc_id', 'doc_id', 'file_path'], num_rows: 224 }) }) ### 盈利预测任务 该任务为二分类任务,目标基于企业当前年度报告,预测其下一财年的盈利是否会出现增长。标签取值为增长(1)或未增长(0)。 python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "earnings_forecast") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'naive_prediction', 'edinet_code', 'doc_id', 'previous_year_file_path', 'current_year_file_path'], num_rows: 549 }) test: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'label', 'naive_prediction', 'edinet_code', 'doc_id', 'previous_year_file_path', 'current_year_file_path'], num_rows: 451 }) }) ### 行业分类任务 该任务为多分类任务,目标基于企业当前年度报告,预测其所属行业类型(例如银行业)。每个标签(此处为industry字段)对应16种可选行业类型之一。 python >>> from datasets import load_dataset >>> ds = load_dataset("SakanaAI/EDINET-Bench", "industry_prediction") >>> ds DatasetDict({ train: Dataset({ features: ['meta', 'summary', 'bs', 'pl', 'cf', 'text', 'industry', 'edinet_code', 'doc_id', 'file_path'], num_rows: 496 }) }) ## 数据集局限性 - **标签噪声**:在构建会计舞弊检测任务的基准数据集时,我们假设仅明确披露为舞弊的案例会被标记为舞弊,其余案例均视为非舞弊。但可能存在未被发现、未披露的舞弊案例,这会为数据集引入潜在的标签噪声。此外,我们的舞弊样本通过让大语言模型读取修正后报告内容并判定是否与舞弊活动相关来构建。由于大语言模型本身存在幻觉问题且指令遵循能力有限,部分案例可能被错误判定为舞弊案例。 - **固有难度**:在本基准测试集的任务中,舞弊检测与盈利预测任务可能存在固有难度与性能上限,因为大语言模型仅能依靠单一年度报告的信息进行预测。未来的研究方向可探索开发新型AI智能体(AI Agent)流水线,使模型能够利用年度报告之外的信息来优化基准任务设计。 ## 许可协议 EDINET-Bench依据[EDINET使用条款](https://disclosure2dl.edinet-fsa.go.jp/guide/static/disclosure/WZEK0030.html),采用[PDL 1.0](https://www.digital.go.jp/resources/open_data/public_data_license_v1.0)协议进行授权。 ## ⚠️ 警告 EDINET-Bench仅用于推动大语言模型在金融领域的应用研究,严禁用于针对数据集中包含的任何真实企业的恶意行为或损害行为。 ## 引用 @misc{sugiura2025edinet, author={Issa Sugiura and Takashi Ishida and Taro Makino and Chieko Tazuke and Takanori Nakagawa and Kosuke Nakago and David Ha}, title={{EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements}}, year={2025}, eprint={2506.08762}, archivePrefix={arXiv}, primaryClass={q-fin.ST}, url={https://arxiv.org/abs/2506.08762}, }
提供机构:
maas
创建时间:
2025-06-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作