five

dataset_4kinds

收藏
魔搭社区2025-09-03 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Memect/dataset_4kinds
下载链接
链接失效反馈
官方服务:
资源简介:
# 金融文档信息抽取数据集 本数据集包含四个金融文档信息抽取任务的数据,用于训练和评估大语言模型在金融领域的结构化信息抽取能力。 ## 数据集概览 | 数据集名称 | 中文名称 | 训练集 | 测试集 | |------------|----------|--------|--------| | Motions_table | 股东大会议案表 | 789 | 198 | | meeting | 股东大会信息 | 780 | 195 | | forcast | 业绩预告 | 77 | 21 | | HR_change | 人事变动 | 61 | 7 | **总计:四类数据** ## 数据集详细说明 ### 1. Motions_table(股东大会议案表) **数据格式:** JSON数组,每个元素包含instruction、input、output和system字段 **输出格式:** JSON数组,每个议案包含以下字段: - `应选人数`:议案涉及的应选人数(字符串) - `议案内容`:议案的具体内容(字符串) - `议案序号`:议案编号,如1, 2, 12.1等(字符串) - `累计投票制是否适用`:是否适用累计投票制(枚举:是/否) **目录结构:** ``` Motions_table/ ├── train/ │ └── data.json (789条) ├── eval/ │ └── results.json (评估结果) ├── test/ │ └── test.json (198条) └── schema.json ``` ### 2. meeting(股东大会信息) **数据格式:** JSON数组,每个元素包含instruction、input、output和system字段 **输出格式:** JSON对象,包含以下字段: - `A股股东资格登记日期`:日期格式 YYYY-MM-DD - `交易系统投票日期`:日期格式 YYYY-MM-DD - `会议召开地点`:会议地点文本 - `会议召开时间`:日期格式 YYYY-MM-DD - `参会登记日期截止日期`:日期格式 YYYY-MM-DD - `参会登记起始日`:日期格式 YYYY-MM-DD - `网络投票终止日`:日期格式 YYYY-MM-DD - `网络投票起始日`:日期格式 YYYY-MM-DD - `股东大会名称`:股东大会名称文本 - `股东大会名称(英文)`:英文名称 - `股东大会类别`:股东大会类别文本 - `股东大会类别编码`:数字编码字符串 **目录结构:** ``` meeting/ ├── train/ │ └── data.json (780条) ├── eval/ │ └── results.json ├── test/ │ └── test.json (195条) └── schema.json ``` ### 3. forcast(业绩预告) **数据格式:** JSON数组,每个元素包含instruction、input、output和system字段 **输出格式:** JSON对象,包含以下字段: - `业绩变化原因`:业绩变化原因描述文本 - `业绩类型`:业绩类型文本(如"预计扭亏"、"业绩预增") - `净利润描述`:净利润描述文本 - `净利润(上年)(万元)`:上年净利润金额(支持±符号) - `扣非后净利润(上年)(万元)`:上年扣非后净利润金额 - `报告年度`:报告年度日期 - `本期扣非前净利润上限(万元)`:本期扣非前净利润上限 - `本期扣非前净利润下限(万元)`:本期扣非前净利润下限 - `本期扣非后净利润上限(万元)`:本期扣非后净利润上限 - `本期扣非后净利润下限(万元)`:本期扣非后净利润下限 **目录结构:** ``` forcast/ ├── train/ │ └── data.json (77条) ├── eval/ │ └── results.json ├── test/ │ └── test.json (21条) └── schema.json ``` ### 4. HR_change(高管变动) **数据格式:** JSON数组,每个元素包含instruction、input、output和system字段 **输出格式:** JSON数组,每个高管变动事件包含以下字段: - `离职高管姓名`:离职高管全名 - `离职高管职务`:离职前担任的具体职务 - `离职高管性别`:性别(枚举:男/女) - `继任者姓名`:继任者全名(如未任命则为空) - `继任者职务`:继任者将要担任的职务 - `继任者性别`:继任者性别(枚举:男/女) - `离职原因`:公告中解释的离职具体原因 **目录结构:** ``` HR_change/ ├── train/ │ └── data.json (61条) ├── eval/ │ ├── results-1.json │ ├── results-2.json │ └── results-3.json ├── test/ │ └── test.json (7条) └── schema.json ``` ## 数据格式说明 所有数据集都采用统一的训练格式: ```json { "instruction": "任务指令描述", "input": "原始文档文本内容", "output": "结构化抽取结果(JSON格式)", "system": "系统提示信息" } ``` ## 使用说明 1. **训练数据**:使用 `train/data.json` 文件进行模型训练 2. **验证数据**:使用 `eval/` 目录下的文件进行模型验证 3. **测试结果**: `eval/` 目录下的文件存储测试结果 4. **Schema文件**:每个数据集目录下的 `schema.json` 定义了输出格式规范

# Financial Document Information Extraction Dataset This dataset contains data for four financial document information extraction tasks, used to train and evaluate the structured information extraction capabilities of Large Language Models (LLMs) in the financial domain. ## Dataset Overview | Dataset Name | Chinese Name | Training Set | Test Set | |------------|----------|--------|--------| | Motions_table | Shareholders' Meeting Motions Table | 789 | 198 | | meeting | Shareholders' Meeting Information | 780 | 195 | | forcast | Performance Forecast | 77 | 21 | | HR_change | Executive Turnover | 61 | 7 | **Total: Four categories of data** ## Detailed Dataset Description ### 1. Motions_table (Shareholders' Meeting Motions Table) **Data Format:** JSON array, where each element contains the fields of `instruction`, `input`, `output` and `system` **Output Format:** JSON array, each motion includes the following fields: - `Number of Candidates to Be Elected`: Number of candidates to be elected involved in the motion (string) - `Motion Content`: Specific content of the motion (string) - `Motion Serial Number`: Motion number, such as 1, 2, 12.1, etc. (string) - `Cumulative Voting System Applicable`: Whether the cumulative voting system applies (enumeration: Yes/No) **Directory Structure:** Motions_table/ ├── train/ │ └── data.json (789 entries) ├── eval/ │ └── results.json (evaluation results) ├── test/ │ └── test.json (198 entries) └── schema.json ### 2. meeting (Shareholders' Meeting Information) **Data Format:** JSON array, where each element contains the fields of `instruction`, `input`, `output` and `system` **Output Format:** JSON object, including the following fields: - `Qualification Registration Date for A-share Shareholders`: Date in format YYYY-MM-DD - `Trading System Voting Date`: Date in format YYYY-MM-DD - `Meeting Venue`: Text of the meeting location - `Meeting Time`: Date in format YYYY-MM-DD - `Deadline for Participant Registration`: Date in format YYYY-MM-DD - `Start Date of Participant Registration`: Date in format YYYY-MM-DD - `End Date of Online Voting`: Date in format YYYY-MM-DD - `Start Date of Online Voting`: Date in format YYYY-MM-DD - `Name of General Meeting of Shareholders`: Text of the general meeting name - `English Name of General Meeting of Shareholders`: English name - `Category of General Meeting of Shareholders`: Text of the meeting category - `Category Code of General Meeting of Shareholders`: Numeric code string **Directory Structure:** meeting/ ├── train/ │ └── data.json (780 entries) ├── eval/ │ └── results.json ├── test/ │ └── test.json (195 entries) └── schema.json ### 3. forcast (Performance Forecast) **Data Format:** JSON array, where each element contains the fields of `instruction`, `input`, `output` and `system` **Output Format:** JSON object, including the following fields: - `Reasons for Performance Change`: Descriptive text of the reasons for performance changes - `Performance Type`: Text of performance type (e.g., "Expected Turnaround from Loss", "Profit Pre-increase") - `Net Profit Description`: Descriptive text of net profit - `Net Profit (Previous Year) (10,000 Yuan)`: Previous year's net profit amount (supports ± signs) - `Non-recurring Profit and Loss Deducted Net Profit (Previous Year) (10,000 Yuan)`: Previous year's net profit amount after deducting non-recurring profit and loss - `Reporting Year`: Reporting year date - `Upper Limit of Pre-tax Net Profit (Current Period) (10,000 Yuan)`: Upper limit of pre-tax net profit in the current period - `Lower Limit of Pre-tax Net Profit (Current Period) (10,000 Yuan)`: Lower limit of pre-tax net profit in the current period - `Upper Limit of Post-deduction Net Profit (Current Period) (10,000 Yuan)`: Upper limit of net profit after deduction in the current period - `Lower Limit of Post-deduction Net Profit (Current Period) (10,000 Yuan)`: Lower limit of net profit after deduction in the current period **Directory Structure:** forcast/ ├── train/ │ └── data.json (77 entries) ├── eval/ │ └── results.json ├── test/ │ └── test.json (21 entries) └── schema.json ### 4. HR_change (Executive Turnover) **Data Format:** JSON array, where each element contains the fields of `instruction`, `input`, `output` and `system` **Output Format:** JSON array, each executive turnover event includes the following fields: - `Name of Resigning Executive`: Full name of the resigning executive - `Position of Resigning Executive`: Specific position held before resignation - `Gender of Resigning Executive`: Gender (enumeration: Male/Female) - `Name of Successor`: Full name of the successor (empty if not appointed) - `Position of Successor`: Position to be held by the successor - `Gender of Successor`: Gender of the successor (enumeration: Male/Female) - `Reasons for Resignation`: Specific reasons for resignation explained in the announcement **Directory Structure:** HR_change/ ├── train/ │ └── data.json (61 entries) ├── eval/ │ ├── results-1.json │ ├── results-2.json │ └── results-3.json ├── test/ │ └── test.json (7 entries) └── schema.json ## Unified Data Format Specification All datasets adopt a unified training format: json { "instruction": "Task instruction description", "input": "Original document text content", "output": "Structured extraction result (JSON format)", "system": "System prompt information" } ## Usage Instructions 1. **Training Data**: Use the `train/data.json` file for model training 2. **Validation Data**: Use the files under the `eval/` directory for model validation 3. **Test Results**: Files in the `eval/` directory store test results 4. **Schema File**: The `schema.json` under each dataset directory defines the output format specification
提供机构:
maas
创建时间:
2025-09-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作