five

finben-finer-ord

收藏
魔搭社区2025-06-05 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/TheFinAI/finben-finer-ord
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FinBen-FiNER-ORD ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://huggingface.co/datasets/TheFinAI/finben-finer-ord - **Repository:** https://huggingface.co/datasets/TheFinAI/finben-finer-ord/edit/main/README.md - **Paper:** FinBen: A Holistic Financial Benchmark for Large Language Models - **Leaderboard:** https://huggingface.co/spaces/finosfoundation/Open-Financial-LLM-Leaderboard ### Dataset Summary FinBen-FiNER-ORD is a financial named entity recognition (NER) dataset adapted from **FiNER-ORD (Shah et al., 2023b)**. The dataset is designed for training and evaluating large language models (LLMs) on financial text entity recognition tasks. The dataset includes necessary label columns and instructions to enhance its usability for LLM-based training and evaluation. ### Supported Tasks and Leaderboards - **Task:** Named Entity Recognition (NER) - **Evaluation Metric:** Entity F1 Score - **Test Size:** 1080 instances ### Languages - English ## Dataset Structure ### Data Instances Each instance consists of a list of tokens along with their corresponding entity labels. The annotation follows the **BIO** tagging format: - **B-PER, B-LOC, B-ORG**: Indicates the beginning of an entity (Person, Location, Organization). - **I-PER, I-LOC, I-ORG**: Indicates the continuation of an entity. - **O**: Indicates a token that does not belong to any named entity category. ### Data Fields - id: A unique identifier for each data instance. - query: The input text that the model processes. - answer: The expected response or annotation. - label: The sequence of labels for each token. - token: The tokenized version of the query text. ### Data Splits The dataset is split into: - **Test:** 1080 instances ## Dataset Creation ### Curation Rationale The dataset is adapted from **FiNER-ORD (Shah et al., 2023b)** to improve its suitability for LLM-based NER tasks by adding instruction and label columns for better training and evaluation. ### Source Data #### Initial Data Collection and Normalization The dataset originates from financial documents and articles containing named entities relevant to financial contexts. #### Who are the source language producers? Financial analysts, researchers, and automated data extraction systems. ### Annotations #### Annotation Process Annotations follow the BIO tagging scheme, where entities are labeled manually and reviewed for accuracy. #### Who are the annotators? Trained annotators with expertise in financial document analysis. ### Personal and Sensitive Information No personally identifiable information (PII) is included. ## Considerations for Using the Data ### Social Impact of Dataset This dataset enhances financial NLP capabilities, allowing more accurate extraction of named entities in financial texts. ### Discussion of Biases Potential biases may exist due to: - Overrepresentation of specific financial sectors. - Linguistic biases in the original dataset. ### Other Known Limitations - May require domain-specific fine-tuning. - Lacks multilingual support. ## Additional Information ### Dataset Curators - The Fin AI Community ### Licensing Information - **License:** CC BY-NC 4.0 ### Citation Information ```bibtex @article{shah2023finer, title={FiNER: Financial Named Entity Recognition Dataset and Weak-Supervision Model}, author={Shah, Agam and Vithani, Ruchit and Gullapalli, Abhinav and Chava, Sudheer}, journal={arXiv preprint arXiv:2302.11157}, year={2023} } ``` **Adapted Version (FinBen-FiNER-ORD):** ```bibtex @article{xie2024finben, title={FinBen: A Holistic Financial Benchmark for Large Language Models}, author={Xie, Qianqian and others}, journal={arXiv preprint arXiv:2402.12659}, year={2024} } ```

# FinBen-FiNER-ORD 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献内容](#contributions) ## 数据集描述 - **主页**:https://huggingface.co/datasets/TheFinAI/finben-finer-ord - **代码仓库**:https://huggingface.co/datasets/TheFinAI/finben-finer-ord/edit/main/README.md - **关联论文**:《FinBen:面向大语言模型的全维度金融基准测试集》 - **评测榜单**:https://huggingface.co/spaces/finosfoundation/Open-Financial-LLM-Leaderboard ### 数据集概述 FinBen-FiNER-ORD 是一个金融命名实体识别(Named Entity Recognition,简称NER)数据集,改编自**FiNER-ORD(Shah等,2023b)**。本数据集旨在针对金融文本实体识别任务,对大语言模型(Large Language Model,简称LLM)进行训练与评估。数据集包含必要的标签列与提示指令,以提升其在基于大语言模型的训练与评估中的可用性。 ### 支持任务与评测榜单 - **任务**:命名实体识别(NER) - **评估指标**:实体F1值 - **测试集规模**:1080个数据实例 ### 语言 - 英语 ## 数据集结构 ### 数据实例 每个数据实例由一个Token列表及其对应的实体标签组成。标注遵循**BIO标注格式**: - **B-PER、B-LOC、B-ORG**:分别指代实体起始标记,对应人物(Person)、地点(Location)、组织(Organization)三类实体 - **I-PER、I-LOC、I-ORG**:分别指代对应实体的延续部分 - **O**:表示不属于任何命名实体类别的Token ### 数据字段 - `id`:每个数据实例的唯一标识符 - `query`:模型待处理的输入文本 - `answer`:模型应生成的预期响应或标注结果 - `label`:每个Token对应的标签序列 - `token`:输入文本经过分词后的Token序列 ### 数据划分 本数据集划分为: - **测试集**:1080个数据实例 ## 数据集构建 ### 构建初衷 本数据集改编自**FiNER-ORD(Shah等,2023b)**,通过新增提示指令与标签列,提升其在基于大语言模型的命名实体识别任务中的适配性,优化训练与评估流程。 ### 源数据 #### 初始数据收集与标准化 本数据集源自包含金融领域相关命名实体的金融文档与文章。 #### 源文本生产者 金融分析师、研究人员以及自动化数据抽取系统。 ### 标注信息 #### 标注流程 标注遵循BIO标注方案,实体标签由人工完成并经过准确性审核。 #### 标注者 具备金融文档分析专业能力的受训标注人员。 ### 个人与敏感信息 本数据集未包含任何个人可识别信息(Personally Identifiable Information,简称PII)。 ## 数据集使用注意事项 ### 数据集的社会影响 本数据集可提升金融自然语言处理能力,实现金融文本中命名实体的更精准抽取。 ### 偏差讨论 潜在偏差可能源于以下两点: - 特定金融领域的过度代表 - 原始数据集存在的语言偏差 ### 其他已知局限性 - 可能需要针对特定金融领域进行微调 - 不支持多语言场景 ## 附加信息 ### 数据集维护者 - Fin AI社区 ### 许可信息 - **许可协议**:CC BY-NC 4.0 ### 引用信息 bibtex @article{shah2023finer, title={FiNER: Financial Named Entity Recognition Dataset and Weak-Supervision Model}, author={Shah, Agam and Vithani, Ruchit and Gullapalli, Abhinav and Chava, Sudheer}, journal={arXiv preprint arXiv:2302.11157}, year={2023} } **改编版本(FinBen-FiNER-ORD):** bibtex @article{xie2024finben, title={FinBen: A Holistic Financial Benchmark for Large Language Models}, author={Xie, Qianqian and others}, journal={arXiv preprint arXiv:2402.12659}, year={2024} }
提供机构:
maas
创建时间:
2025-03-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作