plutus-finner-numeric
收藏魔搭社区2025-05-31 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/TheFinAI/plutus-finner-numeric
下载链接
链接失效反馈官方服务:
资源简介:
----------------------------------------------------------------
# Dataset Card for Plutus Finner Numeric
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://huggingface.co/collections/TheFinAI/plutus-benchmarking-greek-financial-llms-67bc718fb8d897c65f1e87db
- **Repository:** https://huggingface.co/datasets/TheFinAI/plutus-finner-numeric
- **Paper:** Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
- **Leaderboard:** https://huggingface.co/spaces/TheFinAI/Open-Greek-Financial-LLM-Leaderboard#/
- **Model:** https://huggingface.co/spaces/TheFinAI/plutus-8B-instruct
### Dataset Summary
Plutus Finner Numeric is a specialized dataset created for numeric named entity recognition within financial texts. It combines financial queries with associated answers, labels, and additional textual context, serving as a benchmark for identifying numeric entities in Greek language financial documents.
### Supported Tasks and Leaderboards
- **Task:** Numeric Named Entity Recognition
- **Evaluation Metrics:** Entity F1 Score
- **Test Size:** 100
### Languages
- Greek
## Dataset Structure
### Data Instances
Each data instance in the dataset includes the following fields:
- **query:** A financial query or prompt containing numeric expressions.
- **answer:** The expected answer corresponding to the query.
- **label:** A sequence field containing labels that may represent numeric entities or related categories.
- **text:** Supplementary context or explanation accompanying each query.
### Data Fields
- **query:** String – The input prompt centered on financial numeric content.
- **answer:** String – The answer intended for the corresponding query.
- **label:** Sequence of strings – Additional labels or categorical descriptors for each instance.
- **text:** String – Extra textual context or commentary.
### Data Splits
The dataset is divided into three splits:
- **Train:** 320 instances (609,219 bytes)
- **Validation:** 80 instances (166,639 bytes)
- **Test:** 100 instances (219,566 bytes)
## Dataset Creation
### Curation Rationale
The dataset was developed to support the evaluation of models in accurately identifying and reasoning with numeric entities in financial texts, specifically tailored for the Greek language. This is an essential resource for advancing research in numeric named entity recognition combined with financial text analysis in low-resource settings.
### Source Data
#### Initial Data Collection and Normalization
The source data was derived from a diverse collection of Greek financial annual reports containing numeric information.
#### Who are the Source Language Producers?
Greek financial annual reports.
### Annotations
#### Annotation Process
Annotations were performed by domain experts with backgrounds in finance and linguistics. The process involved identifying and labeling numeric entities based on financial context to facilitate accurate named entity recognition.
#### Who are the Annotators?
The dataset was annotated by a team of financial analysts, data scientists, and linguists collaborating to ensure both numeric accuracy and linguistic coherence.
### Personal and Sensitive Information
This dataset has been curated to exclude any personally identifiable information (PII). It focuses solely on financial content and numeric reasoning without sensitive personal data.
## Considerations for Using the Data
### Social Impact of Dataset
By enhancing the capabilities of numeric named entity recognition in the financial domain, particularly in Greek, this dataset supports improved financial analysis, informed decision-making, and automated information processing. It is a valuable resource for both academic research and practical financial applications.
### Discussion of Biases
Potential biases include:
- Domain-specific language and numeric formats that might be less applicable outside Greek financial texts.
- The curation process may favor specific numeric presentation styles or financial topics.
- Annotation variability inherent in domain-specific text processing.
### Other Known Limitations
- The dataset is specifically designed for Greek financial texts, which may limit its applicability to other domains or languages.
- Variations in numeric expressions and formatting may require additional pre-processing before analysis.
## Additional Information
### Dataset Curators
- Xueqing Peng
- Triantafillos Papadopoulos
- Efstathia Soufleri
- Polydoros Giannouris
- Ruoyu Xiang
- Yan Wang
- Lingfei Qian
- Jimin Huang
- Qianqian Xie
- Sophia Ananiadou
The research is supported by NaCTeM, Archimedes RC, and The Fin AI.
### Licensing Information
- **License:** Apache License 2.0
### Citation Information
If you use this dataset in your research, please consider citing it as follows:
```bibtex
@misc{peng2025plutusbenchmarkinglargelanguage,
title={Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance},
author={Xueqing Peng and Triantafillos Papadopoulos and Efstathia Soufleri and Polydoros Giannouris and Ruoyu Xiang and Yan Wang and Lingfei Qian and Jimin Huang and Qianqian Xie and Sophia Ananiadou},
year={2025},
eprint={2502.18772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18772},
}
```
# Plutus Finner Numeric 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集构建者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页:** https://huggingface.co/collections/TheFinAI/plutus-benchmarking-greek-financial-llms-67bc718fb8d897c65f1e87db
- **代码仓库:** https://huggingface.co/datasets/TheFinAI/plutus-finner-numeric
- **相关论文:** Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
- **排行榜:** https://huggingface.co/spaces/TheFinAI/Open-Greek-Financial-LLM-Leaderboard#/
- **模型:** https://huggingface.co/spaces/TheFinAI/plutus-8B-instruct
### 数据集概述
Plutus Finner Numeric 是专为金融文本中的数值命名实体识别(Numeric Named Entity Recognition)打造的专用数据集。该数据集将金融查询与其对应的答案、标签及附加文本上下文相结合,可作为希腊语文本金融文档中数值实体识别的基准测试集。
### 支持任务与排行榜
- **任务:** 数值命名实体识别(Numeric Named Entity Recognition)
- **评估指标:** 实体F1分数(Entity F1 Score)
- **测试集规模:** 100条
### 语言
- 希腊语
## 数据集结构
### 数据实例
本数据集的每条数据实例均包含以下字段:
- **query(查询文本):** 包含数值表达式的金融查询或提示词。
- **answer(答案):** 与该查询对应的预期输出结果。
- **label(标签):** 序列类型字段,用于标注数值实体或相关类别。
- **text(上下文文本):** 伴随每条查询的补充上下文或解释说明。
### 数据字段
- **query(查询文本):** 字符串类型——以金融数值内容为核心的输入提示词。
- **answer(答案):** 字符串类型——对应查询的预期输出结果。
- **label(标签):** 字符串序列——每条实例的附加标签或类别描述符。
- **text(上下文文本):** 字符串类型——额外的文本上下文或注释说明。
### 数据划分
本数据集划分为三个子集:
- **Train(训练集):** 320条实例(609,219 字节)
- **Validation(验证集):** 80条实例(166,639 字节)
- **Test(测试集):** 100条实例(219,566 字节)
## 数据集构建
### 构建初衷
本数据集的构建旨在支持模型在金融文本中准确识别数值实体并完成相关推理的评估任务,且专门针对希腊语场景进行定制。对于低资源语言环境下的数值命名实体识别与金融文本分析研究而言,该数据集是推动相关研究进展的重要资源。
### 源数据
#### 初始数据收集与标准化
源数据源自包含数值信息的多样化希腊语金融年报集合。
#### 源文本生产者是谁?
希腊语金融年报。
### 标注
#### 标注流程
标注工作由具备金融与语言学背景的领域专家完成。流程需基于金融上下文识别并标注数值实体,以保障命名实体识别任务的准确性。
#### 标注人员构成
本数据集由金融分析师、数据科学家与语言学家组成的团队共同标注,以确保数值准确性与语言连贯性。
### 个人与敏感信息
本数据集在构建过程中已剔除所有个人可识别信息(Personally Identifiable Information, PII),仅聚焦于金融内容与数值推理,不包含任何敏感个人数据。
## 数据集使用注意事项
### 数据集的社会影响
通过提升金融领域(尤其是希腊语场景下)的数值命名实体识别能力,本数据集可助力优化金融分析、辅助决策制定与自动化信息处理。其无论是在学术研究还是实际金融应用中均为极具价值的资源。
### 偏差讨论
潜在偏差包括:
- 仅适配希腊语金融文本的领域特定语言与数值格式,在其他场景下适用性有限。
- 构建过程可能偏向特定的数值呈现风格或金融主题。
- 领域特定文本处理中固有存在的标注差异性。
### 其他已知局限性
- 本数据集专为希腊语金融文本设计,其适用性局限于特定语言与领域。
- 数值表达与格式的多样性可能需要在分析前进行额外的预处理操作。
## 附加信息
### 数据集构建者
- Xueqing Peng
- Triantafillos Papadopoulos
- Efstathia Soufleri
- Polydoros Giannouris
- Ruoyu Xiang
- Yan Wang
- Lingfei Qian
- Jimin Huang
- Qianqian Xie
- Sophia Ananiadou
本研究由NaCTeM、Archimedes RC与The Fin AI资助支持。
### 许可信息
- **许可证:** Apache许可证2.0(Apache License 2.0)
### 引用信息
若您在研究中使用本数据集,请按以下格式引用:
bibtex
@misc{peng2025plutusbenchmarkinglargelanguage,
title={Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance},
author={Xueqing Peng and Triantafillos Papadopoulos and Efstathia Soufleri and Polydoros Giannouris and Ruoyu Xiang and Yan Wang and Lingfei Qian and Jimin Huang and Qianqian Xie and Sophia Ananiadou},
year={2025},
eprint={2502.18772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18772},
}
提供机构:
maas
创建时间:
2025-03-03



