plutus-finner-text
收藏魔搭社区2025-05-31 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/TheFinAI/plutus-finner-text
下载链接
链接失效反馈官方服务:
资源简介:
----------------------------------------------------------------
# Dataset Card for Plutus Finner Text
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://huggingface.co/collections/TheFinAI/plutus-benchmarking-greek-financial-llms-67bc718fb8d897c65f1e87db
- **Repository:** https://huggingface.co/datasets/TheFinAI/plutus-finner-text
- **Paper:** https://arxiv.org/pdf/2502.18772
- **Leaderboard:** https://huggingface.co/spaces/TheFinAI/Open-Greek-Financial-LLM-Leaderboard#/
- **Model:** https://huggingface.co/spaces/TheFinAI/plutus-8B-instruct
### Dataset Summary
Plutus Finner Text is a dataset crafted for text named entity recognition (NER) within financial documents. Focusing on Greek language financial texts, this resource combines financial queries with answers, labels, and additional contextual text. The dataset is designed as a benchmark to enhance NER capabilities for extracting and categorizing textual entities in finance.
### Supported Tasks
- **Task:** Text Named Entity Recognition
- **Evaluation Metrics:** Entity F1 Score
### Languages
- Greek
## Dataset Structure
### Data Instances
Each instance in the dataset is composed of four fields:
- **query:** A financial query or prompt that includes text potentially containing named entities.
- **answer:** The expected answer associated with the query.
- **label:** A sequence field containing labels which denote the named entities.
- **text:** Additional context or commentary that clarifies the query.
### Data Fields
- **query:** String – Represents the financial query or prompt.
- **answer:** String – The corresponding answer for the query.
- **label:** Sequence of strings – Contains the named entity labels linked to each instance.
- **text:** String – Provides supplementary context or details.
### Data Splits
The dataset is organized into three splits:
- **Train:** 320 instances (649,136 bytes)
- **Validation:** 80 instances (157,953 bytes)
- **Test:** 100 instances (230,512 bytes)
## Dataset Creation
### Curation Rationale
The Plutus Finner Text dataset was developed to support robust text-based named entity recognition in the financial domain, tailored specifically for Greek language texts. It aims to empower researchers and practitioners with a challenging benchmark for extracting and classifying named entities within financial documents.
### Source Data
#### Initial Data Collection and Normalization
The source data was derived from a diverse collection of Greek financial annual reports containing numeric information.
#### Who are the Source Language Producers?
Greek financial annual reports.
### Annotations
#### Annotation Process
The annotation process involved domain experts in both finance and linguistics who manually identified and marked the relevant named entities within the financial queries and contextual text. Quality control was maintained to ensure high annotation consistency.
#### Who are the Annotators?
A collaboration between financial analysts, data scientists, and linguists was established to annotate the dataset accurately and reliably.
### Personal and Sensitive Information
This dataset has been curated to exclude any personally identifiable information (PII) and focuses solely on financial textual data and entity extraction.
## Considerations for Using the Data
### Social Impact of Dataset
By advancing text NER within the Greek financial sector, this dataset supports improved information extraction and automated analysis—benefiting financial decision-making and research across the industry and academia.
### Discussion of Biases
- The domain-specific language and textual formats may limit generalizability outside Greek financial texts.
- Annotation subjectivity could introduce biases in the identification of entities.
- The dataset’s focused scope in finance may require further adaptation for use in broader contexts.
### Other Known Limitations
- Additional pre-processing might be needed to handle variations in text and entity presentation.
- The dataset’s application is primarily limited to the financial domain.
## Additional Information
### Dataset Curators
- Xueqing Peng
- Triantafillos Papadopoulos
- Efstathia Soufleri
- Polydoros Giannouris
- Ruoyu Xiang
- Yan Wang
- Lingfei Qian
- Jimin Huang
- Qianqian Xie
- Sophia Ananiadou
The research is supported by NaCTeM, Archimedes RC, and The Fin AI.
### Licensing Information
- **License:** Apache License 2.0
### Citation Information
If you use this dataset in your research, please consider citing it as follows:
```bibtex
@misc{peng2025plutusbenchmarkinglargelanguage,
title={Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance},
author={Xueqing Peng and Triantafillos Papadopoulos and Efstathia Soufleri and Polydoros Giannouris and Ruoyu Xiang and Yan Wang and Lingfei Qian and Jimin Huang and Qianqian Xie and Sophia Ananiadou},
year={2025},
eprint={2502.18772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18772},
}
```
# Plutus Finner Text 数据集卡片
## 目录
- [目录](#目录)
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务](#支持任务)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据样本](#数据样本)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注信息](#标注信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可证信息](#许可证信息)
- [引用信息](#引用信息)
- [贡献](#贡献)
## 数据集描述
- **主页**: https://huggingface.co/collections/TheFinAI/plutus-benchmarking-greek-financial-llms-67bc718fb8d897c65f1e87db
- **仓库**: https://huggingface.co/datasets/TheFinAI/plutus-finner-text
- **论文**: https://arxiv.org/pdf/2502.18772
- **排行榜**: https://huggingface.co/spaces/TheFinAI/Open-Greek-Financial-LLM-Leaderboard#/
- **模型**: https://huggingface.co/spaces/TheFinAI/plutus-8B-instruct
### 数据集概述
Plutus Finner Text是一款专为金融文档中的**文本命名实体识别(Named Entity Recognition, NER)**打造的数据集。该资源聚焦希腊语金融文本,将金融查询与答案、标签及额外上下文文本相结合,旨在作为基准数据集,提升金融领域文本实体的提取与分类能力。
### 支持任务
- **任务**: 文本命名实体识别
- **评估指标**: 实体F1分数(Entity F1 Score)
### 语言
- 希腊语
## 数据集结构
### 数据样本
本数据集的每个样本均包含四个字段:
- **query**: 包含潜在命名实体的金融查询或提示文本
- **answer**: 与该查询对应的预期答案
- **label**: 用于标记命名实体的序列字段
- **text**: 用于补充说明查询的上下文或评论文本
### 数据字段
- **query**: 字符串类型,表示金融查询或提示
- **answer**: 字符串类型,对应查询的预期答案
- **label**: 字符串序列,包含与每个样本关联的命名实体标签
- **text**: 字符串类型,提供补充上下文或细节信息
### 数据划分
本数据集分为三个划分:
- **训练集**: 320个样本(649,136 字节)
- **验证集**: 80个样本(157,953 字节)
- **测试集**: 100个样本(230,512 字节)
## 数据集构建
### 构建初衷
Plutus Finner Text数据集旨在支持金融领域稳健的文本命名实体识别任务,专门针对希腊语文本定制,旨在为研究人员和从业者提供一个用于金融文档中命名实体提取与分类的挑战性基准数据集。
### 源数据
#### 初始数据收集与归一化
源数据源自包含数值信息的多样化希腊金融年度报告集合。
#### 源语言文本的生产者是谁?
希腊金融年度报告。
### 标注信息
#### 标注流程
标注流程由兼具金融与语言学领域专业知识的专家完成,他们手动识别并标记金融查询及上下文文本中的相关命名实体,并通过质量控制确保标注一致性。
#### 标注人员是谁?
由金融分析师、数据科学家与语言学家组成的协作团队,以准确可靠地完成数据集标注。
### 个人与敏感信息
本数据集已经过筛选,剔除所有个人可识别信息(Personally Identifiable Information, PII),仅聚焦金融文本数据与实体提取任务。
## 数据集使用注意事项
### 数据集的社会影响
通过推动希腊金融领域的文本命名实体识别技术发展,本数据集有助于优化信息提取与自动化分析,惠及行业与学术界的金融决策与研究工作。
### 偏差讨论
- 领域特定语言与文本格式可能限制其在希腊金融文本之外场景的泛化能力
- 标注主观性可能在实体识别过程中引入偏差
- 数据集聚焦金融领域的特性可能需要进一步适配才能应用于更广泛的场景
### 其他已知局限性
- 可能需要额外的预处理步骤以应对文本与实体呈现形式的差异
- 数据集的应用场景主要局限于金融领域
## 附加信息
### 数据集维护者
- Xueqing Peng
- Triantafillos Papadopoulos
- Efstathia Soufleri
- Polydoros Giannouris
- Ruoyu Xiang
- Yan Wang
- Lingfei Qian
- Jimin Huang
- Qianqian Xie
- Sophia Ananiadou
本研究由NaCTeM、Archimedes RC及The Fin AI支持。
### 许可证信息
- **许可证**: Apache许可证2.0(Apache License 2.0)
### 引用信息
若您在研究中使用本数据集,请参考以下格式进行引用:
bibtex
@misc{peng2025plutusbenchmarkinglargelanguage,
title={Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance},
author={Xueqing Peng and Triantafillos Papadopoulos and Efstathia Soufleri and Polydoros Giannouris and Ruoyu Xiang and Yan Wang and Lingfei Qian and Jimin Huang and Qianqian Xie and Sophia Ananiadou},
year={2025},
eprint={2502.18772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18772},
}
提供机构:
maas
创建时间:
2025-03-03



