sohomghosh/IndicFinNLP_FinancialNatural_Language_Processing_for_Indian_Languages
收藏Hugging Face2024-06-05 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sohomghosh/IndicFinNLP_FinancialNatural_Language_Processing_for_Indian_Languages
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
task_categories:
- text-classification
language:
- bn
- hi
- te
tags:
- finance
size_categories:
- 10K<n<100K
---
# IndicFinNLP
This repository contains dataset mentioned in the paper, "IndicFinNLP: Financial Natural Language Processing for Indian Languages" @ LREC-COLING 2024
Tasks:
Exaggerated Numeral Detection
Sustainability Assessment,
ESG Theme Determination
Languages:
Hindi, Bengali, Telugu
Source:
Budget speeches of Hindi, Bengali, and Telugu speaking states (Punjab, Uttarakhand, Haryana, West Bengal, Telangana, and Andhra Pradesh) starting from the year 2011 till 2023.
Financial texts filtered from the Samanantar corpus (Rameshet al., TACL 2022).
Translated the existing dataset proposed by Kangand El Maarouf (FinSim4-ESG FinNLP-2022) from English to Indian languages (Hindi, Bengali, and Telugu). Retained only the high quality ones.
Translated the existing dataset proposed by Chen et al. 2023 (FinNLP-2023 ML-ESG) from English to Indian languages (Hindi, Bengali, and Telugu). Manually verified and corrected the translations
wherever needed.
## Resources
**Task-1 Metadata** <br>
------------------- <br>
| column name | explanation | example |
|----------------|--------------------------------------------------------------------|-------------------------------|
| indic | financial text in indic language | যা টাকায় ১০ কোটি টাকারও বেশি। |
| number_english | number present in indic text after translating it to English | 10 |
| number_indic | number present in indic text | ১০ |
| start_posn | starting position of the number in indic text | 9 |
| end_posn | ending position of the number in indic text | 11 |
| language | indic language in which the text is present (hindi/bengali/telugu) | bengali |
| magnitude | magnitude of the number | 1 |
**Task-2 Metadata** <br>
------------------- <br>
| column name | explanation | example |
|----------------|----------------------------------|-----------------------------------------------------------------|
| sentence_indic | financial text in indic language | 2019 में, हवाई यात्रा हमारे अपने कार्बन फुटप्रिंट का लगभग 38 प्रतिशत थी। |
| label | sustainable or unsustainable | unsustainable |
| language | indic language | hindi |
**Task-3 Metadata** <br>
------------------- <br>
| column name | explanation | example |
|------------------|------------------------------|-----------------------------------------------------------------------------------------|
| URL | url of the news title | https://www.esgtoday.com/abn-amro-to-align-lending-investment-portfolios-with-net-zero/ |
| news_title_indic | news title in indic language | రుణాలను సమలేఖనం చేయడానికి, నికర సున్నాతో పెట్టుబడి దస్త్రాలను సమలేఖనం చేయడానికి ABN AMRO |
| ESG_Theme | ESG theme of the news title | climate change |
| language | indic language | telugu |
```bibtex
@INPROCEEDINGS{ghosh2024,
author={Ghosh, Sohom and Majhi, Arnab and Narayana, Aswartha and Naskar, Sudip Kumar},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
title={IndicFinNLP: Financial Natural Language Processing for Indian Languages},
month={05},
year={2024},
pages={9010–9018},
url={https://aclanthology.org/2024.lrec-main.789/}
}
```
---
license: cc-by-nc-sa-4.0
---
提供机构:
sohomghosh
原始信息汇总
数据集概述
数据集名称
IndicFinNLP
数据集来源
- 预算演讲文本来自印度语地区(旁遮普邦、北阿坎德邦、哈里亚纳邦、西孟加拉邦、特伦甘纳邦和安得拉邦),时间跨度为2011年至2023年。
- 从Samanantar语料库中筛选的金融文本。
- 翻译自Kang和El Maarouf提出的FinSim4-ESG FinNLP-2022数据集,以及Chen et al. 2023提出的FinNLP-2023 ML-ESG数据集,从英语翻译至印度语(印地语、孟加拉语、泰卢固语)。
数据集任务
- 夸大数字检测
- 可持续性评估
- ESG主题确定
数据集语言
- 印地语
- 孟加拉语
- 泰卢固语
数据集规模
10K<n<100K
数据集许可证
cc-by-nc-sa-4.0
数据集元数据
Task-1 Metadata
| 列名 | 解释 | 示例 |
|---|---|---|
| indic | 印度语金融文本 | যা টাকায় ১০ কোটি টাকারও বেশি। |
| number_english | 印度语文本中的数字翻译成英语 | 10 |
| number_indic | 印度语文本中的数字 | ১০ |
| start_posn | 数字在印度语文本中的起始位置 | 9 |
| end_posn | 数字在印度语文本中的结束位置 | 11 |
| language | 文本使用的印度语(印地语/孟加拉语/泰卢固语) | bengali |
| magnitude | 数字的量级 | 1 |
Task-2 Metadata
| 列名 | 解释 | 示例 |
|---|---|---|
| sentence_indic | 印度语金融文本 | 2019 में, हवाई यात्रा हमारे अपने कार्बन फुटप्रिंट का लगभग 38 प्रतिशत थी। |
| label | 可持续或不可持续 | unsustainable |
| language | 文本使用的印度语 | hindi |
Task-3 Metadata
| 列名 | 解释 | 示例 |
|---|---|---|
| URL | 新闻标题的网址 | https://www.esgtoday.com/abn-amro-to-align-lending-investment-portfolios-with-net-zero/ |
| news_title_indic | 印度语新闻标题 | రుణాలను సమలేఖనం చేయడానికి, నికర సున్నాతో పెట్టుబడి దస్త్రాలను సమలేఖనం చేయడానికి ABN AMRO |
| ESG_Theme | 新闻标题的ESG主题 | climate change |
| language | 文本使用的印度语 | telugu |
数据集相关文献
- Ghosh, Sohom and Majhi, Arnab and Narayana, Aswartha and Naskar, Sudip Kumar. "IndicFinNLP: Financial Natural Language Processing for Indian Languages". Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, pp. 9010–9018.



