five

sohomghosh/IndicFinNLP_FinancialNatural_Language_Processing_for_Indian_Languages

收藏
Hugging Face2024-06-05 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sohomghosh/IndicFinNLP_FinancialNatural_Language_Processing_for_Indian_Languages
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-sa-4.0 task_categories: - text-classification language: - bn - hi - te tags: - finance size_categories: - 10K<n<100K --- # IndicFinNLP This repository contains dataset mentioned in the paper, "IndicFinNLP: Financial Natural Language Processing for Indian Languages" @ LREC-COLING 2024 Tasks: Exaggerated Numeral Detection Sustainability Assessment, ESG Theme Determination Languages: Hindi, Bengali, Telugu Source: Budget speeches of Hindi, Bengali, and Telugu speaking states (Punjab, Uttarakhand, Haryana, West Bengal, Telangana, and Andhra Pradesh) starting from the year 2011 till 2023. Financial texts filtered from the Samanantar corpus (Rameshet al., TACL 2022). Translated the existing dataset proposed by Kangand El Maarouf (FinSim4-ESG FinNLP-2022) from English to Indian languages (Hindi, Bengali, and Telugu). Retained only the high quality ones. Translated the existing dataset proposed by Chen et al. 2023 (FinNLP-2023 ML-ESG) from English to Indian languages (Hindi, Bengali, and Telugu). Manually verified and corrected the translations wherever needed. ## Resources **Task-1 Metadata** <br> ------------------- <br> | column name | explanation | example | |----------------|--------------------------------------------------------------------|-------------------------------| | indic | financial text in indic language | যা টাকায় ১০ কোটি টাকারও বেশি। | | number_english | number present in indic text after translating it to English | 10 | | number_indic | number present in indic text | ১০ | | start_posn | starting position of the number in indic text | 9 | | end_posn | ending position of the number in indic text | 11 | | language | indic language in which the text is present (hindi/bengali/telugu) | bengali | | magnitude | magnitude of the number | 1 | **Task-2 Metadata** <br> ------------------- <br> | column name | explanation | example | |----------------|----------------------------------|-----------------------------------------------------------------| | sentence_indic | financial text in indic language | 2019 में, हवाई यात्रा हमारे अपने कार्बन फुटप्रिंट का लगभग 38 प्रतिशत थी। | | label | sustainable or unsustainable | unsustainable | | language | indic language | hindi | **Task-3 Metadata** <br> ------------------- <br> | column name | explanation | example | |------------------|------------------------------|-----------------------------------------------------------------------------------------| | URL | url of the news title | https://www.esgtoday.com/abn-amro-to-align-lending-investment-portfolios-with-net-zero/ | | news_title_indic | news title in indic language | రుణాలను సమలేఖనం చేయడానికి, నికర సున్నాతో పెట్టుబడి దస్త్రాలను సమలేఖనం చేయడానికి ABN AMRO | | ESG_Theme | ESG theme of the news title | climate change | | language | indic language | telugu | ```bibtex @INPROCEEDINGS{ghosh2024, author={Ghosh, Sohom and Majhi, Arnab and Narayana, Aswartha and Naskar, Sudip Kumar}, booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, title={IndicFinNLP: Financial Natural Language Processing for Indian Languages}, month={05}, year={2024}, pages={9010–9018}, url={https://aclanthology.org/2024.lrec-main.789/} } ``` --- license: cc-by-nc-sa-4.0 ---
提供机构:
sohomghosh
原始信息汇总

数据集概述

数据集名称

IndicFinNLP

数据集来源

  • 预算演讲文本来自印度语地区(旁遮普邦、北阿坎德邦、哈里亚纳邦、西孟加拉邦、特伦甘纳邦和安得拉邦),时间跨度为2011年至2023年。
  • 从Samanantar语料库中筛选的金融文本。
  • 翻译自Kang和El Maarouf提出的FinSim4-ESG FinNLP-2022数据集,以及Chen et al. 2023提出的FinNLP-2023 ML-ESG数据集,从英语翻译至印度语(印地语、孟加拉语、泰卢固语)。

数据集任务

  1. 夸大数字检测
  2. 可持续性评估
  3. ESG主题确定

数据集语言

  • 印地语
  • 孟加拉语
  • 泰卢固语

数据集规模

10K<n<100K

数据集许可证

cc-by-nc-sa-4.0

数据集元数据

Task-1 Metadata

列名 解释 示例
indic 印度语金融文本 যা টাকায় ১০ কোটি টাকারও বেশি।
number_english 印度语文本中的数字翻译成英语 10
number_indic 印度语文本中的数字 ১০
start_posn 数字在印度语文本中的起始位置 9
end_posn 数字在印度语文本中的结束位置 11
language 文本使用的印度语(印地语/孟加拉语/泰卢固语) bengali
magnitude 数字的量级 1

Task-2 Metadata

列名 解释 示例
sentence_indic 印度语金融文本 2019 में, हवाई यात्रा हमारे अपने कार्बन फुटप्रिंट का लगभग 38 प्रतिशत थी।
label 可持续或不可持续 unsustainable
language 文本使用的印度语 hindi

Task-3 Metadata

列名 解释 示例
URL 新闻标题的网址 https://www.esgtoday.com/abn-amro-to-align-lending-investment-portfolios-with-net-zero/
news_title_indic 印度语新闻标题 రుణాలను సమలేఖనం చేయడానికి, నికర సున్నాతో పెట్టుబడి దస్త్రాలను సమలేఖనం చేయడానికి ABN AMRO
ESG_Theme 新闻标题的ESG主题 climate change
language 文本使用的印度语 telugu

数据集相关文献

  • Ghosh, Sohom and Majhi, Arnab and Narayana, Aswartha and Naskar, Sudip Kumar. "IndicFinNLP: Financial Natural Language Processing for Indian Languages". Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, pp. 9010–9018.
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作