five

ExponentialScience/ESG-DLT-NER

收藏
Hugging Face2024-08-15 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ExponentialScience/ESG-DLT-NER
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 size_categories: - 10K<n<100K task_categories: - token-classification tags: - Distributed Ledger Technology - Blockchain - ESG - Named Entity Recognition - Environmental, Social, and Governance dataset_info: features: - name: text dtype: string - name: input_ids sequence: int64 - name: attention_mask sequence: int64 - name: labels sequence: class_label: names: '0': O '1': B-Blockchain_Name '2': I-Blockchain_Name '3': B-Codebase '4': I-Codebase '5': B-Consensus '6': I-Consensus '7': B-ChargingAndRewardingSystem '8': I-ChargingAndRewardingSystem '9': B-ESG '10': I-ESG '11': B-Extensibility '12': I-Extensibility '13': B-Identifiers '14': I-Identifiers '15': B-Identity_Management '16': I-Identity_Management '17': B-Miscellaneous '18': I-Miscellaneous '19': B-Native_Currency_Tokenisation '20': I-Native_Currency_Tokenisation '21': B-Security_Privacy '22': I-Security_Privacy '23': B-Transaction_Capabilities '24': I-Transaction_Capabilities - name: ner_tags sequence: class_label: names: '0': O '1': B-Blockchain_Name '2': I-Blockchain_Name '3': B-Codebase '4': I-Codebase '5': B-Consensus '6': I-Consensus '7': B-ChargingAndRewardingSystem '8': I-ChargingAndRewardingSystem '9': B-ESG '10': I-ESG '11': B-Extensibility '12': I-Extensibility '13': B-Identifiers '14': I-Identifiers '15': B-Identity_Management '16': I-Identity_Management '17': B-Miscellaneous '18': I-Miscellaneous '19': B-Native_Currency_Tokenisation '20': I-Native_Currency_Tokenisation '21': B-Security_Privacy '22': I-Security_Privacy '23': B-Transaction_Capabilities '24': I-Transaction_Capabilities - name: tokens sequence: string - name: paper_name dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 108441309 num_examples: 5813 download_size: 11419998 dataset_size: 108441309 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for ESG/DLT Named Entity Recognition Dataset This dataset contains named entities related to Distributed Ledger Technology (DLT) and Environmental, Social, and Governance (ESG) topics created to support research in these areas and at the intersection of these domains. ## Dataset Details ### Dataset Description - **Curated by:** Walter Hernandez Cruz, Kamil Tylinski, Ali Irzam Kathia, Alastair Moore, Niall Roche, Nikhil Vadgama, Horst Treiblmaier, Jiangbo Shangguan, Jiahua Xu, Paolo Tasca - **Language(s) (NLP):** English - **Number of Entity:** 12 - **Entity Types:** `Blockchain Name`, `Consensus`, `Identifiers`, `Security Privacy`, `ESG`, `Transaction Capabilities`, `ChargingAndRewardingSystem`, `Extensibility`, `Identity Management`, `Native Currency Tokenisation`, `Native Currency Tokenisation`, `Miscellaneous` - **License:** CC BY-NC 4.0 ### Dataset Sources - **Repository:** https://github.com/dlt-science/ESG-DLT-LitReview - **Paper:** https://arxiv.org/abs/2308.12420 ## Use This dataset can be used for training and evaluating Named Entity Recognition models focused on DLT and ESG topics. It's particularly useful for researchers and practitioners working on text mining and information extraction in these domains. ## Dataset Structure The dataset contains 39,427 named entities organized into 12 top-level categories with 136 labels in a tree structure. It includes entities related to blockchain names, consensus mechanisms, transaction capabilities, security and privacy, and ESG concepts. ### Label ID The label2id dictionary is: ```python { "O": 0, "B-Blockchain_Name": 1, "I-Blockchain_Name": 2, "B-Codebase": 3, "I-Codebase": 4, "B-Consensus": 5, "I-Consensus": 6, "B-ChargingAndRewardingSystem": 7, "I-ChargingAndRewardingSystem": 8, "B-ESG": 9, "I-ESG": 10, "B-Extensibility": 11, "I-Extensibility": 12, "B-Identifiers": 13, "I-Identifiers": 14, "B-Identity_Management": 15, "I-Identity_Management": 16, "B-Miscellaneous": 17, "I-Miscellaneous": 18, "B-Native_Currency_Tokenisation": 19, "I-Native_Currency_Tokenisation": 20, "B-Security_Privacy": 21, "I-Security_Privacy": 22, "B-Transaction_Capabilities": 23, "I-Transaction_Capabilities": 24 } ``` ## Dataset Creation ### Curation Rationale The dataset was created to address the scarcity of labeled NLP data for blockchain research, focusing on the intersection of DLT and ESG topics. ### Source Data #### Data Collection and Processing The dataset was created by manually annotating 80 publicly available publications using the brat tool and argilla. The taxonomy framework from [Tasca and Tessone (2019)](https://ledger.pitt.edu/ojs/ledger/article/view/140) was extended to include ESG-related concepts. ### Annotations #### Annotation process The annotation process involved manual labeling using the brat tool and argilla, following an extended version of the [Tasca and Tessone (2019) taxonomy](https://ledger.pitt.edu/ojs/ledger/article/view/140). Inter-labeler consistency was improved through systematic processes and programmatic cleaning. #### Who are the annotators? The annotators are the research paper's authors and other collaborators involved in the project. #### Personal and Sensitive Information The dataset does not contain personal or sensitive information as it is based on publicly available academic publications. ## Bias, Risks, and Limitations ### Recommendations Users should be aware of potential biases in the dataset due to the selection of source publications and the annotation process. ## Glossary - DLT: Distributed Ledger Technology - ESG: Environmental, Social, and Governance - NER: Named Entity Recognition ## More Information For more details about the dataset creation process and its applications, please refer to the associated research paper: https://arxiv.org/abs/2308.12420 ## Citation Information ``` @misc{hernandez2024evolutionesgfocuseddltresearch, title={Evolution of ESG-focused DLT Research: An NLP Analysis of the Literature}, author={Walter Hernandez and Kamil Tylinski and Alastair Moore and Niall Roche and Nikhil Vadgama and Horst Treiblmaier and Jiangbo Shangguan and Paolo Tasca and Jiahua Xu}, year={2024}, eprint={2308.12420}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2308.12420}, } ``` ## Contributions Thanks to [Ali Irzam Kathia](https://uk.linkedin.com/in/alikathia) for his contribution to labeling this dataset.
提供机构:
ExponentialScience
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作