ireneturrado/finer-139

Name: ireneturrado/finer-139
Creator: ireneturrado
Published: 2026-03-25 18:33:27
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ireneturrado/finer-139

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: FiNER-139 size_categories: - 1M<n<10M source_datasets: [] task_categories: - structure-prediction - named-entity-recognition - entity-extraction task_ids: - named-entity-recognition --- # Dataset Card for FiNER-139 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [SEC-BERT](#sec-bert) - [About Us](#about-us) ## Dataset Description - **Homepage:** [FiNER](https://github.com/nlpaueb/finer) - **Repository:** [FiNER](https://github.com/nlpaueb/finer) - **Paper:** [FiNER, Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) - **Point of Contact:** [Manos Fergadiotis](mailto:fergadiotis@aueb.gr) ### Dataset Summary <div style="text-align: justify"> <strong>FiNER-139</strong> is comprised of 1.1M sentences annotated with <strong>eXtensive Business Reporting Language (XBRL)</strong> tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of <strong>139 entity types</strong>. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself. </div> ### Supported Tasks <div style="text-align: justify"> To promote transparency among shareholders and potential investors, publicly traded companies are required to file periodic financial reports annotated with tags from the eXtensive Business Reporting Language (XBRL), an XML-based language, to facilitate the processing of financial information. However, manually tagging reports with XBRL tags is tedious and resource-intensive. We, therefore, introduce <strong>XBRL tagging</strong> as a <strong>new entity extraction task</strong> for the <strong>financial domain</strong> and study how financial reports can be automatically enriched with XBRL tags. To facilitate research towards automated XBRL tagging we release FiNER-139. </div> ### Languages **FiNER-139** is compiled from approximately 10k annual and quarterly **English** reports ## Dataset Structure ### Data Instances This is a "train" split example: ```json { 'id': 40 'tokens': ['In', 'March', '2014', ',', 'the', 'Rialto', 'segment', 'issued', 'an', 'additional', '$', '100', 'million', 'of', 'the', '7.00', '%', 'Senior', 'Notes', ',', 'at', 'a', 'price', 'of', '102.25', '%', 'of', 'their', 'face', 'value', 'in', 'a', 'private', 'placement', '.'] 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 37, 0, 0, 0, 41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] } ``` ### Data Fields **id**: ID of the example <br> **tokens**: List of tokens for the specific example. <br> **ner_tags**: List of tags for each token in the example. Tags are provided as integer classes.<br> If you want to use the class names you can access them as follows: ```python import datasets finer_train = datasets.load_dataset("nlpaueb/finer-139", split="train") finer_tag_names = finer_train.features["ner_tags"].feature.names ``` **finer_tag_names** contains a list of class names corresponding to the integer classes e.g. ``` 0 -> "O" 1 -> "B-AccrualForEnvironmentalLossContingencies" ``` ### Data Splits | Training | Validation | Test | -------- | ---------- | ------- | 900,384 | 112,494 | 108,378 ## Dataset Creation ### Curation Rationale The dataset was curated by [Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) <br> ### Source Data #### Initial Data Collection and Normalization <div style="text-align: justify"> FiNER-139 is compiled from approximately 10k annual and quarterly English reports (filings) of publicly traded companies downloaded from the [US Securities and Exchange Commission's (SEC)](https://www.sec.gov/) [Electronic Data Gathering, Analysis, and Retrieval (EDGAR)](https://www.sec.gov/edgar.shtml) system. The reports span a 5-year period, from 2016 to 2020. They are annotated with XBRL tags by professional auditors and describe the performance and projections of the companies. XBRL defines approximately 6k entity types from the US-GAAP taxonomy. FiNER-139 is annotated with the 139 most frequent XBRL entity types with at least 1,000 appearances. We used regular expressions to extract the text notes from the Financial Statements Item of each filing, which is the primary source of XBRL tags in annual and quarterly reports. We used the <strong>IOB2</strong> annotation scheme to distinguish tokens at the beginning, inside, or outside of tagged expressions, which leads to 279 possible token labels. </div> ### Annotations #### Annotation process <div style="text-align: justify"> All the examples were annotated by professional auditors as required by the Securities & Exchange Commission (SEC) legislation. Even though the gold XBRL tags come from professional auditors there are still some discrepancies. Consult [Loukas et al. (2022)](https://arxiv.org/abs/2203.06482), (Section 9.4) for more details </div> #### Who are the annotators? Professional auditors ### Personal and Sensitive Information The dataset contains publicly available annual and quarterly reports (filings) ## Additional Information ### Dataset Curators [Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) ### Licensing Information <div style="text-align: justify"> Access to SEC's EDGAR public database is free, allowing research of public companies' financial information and operations by reviewing the filings the companies makes with the SEC. </div> ### Citation Information If you use this dataset cite the following ``` @inproceedings{loukas-etal-2022-finer, title = {FiNER: Financial Numeric Entity Recognition for XBRL Tagging}, author = {Loukas, Lefteris and Fergadiotis, Manos and Chalkidis, Ilias and Spyropoulou, Eirini and Malakasiotis, Prodromos and Androutsopoulos, Ion and Paliouras George}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, publisher = {Association for Computational Linguistics}, location = {Dublin, Republic of Ireland}, year = {2022}, url = {https://arxiv.org/abs/2203.06482} } ``` ## SEC-BERT <img align="center" src="https://i.ibb.co/0yz81K9/sec-bert-logo.png" alt="SEC-BERT" width="400"/> <div style="text-align: justify"> We also pre-train our own BERT models (<strong>SEC-BERT</strong>) for the financial domain, intended to assist financial NLP research and FinTech applications. <br> <strong>SEC-BERT</strong> consists of the following models: * [**SEC-BERT-BASE**](https://huggingface.co/nlpaueb/sec-bert-base): Same architecture as BERT-BASE trained on financial documents. * [**SEC-BERT-NUM**](https://huggingface.co/nlpaueb/sec-bert-num): Same as SEC-BERT-BASE but we replace every number token with a [NUM] pseudo-token handling all numeric expressions in a uniform manner, disallowing their fragmentation * [**SEC-BERT-SHAPE**](https://huggingface.co/nlpaueb/sec-bert-shape): Same as SEC-BERT-BASE but we replace numbers with pseudo-tokens that represent the number’s shape, so numeric expressions (of known shapes) are no longer fragmented, e.g., '53.2' becomes '[XX.X]' and '40,200.5' becomes '[XX,XXX.X]'. These models were pre-trained on 260,773 10-K filings (annual reports) from 1993-2019, publicly available at [U.S. Securities and Exchange Commission (SEC)](https://www.sec.gov/) </div> ## About Us <div style="text-align: justify"> [**AUEB's Natural Language Processing Group**](http://nlp.cs.aueb.gr) develops algorithms, models, and systems that allow computers to process and generate natural language texts. The group's current research interests include: * question answering systems for databases, ontologies, document collections, and the Web, especially biomedical question answering, * natural language generation from databases and ontologies, especially Semantic Web ontologies, text classification, including filtering spam and abusive content, * information extraction and opinion mining, including legal text analytics and sentiment analysis, * natural language processing tools for Greek, for example parsers and named-entity recognizers, machine learning in natural language processing, especially deep learning. The group is part of the Information Processing Laboratory of the Department of Informatics of the Athens University of Economics and Business. </div> [Manos Fergadiotis](https://manosfer.github.io) on behalf of [AUEB's Natural Language Processing Group](http://nlp.cs.aueb.gr)

annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual pretty_name: FiNER-139 size_categories: - 1M<n<10M source_datasets: [] task_categories: - structure-prediction - named-entity-recognition - entity-extraction task_ids: - named-entity-recognition --- # FiNER-139 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务](#supported-tasks) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-instances) - [数据拆分](#data-instances) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [SEC-BERT](#sec-bert) - [关于我们](#about-us) ## 数据集描述 - **主页**: [FiNER](https://github.com/nlpaueb/finer) - **代码仓库**: [FiNER](https://github.com/nlpaueb/finer) - **论文**: [FiNER, Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) - **联系人**: [Manos Fergadiotis](mailto:fergadiotis@aueb.gr) ### 数据集摘要 <div style="text-align: justify"> <strong>FiNER-139</strong> 由110万个句子组成，这些句子标注了从美国上市公司年度及季度报告中提取的<strong>可扩展商业报告语言（eXtensive Business Reporting Language, XBRL）</strong>标签。与其他实体抽取任务（如<strong>命名实体识别（Named Entity Recognition, NER）</strong>或合同要素抽取）通常仅需识别少量常见类型的实体（如人物、组织）不同，FiNER-139使用了包含<strong>139种实体类型</strong>的庞大标签集。与典型实体抽取任务的另一重要区别在于，FiNER聚焦于数值Token，其正确标签主要取决于上下文而非Token本身。 </div> ### 支持任务 <div style="text-align: justify"> 为提升股东与潜在投资者的透明度，上市公司需提交标注有<strong>可扩展商业报告语言（XBRL）</strong>标签的定期财务报告——这是一种基于XML的语言，用于简化财务信息的处理流程。然而，手动为报告添加XBRL标签既繁琐又耗费资源。因此，我们提出<strong>XBRL标签标注</strong>作为<strong>金融领域的新型实体抽取任务</strong>，并研究如何通过XBRL标签自动丰富财务报告内容。为推动自动化XBRL标签标注的相关研究，我们发布了FiNER-139数据集。 </div> ### 语言 **FiNER-139** 基于约1万份年度及季度<strong>英语</strong>报告构建。 ## 数据集结构 ### 数据实例以下为「训练集」拆分示例： json { "id": 40, "tokens": ["In", "March", "2014", ",", "the", "Rialto", "segment", "issued", "an", "additional", "$", "100", "million", "of", "the", "7.00", "%", "Senior", "Notes", ",", "at", "a", "price", "of", "102.25", "%", "of", "their", "face", "value", "in", "a", "private", "placement", "."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 37, 0, 0, 0, 41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] } ### 数据字段 **id**: 示例的唯一标识符<br> **tokens**: 当前示例的分词列表<br> **ner_tags**: 示例中每个分词对应的标签列表，标签以整数类别形式提供。<br> 若需使用类别名称，可通过以下方式获取： python import datasets finer_train = datasets.load_dataset("nlpaueb/finer-139", split="train") finer_tag_names = finer_train.features["ner_tags"].feature.names **finer_tag_names** 包含与整数类别对应的名称列表，例如： 0 -> "O" 1 -> "B-AccrualForEnvironmentalLossContingencies" ### 数据拆分 | 训练集 | 验证集 | 测试集 | -------- | ---------- | ------- | 900,384 | 112,494 | 108,378 ## 数据集构建 ### 构建初衷本数据集由[Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) 整理构建<br> ### 源数据 #### 初始数据收集与标准化 <div style="text-align: justify"> FiNER-139 基于约1万份从美国<strong>证券交易委员会（Securities and Exchange Commission, SEC）</strong>的<strong>电子数据收集、分析与检索系统（Electronic Data Gathering, Analysis, and Retrieval, EDGAR）</strong>下载的美国上市公司年度及季度英语报告（备案文件）构建。报告覆盖2016至2020年共5年的时间跨度。这些报告由专业审计人员标注XBRL标签，用于描述公司的经营业绩与未来规划。XBRL源自美国通用会计准则（US-GAAP）分类标准，定义了约6000种实体类型。FiNER-139仅选用了其中出现次数不少于1000次的139种最常见XBRL实体类型进行标注。我们使用正则表达式从每份备案文件的「财务报表附注」模块提取文本——该模块是年度及季度报告中XBRL标签的主要来源。我们采用<strong>IOB2</strong>标注方案区分标记表达式的开头、内部及外部词元，该方案共可生成279种可能的分词标签。 </div> ### 标注信息 #### 标注流程 <div style="text-align: justify"> 所有示例均由符合美国证券交易委员会（SEC）法规要求的专业审计人员完成标注。尽管金标准XBRL标签来自专业审计人员，但仍存在少量不一致之处。详细信息请参考[Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) 的第9.4节。 </div> #### 标注人员专业审计人员 ### 个人与敏感信息本数据集包含公开可用的年度及季度上市公司报告（备案文件）。 ## 附加信息 ### 数据集整理者 [Loukas et al. (2022)](https://arxiv.org/abs/2203.06482) ### 许可信息 <div style="text-align: justify"> 美国证券交易委员会（SEC）的EDGAR公共数据库可免费访问，允许研究人员通过查阅上市公司提交给SEC的备案文件，开展公共公司财务信息与运营情况的相关研究。 </div> ### 引用信息若使用本数据集，请引用以下文献： @inproceedings{loukas-etal-2022-finer, title = {FiNER: Financial Numeric Entity Recognition for XBRL Tagging}, author = {Loukas, Lefteris and Fergadiotis, Manos and Chalkidis, Ilias and Spyropoulou, Eirini and Malakasiotis, Prodromos and Androutsopoulos, Ion and Paliouras George}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)}, publisher = {Association for Computational Linguistics}, location = {Dublin, Republic of Ireland}, year = {2022}, url = {https://arxiv.org/abs/2203.06482} } ## SEC-BERT <img align="center" src="https://i.ibb.co/0yz81K9/sec-bert-logo.png" alt="SEC-BERT" width="400"/> <div style="text-align: justify"> 我们还针对金融领域预训练了专属BERT模型——<strong>SEC-BERT</strong>，旨在助力金融自然语言处理（NLP）研究与金融科技应用开发。<br> <strong>SEC-BERT</strong> 包含以下模型： * [**SEC-BERT-BASE**](https://huggingface.co/nlpaueb/sec-bert-base): 与BERT-BASE架构一致，在金融文档上预训练得到。 * [**SEC-BERT-NUM**](https://huggingface.co/nlpaueb/sec-bert-num): 与SEC-BERT-BASE架构一致，但将所有数值Token替换为`[NUM]`伪Token，以统一方式处理所有数值表达式，避免其被拆分。 * [**SEC-BERT-SHAPE**](https://huggingface.co/nlpaueb/sec-bert-shape): 与SEC-BERT-BASE架构一致，但将数值替换为代表其格式的伪Token，因此数值表达式（已知格式）不会被拆分，例如 '53.2' 变为 '[XX.X]' 和 '40,200.5' 变为 '[XX,XXX.X]'。这些模型基于1993年至2019年间的260,773份10-K备案文件（年度报告）预训练而来，这些数据可从[美国证券交易委员会（SEC）](https://www.sec.gov/)公开获取。 </div> ## 关于我们 <div style="text-align: justify"> [**雅典经济与商业大学自然语言处理小组（AUEB's Natural Language Processing Group）**](http://nlp.cs.aueb.gr) 致力于研发可让计算机处理与生成自然语言文本的算法、模型与系统。该小组当前的研究方向包括： * 面向数据库、本体、文档集与网络的问答系统，尤其是生物医学问答方向； * 从数据库与本体（尤其是语义Web本体）生成自然语言； * 文本分类，包括垃圾邮件与违规内容过滤； * 信息抽取与观点挖掘，包括法律文本分析与情感分析； * 面向希腊语的自然语言处理工具，例如句法分析器与命名实体识别器； * 自然语言处理中的机器学习，尤其是深度学习方向。该小组隶属于雅典经济与商业大学信息学系信息处理实验室。 </div> [Manos Fergadiotis](https://manosfer.github.io) 代表 [雅典经济与商业大学自然语言处理小组（AUEB's Natural Language Processing Group）](http://nlp.cs.aueb.gr) 撰写

提供机构：

ireneturrado

5,000+

优质数据集

54 个

任务类型

进入经典数据集