five

him1411/EDGAR10-Q

收藏
Hugging Face2023-05-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/him1411/EDGAR10-Q
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit tags: - financial - ner - context-ner size_categories: - 1M<n<10M --- # EDGAR10-Q ## Dataset Summary EDGAR10-Q is a large financial dataset curated by scraping annual and quarterly reports of top 1500 LLCs in the world. The dataset is designed for the task of ContextNER, which aims to generate the relevant context for entities in a sentence, where the context is a set of phrases describing the entity but not necessarily present in the sentence. The dataset is the largest in terms of the number of sentences (1M), entities (2.8M), and average tokens per sentence (35). You may want to check out * Our paper: [CONTEXT-NER: Contextual Phrase Generation at Scale](https://arxiv.org/abs/2109.08079/) * GitHub: [Click Here](https://github.com/him1411/edgar10q-dataset) ## Supported Tasks The dataset is designed for the task of ContextNER that aims to generate the relevant context for entities in a sentence, where the context is a set of phrases describing the entity but not necessarily present in the sentence. ## Dataset Structure ### Data Instances The dataset includes plain text input-output pairs, where the input is a sentence with an entity and the output is the context for the entity. An example of a train instance looks as follows: ``` { "input": "0.6 million . The denominator also includes the dilutive effect of approximately 0.9 million, 0.6 million and 0.6 million shares of unvested restricted shares of common stock for the years ended December 31, 2019, 2018 and 2017, respectively.", "output": "Dilutive effect of unvested restricted shares of Class A common stock" } ``` We also publish a metadata file in the original repository to promote future research in the area. Please checkout the [main website](https://github.com/him1411/edgar10q-dataset) ### Data Fields The data fields are the same among all splits. - `text`: a `string` in the form of entity plus sentence. - `label`: a string describing the relevant context for entity in the sentence ### Data Splits The dataset is split into train, validation, and test sets. The sizes of the splits are as follows: | | Train | Validation | Test | |-----------|-----------|------------|-------| | Instances | 1,498,995 | 187,383 |187,383| ### Dataset Creation The dataset was created by scraping annual and quarterly reports of top 1500 LLCs in the world. ### Models trained using this dataset There are several models finetuned using this dataset. They are: 1. [EDGAR-T5-base](https://huggingface.co/him1411/EDGAR-T5-base) 2. [EDGAR-BART-Base](https://huggingface.co/him1411/EDGAR-BART-Base) 3. [EDGAR-flan-t5-base](https://huggingface.co/him1411/EDGAR-flan-t5-base) 4. [EDGAR-T5-Large](https://huggingface.co/him1411/EDGAR-T5-Large) 5. [EDGAR-Tk-Instruct-Large](https://huggingface.co/him1411/EDGAR-Tk-Instruct-Large) 6. [Instruction tuned EDGAR-Tk-Instruct-base](https://huggingface.co/him1411/EDGAR-Tk-instruct-base-inst-tune) ### Citation Information If you use this dataset and any other related artifact, please cite the following paper: ``` @article{gupta2021context, title={Context-NER: Contextual Phrase Generation at Scale}, author={Gupta, Himanshu and Verma, Shreyas and Kumar, Tarun and Mishra, Swaroop and Agrawal, Tamanna and Badugu, Amogh and Bhatt, Himanshu Sharad}, journal={arXiv preprint arXiv:2109.08079}, year={2021} } ``` ### Contributions Thanks to [@him1411](https://github.com/him1411) for adding this dataset.
提供机构:
him1411
原始信息汇总

EDGAR10-Q 数据集概述

数据集简介

EDGAR10-Q 是一个大型金融数据集,通过抓取全球1500家顶级有限责任公司的年度和季度报告编制而成。该数据集专为ContextNER任务设计,旨在为句子中的实体生成相关上下文,其中上下文是一组描述实体的短语,但不一定出现在句子中。数据集在句子数量(1M)、实体数量(2.8M)和平均每句令牌数(35)方面均为最大。

支持的任务

数据集专为ContextNER任务设计,旨在为句子中的实体生成相关上下文,其中上下文是一组描述实体的短语,但不一定出现在句子中。

数据集结构

数据实例

数据集包含纯文本的输入-输出对,其中输入是一个包含实体的句子,输出是该实体的上下文。

示例:

{ "input": "0.6 million . The denominator also includes the dilutive effect of approximately 0.9 million, 0.6 million and 0.6 million shares of unvested restricted shares of common stock for the years ended December 31, 2019, 2018 and 2017, respectively.", "output": "Dilutive effect of unvested restricted shares of Class A common stock" }

数据字段

所有分割的数据字段相同:

  • text: 包含实体的句子,类型为string
  • label: 描述句子中实体相关上下文的字符串。

数据分割

数据集分为训练、验证和测试集,各部分大小如下:

训练 验证 测试
实例数量 1,498,995 187,383 187,383

数据集创建

数据集通过抓取全球1500家顶级有限责任公司的年度和季度报告创建。

使用此数据集训练的模型

使用此数据集微调的模型包括:

  1. EDGAR-T5-base
  2. EDGAR-BART-Base
  3. EDGAR-flan-t5-base
  4. EDGAR-T5-Large
  5. EDGAR-Tk-Instruct-Large
  6. Instruction tuned EDGAR-Tk-instruct-base

引用信息

使用此数据集及相关制品时,请引用以下论文:

@article{gupta2021context, title={Context-NER: Contextual Phrase Generation at Scale}, author={Gupta, Himanshu and Verma, Shreyas and Kumar, Tarun and Mishra, Swaroop and Agrawal, Tamanna and Badugu, Amogh and Bhatt, Himanshu Sharad}, journal={arXiv preprint arXiv:2109.08079}, year={2021} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作