five

vikrantburman/financial_phrasebank

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vikrantburman/financial_phrasebank
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - en license: - cc-by-nc-sa-3.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification - sentiment-classification pretty_name: FinancialPhrasebank dataset_info: - config_name: sentences_allagree features: - name: sentence dtype: string - name: label dtype: class_label: names: '0': negative '1': neutral '2': positive splits: - name: train num_bytes: 303371 num_examples: 2264 download_size: 681890 dataset_size: 303371 - config_name: sentences_75agree features: - name: sentence dtype: string - name: label dtype: class_label: names: '0': negative '1': neutral '2': positive splits: - name: train num_bytes: 472703 num_examples: 3453 download_size: 681890 dataset_size: 472703 - config_name: sentences_66agree features: - name: sentence dtype: string - name: label dtype: class_label: names: '0': negative '1': neutral '2': positive splits: - name: train num_bytes: 587152 num_examples: 4217 download_size: 681890 dataset_size: 587152 - config_name: sentences_50agree features: - name: sentence dtype: string - name: label dtype: class_label: names: '0': negative '1': neutral '2': positive splits: - name: train num_bytes: 679240 num_examples: 4846 download_size: 681890 dataset_size: 679240 tags: - finance --- # Dataset Card for financial_phrasebank ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Kaggle](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news) [ResearchGate](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) - **Repository:** - **Paper:** [Arxiv](https://arxiv.org/abs/1307.5336) - **Leaderboard:** [Kaggle](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/code) [PapersWithCode](https://paperswithcode.com/sota/sentiment-analysis-on-financial-phrasebank) = - **Point of Contact:** [Pekka Malo](mailto:pekka.malo@aalto.fi) [Ankur Sinha](mailto:ankur.sinha@aalto.fi) ### Dataset Summary Polar sentiment dataset of sentences from financial news. The dataset consists of 4840 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators. ### Supported Tasks and Leaderboards Sentiment Classification ### Languages English ## Dataset Structure ### Data Instances ``` { "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" } ``` ### Data Fields Each example contains: - sentence: a tokenized line from the dataset - label: a label corresponding to the class as a string: 'positive', 'negative' or 'neutral' ### Data Splits There's no train/validation/test split. However the dataset is available in four possible configurations depending on the percentage of agreement of annotators: - `sentences_50agree`; Number of instances with >=50% annotator agreement: 4846 - `sentences_66agree`: Number of instances with >=66% annotator agreement: 4217 - `sentences_75agree`: Number of instances with >=75% annotator agreement: 3453 - `sentences_allagree`: Number of instances with 100% annotator agreement: 2264 ### Usage ```python from datasets import load_dataset # Load the highest-agreement configuration ds = load_dataset("takala/financial_phrasebank", "sentences_allagree") print(ds) print(ds["train"][0]) ``` Other configurations (e.g. `sentences_75agree`, `sentences_66agree`) can be loaded by changing the second argument. ## Quick baseline (Transformers) ```python from datasets import load_dataset from transformers import pipeline ds = load_dataset("takala/financial_phrasebank", "sentences_allagree")["train"] clf = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english") print(ds[0]["sentence"]) print(clf(ds[0]["sentence"])) ``` ## Dataset Creation ### Curation Rationale The key arguments for the low utilization of statistical techniques in financial sentiment analysis have been the difficulty of implementation for practical applications and the lack of high quality training data for building such models. Especially in the case of finance and economic texts, annotated collections are a scarce resource and many are reserved for proprietary use only. To resolve the missing training data problem, we present a collection of ∼ 5000 sentences to establish human-annotated standards for benchmarking alternative modeling techniques. The objective of the phrase level annotation task was to classify each example sentence into a positive, negative or neutral category by considering only the information explicitly available in the given sentence. Since the study is focused only on financial and economic domains, the annotators were asked to consider the sentences from the view point of an investor only; i.e. whether the news may have positive, negative or neutral influence on the stock price. As a result, sentences which have a sentiment that is not relevant from an economic or financial perspective are considered neutral. ### Source Data #### Initial Data Collection and Normalization The corpus used in this paper is made out of English news on all listed companies in OMX Helsinki. The news has been downloaded from the LexisNexis database using an automated web scraper. Out of this news database, a random subset of 10,000 articles was selected to obtain good coverage across small and large companies, companies in different industries, as well as different news sources. Following the approach taken by Maks and Vossen (2010), we excluded all sentences which did not contain any of the lexicon entities. This reduced the overall sample to 53,400 sentences, where each has at least one or more recognized lexicon entity. The sentences were then classified according to the types of entity sequences detected. Finally, a random sample of ∼5000 sentences was chosen to represent the overall news database. #### Who are the source language producers? The source data was written by various financial journalists. ### Annotations #### Annotation process This release of the financial phrase bank covers a collection of 4840 sentences. The selected collection of phrases was annotated by 16 people with adequate background knowledge on financial markets. Given the large number of overlapping annotations (5 to 8 annotations per sentence), there are several ways to define a majority vote based gold standard. To provide an objective comparison, we have formed 4 alternative reference datasets based on the strength of majority agreement: #### Who are the annotators? Three of the annotators were researchers and the remaining 13 annotators were master's students at Aalto University School of Business with majors primarily in finance, accounting, and economics. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases All annotators were from the same institution and so interannotator agreement should be understood with this taken into account. ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/. If you are interested in commercial use of the data, please contact the following authors for an appropriate license: - [Pekka Malo](mailto:pekka.malo@aalto.fi) - [Ankur Sinha](mailto:ankur.sinha@aalto.fi) ### Citation Information ``` @article{Malo2014GoodDO, title={Good debt or bad debt: Detecting semantic orientations in economic texts}, author={P. Malo and A. Sinha and P. Korhonen and J. Wallenius and P. Takala}, journal={Journal of the Association for Information Science and Technology}, year={2014}, volume={65} } ``` ### Contributions Thanks to [@frankier](https://github.com/frankier) for adding this dataset.

annotations_creators: - 专家标注生成 language_creators: - 公开语料采集 language: - 英语(en) license: - 知识共享署名-非商业性使用-相同方式共享3.0(CC BY-NC-SA 3.0) multilinguality: - 单语种 size_categories: - 1000 < n < 10000 source_datasets: - 原生数据集 task_categories: - 文本分类 task_ids: - 多类别分类 - 情感分类 pretty_name: 金融短语库(FinancialPhrasebank) dataset_info: - config_name: sentences_allagree features: - name: sentence dtype: 字符串 - name: label dtype: class_label: names: '0': 消极(negative) '1': 中性(neutral) '2': 积极(positive) splits: - name: train num_bytes: 303371 num_examples: 2264 download_size: 681890 dataset_size: 303371 - config_name: sentences_75agree features: - name: sentence dtype: 字符串 - name: label dtype: class_label: names: '0': 消极(negative) '1': 中性(neutral) '2': 积极(positive) splits: - name: train num_bytes: 472703 num_examples: 3453 download_size: 681890 dataset_size: 472703 - config_name: sentences_66agree features: - name: sentence dtype: 字符串 - name: label dtype: class_label: names: '0': 消极(negative) '1': 中性(neutral) '2': 积极(positive) splits: - name: train num_bytes: 587152 num_examples: 4217 download_size: 681890 dataset_size: 587152 - config_name: sentences_50agree features: - name: sentence dtype: 字符串 - name: label dtype: class_label: names: '0': 消极(negative) '1': 中性(neutral) '2': 积极(positive) splits: - name: train num_bytes: 679240 num_examples: 4846 download_size: 681890 dataset_size: 679240 tags: - 金融 # 金融短语库(FinancialPhrasebank)数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与基准榜单](#支持任务与基准榜单) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献致谢](#贡献致谢) ## 数据集描述 - **主页**:[Kaggle](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news) [ResearchGate](https://www.researchgate.net/publication/251231364_FinancialPhraseBank-v10) - **仓库**:无 - **论文**:[Arxiv](https://arxiv.org/abs/1307.5336) - **基准榜单**:[Kaggle](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news/code) [PapersWithCode](https://paperswithcode.com/sota/sentiment-analysis-on-financial-phrasebank) - **联络人**:佩卡·马洛(Pekka Malo)<pekka.malo@aalto.fi>、安库尔·辛哈(Ankur Sinha)<ankur.sinha@aalto.fi> ### 数据集概述 本数据集为金融新闻语句的极性情感数据集,包含4840条英语金融新闻语句,按情感类别完成标注。数据集根据5至8名标注者的标注同意率划分为多个配置版本。 ### 支持任务与基准榜单 情感分类任务 ### 语言 英语 ## 数据集结构 ### 数据实例 { "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" } ### 数据字段 每条数据样本包含: - `sentence`:数据集中的分词语句文本(字符串类型) - `label`:情感分类标签,可选值为`0`(消极/negative)、`1`(中性/neutral)、`2`(积极/positive) ### 数据划分 本数据集未划分训练/验证/测试集。但根据标注者的标注同意率,提供四种可选配置: - `sentences_50agree`:标注同意率≥50%的样本,共4846条 - `sentences_66agree`:标注同意率≥66%的样本,共4217条 - `sentences_75agree`:标注同意率≥75%的样本,共3453条 - `sentences_allagree`:标注完全一致的样本,共2264条 ### 使用方法 python from datasets import load_dataset # 加载标注一致性最高的配置版本 ds = load_dataset("takala/financial_phrasebank", "sentences_allagree") print(ds) print(ds["train"][0]) 其他配置(如`sentences_75agree`、`sentences_66agree`)可通过修改第二个参数加载。 ## 快速基准测试(基于Transformer) python from datasets import load_dataset from transformers import pipeline ds = load_dataset("takala/financial_phrasebank", "sentences_allagree")["train"] clf = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english") print(ds[0]["sentence"]) print(clf(ds[0]["sentence"])) ## 数据集构建 ### 构建初衷 以往金融情感分析中统计技术应用不足的核心原因,在于实际应用中实现难度较大,且缺乏高质量的训练标注数据。尤其在金融与经济文本领域,带标注的语料资源稀缺,且多数被划为专有使用范畴。为解决训练数据缺失的问题,本数据集构建了约5000条人工标注语句,用于建立基准标准以评测各类建模方法。 本短语级标注任务的目标是,仅基于给定语句中明确呈现的信息,将每条样本语句划分为积极、消极或中性三类。由于本研究仅聚焦金融与经济领域,标注者需从投资者视角进行判断:即该新闻对股票价格是否存在积极、消极或中性影响。因此,所有与经济或金融视角无关的情感语句均被归为中性类别。 ### 源数据 #### 初始数据收集与预处理 本研究使用的语料库来源于赫尔辛基证券交易所(OMX Helsinki)所有上市公司的英语新闻。我们通过自动网络爬虫从LexisNexis数据库下载相关新闻。从该新闻库中随机选取10000篇文章,以覆盖不同规模、不同行业的公司以及不同新闻来源。参考Maks与Vossen(2010)的方法,我们剔除了所有不包含词典实体的语句,最终得到53400条至少包含一个可识别词典实体的语句。随后,我们根据检测到的实体序列类型对语句进行分类,最后从中随机抽取约5000条语句作为代表整个新闻库的样本。 #### 源语言创作者 源数据由各类金融记者撰写。 ### 标注信息 #### 标注流程 本次发布的金融短语库包含4840条语句。所选语句由16名具备金融市场相关专业背景的人员进行标注。 由于每条语句存在5至8份重叠标注,我们可通过多种方式构建基于多数投票的金标准数据集。为提供客观的对比基准,我们根据标注同意度的强弱,构建了四种不同的参考数据集。 #### 标注者身份 其中3名标注者为研究人员,其余13名为阿尔托大学商学院的硕士研究生,专业方向主要为金融、会计与经济学。 ### 个人与敏感信息 [需补充更多信息] ## 数据使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 所有标注者均来自同一机构,因此在分析标注者间一致性时需考虑此因素。 ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 本作品采用知识共享署名-非商业性使用-相同方式共享3.0未移植许可协议。如需查看该许可协议副本,请访问http://creativecommons.org/licenses/by-nc-sa/3.0/。 若您有意将本数据集用于商业用途,请联系以下作者获取相应许可: - 佩卡·马洛(pekka.malo@aalto.fi) - 安库尔·辛哈(ankur.sinha@aalto.fi) ### 引用信息 @article{Malo2014GoodDO, title={Good debt or bad debt: Detecting semantic orientations in economic texts}, author={P. Malo and A. Sinha and P. Korhonen and J. Wallenius and P. Takala}, journal={Journal of the Association for Information Science and Technology}, year={2014}, volume={65} } ### 贡献致谢 感谢[@frankier](https://github.com/frankier)添加本数据集。
提供机构:
vikrantburman
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作