five

demo-org/auditor_review

收藏
Hugging Face2022-08-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/demo-org/auditor_review
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - found language: - en multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - multi-class-classification - sentiment-classification paperswithcode_id: null pretty_name: Auditor_Review --- # Dataset Card for Auditor_Review ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) ## Dataset Description Auditor review data collected by News Department - **Point of Contact:** Talked to COE for Auditing, currently sue@demo.org ### Dataset Summary Auditor sentiment dataset of sentences from financial news. The dataset consists of 3500 sentences from English language financial news categorized by sentiment. The dataset is divided by the agreement rate of 5-8 annotators. ### Supported Tasks and Leaderboards Sentiment Classification ### Languages English ## Dataset Structure ### Data Instances ``` "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" ``` ### Data Fields - sentence: a tokenized line from the dataset - label: a label corresponding to the class as a string: 'positive' - (2), 'neutral' - (1), or 'negative' - (0) Complete data code is [available here](https://www.datafiles.samhsa.gov/get-help/codebooks/what-codebook) ### Data Splits A train/test split was created randomly with a 75/25 split ## Dataset Creation ### Curation Rationale To gather our auditor evaluations into one dataset. Previous attempts using off-the-shelf sentiment had only 70% F1, this dataset was an attempt to improve upon that performance. ### Source Data #### Initial Data Collection and Normalization The corpus used in this paper is made out of English news reports. #### Who are the source language producers? The source data was written by various auditors. ### Annotations #### Annotation process This release of the auditor reviews covers a collection of 4840 sentences. The selected collection of phrases was annotated by 16 people with adequate background knowledge of financial markets. The subset here is where inter-annotation agreement was greater than 75%. #### Who are the annotators? They were pulled from the SME list, names are held by sue@demo.org ### Personal and Sensitive Information There is no personal or sensitive information in this dataset. ## Considerations for Using the Data ### Discussion of Biases All annotators were from the same institution and so interannotator agreement should be understood with this taken into account. The [Dataset Measurement tool](https://huggingface.co/spaces/huggingface/data-measurements-tool) identified these bias statistics: ![Bias](https://huggingface.co/datasets/demo-org/auditor_review/resolve/main/bias_stats.png) ### Other Known Limitations [More Information Needed] ### Licensing Information License: Demo.Org Proprietary - DO NOT SHARE
提供机构:
demo-org
原始信息汇总

数据集概述

数据集名称

  • 名称: Auditor_Review

数据集摘要

  • 描述: 包含3500个来自英语金融新闻的句子,根据情感进行分类。数据集由5-8位注释者的共识率划分。
  • 任务: 情感分类
  • 语言: 英语

数据集结构

  • 数据实例示例:

    "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative"

  • 数据字段:

    • sentence: 数据集中的一个分词行
    • label: 对应类别的标签,字符串形式:positive - (2), neutral - (1), negative - (0)
  • 数据分割: 随机创建的训练/测试分割,比例为75/25。

数据集创建

  • 来源数据:
    • 初始数据收集和规范化: 使用英语新闻报告构建语料库
    • 语言数据生产者: 由不同审计师编写
  • 注释:
    • 注释过程: 4840个句子的集合,由16位具有金融市场背景知识的人员注释,此处使用的是注释者间一致性超过75%的子集。
    • 注释者: 来自SME列表,具体姓名由sue@demo.org保管

使用数据时的考虑

  • 偏见讨论: 所有注释者来自同一机构,因此在理解注释者间一致性时应考虑此因素。
  • 其他已知限制: 需要更多信息

许可证信息

  • 许可证: Demo.Org Proprietary - 禁止分享
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作