eduagarcia/FactNews

Name: eduagarcia/FactNews
Creator: eduagarcia
Published: 2024-04-29 22:45:39
License: 暂无描述

Hugging Face2024-04-29 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/eduagarcia/FactNews

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: bias_prediction features: - name: file dtype: string - name: id_sente dtype: string - name: id_article dtype: string - name: domain dtype: string - name: year dtype: string - name: sentences dtype: string - name: label dtype: int64 - name: label_text dtype: string splits: - name: train num_bytes: 163041 num_examples: 738 - name: full_train num_bytes: 951010 num_examples: 4403 - name: test num_bytes: 384327 num_examples: 1788 download_size: 718605 dataset_size: 1498378 - config_name: factuality_prediction features: - name: file dtype: string - name: id_sente dtype: string - name: id_article dtype: string - name: domain dtype: string - name: year dtype: string - name: sentences dtype: string - name: label dtype: int64 - name: label_text dtype: string splits: - name: train num_bytes: 606722 num_examples: 2826 - name: full_train num_bytes: 944929 num_examples: 4403 - name: test num_bytes: 381863 num_examples: 1788 download_size: 927856 dataset_size: 1933514 - config_name: original features: - name: file dtype: string - name: id_sente dtype: string - name: id_article dtype: string - name: domain dtype: string - name: year dtype: string - name: sentences dtype: string - name: classe dtype: int64 - name: label_text dtype: string splits: - name: train num_bytes: 1317047 num_examples: 6191 download_size: 516651 dataset_size: 1317047 configs: - config_name: bias_prediction data_files: - split: train path: bias_prediction/train-* - split: full_train path: bias_prediction/full_train-* - split: test path: bias_prediction/test-* - config_name: factuality_prediction data_files: - split: train path: factuality_prediction/train-* - split: full_train path: factuality_prediction/full_train-* - split: test path: factuality_prediction/test-* - config_name: original data_files: - split: train path: original/train-* license: unknown task_categories: - text-classification language: - pt - por pretty_name: FactNews size_categories: - 1K<n<10K multilinguality: - monolingual language_creators: - found annotations_creators: - expert-generated tags: - subjectivity - mediabias - media-bias --- ## Disclaimer *I am not the author of this dataset, this is a modified version of the FactCheck dataset on HuggingFace, the original data is made avaliable by Vargas et. al, 2023 and can be downloaded from the link: https://github.com/franciellevargas/FactNews* *Modifications:* - *The "original" subset contains the unmodified original CSV* - *The subsets for the task of "bias_prediction" and "factuality_prediction" were splited between train (70%) AND test (30%) by randomly selecting sentences grouped by their id_article. This configuration difers from the authors, who made a 90%/10% 10-fold split on the papers.* - *Each task contains an unbalanced split (full-train) and the balanced-split (train)* # Sentence-Level Annotated Dataset for Predicting Factuality of News and Bias of Media Outlets in Portuguese Automated fact-checking and news credibility verification at scale require accurate prediction of news factuality and media bias. Here, we introduce a large sentence-level dataset, titled FactNews, composed of 6,191 sentences expertly annotated according to factuality and media bias definitions proposed by AllSides. We used the FactNews to assess the overall reliability of news sources by formulating two text classification problems for predicting sentence-level factuality of news reporting and bias of media outlets. Our experiments demonstrate that biased sentences present a higher number of words compared to factual sentences, besides having a predominance of emotions. Hence, the fine-grained analysis of subjectivity and impartiality of news articles showed promising results for predicting the reliability of the entire media outlet. Finally, due to the severity of fake news and political polarization in Brazil, and the lack of research for Portuguese, both dataset and baseline were proposed for Brazilian Portuguese. The following table describes in detail the FactNews labels, documents, and stories: | Factual| Quotes | Biased | Total sentences | Total news stories | Total news documents | | :--- | :---: | ---: | ---: | ---: | ---: | | 4,242 | 1,391 | 558 | 6,161 | 100 | 300 | ### Sources: - Media 1: Folha de São Paulo - Media 2: Estadão - Media 3: O Globo ### Paper Results: Sentence-Level Media Bias Prediction (90%/10% 10-fold split) - 67% (F1-Score) by Fine-tuned mBert-case Sentence-Level Factuality Prediction (90%/10% 10-fold split) - 88% (F1-Score) by Fine-tuned mBert-case ## Citation ``` Vargas, F., Jaidka, K., Pardo, T.A.S., Benevenuto, F. (2023). Predicting Sentence-Level Factuality of News and Bias of Media Outlets. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pp.1197--1206. Varna, Bulgaria. Association for Computational Linguistics (ACL). ``` **Bibtex** ``` @inproceedings{vargas-etal-2023-predicting, title = "Predicting Sentence-Level Factuality of News and Bias of Media Outlets", author = "Vargas, Francielle and Jaidka, Kokil and Pardo, Thiago and Benevenuto, Fabr{\'\i}cio", editor = "Mitkov, Ruslan and Angelova, Galia", booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing", month = sep, year = "2023", address = "Varna, Bulgaria", publisher = "INCOMA Ltd., Shoumen, Bulgaria", url = "https://aclanthology.org/2023.ranlp-1.127", pages = "1197--1206", } ``` ## Dataset Description - **Homepage:** https://github.com/franciellevargas/FactNews - **Paper:** [Predicting Sentence-Level Factuality of News and Bias of Media Outlets](https://aclanthology.org/2023.ranlp-1.127)

提供机构：

eduagarcia

原始信息汇总

数据集概述

数据集名称

名称: FactNews
语言: 葡萄牙语（pt, por）
任务类别: 文本分类
多语言性: 单语种
数据集大小: 1K<n<10K

数据集配置

bias_prediction
- 特征:
  - file: 字符串
  - id_sente: 字符串
  - id_article: 字符串
  - domain: 字符串
  - year: 字符串
  - sentences: 字符串
  - label: int64
  - label_text: 字符串
- 分割:
  - train: 738个样本, 163041字节
  - full_train: 4403个样本, 951010字节
  - test: 1788个样本, 384327字节
- 下载大小: 718605字节
- 数据集大小: 1498378字节
factuality_prediction
- 特征:
  - file: 字符串
  - id_sente: 字符串
  - id_article: 字符串
  - domain: 字符串
  - year: 字符串
  - sentences: 字符串
  - label: int64
  - label_text: 字符串
- 分割:
  - train: 2826个样本, 606722字节
  - full_train: 4403个样本, 944929字节
  - test: 1788个样本, 381863字节
- 下载大小: 927856字节
- 数据集大小: 1933514字节
original
- 特征:
  - file: 字符串
  - id_sente: 字符串
  - id_article: 字符串
  - domain: 字符串
  - year: 字符串
  - sentences: 字符串
  - classe: int64
  - label_text: 字符串
- 分割:
  - train: 6191个样本, 1317047字节
- 下载大小: 516651字节
- 数据集大小: 1317047字节

数据集文件

bias_prediction:
- train: bias_prediction/train-*
- full_train: bias_prediction/full_train-*
- test: bias_prediction/test-*
factuality_prediction:
- train: factuality_prediction/train-*
- full_train: factuality_prediction/full_train-*
- test: factuality_prediction/test-*
original:
- train: original/train-*

数据集标签统计

总句子数: 6,161
总新闻故事数: 100
总新闻文档数: 300
事实句子数: 4,242
引用句子数: 1,391
偏见句子数: 558

来源媒体

Media 1: Folha de São Paulo
Media 2: Estadão
Media 3: O Globo

论文结果

句子级媒体偏见预测: 67% F1-Score（Fine-tuned mBert-case）
句子级事实预测: 88% F1-Score（Fine-tuned mBert-case）

5,000+

优质数据集

54 个

任务类型

进入经典数据集