five

okite97/news-data

收藏
Hugging Face2022-08-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/okite97/news-data
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - other language: - 'en' language_creators: - found license: - afl-3.0 multilinguality: - monolingual pretty_name: News Dataset size_categories: - 1K<n<10K source_datasets: - original tags: [] task_categories: - text-classification task_ids: - topic-classification - multi-class-classification --- # Dataset Card for news-data ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Dataset Curators](#dataset-curators) ### Dataset Summary The News Dataset is an English-language dataset containing just over 4k unique news articles scrapped from AriseTv- One of the most popular news television in Nigeria. ### Supported Tasks and Leaderboards It supports news article classification into different categories. ### Languages English ## Dataset Structure ### Data Instances ''' {'Title': 'Nigeria: APC Yet to Zone Party Positions Ahead of Convention' 'Excerpt': 'The leadership of the All Progressives Congress (APC), has denied reports that it had zoned some party positions ahead of' 'Category': 'politics' 'labels': 2} ''' ### Data Fields * Title: a string containing the title of a news title as shown * Excerpt: a string containing a short extract from the body of the news * Category: a string that tells the category of an example (string label) * labels: integer telling the class of an example (label) ### Data Splits | Dataset Split | Number of instances in split | | ----------- | ----------- | | Train | 4,594 | | Paragraph | 811 | ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization The code for the dataset creation at *https://github.com/chimaobi-okite/NLP-Projects-Competitions/blob/main/NewsCategorization/Data/NewsDataScraping.ipynb*. The examples were scrapped from <https://www.arise.tv/> ### Annotations #### Annotation process The annotation is based on the news category in the [arisetv](https://www.arise.tv) website #### Who are the annotators? Journalists at arisetv ## Considerations for Using the Data ### Social Impact of Dataset The purpose of this dataset is to help develop models that can classify news articles into categories. This task is useful for efficiently presenting information given a large quantity of text. It should be made clear that any summarizations produced by models trained on this dataset are reflective of the language used in the articles, but are in fact automatically generated. ### Discussion of Biases This data is biased towards news happenings in Nigeria but the model built using it can as well classify news from other parts of the world with a slight degradation in performance. ### Dataset Curators The dataset is created by people at arise but was scrapped by [@github-chimaobi-okite](https://github.com/chimaobi-okite/)
提供机构:
okite97
原始信息汇总

数据集概述

数据集描述

数据集总结

  • 名称:News Dataset
  • 语言:英语
  • 大小:约4k新闻文章
  • 来源:从尼日利亚的AriseTv网站上抓取

支持的任务和排行榜

  • 任务:新闻文章分类
  • 类型:多类别分类

语言

  • 英语

数据集结构

数据实例

  • 示例结构:
    • Title: 新闻标题(字符串)
    • Excerpt: 新闻摘要(字符串)
    • Category: 新闻类别(字符串标签)
    • labels: 类别标签(整数)

数据字段

  • Title: 新闻标题
  • Excerpt: 新闻摘要
  • Category: 新闻类别
  • labels: 类别标签

数据分割

  • 训练集:4,594个实例
  • 段落集:811个实例

数据集创建

源数据

  • 初始数据收集和规范化:数据抓取代码位于此链接,数据来源于AriseTv

注释

  • 注释过程:基于AriseTv网站上的新闻类别
  • 注释者:AriseTv的记者

使用数据的考虑

数据集的社会影响

  • 目的:帮助开发能够将新闻文章分类的模型
  • 注意事项:任何由模型生成的摘要都是自动生成的,反映了文章中的语言使用

讨论偏见

  • 偏见:数据偏向尼日利亚的新闻事件,但模型也可以对来自世界其他地区的新闻进行分类,性能略有下降。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作