osanseviero/covid_news

Name: osanseviero/covid_news
Creator: osanseviero
Published: 2022-09-09 14:53:32
License: 暂无描述

Hugging Face2022-09-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/osanseviero/covid_news

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: - cc0-1.0 converted_from: kaggle kaggle_id: timmayer/covid-news-articles-2020-2022 --- # Dataset Card for COVID News Articles (2020 - 2022) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://kaggle.com/datasets/timmayer/covid-news-articles-2020-2022 - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - **title**, **content** and **category**. **title** refers to the headline of the news article. **content** refers to the article in itself and **category** denotes the overall context of the news article at a high level. The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - **title**, **content** and **category**. **title** refers to the headline of the news article. **content** refers to the article in itself and **category** denotes the overall context of the news article at a high level. This dataset can be used to pre-train large language models (LLMs) and demonstrate NLP downstream tasks like binary/multi-class text classification. The dataset can be used to study the difference in behaviors of language models when there is a shift in data. For e.g., the classic transformers based BERT model was trained before the COVID era. By training a masked language model (MLM) using this dataset, we can try to differentiate the behaviors of the original BERT model vs the newly trained models. ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators This dataset was shared by [@timmayer](https://kaggle.com/timmayer) ### Licensing Information The license for this dataset is cc0-1.0 ### Citation Information ```bibtex [More Information Needed] ``` ### Contributions [More Information Needed]

提供机构：

osanseviero

原始信息汇总

数据集概述：COVID新闻文章（2020-2022）

数据集描述

数据集总结

数据集内容： 包含约50万篇新闻文章，收集于2020至2022年新冠病毒大流行期间。
数据集结构： 包含三个主要字段：标题（新闻文章的标题）、内容（新闻文章的全文）和类别（新闻文章的高层次分类）。
应用场景： 可用于预训练大型语言模型（LLMs），以及进行自然语言处理（NLP）的下游任务，如文本分类。

支持的任务和排行榜

信息缺失： 具体支持的任务和排行榜信息未提供。

语言

信息缺失： 数据集所包含的语言信息未提供。

数据集结构

数据实例

信息缺失： 数据实例的具体描述未提供。

数据字段

信息缺失： 数据字段的具体描述未提供。

数据分割

信息缺失： 数据分割的具体描述未提供。

数据集创建

数据筛选理由

信息缺失： 数据筛选的具体理由未提供。

源数据

初始数据收集和标准化

信息缺失： 初始数据收集和标准化的具体描述未提供。

源语言生产者

信息缺失： 源语言生产者的具体信息未提供。

注释

注释过程

信息缺失： 注释过程的具体描述未提供。

注释者

信息缺失： 注释者的具体信息未提供。

个人和敏感信息

信息缺失： 个人和敏感信息的具体处理方式未提供。

使用数据的考虑

数据集的社会影响

信息缺失： 数据集的社会影响的具体讨论未提供。

偏见讨论

信息缺失： 数据集中可能存在的偏见的具体讨论未提供。

其他已知限制

信息缺失： 数据集的其他已知限制的具体描述未提供。

附加信息

数据集管理者

数据集分享者： @timmayer

许可信息

数据集许可： cc0-1.0

引用信息

信息缺失： 数据集的引用信息未提供。

贡献

信息缺失： 数据集的贡献信息未提供。

5,000+

优质数据集

54 个

任务类型

进入经典数据集