osanseviero/covid_news
收藏Hugging Face2022-09-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/osanseviero/covid_news
下载链接
链接失效反馈官方服务:
资源简介:
---
license:
- cc0-1.0
converted_from: kaggle
kaggle_id: timmayer/covid-news-articles-2020-2022
---
# Dataset Card for COVID News Articles (2020 - 2022)
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://kaggle.com/datasets/timmayer/covid-news-articles-2020-2022
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - **title**, **content** and **category**. **title** refers to the headline of the news article. **content** refers to the article in itself and **category** denotes the overall context of the news article at a high level. The dataset encapsulates approximately half a million news articles collected over a period of 2 years during the Coronavirus pandemic onset and surge. It consists of 3 columns - **title**, **content** and **category**. **title** refers to the headline of the news article. **content** refers to the article in itself and **category** denotes the overall context of the news article at a high level.
This dataset can be used to pre-train large language models (LLMs) and demonstrate NLP downstream tasks like binary/multi-class text classification. The dataset can be used to study the difference in behaviors of language models when there is a shift in data. For e.g., the classic transformers based BERT model was trained before the COVID era. By training a masked language model (MLM) using this dataset, we can try to differentiate the behaviors of the original BERT model vs the newly trained models.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
This dataset was shared by [@timmayer](https://kaggle.com/timmayer)
### Licensing Information
The license for this dataset is cc0-1.0
### Citation Information
```bibtex
[More Information Needed]
```
### Contributions
[More Information Needed]
提供机构:
osanseviero
原始信息汇总
数据集概述:COVID新闻文章(2020-2022)
数据集描述
数据集总结
- 数据集内容: 包含约50万篇新闻文章,收集于2020至2022年新冠病毒大流行期间。
- 数据集结构: 包含三个主要字段:标题(新闻文章的标题)、内容(新闻文章的全文)和类别(新闻文章的高层次分类)。
- 应用场景: 可用于预训练大型语言模型(LLMs),以及进行自然语言处理(NLP)的下游任务,如文本分类。
支持的任务和排行榜
- 信息缺失: 具体支持的任务和排行榜信息未提供。
语言
- 信息缺失: 数据集所包含的语言信息未提供。
数据集结构
数据实例
- 信息缺失: 数据实例的具体描述未提供。
数据字段
- 信息缺失: 数据字段的具体描述未提供。
数据分割
- 信息缺失: 数据分割的具体描述未提供。
数据集创建
数据筛选理由
- 信息缺失: 数据筛选的具体理由未提供。
源数据
初始数据收集和标准化
- 信息缺失: 初始数据收集和标准化的具体描述未提供。
源语言生产者
- 信息缺失: 源语言生产者的具体信息未提供。
注释
注释过程
- 信息缺失: 注释过程的具体描述未提供。
注释者
- 信息缺失: 注释者的具体信息未提供。
个人和敏感信息
- 信息缺失: 个人和敏感信息的具体处理方式未提供。
使用数据的考虑
数据集的社会影响
- 信息缺失: 数据集的社会影响的具体讨论未提供。
偏见讨论
- 信息缺失: 数据集中可能存在的偏见的具体讨论未提供。
其他已知限制
- 信息缺失: 数据集的其他已知限制的具体描述未提供。
附加信息
数据集管理者
- 数据集分享者: @timmayer
许可信息
- 数据集许可: cc0-1.0
引用信息
- 信息缺失: 数据集的引用信息未提供。
贡献
- 信息缺失: 数据集的贡献信息未提供。



