bjoernp/tagesschau-2018-2023
收藏Hugging Face2023-04-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bjoernp/tagesschau-2018-2023
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: date
dtype: string
- name: headline
dtype: string
- name: short_headline
dtype: string
- name: short_text
dtype: string
- name: article
dtype: string
- name: link
dtype: string
splits:
- name: train
num_bytes: 107545823
num_examples: 21847
download_size: 63956047
dataset_size: 107545823
language:
- de
size_categories:
- 10K<n<100K
---
# Tagesschau Archive Article Dataset
A scrape of Tagesschau.de articles from 01.01.2018 to 26.04.2023. Find all source code in [github.com/bjoernpl/tagesschau](https://github.com/bjoernpl/tagesschau).
## Dataset Information
CSV structure:
| Field | Description |
| --- | --- |
| `date` | Date of the article |
| `headline` | Title of the article |
| `short_headline` | A short headline / Context |
| `short_text` | A brief summary of the article |
| `article` | The full text of the article |
| `href` | The href of the article on tagesschau.de |
Size:
The final dataset (2018-today) contains 225202 articles from 1942 days. Of these articles only
21848 are unique (Tagesschau often keeps articles in circulation for ~1 month). The total download
size is ~65MB.
Cleaning:
- Duplicate articles are removed
- Articles with empty text are removed
- Articles with empty short_texts are removed
- Articles, headlines and short_headlines are stripped of leading and trailing whitespace
More details in [`clean.py`](https://github.com/bjoernpl/tagesschau/blob/main/clean.py).
提供机构:
bjoernp
原始信息汇总
Tagesschau Archive Article Dataset 概述
数据集特征
- date:文章日期,数据类型为字符串。
- headline:文章标题,数据类型为字符串。
- short_headline:简短标题/上下文,数据类型为字符串。
- short_text:文章简短摘要,数据类型为字符串。
- article:文章全文,数据类型为字符串。
- link:文章链接,数据类型为字符串。
数据集分割
- train:训练集,包含21,847个样本,总大小为107,545,823字节。
数据集大小
- 下载大小:63,956,047字节。
- 数据集总大小:107,545,823字节。
语言
- de:德语。
数据集规模
- 10K<n<100K:数据集规模介于10,000到100,000之间。



