five

bjoernp/tagesschau-2018-2023

收藏
Hugging Face2023-04-27 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bjoernp/tagesschau-2018-2023
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: date dtype: string - name: headline dtype: string - name: short_headline dtype: string - name: short_text dtype: string - name: article dtype: string - name: link dtype: string splits: - name: train num_bytes: 107545823 num_examples: 21847 download_size: 63956047 dataset_size: 107545823 language: - de size_categories: - 10K<n<100K --- # Tagesschau Archive Article Dataset A scrape of Tagesschau.de articles from 01.01.2018 to 26.04.2023. Find all source code in [github.com/bjoernpl/tagesschau](https://github.com/bjoernpl/tagesschau). ## Dataset Information CSV structure: | Field | Description | | --- | --- | | `date` | Date of the article | | `headline` | Title of the article | | `short_headline` | A short headline / Context | | `short_text` | A brief summary of the article | | `article` | The full text of the article | | `href` | The href of the article on tagesschau.de | Size: The final dataset (2018-today) contains 225202 articles from 1942 days. Of these articles only 21848 are unique (Tagesschau often keeps articles in circulation for ~1 month). The total download size is ~65MB. Cleaning: - Duplicate articles are removed - Articles with empty text are removed - Articles with empty short_texts are removed - Articles, headlines and short_headlines are stripped of leading and trailing whitespace More details in [`clean.py`](https://github.com/bjoernpl/tagesschau/blob/main/clean.py).
提供机构:
bjoernp
原始信息汇总

Tagesschau Archive Article Dataset 概述

数据集特征

  • date:文章日期,数据类型为字符串。
  • headline:文章标题,数据类型为字符串。
  • short_headline:简短标题/上下文,数据类型为字符串。
  • short_text:文章简短摘要,数据类型为字符串。
  • article:文章全文,数据类型为字符串。
  • link:文章链接,数据类型为字符串。

数据集分割

  • train:训练集,包含21,847个样本,总大小为107,545,823字节。

数据集大小

  • 下载大小:63,956,047字节。
  • 数据集总大小:107,545,823字节。

语言

  • de:德语。

数据集规模

  • 10K<n<100K:数据集规模介于10,000到100,000之间。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作