Yet Another Chinese News Dataset
收藏www.kaggle.com2019-11-18 更新2025-01-16 收录
下载链接:
https://www.kaggle.com/ceshine/yet-another-chinese-news-dataset
下载链接
链接失效反馈官方服务:
资源简介:
A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).
Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.
## Data Collection Process
1. The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
2. Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)
Note: Only minimal text cleaning has been performed on the meta tags.
### Data Fields
1. title: Article title from `og:title` or `twitter:title` meta tag.
2. desc: Article summary from `twitter:description` or `og:description` meta tag.
3. image: URL to the cover image from `twitter:image` or `og:image` meta tag.
4. url: URL of the article.
5. source: The code of the news outlet.
6. date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD
This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.
本数据集汇集了繁体中文和简体中文的新闻文章。其中包含一些并非中国官方媒体的网络新闻来源(它们应构成单独的数据集)。数据集的覆盖范围并不全面,因此不适用于事件覆盖的分析。本数据集旨在作为自然语言处理算法的语料库使用。
## 数据收集过程
1. 从新闻来源的 RSS 流或 Twitter 账户中收集新闻文章的链接。
2. 下载并解析网页,然后利用元标签提取每篇文章的标题、摘要/概要以及封面图片。(这些内容是 Twitter 和 Facebook 摘要卡片中使用的。)
注意:仅对元标签进行了最小程度的文本清理。
### 数据字段
1. title:来自 `og:title` 或 `twitter:title` 元标签的文章标题。
2. desc:来自 `twitter:description` 或 `og:description` 元标签的文章摘要。
3. image:封面图片的 URL,来自 `twitter:image` 或 `og:image` 元标签。
4. url:文章的 URL。
5. source:新闻来源的代码。
6. date:文章在 Twitter 或 RSS 流中的发布日期。格式:YYYYMMDD
此数据集不提供文章的全文。您需要使用提供的链接自行抓取全文。
提供机构:
Kaggle



