German News Portals: The Linked Domains In Their Articles
收藏DataCite Commons2026-02-12 更新2026-04-25 收录
下载链接:
https://data.tu-dortmund.de/citation?persistentId=doi:10.71955/DUEDATA-2026-ML5CATJG
下载链接
链接失效反馈官方服务:
资源简介:
<p>The dataset was created to investigate the linking behavior of 12 of the largest German news portals, with the goal of understanding how they connect to other online resources. Specifically, this study aims to examine whether the number of links correlates with other variables such as article title, keywords, section, and length. By analyzing these relationships, we hope to gain insights into the factors that influence the linking behavior of news portals. The data was automatically scraped using a web crawler implemented in Python. For copyright reasons, the article texts are not included.</p>
<b>Methods</b>
<p>To collect the data, we employed a web crawling and scraping approach, where the homepages of the 12 news portals were crawled daily at 12 noon during the survey period. We used the Scrapy and Selenium libraries in Python to implement the crawler and scraper, which allowed us to handle cookie banners, interactive elements, and dynamic content when necessary. The crawler extracted all news articles in text form on the homepage, along with the links contained within the article text. The code is available on Github at <a href="https://github.com/EstherKuerbis/NewsArticlesScraper.git">https://github.com/EstherKuerbis/NewsArticlesScraper.git</a>, allowing other researchers to review, use, and modify the code for their own purposes. To exclude irrelevant links, we filtered out links that appeared alongside the article text, such as advertisements. Where possible, we extracted the articles’ departments from the scraped keywords or URLs. To categorize the domains, we used a Python script to automatically subdivide them into internal and external domains based on the domain name in the URL. This approach enabled us to collect a comprehensive dataset of news articles and links, which can be used to analyze the linking behavior of news portals.</p>
提供机构:
TUDOdata
创建时间:
2026-02-02



